<a href="https://colab.research.google.com/github/dbro-dev/DataQuest_Courses/blob/master/018__Exploring_Data_with_pandas__Fundamentals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **MISSION 4: Exploring Data with pandas: Fundamentals**

In this mission, we'll continue to learn about exploring data in pandas, including:

* How to select data from pandas objects using boolean arrays.
* How to assign data using labels and boolean arrays.
* How to create new rows and columns in pandas.
* New methods to make data analysis easier in pandas.

**1. Introduction to the Data**

We'll continue working with a data set from [Fortune](https://fortune.com/) magazine's 2017 [Global 500 list](https://en.wikipedia.org/wiki/Fortune_Global_500), which ranks the top 500 corporations worldwide by revenue. The data set was originally compiled [here](https://data.world/chasewillden/fortune-500-companies-2017); however, we modified the original data set to make it more accessible. [Click here](https://github.com/dbro-dev/DataQuest_Courses/blob/master/datasets/f500.csv) or [here](https://drive.google.com/file/d/1sp668oBm1G7vQbgCpw8zH-fnD1IJd9Ut/view?usp=sharing) for the current version used in this notebook (*as my Github username may change in the future*).

![Fortune_500_logo](https://s3.amazonaws.com/dq-content/291/fortune-500.jpg)

Below is the code to import pandas and use the pandas.read_csv() function to read the CSV into a dataframe and assign it to the variable name f500. 

```
import pandas as pd
f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None


```

In Google Colab however, it is a bit more complicated to load a .csv to work with. The fields below show how it is done:

In [None]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
# Once you have completed verification, go to the CSV file in Google Drive, right-click on it and select “Get shareable link”, and cut out the unique id in the link.
id = "1sp668oBm1G7vQbgCpw8zH-fnD1IJd9Ut"

In [None]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('f500.csv')

In [None]:
# Import code which resembles the original code above
import pandas as pd

f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None

Inspect the data set type and its dimensions:

In [None]:
f500_type = type(f500)
print(f500_type)

<class 'pandas.core.frame.DataFrame'>


In [None]:
f500_shape = f500.shape
print(f500_shape)

(500, 16)


Next, let's use the `DataFrame.head`() and `DataFrame.info()` methods to refamiliarize ourselves with the data.

In [None]:
# Explore the data set by printing the first 10 lines
f500_head = f500.head(10)
print(f500_head)

                          rank  revenues  ...  employees  total_stockholder_equity
Walmart                      1    485873  ...    2300000                     77798
State Grid                   2    315199  ...     926067                    209456
Sinopec Group                3    267518  ...     713288                    106523
China National Petroleum     4    262573  ...    1512048                    301893
Toyota Motor                 5    254694  ...     364445                    157210
Volkswagen                   6    240264  ...     626715                     97753
Royal Dutch Shell            7    240033  ...      89000                    186646
Berkshire Hathaway           8    223604  ...     367700                    283001
Apple                        9    215639  ...     116000                    128249
Exxon Mobil                 10    205004  ...      72700                    167325

[10 rows x 16 columns]


In [None]:
# Display more info about the dataframe
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   rank                      500 non-null    int64  
 1   revenues                  500 non-null    int64  
 2   revenue_change            498 non-null    float64
 3   profits                   499 non-null    float64
 4   assets                    500 non-null    int64  
 5   profit_change             436 non-null    float64
 6   ceo                       500 non-null    object 
 7   industry                  500 non-null    object 
 8   sector                    500 non-null    object 
 9   previous_rank             500 non-null    int64  
 10  country                   500 non-null    object 
 11  hq_location               500 non-null    object 
 12  website                   500 non-null    object 
 13  years_on_global_500_list  500 non-null    int64  
 14  em

**2. Vectorized Operations**

Because pandas is designed to operate like NumPy, a lot of concepts and methods from Numpy are supported. Recall that one of the ways NumPy makes working with data easier is with **vectorized operations**, or operations applied to multiple data points at once:
![alt text](https://s3.amazonaws.com/dq-content/289/vectorized.gif)
Vectorization not only improves our code's performance, but also enables us to write code more quickly.

Recall that our `f500` dataframe includes each company's current and previous year's rank on the Fortune 500 list. Let's use vectorized operations to calculate the changes in rank for each company.

In [None]:
rank_change =  f500["previous_rank"] - f500["rank"]
print(rank_change[:6])

Walmart                     0.0
State Grid                  0.0
Sinopec Group               1.0
China National Petroleum   -1.0
Toyota Motor                3.0
Volkswagen                  1.0
dtype: float64


**3. Series Data Exploration Methods**

Like NumPy, pandas supports many descriptive stats methods that can help us answer these questions. Here are a few of the most useful ones (with links to documentation):

* [`Series.max()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.max.html)
* [`Series.min()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.min.html)
* [`Series.mean()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html)
* [`Series.median()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.median.html)
* [`Series.mode()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html)
* [`Series.sum()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html)

In [None]:
rank_change_max = rank_change.max()
print(rank_change_max)

226.0


Biggest increase in rank: 226

In [None]:
rank_change_min = rank_change.min()
print(rank_change_min)
# Outcome was -500 before alterations were made further down this document

Biggest decrease in rank: -500

According to the data dictionary, this list should only rank companies on a scale of 1 to 500. Even if the company ranked 1st in the previous year moved to 500th this year, the rank change calculated would be -499. This indicates that there is incorrect data in either the rank column or previous_rank column.

**4. Series Describe Method**

Another method that can help us more quickly investigate this issue - the `Series.describe()` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.describe.html#pandas.Series.describe). This method tells us how many non-null values are contained in the series, along with the mean, minimum, maximum, and other statistics we'll learn about later in this path.

If we use `describe()` on a column that contains non-numeric values, we get some different statistics. Let's look at an example:

In [None]:
country = f500["country"]
print(country.describe())

count     500
unique     34
top       USA
freq      132
Name: country, dtype: object


The first statistic, `count`, is the same as for numeric columns, showing us the number of non-null values. The other three statistics are new:

* `unique`: Number of unique values in the series. In this case, it tells us that there are 34 different countries represented in the Fortune 500.
* `top`: Most common value in the series. The USA is the country that headquarters the most Fortune 500 companies.
* `freq`: Frequency of the most common value. Exactly 132 companies from the Fortune 500 are headquartered in the USA.

Let's use this method to gather more information about the `rank` and `previous_rank` series.

In [None]:
rank = f500["rank"]
rank_desc = rank.describe()
print(rank_desc)

count    500.000000
mean     250.500000
std      144.481833
min        1.000000
25%      125.750000
50%      250.500000
75%      375.250000
max      500.000000
Name: rank, dtype: float64


In [None]:
prev_rank = f500["previous_rank"]
prev_rank_desc = prev_rank.describe()
print(prev_rank_desc)

count    500.000000
mean     222.134000
std      146.941961
min        0.000000
25%       92.750000
50%      219.500000
75%      347.250000
max      500.000000
Name: previous_rank, dtype: float64


Something is odd - the minimum value for the previous_rank column is 0. However, this column should only have values between 1 and 500 (inclusive), so a value of 0 doesn't make sense. To investigate the possible cause of this issue, let's confirm the number of 0 values that appear in the previous_rank column.

**5. Method Chaining**

Let's use the `Series.value_counts()` method and `Series.loc` next to confirm the number of 0 values in the `previous_rank` column.

In [None]:
zero_previous_rank = f500["previous_rank"].value_counts().loc[0]
print(zero_previous_rank)

33


> This formula might seem complex, but it is actually very logic. Rather than assigning the previous_rank series to it's own variable, we skip that step and use the method directly on the result of the column selection.

> This is called **method chaining** — a way to combine multiple methods together in a single line. When writing code, always assess whether method chaining will make your code harder to read. If it does, it's always preferable to break the code into more than one line.

From the result we can confirm that **33 companies in the dataframe have a value of 0** in the previous_rank column. Given that multiple companies have a 0 rank, we might conclude that **these companies didn't have a rank at all for the previous year**. It would make more sense for us to replace these values with a null value instead.

**6. Dataframe Exploration Methods**

Before we correct these values, let's explore the rest of our dataframe to make sure there are no other data issues. Just like we used descriptive stats methods **to explore individual series**, we can **also use descriptive stats methods** to explore our `f500` **dataframe**.

Because series and dataframes are two distinct objects, **they have their own unique methods**. However, oftentimes both series and dataframe objects have a method of the **same name that behaves in similar ways**. 

Below are some examples:

* `Series.max()` and `DataFrame.max()`
* `Series.min()` and `DataFrame.min()`
* `Series.mean()` and `DataFrame.mean()`
* `Series.median()` and `DataFrame.median()`
* `Series.mode()` and `DataFrame.mode()`
* `Series.sum()` and `DataFrame.sum()`


Unlike their series counterparts, **dataframe methods require an axis parameter** so we know which axis to calculate across (default value is `axis=0`). 

While you can use integers to refer to the first and second axis, pandas dataframe methods also accept the strings `"index"` and `"columns"` for the axis parameter:

![alt text](https://s3.amazonaws.com/dq-content/291/axis_param.svg)

If we wanted to find the median (middle) value for the revenues and profits columns, we could use the following code:

In [None]:
medians = f500[["revenues", "profits"]].median(axis=0)
# we could also use .median(axis="index"), or leave it out as the default value is 0
print(medians)

revenues    40236.0
profits      1761.6
dtype: float64


Instructions: 

Use the `DataFrame.max()` method to find the maximum value for only the *numeric* columns from `f500` (you may need to check the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html)). 

In [None]:
max_f500 = f500.max(axis=0, numeric_only=True)

print(max_f500)

rank                            500.0
revenues                     485873.0
revenue_change                  442.3
profits                       45687.0
assets                      3473238.0
profit_change                  8909.5
previous_rank                   500.0
years_on_global_500_list         23.0
employees                   2300000.0
total_stockholder_equity     301893.0
dtype: float64


Based on the column descriptions, the maximum for each of these columns seems reasonable.

**7. Dataframe Describe Method**

Like series objects, dataframe objects also have a `DataFrame.describe()` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) that we can use to explore the dataframe more quickly (See link for documentation).

One difference is that we need to manually specify if you want to see the statistics for the non-numeric columns. By default, `DataFrame.describe()` will return statistics for only numeric columns. If we wanted to get just the object columns, we need to use the `include=['O']` parameter (`'O'` for Object, not zero):

In [None]:
print(f500.describe(include=['O']))

                   ceo  ...                website
count              500  ...                    500
unique             500  ...                    500
top     Larry J. Merlo  ...  http://www.ckh.com.hk
freq                 1  ...                      1

[4 rows x 6 columns]


Whereas the `Series.describe()` method returns a series object, the `DataFrame.describe()` method returns a dataframe object.

In [None]:
print(f500.describe())


             rank       revenues  ...     employees  total_stockholder_equity
count  500.000000     500.000000  ...  5.000000e+02                500.000000
mean   250.500000   55416.358000  ...  1.339983e+05              30628.076000
std    144.481833   45725.478963  ...  1.700878e+05              43642.576833
min      1.000000   21609.000000  ...  3.280000e+02             -59909.000000
25%    125.750000   29003.000000  ...  4.293250e+04               7553.750000
50%    250.500000   40236.000000  ...  9.291050e+04              15809.500000
75%    375.250000   63926.750000  ...  1.689172e+05              37828.500000
max    500.000000  485873.000000  ...  2.300000e+06             301893.000000

[8 rows x 10 columns]


By default this method returns descriptive statistics only for all the numeric columns. If you want them for objects as well then insert `include=all`. See also the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)

After reviewing the descriptive statistics for the numeric columns in `f500`, we can conclude that no values look unusual besides the `0` values in the `previous_rank` column. Previously, we concluded that companies with a rank of zero didn't have a rank at all. 

Next, we'll replace these values with a null value to clearly indicate that the value is missing.

**8. Assignment with pandas**

We'll learn how to do two things so we can correct these values:

* Perform assignment in pandas.
* Use boolean indexing in pandas.

Let's start by learning assignment, starting with the following example:

In [None]:
top5_rank_revenue = f500[["rank", "revenues"]].head()
print(top5_rank_revenue)

                          rank  revenues
Walmart                      1    485873
State Grid                   2    315199
Sinopec Group                3    267518
China National Petroleum     4    262573
Toyota Motor                 5    254694


Just like in NumPy, the same techniques that we use to select data could be used for assignment. When we selected a whole column by label and used assignment, we assigned the value to every item in that column.



```
# All revenues of the 5 first companies would be overwritten to 0 when executing this formula.

top5_rank_revenue["revenues"] = 0
```


By providing labels for both axes, we can assign them to a single value within our dataframe.

```
top5_rank_revenue.loc["Sinopec Group", "revenues"] = 999
```



Let's assign another value in the Fortune 500 dataframe. The company "Dow Chemical" has named a new CEO: Jim Fitterling:

In [None]:
print(f500.loc["Dow Chemical", "ceo"])

Andrew N. Liveris


In [None]:
f500.loc["Dow Chemical", "ceo"] = "Jim Fitterling"
print(f500.loc["Dow Chemical", "ceo"])

Jim Fitterling


**9. Using Boolean Indexing with pandas Objects**

We can use **boolean indexing** to change all rows that meet the same criteria, just like we did with NumPy.

Let's look at two examples of how boolean indexing works in pandas. For our example, we'll work with this dataframe of people and their favorite numbers:
![alt text](https://s3.amazonaws.com/dq-content/291/eg_df.svg)

Let's check which people have a favorite number of 8. First, we perform a vectorized boolean operation that produces a boolean series:
![alt text](https://s3.amazonaws.com/dq-content/291/bool_series.svg)

We can use that series to index the whole dataframe, leaving us with the rows that correspond only to people whose favorite number is **8**:
![alt text](https://s3.amazonaws.com/dq-content/291/boolean_indexing_df.svg)

Note that we didn't use `loc[]`. This is because boolean arrays use the same shortcut as slices to select along the index axis. We can also use the boolean series to index just one column of the dataframe:
![alt text](https://s3.amazonaws.com/dq-content/291/boolean_indexing_s.svg)
In this case, we used `df.loc[]` to specify both axes.



Next, let's use boolean indexing to identify companies belonging to the "Motor Vehicles and Parts" industry, and the countries in which they are located.

In [None]:
motor_bool = f500["industry"] == "Motor Vehicles and Parts"

motor_countries = f500.loc[motor_bool, "country"]

print(motor_countries)

Toyota Motor                                 Japan
Volkswagen                                 Germany
Daimler                                    Germany
General Motors                                 USA
Ford Motor                                     USA
Honda Motor                                  Japan
SAIC Motor                                   China
Nissan Motor                                 Japan
BMW Group                                  Germany
Dongfeng Motor                               China
Robert Bosch                               Germany
Hyundai Motor                          South Korea
China FAW Group                              China
Beijing Automotive Group                     China
Peugeot                                     France
Renault                                     France
Kia Motors                             South Korea
Continental                                Germany
Denso                                        Japan
Guangzhou Automobile Industry G

**10. Using Boolean Arrays to Assign Values**

With will now combine the operations together:
* Perform assignment in pandas.
* Use boolean indexing in pandas.

We will change the `'Motor Vehicles & Parts'` values in the `sector` column to `'Motor Vehicles and Parts'`– i.e. we will change the ampersand (`&`) to `and`.

First, we create a boolean series by comparing the values in the sector column to `'Motor Vehicles & Parts'`

```
ampersand_bool = f500["sector"] == "Motor Vehicles & Parts"
```
Next, we use that boolean series and the string `"sector"` to perform the assignment.

```
f500.loc[ampersand_bool,"sector"] = "Motor Vehicles and Parts"
```
Just like we saw in the NumPy mission earlier in this course, we can remove the intermediate step of creating a boolean series, and combine everything into one line. This is the most common way to write pandas code to perform assignment using boolean arrays:


```
f500.loc[f500["sector"] == "Motor Vehicles & Parts","sector"] = "Motor Vehicles and Parts"
```









Now we can follow this pattern to replace the values in the `previous_rank` column. We'll replace these values with `np.nan`. Just like in NumPy, `np.nan` is used in pandas to represent values that can't be represented numerically, most commonly missing values.

To make comparing the values in this column before and after our operation easier, we will make a `previous_rank_before` and `previous_rank_after`:

```
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()
```

This uses `Series.value_counts()` and `Series.head()` to display the 5 most common values in the `previous_rank` column, but adds an additional `dropna=False` parameter, which stops the `Series.value_counts()` method from excluding null values when it makes its calculation, as shown in the `Series.value_counts()` documentation.

In [None]:
import numpy as np
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()

prev_rank_bool = f500["previous_rank"] == 0

print(prev_rank_before)

0      33
159     1
147     1
148     1
149     1
Name: previous_rank, dtype: int64


In [None]:
f500.loc[prev_rank_bool, "previous_rank"] = np.nan

prev_rank_after = f500["previous_rank"].value_counts(dropna=False).head()

print(prev_rank_after)

NaN      33
471.0     1
234.0     1
125.0     1
166.0     1
Name: previous_rank, dtype: int64


Instead of 33 instances of `0`, we now have 33 instances of `NaN`.

Also note that the index of the series that `Series.value_counts()` produces now shows us floats like 471.0 instead of integers. 

**11. Creating New Columns**

Now that we've corrected the data, let's create the `rank_change` series again. This time, we'll add it to our `f500` dataframe as a new column.

When we assign a value or values to a new column label, pandas will automatically create a new column in our dataframe. For example, below we add a new column named `year_founded` to a dataframe `top5_rank_revenue`:

`top5_rank_revenue["year_founded"] = 0`

Next, we create a `rank_change` column in our `f500` dataframe and return a series of descriptive statistics for the `rank_change` column to study the results:

In [None]:
f500["rank_change"] = f500["previous_rank"] - f500["rank"]

rank_change_desc = f500["rank_change"].describe()
     
print(rank_change_desc)

count    467.000000
mean      -3.533191
std       44.293603
min     -199.000000
25%      -21.000000
50%       -2.000000
75%       10.000000
max      226.000000
Name: rank_change, dtype: float64


Note that the minimum value of the rank_change column is now greater than -500.

**12. Challenge: Top Performers by Country**

In this challenge, we'll calculate a specific statistic or attribute of each of the two most common countries from our f500 dataframe. 

Like the `DataFrame.head()` method, the `Series.head()` method returns the first five items from a series by default, or a different number if you provide an argument.



1.   Create a series, `industry_usa`,  containing counts of the two most common values in the `industry` column for companies headquartered in the USA.

In [None]:
# I first wrote the boolean to create the code 
# companies_USA_bool = f500["country"] == "USA"

industry_usa = f500.loc[f500["country"] == "USA", "industry"].value_counts().head(2)

print(industry_usa)

Banks: Commercial and Savings               8
Insurance: Property and Casualty (Stock)    7
Name: industry, dtype: int64


The shorthand way of writing the code above is:


```
industry_usa = f500["industry"][f500["country"] == "USA"].value_counts().head(2)
```



2.   Create a series, `sector_china`, containing counts of the three most common values in the `sector` column for companies headquartered in the China.

In [None]:
sector_china = f500.loc[f500["country"] == "China", "sector"].value_counts().head(3)

print(sector_china)

Financials    25
Energy        22
Name: sector, dtype: int64


The shorthand way of writing the code above is:


```
sector_china = f500["sector"][f500["country"] == "China"].value_counts().head(3)
```



---


In the next mission, we'll continue to learn about exploring data in pandas, including:

* New ways of creating dataframe and series objects.
* Advanced selection techniques.
* Performing more complex analysis.
