<a href="https://colab.research.google.com/github/dbro-dev/DataQuest_Courses/blob/master/019__Exploring_Data_with_pandas_-_Intermediate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MISSION 5: Exploring Data with pandas: Intermediate**

*Learn more techniques for selecting and analyzing data in pandas.*

In this mission, we will learn how to:

* Select columns, rows and individual items using their integer location.
* Use `pd.read_csv()` to read CSV files in pandas.
* Work with integer axis labels.
* How to use pandas methods to produce boolean arrays.
* Use boolean operators to combine boolean comparisons to perform more complex analysis.
* Use index labels to align data.
* Use aggregation to perform advanced analysis using loops.

## **1. Introduction**

We'll continue working with a data set from [Fortune](https://fortune.com/) magazine's 2017 [Global 500 list](https://en.wikipedia.org/wiki/Fortune_Global_500), which ranks the top 500 corporations worldwide by revenue. The data set was originally compiled [here](https://data.world/chasewillden/fortune-500-companies-2017); however, we modified the original data set to make it more accessible. [Click here](https://github.com/dbro-dev/DataQuest_Courses/blob/master/datasets/f500.csv) or [here](https://drive.google.com/file/d/1sp668oBm1G7vQbgCpw8zH-fnD1IJd9Ut/view?usp=sharing) for the current version used in this notebook (*as my Github username may change in the future*).

![Fortune_500_logo](https://s3.amazonaws.com/dq-content/291/fortune-500.jpg)

Below is the code to import pandas and use the pandas.read_csv() function to read the CSV into a dataframe and assign it to the variable name f500. 

```
import pandas as pd
f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None


```

In Google Colab however, it is a bit more complicated to load a .csv to work with. The fields below show what is currently the best way to go about it:

In [149]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [150]:
# Once you have completed verification, go to the CSV file in Google Drive, right-click on it and select “Get shareable link”, and cut out the unique id in the link.
id = "1sp668oBm1G7vQbgCpw8zH-fnD1IJd9Ut"

In [151]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('f500.csv')

In [152]:
# Import code which resembles the original code above
import pandas as pd
import numpy as np

f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None

In [153]:
# replace 0 values in the "previous_rank" column with NaN
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan

Select the `rank`, `revenues`, and `revenue_change` columns in `f500`. Then, use the `DataFrame.head()` method to select the first five rows. Assign the result to `f500_selection`.


In [154]:
f500_selection = f500[['rank', 'revenues', 'revenue_change']].head()
print(f500_selection)

                          rank  revenues  revenue_change
Walmart                      1    485873             0.8
State Grid                   2    315199            -4.4
Sinopec Group                3    267518            -9.1
China National Petroleum     4    262573           -12.3
Toyota Motor                 5    254694             7.7


Note how a few steps before we used this code to load the .csv file. 



```
f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None
```
Using this code, the index axis labels are actually the values from the first column in the data set, company. 



The `index_col` parameter is an optional argument and should specify which column to use as the row labels for the dataframe. When we used a value of 0, we specified that we wanted to use the first column as the row labels.

If we remove the second line: `f500.index.name = None`. Both the column and index axes can have names assigned to them. However, we originally used the code below to access the name of the index axes and set it to None, so our dataframe didn't have a name for the index axis.

## **2. Reading CSV files with pandas**

The more conventional way to read in a dataframe is this:

In [155]:
f500 = pd.read_csv("f500.csv")

In [156]:
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan



There are two differences with this approach:

* The company column is now included as a regular column, instead of being used for the index.
* The index labels are now integers starting from 0.


## **3. Using `iloc` to select by integer position**

Recall that when we worked with a dataframe with string index labels, we used `loc[]` to select data:

![alt text](https://s3.amazonaws.com/dq-content/292/selection_loc.svg)

Just like in NumPy, we can also use integer positions to select data using `Dataframe.iloc[]` and `Series.iloc[]`. It's easy to get `loc[]` and `iloc[]` confused at first, but the easiest way is to remember the first letter of each method:

* `loc`: **l**abel based selection
* `iloc`: **i**nteger position based selection


Using `iloc[]` is almost identical to indexing with NumPy, with integer positions starting at `0` like ndarrays and Python lists. This is how we would perform the selection above using `iloc[]`:

![alt text](https://s3.amazonaws.com/dq-content/292/selection_iloc.svg)

As you can see, `DataFrame.iloc[]` behaves similarly to `DataFrame.loc[]`. The full syntax for `DataFrame.iloc[]`, in pseudocode, is:



```
df.iloc[row_index, column_index]
```



Practice:

1. Select just the fifth row of the `f500` dataframe. Assign the result to `fifth_row`.

In [157]:
fifth_row = f500.iloc[4]

2. Select the value in first row of the `company` column. Assign the result to `company_value`.

In [158]:
company_value = f500.iloc[0, 0]

## **4. Using `iloc` to select by integer position continued**

To select just the first column from our `f500` dataframe, we use `:` (a colon) to specify all rows, and then use the integer `0` to specify the first column:



In [159]:
first_column = f500.iloc[:,0]
print(first_column)

0                             Walmart
1                          State Grid
2                       Sinopec Group
3            China National Petroleum
4                        Toyota Motor
                    ...              
495    Teva Pharmaceutical Industries
496          New China Life Insurance
497         Wm. Morrison Supermarkets
498                               TUI
499                        AutoNation
Name: company, Length: 500, dtype: object


Slicing: select the rows between index positions one to four (inclusive)

In [160]:
second_to_sixth_rows = f500[1:5]



```
company  rank  revenues ... employees  total_stockholder_equity
1         State Grid     2    315199 ...    926067                    209456
2      Sinopec Group     3    267518 ...    713288                    106523
3  China National...     4    262573 ...   1512048                    301893
4       Toyota Motor     5    254694 ...    364445                    157210
```



In the example above, the row at index position `5` is not included, just as if we were slicing with a Python list or NumPy ndarray. Recall that `loc[]` handles slicing differently:

* With `loc[]`, the ending slice is included.
* With `iloc[]`, the ending slice is not included.


The table below summarizes how we can use `DataFrame.iloc[]` and `Series.iloc[]` to select by integer position:


|Select by integer position	| Explicit Syntax	| Shorthand Convention|
| --- | --- | --- |
Single column from dataframe|	df.iloc[:,3]	
List of columns from dataframe|	df.iloc[:,[3,5,6]]	
Slice of columns from dataframe|	df.iloc[:,3:7]	
Single row from dataframe|	df.iloc[20]	
List of rows from dataframe|	df.iloc[[0,3,8]]	
Slice of rows from dataframe|	df.iloc[3:5]|	df[3:5]
Single items from series|	s.iloc[8]|	s[8]
List of item from series|	s.iloc[[2,8,1]]|	s[[2,8,1]]
Slice of items from series|	s.iloc[5:10]|	s[5:10]



Practice:
1. Select the first three rows of the `f500` dataframe. Assign the result to `first_three_rows`.


In [161]:
first_three_rows = f500.iloc[:3]
print(first_three_rows)

         company  rank  ...  employees  total_stockholder_equity
0        Walmart     1  ...    2300000                     77798
1     State Grid     2  ...     926067                    209456
2  Sinopec Group     3  ...     713288                    106523

[3 rows x 17 columns]


2. Select the first and seventh rows and the first five columns of the `f500` dataframe. Assign the result to `first_seventh_row_slice`.

In [162]:
first_seventh_row_slice = f500.iloc[[0,6], :5]
print(first_seventh_row_slice)

             company  rank  revenues  revenue_change  profits
0            Walmart     1    485873             0.8  13643.0
6  Royal Dutch Shell     7    240033           -11.8   4575.0


## **5. Using pandas methods to create boolean masks**

Besides >, <, and == there are other pandas methods that return boolean masks, two examples are:
* `Series.isnull()` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isnull.html)
* `Series.notnull()` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.notnull.html)

First, let's use the `Series.isnull()` method to view rows with null values in the `revenue_change` column:

In [163]:
rev_is_null = f500["revenue_change"].isnull()
print(rev_is_null.head())

0    False
1    False
2    False
3    False
4    False
Name: revenue_change, dtype: bool


Using Series.isnull() resulted in a **boolean series**. Just like in NumPy, we can use this series to filter our dataframe, `f500`:

In [164]:
rev_change_null = f500[rev_is_null]
print(rev_change_null[["company","country","sector"]])

                        company  country      sector
90                       Uniper  Germany      Energy
180  Hewlett Packard Enterprise      USA  Technology


We can confirm that the two companies with missing values for the revenue_change column are Uniper, a German energy company, and Hewlett Parkard Enterprise, an American technology company.

Let's use what we've learned to find the null values in the `previous_rank` column next:

Use the `Series.isnull()` method to select all rows from `f500` that have a null value for the `previous_rank` column. Select only the `company`, `rank`, and `previous_rank` columns. Assign the result to `null_previous_rank`.



In [165]:
null_previous_rank = f500[f500["previous_rank"].isnull()][["company","rank", "previous_rank"]]

print(null_previous_rank.head())

                    company  rank  previous_rank
48    Legal & General Group    49            NaN
90                   Uniper    91            NaN
123       Dell Technologies   124            NaN
138  Anbang Insurance Group   139            NaN
140         Albertsons Cos.   141            NaN


## **6. Working with Integer Labels**

Always think carefully about whether you want to select by *label* or *integer position*. Use `DataFrame.loc[]` or `DataFrame.iloc[]` accordingly. 

![.iloc vs .loc](https://s3.amazonaws.com/dq-content/292/integer_labels_2.svg)

Practice:

Assign the first five rows of the `null_previous_rank` dataframe to the variable `top5_null_prev_rank` by choosing the correct method out of either `loc[]` or `iloc[]`.

In [166]:
null_previous_rank = f500[f500["previous_rank"].isnull()]

top5_null_prev_rank = null_previous_rank.iloc[:5]

print(top5_null_prev_rank)

                    company  rank  ...  employees  total_stockholder_equity
48    Legal & General Group    49  ...       8939                      8579
90                   Uniper    91  ...      12890                     12889
123       Dell Technologies   124  ...     138000                     13243
138  Anbang Insurance Group   139  ...      40707                     20372
140         Albertsons Cos.   141  ...     273000                      1371

[5 rows x 17 columns]


## **7. Pandas Index Alignment**

Now that we've identified the rows with null values in the `previous_rank` column, let's use the `Series.notnull()` method to exclude them from the next part of our analysis.

In [167]:
previously_ranked = f500[f500["previous_rank"].notnull()]

We can then create a `rank_change` column by subtracting the `rank` column from the `previous_rank` column:

In [170]:
rank_change = previously_ranked["previous_rank"] - previously_ranked["rank"]

f500["rank_change"] = rank_change

print(rank_change.shape)
print(rank_change.tail(3))

(467,)
496   -70.0
497   -61.0
498   -32.0
dtype: float64


Above, we can see that our `rank_change` series has 467 rows. Since the last integer index label is 498, we know that our index labels no longer align with the integer positions. 

Suppose now we decided to add the `rank_change` series to the `f500` dataframe as a new column. Its index labels no longer match the index labels in `f500`, so how could this be done?

Simple. In pandas almost every operation will align on the index labels. Let's look at an example – below we have a dataframe named `food` and a series named `alt_name`:

![alt text](https://s3.amazonaws.com/dq-content/292/align_index_1_updated.svg)

The `food` dataframe and the `alt_name` series not only have a different number of items, but they also only have two of the same index labels - `corn` and `eggplant` - and they're in different orders. If we wanted to add `alt_name` as a new column in our `food` dataframe, we can use the following code:


```
food["alt_name"] = alt_name
```



When we do this, pandas will ignore the order of the `alt_name` series, and align on the index labels:

![alt text](https://s3.amazonaws.com/dq-content/292/align_index_2_updated.svg)

Pandas will also:

* Discard any items that have an index that doesn't match the dataframe (like arugula).
* Fill any remaining rows with NaN.

Below is the result:

![alt text](https://s3.amazonaws.com/dq-content/292/align_index_5_updated.svg)

The pandas library will align on index at every opportunity, no matter if our index labels are strings or integers - this makes working with data from different sources or working with data when we have removed, added, or reordered rows much easier than it would be otherwise.

## **8. Using Boolean Operators**

Boolean indexing is a powerful tool which allows us to select or exclude parts of our data based on their values. However, to answer more complex questions, we need to learn how to combine boolean arrays.

To recap, boolean arrays are created using any of the Python standard comparison operators: `==` (equal), `>` (greater than), `<` (less than), `!=` (not equal).

We combine boolean arrays using boolean operators. In Python, these boolean operators are `and`, `or`, and `not`. In pandas, the operators are slightly different:

pandas |	Python equivalent |	Meaning
--- | --- | ---
a & b |	a and b |	True if both a and b are True, else False
a I b |	a or b |	True if either a or b is True
~a |	not a |	True if a is False, else False

Let's look at an example using `f500_sel`, a small selection of our `f500` dataframe:

![alt text](https://s3.amazonaws.com/dq-content/292/bool_ops_1.svg)

Suppose we wanted to find the companies in `f500_sel` with more than 265 billion in revenue that are headquartered in China. We'll start by performing two boolean comparisons to produce two separate boolean arrays (the revenue column is already in millions).

![alt text](https://s3.amazonaws.com/dq-content/292/bool_ops_2.svg)

We then use the `&` operator to combine the two boolean arrays using boolean "and" logic:
![alt text](https://s3.amazonaws.com/dq-content/292/bool_ops_3.svg)

Lastly, we use the combined boolean array to perform selection on our dataframe:
![alt text](https://s3.amazonaws.com/dq-content/292/bool_ops_4.svg)

The result gives us two companies from f500_sel that are both Chinese and have over 265 billion in revenue. 

Let's practice more complex selection using boolean operators:


Select all companies with revenues over 100 billion and negative profits from the `f500` dataframe. The result should include all columns.

* Create a boolean array that selects the companies with revenues greater than 100 billion. Assign the result to `large_revenue`.
* Create a boolean array that selects the companies with profits less than 0. Assign the result to `negative_profits`.
* Combine `large_revenue` and `negative_profits`. Assign the result to `combined`.
*Use `combined` to filter `f500`. Assign the result to `big_rev_neg_profit`.


## **9. Using Boolean Operators Continued**

## **10. Sorting Values**

## **11. Using Loops with pandas**

## **12. Challenge: Calculating Return on Assets by Country**