<a href="https://colab.research.google.com/github/dbro-dev/DataQuest_Courses/blob/master/019___Exploring_Data_with_pandas__Intermediate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MISSION 5: Exploring Data with pandas: Intermediate**

*Learn more techniques for selecting and analyzing data in pandas.*

In this mission, we will learn how to:

* Select columns, rows and individual items using their integer location.
* Use `pd.read_csv()` to read CSV files in pandas.
* Work with integer axis labels.
* How to use pandas methods to produce boolean arrays.
* Use boolean operators to combine boolean comparisons to perform more complex analysis.
* Use index labels to align data.
* Use aggregation to perform advanced analysis using loops.

## **1. Introduction**

We'll continue working with a data set from [Fortune](https://fortune.com/) magazine's 2017 [Global 500 list](https://en.wikipedia.org/wiki/Fortune_Global_500), which ranks the top 500 corporations worldwide by revenue. The data set was originally compiled [here](https://data.world/chasewillden/fortune-500-companies-2017); however, we modified the original data set to make it more accessible. [Click here](https://github.com/dbro-dev/DataQuest_Courses/blob/master/datasets/f500.csv) or [here](https://drive.google.com/file/d/1sp668oBm1G7vQbgCpw8zH-fnD1IJd9Ut/view?usp=sharing) for the current version used in this notebook (*as my Github username may change in the future*).

![Fortune_500_logo](https://s3.amazonaws.com/dq-content/291/fortune-500.jpg)

Below is the code to import pandas and use the pandas.read_csv() function to read the CSV into a dataframe and assign it to the variable name f500. 

```
import pandas as pd
f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None


```

In Google Colab however, it is a bit more complicated to load a .csv to work with. The fields below show how it is done:

In [None]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
# Once you have completed verification, go to the CSV file in Google Drive, right-click on it and select “Get shareable link”, and cut out the unique id in the link.
id = "1sp668oBm1G7vQbgCpw8zH-fnD1IJd9Ut"

In [None]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('f500.csv')

In [None]:
# Import code which resembles the original code above
import pandas as pd
import numpy as np

f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None

In [None]:
# replace 0 values in the "previous_rank" column with NaN
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan

Select the `rank`, `revenues`, and `revenue_change` columns in `f500`. Then, use the `DataFrame.head()` method to select the first five rows. Assign the result to `f500_selection`.


In [None]:
f500_selection = f500[['rank', 'revenues', 'revenue_change']].head(5)
print(f500_selection)

                          rank  revenues  revenue_change
Walmart                      1    485873             0.8
State Grid                   2    315199            -4.4
Sinopec Group                3    267518            -9.1
China National Petroleum     4    262573           -12.3
Toyota Motor                 5    254694             7.7


Note how a few steps before we used this code to load the .csv file. 



```
f500 = pd.read_csv('f500.csv',index_col=0)
f500.index.name = None
```
Using this code, the index axis labels are actually the values from the first column in the data set, company. 



The `index_col` parameter is an optional argument and should specify which column to use as the row labels for the dataframe. When we used a value of 0, we specified that we wanted to use the first column as the row labels.

If we remove the second line: `f500.index.name = None`. Both the column and index axes can have names assigned to them. However, we originally used the code below to access the name of the index axes and set it to None, so our dataframe didn't have a name for the index axis.

## **2. Reading CSV files with pandas**

The more conventional way to read in a dataframe is this:

In [None]:
f500 = pd.read_csv("f500.csv")



There are two differences with this approach:

* The company column is now included as a regular column, instead of being used for the index.
* The index labels are now integers starting from 0.


## **3. Using `iloc` to select by integer position**

Recall that when we worked with a dataframe with string index labels, we used `loc[]` to select data:

![alt text](https://s3.amazonaws.com/dq-content/292/selection_loc.svg)

Just like in NumPy, we can also use integer positions to select data using `Dataframe.iloc[]` and `Series.iloc[]`. It's easy to get `loc[]` and `iloc[]` confused at first, but the easiest way is to remember the first letter of each method:

* `loc`: **l**abel based selection
* `iloc`: **i**nteger position based selection


Using `iloc[]` is almost identical to indexing with NumPy, with integer positions starting at `0` like ndarrays and Python lists. This is how we would perform the selection above using `iloc[]`:

![alt text](https://s3.amazonaws.com/dq-content/292/selection_iloc.svg)

As you can see, `DataFrame.iloc[]` behaves similarly to `DataFrame.loc[]`. The full syntax for `DataFrame.iloc[]`, in pseudocode, is:



```
df.iloc[row_index, column_index]
```



Practice:

1. Select just the fifth row of the `f500` dataframe. Assign the result to `fifth_row`.

In [None]:
fifth_row = f500.iloc[4]

2. Select the value in first row of the `company` column. Assign the result to `company_value`.

In [None]:
company_value = f500.iloc[0, 0]

## **4. Using `iloc` to select by integer position continued**

To select just the first column from our `f500` dataframe, we use `:` (a colon) to specify all rows, and then use the integer `0` to specify the first column:



In [None]:
first_column = f500.iloc[:,0]
print(first_column)

0                             Walmart
1                          State Grid
2                       Sinopec Group
3            China National Petroleum
4                        Toyota Motor
                    ...              
495    Teva Pharmaceutical Industries
496          New China Life Insurance
497         Wm. Morrison Supermarkets
498                               TUI
499                        AutoNation
Name: company, Length: 500, dtype: object


Slicing: select the rows between index positions one to four (inclusive)

In [None]:
second_to_sixth_rows = f500[1:5]



```
company  rank  revenues ... employees  total_stockholder_equity
1         State Grid     2    315199 ...    926067                    209456
2      Sinopec Group     3    267518 ...    713288                    106523
3  China National...     4    262573 ...   1512048                    301893
4       Toyota Motor     5    254694 ...    364445                    157210
```



In the example above, the row at index position `5` is not included, just as if we were slicing with a Python list or NumPy ndarray. Recall that `loc[]` handles slicing differently:

* With `loc[]`, the ending slice is included.
* With `iloc[]`, the ending slice is not included.


The table below summarizes how we can use `DataFrame.iloc[]` and `Series.iloc[]` to select by integer position:


|Select by integer position	| Explicit Syntax	| Shorthand Convention|
| --- | --- | --- |
Single column from dataframe|	df.iloc[:,3]	
List of columns from dataframe|	df.iloc[:,[3,5,6]]	
Slice of columns from dataframe|	df.iloc[:,3:7]	
Single row from dataframe|	df.iloc[20]	
List of rows from dataframe|	df.iloc[[0,3,8]]	
Slice of rows from dataframe|	df.iloc[3:5]|	df[3:5]
Single items from series|	s.iloc[8]|	s[8]
List of item from series|	s.iloc[[2,8,1]]|	s[[2,8,1]]
Slice of items from series|	s.iloc[5:10]|	s[5:10]



Practice:
1. Select the first three rows of the `f500` dataframe. Assign the result to `first_three_rows`.


In [None]:
first_three_rows = f500.iloc[:3]
print(first_three_rows)

         company  rank  ...  employees  total_stockholder_equity
0        Walmart     1  ...    2300000                     77798
1     State Grid     2  ...     926067                    209456
2  Sinopec Group     3  ...     713288                    106523

[3 rows x 17 columns]


2. Select the first and seventh rows and the first five columns of the `f500` dataframe. Assign the result to `first_seventh_row_slice`.

In [None]:
first_seventh_row_slice = f500.iloc[[0,6], :5]
print(first_seventh_row_slice)

             company  rank  revenues  revenue_change  profits
0            Walmart     1    485873             0.8  13643.0
6  Royal Dutch Shell     7    240033           -11.8   4575.0


## **5. Using pandas methods to create boolean masks**

Besides >, <, and == there are other pandas methods that return boolean masks, two examples are:
* `Series.isnull()` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isnull.html)
* `Series.notnull()` [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.notnull.html)

First, let's use the `Series.isnull()` method to view rows with null values in the `revenue_change` column:

In [None]:
rev_is_null = f500["revenue_change"].isnull()
print(rev_is_null.head())

0    False
1    False
2    False
3    False
4    False
Name: revenue_change, dtype: bool


Just like in NumPy, we can use this series to filter our dataframe, `f500`:

In [None]:
rev_change_null = f500[rev_is_null]
print(rev_change_null[["company","country","sector"]])

                        company  country      sector
90                       Uniper  Germany      Energy
180  Hewlett Packard Enterprise      USA  Technology


We can confirm that the two companies with missing values for the revenue_change column are Uniper, a German energy company, and Hewlett Parkard Enterprise, an American technology company.

Let's use what we've learned to find the null values in the `previous_rank` column next:

Use the `Series.isnull()` method to select all rows from `f500` that have a null value for the `previous_rank` column. Select only the `company`, `rank`, and `previous_rank` columns. Assign the result to `null_previous_rank`.



In [None]:
isnul = f500["previous_rank"].isnull()
null_previous_rank = f500[isnul][["company", "rank", "previous_rank"]]

null_previous_rank.info()



<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   company        0 non-null      object
 1   rank           0 non-null      int64 
 2   previous_rank  0 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 0.0+ bytes


In [None]:
print(null_previous_rank.head())

Empty DataFrame
Columns: [company, rank, previous_rank]
Index: []


# *To be completed when I come back from holiday in 2 weeks*

## **6. Working with Integer Labels**

## **7. Pandas Index Alignment**

## **8. Using Boolean Operators**

##

## **9. Using Boolean Operators Continued**

## **10. Sorting Values**

## **11. Using Loops with pandas**

## **12. Challenge: Calculating Return on Assets by Country**