# Pandas Basics Practical (09 / 15 / 2020)

Steps:
1. Import numpy and pandas
2. Load the DataFrame (`pd.read_csv()`)
  - print the dimensions of the df `df.shape`
3. Clean the DataFrame
  - Set the dtype for columns (`.astype()`)
    - Numeric Types
    - Date Times
  - Drop columns that only have null values (`.isnull()`)
  - Set the index
    - Check whether the 'Province_State' is unique (`.values_counts()`)
    - If so, use this column as the index (`.set_index()`)
4. Run some simple operations
 1. Extract a column from the DataFrame (`[ ]` or `.loc[]`)
 2. Extract a row from the DataFrame (`.loc[]` and `.iloc[]`)
 3. Subset the DataFrame using several rows / columns (`.loc[]`)
 4. Find the State with the most recoveries
    1. Find the row with the max (`.max()`)
    2. Use `.loc` to extract that row
 5. Find the states within the first quartile for 'Active' cases
    1. Use `.quartile()` to find the first quartile range
    2. Use `.loc` to subset the dataset using the previous value
 6. Find the states within the third quartile for 'Confirmed' cases
 

## 1. Importing the Pandas Module

In [None]:
import pandas as pd
import numpy as np

## 2. Loading the DataFrame

Use the `pd.read_csv()` command with the provided url

Remember to save the result into a variable (normally `df`)

In [None]:
# This URL will return the raw text contained in the csv file
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/09-13-2020.csv'
# Pandas will convert the raw text into a csv (This file is comma delimited so we do not need to set a sep)
df = pd.read_csv(url)
# Output the first 5 lines in the file
df.head()


### Print the dimensions of the DataFrame using `.shape`

In [None]:
# Output the dimensions of the dataframe
df.shape

## 3. Cleaning the DataFrame

### Setting the correct dtypes
Start by viewing the first 5 rows of the `df`. (`.head()`)

Look to see if there are any variables that look like their types are incorrect (FIPS is one)

Then print the types for each column (`df.dtypes`).
- 'object' generally implies a string
- Look for columns that might need to be changed

Changes:
- Convert any columns you identified in the previous step
- convert 'Last_Update' to a date_time type
  - `df['Last_Update'].astype('datetime64')`
- remember to replace the existing column with the new columns


In [None]:
# Output the first 5 rows of the dataframe
df.head()

In [None]:
# output the type used for each row in the dataframe
df.dtypes

In [None]:
# Change the type of the 'Last_Update' series to 'datetime64 and then replace that series in the original dataframe
df['Last_Update'] = df['Last_Update'].astype('datetime64')
# Change the type of the 'FIPS' series to 'np.uint32 and then replace that series in the original dataframe
df['FIPS'] = df['FIPS'].astype(np.uint32)
df.dtypes

### Drop any columns that only have null values

Use the shape of the df to determine which columns might fit this criteria

`df.isnull().sum()`

In [None]:
# isnull() will return True or False for each value in the dataframe
# sum will sum the True values down each column and return the total
df.isnull().sum()

In [None]:
# From the previous step we can see that these columns only have null values
# so we drop them from the dataframe
df = df.drop(columns=["People_Hospitalized", "Hospitalization_Rate"])

Set the index to the `'Province_State'` column.

First make sure that each state appears only once by using the `.value_counts()` function on the column

In [None]:
# .value_counts will count the number of occurances for each value in the series
df["Province_State"].value_counts()

In [None]:
# This will set the index to the specified column
# will result in an error if this is run twice or there is a typo
df = df.set_index("Province_State")

In [None]:
# ouput the first 5 rows of the dataframe
df.head()

## 4. Run some simple operations
 1. Extract a column from the DataFrame (`[ ]` or `.loc[]`)


In [None]:
# Output the first 5 values from the 'Lat' series
df["Lat"].head()

 2. Extract a row from the DataFrame (`.loc[]` and `.iloc[]`)


In [None]:
# Output the 26th row (zero based indexing) of the dataframe 
df.iloc[25]

In [None]:
# Use the index column we set earlier to extract the row for Massachusetts
df.loc["Massachusetts"]

 3. Subset the DataFrame using several rows / columns (`.loc[]`)


In [None]:
# Extract the Lat and Long_ columns into a new DataFrame, then output the first 5 rows
df[["Lat", "Long_"]].head()

In [None]:
# Output rows with indexes 0, 1 and 2
df.iloc[[0, 1, 2]]

 4. Find the State with the most recoveries
    1. Find the row with the max (`.max()`)
    2. Use `.loc` to extract that row


In [None]:
# Find the maximum value in the 'Recovered' Column
df["Recovered"].max()

In [None]:
# df["Recovered"] == df["Recovered"].max()
#    This will return True for the row which has the max value and False for all others
# df.loc[expression]
#    Will use the expression to select rows from the DataFrame
df.loc[df["Recovered"] == df["Recovered"].max()]

 5. Find the states within the first quartile for 'Active' cases
    1. Use `.quartile()` to find the first quartile range
    2. Use `.loc` to subset the dataset using the previous value
 

In [None]:
# Find the value for the first quartile range
q1 = df["Active"].quantile(0.25)

# Extract all rows that fall within the first quartile'Active'
df.loc[df["Active"] < q1]

6. Find the states within the third quartile for 'Confirmed' cases

This step will require two filters (boolean expressions)