## Pandas

Today we will look at a Package called [Pandas](https://pandas.pydata.org/docs/)

- Packages are a collection of functions
- Pandas is the best available package to handle data
- It is built on an other famous package [Numpy](https://numpy.org/devdocs/user/quickstart.html)

In [46]:
# import numpy package and give it the usual alias: np
import numpy as np
print(np.__version__)

1.18.1


## In class exercise
- import the Pandas library
- use the pd alias
- print out the Pandas version

In [47]:
# import the library
import pandas as pd
print(pd.__version__)

1.0.1


## Numpy
- We will not go into the details of the Numpy package
- But you might find it useful to know it exists
- Let's demonstrate why we will use packages instead of making our own functions

In [8]:
# Use list comprehension to create a list from 1 to 1 million
x_list=[i for i in range(1,1000001)]

# Create the equivalent numpy array
x_np= np.array(x_list)
# print type of each list
print(type(x_list), type(x_np), sep=', ')

<class 'list'>, <class 'numpy.ndarray'>


## Jupyter (IPython) magic
- In Jupyter notebooks you can built-in commands named [__magic__](https://ipython.readthedocs.io/en/stable/interactive/magics.html)
- To activate a magic command simply put a % in front
- Let's illustrate using the __%time__ magic command

In [9]:
# time the runtime to sum the Python list using the base sum() function
%time 
sum(x_list)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 11 µs


500000500000

## In class exercise
- Numpy has a sum function
- class the sum function from numpy on x_np
- Make sure to time the execution

In [11]:
# time the runtime to sum the numpy array using the sum function from numpy
%time
np.sum(x_np)


CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 9.06 µs


500000500000

## How to use packages
- The command [np.sum()](https://numpy.org/doc/stable/reference/generated/numpy.sum.html) asks python to access the sum() function in the Numpy package
- All functions in packages are accessed the same way.
- Let's go back to Pandas!

## Pandas' raison d'être

- Pandas keeps the speed of Numpy
- Then adds the ability to label __variables__ and __index__ rows in a user friendly way
- This is called a [dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)
- Before loading an actual dataset into memory let's see a very simple example

In [12]:
# Create a small dictionary for data
d = {'col1': [1, 2, 10,20], 'col2': [3, 4, 25.0,'ottawa']}
dataframe = pd.DataFrame(data=d)

dataframe

Unnamed: 0,col1,col2
0,1,3
1,2,4
2,10,25
3,20,ottawa


- Your dataframe is now assigned to _dataframe_
- This is what's known as an __object__
    - You can learn more about [Object Oriented Programming here](https://realpython.com/python3-object-oriented-programming/)

In [13]:
# Use the columns attribute on the dataframe to list variables
dataframe.columns

Index(['col1', 'col2'], dtype='object')

In [18]:
# print type of what is returned by calling columns
print(type(dataframe.columns))
# Save column names to a list using the tolist method
cols=dataframe.columns.tolist()
# print the type of this list
print(type(cols))
#return the columns
print(cols)

<class 'pandas.core.indexes.base.Index'>
<class 'list'>
['col1', 'col2']


## In class exercise

- Data frames have also an index attribute
- print out this index
- save it into a list
- loop over the list and print out the values

In [22]:
# Use the index attribute on the dataframe to list indices
display(dataframe.index)
indices=dataframe.index.tolist()
for i in indices:
    print(i)

RangeIndex(start=0, stop=4, step=1)

0
1
2
3


# Let's use real data
- Let's look into a dataset freely available on [Open Canada](https://www.kaggle.com)
- Specifically, we will look into data related to [COVID19 in Canada](https://open.canada.ca/data/en/dataset/b8d1d622-1ceb-4c1c-96e9-a0b38939080b)


Best practice

In [24]:
# # If you access multiple times the same folder or if you want to share your code
# pathFolder="~/Dropbox/Data Science for Social Scientists/Classes/Class 2 - Tidy Data/COVID/"

# # read_csv into a pandas datframe
# covidFile=''.join([pathFolder,"covid19_map.csv"])
# pd.read_csv(covidFile)

# altenatively, if you don't have access to the file you can dowload directly from this Dropbox link:
pd.read_csv("https://www.dropbox.com/s/78h5e3xj36xl9eu/covid19_map.csv?dl=1")

Unnamed: 0,pruid,prname,prnameFR,date,update,numconf,numprob,numdeaths,numtotal,numtested,...,numdeaths_last14,ratedeaths_last14,numtotal_last7,ratetotal_last7,numdeaths_last7,ratedeaths_last7,avgtotal_last7,avgincidence_last7,avgdeaths_last7,avgratedeaths_last7
0,35,Ontario,Ontario,31-01-2020,,3,0,0.0,3,,...,,,,,,,,,,
1,59,British Columbia,Colombie-Britannique,31-01-2020,,1,0,0.0,1,,...,,,,,,,,,,
2,1,Canada,Canada,31-01-2020,,4,0,0.0,4,,...,,,,,,,,,,
3,35,Ontario,Ontario,08-02-2020,,3,0,0.0,3,,...,,,,,,,,,,
4,59,British Columbia,Colombie-Britannique,08-02-2020,,4,0,0.0,4,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4747,60,Yukon,Yukon,17-01-2021,0.0,70,0,1.0,70,6256.0,...,0.0,0.00,0.0,0.00,0.0,0.00,0.0,0.00,0.0,0.00
4748,61,Northwest Territories,Territoires du Nord-Ouest,17-01-2021,1.0,28,0,0.0,28,8323.0,...,0.0,0.00,4.0,8.92,0.0,0.00,1.0,1.27,0.0,0.00
4749,62,Nunavut,Nunavut,17-01-2021,0.0,266,0,1.0,266,6558.0,...,0.0,0.00,0.0,0.00,0.0,0.00,0.0,0.00,0.0,0.00
4750,99,Repatriated travellers,Voyageurs rapatriés,17-01-2021,,13,0,0.0,13,76.0,...,0.0,,0.0,,0.0,,0.0,,0.0,


Where did my dataset go?

## Pandas dataframe
You want to keep the data into a Pandas dataframe so you will assign it!

In [48]:
covidFile="https://www.dropbox.com/s/78h5e3xj36xl9eu/covid19_map.csv?dl=1"
df=pd.read_csv(covidFile)

## head()
- A method is a function that takes as a first argument, the object itself
- Let's explore this dataset using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method

In [28]:
# call the dataframe's head
df.head()

Unnamed: 0,pruid,prname,prnameFR,date,update,numconf,numprob,numdeaths,numtotal,numtested,...,numdeaths_last14,ratedeaths_last14,numtotal_last7,ratetotal_last7,numdeaths_last7,ratedeaths_last7,avgtotal_last7,avgincidence_last7,avgdeaths_last7,avgratedeaths_last7
0,35,Ontario,Ontario,31-01-2020,,3,0,0.0,3,,...,,,,,,,,,,
1,59,British Columbia,Colombie-Britannique,31-01-2020,,1,0,0.0,1,,...,,,,,,,,,,
2,1,Canada,Canada,31-01-2020,,4,0,0.0,4,,...,,,,,,,,,,
3,35,Ontario,Ontario,08-02-2020,,3,0,0.0,3,,...,,,,,,,,,,
4,59,British Columbia,Colombie-Britannique,08-02-2020,,4,0,0.0,4,,...,,,,,,,,,,


- head() will print the first 5 rows
- We have 36 columns but only a subset of them were printed out
- You can change this as an [option](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html)

In [30]:
# Change the maximum display option
pd.set_option("display.max_columns",999)
#print again the head
df.head()


Unnamed: 0,pruid,prname,prnameFR,date,update,numconf,numprob,numdeaths,numtotal,numtested,numrecover,percentrecover,ratetested,numtoday,percentoday,ratetotal,ratedeaths,numdeathstoday,percentdeath,numtestedtoday,numrecoveredtoday,percentactive,numactive,rateactive,numtotal_last14,ratetotal_last14,numdeaths_last14,ratedeaths_last14,numtotal_last7,ratetotal_last7,numdeaths_last7,ratedeaths_last7,avgtotal_last7,avgincidence_last7,avgdeaths_last7,avgratedeaths_last7
0,35,Ontario,Ontario,31-01-2020,,3,0,0.0,3,,,,,3,300.0,0.02,0.0,0.0,0.0,,,100.0,3.0,0.02,,,,,,,,,,,,
1,59,British Columbia,Colombie-Britannique,31-01-2020,,1,0,0.0,1,,,,,1,100.0,0.02,0.0,0.0,0.0,,,100.0,1.0,0.02,,,,,,,,,,,,
2,1,Canada,Canada,31-01-2020,,4,0,0.0,4,,,,,4,400.0,0.01,0.0,0.0,0.0,,,100.0,4.0,0.01,,,,,,,,,,,,
3,35,Ontario,Ontario,08-02-2020,,3,0,0.0,3,,,,,0,0.0,0.02,0.0,0.0,0.0,,,100.0,3.0,0.02,,,,,,,,,,,,
4,59,British Columbia,Colombie-Britannique,08-02-2020,,4,0,0.0,4,,,,,3,300.0,0.08,0.0,0.0,0.0,,,100.0,4.0,0.08,,,,,,,,,,,,


## In class exercise

- There is also a [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) method
- It works like head()
- By default the number of rows displayed is 5
- You can change this by calling n=X inside the method
- Call the tail() method on your dataset and print out the 10 last rows

In [31]:
# Call tail() on the dataframe
df.tail(n=10)

Unnamed: 0,pruid,prname,prnameFR,date,update,numconf,numprob,numdeaths,numtotal,numtested,numrecover,percentrecover,ratetested,numtoday,percentoday,ratetotal,ratedeaths,numdeathstoday,percentdeath,numtestedtoday,numrecoveredtoday,percentactive,numactive,rateactive,numtotal_last14,ratetotal_last14,numdeaths_last14,ratedeaths_last14,numtotal_last7,ratetotal_last7,numdeaths_last7,ratedeaths_last7,avgtotal_last7,avgincidence_last7,avgdeaths_last7,avgratedeaths_last7
4742,24,Quebec,Québec,17-01-2021,1.0,242714,0,9055.0,242714,2656534.0,213008.0,87.76,313087.0,1744,0.72,2860.52,106.72,50.0,3.73,9270.0,2644.0,8.51,20651.0,243.38,32410.0,381.97,708.0,8.34,13893.0,163.74,369.0,4.35,1985.0,23.39,53.0,0.62
4743,10,Newfoundland and Labrador,Terre-Neuve-et-Labrador,17-01-2021,1.0,396,0,4.0,396,76369.0,383.0,96.72,146429.0,1,0.25,75.93,0.77,0.0,1.01,204.0,0.0,2.27,9.0,1.73,6.0,1.15,0.0,0.0,3.0,0.58,0.0,0.0,0.0,0.08,0.0,0.0
4744,13,New Brunswick,Nouveau-Brunswick,17-01-2021,1.0,947,0,12.0,947,128277.0,642.0,67.79,165129.0,36,3.95,121.91,1.54,0.0,1.27,874.0,11.0,30.94,293.0,37.72,329.0,42.35,3.0,0.39,168.0,21.63,3.0,0.39,24.0,3.09,0.0,0.06
4745,12,Nova Scotia,Nouvelle-Écosse,17-01-2021,1.0,1558,0,65.0,1558,195810.0,1464.0,93.97,201576.0,4,0.26,160.39,6.69,0.0,4.17,743.0,5.0,1.86,29.0,2.99,57.0,5.87,0.0,0.0,30.0,3.09,0.0,0.0,4.0,0.44,0.0,0.0
4746,11,Prince Edward Island,Île-du-Prince-Édouard,17-01-2021,0.0,104,0,0.0,104,86220.0,95.0,91.35,549357.0,0,0.0,66.26,0.0,0.0,0.0,331.0,0.0,8.65,9.0,5.73,8.0,5.1,0.0,0.0,2.0,1.27,0.0,0.0,0.0,0.18,0.0,0.0
4747,60,Yukon,Yukon,17-01-2021,0.0,70,0,1.0,70,6256.0,67.0,95.71,153131.0,0,0.0,171.34,2.45,0.0,1.43,0.0,0.0,2.86,2.0,4.9,6.0,14.69,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4748,61,Northwest Territories,Territoires du Nord-Ouest,17-01-2021,1.0,28,0,0.0,28,8323.0,24.0,85.71,185673.0,3,12.0,62.46,0.0,0.0,0.0,0.0,0.0,14.29,4.0,8.92,4.0,8.92,0.0,0.0,4.0,8.92,0.0,0.0,1.0,1.27,0.0,0.0
4749,62,Nunavut,Nunavut,17-01-2021,0.0,266,0,1.0,266,6558.0,265.0,99.62,169108.0,0,0.0,685.92,2.58,0.0,0.38,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4750,99,Repatriated travellers,Voyageurs rapatriés,17-01-2021,,13,0,0.0,13,76.0,13.0,100.0,,0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,,0.0,
4751,1,Canada,Canada,17-01-2021,,708619,0,18014.0,708619,16557083.0,615324.0,86.83,440474.0,6436,0.92,1885.16,47.92,149.0,2.54,70499.0,7240.0,10.62,75281.0,200.27,105106.0,279.62,2103.0,5.59,47285.0,125.79,1001.0,2.66,6755.0,17.97,143.0,0.38


## Shape

- Let's get a better sense of the number of variables (columns) and observations (rows) in this dataset.
- To do so you can use the [shape](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) __attribute__

In [32]:
# What are the dimensions of our df?
df.shape

(4752, 36)

## Info()
Last week we talked about the different types.

Pandas offer a way to know about the type of each variable and more.

This is the [info()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) method

In [34]:
# Call info()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4752 entries, 0 to 4751
Data columns (total 36 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   pruid                4752 non-null   int64  
 1   prname               4752 non-null   object 
 2   prnameFR             4752 non-null   object 
 3   date                 4752 non-null   object 
 4   update               325 non-null    float64
 5   numconf              4752 non-null   int64  
 6   numprob              4752 non-null   int64  
 7   numdeaths            4633 non-null   float64
 8   numtotal             4752 non-null   int64  
 9   numtested            4695 non-null   float64
 10  numrecover           4228 non-null   float64
 11  percentrecover       4024 non-null   float64
 12  ratetested           4370 non-null   float64
 13  numtoday             4752 non-null   int64  
 14  percentoday          4752 non-null   float64
 15  ratetotal            4439 non-null   f

## describe()

- You can also obtain a few summary statistics



In [37]:
# use the describe method on df
df.describe().round()

Unnamed: 0,pruid,update,numconf,numprob,numdeaths,numtotal,numtested,numrecover,percentrecover,ratetested,numtoday,percentoday,ratetotal,ratedeaths,numdeathstoday,percentdeath,numtestedtoday,numrecoveredtoday,percentactive,numactive,rateactive,numtotal_last14,ratetotal_last14,numdeaths_last14,ratedeaths_last14,numtotal_last7,ratetotal_last7,numdeaths_last7,ratedeaths_last7,avgtotal_last7,avgincidence_last7,avgdeaths_last7,avgratedeaths_last7
count,4752.0,325.0,4752.0,4752.0,4633.0,4752.0,4695.0,4228.0,4024.0,4370.0,4752.0,4752.0,4439.0,4439.0,4633.0,4355.0,4695.0,4228.0,4355.0,4355.0,4439.0,4500.0,4200.0,4500.0,4200.0,4500.0,4200.0,4500.0,4199.0,4500.0,4200.0,4500.0,4200.0
mean,39.0,1.0,24742.0,1.0,1139.0,24742.0,773226.0,22130.0,83.0,113185.0,298.0,3.0,266.0,9.0,8.0,2.0,7053.0,293.0,23.0,4301.0,48.0,4116.0,48.0,106.0,1.0,2138.0,25.0,55.0,1.0,305.0,4.0,8.0,0.0
std,26.0,0.0,70018.0,19.0,2730.0,70018.0,2094634.0,61350.0,20.0,113188.0,909.0,18.0,461.0,18.0,24.0,3.0,17923.0,1060.0,31.0,11008.0,97.0,12216.0,91.0,308.0,2.0,6325.0,47.0,158.0,1.0,904.0,7.0,23.0,0.0
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26.0,-4.0,-1.0,0.0,0.0,-2.0,0.0,-20682.0,-31.0,0.0,0.0,0.0,-1.0,-0.0,0.0,0.0,-1.0,-0.0,0.0,0.0,0.0,-0.0,0.0,0.0
25%,12.0,1.0,17.0,0.0,0.0,20.0,4134.0,27.0,79.0,32271.0,0.0,0.0,22.0,0.0,0.0,0.0,14.0,0.0,2.0,3.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,46.0,1.0,340.0,0.0,7.0,341.0,54477.0,454.0,88.0,76245.0,1.0,0.0,60.0,1.0,0.0,1.0,394.0,1.0,10.0,65.0,7.0,41.0,9.0,0.0,0.0,18.0,4.0,0.0,0.0,3.0,1.0,0.0,0.0
75%,60.0,1.0,11864.0,0.0,237.0,11864.0,344259.0,12047.0,97.0,151229.0,133.0,1.0,271.0,7.0,2.0,4.0,3604.0,117.0,27.0,2225.0,34.0,1979.0,37.0,30.0,1.0,1038.0,20.0,16.0,0.0,148.0,3.0,2.0,0.0
max,99.0,1.0,708619.0,833.0,18014.0,708619.0,16557083.0,615324.0,100.0,681640.0,8766.0,600.0,2861.0,107.0,222.0,10.0,255537.0,23853.0,100.0,85549.0,676.0,109120.0,545.0,2311.0,18.0,57821.0,289.0,1236.0,9.0,8260.0,41.0,177.0,1.0


## Question
- Why are there 33 columns returned when using describe?

## In class exercise

- You can chain commands in Pandas.
- Code will be executed from letf to right
- Print again the describe() table 
- This time make sure that the statistics form the columns and the variables are the indices
- Hint: you can use the [transpose()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html) method 

In [38]:
 df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
pruid,4752.0,39.105219,26.25532,1.0,12.0,46.0,60.0,99.0
update,325.0,0.843077,0.3642891,0.0,1.0,1.0,1.0,1.0
numconf,4752.0,24741.714646,70018.19,0.0,17.0,339.5,11863.75,708619.0
numprob,4752.0,1.381313,19.11874,0.0,0.0,0.0,0.0,833.0
numdeaths,4633.0,1138.887114,2730.36,0.0,0.0,7.0,237.0,18014.0
numtotal,4752.0,24742.271254,70018.28,0.0,20.0,341.0,11863.75,708619.0
numtested,4695.0,773226.182322,2094634.0,0.0,4134.0,54477.0,344259.0,16557083.0
numrecover,4228.0,22129.917928,61349.78,0.0,27.0,453.5,12047.25,615324.0
percentrecover,4024.0,82.951817,20.26169,0.0,79.13,88.35,96.61,100.0
ratetested,4370.0,113185.255835,113188.1,26.0,32271.25,76245.0,151229.25,681640.0


## Subsetting
- It is very unlikely that a dataset, you didn't create yourself, will be tailored to your exact needs
- __Data cleaning__ consist in removing, reshaping, creating and merging data
- Let's start with using a subset of the data

In [40]:
# call head again
df.head()

Unnamed: 0,pruid,prname,prnameFR,date,update,numconf,numprob,numdeaths,numtotal,numtested,numrecover,percentrecover,ratetested,numtoday,percentoday,ratetotal,ratedeaths,numdeathstoday,percentdeath,numtestedtoday,numrecoveredtoday,percentactive,numactive,rateactive,numtotal_last14,ratetotal_last14,numdeaths_last14,ratedeaths_last14,numtotal_last7,ratetotal_last7,numdeaths_last7,ratedeaths_last7,avgtotal_last7,avgincidence_last7,avgdeaths_last7,avgratedeaths_last7
0,35,Ontario,Ontario,31-01-2020,,3,0,0.0,3,,,,,3,300.0,0.02,0.0,0.0,0.0,,,100.0,3.0,0.02,,,,,,,,,,,,
1,59,British Columbia,Colombie-Britannique,31-01-2020,,1,0,0.0,1,,,,,1,100.0,0.02,0.0,0.0,0.0,,,100.0,1.0,0.02,,,,,,,,,,,,
2,1,Canada,Canada,31-01-2020,,4,0,0.0,4,,,,,4,400.0,0.01,0.0,0.0,0.0,,,100.0,4.0,0.01,,,,,,,,,,,,
3,35,Ontario,Ontario,08-02-2020,,3,0,0.0,3,,,,,0,0.0,0.02,0.0,0.0,0.0,,,100.0,3.0,0.02,,,,,,,,,,,,
4,59,British Columbia,Colombie-Britannique,08-02-2020,,4,0,0.0,4,,,,,3,300.0,0.08,0.0,0.0,0.0,,,100.0,4.0,0.08,,,,,,,,,,,,


## Keep only the columns that you need
- Often you will want to remove columns (variables)
- Columns that are not needed may take space and slow down your code
- Let's use the [pop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pop.html) method on prnameFR variable

In [49]:
# you can use the pop method by using the variable's name
print(df.columns.tolist())
df.pop('prnameFR')
print(df.columns.tolist()) 

['pruid', 'prname', 'prnameFR', 'date', 'update', 'numconf', 'numprob', 'numdeaths', 'numtotal', 'numtested', 'numrecover', 'percentrecover', 'ratetested', 'numtoday', 'percentoday', 'ratetotal', 'ratedeaths', 'numdeathstoday', 'percentdeath', 'numtestedtoday', 'numrecoveredtoday', 'percentactive', 'numactive', 'rateactive', 'numtotal_last14', 'ratetotal_last14', 'numdeaths_last14', 'ratedeaths_last14', 'numtotal_last7', 'ratetotal_last7', 'numdeaths_last7', 'ratedeaths_last7', 'avgtotal_last7', 'avgincidence_last7', 'avgdeaths_last7', 'avgratedeaths_last7']
['pruid', 'prname', 'date', 'update', 'numconf', 'numprob', 'numdeaths', 'numtotal', 'numtested', 'numrecover', 'percentrecover', 'ratetested', 'numtoday', 'percentoday', 'ratetotal', 'ratedeaths', 'numdeathstoday', 'percentdeath', 'numtestedtoday', 'numrecoveredtoday', 'percentactive', 'numactive', 'rateactive', 'numtotal_last14', 'ratetotal_last14', 'numdeaths_last14', 'ratedeaths_last14', 'numtotal_last7', 'ratetotal_last7', 'nu

## drop()
- The [drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) method allows you to drop one or more columns
- It is more verbose
- Let's see some of its functionalities and remove variables ending with _last7

In [51]:
# Let's pass a list of columns to drop
toDrop=['numtotal_last7','ratetotal_last7', 'numdeaths_last7', 'ratedeaths_last7','avgtotal_last7', 'avgincidence_last7', 'avgdeaths_last7','avgratedeaths_last7']
df=df.drop(columns=toDrop)
df.head()

Unnamed: 0,pruid,prname,date,update,numconf,numprob,numdeaths,numtotal,numtested,numrecover,percentrecover,ratetested,numtoday,percentoday,ratetotal,ratedeaths,numdeathstoday,percentdeath,numtestedtoday,numrecoveredtoday,percentactive,numactive,rateactive,numtotal_last14,ratetotal_last14,numdeaths_last14,ratedeaths_last14
0,35,Ontario,31-01-2020,,3,0,0.0,3,,,,,3,300.0,0.02,0.0,0.0,0.0,,,100.0,3.0,0.02,,,,
1,59,British Columbia,31-01-2020,,1,0,0.0,1,,,,,1,100.0,0.02,0.0,0.0,0.0,,,100.0,1.0,0.02,,,,
2,1,Canada,31-01-2020,,4,0,0.0,4,,,,,4,400.0,0.01,0.0,0.0,0.0,,,100.0,4.0,0.01,,,,
3,35,Ontario,08-02-2020,,3,0,0.0,3,,,,,0,0.0,0.02,0.0,0.0,0.0,,,100.0,3.0,0.02,,,,
4,59,British Columbia,08-02-2020,,4,0,0.0,4,,,,,3,300.0,0.08,0.0,0.0,0.0,,,100.0,4.0,0.08,,,,


## Keep columns
- What if instead you want to specify which columns to keep instead?
- You can do so by calling a list on the dataframe
- Say you only want to keep the province name, date and total number of positive cases

In [53]:
# subsetting on 'prname','date','numtotal'
df[['prname','date','numtotal']]

Unnamed: 0,prname,date,numtotal
0,Ontario,31-01-2020,3
1,British Columbia,31-01-2020,1
2,Canada,31-01-2020,4
3,Ontario,08-02-2020,3
4,British Columbia,08-02-2020,4
...,...,...,...
4747,Yukon,17-01-2021,70
4748,Northwest Territories,17-01-2021,28
4749,Nunavut,17-01-2021,266
4750,Repatriated travellers,17-01-2021,13


## In class exercise
- Save in a list named toKeep the columns corresponding to province name, date and the number of deaths
- pass this list in df
- save the results in a new dataframe called df_mortality
- show the head of df_mortality

In [55]:
# save list to keep, province name, 'date', 'numdeaths'
toKeep=['prname','date','numdeaths']
# pass this list in dataframe and save in new dataframe named df_mortality
df_mortality=df[toKeep]
# show head()
df_mortality.head()

Unnamed: 0,prname,date,numdeaths
0,Ontario,31-01-2020,0.0
1,British Columbia,31-01-2020,0.0
2,Canada,31-01-2020,0.0
3,Ontario,08-02-2020,0.0
4,British Columbia,08-02-2020,0.0


## rename()
- Another useful method is [rename()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html)
- rename() takes a dictionary as a _mapper_ the key represents the current variable name and the value the new variable name.
- you can change multiple names at the same time
- you can also set the inplace argument to True
- Let's rename prname and numdeaths of df_mortality to province and deaths

In [60]:
# rename columns of df_mortality, set the inplace argument to True
df_mortality.rename(columns={'prname':'province', 'numdeaths':'deaths'}, inplace=True)
# print df_mortality's head
df_mortality

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,province,date,deaths
0,Ontario,31-01-2020,0.0
1,British Columbia,31-01-2020,0.0
2,Canada,31-01-2020,0.0
3,Ontario,08-02-2020,0.0
4,British Columbia,08-02-2020,0.0
...,...,...,...
4747,Yukon,17-01-2021,1.0
4748,Northwest Territories,17-01-2021,0.0
4749,Nunavut,17-01-2021,1.0
4750,Repatriated travellers,17-01-2021,0.0


## Subset of observations
- You may also be interested in subseting the dataset according to the information it contains
- Say only want to keep data from Ontario
- You can do so by using booleans

In [65]:
# Can you guess what this will return?
df['prname']=="Ontario"



Unnamed: 0,pruid,prname,date,update,numconf,numprob,numdeaths,numtotal,numtested,numrecover,percentrecover,ratetested,numtoday,percentoday,ratetotal,ratedeaths,numdeathstoday,percentdeath,numtestedtoday,numrecoveredtoday,percentactive,numactive,rateactive,numtotal_last14,ratetotal_last14,numdeaths_last14,ratedeaths_last14
0,35,Ontario,31-01-2020,,3,0,0.0,3,,,,,3,300.0,0.02,0.0,0.0,0.0,,,100.0,3.0,0.02,,,,
1,59,British Columbia,31-01-2020,,1,0,0.0,1,,,,,1,100.0,0.02,0.0,0.0,0.0,,,100.0,1.0,0.02,,,,
2,1,Canada,31-01-2020,,4,0,0.0,4,,,,,4,400.0,0.01,0.0,0.0,0.0,,,100.0,4.0,0.01,,,,
3,35,Ontario,08-02-2020,,3,0,0.0,3,,,,,0,0.0,0.02,0.0,0.0,0.0,,,100.0,3.0,0.02,,,,
4,59,British Columbia,08-02-2020,,4,0,0.0,4,,,,,3,300.0,0.08,0.0,0.0,0.0,,,100.0,4.0,0.08,,,,


0        True
1       False
2       False
3        True
4       False
        ...  
4747    False
4748    False
4749    False
4750    False
4751    False
Name: prname, Length: 4752, dtype: bool

In [66]:
# What is the type of what is returned?
type(df['prname']=="Ontario")

pandas.core.series.Series

In [71]:
# Create a new dataframe named on as a copy of df 
on=df[df['prname']=="Ontario"].copy()
# print head
on.head()

Unnamed: 0,pruid,prname,date,update,numconf,numprob,numdeaths,numtotal,numtested,numrecover,percentrecover,ratetested,numtoday,percentoday,ratetotal,ratedeaths,numdeathstoday,percentdeath,numtestedtoday,numrecoveredtoday,percentactive,numactive,rateactive,numtotal_last14,ratetotal_last14,numdeaths_last14,ratedeaths_last14
0,35,Ontario,31-01-2020,,3,0,0.0,3,,,,,3,300.0,0.02,0.0,0.0,0.0,,,100.0,3.0,0.02,,,,
3,35,Ontario,08-02-2020,,3,0,0.0,3,,,,,0,0.0,0.02,0.0,0.0,0.0,,,100.0,3.0,0.02,,,,
6,35,Ontario,16-02-2020,,3,0,0.0,3,,,,,0,0.0,0.02,0.0,0.0,0.0,,,100.0,3.0,0.02,,,,
9,35,Ontario,21-02-2020,,3,0,0.0,3,,,,,0,0.0,0.02,0.0,0.0,0.0,,,100.0,3.0,0.02,,,,
12,35,Ontario,24-02-2020,,4,0,0.0,4,,,,,1,33.33,0.03,0.0,0.0,0.0,,,100.0,4.0,0.03,,,,


## copy()
- Pandas allows you to map the changes you make on a subset of a dataset to the original dataset
- In our example, you could map the changes you make to the dataframe on onto df
- Hence, Pandas wants you to be specific about the type of copy you are making
- using the [copy()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html) method allows you to specify you do not want these dataframes to be linked.
- Sorry!


## unique()
- The unique() method can be called on a [Pandas series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) (a single column)
- It will return all unique values in a series

In [73]:
# call unique on the prname variable
df['prname'].unique()

array(['Ontario', 'British Columbia', 'Canada', 'Quebec', 'Alberta',
       'Saskatchewan', 'Manitoba', 'New Brunswick',
       'Newfoundland and Labrador', 'Nova Scotia', 'Prince Edward Island',
       'Northwest Territories', 'Nunavut', 'Yukon',
       'Repatriated travellers'], dtype=object)

## In class exercise
- You can combine the booleans 
- It seems like there are values in prname that are not provinces or territories 
- Subset df and keep only provinces and territories 
- hint you will need to use  "&" or "|" in between booleans and put parentheses around each condition
- Save the results in a df2 dataframe
- print the unique values of prname for df2
- show the tail of the df2

In [74]:
# create df2
df2=df[(df['prname']!='Canada')  & (df['prname']!='Repatriated travellers')]
# display unique values of prname
display(df2.prname.unique())
# show tail
df2.tail()

array(['Ontario', 'British Columbia', 'Quebec', 'Alberta', 'Saskatchewan',
       'Manitoba', 'New Brunswick', 'Newfoundland and Labrador',
       'Nova Scotia', 'Prince Edward Island', 'Northwest Territories',
       'Nunavut', 'Yukon'], dtype=object)

Unnamed: 0,pruid,prname,date,update,numconf,numprob,numdeaths,numtotal,numtested,numrecover,percentrecover,ratetested,numtoday,percentoday,ratetotal,ratedeaths,numdeathstoday,percentdeath,numtestedtoday,numrecoveredtoday,percentactive,numactive,rateactive,numtotal_last14,ratetotal_last14,numdeaths_last14,ratedeaths_last14
4745,12,Nova Scotia,17-01-2021,1.0,1558,0,65.0,1558,195810.0,1464.0,93.97,201576.0,4,0.26,160.39,6.69,0.0,4.17,743.0,5.0,1.86,29.0,2.99,57.0,5.87,0.0,0.0
4746,11,Prince Edward Island,17-01-2021,0.0,104,0,0.0,104,86220.0,95.0,91.35,549357.0,0,0.0,66.26,0.0,0.0,0.0,331.0,0.0,8.65,9.0,5.73,8.0,5.1,0.0,0.0
4747,60,Yukon,17-01-2021,0.0,70,0,1.0,70,6256.0,67.0,95.71,153131.0,0,0.0,171.34,2.45,0.0,1.43,0.0,0.0,2.86,2.0,4.9,6.0,14.69,0.0,0.0
4748,61,Northwest Territories,17-01-2021,1.0,28,0,0.0,28,8323.0,24.0,85.71,185673.0,3,12.0,62.46,0.0,0.0,0.0,0.0,0.0,14.29,4.0,8.92,4.0,8.92,0.0,0.0
4749,62,Nunavut,17-01-2021,0.0,266,0,1.0,266,6558.0,265.0,99.62,169108.0,0,0.0,685.92,2.58,0.0,0.38,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Combine the two
- You can combine the column and row selection
- Usually you would use [.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)
- The order is always rows, columns: df.loc[df.colname==value, [cols_list]]
- Say you want data only for Quebec and only for the number of tested and the date

In [75]:
# create a Quebec (qc) dataset with 'prname','date','numtestedtoday' variables as a copy of df
qc=df.loc[df.prname=='Quebec', ['prname','date','numtestedtoday']]
#show head
qc.head()

Unnamed: 0,prname,date,numtestedtoday
29,Quebec,01-03-2020,
33,Quebec,03-03-2020,
37,Quebec,05-03-2020,
41,Quebec,06-03-2020,
45,Quebec,07-03-2020,


## Create new variables
- Often you will also want to create new variables
- Let's see a very simple case

In [76]:
# you can also create a constant very easily
qc['test']=1
# show head  
qc.head()

Unnamed: 0,prname,date,numtestedtoday,test
29,Quebec,01-03-2020,,1
33,Quebec,03-03-2020,,1
37,Quebec,05-03-2020,,1
41,Quebec,06-03-2020,,1
45,Quebec,07-03-2020,,1


In [77]:
# Let's pop this series out
qc.pop('test')

29      1
33      1
37      1
41      1
45      1
       ..
4682    1
4697    1
4712    1
4727    1
4742    1
Name: test, Length: 320, dtype: int64

In [78]:
# show head again
qc.head()

Unnamed: 0,prname,date,numtestedtoday
29,Quebec,01-03-2020,
33,Quebec,03-03-2020,
37,Quebec,05-03-2020,
41,Quebec,06-03-2020,
45,Quebec,07-03-2020,


## groupby()
- Another very useful tool is [groupby()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)
- Say you want to know about the total number of people tested each month in Quebec
- This is equivalent to reducing the number of observations to one per month and sum over the total number of tested

### Problem
- We need a variable that varies by month
- For now we only have a variable that varies by day
- You could you Pandas' powerful [datetime tools](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)
- We will use the [split()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html) method instead. 

In [None]:
# create a Quebec (qc) dataset with 'prname','date','numtestedtoday' variables as a copy of df

#show head


In [80]:
# The date column is an object (string) type
print(qc['date'])
# use the split method on the string of a pandas series
qc['date'].str.split(pat='-')

29      01-03-2020
33      03-03-2020
37      05-03-2020
41      06-03-2020
45      07-03-2020
           ...    
4682    13-01-2021
4697    14-01-2021
4712    15-01-2021
4727    16-01-2021
4742    17-01-2021
Name: date, Length: 320, dtype: object


29      [01, 03, 2020]
33      [03, 03, 2020]
37      [05, 03, 2020]
41      [06, 03, 2020]
45      [07, 03, 2020]
             ...      
4682    [13, 01, 2021]
4697    [14, 01, 2021]
4712    [15, 01, 2021]
4727    [16, 01, 2021]
4742    [17, 01, 2021]
Name: date, Length: 320, dtype: object

In [84]:
# unpack the values in 3 different columns using expand=True in split()
qc[['day','month','year']]=qc['date'].str.split(pat='-',expand=True)

# show tail()

qc.tail()

Unnamed: 0,prname,date,numtestedtoday,day,month,year
4682,Quebec,13-01-2021,8692.0,13,1,2021
4697,Quebec,14-01-2021,8955.0,14,1,2021
4712,Quebec,15-01-2021,8471.0,15,1,2021
4727,Quebec,16-01-2021,9590.0,16,1,2021
4742,Quebec,17-01-2021,9270.0,17,1,2021


We can now use groupby on the dataframe

In [92]:
# groupby year and month, and take the sum() of numtestedtoday
qc.groupby(['year','month'],as_index=False)['numtestedtoday'].sum().astype(int)

Unnamed: 0,year,month,numtestedtoday
0,2020,3,67540
1,2020,4,146507
2,2020,5,222752
3,2020,6,163515
4,2020,7,219403
5,2020,8,288470
6,2020,9,416540
7,2020,10,382212
8,2020,11,279137
9,2020,12,312265


## In class exercise
- use groupby to find the maximum number of daily deaths each month in Ontario
- First you'll need to get day, month and year in on dataframe
- Hint: use the max() method

In [97]:
on[['day','month','year']]=on['date'].str.split(pat='-',expand=True)
on.groupby(['year','month'],as_index=False)['numdeathstoday'].max().astype(int)

Unnamed: 0,year,month,numdeathstoday
0,2020,1,0
1,2020,2,0
2,2020,3,14
3,2020,4,86
4,2020,5,84
5,2020,6,45
6,2020,7,14
7,2020,8,6
8,2020,9,9
9,2020,10,76


### transform()
- Say that I want to express the number of people tested each day as a percentage of the total number of people treated in the same month
- First, we should find a way to keep the value for the within month sum
- This is what calling [transform](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html) on a groupby() operation allows you to do

In [100]:

qc['sum_tested']= qc.groupby(['year','month'])['numtestedtoday'].transform(sum)
qc.head()

Unnamed: 0,prname,date,numtestedtoday,day,month,year,sum_tested
29,Quebec,01-03-2020,,1,3,2020,67540.0
33,Quebec,03-03-2020,,3,3,2020,67540.0
37,Quebec,05-03-2020,,5,3,2020,67540.0
41,Quebec,06-03-2020,,6,3,2020,67540.0
45,Quebec,07-03-2020,,7,3,2020,67540.0


## fillna()
- Second we want to replace NaN values by zeros for tested
- Warning: this is very rarely a good idea
- But here we know that testing was not in place in early March and this is why the data is missing
- We can fill the missing values with zeros using the [fillna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) method

In [101]:
# fill missing values in qc['numtestedtoday'] with zeros
qc['numtestedtoday'].fillna(0)

29         0.0
33         0.0
37         0.0
41         0.0
45         0.0
         ...  
4682    8692.0
4697    8955.0
4712    8471.0
4727    9590.0
4742    9270.0
Name: numtestedtoday, Length: 320, dtype: float64

In [104]:
# Replace NaN by zeros for the numtestedtoday variable
qc['numtestedtoday']=qc['numtestedtoday'].fillna(0)

#show head()
qc.head()

Unnamed: 0,prname,date,numtestedtoday,day,month,year,sum_tested
29,Quebec,01-03-2020,0.0,1,3,2020,67540.0
33,Quebec,03-03-2020,0.0,3,3,2020,67540.0
37,Quebec,05-03-2020,0.0,5,3,2020,67540.0
41,Quebec,06-03-2020,0.0,6,3,2020,67540.0
45,Quebec,07-03-2020,0.0,7,3,2020,67540.0


We can now create our new variable: share of people tested in a day as a percentage of the total number in a month

In [106]:
# create share_tested variable
qc['share_tested']=qc['numtestedtoday']/qc['sum_tested']*100

# show tail()
qc

Unnamed: 0,prname,date,numtestedtoday,day,month,year,sum_tested,share_tested
29,Quebec,01-03-2020,0.0,01,03,2020,67540.0,0.000000
33,Quebec,03-03-2020,0.0,03,03,2020,67540.0,0.000000
37,Quebec,05-03-2020,0.0,05,03,2020,67540.0,0.000000
41,Quebec,06-03-2020,0.0,06,03,2020,67540.0,0.000000
45,Quebec,07-03-2020,0.0,07,03,2020,67540.0,0.000000
...,...,...,...,...,...,...,...,...
4682,Quebec,13-01-2021,8692.0,13,01,2021,158193.0,5.494554
4697,Quebec,14-01-2021,8955.0,14,01,2021,158193.0,5.660807
4712,Quebec,15-01-2021,8471.0,15,01,2021,158193.0,5.354851
4727,Quebec,16-01-2021,9590.0,16,01,2021,158193.0,6.062215


## In class exercise
- What is the share of people tested in each province as percentage of the total number of people tested in Canada?
1. Create a dataframe named can, keeping the prname and numtestedtoday variables
2. Drop observations for 'Canada' and 'Repatriated travellers' in prname
3. Get the total number of people tested (hint: call sum on the column but do not use groupby)
4. Save the results in total_tested
5. Print out the total number of people tested in Canada
6. group the can dataframe by province and sum the total number of tested (suggestion: set as_index=False in groupby)
7. rename the variable _numtestedtoday_ to _totaltested_ 
8. Create a new colum, _total_, which contains the value in total_tested
9. Create a new column with the share for each province
10. show the head() of can for 13 observations

In [None]:
#1. Create a dataframe named can, keeping the prname and numtestedtoday variables
#2. Drop observations for 'Canada' and 'Repatriated travellers' in prname

#3. Get the total number of people tested (hint: call sum on the column but do not use groupby)
#4. Save the results in total_tested

#5. Print out the total number of people tested in Canada

#6. group the can dataframe by province and sum the total number of tested (suggestion: set as_index=False in groupby)

#7. rename the variable numtestedtoday to totaltested

# 8. Create a new colum, _total_, which contains the value in total_tested

#9. Create a new column with the share for each province

