## Python Packages
- Packages are convenient ways to extend python functionality
- Packages are a collection of functions and objects
- First example is [Numpy](https://numpy.org/devdocs/user/quickstart.html)

In [None]:
# import numpy package and give it the usual alias: np
import 

In [None]:
# Access numpy commands by saying "np.command"


### Numpy arrays

- Numpy has an object called an _array_
- Numpy arrays act like lists, but you can do lots more stuff with them, and math functions can be much faster

In [None]:
# Create a numpy array [1, 4, 9]
array = 

# An equivalent list
list = 

print(type(list), type(array))

### Numpy functions
- The numpy sum function [np.sum()](https://numpy.org/doc/stable/reference/generated/numpy.sum.html)  asks python to access the sum() function in the Numpy package
- The numpy square root (sqrt) function (https://numpy.org/doc/stable/reference/generated/numpy.sqrt.html) calculates the square root of each element in a numpy array
- All functions in packages are accessed the same way.
- We use numpy functions because (1) they are often faster than built-in python functions or (2) they have more functionality

In [None]:
# Print the sum of the numpy array
print()

# Print the sum of the list
print()

# Print the square root of each element in the numpy array
print()


## Import Pandas
- Today we will look at a Package called [Pandas](https://pandas.pydata.org/docs/)
- Pandas is the best available package to handle data

In [None]:
# Import the library, use pd alias


# Print out the Pandas version


## Why use Pandas? 

- Pandas keeps the speed of Numpy
- Then adds the ability to label __variables__ and __index__ rows in a user friendly way
- Before loading an actual dataset into memory let's see a very simple example

In [None]:
# Create a small dictionary for data
d = {'col1': [1, 2, 10,20], 'col2': [3, 4, 25.0,'ottawa']}
dataframe = 
dataframe

- Your dataframe is now assigned to _dataframe_
- This is what's known as an __object__
    - You can learn more about [Object Oriented Programming here](https://realpython.com/python3-object-oriented-programming/)

In [None]:
# Use the columns attribute on the dataframe to list variables
dataframe

In [None]:
# print type of what is returned by calling columns
print(type(dataframe.columns))
# Save column names to a list using the tolist method
cols=dataframe.columns.tolist()
# print the type of this list
print(type(cols))
#return the columns
cols

# Let's use real data
- Let's look into a dataset freely available on [Open Canada](https://www.kaggle.com)
- Specifically, we will look into data related to [COVID19 in Canada](https://open.canada.ca/data/en/dataset/b8d1d622-1ceb-4c1c-96e9-a0b38939080b)


Best practice

In [None]:
# If you access multiple times the same folder or if you want to share your code
pathFolder="~/Dropbox/_teaching/ECO4199/2023/Data-Science-for-Social-Scientists/Class 03 - Tidy Data/COVID/"

In [None]:
# read_csv into a pandas datframe
covidFile=''.join([pathFolder,"covid19_map.csv"])
pd.read_csv(covidFile)

#Altenatively, if you don't have access to the file you can dowload directly from this Dropbox link:
#pd.read_csv("https://www.dropbox.com/s/78h5e3xj36xl9eu/covid19_map.csv?dl=1")

Where did my dataset go?

## Pandas dataframe
You want to keep the data into a Pandas dataframe so you will assign it!

In [None]:
covidFile="https://www.dropbox.com/s/78h5e3xj36xl9eu/covid19_map.csv?dl=1"
df=pd.read_csv(covidFile)

## head()
- A method is a function that takes as a first argument, the object itself
- Let's explore this dataset using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method

In [None]:
# call the dataframe's head
df

- head() will print the first 5 rows
- We have 36 columns but only a subset of them were printed out
- You can change this as an [option](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html)

In [None]:
# Change the maximum display option

#print again the head
df.head()

## In class exercise

- There is also a [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) method
- It works like head()
- By default the number of rows displayed is 5
- You can change this by calling n=X inside the method

In [None]:
#Call the tail() method on your dataset and print out the 10 last rows


## Shape

- Let's get a better sense of the number of variables (columns) and observations (rows) in this dataset.
- To do so you can use the [shape](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) __attribute__

In [None]:
# What are the dimensions of our df?


## Info()
Last week we talked about the different types.

Pandas offer a way to know about the type of each variable and more.

This is the [info()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) method

In [None]:
# Call info()


## describe()

- You can also obtain a few summary statistics

- These are of course available only for numerical variables

In [None]:
# use the describe method on df


## Question
- Why are there 33 columns returned when using describe?

## Chaining Commands

- You can chain commands in Pandas.
- Code will be executed from letf to right
- Here is an example using the [transpose()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html) method 

In [None]:
# Transposed describe method


## Subsetting
- It is very unlikely that a dataset, you didn't create yourself, will be tailored to your exact needs
- __Data cleaning__ consist in removing, reshaping, creating and merging data
- Let's start with using a subset of the data

In [None]:
# call head again
df.head()

## Keep only the columns that you need
- Often you will want to remove columns (variables)
- Columns that are not needed may take space and slow down your code
- Let's use the [pop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pop.html) method on prnameFR variable

In [None]:
# you can use the pop method by using the variable's name
print(df.columns)



print(df.columns)

## drop()
- The [drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) method allows you to drop one or more columns
- It is more verbose
- Let's see some of its functionalities and remove variables ending with _last7

In [None]:
# Let's pass a list of columns to drop
colstodrop = ['numtotal_last7','ratetotal_last7', 'numdeaths_last7', 'ratedeaths_last7','avgtotal_last7', 'avgincidence_last7', 'avgdeaths_last7','avgratedeaths_last7']


## Keep columns
- What if instead you want to specify which columns to keep instead?
- You can do so by calling a list on the dataframe
- Say you only want to keep the province name, date and total number of positive cases

In [None]:
# subsetting: Subset dataframe to columns 'prname','date','numtotal'


## rename()
- Another useful method is [rename()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html)
- rename() takes a dictionary as a _mapper_ the key represents the current variable name and the value the new variable name.
- you can change multiple names at the same time
- you can also set the inplace argument to True
- Let's rename prname and numdeaths of df_mortality to province and deaths

In [None]:
#Rename column 'prname' to 'province'; 'numdeaths' to 'deaths'


## Subset of observations
- You may also be interested in subseting the dataset according to the information it contains
- Say only want to keep data from Ontario
- You can do so by using booleans

In [None]:
# Susetting to a specific column
df['province']

In [None]:
# Can you guess what this will return?
df['province']=="Ontario"

In [None]:
# What is the type of what is returned?
type(df['province']=="Ontario")

In [None]:
# Create a new dataframe named on as a copy of df 
on = 
# print head
on.head()

## copy()
- Pandas allows you to map the changes you make on a subset of a dataset to the original dataset
- In our example, you could map the changes you make to the dataframe on onto df
- Hence, Pandas wants you to be specific about the type of copy you are making
- using the [copy()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html) method allows you to specify you do not want these dataframes to be linked.


## unique()
- The unique() method can be called on a [Pandas series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) (a single column)
- It will return all unique values in a series

In [None]:
# call unique on the prname variable


## Combine the two
- You can combine the column and row selection
- Usually you would use [.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)
- The order is always rows, columns: df.loc[df.colname==value, [cols_list]]
- Say you want data only for Quebec and only for the number of tested and the date

In [None]:
# create a Quebec (qc) dataset with 'prname','date','numtestedtoday' variables as a copy of df
qc=
qc.head()

## Create new variables
- Often you will also want to create new variables
- Let's see a very simple case

In [None]:
# you can also create a constant very easily

qc.head()

In [None]:
# Let's pop this series out


In [None]:
qc.head()

## groupby()
- Another very useful tool is [groupby()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)
- Say you want to know about the total number of people tested each month in Quebec
- This is equivalent to reducing the number of observations to one per month and sum over the total number of tested

### Problem
- We need a variable that varies by month
- For now we only have a variable that varies by day
- You could you Pandas' powerful [datetime tools](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)
- We will use the [split()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html) method instead. 

In [None]:
# The date column is an object (string) type
print(qc['date'])
# use the split method on the string of a pandas series
qc['date']

In [None]:
# unpack the values in 3 different columns using expand=True in split()
qc[['day','month','year']]=qc['date'].str
# show tail()
qc.tail()

We can now use groupby on the dataframe

In [None]:
# groupby year and month, and take the sum() of numtestedtoday
qc.groupby()[].sum().astype(int)

### transform()
- Say that I want to express the number of people tested each day as a percentage of the total number of people treated in the same month
- First, we should find a way to keep the value for the within month sum
- This is what calling [transform](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html) on a groupby() operation allows you to do

In [None]:
qc.groupby()[].transform(np.sum)

In [None]:
qc['monthly_tested']=qc.groupby(['year','month'])['numtestedtoday'].transform(np.sum)
qc.head()

## fillna()
- Second we want to replace NaN values by zeros for tested
- Warning: this is very rarely a good idea
- But here we know that testing was not in place in early March and this is why the data is missing
- We can fill the missing values with zeros using the [fillna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) method

In [None]:
# fill missing values in qc['numtestedtoday'] with zeros
qc['numtestedtoday']

In [None]:
# Replace NaN by zeros for the numtestedtoday variable
qc['numtestedtoday']=qc['numtestedtoday'].fillna(0)
qc.head()

We can now create our new variable

In [None]:
qc['share_tested']=qc['numtestedtoday']/qc['monthly_tested']*100
qc.tail()