# CME538 - Introduction to Data Science

## Tutorial 2 - Pandas: A Brief Review 
By Navid Kayhani, Marc Saleh
### Goals

### Tutorial Structure
0. [Import the necessary libraries](#section0)


1. [Review of basics in Pandas](#section1)

    1.1. Anatomy of a DataFrame
    
    1.2. Define a DataFrame from scratch 
    
    1.3. DataFrame Manipulation
    
    
2. Exploring an imported dataframe

    2.1 Read in data sources (Importing CSV files)
    
    2.2 Filtering a dataframe based on conditions
    
    2.3 Using groupby()
    
    2.4 Iterrating through a dataframe

<a id='section0'></a>
## Setup Notebook
At the start of a notebook, we need to import the Python packages we plan to use.
* [NumPy](https://numpy.org/) - A library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. NumPy was introcuded in Lecture 4 and we will learn more about its functionality in this lecture. It is customary to `import numpy as np`.
* [Pandas](https://pandas.pydata.org/) - pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. Lecture 5 and 6 will do a deep dive into the core functionality of Pandas. It is customary to `import pandas as pd`. 
* [Seaborn](https://seaborn.pydata.org/) - Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. We will use Seaborn throughout CIV1498 for data visualization. It is customary to `import seaborn as sns`.  
* [Maplotlib](https://matplotlib.org//) - Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. We will use Matplotlib throughout CIV1498 for data visualization. It is customary to `import matplotlib.pyplot as plt`. 

Next, we want to configure the Jupyter Notebook.
* `%matplotlib inline` - This code configured the notebook to display all plots, from Seaborn or Matplotlib, in the Notebook as opposed to in a separate pop-up window.
* `plt.style.use('fivethirtyeight')` - This code configured the plots with the "fivethirtyeight" styling, which tries to replicate the styles from the website [FiveThirtyEight](https://fivethirtyeight.com/).
* `sns.set_context("notebook")` - This sets the plotting context parameters to be optimized for a Notebook. This affects things like the size of the labels, lines, and other elements of the plot, but not the overall style.

In [2]:
# Import 3rd party libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
import warnings
warnings.filterwarnings('ignore')

<a id='section1'></a>
## 1. Basics

### 1.1. Anatomy of a DataFrame
The primary two components of `pandas` are the `Series` and `DataFrames`.

![DFvsSeries](https://storage.googleapis.com/lds-media/images/series-and-dataframe.width-1200.png)
<center>Series and DataFrames: Number of purchases for apples and oranges</center>

https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/

### 1.2 Creating DataFrames from scratch and selecting values

There are many ways to create a `DataFrame` from scratch, but a great option is to just use a simple Python dictionaries `dict`.

Dictionaries are used to store data values in `key:value` pairs.

A dictionary is a collection which is unordered, changeable and does not allow duplicates (they cannot have two items with the same `key`).

Dictionaries are written with curly brackets, and have `keys` and `values`:

##### **Create a dictionary that includes the 'apples' and 'oranges' series**

In [4]:
data = {'apples' : [3,2,0,1],
       'oranges' : [0,3,7,2]}

Print the values for the key 'apples'

In [6]:
print(data['apples'])

[3, 2, 0, 1]


##### Use the dictionnary to create a dataframe where each key is a column and its associated values are represented in the rows

In [7]:
purchases = pd.DataFrame(data)
print(purchases)

   apples  oranges
0       3        0
1       2        3
2       0        7
3       1        2


The `Index` of this DataFrame was given to us on creation as the numbers 0-3, we could change it and assign an existing column as the index

In [11]:
# make the 'oranges' column the index
orange_index_purchases = purchases.set_index('oranges')
orange_index_purchases

Unnamed: 0_level_0,apples
oranges,Unnamed: 1_level_1
0,3
3,2
7,0
2,1


We could also create our own index column when we initialize the Dataframe with the dictionnary

Let's have customer names as our index:

In [12]:
# Names 'Sarah', 'Tim', 'Lily', 'David'
purchases = pd.DataFrame(data, index =['Sarah', 'Tim', 'Lily', 'David'])
purchases

Unnamed: 0,apples,oranges
Sarah,3,0
Tim,2,3
Lily,0,7
David,1,2


##### Select values using df.loc[rows_list, columns_list]

We can locate a customer's purchases by the using the name or numerical position of rows and columns

In [15]:
# select all purchases of David using .loc
print(purchases.loc['David'])

# select all purchases of David using .iloc knowing that David represents row 3
print(purchases.iloc[3])

apples     1
oranges    2
Name: David, dtype: int64
apples     1
oranges    2
Name: David, dtype: int64


In [17]:
# select the number of oranges David purchased using .loc
oranges_david = purchases.loc['David','oranges']
print(apples_david)

# select the number of oranges David purchased using .iloc knowing that David represents row 3 and the oranges column represents column 1
oranges_david = purchases.iloc[3,1]
print(oranges_david)

2
2


In [18]:
purchases

Unnamed: 0,apples,oranges
Sarah,3,0
Tim,2,3
Lily,0,7
David,1,2


Let's get back to our numbered indices and have the names as a new column:

In [19]:
# reset index
purchases = purchases.reset_index()
purchases

Unnamed: 0,index,apples,oranges
0,Sarah,3,0
1,Tim,2,3
2,Lily,0,7
3,David,1,2


Let's rename the generated column to `name`

In [21]:
# Common mistake is forgetting to overwrite a dataframe when making changes
purchases.rename(columns = {'index' : 'name'})

   index  apples  oranges
0  Sarah       3        0
1    Tim       2        3
2   Lily       0        7
3  David       1        2


Rename a column with overwrite

In [22]:
# OPTION 1 with inplace = true
purchases.rename(columns = {'index' : 'name'}, inplace = True)

# OPTION 2 with purchases = purchases.rename
purchases = purchases.rename(columns = {'index' : 'name'})

### 1.3. DataFrame Manipulation

##### Add column

Maybe we have other types of fruits in our store (bananas):

In [24]:
# add the column bananas: [0, 1 , 3 , 3]
purchases['bananas'] = [0, 1 , 3 , 3]

##### Add row

Maybe we have other customers:

In [27]:
# Insert a new row
# Pass the row elements as key value pairs to append() function 

new_row = {'name': 'Dan', 'apples': 2, 'oranges':2, 'bananas':0}
purchases = purchases.append(new_row , ignore_index = True)

In [28]:
purchases

Unnamed: 0,name,apples,oranges,bananas
0,Sarah,3,0,0
1,Tim,2,3,1
2,Lily,0,7,3
3,David,1,2,3
4,Dan,2,2,0


**Q** :What is the maximum number of purchased items between the categories purchased for each customer?

In [29]:
# I want to find the maximum in each row --> I have to check data in each column (axis=1)
purchases.max(axis = 1)

0    3
1    3
2    7
3    3
4    2
dtype: int64

**Q** :What is the highest number of a good purchased between the customers for each category?

In [30]:
# I want to find the maximum in each column --> I have to check data in each row (axis=0)
purchases.max(axis = 0)

name       Tim
apples       3
oranges      7
bananas      3
dtype: object

In [31]:
#check the df shape. Do axis number 0 and 1 make sense now?
purchases.shape

(5, 4)

Transpose the dataframe

In [32]:
#transpose the df
purchases_t = purchases.transpose()
purchases_t

Unnamed: 0,0,1,2,3,4
name,Sarah,Tim,Lily,David,Dan
apples,3,2,0,1,2
oranges,0,3,7,2,2
bananas,0,1,3,3,0


In [33]:
#check the df shape. Do axis number 0 and 1 make sense now?
purchases_t.shape

(4, 5)

## 2.0 Exploring an imported dataframe

### 2.1. Read in data sourses (Importing CSV files)
* `pd.read_csv()` - Import a **comma-separated values (.csv)** file.

In [35]:
# import dataframe
df_names = pd.read_csv('US_baby_names_2013-14.csv')

print(df_names.head())

      Id      Name  Year Gender State  Count
0  13298      Emma  2013      F    AK     57
1  13299    Sophia  2013      F    AK     50
2  13300   Abigail  2013      F    AK     39
3  13301  Isabella  2013      F    AK     38
4  13302    Olivia  2013      F    AK     35


In [37]:
# explore general info on column types, use .info()
df_names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186891 entries, 0 to 186890
Data columns (total 6 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   Id      186891 non-null  int64 
 1   Name    186891 non-null  object
 2   Year    186891 non-null  int64 
 3   Gender  186891 non-null  object
 4   State   186891 non-null  object
 5   Count   186891 non-null  int64 
dtypes: int64(3), object(3)
memory usage: 8.6+ MB


In [38]:
# explore statistical data of numerical columns in dataframe, use .describe()
df_names.describe()

Unnamed: 0,Id,Year,Count
count,186891.0,186891.0,186891.0
mean,2852127.0,2013.503759,33.067692
std,1652032.0,0.499987,87.988787
min,13298.0,2013.0,5.0
25%,1325804.0,2013.0,7.0
50%,2816340.0,2014.0,11.0
75%,4347148.0,2014.0,26.0
max,5647426.0,2014.0,3451.0


List the US states in the df

In [40]:
df_names['State'].unique()

array(['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
       'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
       'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
       'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
       'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY'], dtype=object)

### 2.2 Filtering dataframe based on conditions

##### Find the 10 most popular male baby names in CA in 2013.


In [44]:
# Let's first filter the dataframe to only keep male baby names from california in 2013
df_M_Cali_2013 = df_names[(df_names['Year'] == 2013) &
                         (df_names['State'] == 'CA') &
                         (df_names['Gender'] == 'M')]

# Let's now sort this dataframe by the 'Count' column in descending order and only print the first 10 rows
sorted_df_M_Cali_2013 = df_M_Cali_2013.sort_values(by ='Count', ascending = False)

sorted_df_M_Cali_2013

Unnamed: 0,Id,Name,Year,Gender,State,Count
19288,704421,Jacob,2013,M,CA,2879
19289,704422,Ethan,2013,M,CA,2659
19290,704423,Daniel,2013,M,CA,2590
19291,704424,Jayden,2013,M,CA,2580
19292,704425,Matthew,2013,M,CA,2553
...,...,...,...,...,...,...
21926,707059,Eliu,2013,M,CA,5
21925,707058,Elisandro,2013,M,CA,5
21924,707057,Elih,2013,M,CA,5
21923,707056,Ej,2013,M,CA,5


In [46]:
# The previous cell could be completed in a single line of code
sorted_df_M_Cali_2013 = df_names[(df_names['Year'] == 2013) &
                         (df_names['State'] == 'CA') &
                         (df_names['Gender'] == 'M')].sort_values(by ='Count', ascending = False)

Unnamed: 0,Id,Name,Year,Gender,State,Count
19288,704421,Jacob,2013,M,CA,2879
19289,704422,Ethan,2013,M,CA,2659
19290,704423,Daniel,2013,M,CA,2590
19291,704424,Jayden,2013,M,CA,2580
19292,704425,Matthew,2013,M,CA,2553
...,...,...,...,...,...,...
21926,707059,Eliu,2013,M,CA,5
21925,707058,Elisandro,2013,M,CA,5
21924,707057,Elih,2013,M,CA,5
21923,707056,Ej,2013,M,CA,5


### 2.3 Using groupby. to aggregate data

##### How many male and female babies were born in 2014?

In [50]:
# use .groupby('Column to group by').aggregation_method() to group by year and gender
#print(df_names.groupby(['Year' , 'Gender']).sum())

print('\n')

# We are interested in the count column
#print(df_names.groupby(['Year' , 'Gender']).sum()['Count'])

print('\n')

# we are also interested in year 2014 specifically (.loc)
print(df_names.groupby(['Year' , 'Gender']).sum()['Count'].loc[2014])





Gender
F    1446259
M    1667352
Name: Count, dtype: int64


##### Combining both the grouping function and the dataframe condition based filtering, answer the following question

##### What is the most popular name that has the letter 'z' in it?

In [51]:
df_name_Z = df_names[(df_names['Name'].str.contains('z')) | (df_names['Name'].str.contains('Z'))].groupby('Name').sum().sort_values(by= 'Count', ascending = False)

In [53]:
df_name_Z.head(10)

Unnamed: 0_level_0,Id,Year,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Elizabeth,293243593,205377,18905
Zoey,293244668,205377,14568
Zoe,293957347,207391,11791
Zachary,298261184,205377,10863
Mackenzie,290191136,209405,8144
Ezra,338973328,239605,6267
Hazel,286666912,201351,4922
Ezekiel,261234922,185242,4498
Mckenzie,266858791,191283,4488
Zayden,264907961,187255,4281


In [54]:
# reset index to move 'Name' as column
df_name_Z.reset_index(inplace= True)

# print list of first 10 names
print(df_name_Z)

          Name         Id    Year  Count
0    Elizabeth  293243593  205377  18905
1         Zoey  293244668  205377  14568
2          Zoe  293957347  207391  11791
3      Zachary  298261184  205377  10863
4    Mackenzie  290191136  209405   8144
..         ...        ...     ...    ...
761  Kynzleigh    4937651    2014      5
762       Elza     565090    2014      5
763    Kynzlie    4937652    2014      5
764    Larenzo    2872417    2013      5
765      Tzivy    3749652    2014      5

[766 rows x 4 columns]


### 2.4 Iterate through a dataframe

##### Iterrate through the dataframe and add a new 'HighFemaleBirth' column.

For each baby name (row), 'HighFemaleBirth' is attributed a 'Yes' value if the number of birth with this name is above 500 and the name is female. A value of 'No' is otherwise attributed.

In [55]:
# select a smaller portion of the data (first 30,000 rows)
data = df_names.iloc[:30000]

In [60]:
data.shape

(30000, 6)

In [61]:
%%time
# OPTION 1: Simple for loop over range

# initalize empty list
HighFemaleBirth = []

for row_index in range(data.shape[0]):
    Count = data.loc[row_index , 'Count']
    Gender = data.loc[row_index, 'Gender']
    
    if (Count >=500) and (Gender == 'F'):
        HighFemaleBirth.append('Yes')
    else:
        HighFemaleBirth.append('No')

data['HighFemaleBirth'] = HighFemaleBirth


Wall time: 863 ms


In [63]:
%%time
# OPTION 2: Simple for loop using .iterrows()

# initalize empty list
HighFemaleBirth = []

for index, row in data.iterrows():
    Count = row['Count']
    Gender = row['Gender']
    
    if (Count >=500) and (Gender == 'F'):
        HighFemaleBirth.append('Yes')
    else:
        HighFemaleBirth.append('No')
        
data['HighFemaleBirth'] = HighFemaleBirth

Wall time: 4.4 s


In [65]:
%%time
# OPTION 3: Using pandas .apply
def add_highfemalebirth(Count , Gender):
    if (Count >=500) and (Gender == 'F'):
        return 'Yes'
    else:
        return 'No'
    
data['HighFemaleBirth'] = data.apply(lambda row : add_highfemalebirth(row['Count'], row['Gender']) , axis =1)

Wall time: 545 ms


##### Check results of the new column is working

In [67]:
# print the count of each value in the 'HighFemaleBirth' column
data['HighFemaleBirth'].value_counts()

No     29852
Yes      148
Name: HighFemaleBirth, dtype: int64

In [68]:
# Look at the rows that have a 'Yes' for 'HighFemaleBirth'
test = data[data['HighFemaleBirth'] == 'Yes']
test

Unnamed: 0,Id,Name,Year,Gender,State,Count,HighFemaleBirth
6837,307228,Sophia,2013,F,AZ,604,Yes
11353,557545,Sophia,2013,F,CA,3451,Yes
11354,557546,Isabella,2013,F,CA,2783,Yes
11355,557547,Mia,2013,F,CA,2592,Yes
11356,557548,Emma,2013,F,CA,2478,Yes
...,...,...,...,...,...,...,...
15374,561566,Valerie,2014,F,CA,534,Yes
15375,561567,Ruby,2014,F,CA,531,Yes
15376,561568,Claire,2014,F,CA,520,Yes
15377,561569,Ariel,2014,F,CA,507,Yes
