# CME538 - Introduction to Data Science

## Tutorial 2 - Pandas: A Brief Review 
By Navid Kayhani, Marc Saleh
### Goals

### Tutorial Structure
0. [Import the necessary libraries](#section0)


1. [Review of basics in Pandas](#section1)

    1.1. Anatomy of a DataFrame
    
    1.2. Define a DataFrame from scratch 
    
    1.3. DataFrame Manipulation
    
    
2. Exploring an imported dataframe

    2.1 Read in data sources (Importing CSV files)
    
    2.2 Filtering a dataframe based on conditions
    
    2.3 Using groupby()
    
    2.4 Iterrating through a dataframe

<a id='section0'></a>
## Setup Notebook
At the start of a notebook, we need to import the Python packages we plan to use.
* [NumPy](https://numpy.org/) - A library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. NumPy was introcuded in Lecture 4 and we will learn more about its functionality in this lecture. It is customary to `import numpy as np`.
* [Pandas](https://pandas.pydata.org/) - pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. Lecture 5 and 6 will do a deep dive into the core functionality of Pandas. It is customary to `import pandas as pd`. 
* [Seaborn](https://seaborn.pydata.org/) - Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. We will use Seaborn throughout CIV1498 for data visualization. It is customary to `import seaborn as sns`.  
* [Maplotlib](https://matplotlib.org//) - Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. We will use Matplotlib throughout CIV1498 for data visualization. It is customary to `import matplotlib.pyplot as plt`. 

Next, we want to configure the Jupyter Notebook.
* `%matplotlib inline` - This code configured the notebook to display all plots, from Seaborn or Matplotlib, in the Notebook as opposed to in a separate pop-up window.
* `plt.style.use('fivethirtyeight')` - This code configured the plots with the "fivethirtyeight" styling, which tries to replicate the styles from the website [FiveThirtyEight](https://fivethirtyeight.com/).
* `sns.set_context("notebook")` - This sets the plotting context parameters to be optimized for a Notebook. This affects things like the size of the labels, lines, and other elements of the plot, but not the overall style.

In [None]:
# Import 3rd party libraries


In [None]:
import warnings
warnings.filterwarnings('ignore')

<a id='section1'></a>
## 1. Basics

### 1.1. Anatomy of a DataFrame
The primary two components of `pandas` are the `Series` and `DataFrames`.

![DFvsSeries](https://storage.googleapis.com/lds-media/images/series-and-dataframe.width-1200.png)
<center>Series and DataFrames: Number of purchases for apples and oranges</center>

https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/

### 1.2 Creating DataFrames from scratch and selecting values

There are many ways to create a `DataFrame` from scratch, but a great option is to just use a simple Python dictionaries `dict`.

Dictionaries are used to store data values in `key:value` pairs.

A dictionary is a collection which is unordered, changeable and does not allow duplicates (they cannot have two items with the same `key`).

Dictionaries are written with curly brackets, and have `keys` and `values`:

##### **Create a dictionary that includes the 'apples' and 'oranges' series**

Print the values for the key 'apples'

##### Use the dictionnary to create a dataframe where each key is a column and its associated values are represented in the rows

The `Index` of this DataFrame was given to us on creation as the numbers 0-3, we could change it and assign an existing column as the index

In [None]:
# make the 'oranges' column the index


We could also create our own index column when we initialize the Dataframe with the dictionnary

Let's have customer names as our index:

In [None]:
# Names 'Sarah', 'Tim', 'Lily', 'David'


##### Select values using df.loc[rows_list, columns_list]

We can locate a customer's purchases by the using the name or numerical position of rows and columns

In [None]:
# select all purchases of David using .loc


# select all purchases of David using .iloc knowing that David represents row 3


In [None]:
# select the number of oranges David purchased using .loc


# select the number of oranges David purchased using .iloc knowing that David represents row 3 and the oranges column represents column 1


Let's get back to our numbered indices and have the names as a new column:

In [None]:
# reset index


Let's rename the generated column to `name`

In [None]:
# Common mistake is forgetting to overwrite a dataframe when making changes


Rename a column with overwrite

In [None]:
# OPTION 1 with inplace = true


# OPTION 2 with purchases = purchases.rename


### 1.3. DataFrame Manipulation

##### Add column

Maybe we have other types of fruits in our store (bananas):

In [None]:
# add the column bananas: [0, 1 , 3 , 3]


##### Add row

Maybe we have other customers:

In [None]:
# Insert a new row
# Pass the row elements as key value pairs to append() function 
# new_row = {'name': 'Dan', 'apples': 2, 'oranges':2, 'bananas':0}


**Q** :What is the maximum number of purchased items between the categories purchased for each customer?

In [None]:
# I want to find the maximum in each row --> I have to check data in each column (axis=1)


**Q** :What is the highest number of a good purchased between the customers for each category?

In [None]:
# I want to find the maximum in each column --> I have to check data in each row (axis=0)


In [None]:
#check the df shape. Do axis number 0 and 1 make sense now?


Transpose the dataframe

In [None]:
#transpose the df


In [None]:
#check the df shape. Do axis number 0 and 1 make sense now?


## 2.0 Exploring an imported dataframe

### 2.1. Read in data sourses (Importing CSV files)
* `pd.read_csv()` - Import a **comma-separated values (.csv)** file.

In [None]:
# import dataframe


In [None]:
# explore general info on column types, use .info()


In [None]:
# explore statistical data of numerical columns in dataframe, use .describe()


List the US states in the df

### 2.2 Filtering dataframe based on conditions

##### Find the 10 most popular male baby names in CA in 2013.


In [None]:
# Let's first filter the dataframe to only keep male baby names from california in 2013


# Let's now sort this dataframe by the 'Count' column in descending order and only print the first 10 rows




In [None]:
# The previous cell could be completed in a single line of code


### 2.3 Using groupby. to aggregate data

##### How many male and female babies were born in 2014?

In [None]:
# use .groupby('Column to group by').aggregation_method() to group by year and gender


print('\n')

# We are interested in the count column


print('\n')

# we are also interested in year 2014 specifically (.loc)


##### Combining both the grouping function and the dataframe condition based filtering, answer the following question

##### What is the most popular name that has the letter 'z' in it?

In [None]:
# reset index to move 'Name' as column

# print list of first 10 names


### 2.4 Iterate through a dataframe

##### Iterrate through the dataframe and add a new 'HighFemaleBirth' column.

For each baby name (row), 'HighFemaleBirth' is attributed a 'Yes' value if the number of birth with this name is above 500 and the name is female. A value of 'No' is otherwise attributed.

In [None]:
# select a smaller portion of the data (first 30,000 rows)


In [None]:
%%time
# OPTION 1: Simple for loop over range

# initalize empty list


In [None]:
%%time
# OPTION 2: Simple for loop using .iterrows()

# initalize empty list


In [None]:
%%time
# OPTION 3: Using pandas .apply


##### Check results of the new column is working

In [None]:
# print the count of each value in the 'HighFemaleBirth' column


In [None]:
# Look at the rows that have a 'Yes' for 'HighFemaleBirth'
