# CME538 - Introduction to Data Science
## Lecture 2.1 - Pandas I
### Goals
Introduce Pandas, with emphasis on:
* Key Data Structures (data frames, series, indices).
* How to index into these structures.
* How to read files to create these structures.
* Other basic operations on these structures.
* Will go through quite a lot of the language without full explanations. 
* We expect you to fill in the gaps on homeworks, labs, projects, and through your own experimentation.
* Solve some very basic data science problems using Jupyter/pandas.

### Lecture Structure
1. [What is Pandas and what are DataFrames?](#section1)
2. [Importing Data Sources](#section2)
3. [Anatomy of a DataFrame](#section3)
4. [Getting a quick look at your DataFrame](#section4)
5. [Ulitity Operations](#section5)
6. [Indexing](#section6)
7. [Accessing Rows and Columns (Slicing)](#section7)
8. [Boolean Array Selection](#section8)
9. [Slicing using .iloc](#section9)

## Setup Notebook
At the start of a notebook, we need to import the Python packages we plan to use.
* [NumPy](https://numpy.org/) - A library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. NumPy was introcuded in Lecture 3 and we will learn more about its functionality in this lecture. It is customary to `import numpy as np`.
* [Pandas](https://pandas.pydata.org/) - pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. Lecture 4, 5, and 6 will do a deep dive into the core functionality of Pandas. It is customary to `import pandas as pd`. 
* [Seaborn](https://seaborn.pydata.org/) - Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. We will use Seaborn throughout CIV1498 for data visualization. It is customary to `import seaborn as pd`.  
* [Maplotlib](https://matplotlib.org//) - Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. We will use Matplotlib throughout C for data visualization. It is customary to `import matplotlib.pyplot as plt`. 

Next, we want to configure the Jupyter Notebook.
* `%matplotlib inline` - This code configured the notebook to display all plots, from Seaborn or Matplotlib, in the Notebook as opposed to in a separate pop-up window.
* `plt.style.use('fivethirtyeight')` - This code configured the plots with the "fivethirtyeight" styling, which tries to replicate the styles from the website [FiveThirtyEight](https://fivethirtyeight.com/).
* `sns.set_context("notebook")` - This sets the plotting context parameters to be optimized for a Notebook. This affects things like the size of the labels, lines, and other elements of the plot, but not the overall style.

In [None]:
# Import 3rd party libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")

**Install xlrd**

In [None]:
!pip install xlrd

**Install lxml**

In [None]:
!pip install lxml

<a id='section1'></a>
## 1. What is Pandas and what are DataFrames?
A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table. DataFrames were first introduced in the [**R Programming Language**](https://www.r-project.org/) and are generally the most commonly used pandas object. **Pandas** is the most popular Python package for working with DataFrames.

[![https://medium.com/epfl-extension-school/selecting-data-from-a-pandas-dataframe-53917dc39953](images/dataframe_overview.png)](https://medium.com/epfl-extension-school/selecting-data-from-a-pandas-dataframe-53917dc39953)
<center>The World’s Highest Mountains</center>

<a id='section2'></a>
## 2. Importing Data Sources
Pandas has a number of very useful file reading tools. You can see them enumerated by typing `pd.read` and pressing tab. Some common tools include:
* `pd.read_csv()` - Import a **comma-separated values (.csv)** file.
* `pd.read_excel()` - Import a **Microsoft Excel (.xlsx)** file.
* `pd.read_hdf()` - Import a **Hierarchical Data Format (.hdf)** file.
* `pd.read_html()` - Import a **Hypertext Markup Language (.html)** file.
* `pd.read_json()` - Import a **JavaScript Object Notation (.json)** file.
* `pd.read_pickle()` - Import a **Python Pickle (.pickle)** file.
* `pd.read_sql()` - Import a **Structured Query Language (.sql)** file.

### CSV Table
Lets import the CSV file **election.csv**.

In [None]:
# Import html tables to DataFrame
elections = pd.read_csv('elections.csv')

# View the first few rows
elections.head()

### Excel Table
Lets import the Excel file **fossil_fuel.xlsx**.

In [None]:
# Import html tables to DataFrame
fossil_fuel = pd.read_excel('fossil_fuel.xls', 
                            sheet_name='Data1')

# View the first few rows
fossil_fuel.head()

### HTML Table
Lets try to import a Wikipedia table that contains information about **[the largest recorded earthquakes](https://en.wikipedia.org/wiki/Lists_of_earthquakes)** by country.

In [None]:
# Import html tables to DataFrame
dfs = pd.read_html('https://en.wikipedia.org/wiki/Lists_of_earthquakes')

# dfs is an object that contains a DataFrame for every table found at this Wikipedia page. 
# Visit the site and check out the tables.
print('There are {} tables in dfs.'.format(len(dfs)))

In [None]:
# Lets check out the first table.
df = dfs[0]

# View the first few rows
df.head()

In [None]:
# Lets check out the second table.
df = dfs[1]

# View the first few rows
df.head()

In [None]:
# Lets check out the third table.
df = dfs[3]

# View the first few rows
df.head()

Explore the contents of the other 10 tables.

<a id='section3'></a>
## 3. Anatomy of a DataFrame
To start, let's have a look at the `elections` DataFrame.

In [None]:
elections

The figure below displays the different components of a DataFrame, which include: indices, columns, axes, and Series. 
<br>
<img src="images/DataFrame.png" alt="drawing" width="450"/>
<br>
These different DataFrame components can be easily extracted using the following commands.
### Row Indices

In [None]:
elections.index

`.index` returns a `RangeIndex()` object, which shows the start, end and step size of the row indices. `RangeIndex` is a memory-saving special case of `Int64Index` limited to representing monotonic ranges. If we want to simply get an array of index values, we can use `.to_numpy()`.

In [None]:
elections.index.to_numpy()

### Columns

In [None]:
elections.columns

### Data
The raw data of any Pandas object can be accessed as a NumPy array using the `.to_numpy()` operation. The `.values` operation has the same effect but its recommended to use `.to_numpy()`. 

In [None]:
elections.to_numpy()

### Axes
To illustrate the utility of the DataFrame, let's take the `.max()` of `elections`, which will return the maximum numerical value.
<br>
<br>
`.max()` along the `0` axis will return a value for each column.

In [None]:
elections.max(axis=0)

`.max()` along the `1` axis will return a value for each row.

In [None]:
elections.max(axis=1)

### Series
A Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). Each columns in a DataFrame is a Series object. Let's call the `Candidate` column and see what gets returned.

In [None]:
candidate = elections['Candidate']
candidate

Check the data type of `candidate`.

In [None]:
type(candidate)

A Series has an Index.

In [None]:
candidate.index

and data.

In [None]:
candidate.to_numpy()

and a name.

In [None]:
candidate.name

A series can be easily converted to a DataFrame.

In [None]:
candidate.to_frame().head()

Series act like numpy arrays and support most numpy operations.

In [None]:
year = elections['Year']
year.mean()

You can apply NumPy operations.

In [None]:
np.sin(year * 3 + 10)

And of course, Series support Pandas operations.

In [None]:
np.sin(year * 3 + 10).sort_values()

Series also has a very useful function `.value_counts()` which allows you to compute the number of occurences of each unique value.

In [None]:
party = elections['Party']
party_count = party.value_counts()
party_count

In [None]:
party_count.index

In [None]:
party_count.to_numpy()

In [None]:
party_count['Independent']

<a id='section4'></a>
## 4. Getting a quick look at your DataFrame
We can use builtin Pandas commands to return only a few rows of a dataframe for quick inspection. <br>
<br>
Check out the first 5 rows of a DataFrame.

In [None]:
elections.head()

Now, check out the first 10 rows of a DataFrame.

In [None]:
elections.head(10)

There is also a tail function that you can use to inspect the last few rows of a DataFrame.

In [None]:
elections.tail(8)

Randomly sample from a DataFrame without replacement.

In [None]:
elections_sample = elections.sample(10, random_state=0, replace=False) 

# view DataFrame
elections_sample.head(10)

Randomly sample from a DataFrame with replacement.

In [None]:
elections.sample(10, random_state=0, replace=True).head(10)

Randomly sample columns from a DataFrame without replacement.

In [None]:
elections.sample(2, random_state=0, replace=False, axis=1).head()

<a id='section5'></a>
## 5. Ulitity Operations
In addition to `.head()`, `.tail()` and `.sample()`, the are a range of other useful operations.
<br>
<br>
For example, `.shape` returns the number of rows and columns in a DataFrame as a tuple `(rows, cols)`.

In [None]:
elections.shape

`.size` describes the number of "cells" in a DataFrame.

In [None]:
elections.size

In [None]:
print('rows: {} x cols: {} = {}'.format(elections.shape[0], elections.shape[1], elections.size))

We can sort rows by values in multiple columns.

In [None]:
# Sort by Year in ascending order
elections.sort_values(['Year'], ascending=True)

In [None]:
# Sort by Year in descending order
elections.sort_values(['Year'], ascending=False)

In [None]:
# Sort first by Year in ascending order and then by vote % in descending order
elections.sort_values(['Year', '%'], ascending=[True, False])

We can rename columns if their given names are not acceptable.

In [None]:
elections.rename(columns={'%': 'Percent', 'Result': 'Outcome'}).head()

**Important** The `.rename()` method returns a new DataFrame and does not modify the original one. Let's check out `elections` just to be sure.  

In [None]:
elections.head()

**Most operations in Pandas by default are not mutating.** 
<br>
<br>
This produces cleaner code.  If you change something it should be stored in a new appropriately named variable.
<br>
<br>
So, if we can to permanently make these changes to `elections`, we can reassign the variable as shown below.

In [None]:
elections = elections.rename(columns={'%': 'Percent', 'Result': 'Outcome'})

# View DataFrame
elections.head()

Let's switch back to the original names for continuity.

In [None]:
elections = elections.rename(columns={'Percent': '%', 'Outcome': 'Result'})

You can inspect the data type of each column using `.dtypes`.

In [None]:
elections.dtypes

If we want to change the data type of `Year` from `int` to `float`, we can use the `.astype()` method.

In [None]:
elections.astype({'Year': float}).head()

When exploring a new DataFrame, We may want to get summary statistics for each column.

In [None]:
elections.describe(include='all')

We can look at summary statistic for numeric data only.

In [None]:
elections.describe(include=np.number)

Or object data.

In [None]:
elections.describe(include=np.object)

You can even `.transpose()` a DataFrame.

In [None]:
elections.transpose()

<a id='section6'></a>
## 6. Indexing
As we learned in [3. Atanomy of a DataFrame](#section3), all DataFrames and Series have a Index. An Index is like an address, that’s how any data point across the dataframe or series can be accessed. 

In [None]:
elections

In [None]:
elections.index

By default a `RangeIndex` is attached enumerating the rows, which is shown in bold as the far left column of the DataFrame. `RangeIndex` is a memory-saving special case of `Int64Index` limited to representing monotonic ranges.
<br>
<br>
Recall that we sampled the elections table. Let's examine that sample.

In [None]:
elections_sample

In [None]:
elections_sample.index

Notice that the index is different and can no longer be expressed as `RangeIndex`. It maintained the index of the rows in the original table. This is very useful if we wanted to go back and relate derived tables with their original values.

You can use the `.set_index()` operation to set the index of a DataFrame to one of the columns.

In [None]:
elections_sample.set_index('Year')

In [None]:
elections_sample.reset_index()

In [None]:
elections_sample.reset_index(drop=True)

The index allows you to reference *rows* by *name*. You will see this in a moment when we talk about slicing.  

**Note:** The index does not need to be unique. Remember our random sample with replacement from earlier ([Getting a quick look at your DataFrame](#section4)? Row index 3 appears twice!

In [None]:
elections.sample(10, random_state=0, replace=True).head(10)

**Note:** Recall that Columns are also an type of index. We could get the list of column names, which can be used to reference columns by name.

In [None]:
elections.columns

<a id='section7'></a>
## 7. Accessing Rows and Columns (Slicing)
There are many ways to access rows and columns of a Pandas DataFrame.  We will spend some time reviewing the most used options. You can access columns using the square `[  ]` brakets.
### Columns

In [None]:
elections_sample

You can pass a list of column names to select only those columns.

In [None]:
elections_sample[['Candidate','Year', 'Result']]

If you pass a list with a single element you get back a DataFrame.

In [None]:
elections_sample[['Candidate']]

If you pass single column name string, you get back a Series.

In [None]:
elections_sample['Candidate']

You can modify and even add columns using the square brackets `[ ]`.

In [None]:
temp = elections_sample.copy()
temp['Year'] = temp['Year'] * -1 + 25.
temp

We can add a new column by assignment.

In [None]:
temp['Corrected Year'] = temp['Year'] * -1 + 25.
temp

In [None]:
temp['random'] = np.random.randn(temp.shape[0])
temp

### Accessing by rows and columns by index using `.loc[ ]`
You can access rows and columns of a DataFrame by name using the `.loc[ ]` syntax.

In [None]:
elections_sample

The syntax for `.loc` is:

```
df.loc[rows_list, column_list]
```
We can pass a list of row names (index values).

In [None]:
elections_sample.loc[[11, 8], ['Party', 'Year']]

In [None]:
elections_sample.loc[:, ['Party', 'Year']]

In [None]:
elections_sample.loc[[11, 8]]

In [None]:
elections_sample.loc[[11, 8], :]

`.loc` also supports slicing (for all types, including numeric and string labels!). Note that the slicing for loc is **inclusive**, even for numeric slices.  In general, avoiding range slicing with `.loc`.  

In [None]:
elections.loc[0:10, 'Candidate':'Year']

Range slicing works for `elections`. If we try the same thing for `elections_sample`, we get the following value error. 
```
elections_sample.loc[0:10, 'Candidate':'Year']

Returns:
ValueError: index must be monotonic increasing or decreasing
```
Keep in mind that the ranges need to be over the index values and not the locations and index values need to have well defined contiguous ranges.
<br>
<br>
If we try the same thing after sorting by index in ascending order, it will work.

In [None]:
elections_sample.sort_index().loc[0:10, 'Candidate':'Year']

This funcionality can be very useful when the index is set to a column, for example `year`. We can use `.loc` to filter the DataFrame.  

In [None]:
elections_year = elections.set_index('Year').sort_index()
elections_year

Let's say we want to  return all election results from 1980 to 2004.

In [None]:
elections_year.loc[1980:2004]

If you give `.loc` a single scalar arguments for the requested rows and columns, you get back just a single value.

In [None]:
elections.loc[19, 'Candidate']

<a id='section8'></a>
## 8. Boolean Array Selection

`.loc[ ]` and `[ ]` support arrays of booleans as an input. In this case, the array must be exactly as long as the number of rows or columns. The result is a filtered version of the data frame, where only rows corresponding to `True` appear. This functionality is similar to `WHERE` in **SQL**.

In [None]:
elections_sample.shape

The `elections_sample` DataFrame has 10 rows, so if we create an list of Boolean values, we can use it to filter the DataFrame.

In [None]:
boolean_list = [False, False, False, False, True, 
                False, False, True, True, False]

elections_sample.loc[boolean_list]

You can also pass the same arguments to the `[ ]` operator.

In [None]:
elections_sample[boolean_list]

One very common task in Data Science is filtering. Boolean Array Selection is one way to achieve this in Pandas. We start by observing logical operators like the equality operator `==` can be applied to Pandas Series data to generate a Boolean array. For example, we can compare the `Result` column to the String `win`.

In [None]:
elections.head()

In [None]:
iswin = elections['Result'] == 'win'
iswin

The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean. The entry at row **i** represents the result of the application of that operator to the entry of the original Series at row **i**.
<br>
<br>
Such a boolean Series can be used as an argument to the `[ ]` operator. For example, the following code creates a DataFrame of all election winners since 1980.

In [None]:
elections[iswin]

Above, we've assigned the result of the logical operator to a new variable called `iswin`. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

In [None]:
elections[elections['Result'] == 'win']

We can select multiple criteria by creating multiple boolean Series and combining them using the & operator.

In [None]:
elections[
    (elections['Result'] == 'win') & 
    (elections['%'] < 50)
]

Using the logical negation `~` operator, which means **Not**.  

In [None]:
elections[
    (elections['Result'] == 'win') & 
    ~(elections['%'] < 50)
]

Using the `|` operator, which mean **Or**.

In [None]:
elections[
    ~((elections['Party'] == "Democratic") | 
      (elections['Party'] == "Republican"))
]

In [None]:
elections[
    (elections['Party'] != "Democratic") & 
    (elections['Party'] != "Republican")
]

If we have multiple conditions (say Republican or Democratic), we can use the isin operator to simplify our code.

In [None]:
elections[~elections['Party'].isin(['Republican', 'Democratic'])]

<a id='section9'></a>
## 9. Slicing using `.iloc`

`.loc`'s cousin `iloc` is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use `.iloc[0:3, 0:3]`. `.iloc` slicing is **exclusive**, just like standard Python slicing of numerical values.

In [None]:
elections_sample

In [None]:
elections_sample.iloc[2:, 3:5]

In [None]:
elections_sample.iloc[::2, 3:5]

In [None]:
elections_sample.iloc[5:-1, 3:5]

### Caution
We will use both `.loc` and `.iloc` in the course. `.loc` is generally preferred for a number of reasons, for example: 

1. It is harder to make mistakes since you have to literally write out what you want to get.
2. Code is easier to read, because the reader doesn't have to know e.g. what column **#31** represents.
3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

## Quick Challenge
Which of the following expressions return DataFrame of the first 3 Candidate and Party names for candidates that won with more than 50% of the vote?

In [None]:
elections.iloc[[0, 3, 5], [0, 3]]

In [None]:
elections.loc[[0, 3, 5], 'Candidate': 'Year']

In [None]:
elections.loc[elections['%'] > 50, ['Candidate', 'Year']].head(3)

In [None]:
elections.loc[elections['%'] > 50, ['Candidate', 'Year']].iloc[0:2, :]

## Baby Names Data

We will start working with the baby names datset next lecture. If you're interested, you can get a head start.

Now let's play around a bit with the large baby names dataset. We'll start by loading that dataset from the social security administration's website.

To keep the data small enough to avoid crashing **JupyterHub**, we're going to look at only New York rather than looking at the national dataset.

In [None]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'NY.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    baby_names = pd.read_csv(fh, header=None, names=field_names)

baby_names.sample(5)

**Goal 1:** Find the 20 most popular female baby names in New York in 2018.

<details>
    <summary>Solution</summary>
<code>
baby_names[
    (baby_names['Year'] == 2018) & 
    (baby_names['Sex'] == 'F')
].sort_values(by='Count', ascending=False).head(20)
</code>
</details>

In [None]:
# Solution here

**Goal 2:** Make a plot of how many baby boys were named **Avery** over the years.

<details>
    <summary>Solution</summary>
<code>
_ = baby_names[
    (baby_names['Name'] == 'Avery') & 
    (baby_names['Sex'] == 'M')
].plot(x='Year', y='Count', marker='.');
</code>
</details>

In [None]:
# Solution here