# CME538 - Introduction to Data Science
## Lecture 2.1 - Pandas I
### Goals
Introduce Pandas, with emphasis on:
* Key Data Structures (data frames, series, indices).
* How to index into these structures.
* How to read files to create these structures.
* Other basic operations on these structures.
* Will go through quite a lot of the language without full explanations. 
* We expect you to fill in the gaps on homeworks, labs, projects, and through your own experimentation.
* Solve some very basic data science problems using Jupyter/pandas.

### Lecture Structure
1. [What is Pandas and what are DataFrames?](#section1)
2. [Importing Data Sources](#section2)
3. [Anatomy of a DataFrame](#section3)
4. [Getting a quick look at your DataFrame](#section4)
5. [Ulitity Operations](#section5)
6. [Indexing](#section6)
7. [Accessing Rows and Columns (Slicing)](#section7)
8. [Boolean Array Selection](#section8)
9. [Slicing using .iloc](#section9)

## Setup Notebook
At the start of a notebook, we need to import the Python packages we plan to use.
* [NumPy](https://numpy.org/) - A library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. NumPy was introcuded in Lecture 3 and we will learn more about its functionality in this lecture. It is customary to `import numpy as np`.
* [Pandas](https://pandas.pydata.org/) - pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. Lecture 4, 5, and 6 will do a deep dive into the core functionality of Pandas. It is customary to `import pandas as pd`. 
* [Seaborn](https://seaborn.pydata.org/) - Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. We will use Seaborn throughout CIV1498 for data visualization. It is customary to `import seaborn as pd`.  
* [Maplotlib](https://matplotlib.org//) - Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. We will use Matplotlib throughout C for data visualization. It is customary to `import matplotlib.pyplot as plt`. 

Next, we want to configure the Jupyter Notebook.
* `%matplotlib inline` - This code configured the notebook to display all plots, from Seaborn or Matplotlib, in the Notebook as opposed to in a separate pop-up window.
* `plt.style.use('fivethirtyeight')` - This code configured the plots with the "fivethirtyeight" styling, which tries to replicate the styles from the website [FiveThirtyEight](https://fivethirtyeight.com/).
* `sns.set_context("notebook")` - This sets the plotting context parameters to be optimized for a Notebook. This affects things like the size of the labels, lines, and other elements of the plot, but not the overall style.

In [1]:
# Import 3rd party libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")

**Install xlrd**

In [2]:
!pip install xlrd

Collecting xlrd
  Obtaining dependency information for xlrd from https://files.pythonhosted.org/packages/a6/0c/c2a72d51fe56e08a08acc85d13013558a2d793028ae7385448a6ccdfae64/xlrd-2.0.1-py2.py3-none-any.whl.metadata
  Downloading xlrd-2.0.1-py2.py3-none-any.whl.metadata (3.4 kB)
Downloading xlrd-2.0.1-py2.py3-none-any.whl (96 kB)
   ---------------------------------------- 0.0/96.5 kB ? eta -:--:--
   ---- ----------------------------------- 10.2/96.5 kB ? eta -:--:--
   ------------------------- -------------- 61.4/96.5 kB 825.8 kB/s eta 0:00:01
   ---------------------------------------- 96.5/96.5 kB 1.1 MB/s eta 0:00:00
Installing collected packages: xlrd
Successfully installed xlrd-2.0.1


**Install lxml**

In [3]:
!pip install lxml



<a id='section1'></a>
## 1. What is Pandas and what are DataFrames?
A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table. DataFrames were first introduced in the [**R Programming Language**](https://www.r-project.org/) and are generally the most commonly used pandas object. **Pandas** is the most popular Python package for working with DataFrames.

[![https://medium.com/epfl-extension-school/selecting-data-from-a-pandas-dataframe-53917dc39953](images/dataframe_overview.png)](https://medium.com/epfl-extension-school/selecting-data-from-a-pandas-dataframe-53917dc39953)
<center>The World’s Highest Mountains</center>

<a id='section2'></a>
## 2. Importing Data Sources
Pandas has a number of very useful file reading tools. You can see them enumerated by typing `pd.read` and pressing tab. Some common tools include:
* `pd.read_csv()` - Import a **comma-separated values (.csv)** file.
* `pd.read_excel()` - Import a **Microsoft Excel (.xlsx)** file.
* `pd.read_hdf()` - Import a **Hierarchical Data Format (.hdf)** file.
* `pd.read_html()` - Import a **Hypertext Markup Language (.html)** file.
* `pd.read_json()` - Import a **JavaScript Object Notation (.json)** file.
* `pd.read_pickle()` - Import a **Python Pickle (.pickle)** file.
* `pd.read_sql()` - Import a **Structured Query Language (.sql)** file.

### CSV Table
Lets import the CSV file **election.csv**.

In [4]:
# Import html tables to DataFrame
elections = pd.read_csv('elections.csv')

# View the first few rows
elections.head()

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


### Excel Table
Lets import the Excel file **fossil_fuel.xlsx**.

In [5]:
# Import html tables to DataFrame
fossil_fuel = pd.read_excel('fossil_fuel.xls', 
                            sheet_name='Data1')

# View the first few rows
fossil_fuel.head()

Unnamed: 0,Compound,Formula,Concentration in atmosphere[25] (ppm),Contribution (%)
0,Water vapor and clouds,H2O,"10–50,000(A)",36–72%
1,Carbon dioxide,CO2,~400,9–26%
2,Methane,CH4,~1.8,4–9%
3,Ozone,O3,2–8(B),3–7%


### HTML Table
Lets try to import a Wikipedia table that contains information about **[the largest recorded earthquakes](https://en.wikipedia.org/wiki/Lists_of_earthquakes)** by country.

In [6]:
# Import html tables to DataFrame
dfs = pd.read_html('https://en.wikipedia.org/wiki/Lists_of_earthquakes')

# dfs is an object that contains a DataFrame for every table found at this Wikipedia page. 
# Visit the site and check out the tables.
print('There are {} tables in dfs.'.format(len(dfs)))

There are 17 tables in dfs.


In [7]:
# Lets check out the first table.
df = dfs[0]

# View the first few rows
df.head()

Unnamed: 0,0,1
0,,This article has multiple issues. Please help ...
1,,This article needs additional citations for ve...
2,,This article possibly contains original resear...


In [8]:
# Lets check out the second table.
df = dfs[1]

# View the first few rows
df.head()

Unnamed: 0,0,1
0,,This article needs additional citations for ve...


In [9]:
# Lets check out the third table.
df = dfs[3]

# View the first few rows
df.head()

Unnamed: 0,Country[1],2023,2022,2021,2020,2019,2018
0,Indonesia,2233,2207,2307,2082,2907,1928
1,Mexico,1838,1791,1873,2105,2090,1915
2,Philippines,1473,921,831,851,950,470
3,Chile,936,905,970,993,846,856
4,Japan,903,1253,1122,819,776,875


Explore the contents of the other 10 tables.

<a id='section3'></a>
## 3. Anatomy of a DataFrame
To start, let's have a look at the `elections` DataFrame.

In [10]:
elections

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


The figure below displays the different components of a DataFrame, which include: indices, columns, axes, and Series. 
<br>
<img src="images/DataFrame.png" alt="drawing" width="450"/>
<br>
These different DataFrame components can be easily extracted using the following commands.
### Row Indices

In [11]:
elections.index

RangeIndex(start=0, stop=23, step=1)

`.index` returns a `RangeIndex()` object, which shows the start, end and step size of the row indices. `RangeIndex` is a memory-saving special case of `Int64Index` limited to representing monotonic ranges. If we want to simply get an array of index values, we can use `.to_numpy()`.

In [12]:
elections.index.to_numpy()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22], dtype=int64)

### Columns

In [13]:
elections.columns

Index(['Candidate', 'Party', '%', 'Year', 'Result'], dtype='object')

### Data
The raw data of any Pandas object can be accessed as a NumPy array using the `.to_numpy()` operation. The `.values` operation has the same effect but its recommended to use `.to_numpy()`. 

In [14]:
elections.to_numpy()

array([['Reagan', 'Republican', 50.7, 1980, 'win'],
       ['Carter', 'Democratic', 41.0, 1980, 'loss'],
       ['Anderson', 'Independent', 6.6, 1980, 'loss'],
       ['Reagan', 'Republican', 58.8, 1984, 'win'],
       ['Mondale', 'Democratic', 37.6, 1984, 'loss'],
       ['Bush', 'Republican', 53.4, 1988, 'win'],
       ['Dukakis', 'Democratic', 45.6, 1988, 'loss'],
       ['Clinton', 'Democratic', 43.0, 1992, 'win'],
       ['Bush', 'Republican', 37.4, 1992, 'loss'],
       ['Perot', 'Independent', 18.9, 1992, 'loss'],
       ['Clinton', 'Democratic', 49.2, 1996, 'win'],
       ['Dole', 'Republican', 40.7, 1996, 'loss'],
       ['Perot', 'Independent', 8.4, 1996, 'loss'],
       ['Gore', 'Democratic', 48.4, 2000, 'loss'],
       ['Bush', 'Republican', 47.9, 2000, 'win'],
       ['Kerry', 'Democratic', 48.3, 2004, 'loss'],
       ['Bush', 'Republican', 50.7, 2004, 'win'],
       ['Obama', 'Democratic', 52.9, 2008, 'win'],
       ['McCain', 'Republican', 45.7, 2008, 'loss'],
       ['O

### Axes
To illustrate the utility of the DataFrame, let's take the `.max()` of `elections`, which will return the maximum numerical value.
<br>
<br>
`.max()` along the `0` axis will return a value for each column.

In [15]:
elections.max(axis=0)

Candidate         Trump
Party        Republican
%                  58.8
Year               2016
Result              win
dtype: object

`.max()` along the `1` axis will return a value for each row.

In [16]:
elections.max(axis=1)

TypeError: '>=' not supported between instances of 'str' and 'float'

### Series
A Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). Each columns in a DataFrame is a Series object. Let's call the `Candidate` column and see what gets returned.

In [17]:
candidate = elections['Candidate']
candidate

0       Reagan
1       Carter
2     Anderson
3       Reagan
4      Mondale
5         Bush
6      Dukakis
7      Clinton
8         Bush
9        Perot
10     Clinton
11        Dole
12       Perot
13        Gore
14        Bush
15       Kerry
16        Bush
17       Obama
18      McCain
19       Obama
20      Romney
21     Clinton
22       Trump
Name: Candidate, dtype: object

Check the data type of `candidate`.

In [18]:
type(candidate)

pandas.core.series.Series

A Series has an Index.

In [19]:
candidate.index

RangeIndex(start=0, stop=23, step=1)

and data.

In [20]:
candidate.to_numpy()

array(['Reagan', 'Carter', 'Anderson', 'Reagan', 'Mondale', 'Bush',
       'Dukakis', 'Clinton', 'Bush', 'Perot', 'Clinton', 'Dole', 'Perot',
       'Gore', 'Bush', 'Kerry', 'Bush', 'Obama', 'McCain', 'Obama',
       'Romney', 'Clinton', 'Trump'], dtype=object)

and a name.

In [21]:
candidate.name

'Candidate'

A series can be easily converted to a DataFrame.

In [22]:
candidate.to_frame().head()

Unnamed: 0,Candidate
0,Reagan
1,Carter
2,Anderson
3,Reagan
4,Mondale


Series act like numpy arrays and support most numpy operations.

In [23]:
year = elections['Year']
year.mean()

1996.8695652173913

You can apply NumPy operations.

In [24]:
np.sin(year * 3 + 10)

0    -0.175571
1    -0.175571
2    -0.175571
3    -0.676395
4    -0.676395
5    -0.965985
6    -0.965985
7    -0.953907
8    -0.953907
9    -0.953907
10   -0.643930
11   -0.643930
12   -0.643930
13   -0.132860
14   -0.132860
15    0.419702
16    0.419702
17    0.841194
18    0.841194
19    0.999988
20    0.999988
21    0.846493
22    0.846493
Name: Year, dtype: float64

And of course, Series support Pandas operations.

In [25]:
np.sin(year * 3 + 10).sort_values()

5    -0.965985
6    -0.965985
7    -0.953907
8    -0.953907
9    -0.953907
3    -0.676395
4    -0.676395
11   -0.643930
12   -0.643930
10   -0.643930
0    -0.175571
1    -0.175571
2    -0.175571
13   -0.132860
14   -0.132860
15    0.419702
16    0.419702
17    0.841194
18    0.841194
21    0.846493
22    0.846493
19    0.999988
20    0.999988
Name: Year, dtype: float64

Series also has a very useful function `.value_counts()` which allows you to compute the number of occurences of each unique value.

In [26]:
party = elections['Party']
party_count = party.value_counts()
party_count

Party
Republican     10
Democratic     10
Independent     3
Name: count, dtype: int64

In [27]:
party_count.index

Index(['Republican', 'Democratic', 'Independent'], dtype='object', name='Party')

In [28]:
party_count.to_numpy()

array([10, 10,  3], dtype=int64)

In [29]:
party_count['Independent']

3

<a id='section4'></a>
## 4. Getting a quick look at your DataFrame
We can use builtin Pandas commands to return only a few rows of a dataframe for quick inspection. <br>
<br>
Check out the first 5 rows of a DataFrame.

In [30]:
elections.head()

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


Now, check out the first 10 rows of a DataFrame.

In [31]:
elections.head(10)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


There is also a tail function that you can use to inspect the last few rows of a DataFrame.

In [32]:
elections.tail(8)

Unnamed: 0,Candidate,Party,%,Year,Result
15,Kerry,Democratic,48.3,2004,loss
16,Bush,Republican,50.7,2004,win
17,Obama,Democratic,52.9,2008,win
18,McCain,Republican,45.7,2008,loss
19,Obama,Democratic,51.1,2012,win
20,Romney,Republican,47.2,2012,loss
21,Clinton,Democratic,48.2,2016,loss
22,Trump,Republican,46.1,2016,win


Randomly sample from a DataFrame without replacement.

In [33]:
elections_sample = elections.sample(10, random_state=0, replace=False) 

# view DataFrame
elections_sample.head(10)

Unnamed: 0,Candidate,Party,%,Year,Result
11,Dole,Republican,40.7,1996,loss
10,Clinton,Democratic,49.2,1996,win
21,Clinton,Democratic,48.2,2016,loss
14,Bush,Republican,47.9,2000,win
20,Romney,Republican,47.2,2012,loss
1,Carter,Democratic,41.0,1980,loss
13,Gore,Democratic,48.4,2000,loss
22,Trump,Republican,46.1,2016,win
16,Bush,Republican,50.7,2004,win
8,Bush,Republican,37.4,1992,loss


Randomly sample from a DataFrame with replacement.

In [34]:
elections.sample(10, random_state=0, replace=True).head(10)

Unnamed: 0,Candidate,Party,%,Year,Result
12,Perot,Independent,8.4,1996,loss
15,Kerry,Democratic,48.3,2004,loss
21,Clinton,Democratic,48.2,2016,loss
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
3,Reagan,Republican,58.8,1984,win
7,Clinton,Democratic,43.0,1992,win
9,Perot,Independent,18.9,1992,loss
19,Obama,Democratic,51.1,2012,win
21,Clinton,Democratic,48.2,2016,loss


Randomly sample columns from a DataFrame without replacement.

In [35]:
elections.sample(2, random_state=0, replace=False, axis=1).head()

Unnamed: 0,%,Candidate
0,50.7,Reagan
1,41.0,Carter
2,6.6,Anderson
3,58.8,Reagan
4,37.6,Mondale


<a id='section5'></a>
## 5. Ulitity Operations
In addition to `.head()`, `.tail()` and `.sample()`, the are a range of other useful operations.
<br>
<br>
For example, `.shape` returns the number of rows and columns in a DataFrame as a tuple `(rows, cols)`.

In [36]:
elections.shape

(23, 5)

`.size` describes the number of "cells" in a DataFrame.

In [37]:
elections.size

115

In [38]:
print('rows: {} x cols: {} = {}'.format(elections.shape[0], elections.shape[1], elections.size))

rows: 23 x cols: 5 = 115


We can sort rows by values in multiple columns.

In [39]:
# Sort by Year in ascending order
elections.sort_values(['Year'], ascending=True)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


In [40]:
# Sort by Year in descending order
elections.sort_values(['Year'], ascending=False)

Unnamed: 0,Candidate,Party,%,Year,Result
22,Trump,Republican,46.1,2016,win
21,Clinton,Democratic,48.2,2016,loss
20,Romney,Republican,47.2,2012,loss
19,Obama,Democratic,51.1,2012,win
18,McCain,Republican,45.7,2008,loss
17,Obama,Democratic,52.9,2008,win
16,Bush,Republican,50.7,2004,win
15,Kerry,Democratic,48.3,2004,loss
14,Bush,Republican,47.9,2000,win
13,Gore,Democratic,48.4,2000,loss


In [41]:
# Sort first by Year in ascending order and then by vote % in descending order
elections.sort_values(['Year', '%'], ascending=[True, False])

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


We can rename columns if their given names are not acceptable.

In [42]:
elections.rename(columns={'%': 'Percent', 'Result': 'Outcome'}).head()

Unnamed: 0,Candidate,Party,Percent,Year,Outcome
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


**Important** The `.rename()` method returns a new DataFrame and does not modify the original one. Let's check out `elections` just to be sure.  

In [43]:
elections.head()

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


**Most operations in Pandas by default are not mutating.** 
<br>
<br>
This produces cleaner code.  If you change something it should be stored in a new appropriately named variable.
<br>
<br>
So, if we can to permanently make these changes to `elections`, we can reassign the variable as shown below.

In [44]:
elections = elections.rename(columns={'%': 'Percent', 'Result': 'Outcome'})

# View DataFrame
elections.head()

Unnamed: 0,Candidate,Party,Percent,Year,Outcome
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


Let's switch back to the original names for continuity.

In [45]:
elections = elections.rename(columns={'Percent': '%', 'Outcome': 'Result'})

You can inspect the data type of each column using `.dtypes`.

In [46]:
elections.dtypes

Candidate     object
Party         object
%            float64
Year           int64
Result        object
dtype: object

If we want to change the data type of `Year` from `int` to `float`, we can use the `.astype()` method.

In [47]:
elections.astype({'Year': float}).head()

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980.0,win
1,Carter,Democratic,41.0,1980.0,loss
2,Anderson,Independent,6.6,1980.0,loss
3,Reagan,Republican,58.8,1984.0,win
4,Mondale,Democratic,37.6,1984.0,loss


When exploring a new DataFrame, We may want to get summary statistics for each column.

In [48]:
elections.describe(include='all')

Unnamed: 0,Candidate,Party,%,Year,Result
count,23,23,23.0,23.0,23
unique,15,3,,,2
top,Bush,Republican,,,loss
freq,4,10,,,13
mean,,,42.513043,1996.869565,
std,,,13.476117,11.627961,
min,,,6.6,1980.0,
25%,,,40.85,1988.0,
50%,,,47.2,1996.0,
75%,,,49.95,2006.0,


We can look at summary statistic for numeric data only.

In [49]:
elections.describe(include=np.number)

Unnamed: 0,%,Year
count,23.0,23.0
mean,42.513043,1996.869565
std,13.476117,11.627961
min,6.6,1980.0
25%,40.85,1988.0
50%,47.2,1996.0
75%,49.95,2006.0
max,58.8,2016.0


Or object data.

In [50]:
elections.describe(include=np.object)

  elections.describe(include=np.object)


AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. 
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

You can even `.transpose()` a DataFrame.

In [51]:
elections.transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
Candidate,Reagan,Carter,Anderson,Reagan,Mondale,Bush,Dukakis,Clinton,Bush,Perot,...,Gore,Bush,Kerry,Bush,Obama,McCain,Obama,Romney,Clinton,Trump
Party,Republican,Democratic,Independent,Republican,Democratic,Republican,Democratic,Democratic,Republican,Independent,...,Democratic,Republican,Democratic,Republican,Democratic,Republican,Democratic,Republican,Democratic,Republican
%,50.7,41.0,6.6,58.8,37.6,53.4,45.6,43.0,37.4,18.9,...,48.4,47.9,48.3,50.7,52.9,45.7,51.1,47.2,48.2,46.1
Year,1980,1980,1980,1984,1984,1988,1988,1992,1992,1992,...,2000,2000,2004,2004,2008,2008,2012,2012,2016,2016
Result,win,loss,loss,win,loss,win,loss,win,loss,loss,...,loss,win,loss,win,win,loss,win,loss,loss,win


<a id='section6'></a>
## 6. Indexing
As we learned in [3. Atanomy of a DataFrame](#section3), all DataFrames and Series have a Index. An Index is like an address, that’s how any data point across the dataframe or series can be accessed. 

In [52]:
elections

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


In [53]:
elections.index

RangeIndex(start=0, stop=23, step=1)

By default a `RangeIndex` is attached enumerating the rows, which is shown in bold as the far left column of the DataFrame. `RangeIndex` is a memory-saving special case of `Int64Index` limited to representing monotonic ranges.
<br>
<br>
Recall that we sampled the elections table. Let's examine that sample.

In [54]:
elections_sample

Unnamed: 0,Candidate,Party,%,Year,Result
11,Dole,Republican,40.7,1996,loss
10,Clinton,Democratic,49.2,1996,win
21,Clinton,Democratic,48.2,2016,loss
14,Bush,Republican,47.9,2000,win
20,Romney,Republican,47.2,2012,loss
1,Carter,Democratic,41.0,1980,loss
13,Gore,Democratic,48.4,2000,loss
22,Trump,Republican,46.1,2016,win
16,Bush,Republican,50.7,2004,win
8,Bush,Republican,37.4,1992,loss


In [55]:
elections_sample.index

Index([11, 10, 21, 14, 20, 1, 13, 22, 16, 8], dtype='int64')

Notice that the index is different and can no longer be expressed as `RangeIndex`. It maintained the index of the rows in the original table. This is very useful if we wanted to go back and relate derived tables with their original values.

You can use the `.set_index()` operation to set the index of a DataFrame to one of the columns.

In [56]:
elections_sample.set_index('Year')

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1996,Dole,Republican,40.7,loss
1996,Clinton,Democratic,49.2,win
2016,Clinton,Democratic,48.2,loss
2000,Bush,Republican,47.9,win
2012,Romney,Republican,47.2,loss
1980,Carter,Democratic,41.0,loss
2000,Gore,Democratic,48.4,loss
2016,Trump,Republican,46.1,win
2004,Bush,Republican,50.7,win
1992,Bush,Republican,37.4,loss


In [57]:
elections_sample.reset_index()

Unnamed: 0,index,Candidate,Party,%,Year,Result
0,11,Dole,Republican,40.7,1996,loss
1,10,Clinton,Democratic,49.2,1996,win
2,21,Clinton,Democratic,48.2,2016,loss
3,14,Bush,Republican,47.9,2000,win
4,20,Romney,Republican,47.2,2012,loss
5,1,Carter,Democratic,41.0,1980,loss
6,13,Gore,Democratic,48.4,2000,loss
7,22,Trump,Republican,46.1,2016,win
8,16,Bush,Republican,50.7,2004,win
9,8,Bush,Republican,37.4,1992,loss


In [58]:
elections_sample.reset_index(drop=True)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Dole,Republican,40.7,1996,loss
1,Clinton,Democratic,49.2,1996,win
2,Clinton,Democratic,48.2,2016,loss
3,Bush,Republican,47.9,2000,win
4,Romney,Republican,47.2,2012,loss
5,Carter,Democratic,41.0,1980,loss
6,Gore,Democratic,48.4,2000,loss
7,Trump,Republican,46.1,2016,win
8,Bush,Republican,50.7,2004,win
9,Bush,Republican,37.4,1992,loss


The index allows you to reference *rows* by *name*. You will see this in a moment when we talk about slicing.  

**Note:** The index does not need to be unique. Remember our random sample with replacement from earlier ([Getting a quick look at your DataFrame](#section4)? Row index 3 appears twice!

In [59]:
elections.sample(10, random_state=0, replace=True).head(10)

Unnamed: 0,Candidate,Party,%,Year,Result
12,Perot,Independent,8.4,1996,loss
15,Kerry,Democratic,48.3,2004,loss
21,Clinton,Democratic,48.2,2016,loss
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
3,Reagan,Republican,58.8,1984,win
7,Clinton,Democratic,43.0,1992,win
9,Perot,Independent,18.9,1992,loss
19,Obama,Democratic,51.1,2012,win
21,Clinton,Democratic,48.2,2016,loss


**Note:** Recall that Columns are also an type of index. We could get the list of column names, which can be used to reference columns by name.

In [60]:
elections.columns

Index(['Candidate', 'Party', '%', 'Year', 'Result'], dtype='object')

<a id='section7'></a>
## 7. Accessing Rows and Columns (Slicing)
There are many ways to access rows and columns of a Pandas DataFrame.  We will spend some time reviewing the most used options. You can access columns using the square `[  ]` brakets.
### Columns

In [61]:
elections_sample

Unnamed: 0,Candidate,Party,%,Year,Result
11,Dole,Republican,40.7,1996,loss
10,Clinton,Democratic,49.2,1996,win
21,Clinton,Democratic,48.2,2016,loss
14,Bush,Republican,47.9,2000,win
20,Romney,Republican,47.2,2012,loss
1,Carter,Democratic,41.0,1980,loss
13,Gore,Democratic,48.4,2000,loss
22,Trump,Republican,46.1,2016,win
16,Bush,Republican,50.7,2004,win
8,Bush,Republican,37.4,1992,loss


You can pass a list of column names to select only those columns.

In [62]:
elections_sample[['Candidate','Year', 'Result']]

Unnamed: 0,Candidate,Year,Result
11,Dole,1996,loss
10,Clinton,1996,win
21,Clinton,2016,loss
14,Bush,2000,win
20,Romney,2012,loss
1,Carter,1980,loss
13,Gore,2000,loss
22,Trump,2016,win
16,Bush,2004,win
8,Bush,1992,loss


If you pass a list with a single element you get back a DataFrame.

In [63]:
elections_sample[['Candidate']]

Unnamed: 0,Candidate
11,Dole
10,Clinton
21,Clinton
14,Bush
20,Romney
1,Carter
13,Gore
22,Trump
16,Bush
8,Bush


If you pass single column name string, you get back a Series.

In [64]:
elections_sample['Candidate']

11       Dole
10    Clinton
21    Clinton
14       Bush
20     Romney
1      Carter
13       Gore
22      Trump
16       Bush
8        Bush
Name: Candidate, dtype: object

You can modify and even add columns using the square brackets `[ ]`.

In [65]:
temp = elections_sample.copy()
temp['Year'] = temp['Year'] * -1 + 25.
temp

Unnamed: 0,Candidate,Party,%,Year,Result
11,Dole,Republican,40.7,-1971.0,loss
10,Clinton,Democratic,49.2,-1971.0,win
21,Clinton,Democratic,48.2,-1991.0,loss
14,Bush,Republican,47.9,-1975.0,win
20,Romney,Republican,47.2,-1987.0,loss
1,Carter,Democratic,41.0,-1955.0,loss
13,Gore,Democratic,48.4,-1975.0,loss
22,Trump,Republican,46.1,-1991.0,win
16,Bush,Republican,50.7,-1979.0,win
8,Bush,Republican,37.4,-1967.0,loss


We can add a new column by assignment.

In [66]:
temp['Corrected Year'] = temp['Year'] * -1 + 25.
temp

Unnamed: 0,Candidate,Party,%,Year,Result,Corrected Year
11,Dole,Republican,40.7,-1971.0,loss,1996.0
10,Clinton,Democratic,49.2,-1971.0,win,1996.0
21,Clinton,Democratic,48.2,-1991.0,loss,2016.0
14,Bush,Republican,47.9,-1975.0,win,2000.0
20,Romney,Republican,47.2,-1987.0,loss,2012.0
1,Carter,Democratic,41.0,-1955.0,loss,1980.0
13,Gore,Democratic,48.4,-1975.0,loss,2000.0
22,Trump,Republican,46.1,-1991.0,win,2016.0
16,Bush,Republican,50.7,-1979.0,win,2004.0
8,Bush,Republican,37.4,-1967.0,loss,1992.0


In [67]:
temp['random'] = np.random.randn(temp.shape[0])
temp

Unnamed: 0,Candidate,Party,%,Year,Result,Corrected Year,random
11,Dole,Republican,40.7,-1971.0,loss,1996.0,-0.659946
10,Clinton,Democratic,49.2,-1971.0,win,1996.0,1.200291
21,Clinton,Democratic,48.2,-1991.0,loss,2016.0,-1.248692
14,Bush,Republican,47.9,-1975.0,win,2000.0,1.171919
20,Romney,Republican,47.2,-1987.0,loss,2012.0,-0.14897
1,Carter,Democratic,41.0,-1955.0,loss,1980.0,0.72191
13,Gore,Democratic,48.4,-1975.0,loss,2000.0,0.335168
22,Trump,Republican,46.1,-1991.0,win,2016.0,-0.965496
16,Bush,Republican,50.7,-1979.0,win,2004.0,0.653243
8,Bush,Republican,37.4,-1967.0,loss,1992.0,-0.620461


### Accessing by rows and columns by index using `.loc[ ]`
You can access rows and columns of a DataFrame by name using the `.loc[ ]` syntax.

In [68]:
elections_sample

Unnamed: 0,Candidate,Party,%,Year,Result
11,Dole,Republican,40.7,1996,loss
10,Clinton,Democratic,49.2,1996,win
21,Clinton,Democratic,48.2,2016,loss
14,Bush,Republican,47.9,2000,win
20,Romney,Republican,47.2,2012,loss
1,Carter,Democratic,41.0,1980,loss
13,Gore,Democratic,48.4,2000,loss
22,Trump,Republican,46.1,2016,win
16,Bush,Republican,50.7,2004,win
8,Bush,Republican,37.4,1992,loss


The syntax for `.loc` is:

```
df.loc[rows_list, column_list]
```
We can pass a list of row names (index values).

In [69]:
elections_sample.loc[[11, 8], ['Party', 'Year']]

Unnamed: 0,Party,Year
11,Republican,1996
8,Republican,1992


In [70]:
elections_sample.loc[:, ['Party', 'Year']]

Unnamed: 0,Party,Year
11,Republican,1996
10,Democratic,1996
21,Democratic,2016
14,Republican,2000
20,Republican,2012
1,Democratic,1980
13,Democratic,2000
22,Republican,2016
16,Republican,2004
8,Republican,1992


In [71]:
elections_sample.loc[[11, 8]]

Unnamed: 0,Candidate,Party,%,Year,Result
11,Dole,Republican,40.7,1996,loss
8,Bush,Republican,37.4,1992,loss


In [72]:
elections_sample.loc[[11, 8], :]

Unnamed: 0,Candidate,Party,%,Year,Result
11,Dole,Republican,40.7,1996,loss
8,Bush,Republican,37.4,1992,loss


`.loc` also supports slicing (for all types, including numeric and string labels!). Note that the slicing for loc is **inclusive**, even for numeric slices.  In general, avoiding range slicing with `.loc`.  

In [73]:
elections.loc[0:10, 'Candidate':'Year']

Unnamed: 0,Candidate,Party,%,Year
0,Reagan,Republican,50.7,1980
1,Carter,Democratic,41.0,1980
2,Anderson,Independent,6.6,1980
3,Reagan,Republican,58.8,1984
4,Mondale,Democratic,37.6,1984
5,Bush,Republican,53.4,1988
6,Dukakis,Democratic,45.6,1988
7,Clinton,Democratic,43.0,1992
8,Bush,Republican,37.4,1992
9,Perot,Independent,18.9,1992


Range slicing works for `elections`. If we try the same thing for `elections_sample`, we get the following value error. 
```
elections_sample.loc[0:10, 'Candidate':'Year']

Returns:
ValueError: index must be monotonic increasing or decreasing
```
Keep in mind that the ranges need to be over the index values and not the locations and index values need to have well defined contiguous ranges.
<br>
<br>
If we try the same thing after sorting by index in ascending order, it will work.

In [74]:
elections_sample.sort_index().loc[0:10, 'Candidate':'Year']

Unnamed: 0,Candidate,Party,%,Year
1,Carter,Democratic,41.0,1980
8,Bush,Republican,37.4,1992
10,Clinton,Democratic,49.2,1996


This funcionality can be very useful when the index is set to a column, for example `year`. We can use `.loc` to filter the DataFrame.  

In [75]:
elections_year = elections.set_index('Year').sort_index()
elections_year

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980,Reagan,Republican,50.7,win
1980,Carter,Democratic,41.0,loss
1980,Anderson,Independent,6.6,loss
1984,Reagan,Republican,58.8,win
1984,Mondale,Democratic,37.6,loss
1988,Bush,Republican,53.4,win
1988,Dukakis,Democratic,45.6,loss
1992,Clinton,Democratic,43.0,win
1992,Bush,Republican,37.4,loss
1992,Perot,Independent,18.9,loss


Let's say we want to  return all election results from 1980 to 2004.

In [76]:
elections_year.loc[1980:2004]

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980,Reagan,Republican,50.7,win
1980,Carter,Democratic,41.0,loss
1980,Anderson,Independent,6.6,loss
1984,Reagan,Republican,58.8,win
1984,Mondale,Democratic,37.6,loss
1988,Bush,Republican,53.4,win
1988,Dukakis,Democratic,45.6,loss
1992,Clinton,Democratic,43.0,win
1992,Bush,Republican,37.4,loss
1992,Perot,Independent,18.9,loss


If you give `.loc` a single scalar arguments for the requested rows and columns, you get back just a single value.

In [77]:
elections.loc[19, 'Candidate']

'Obama'

<a id='section8'></a>
## 8. Boolean Array Selection

`.loc[ ]` and `[ ]` support arrays of booleans as an input. In this case, the array must be exactly as long as the number of rows or columns. The result is a filtered version of the data frame, where only rows corresponding to `True` appear. This functionality is similar to `WHERE` in **SQL**.

In [78]:
elections_sample.shape

(10, 5)

The `elections_sample` DataFrame has 10 rows, so if we create an list of Boolean values, we can use it to filter the DataFrame.

In [79]:
boolean_list = [False, False, False, False, True, 
                False, False, True, True, False]

elections_sample.loc[boolean_list]

Unnamed: 0,Candidate,Party,%,Year,Result
20,Romney,Republican,47.2,2012,loss
22,Trump,Republican,46.1,2016,win
16,Bush,Republican,50.7,2004,win


You can also pass the same arguments to the `[ ]` operator.

In [80]:
elections_sample[boolean_list]

Unnamed: 0,Candidate,Party,%,Year,Result
20,Romney,Republican,47.2,2012,loss
22,Trump,Republican,46.1,2016,win
16,Bush,Republican,50.7,2004,win


One very common task in Data Science is filtering. Boolean Array Selection is one way to achieve this in Pandas. We start by observing logical operators like the equality operator `==` can be applied to Pandas Series data to generate a Boolean array. For example, we can compare the `Result` column to the String `win`.

In [81]:
elections.head()

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


In [82]:
iswin = elections['Result'] == 'win'
iswin

0      True
1     False
2     False
3      True
4     False
5      True
6     False
7      True
8     False
9     False
10     True
11    False
12    False
13    False
14     True
15    False
16     True
17     True
18    False
19     True
20    False
21    False
22     True
Name: Result, dtype: bool

The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean. The entry at row **i** represents the result of the application of that operator to the entry of the original Series at row **i**.
<br>
<br>
Such a boolean Series can be used as an argument to the `[ ]` operator. For example, the following code creates a DataFrame of all election winners since 1980.

In [83]:
elections[iswin]

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
16,Bush,Republican,50.7,2004,win
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
22,Trump,Republican,46.1,2016,win


Above, we've assigned the result of the logical operator to a new variable called `iswin`. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

In [84]:
elections[elections['Result'] == 'win']

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
16,Bush,Republican,50.7,2004,win
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
22,Trump,Republican,46.1,2016,win


We can select multiple criteria by creating multiple boolean Series and combining them using the & operator.

In [85]:
elections[
    (elections['Result'] == 'win') & 
    (elections['%'] < 50)
]

Unnamed: 0,Candidate,Party,%,Year,Result
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
22,Trump,Republican,46.1,2016,win


Using the logical negation `~` operator, which means **Not**.  

In [86]:
elections[
    (elections['Result'] == 'win') & 
    ~(elections['%'] < 50)
]

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
16,Bush,Republican,50.7,2004,win
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win


Using the `|` operator, which mean **Or**.

In [87]:
elections[
    ~((elections['Party'] == "Democratic") | 
      (elections['Party'] == "Republican"))
]

Unnamed: 0,Candidate,Party,%,Year,Result
2,Anderson,Independent,6.6,1980,loss
9,Perot,Independent,18.9,1992,loss
12,Perot,Independent,8.4,1996,loss


In [88]:
elections[
    (elections['Party'] != "Democratic") & 
    (elections['Party'] != "Republican")
]

Unnamed: 0,Candidate,Party,%,Year,Result
2,Anderson,Independent,6.6,1980,loss
9,Perot,Independent,18.9,1992,loss
12,Perot,Independent,8.4,1996,loss


If we have multiple conditions (say Republican or Democratic), we can use the isin operator to simplify our code.

In [89]:
elections[~elections['Party'].isin(['Republican', 'Democratic'])]

Unnamed: 0,Candidate,Party,%,Year,Result
2,Anderson,Independent,6.6,1980,loss
9,Perot,Independent,18.9,1992,loss
12,Perot,Independent,8.4,1996,loss


<a id='section9'></a>
## 9. Slicing using `.iloc`

`.loc`'s cousin `iloc` is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use `.iloc[0:3, 0:3]`. `.iloc` slicing is **exclusive**, just like standard Python slicing of numerical values.

In [90]:
elections_sample

Unnamed: 0,Candidate,Party,%,Year,Result
11,Dole,Republican,40.7,1996,loss
10,Clinton,Democratic,49.2,1996,win
21,Clinton,Democratic,48.2,2016,loss
14,Bush,Republican,47.9,2000,win
20,Romney,Republican,47.2,2012,loss
1,Carter,Democratic,41.0,1980,loss
13,Gore,Democratic,48.4,2000,loss
22,Trump,Republican,46.1,2016,win
16,Bush,Republican,50.7,2004,win
8,Bush,Republican,37.4,1992,loss


In [91]:
elections_sample.iloc[2:, 3:5]

Unnamed: 0,Year,Result
21,2016,loss
14,2000,win
20,2012,loss
1,1980,loss
13,2000,loss
22,2016,win
16,2004,win
8,1992,loss


In [92]:
elections_sample.iloc[::2, 3:5]

Unnamed: 0,Year,Result
11,1996,loss
21,2016,loss
20,2012,loss
13,2000,loss
16,2004,win


In [93]:
elections_sample.iloc[5:-1, 3:5]

Unnamed: 0,Year,Result
1,1980,loss
13,2000,loss
22,2016,win
16,2004,win


### Caution
We will use both `.loc` and `.iloc` in the course. `.loc` is generally preferred for a number of reasons, for example: 

1. It is harder to make mistakes since you have to literally write out what you want to get.
2. Code is easier to read, because the reader doesn't have to know e.g. what column **#31** represents.
3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

## Quick Challenge
Which of the following expressions return DataFrame of the first 3 Candidate and Party names for candidates that won with more than 50% of the vote?

In [94]:
elections.iloc[[0, 3, 5], [0, 3]]

Unnamed: 0,Candidate,Year
0,Reagan,1980
3,Reagan,1984
5,Bush,1988


In [95]:
elections.loc[[0, 3, 5], 'Candidate': 'Year']

Unnamed: 0,Candidate,Party,%,Year
0,Reagan,Republican,50.7,1980
3,Reagan,Republican,58.8,1984
5,Bush,Republican,53.4,1988


In [96]:
elections.loc[elections['%'] > 50, ['Candidate', 'Year']].head(3)

Unnamed: 0,Candidate,Year
0,Reagan,1980
3,Reagan,1984
5,Bush,1988


In [97]:
elections.loc[elections['%'] > 50, ['Candidate', 'Year']].iloc[0:2, :]

Unnamed: 0,Candidate,Year
0,Reagan,1980
3,Reagan,1984


## Baby Names Data

We will start working with the baby names datset next lecture. If you're interested, you can get a head start.

Now let's play around a bit with the large baby names dataset. We'll start by loading that dataset from the social security administration's website.

To keep the data small enough to avoid crashing **JupyterHub**, we're going to look at only New York rather than looking at the national dataset.

In [98]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'STATE.NY.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    baby_names = pd.read_csv(fh, header=None, names=field_names)

baby_names.sample(5)

KeyError: "There is no item named 'STATE.NY.TXT' in the archive"

**Goal 1:** Find the 20 most popular female baby names in New York in 2018.

<details>
    <summary>Solution</summary>
<code>
baby_names[
    (baby_names['Year'] == 2018) & 
    (baby_names['Sex'] == 'F')
].sort_values(by='Count', ascending=False).head(20)
</code>
</details>

In [99]:
# Solution here

**Goal 2:** Make a plot of how many baby boys were named **Avery** over the years.

<details>
    <summary>Solution</summary>
<code>
_ = baby_names[
    (baby_names['Name'] == 'Avery') & 
    (baby_names['Sex'] == 'M')
].plot(x='Year', y='Count', marker='.');
</code>
</details>

In [None]:
# Solution here