# Working With Data Using Pandas

In [37]:
import pandas as pd
import glob
import os
import matplotlib.pyplot as plt

I'm getting older. Let's say I was born in 1890 and I'd like to know how common my first name is. In case the file is really long, I'd like to not have to scroll through manually looking for my name (Ctrl + F is clearly no fun). I'll use pandas instead.

I've downloaded the data from Social Security, which includes name data from 1882 to 2017 where each year's data is stored in a separate file called `yob****.txt` where the `****` represent the year of birth. Inspecting the files, you see that each row is formatted as _name, sex, count_ and sorted by sex, then count. Comma-separated — we know how to deal with that!

In [47]:
# First we read in the data. There's no column names given in the file, so we pass header=None
# and add the names ourselves.
data_1890 = pd.read_csv('~/intro-pandas/data/yob1890.txt', header=None, names=['Name', 'Sex', 'Count'])
data_1890.head()

Unnamed: 0,Name,Sex,Count
0,Mary,F,12078
1,Anna,F,5233
2,Elizabeth,F,3112
3,Margaret,F,3100
4,Emma,F,2980


Nice. Let's see if I can just get the male names.

In [18]:
males_1890 = data_1890[data_1890['Sex'] == 'M']
males_1890.head()

Unnamed: 0,Name,Sex,Count
1534,John,M,8502
1535,William,M,7494
1536,James,M,5097
1537,George,M,4458
1538,Charles,M,4061


Wow, 8500 Johns! But I'd like to set John as the number one name in my new rankings. Let's reindex the dataframe.

In [24]:
males_1890 = males_1890.reset_index(drop=True) #drops the old index column and restarts the counting
males_1890.head()

Unnamed: 0,Name,Sex,Count
0,John,M,8502
1,William,M,7494
2,James,M,5097
3,George,M,4458
4,Charles,M,4061


Better, but it'd be nice to start counting from 1, no? 

In [26]:
males_1890.index = range(1, len(males_1890) + 1)
males_1890.head()

Unnamed: 0,Name,Sex,Count
1,John,M,8502
2,William,M,7494
3,James,M,5097
4,George,M,4458
5,Charles,M,4061


And maybe add a label?

In [27]:
males_1890.index.name = 'Rank'
males_1890.head()

Unnamed: 0_level_0,Name,Sex,Count
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,John,M,8502
2,William,M,7494
3,James,M,5097
4,George,M,4458
5,Charles,M,4061


Moment of truth; let's find out just how cool my name is:

In [28]:
males_1890[males_1890['Name'] == 'David']

Unnamed: 0_level_0,Name,Sex,Count
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
26,David,M,731


::flex::

Your turn.

__Bonus:__ Can you figure out how your name ranks for your birth year _regardless of sex?_ 

In [54]:
# Your code here

__Bonus 2:__ Can you track the popularity of your name over time from the earliest to the latest data point? Store your answer as a list called `ranks`. I've gotten you started by creating a list of the file names for you to iterate over.

In [55]:
fnames = sorted(glob.glob('intro-pandas/data/*.txt'))
fnames[:5] #truncated to save space

# Write your code here

['intro-pandas/data/yob1882.txt',
 'intro-pandas/data/yob1883.txt',
 'intro-pandas/data/yob1884.txt',
 'intro-pandas/data/yob1885.txt',
 'intro-pandas/data/yob1886.txt']

In [56]:
# Run this after you've solved Bonus 2 (You'll need to uncomment each line first)
#plt.title('Popularity of <yourname>: 1882-2017')
#plt.xlabel('Year')
#plt.ylabel('Rank')
#years = range(1882, 2018)
#plt.plot(years, ranks, '.')
#plt.show()