# Data Analysis With Python

This example will show how python with Pandas can be used for effective programatic data analysis.  The goals of this analysis are to:

1. Read in a collection of .txt files containing the counts of baby names by year
2. Combine them into a single data frame
3. Perform exploratory data analysis on that data frame

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline  

The function contains a string formatter that will take the int in the range of years (1880 to 2014) and include that in the stiring text.  This is done through the incluse of the %d, which is for a number, whereas %s is for a string.  Before you run the code, here is a couple examples:

In [None]:
name = 'Bob'
number = 1980
print '%s %d' % (name, number)

#or embedded
print '%s was born in %d'  % (name, number)

#multiple instances
name2 = 'Sam'
print '%s and %s were both born in %d' % (name, name2, number)

In [None]:
columns = ['name', 'sex', 'births'] # Determine column headers
names = pd.DataFrame() #An empty dataframe
years = range(1880, 2014) #Range of years available for baby names

'''Create a function that will loop through all of the documents, combine add a column for the year
and append them all into one large data frame'''

for year in years:
    path = r'C:\Users\bharder\Dropbox\Ben\Python\Babynames\yob%d.txt' % year #for each year in years starting with 1880
    frame = pd.read_csv(path, names=columns) # Read in the txt as a csv, changing the column names to the provided names
    frame['year'] = year #Add a new column that is the year in question
    names = names.append(frame, ignore_index=True) #Append them all together
    
# Concatenate everything into a single DataFrame
#names = pd.concat(pieces, ignore_index=True)

In [None]:
names.head()

In [None]:
total_births = names.pivot_table('births', index ='year', columns ='sex', aggfunc=sum)

#plot the total births by year
total_births.plot(title='Total births by sex and year')

In [None]:
#Create a function to get the top 1000 names for each sex/year combination
def get_top1000(group):
    return group.sort_values(by='births', ascending=False)[:1000]
grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
#Get the most popular names for boys and girls
boys = top1000[top1000.sex == 'M']
girls = top1000[top1000.sex == 'F']
#aggregate the data frame
total_births = top1000.pivot_table('births', index ='year', columns='name',aggfunc=sum)

In [None]:
#Create a subset a single name
subset = total_births[[raw_input()]]
subset.plot(title="Number of births per year")

In [None]:
#Create a subset of the family names
subset = total_births[[raw_input(),raw_input(),raw_input()]]
subset.plot(subplots=True, figsize=(12, 10), grid=False,title="Number of births per year")

In [None]:
top1000.head()