## Birth Dates in the United States

The raw data behind the story **Some People Are Too Superstitious To Have A Baby on Friday the 13th**, which you can read [here](http://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/).

We'll be working with the data set from the Centers for Disease Control and Prevention's National Center for Health Statistics. The data set has the following structure:

- `year` - Year
- `month` - Month
- `date_of_month` - Day number of the month
- `day_of_week` - Day of week, where 1 is Monday and 7 is Sunday
- `births` - Number of births

the data can be downloaded [here](https://github.com/fivethirtyeight/data/tree/master/births/)

In [0]:
# this is a change
# this is another change

### Get the data

There are a few ways to read data directly from github.

In [0]:
url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv"

Using csv and urllib. 

In [0]:
import csv
import urllib.request as ur

Open and read the html file to investigate the contents

In [0]:
file = ur.urlopen(url)
html = file.read()
text_urllib = html.decode()

We can also use requests package

In [0]:
import requests

In [0]:
text_requests = requests.get(url).text

From here on both methods use the same code. Split the file into a list for each row in the csv file by using the delimiter '\r'

In [0]:
split = text_urllib.split('\r')
split = text_requests.split('\r')

In [7]:
for row in split[0:5]:
    line = row.split(',')
    print(line)

['year', 'month', 'date_of_month', 'day_of_week', 'births']
['2000', '1', '1', '6', '9083']
['2000', '1', '2', '7', '8006']
['2000', '1', '3', '1', '11363']
['2000', '1', '4', '2', '13032']


Or we can use pandas

In [8]:
import pandas as pd
df = pd.read_csv(url,index_col=0,parse_dates=[0])

print(df.head(5))

            month  date_of_month  day_of_week  births
year                                                 
2000-01-01      1              1            6    9083
2000-01-01      1              2            7    8006
2000-01-01      1              3            1   11363
2000-01-01      1              4            2   13032
2000-01-01      1              5            3   12558


### Count births on each day of week

Create a dictionary containing the number of births on each unique day of the week

In [0]:
day_counts = {}
split_1 = split[1:len(split)]

for row in split_1:
    line = row.split(',')
    day_of_week = line[3]
    births = int(line[4])
    if day_of_week in day_counts.keys():
        day_counts[day_of_week] += births
    else:
        day_counts[day_of_week] = births
        
print(day_counts)

{'6': 6704495, '7': 5886889, '1': 9316001, '2': 10274874, '3': 10109130, '4': 10045436, '5': 9850199}


Or using pandas

In [0]:
df.groupby(df.day_of_week)['births'].sum()

day_of_week
1     9316001
2    10274874
3    10109130
4    10045436
5     9850199
6     6704495
7     5886889
Name: births, dtype: int64

## Part2: Functional Programming

While a list of strings helps us get a general picture of the dataset, we need to convert it to a more structured format to be able to analyze it. Specifically, we need to convert the dataset into a list of lists where each nested list contains integer values (not strings). We also need to remove the header row.

In [0]:
def read_csv(url):
    text = text_requests = requests.get(url).text
    split = text.split('\r')
    string_list = split[1:len(split)]
    final_list = []
    for row in string_list:
        int_fields = []
        string_fields = row.split(',')
        for f in string_fields:
            int_fields.append(int(f))
        final_list.append(int_fields)
    return final_list    

In [0]:
cdc_list = read_csv(url)

In [0]:
cdc_list[0:10]

[[2000, 1, 1, 6, 9083],
 [2000, 1, 2, 7, 8006],
 [2000, 1, 3, 1, 11363],
 [2000, 1, 4, 2, 13032],
 [2000, 1, 5, 3, 12558],
 [2000, 1, 6, 4, 12466],
 [2000, 1, 7, 5, 12516],
 [2000, 1, 8, 6, 8934],
 [2000, 1, 9, 7, 7949],
 [2000, 1, 10, 1, 11668]]

Now that the data is in a more usable format, we can start to analyze it. Let's calculate the total number of births that occured in each month, across all of the years in the dataset. We'll create a dictionary where each key is a unique month and each value is the number of births that happened in that month, across all years:

In [0]:
def month_births(req_list):
    births_per_month = {}
    for l in req_list:
        month = l[1]
        births = l[4]
        if month in births_per_month.keys():
            births_per_month[month] += births
        else:
            births_per_month[month] = births
    return births_per_month

In [0]:
cdc_month_births = month_births(cdc_list)

In [0]:
cdc_month_births

{1: 5072588,
 2: 4725693,
 3: 5172961,
 4: 4960750,
 5: 5195445,
 6: 5163360,
 7: 5450418,
 8: 5540170,
 9: 5399592,
 10: 5302865,
 11: 5008750,
 12: 5194432}

Let's now create a function that calculates the total number of births for each unique day of the week. Here's what we want the dictionary to look like:

In [0]:
def dow_births(req_list):
    births_per_dow = {}
    for l in req_list:
        dow = l[3]
        births = l[4]
        if dow in births_per_dow.keys():
            births_per_dow[dow] += births
        else:
            births_per_dow[dow] = births
    return births_per_dow

In [0]:
cdc_day_births = dow_births(cdc_list)

In [0]:
cdc_day_births

{6: 6704495,
 7: 5886889,
 1: 9316001,
 2: 10274874,
 3: 10109130,
 4: 10045436,
 5: 9850199}

You may have noticed that there was a lot of similarity between the two functions you just wrote. While we can also create separate functions to calculate the totals for the year and date_of_month columns, it's better to create a single function that works for any column and specify the column we want as a parameter each time we call the function.

In [0]:
def calc_counts(req_list, col):
    births_per_column = {}
    for l in req_list:
        column = l[col]
        births = l[4]
        if column in births_per_column.keys():
            births_per_column[column] += births
        else:
            births_per_column[column] = births
    return births_per_column

In [0]:
cdc_year_births = calc_counts(cdc_list, 0)
cdc_month_births = calc_counts(cdc_list, 1)
cdc_dom_births = calc_counts(cdc_list, 2)
cdc_dow_births = calc_counts(cdc_list, 3)

That's it for the guided steps. Here are some suggestions for next steps:

- Write a function that can calculate the min and max values for any dictionary that's passed in.
- Write a function that extracts the same values across years and calculates the differences between consecutive values to show if number of births is increasing or decreasing. For example, how did the number of births on Saturday change each year between 1994 and 2003?
Find a way to combine the CDC data with the SSA data, which you can find [here](https://github.com/fivethirtyeight/data/tree/master/births/). 

Specifically, brainstorm ways to deal with the overlapping time periods in the datasets.