# US Births Data Exploration Project
----
## Opening the Datasets
This repo contains two csv files:
1. CDC, spanning all days 1994-2003
2. SSA, spanning all days 2000-2014

Each line of the csv files contains a comma separated string representing a unique date. Below, I read in each file and print the first ten lines. 

In [2]:
f = open("US_births_1994-2003_CDC_NCHS.csv")
births_str = f.read()
births_data = births_str.split('\n')
print(births_data[:10])


['year,month,date_of_month,day_of_week,births', '1994,1,1,6,8096', '1994,1,2,7,7772', '1994,1,3,1,10142', '1994,1,4,2,11248', '1994,1,5,3,11053', '1994,1,6,4,11406', '1994,1,7,5,11251', '1994,1,8,6,8653', '1994,1,9,7,7910']


In [4]:
ssa_list = read_csv("US_births_2000-2014_SSA.csv")
print(len(ssa_list))
print(ssa_list[:10])

5479
[[2000, 1, 1, 6, 9083], [2000, 1, 2, 7, 8006], [2000, 1, 3, 1, 11363], [2000, 1, 4, 2, 13032], [2000, 1, 5, 3, 12558], [2000, 1, 6, 4, 12466], [2000, 1, 7, 5, 12516], [2000, 1, 8, 6, 8934], [2000, 1, 9, 7, 7949], [2000, 1, 10, 1, 11668]]


## Converting the csv files to a lists of lists
Below, I created a function named "read_csv" to that takes a csv file as input and outputs a headerless list of lists.

In [3]:
def read_csv(f):
    """Converts a csv file to a headerless list of lists with int elements."""
    file = open(f,"r")
    content = file.read()
    split_content = content.split('\n')
    headerless_split_content = split_content[1:]
    string_list = headerless_split_content
    final_list = []
    
    for row in string_list:
        int_fields = []
        string_fields = row.split(',')
        for e in string_fields:
            int_fields.append(int(e))
        final_list.append(int_fields)

    return final_list

cdc_list = read_csv("US_births_1994-2003_CDC_NCHS.csv")
print(len(cdc_list))
print(cdc_list[len(cdc_list)-10:])

3652
[[2003, 12, 22, 1, 12967], [2003, 12, 23, 2, 12598], [2003, 12, 24, 3, 9096], [2003, 12, 25, 4, 6628], [2003, 12, 26, 5, 10218], [2003, 12, 27, 6, 8646], [2003, 12, 28, 7, 7645], [2003, 12, 29, 1, 12823], [2003, 12, 30, 2, 14438], [2003, 12, 31, 3, 12374]]


## Combining the CDC and SSA lists
This function (combine_lists) combines all dates before 2004 from the CDC data and all dates after 2003 from the SSA data. The resulting output is all unique dates from both list, which span all dates from 1994 to 2014.

In [5]:
def combine_lists(l_one, l_two):
    """Combines the cdc and ssa lists and returns the resulting headerless list of list. This function is useful for these specific cdc and ssa lists and not generalized to other lists of lists. 
    """
    combined_list = []
    for row in l_one:        
        combined_list.append(row)
    for row in l_two:
        row_year = row[0]
        if row_year > 2003:
            combined_list.append(row)
    return combined_list

In [6]:
cdc_ssa_list = combine_lists(cdc_list, ssa_list)
print(len(cdc_ssa_list))

7670


## Explore births of a given date
This function (date_births) will return the births for a given date. The examples below show the births around the end of 2003 and beginning of 2004, which is when the dates from the two datasets start to overlap.

In [7]:
def date_births(data, year, month, date_of_month):
    """Returns a list with the births (int) for given date in a dataset. Multiple elements in the list imply duplicate dates in the data.
    
    First argument, data, must be a list of lists (all int) with the following format:
    index 0: year
    index 1: month
    index 2: date_of_month
    index 3: day_of_week
    index 4: births
    
    3 required arguments constrain the output: year, month, and date_of_month.
    """
    births_list = []
    for row in data:
        row_year = row[0]
        row_month = row[1]
        row_date_of_month = row[2]
        if row_year == year:
            if row_month == month:
                if row_date_of_month == date_of_month:
                    births_list.append(row[4])
    return births_list

In [8]:
print(date_births(cdc_list, 2003, 12, 31))
print(date_births(cdc_list, 2004, 1, 1))

[12374]
[]


In [9]:
print(date_births(ssa_list, 2003, 12, 31))
print(date_births(ssa_list, 2004, 1, 1))

[12540]
[8205]


In [10]:
print(date_births(cdc_ssa_list, 2003, 12, 31))
print(date_births(cdc_ssa_list, 2004, 1, 1))

[12374]
[8205]


## Exploring births by month and day of week
The following two functions (month_births and dow_births) collect the number of total births by month and day of week.

- There were not significantly more or fewer births in any given month during the years 1994-2014. 
- About 45% more births occurred on weekdays than Saturday or Sunday. Monday is represented as 1 in the dataset; Tuesday is 2; etc.

In [11]:
def month_births(births_list):
    """Returns a dict with month (int) keys and births values.
    Input must be a list of lists with month at index 1 and births at index 4.
    """
    births_per_month = {}
    for row in births_list:
        month = row[1]
        births = row[4]
        if month in births_per_month:
            births_per_month[month] += births
        else:
            births_per_month[month] = births
    
    return births_per_month

cdc_ssa_month_births = month_births(cdc_ssa_list)

In [12]:
cdc_ssa_month_births

{1: 6951352,
 2: 6487269,
 3: 7121938,
 4: 6826843,
 5: 7150547,
 6: 7098069,
 7: 7500795,
 8: 7595922,
 9: 7411299,
 10: 7264087,
 11: 6854860,
 12: 7123246}

In [13]:
def dow_births(births_list):
    """Returns a dict with day of week (int) keys and births values.
    Input must be a list of lists with day of week at index 3 and births at index 4.
    """
    births_per_day = {}
    for row in births_list:
        day = row[3]
        births = row[4]
        if day in births_per_day:
            births_per_day[day] += births
        else:
            births_per_day[day] = births
    return births_per_day

cdc_ssa_dow_births = dow_births(cdc_ssa_list)

In [14]:
cdc_ssa_dow_births

{6: 9417666,
 7: 8340922,
 1: 12672592,
 2: 14015353,
 3: 13775310,
 4: 13691289,
 5: 13473095}

## Generalizing the birth count functions
This function (calc_counts) outputs a collection of births based on any of the columns (e.g. births by year, births by month).

- There is a slight upward trend in the number of births per year from 1994 to 2014. Though the trend is well under 1% per year, so it may not be significant.
- Seemingly no birth date of month preference exists. Note: the lower births on the 31st date of month is due to only half of months in the year having a 31st date.

In [15]:
def calc_counts(data, column):
    """Returns a dict with unique column elements as keys and births as values.
    Input must be a list of lists.
    """
    births_per_column = {}
    for row in data:
        column_num = row[column]
        births = row[4]
        if column_num in births_per_column:
            births_per_column[column_num] += births
        else:
            births_per_column[column_num] = births
    return births_per_column

cdc_ssa_year_births = calc_counts(cdc_ssa_list, 0)
cdc_ssa_month_births = calc_counts(cdc_ssa_list, 1)
cdc_ssa_dom_births = calc_counts(cdc_ssa_list, 2)
cdc_ssa_dow_births = calc_counts(cdc_ssa_list,3)

In [16]:
cdc_ssa_year_births, cdc_ssa_month_births, cdc_ssa_dom_births, cdc_ssa_dow_births

({1994: 3952767,
  1995: 3899589,
  1996: 3891494,
  1997: 3880894,
  1998: 3941553,
  1999: 3959417,
  2000: 4058814,
  2001: 4025933,
  2002: 4021726,
  2003: 4089950,
  2004: 4186863,
  2005: 4211941,
  2006: 4335154,
  2007: 4380784,
  2008: 4310737,
  2009: 4190991,
  2010: 4055975,
  2011: 4006908,
  2012: 4000868,
  2013: 3973337,
  2014: 4010532},
 {1: 6951352,
  2: 6487269,
  3: 7121938,
  4: 6826843,
  5: 7150547,
  6: 7098069,
  7: 7500795,
  8: 7595922,
  9: 7411299,
  10: 7264087,
  11: 6854860,
  12: 7123246},
 {1: 2755187,
  2: 2788038,
  3: 2803237,
  4: 2756282,
  5: 2788696,
  6: 2797080,
  7: 2826460,
  8: 2830509,
  9: 2812697,
  10: 2835319,
  11: 2815717,
  12: 2836469,
  13: 2747239,
  14: 2845418,
  15: 2841431,
  16: 2833274,
  17: 2845098,
  18: 2846056,
  19: 2830670,
  20: 2854849,
  21: 2851931,
  22: 2828415,
  23: 2787922,
  24: 2746320,
  25: 2710024,
  26: 2747304,
  27: 2791890,
  28: 2817479,
  29: 2637999,
  30: 2584936,
  31: 1592281},
 {6: 9417666,

## Finding Min and Max Values
The following function (dict_min_max) finds the min and max values of a dictionary. In the examples below, the function is used to explore the births by year, month, date of month, and day of week. The findings from these examples are not very insightful:

- February was the least popular (and shortest) month for births, and August was the most popular.
- The 31st date of month had the lowest births, presumably for the obvious reason.
- Sunday was the least popular day for births, and Tuesday was the most popular.

In [17]:
def dict_min_max(d):
    """Returns the min and max values of a dictionary.
    Min and max are returned as tuples with key and value at index 0 and 1, respectively.
    """
    min_output = None
    max_output = None
    for key, value in d.items():
        if min_output == None and max_output == None:
            min_output = (key, value)
            max_output = (key, value)
        else:
            if value < min_output[1]:
                min_output = (key, value)
            if value > max_output[1]:
                max_output = (key, value)
    return min_output, max_output

In [18]:
print(dict_min_max(cdc_ssa_year_births))
print(dict_min_max(cdc_ssa_month_births))
print(dict_min_max(cdc_ssa_dom_births))
print(dict_min_max(cdc_ssa_dow_births))

((1997, 3880894), (2007, 4380784))
((2, 6487269), (8, 7595922))
((31, 1592281), (20, 2854849))
((7, 8340922), (2, 14015353))


## Exploring Year over Year Trends
The following function (year_changes) outputs annual births limited by up to three constraints: month, date of month, and day of week. In the example below, I used the function to explore if births on 9/11 declined after 2001. The data shows a slight decline in the average births after 2001 compared to before 2002, however this is inconclusive due to the volatility of births rates for any specific date of the year.

In [19]:
def year_changes(births_list, month=None, date_of_month=None, day_of_week=None):
    """Returns a dict with year (int) as keys and births (int) as values.
    
    First argument, births_list, must be a list of lists (all int) with the following format:
    index 0: year
    index 1: month
    index 2: date_of_month
    index 3: day_of_week
    index 4: births
    
    3 optional arguments constrain the output: month, date_of_month, and day_of_week.
    """
    output_dict = {}
    for row in births_list:
        row_year = row[0]
        row_month = row[1]
        row_date = row[2]
        row_dow = row[3]
        row_births = row[4]
        if month == None or month == row_month:
            if date_of_month == None or date_of_month == row_date:
                if day_of_week == None or day_of_week == row_dow:
                    if row_year in output_dict:
                        output_dict[row_year] += row_births
                    else:
                        output_dict[row_year] = row_births
    return output_dict
            

In [20]:
year_changes(cdc_ssa_list, month=9, date_of_month=11)

{1994: 8373,
 1995: 11480,
 1996: 12420,
 1997: 12467,
 1998: 12920,
 1999: 9634,
 2000: 12091,
 2001: 13238,
 2002: 12371,
 2003: 12932,
 2004: 9253,
 2005: 8041,
 2006: 12868,
 2007: 14063,
 2008: 13391,
 2009: 13032,
 2010: 8775,
 2011: 7501,
 2012: 12543,
 2013: 12074,
 2014: 12104}