# Exploring US Births

In this project, we will analyse the following data set that comes from the Centers for Disease Control and Prevention (CDC)'s National Center for Health Statistics. The data set has the following structure:

- year - Year
- month - Month
- date_of_month - Day number of the month
- day_of_week - Day of week, where 1 is Monday and 7 is Sunday
- births - Number of births

## Introduction to the Data Set

We will open and view the data set using Python

In [1]:
# open the data set in its native format
csv_raw = open(r"C:\projectdatasets\US_births_1994-2003_CDC_NCHS.csv").read()

# view the data set's first 500 characters (represented as one long string)
csv_raw[0:500]

'year,month,date_of_month,day_of_week,births\n1994,1,1,6,8096\n1994,1,2,7,7772\n1994,1,3,1,10142\n1994,1,4,2,11248\n1994,1,5,3,11053\n1994,1,6,4,11406\n1994,1,7,5,11251\n1994,1,8,6,8653\n1994,1,9,7,7910\n1994,1,10,1,10498\n1994,1,11,2,11706\n1994,1,12,3,11567\n1994,1,13,4,11212\n1994,1,14,5,11570\n1994,1,15,6,8660\n1994,1,16,7,8123\n1994,1,17,1,10567\n1994,1,18,2,11541\n1994,1,19,3,11257\n1994,1,20,4,11682\n1994,1,21,5,11811\n1994,1,22,6,8833\n1994,1,23,7,8310\n1994,1,24,1,11125\n1994,1,25,2,11981\n1994,1,26,3,11514\n1994,'

In [2]:
# check the data type is a 'string'
type(csv_raw)

str

In [3]:
# split the data set on the '\n' character to create a 'list'
csv_list = csv_raw.split("\n")

# view the first ten rows
csv_list[0:10]

['year,month,date_of_month,day_of_week,births',
 '1994,1,1,6,8096',
 '1994,1,2,7,7772',
 '1994,1,3,1,10142',
 '1994,1,4,2,11248',
 '1994,1,5,3,11053',
 '1994,1,6,4,11406',
 '1994,1,7,5,11251',
 '1994,1,8,6,8653',
 '1994,1,9,7,7910']

In [4]:
# check the data type is a 'list'
type(csv_list)

list

## Automate the data load using a function

We will create a function to:
- read the data in as a string
- split the data on the '\n' character, creating a list
- remove the header row
- for each row in the list, split each row on the ',' creating a separate list for each row (separated by square brackets)
- for each separate list, append each list (and convert each value to an integer) to a new list
- append each list to a 'final_list'

In [5]:
# create the function, which should accept a file path
def read_csv(filename):
    string_data = open(filename).read()
    string_list = string_data.split("\n")[1:]
    final_list = []
    
    for row in string_list:
        string_fields = row.split(",")
        int_fields = []
        for value in string_fields:
            int_fields.append(int(value))
        final_list.append(int_fields)
    return final_list

In [6]:
# run the function we created above on the raw data set
cdc_list = read_csv(r"C:\projectdatasets\US_births_1994-2003_CDC_NCHS.csv")

# view the first ten rows - this represents a 'list of lists'
cdc_list[0:10]

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498]]

## Number Of Births Per Month

We will create a function that counts the number of births that occur for each month, and store the results in a 'dictionary' (of key value pairs)

In [7]:
# function should accept a data set
def month_births(data):
    
    # create a dictionary to hold the results
    births_per_month = {}
    
    for row in data:
        month = row[1]
        births = row[4]
        if month in births_per_month:
            births_per_month[month] = births_per_month[month] + births
        else:
            births_per_month[month] = births
    return births_per_month

In [8]:
# run the function 'month_births' we created above (passing in the data set) 
# and store the result in a new variable 'cdc_month_births'
cdc_month_births = month_births(cdc_list)

# show the dictionary
cdc_month_births

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

## Number Of Births Per Day Of Week

Similar to the above, we will create a function that counts the total number of births that occur for each day of the week, and store the results in a 'dictionary' (of key value pairs)

In [9]:
# function should accept a data set
def dow_births(data):
    
    # create a dictionary to hold the results
    births_per_dow = {}
    
    for row in data:
        dow = row[3]
        births = row[4]
        if dow in births_per_dow:
            births_per_dow[dow] = births_per_dow[dow] + births
        else:
            births_per_dow[dow] = births
    return births_per_dow

In [10]:
# run the function 'dow_births' we created above (passing in the data set) 
# and store the result in a new variable 'cdc_dow_births'
cdc_dow_births = dow_births(cdc_list)

# show the dictionary
cdc_month_births

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

## Number Of Births for Any Column

To improve flexibility, we will create function to count number of births by any column/parameter the user specifies, and store the result in a 'dictionary' (of key value pairs)

In [11]:
# function should accept a data set, and a column index
def calc_counts(data, column):
    
    # create a dictionary to hold the results
    sums_dict = {}
    
    for row in data:
        col_value = row[column]
        births = row[4]
        if col_value in sums_dict:
            sums_dict[col_value] = sums_dict[col_value] + births
        else:
            sums_dict[col_value] = births
    return sums_dict

In [12]:
# run the function 'calc_counts' we created above, four different times, varying the column each time
cdc_year_births = calc_counts(cdc_list, 0)
cdc_month_births = calc_counts(cdc_list, 1)
cdc_dom_births = calc_counts(cdc_list, 2)
cdc_dow_births = calc_counts(cdc_list, 3)

In [13]:
# show the dictionary
cdc_year_births

{1994: 3952767,
 1995: 3899589,
 1996: 3891494,
 1997: 3880894,
 1998: 3941553,
 1999: 3959417,
 2000: 4058814,
 2001: 4025933,
 2002: 4021726,
 2003: 4089950}

In [14]:
# show the dictionary
cdc_month_births

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

In [15]:
# show the dictionary
cdc_dom_births

{1: 1276557,
 2: 1288739,
 3: 1304499,
 4: 1288154,
 5: 1299953,
 6: 1304474,
 7: 1310459,
 8: 1312297,
 9: 1303292,
 10: 1320764,
 11: 1314361,
 12: 1318437,
 13: 1277684,
 14: 1320153,
 15: 1319171,
 16: 1315192,
 17: 1324953,
 18: 1326855,
 19: 1318727,
 20: 1324821,
 21: 1322897,
 22: 1317381,
 23: 1293290,
 24: 1288083,
 25: 1272116,
 26: 1284796,
 27: 1294395,
 28: 1307685,
 29: 1223161,
 30: 1202095,
 31: 746696}

In [16]:
# show the dictionary
cdc_dow_births

{6: 4562111,
 7: 4079723,
 1: 5789166,
 2: 6446196,
 3: 6322855,
 4: 6288429,
 5: 6233657}

## Analysis

Based on our analysis above, we can identify no clear trends between the number of births and the month, day of the month, or day of the week. There is, however, a trend between the number of births and the year (i.e. we can see a steady upward climb of births between 1994-2003)