# Explore U.S. Births

In this project we will explore U.S. births from 1994 to 2003 and try to answer the following questions:
- Which year there was the highest number of births?
- Is there any month with signifigantly more births than the other months?
- Which day of week and which month had the highest number of births?

We will be working with data ofU.S. births from 1994 to 2003. The dataset contains the following columns:
- **year:** Year (1994 to 2003).
- **month:** Month (1 to 12).
- **date_of_month:** Day number of the month (1 to 31).
- **day_of_week:** Day of week (1 to 7).
- **births:** Number of births that day.

## Reading the data

In [2]:
f = open(r"Data\US_births_1994-2003.csv").read()
data_list = f.split("\n")
data_list[:10]

['year,month,date_of_month,day_of_week,births',
 '1994,1,1,6,8096',
 '1994,1,2,7,7772',
 '1994,1,3,1,10142',
 '1994,1,4,2,11248',
 '1994,1,5,3,11053',
 '1994,1,6,4,11406',
 '1994,1,7,5,11251',
 '1994,1,8,6,8653',
 '1994,1,9,7,7910']

In [4]:
data_list[len(data_list)-10:]
#The last row is empty, on convertion to a list of lists we will have to remove it, as well as the header row

['2003,12,23,2,12598',
 '2003,12,24,3,9096',
 '2003,12,25,4,6628',
 '2003,12,26,5,10218',
 '2003,12,27,6,8646',
 '2003,12,28,7,7645',
 '2003,12,29,1,12823',
 '2003,12,30,2,14438',
 '2003,12,31,3,12374',
 '']

## Converting Data Into A List Of Lists

In [5]:
def read_csv(filename):
    string_data = open(filename).read().split("\n")
    string_list = string_data[1:len(string_data)-1]
    final_list = []
    for row in string_list:
        string_fields = row.split(",")
        int_fields = []
        for value in string_fields:
            int_fields.append(int(value))
        final_list.append(int_fields)
    return final_list

In [6]:
cdc_list = read_csv(r"Data\US_births_1994-2003.csv")
cdc_list[0:10][:]

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498]]

## Which year there was the highest number of births?

In [7]:
year_dict = {}
for row in cdc_list:
    if row[0] in year_dict:
        year_dict[row[0]] += row[4]
    else:
        year_dict[row[0]] = row[4]

In [10]:
year_dict

{1994: 3952767,
 1995: 3899589,
 1996: 3891494,
 1997: 3880894,
 1998: 3941553,
 1999: 3959417,
 2000: 4058814,
 2001: 4025933,
 2002: 4021726,
 2003: 4089950}

In [11]:
#It looks like the highest number of births was in 2003. 
#Let's try to create a function that we can use to summarize other columns as well.

def calc_counts(data, group_col, target_col):
    final_dict = {}
    for row in data:
        if row[group_col] in final_dict:
            final_dict[row[group_col]] += row[target_col]
        else:
            final_dict[row[group_col]] = row[target_col]
    return final_dict
    

In [12]:
year_births = calc_counts(cdc_list, 0, 4)
year_births

{1994: 3952767,
 1995: 3899589,
 1996: 3891494,
 1997: 3880894,
 1998: 3941553,
 1999: 3959417,
 2000: 4058814,
 2001: 4025933,
 2002: 4021726,
 2003: 4089950}

Now it should be easy to answer the other questions.

## Is there any month with signifigantly more births than the other months?

In [13]:
month_births = calc_counts(cdc_list,1,4)
month_births

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

It looks like in August number of births is the highest.

## Which day of week and which month had the highest number of births?

In [14]:
day_month_births = calc_counts(cdc_list,2,4)
day_month_births

{1: 1276557,
 2: 1288739,
 3: 1304499,
 4: 1288154,
 5: 1299953,
 6: 1304474,
 7: 1310459,
 8: 1312297,
 9: 1303292,
 10: 1320764,
 11: 1314361,
 12: 1318437,
 13: 1277684,
 14: 1320153,
 15: 1319171,
 16: 1315192,
 17: 1324953,
 18: 1326855,
 19: 1318727,
 20: 1324821,
 21: 1322897,
 22: 1317381,
 23: 1293290,
 24: 1288083,
 25: 1272116,
 26: 1284796,
 27: 1294395,
 28: 1307685,
 29: 1223161,
 30: 1202095,
 31: 746696}

In [15]:
day_week_births = calc_counts(cdc_list,3,4)
day_week_births

{6: 4562111,
 7: 4079723,
 1: 5789166,
 2: 6446196,
 3: 6322855,
 4: 6288429,
 5: 6233657}

18th day of month has the highest number of births as well as Monday within the week days.