<h1>
<center>
Explore Birth Dates in the United States
</center>
</h1>

## 1. Load file and prepare data

We'll be working with the data set from the Centers for Disease Control and Prevention's National Center for Health Statistics. The data set has the following structure:


| Header | Definition   |
|------|------|
|   year  | Year|
|   month  | Month|
|   date_of_month  | Day number of the month|
|   date_of_week  | Day of week, where 1 is Monday and 7 is Sunday|
|   births  | Number of births|

We want to know when American babies are more likely to born.   

Data source : [FiveThirtyEight](https://github.com/fivethirtyeight/data)

In [11]:
file = open("US_births_1994-2003_CDC_NCHS.csv", "r")
data = file.read().split('\n')

Display the first 10 values of the data

In [7]:
data[0:10]

['year,month,date_of_month,day_of_week,births',
 '1994,1,1,6,8096',
 '1994,1,2,7,7772',
 '1994,1,3,1,10142',
 '1994,1,4,2,11248',
 '1994,1,5,3,11053',
 '1994,1,6,4,11406',
 '1994,1,7,5,11251',
 '1994,1,8,6,8653',
 '1994,1,9,7,7910']

We need to convert the data into a list of lists where each nested list contains integer values, and remove the header. Let's create a function to do that.

In [8]:
def read_csv(filename):
    string_list = open(filename, 'r').read().split('\n')[1:]
    final_list = list()
    for row in string_list:
        int_fields = list()
        string_fields = row.split(',')
        for value in string_fields:
            int_fields.append(int(value))
        final_list.append(int_fields)
    return final_list

Now, read the file with the function : 

In [9]:
cdc_list = read_csv("US_births_1994-2003_CDC_NCHS.csv")

Display the first 10 values of the data:

In [10]:
cdc_list[0:10]

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498]]

## 2. Count all births

Now that the data is in a more usable format, we can start to analyze it. 

#### General count function
Let's calculate the total number of births that occurred in each month, across all of the years in the dataset. We'll create a dictionary where each key is a single month, and each value is the number of births that happened in that month, across all years:

In [21]:
def calc_counts(data, column):
    births_counts = dict()
    for row in data:
        feature  = row[column]
        births = row[4]
        if feature in births_counts:
            births_counts[feature] = births_counts[feature] + births
        else:
            births_counts[feature] = births
    return births_counts

#### Total births per month:

In [26]:
cdc_month_births = calc_counts(cdc_list, 1)
cdc_month_births

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

#### Total births per year:

In [27]:
cdc_year_births = calc_counts(cdc_list, 0)
cdc_year_births

{1994: 3952767,
 1995: 3899589,
 1996: 3891494,
 1997: 3880894,
 1998: 3941553,
 1999: 3959417,
 2000: 4058814,
 2001: 4025933,
 2002: 4021726,
 2003: 4089950}

#### Total births per day of the month:

In [28]:
cdc_dom_births = calc_counts(cdc_list, 2)
cdc_dom_births

{1: 1276557,
 2: 1288739,
 3: 1304499,
 4: 1288154,
 5: 1299953,
 6: 1304474,
 7: 1310459,
 8: 1312297,
 9: 1303292,
 10: 1320764,
 11: 1314361,
 12: 1318437,
 13: 1277684,
 14: 1320153,
 15: 1319171,
 16: 1315192,
 17: 1324953,
 18: 1326855,
 19: 1318727,
 20: 1324821,
 21: 1322897,
 22: 1317381,
 23: 1293290,
 24: 1288083,
 25: 1272116,
 26: 1284796,
 27: 1294395,
 28: 1307685,
 29: 1223161,
 30: 1202095,
 31: 746696}

#### Total births per day of the week:

In [29]:
cdc_dow_births = calc_counts(cdc_list, 3)
cdc_dow_births

{6: 4562111,
 7: 4079723,
 1: 5789166,
 2: 6446196,
 3: 6322855,
 4: 6288429,
 5: 6233657}

## 3. Find the most and least frequent births times

Now that we have counted all births per year/month/day, we will try to identify the most represented days of births.
Let's create a function that identifies the extremum of each dictionary.

In [30]:
def extremum(dictionary):
    maxi = max(dictionary, key=dictionary.get)
    mini = min(dictionary, key=dictionary.get)
    return "Minimum ="+ str(mini) + " Maximum = " + str(maxi)

In [31]:
extremum(cdc_year_births)

'Minimum =1997 Maximum = 2003'

In [32]:
extremum(cdc_month_births)

'Minimum =2 Maximum = 8'

In [33]:
extremum(cdc_dom_births)

'Minimum =31 Maximum = 18'

In [34]:
extremum(cdc_dow_births)

'Minimum =7 Maximum = 2'

## 4 . What have we learned about US births with this quick and easy study?

The year when the total number of births was minimum is 1997 and 2003 is the highest. By looking deeper into the data, we can notice that the total number of births is increasing since 1994. 

The day of the months with the least births is 31, which makes sense as it doesn't occur every month. Except the 31, all the other days are slightly equal, thus 18 being the most represented can be a random effect.

August is the month with the maximum of births, which can be explained by the winter's baby boom, described [here](https://www.independent.co.uk/life-style/health-and-families/babies-conceive-christmas-why-most-parents-couples-conception-a8103201.html).
February is the months with the fewer births, that can be explained because doctors suggest not to conceive babies in June due to the pesticide activities, see explanations on "toxic June" [here](https://www.telegraph.co.uk/news/science/science-news/11948522/Avoid-toxic-June-when-trying-to-conceive-say-scientists.html)

Finally, the day which is the least represented is Sunday. It is the day with the least employees working in hospitals, then babies born unnaturally (by cesarean section) are not planned on these days, except emergencies. We can notice the same effect on Saturday, even if it's a less significant than Sundays.