# Wrangle Births by Mother Age by Year

The purpose of this script is to take data on breakdown of births by Mother's age in Michigan.
We want to see what percent of all births should belong to each age for each year.
We assume that this Michigan data can be extrapolated to the rest of the United States.
This data is hard to come by, so the Michigan data is rather golden.

Michigan births by Mother's age raw data [linked here](https://vitalstats.michigan.gov/osr/natality/Tab4.4.asp)

Note that the data has been adjusted so that 

----------------------

<p>Author: PJ Gibson</p>
<p>Date: 2022-12-22</p>
<p>Contact: peter.gibson@doh.wa.gov</p>
<p>Other Contact: pjgibson25@gmail.com</p>


## 0. Import libs, define fpaths

In [None]:
import pandas as pd
import numpy as np
import requests

# Specify [relative] fpaths
fpath_birthWrangled = '../../../SupportingDocs/Births/02_Wrangled'
fpath_birthComplete = '../../../SupportingDocs/Births/03_Complete'

## 1. API Query

### 1.1 Perform Query, Read table

In [None]:
url = 'https://vitalstats.michigan.gov/osr/natality/Tab4.4.asp'
html = requests.get(url).content
df_MI = pd.read_html(html, skiprows=1,header=0)[0]

### 1.2 Wrangle Data

In [None]:
# Drop null rows
df_MI = df_MI.dropna(subset=['Year'])

# Make year an int
df_MI['Year'] = df_MI['Year'].astype(int)

# Remove births from 10-17 years old inclusive. Param set birthing age >= 18
df_MI['15-19'] = df_MI['15-19'] + df_MI['10-14']
df_MI.rename(columns={'15-19':'18-19'}, inplace=True)

# Remove 45+ births. Param set birthing age <= 44
df_MI['40-44'] = df_MI['40-44']+df_MI['45+']

# Drop unused columns
df_MI.drop(['10-14','Total  Fertility  Rate','45+'],axis=1,inplace=True)

# Sort values by Year
df_MI.sort_values('Year',inplace=True)

#### 1.2.1 Extend Birthing Data

We're only using this data to reflect what percent of births in a given year belong to specific ages/age ranges.
Since the data we have here ends in 2021, we'll extend it to 2025 assuming that the breakdown for those unmapped years is similar (the exact same as) to 2021.

In [None]:
# Find the most recent year.  At the time of running this, it was 2021
year_most_recent = df_MI.iloc[-1]['Year']

# Get years between most recent year + 1 and 2025 inclusive
missing_years = np.arange(year_most_recent + 1, 2026 )

# Make a dataframe that's just the last row of data with the number of rows we desire to reach 2025
new_rows = pd.concat([df_MI.iloc[-1:,:]] * len(missing_years))

# Edit the "Year" column to reflect those missing
new_rows['Year'] = missing_years.astype(int)

# Append new rows to existing dataframe
df_MI = pd.concat([df_MI, new_rows], ignore_index=True)

### 1.3 Save for viewing/record

In [None]:
df_MI.to_csv(f'{fpath_birthWrangled}/michigan_yearly_births_by_mothers_age.csv',index=False,header=True)

## 2. Reread/format

In [None]:
# Read
df = pd.read_csv(f'{fpath_birthWrangled}/michigan_yearly_births_by_mothers_age.csv')

# Set index, define columns
df.set_index('Year', inplace=True)
AgeRange_cols = df.columns

# Normalize
df = df.div(df.sum(axis=1), axis=0) 

# Get year on the outside
df = df.reset_index()

## 3. Wrangle data

In [None]:
output = []

# For each of the age-range columns
for col in AgeRange_cols:
    
    # Define minimum age in column age range
    range_min = int(col.split('-')[0])
    
    # Define maximum age in column age range
    range_max = int(col.split('-')[1])
    
    # Get a list of all ages within the range
    range_ages = np.arange(range_min, range_max + 1)
    
    for age in range_ages:
    
        # Caclulate the number of births we would want to see to at each individual age within age range.  Equal probability within range (looks like step-wise function)
        ys = list(df[col] / len(range_ages))
        
        outputme = [age]
        outputme.extend(ys)
        
        # Append new column name to our output_cols
        output.append(outputme)
        
# Save to dataframe with proper columns, shape, index
df_output = pd.DataFrame(output).set_index(0).transpose()
df_output.reset_index(drop=True, inplace=True)

# Tack on the df.year column and we're set!
final_output = pd.merge(df.Year, df_output, how='inner', on=None, left_index=True, right_index=True)

## 4. Save!

In [None]:
final_output.to_csv(f'{fpath_birthComplete}/ProbaBirthByMotherAgeByYear.csv',index=False,header=True)