# Calc Number of Births by Mothers Age by Year by State

The purpose of this script is to take data captured in both previous notebooks and merge it into one dataset.
We'll be able to see how many people are born to each state by the age of their mothers by every year.

----------------------

<p>Author: PJ Gibson</p>
<p>Date: 2022-12-22</p>
<p>Contact: peter.gibson@doh.wa.gov</p>
<p>Other Contact: pjgibson25@gmail.com</p>

## 0. Import libraries, fpaths

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Specify [relative] fpaths
fpath_birthComplete = '../../../SupportingDocs/Births/03_Complete'

## 1. Read in Data

In [None]:
# Read in data representing percent of births each year for each child bearing age 18-44 inclusive
df_BirthingAge = pd.read_csv(f'{fpath_birthComplete}/ProbaBirthByMotherAgeByYear.csv')\
                   .sort_values('Year')

# Read in data representing percent of births each year for each child bearing age 18-44 inclusive
df_VitalStats = pd.read_csv(f'{fpath_birthComplete}/VitalStats_byYear_byState.csv')

## 2. Combine and Format

In [None]:
# Merge our data
df = df_VitalStats[['Year','State','Births']].merge(df_BirthingAge, on='Year',how='inner')

# When we round the number of births for each age, we have the potential to miss our total sum by a little.  see below:
############# mylist = np.array(df.iloc[:,3:].sum(axis=1)  - df['Births'])
#### -0.075: mean of the births difference when using round
#### 3.296: std of the births difference when using round
# This is acceptable, number of births per year in output always within 10 days of intended
for i in range(18,45):
    df[f'{i}'] = np.round(df[f'{i}'] * df['Births']).astype(int)
    
df.drop('Births',axis=1,inplace=True)

## 3. Save

In [None]:
df.to_csv(f'{fpath_birthComplete}/DesiredBirthsByYearByAgeByState.csv',
          header=True,
          index = False)

## Extras...

In my currently unclean version of the storage structure, I have access to simulation data/supporting docs for Washington State specifically.
Since that version, I've expanded to applying this to other states (current version).

This cell below allows us to see how the number of births differ by year for each version for Washington state.

In [None]:
# # Read in previous project version of Washington State desired births by age for comparison
# df_WA = pd.read_csv(f'{fpath_birthComplete}/DesiredBirthsByYearByAge.csv')

# # Filter to data on Washington from newer project version
# df_WAnew = df.query('(State == "Washington") & (Year < 2023)').drop('State',axis=1).reset_index(drop=True)

# # Calculate differences between two datasets, format

# differences = (df_WAnew - df_WA).drop(['Year','year'], axis=1)
# differences['Year'] = df_WAnew.Year
# differences['Year'] = differences['Year'].astype(pd.Int64Dtype())
# differences.set_index('Year', inplace=True)
# differences.dropna(how='all', inplace=True)

# # Plotting

# fig = plt.figure(figsize=(9,7))
# differences.sum(axis=1).plot()
# plt.hlines(y=0, xmin=1920, xmax=2021, color='k', linestyle='--')
# plt.suptitle('Washington State')
# plt.title('Difference in Births between version 1 and version 0')
# plt.xlabel('Year')
# plt.ylabel('Difference (number) of Births')
# plt.show()