# get-births-by-year-by-state

The purpose of this script is to calculate vital stats by year by state from 1914-2025.
Output includes:
* Year
* State
* Deaths
* MortalityRate
* Population
* sourceDeaths
* NextPopulation
* PopulationChange
* Population Retained
* Domestic Immigrants
* Foreign Immigrants
* Total Immigrants
* Total Emigrants
* CrudeBirthRate
* Births

*data on migration is not available until 2005, so it contains mostly null entries.

In order to do this, we combine data from several sources:
* Population Sources
  * [Census Apportionment Data - population stats in decennial years](https://www.census.gov/data/tables/time-series/dec/popchange-data-text.html)
    * {root}/SupportingDocs/Births/01_Raw/apportionment.csv
  * Linear approximation for non-decennial years from 1910-1968
  * Requested NCHS Population Data from Mortality Data Files
    * {root}/SupportingDocs/Births/01_Raw/Deaths*
  * Using previous populations & linearly approximated birth/death rates for 2021-2025
* Mortality Sources
  * Data Manually Gathered from [Census PDFs for 1914-1940](https://www.cdc.gov/nchs/data/vsus/vsrates1900_40.pdf) for each state Mortality Rate
    * {root}/SupportingDocs/Births/02_Wrangled/MortalityRates_pt*
  * Data Manually Gathered from [Census for 1940-1970](https://www2.census.gov/library/publications/1975/compendia/hist_stats_colonial-1970/hist_stats_colonial-1970p1-chB.pdf) for United States Mortality Rate as a whole.
    * {root}/SupportingDocs/Births/02_Wrangled/census-hist-stats-deathrates.csv
  * Requested NCHS/CDC Mortality Data Files from 1968-2020
    * {root}/SupportingDocs/Births/01_Raw/Deaths*
  * Linear approximation for 2021-2025 based on previous 5 year Mortility Rates
* Birth Sources
  * Requested NCHS/CDC Birth Data Files from 1995-2020.  Mostly used for number of births from 1995-2016.
    * {root}/SupportingDocs/Births/01_Raw/Births*
  * Crude birth rates from [Gapminder 1800-2015](https://docs.google.com/spreadsheets/d/1QkK8B3EnGoWzcHUmdf0AIU8YHk5LmzbOcsRRKbN9w2Y/pub?gid=1) for all of United States.  Extrapolated to other states from 1914-1995 for simplicity.
    * {root}/SupportingDocs/Births/01_Raw/indicator_crude birth_rate.csv
  * CDC from 2016-2021.  Mostly used from number of births from 2016-2022
    * {root}/SupportingDocs/Births/01_Raw/Natality, 2016-2021 expanded.txt
* Migration Sources
  * Census data wrangled in previous notebook, [raw data found linked](https://www.census.gov/data/tables/time-series/demo/geographic-mobility/state-to-state-migration.html), wrangled below
    * {root}/SupportingDocs/Births/02_Wrangled/StateMigrationData.csv
    

----------------------

<p>Author: PJ Gibson</p>
<p>Date: 2022-12-22</p>
<p>Contact: peter.gibson@doh.wa.gov</p>
<p>Other Contact: pjgibson25@gmail.com</p>

# 0. Import Libraries, fpaths

In [None]:
# IMport libraries we'll use
import numpy as np
import pandas as pd
import numpy.polynomial.polynomial as poly
import matplotlib.pyplot as plt

# Specify [relative] fpaths
fpath_birthRaw = '../../../SupportingDocs/Births/01_Raw'
fpath_birthWrangled = '../../../SupportingDocs/Births/02_Wrangled'
fpath_birthComplete = '../../../SupportingDocs/Births/03_Complete'
fpath_migrationWrangled = '../../../SupportingDocs/State-to-State-Migration/02_Wrangled'

# We'll extrapolate births/deaths/population up until this year (inclusive).
#### this will be done using a linear approach that takes the average trend of the last 5 years of available data
desired_end_year = 2025

# 1. Load data

## 1.1 Population data

Data on population by state by year (every 10 years), wrangle to integer population, and proper cols.

Apportionment.csv file downloaded at the [following page](https://www.census.gov/data/tables/time-series/dec/popchange-data-text.html)

In [None]:
df_populationCtRaw = pd.read_csv(f'{fpath_birthRaw}/apportionment.csv')
df_populationCtRaw['Population'] = df_populationCtRaw['Resident Population'].astype(str).replace('\,','', regex=True).astype(int)
df_populationCt_before = df_populationCtRaw.query('(`Geography Type` == "State")')\
                                    [['Name','Year','Population']]\
                                    .rename(columns={'Population':'PopulationUN',
                                                     'Name':'State'})

### 1.1.1 Extrapolate Population by state in non-decennial  years

We only have population data for 1910, 1920, 1930, ...

but we would like that data for every single year.
Using linear approximating, we'll calculate the values for non-decennial years.
We could do this approximation several ways including line-fitting to logrithmic or exponential growth but since our data points are fairly close together, I would rather use a linear approximation.

Non 1d linear fits would be particularly difficult in approximating population change for the state of West Virginia, which underwent unusual popoulation change throughout it's history.

In [None]:
# Groupby state, ordering by year
g = df_populationCt_before.sort_values(['State','Year']).groupby('State')

# Initialize output 
output = []

# For each state and it's contents
for state, frame in g:
    
    # Calculate difference between population at year[row] and year[row+1] for all but the last row
    population_differences = frame['PopulationUN'][1:].to_numpy() - frame['PopulationUN'][:-1].to_numpy()
    
    # For every row but the last...
    for i in np.arange(0,len(frame)-1):
        
        # Define starting values and average population change.  Division by 10 assumes data incoming is decennial.
        start_year = int(frame.iloc[i]['Year'])
        start_pop = int(frame.iloc[i]['PopulationUN'])
        avg_pop_change = population_differences[i] / 10
        
        # Append starting values (already present in data)
        output.append( [state, start_year, start_pop] )
        
        # For every non-decennial year...
        for j in np.arange(1,10):
            
            # Append the linear approximation values
            output.append( [state, start_year+int(j), start_pop+(avg_pop_change*j)] )
    
    # append final year since the for-loop excluded the last row
    final_year = int(frame.iloc[-1]['Year'])
    final_pop = int(frame.iloc[-1]['PopulationUN'])
    output.append( [state, final_year, final_pop] )
    
# Overwrite our df_populationCt dataframe
df_populationCt = pd.DataFrame(output, columns=['State','Year','PopulationUN'])

## 1.2 Deaths data

### 1.2.1 CDC data

The data below was given to me directly by the CDC.
It is the only available data by state by year that they had.
Previous years require manual data entry.

In [None]:
df_Death1 = pd.read_csv(f'{fpath_birthRaw}/Deaths by Year and State in 1968-1978.txt', delimiter='\t', skipfooter=10, engine='python')\
               .dropna(subset=['State'],how='any')
               
df_Death2 = pd.read_csv(f'{fpath_birthRaw}/Deaths-by-Year-and-State-1979-1998.txt', delimiter='\t', skipfooter=10, engine='python')\
               .dropna(subset=['State'],how='any')
               
df_Death3 = pd.read_csv(f'{fpath_birthRaw}/Deaths by Year and State in 1999-2020.txt', delimiter='\t', skipfooter=10, engine='python')\
               .dropna(subset=['State'],how='any')
               
df_Deaths_nchs = pd.concat([df_Death1,df_Death2,df_Death3])\
                      [['Year','State','Deaths','Crude Rate','Population']]\
                      .rename(columns={'Crude Rate':'MortalityRate'})

df_Deaths_nchs['Year'] = df_Deaths_nchs['Year'].astype(int)

### 1.2.2 Data manually entered

The data below I manually tnered into an excel sheet using census data.
It only covers the years 1914-1940.
Data source [is linked here](https://www.cdc.gov/nchs/data/vsus/vsrates1900_40.pdf)


See {root_dir}/supporting/Births/01_Raw/vitalstats-snapshot-mortalityrates-1914-1940.pdf for the snaphshot that I manually entered into the two csv's below.


In [None]:
df_Mortality1 = pd.read_csv(f'{fpath_birthWrangled}/MortalityRates_pt1.csv')\
                  .rename(columns={'Unnamed: 0':'state'})\
                  .set_index('state')

df_Mortality2 = pd.read_csv(f'{fpath_birthWrangled}/MortalityRates_pt2.csv')\
                  .rename(columns={'Unnamed: 0':'state'})\
                  .set_index('state')

df_Mortality_manual = df_Mortality1.join(df_Mortality2)

In [None]:
# Add {'Alaska', 'Hawaii'} to the data.  Puerto rico will not be part of our dataset due to missing CDC wonder data for mortality (see df_Deaths)
nullarray = [np.nan]*len(df_Mortality_manual.columns)
null_df = pd.DataFrame([nullarray,nullarray],columns=df_Mortality_manual.columns).transpose()\
            .rename(columns={0:'Alaska',1:'Hawaii'})\
            .transpose()

# Add them by appending
df_Mortality_manual = pd.concat([df_Mortality_manual, null_df],ignore_index=False)

### 1.2.3 Data from 1940-1970
    
There is historical data on vital stats from 1865-1945 as a census publication.
It has the US overall crude death rate for all of those years.
Note that for the years 1940-1945 this excludes members of armed forces (WWII had a large impact)
That resource can be [found linked here](https://www2.census.gov/library/publications/1975/compendia/hist_stats_colonial-1970/hist_stats_colonial-1970p1-chB.pdf)

This data lines up with the other census data Averages from 1914-1940 which is consistant.

<b>Rationale for not manually entering data from 1940-1968</b>
We prefer state-specific information, but after manually entering data for each state for the years 1914-1940 (contained in one PDF), I decided against doing more manual labor.
For state-by-state data for the years 1940-1968, the data is stored in census pdf files on a year-by-year basis and does not always contain data on every state.
Given the amount of manual data entry work, data missingness, and large number of dated census photoscans I would need to manually parse through, I decided to use United States averages for this range.

In [None]:
df_Mortality_hist_census = pd.read_csv(f'{fpath_birthWrangled}/census-hist-stats-deathrates.csv')\
                             .rename(columns={'Unnamed: 0':'State'})\
                             .set_index('State')\
                             .iloc[:,1:-3]

### 1.2.4 Joining Census Data

In [None]:
# Join our manual state-by-state data entries for census death rates with historical averages to get years 1914-1968
df_Mortality_census = df_Mortality_manual.join(df_Mortality_hist_census)

In [None]:
# Fillna values with the average listed
df_Mortality_census = df_Mortality_census.fillna(df_Mortality_census.loc['Average'])

In [None]:
# Melt wide data into 3 column dataset -> index=State , columns=[year,mortalityrate]
df_Mortality_census_melted = pd.melt(df_Mortality_census, ignore_index=False)\
                               .rename(columns={'variable':'Year','value':'MortalityRate'})

# Multiply mortality rate by 100 to match other death data
df_Mortality_census_melted['MortalityRate'] = df_Mortality_census_melted['MortalityRate'] * 100

# Reset index to a column labeled "State"
df_Mortality_census_melted = df_Mortality_census_melted.reset_index().rename(columns = {'index':'State'})
df_Mortality_census_melted['Year'] = df_Mortality_census_melted['Year'].astype(int)


## 1.3 Birth Data

In [None]:
df_Births1 = pd.read_csv(f'{fpath_birthRaw}/Births-by-Year-and-State-1995-2002.txt', delimiter='\t', skipfooter=10, engine='python')\
               .dropna(subset=['State'],how='any')

df_Births2 = pd.read_csv(f'{fpath_birthRaw}/Births-by-Year-and-State-2003-2006.txt', delimiter='\t', skipfooter=10, engine='python')\
               .dropna(subset=['State'],how='any')
               
df_Births3 = pd.read_csv(f'{fpath_birthRaw}/Births-by-Year-and-State-2007-2020.txt', delimiter='\t', skipfooter=10, engine='python')\
               .dropna(subset=['State'],how='any')            

df_Births = pd.concat([df_Births1,df_Births2,df_Births3])\
              [['Year','State','Births']]

## 1.4 Migration Data

In [None]:
df_Migration = pd.read_csv(f'{fpath_migrationWrangled}/StateMigrationData.csv')

## 1.5 Crude Birth Rates

### 1.5.1 Extrapolated from 1800-2015

Data found [here](https://docs.google.com/spreadsheets/d/1QkK8B3EnGoWzcHUmdf0AIU8YHk5LmzbOcsRRKbN9w2Y/pub?gid=1#) and produced by gapminder.


In [None]:
# read in data
df_BirthRates = pd.read_csv(f'{fpath_birthRaw}/indicator_crude birth_rate.csv')

# Query to country is united states.  Can also query other american territories, drop index
df_BirthRates = df_BirthRates.query('Country == "United States"').melt().drop(0)

# Rename columns
df_BirthRates.columns = ['Year','CrudeBirthRate1']

# Convert dtype of year
df_BirthRates['Year'] = df_BirthRates['Year'].astype(int)

### 1.5.2 CDC from 2016-2021

In [None]:
# Read in data
df_cdc = pd.read_csv(f'{fpath_birthRaw}/Natality, 2016-2021 expanded.txt',delimiter='\t',usecols=[1,3,5,7])

# Redefine columns
df_cdc.columns = ['State','Year','Births','CrudeBirthRate2']

# Remove rows with na vals
df_cdc = df_cdc.dropna(how='any')

# Convert dtypes
df_cdc['Year'] = df_cdc['Year'].astype(int)

# We drop births since they are simply calculated from population and rate, we have both.
#### Population in this file exactly matched the source data we extract it from in our output df
df_cdc = df_cdc.drop('Births',axis=1)

# 2. Wrangle data

## 2.1 Combine Death/Population Data

In [None]:
# Combine mortality rates with population data, rename columns
df_Extrapolated = df_Mortality_census_melted.merge(df_populationCt, on=['State','Year'], how='inner')\
                                            .rename(columns={'PopulationUN':'Population'})
# Define null column for upcoming concat
df_Extrapolated['Deaths'] = np.nan

# Combine df_Extrapolated (pre 1968) with df_Deaths_nchs (post 1968)
df = pd.concat([df_Deaths_nchs, df_Extrapolated[df_Deaths_nchs.columns]])

# Caclulate deaths from mortality rate and population
calculated_deaths = np.round((df.Population / 100_000) * df.MortalityRate)

# Describe where deaths data came from in source column
df['sourceDeaths'] = 'NCHS'
df.loc[df.Deaths.isna(), 'sourceDeaths'] = 'calculated from mortality rate and population estimate'
df.loc[df.Deaths.isna(), 'Deaths'] = calculated_deaths[df.Deaths.isna()]

## 2.2 Calculate next population data

In [None]:
# Sort values
df = df.sort_values(['State','Year'])

# Create shifted next population column
df['NextPopulation'] = df['Population'].shift(1)

# When year = year.max(), the shift compares state1 2020 data with state2 2014 data, make NaN
df.loc[df.Year == df.Year.max(), 'NextPopulation'] = np.nan

## 2.3 Format output, combine with migration data, birth data

In [None]:
# Cacluate the population change
df['PopulationChange'] = df['NextPopulation']-df['Population']

# Add migration data
df = df.merge(df_Migration.rename(columns={'CurrentState':'State'}),on=['State','Year'], how='left')

# Add crude birth rates
df = df.merge(df_BirthRates, on='Year', how='left')\
       .merge(df_cdc, on=['Year','State'], how='left')

# Use coalesce functionality to get crude birth rate (US overall pre 2015 inclusive, state specific after)
df['CrudeBirthRate'] = df['CrudeBirthRate2'].combine_first(df['CrudeBirthRate1'])
df = df.drop(['CrudeBirthRate1','CrudeBirthRate2'], axis=1)

# Calc birth rate
df['Births'] = (df['Population'] / 1_000) * df['CrudeBirthRate']

# Drop assumed population column
df = df.drop('Assumed Population', axis=1)

# Bring Mortality back down to per 1000
df['MortalityRate'] = df['MortalityRate'] / 100

## 2.4 Estimate data for upcoming years until desired year

In [None]:
# Initialize empty output list
new_rows = []

# For all states
for state in np.unique(df['State']):
    
    # Define dataframe we're working with
    subdf = df.query(f'State == "{state}"').sort_values('Year')
    last5 = subdf[-5:]

    # Polyfit (linear, 1 degree) to our mortality rates
    coefs_deaths  = poly.polyfit(last5['Year'].astype(float),last5['MortalityRate'].astype(float),1)
    ffit_deaths = poly.Polynomial(coefs_deaths)
    
    # Polyfit (linear, 1 degree) to birth rates
    coefs_births = poly.polyfit(last5['Year'].astype(float),last5['CrudeBirthRate'].astype(float),1)
    ffit_births = poly.Polynomial(coefs_births)

    # Define xs we're defining for
    xs = np.arange(subdf.Year.max()+1, desired_end_year+1)
    
    # Calculate new rates
    new_deathrates = ffit_deaths(xs)
    new_birthrates = ffit_births(xs)
    
    # Find out the population for the following year given population, birth, death info
    last_row = subdf.iloc[-1]
    current_population = last_row['Population'] + last_row['Births'] - last_row['Deaths']
    
    # For each year using our linear prediction...
    for i in np.arange(0,len(xs)):
        
        # Assign variable to current linearly predicted rate
        current_brate = new_birthrates[i]
        current_drate = new_deathrates[i]
        
        # Calcualte number of births and deaths using current population and crude rates
        num_deaths = (current_population/1000)*current_drate
        num_births = (current_population/1000)*current_brate
        
        # Label known variables for clarity
        current_year = xs[i]
        current_state = state
        
        # Append all useful info to our output list
        new_rows.append([current_year,current_state,
                       int(num_deaths),current_drate,
                       int(np.round(current_population)),
                       int(num_births),current_brate,])
        
        # Calculate next year's population (current in next index of the for-loop) using pop, births, deaths
        current_population = current_population + num_births - num_deaths
    
# Convert to pandas dataframe
pd_new_rows = pd.DataFrame(new_rows, columns=['Year','State','Deaths','MortalityRate','Population','Births','CrudeBirthRate'])

## 2.5 Combine data

In [None]:
# Add new rows to final dataframe
df = pd.concat([df,pd_new_rows])

# 3. Validation

In [None]:
# Create dataframe copy
df_copy = df.copy()
df_copy[['Total Immigrants','Total Emigrants']] = df_copy[['Total Immigrants','Total Emigrants']].fillna(value=0)

# Find out the calculated next population only using population, births, and deaths
df_copy['CalcNextPopulation'] = df_copy['Population'] + df_copy['Births'] - df_copy['Deaths']
df_copy['CalcNextPopulation2'] = df_copy['Population'] + df_copy['Births'] + df_copy['Total Immigrants'] - df_copy['Total Emigrants'] - df_copy['Deaths']

# Order by state and year so that we can create a lag column properly and remove data for 1914
df_copy = df_copy.sort_values(['State','Year'])
df_copy['CalcPopulation'] = df_copy['CalcNextPopulation'].shift(1)
df_copy['CalcPopulation2'] = df_copy['CalcNextPopulation2'].shift(1)

df_copy.loc[df_copy.Year == 1914, 'CalcPopulation'] = np.nan
df_copy.loc[df_copy.Year == 1914, 'CalcPopulation2'] = np.nan

## 3.1 Population Difference

In [None]:
# Plot relative population difference between:
###  calculated population (strictly births/deaths) vs census population as a percent of census poplation

# Calculate the column
df_copy['diff'] = abs(df_copy['Population'] - df_copy['CalcPopulation']) / df_copy['Population']
df_copy['diff2'] = abs(df_copy['Population'] - df_copy['CalcPopulation2']) / df_copy['Population']


# Histogram with labels
plt.hist(df_copy['diff'].dropna(), bins=100, color='b', alpha=0.5, label='births/deaths only')
#plt.hist(df_copy['diff2'].dropna(), range=(0,0.2), bins=100, color='r', alpha=0.5, label='births/deaths AND immigration/emigration')

plt.xlabel('Percent/100')
plt.ylabel('Count of Rows')
plt.suptitle('Comparing calculated population (births & deaths) with census populations')
plt.title('abs(calculated population - census population) / census population')
plt.show()

## 3.2 Population change for test states

In [None]:
test_state = 'Louisiana'

df1 = df_Deaths_nchs.query(f'State == "{test_state}"')
df2 = df_populationCt_before.query(f'State == "{test_state}"')
df3 = df_copy.query(f'State == "{test_state}"')
df4 = pd_new_rows.query(f'State == "{test_state}"')

fig,ax = plt.subplots(figsize = (9,7))
plt.plot(df1.Year, df1.Population, color='b', label='NCHS population data')
plt.plot(df2.Year, df2.PopulationUN, color='r', label='Census 10 year population estimates (filled)')
plt.scatter(df2.Year, df2.PopulationUN, color='darkred',s=10, label='Census 10 year population estimates')
plt.plot(df3.Year, df3.Population, color='k', linestyle='--', alpha=0.5, label='Dataframe population column')
plt.plot(df3.Year, df3.CalcPopulation, color='hotpink', label='Deaths/Births only population (calculated)')
plt.plot(df4.Year,df4.Population, color='cyan', label='Linear Prediction')
plt.vlines(x=list(df2.Year), ymin=ax.get_ylim()[0], ymax=ax.get_ylim()[1], color='darkred', linestyle=':',alpha=0.2)
plt.xlabel('Year')
plt.ylabel('Population')
plt.suptitle(test_state)
plt.title('Comparing Population by Year estimates')

plt.legend(loc='upper left')

plt.show()

## 3.3 Birth/Death Rates

In [None]:
test_state = 'Tennessee'
subdf = df.query(f'State == "{test_state}"')
subdf1 = subdf[~subdf['sourceDeaths'].isna()]
subdf2 = subdf[subdf['sourceDeaths'].isna()]

plt.plot(subdf1.Year, subdf1.MortalityRate, color='b', label='death')
plt.plot(subdf1.Year, subdf1.CrudeBirthRate, color='r', label='birth')

plt.plot(subdf2.Year, subdf2.MortalityRate, color='b', linestyle='--', label='death (linear prediction)')
plt.plot(subdf2.Year, subdf2.CrudeBirthRate, color='r', linestyle='--', label='birth (linear prediction)')

plt.vlines(x=subdf1.Year.max(),ymin=2,ymax=32, color='gray', linestyle=':',alpha=0.2)
plt.legend(loc = 'upper right')
plt.suptitle(test_state)
plt.title('Comparing Crude Birth and Death Rates from 1914-2020')
plt.xlabel('year')
plt.ylabel('crude rate (per 1000 population)')
plt.show()

# 4. Saving

## 4.1 Format

In [None]:
df['Population'] = df['Population'].round().astype(pd.Int64Dtype())
df['NextPopulation'] = df['NextPopulation'].round().astype(pd.Int64Dtype())
df['Births'] = df['Births'].astype(float).round().astype(pd.Int64Dtype())
df['Deaths'] = df['Deaths'].round().astype(pd.Int64Dtype())

df = df.sort_values(['State','Year'])

## 4.2 Write out to csv

In [None]:
df.to_csv(f'{fpath_birthComplete}/VitalStats_byYear_byState.csv',header=True,index=False)

# Exploration

In [None]:
# Import polyfit
import numpy.polynomial.polynomial as poly

# Instantiate fig
fig = plt.figure(figsize=(9,7))

xs = subdf['Year']


# Plot Raw
plt.plot(subdf['Year'],subdf['MortalityRate'],color='k',label='Raw Mort')

# Run polyfit 50 degree
coefs  = poly.polyfit(subdf['Year'],subdf['MortalityRate'],50)
ffit = poly.Polynomial(coefs)
plt.plot(xs,ffit(xs),color='r',label='50 deg Polyfit')

# Run rolling mean
ys_alt = subdf['MortalityRate'].rolling(5).sum() / 5
plt.plot(xs,ys_alt,color='b',label='Rolling Mean')

# Run estimated linear trends
for i in np.arange(10,len(xs)):
    sdf_x = subdf['Year'].iloc[i-10:i]
    sdf_y = subdf['MortalityRate'].iloc[i-10:i]
    polyfitted = np.polyfit(sdf_x, sdf_y, 2)
    test_xs = np.arange(subdf['Year'].iloc[i-1],subdf['Year'].iloc[i]+2)
    test_ys = (polyfitted[0]*test_xs**2) + test_xs*polyfitted[1] + polyfitted[2]
    plt.plot(test_xs,test_ys,color='gray',alpha=0.5)
    
plt.plot(test_xs,test_ys,color='gray',alpha=0.5,label='Predictive fit (prev 5 years linear)')

###########################################################################
plt.xlabel('Year')
plt.ylabel('Rate per 1000 Population')
plt.title('Comparing Methods for predictive rates')
plt.legend(loc='upper right')
plt.show()