# Mortality - Scrape Actuarial Life Table

The purpose of this script is to use an API call to collect information about likelyhood of dying this year by each age.

----------------------

<p>Author: PJ Gibson</p>
<p>Date: 2022-01-19</p>
<p>Contact: peter.gibson@doh.wa.gov</p>
<p>Other Contact: pjgibson25@gmail.com</p>

In [None]:
import pandas as pd
import numpy as np
import requests

## 1. Fetch Data

### 1.1 API Call

In [None]:
# Define URL
url = 'https://www.ssa.gov/oact/STATS/table4c6_2019_TR2021.html'

# Perform API Request
api_return = requests.get(url)

### 1.2 Format Pandas DataFrame

In [None]:
# Convert into pandas dataframe and remove the last row (just annotations)
df = pd.read_html(api_return.text)[0][:-1]

# Manually define last row of data (when someone hits 120 years old, there being a 100% chance they pass away that year)
final_row = pd.DataFrame(np.array([120, 1.0, 0, 0, 1.0, 0, 0])).transpose()
final_row.columns = df.columns

output_df = pd.concat([df,final_row],ignore_index=True)

## 2. Show Output

<b>Important</b>: In our notebook '{project_root}/Functions/Process Functions', our function `spdf_will_die()` function utilizes the output of the cell below.
I found it easier to work with the array within the spark UDF because due to the parallel nature of UDFs, we don't want to reread the same .csv file for each row of data the UDF is applied to.
Intead, we just manually define a list we'll use within the function.

In [None]:
# Previously worked, but annoyingly difficult to remove double spaces in multi-index columns.  Varying output when performing API request
# output_df[[('Exact  age','Exact  age'),('Male', 'Death  probability a'),('Female','Death  probability a')]].to_numpy()

# Show the 
output_df.iloc[:,[0,1,4]].to_numpy()