# Probability Converter
This script converts the `occupational_fatalities.csv` dataset into an index of liklihoods to die by each profession, then joins on the `deaths_age_gender_race_mechanism_cause.csv` dataset by mechanism of death. Next, the full dataset is converted into daily probabilities of dying by age , gender, race, and occupation by means of modeling the daily probabilities via linear models connecting each age, in years.

In [21]:
import pandas as pd
import numpy as np

In [22]:
job_deaths = pd.read_csv("../data/occupational_hazards_data.csv")
cdc_deaths = pd.read_csv("../data/deaths_age_gender_race_mechanism_cause.csv")

### Converting BLS Data to Probabilities by Occupation

In [125]:
job_deaths.head()

Unnamed: 0,occupation,hierarchy_level,mechanism_of_death,deaths,population
0,Total,0,Total,5250,161038
1,Total,0,Cut/Pierce,828,161038
2,Total,0,Motor Vehicle Traffic,2080,161038
3,Total,0,Fire/Flame,115,161038
4,Total,0,Fall,791,161038


In [126]:
job_deaths.shape

(4158, 5)

Because there are 4 levels in the hierarchy and the user experience will be poor if they have to search through too many job titles to find something close to theirs (a mentally taxing task), selecting the right level is paramount. To aid in this selection, looking at the volume and degree of detail should help. Level 3 is the most detailed so we'll start there and work our way up:

In [128]:
job_deaths.occupation[job_deaths['hierarchy_level']==3].value_counts()

Cashiers                                                                           14
Millwrights                                                                         7
Furnace, kiln, oven, drier, and kettle operators and tenders                        7
Dishwashers                                                                         7
Packers and packagers, hand                                                         7
Bill and account collectors                                                         7
First-line supervisors of personal service workers                                  7
Operating engineers and other construction equipment operators                      7
First-line supervisors of landscaping, lawn service, and groundskeeping workers     7
Earth drillers, except oil and gas                                                  7
Chief executives                                                                    7
Maintenance workers, machinery                        

275 options is clearly too many. 

In [129]:
job_deaths.occupation[job_deaths['hierarchy_level']==2].value_counts()

Cashiers                                                           14
Social and community service managers                               7
Radio and telecommunications equipment installers and repairers     7
Pest control workers                                                7
Butchers and other meat, poultry, and fish processing workers       7
Bill and account collectors                                         7
First-line supervisors of personal service workers                  7
Massage therapists                                                  7
Welding, soldering, and brazing workers                             7
Secondary school teachers                                           7
Chief executives                                                    7
Waiters and waitresses                                              7
Personal care aides                                                 7
Brickmasons, blockmasons, and stonemasons                           7
Education administra

237 is still way too many.

In [131]:
job_deaths.occupation[job_deaths['hierarchy_level']==1].value_counts()

Textile, apparel, and furnishings workers                                   7
Postsecondary teachers                                                      7
Personal appearance workers                                                 7
Lawyers, judges, and related workers                                        7
Rail transportation workers                                                 7
Life, physical, and social science technicians                              7
Plant and system operators                                                  7
Woodworkers                                                                 7
Agricultural workers                                                        7
Supervisors of protective service workers                                   7
Media and communication equipment workers                                   7
Other protective service workers                                            7
Occupational therapy and physical therapist assistants and aides

87 is getting more reasonable, but still annoying.

In [132]:
job_deaths.occupation[job_deaths['hierarchy_level']==0].value_counts()

Education, training, and library occupations                  7
Construction and extraction occupations                       7
Total                                                         7
Healthcare support occupations                                7
Life, physical, and social science occupations                7
Food preparation and serving related occupations              7
Legal occupations                                             7
Management occupations                                        7
Computer and mathematical occupations                         7
Sales and related occupations                                 7
Arts, design, entertainment, sports, and media occupations    7
Architecture and engineering occupations                      7
Protective service occupations                                7
Farming, fishing, and forestry occupations                    7
Building and grounds cleaning and maintenance occupations     7
Office and administrative support occupa

In [133]:
len(job_deaths.occupation[job_deaths['hierarchy_level']==0].value_counts())

22

22 is totally reasonable. The overall impact this will have on likelihood to die will be extraordinarily small and it's mainly included to help the user feel like it's more personalized.  

In order to prep the data for joining, we'll need to remove the extraneous data and generate a probability based on the the volume of deaths per occupation and mechanism out of the total deaths by mechanism.

In [134]:
job_deaths = job_deaths[job_deaths['hierarchy_level']==0]

We'll want to remove the levels.

In [136]:
del job_deaths['hierarchy_level']

KeyError: 'hierarchy_level'

Reset the index to allow for searching it

In [137]:
job_deaths.reset_index(drop = True, inplace = True)

Need to convert the population string into an integer.

In [143]:
job_deaths.population = job_deaths.population.str.replace(',', '').astype('int')

In [146]:
probs = []
for i in range(job_deaths.shape[0]):
    prob = job_deaths.deaths.iloc[i]/\
        job_deaths[(job_deaths.occupation == job_deaths.occupation.iloc[i]) & 
                   (job_deaths.mechanism_of_death == "Total")].deaths.iloc[0]
    probs.append(prob)

In [147]:
job_deaths['job_prob'] = probs

We won't need the field `Total` for `mechinism_of_death` since it was only used to generate the probabilities so it can be removed along with `Total` for `occupation`. The columns for deaths and population can be removed as well. 

In [150]:
job_deaths = job_deaths[job_deaths.mechanism_of_death != "Total"]

In [178]:
job_deaths = job_deaths[job_deaths.occupation != "Total"]

In [153]:
del job_deaths['deaths']
del job_deaths['population']

### Joining CDC and BLS Data

The CDC data has a `mechanism` field which matches the BLS data (albeit with many many more mechanisms).  This being the case, we'll want to first do what we did with the BLS data and create annual probabilities for each age, gender, race, and mechanism, and cause.  Then, for the mechanisms that do match, we'll expand a single row of the CDC data out by occupation, multiplying the probabilities using the multiplication rule of probability.  

Some useful documentation on how to [split, apply, and combine grouped data](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) (instead of using the for loop methodology like above).

In [38]:
cdc_deaths

Unnamed: 0,age,gender,race,mechanism_of_death,cause_of_death,deaths,population
0,0,Female,American Indian or Alaska Native,Fire/Flame,Exposure to uncontrolled fire in building or s...,1,36615
1,0,Female,American Indian or Alaska Native,Motor Vehicle Traffic,Car occupant injured in collision with heavy t...,1,36615
2,0,Female,American Indian or Alaska Native,Motor Vehicle Traffic,Person injured in unspecified motor-vehicle ac...,1,36615
3,0,Female,American Indian or Alaska Native,Suffocation,Accidental suffocation and strangulation in bed,7,36615
4,0,Female,American Indian or Alaska Native,Suffocation,Unspecified threat to breathing,3,36615
5,0,Female,American Indian or Alaska Native,Suffocation,"Hanging, strangulation and suffocation, undete...",1,36615
6,0,Female,American Indian or Alaska Native,Non-Injury: Intestinal infections,Campylobacter enteritis,1,36615
7,0,Female,American Indian or Alaska Native,Non-Injury: Intestinal infections,Other and unspecified gastroenteritis and coli...,1,36615
8,0,Female,American Indian or Alaska Native,Non-Injury: Septicemia,"Streptococcal septicaemia, unspecified",1,36615
9,0,Female,American Indian or Alaska Native,Non-Injury: Septicemia,"Septicaemia, unspecified",1,36615


First step is to calculate probabilities of dying in the year by dividing deaths per age, gender, and race by the population for that grouping.  

Note that this takes more than an hour to run.

In [55]:
probs = []
for i in range(cdc_deaths.shape[0]):
    prob = cdc_deaths.deaths.iloc[i]/\
        cdc_deaths[(cdc_deaths.age == cdc_deaths.age.iloc[i]) & 
                   (cdc_deaths.gender == cdc_deaths.gender.iloc[i]) &
                   (cdc_deaths.race == cdc_deaths.race.iloc[i])].population.iloc[0]
    probs.append(prob)

In [68]:
cdc_deaths['cdc_prob'] = probs

In [71]:
cdc_deaths

Unnamed: 0,age,gender,race,mechanism_of_death,cause_of_death,deaths,population,cdc_prob
0,0,Female,American Indian or Alaska Native,Fire/Flame,Exposure to uncontrolled fire in building or s...,1,36615,0.000027
1,0,Female,American Indian or Alaska Native,Motor Vehicle Traffic,Car occupant injured in collision with heavy t...,1,36615,0.000027
2,0,Female,American Indian or Alaska Native,Motor Vehicle Traffic,Person injured in unspecified motor-vehicle ac...,1,36615,0.000027
3,0,Female,American Indian or Alaska Native,Suffocation,Accidental suffocation and strangulation in bed,7,36615,0.000191
4,0,Female,American Indian or Alaska Native,Suffocation,Unspecified threat to breathing,3,36615,0.000082
5,0,Female,American Indian or Alaska Native,Suffocation,"Hanging, strangulation and suffocation, undete...",1,36615,0.000027
6,0,Female,American Indian or Alaska Native,Non-Injury: Intestinal infections,Campylobacter enteritis,1,36615,0.000027
7,0,Female,American Indian or Alaska Native,Non-Injury: Intestinal infections,Other and unspecified gastroenteritis and coli...,1,36615,0.000027
8,0,Female,American Indian or Alaska Native,Non-Injury: Septicemia,"Streptococcal septicaemia, unspecified",1,36615,0.000027
9,0,Female,American Indian or Alaska Native,Non-Injury: Septicemia,"Septicaemia, unspecified",1,36615,0.000027


We'll also need to expand each row of age/gender/race/mechanism by a column which has values for each occupation. This is done because not all mechanisms of death match to the job_deaths dataset and we'll want to be able to key off occupation regardless.

In [179]:
job_titles = np.unique(job_deaths.occupation)

In [181]:
cdc_deaths_repeated = pd.concat([cdc_deaths] * len(job_titles), ignore_index=True)

In [183]:
job_titles_repeated = np.repeat(job_titles, cdc_deaths.shape[0])

In [194]:
cdc_deaths_repeated['occupation'] = job_titles_repeated

Now we can merge the job_deaths dataset based on occupation and mechanism of death.

In [205]:
annual_death_probs = pd.merge(cdc_deaths_repeated, job_deaths, 
                              on = ['mechanism_of_death', 'occupation'], how = 'left')

For all the NaNs where there was no match due to the cdc data having more mechanisms of death, impute 1s such that we can multiply the probabilities and get a final probability.

In [208]:
annual_death_probs.job_prob[np.isnan(annual_death_probs.job_prob) == True] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [210]:
annual_death_probs['annual_death_prob'] = annual_death_probs.cdc_prob * annual_death_probs.job_prob

With this complete, we can drop the unneeded columns.

In [211]:
annual_death_probs = annual_death_probs.drop(['deaths', 'population', 'cdc_prob', 'job_prob'], axis = 1)

In [212]:
annual_death_probs

Unnamed: 0,age,gender,race,mechanism_of_death,cause_of_death,occupation,annual_death_prob
0,0,Female,American Indian or Alaska Native,Fire/Flame,Exposure to uncontrolled fire in building or s...,Architecture and engineering occupations,9.103737e-07
1,0,Female,American Indian or Alaska Native,Motor Vehicle Traffic,Car occupant injured in collision with heavy t...,Architecture and engineering occupations,1.911785e-05
2,0,Female,American Indian or Alaska Native,Motor Vehicle Traffic,Person injured in unspecified motor-vehicle ac...,Architecture and engineering occupations,1.911785e-05
3,0,Female,American Indian or Alaska Native,Suffocation,Accidental suffocation and strangulation in bed,Architecture and engineering occupations,1.911785e-04
4,0,Female,American Indian or Alaska Native,Suffocation,Unspecified threat to breathing,Architecture and engineering occupations,8.193363e-05
5,0,Female,American Indian or Alaska Native,Suffocation,"Hanging, strangulation and suffocation, undete...",Architecture and engineering occupations,2.731121e-05
6,0,Female,American Indian or Alaska Native,Non-Injury: Intestinal infections,Campylobacter enteritis,Architecture and engineering occupations,2.731121e-05
7,0,Female,American Indian or Alaska Native,Non-Injury: Intestinal infections,Other and unspecified gastroenteritis and coli...,Architecture and engineering occupations,2.731121e-05
8,0,Female,American Indian or Alaska Native,Non-Injury: Septicemia,"Streptococcal septicaemia, unspecified",Architecture and engineering occupations,2.731121e-05
9,0,Female,American Indian or Alaska Native,Non-Injury: Septicemia,"Septicaemia, unspecified",Architecture and engineering occupations,2.731121e-05


In [213]:
annual_death_probs.to_csv("../data/annual_death_probs.csv", index = False)

### Conversion to Daily Death Probabilities

In order to convert to daily probabilities, we'd want to expand each row by 365 days, creating 188,307,880 rows

In [214]:
617*365

225205

In [215]:
3517521*365

1283895165

In [216]:
job_deaths.mechanism_of_death.value_counts()

Poisoning                21
Cut/Pierce               21
Motor Vehicle Traffic    21
Fire/Flame               21
Struck by or against     21
Fall                     21
Name: mechanism_of_death, dtype: int64

In [218]:
len(cdc_deaths.mechanism_of_death.value_counts())

67

In [219]:
cdc_deaths

Unnamed: 0,age,gender,race,mechanism_of_death,cause_of_death,deaths,population,cdc_prob
0,0,Female,American Indian or Alaska Native,Fire/Flame,Exposure to uncontrolled fire in building or s...,1,36615,0.000027
1,0,Female,American Indian or Alaska Native,Motor Vehicle Traffic,Car occupant injured in collision with heavy t...,1,36615,0.000027
2,0,Female,American Indian or Alaska Native,Motor Vehicle Traffic,Person injured in unspecified motor-vehicle ac...,1,36615,0.000027
3,0,Female,American Indian or Alaska Native,Suffocation,Accidental suffocation and strangulation in bed,7,36615,0.000191
4,0,Female,American Indian or Alaska Native,Suffocation,Unspecified threat to breathing,3,36615,0.000082
5,0,Female,American Indian or Alaska Native,Suffocation,"Hanging, strangulation and suffocation, undete...",1,36615,0.000027
6,0,Female,American Indian or Alaska Native,Non-Injury: Intestinal infections,Campylobacter enteritis,1,36615,0.000027
7,0,Female,American Indian or Alaska Native,Non-Injury: Intestinal infections,Other and unspecified gastroenteritis and coli...,1,36615,0.000027
8,0,Female,American Indian or Alaska Native,Non-Injury: Septicemia,"Streptococcal septicaemia, unspecified",1,36615,0.000027
9,0,Female,American Indian or Alaska Native,Non-Injury: Septicemia,"Septicaemia, unspecified",1,36615,0.000027


In [220]:
cdc_deaths.race.value_counts()

White                               95519
Black or African American           45052
Asian or Pacific Islander           17294
American Indian or Alaska Native     9636
Name: race, dtype: int64

In [222]:
len(cdc_deaths.mechanism_of_death.value_counts())

67

In [223]:
len(cdc_deaths.cause_of_death.value_counts())

3671

In [224]:
len(job_deaths.occupation.value_counts())

21

In [230]:
2*4*101/2/4*365

36865.0

In [231]:
67*3671

245957

In [None]:
(1-p)365 = .80