# Infant Mortality

One of my primary tasks at the Indiana Department of Health was, given the data from a child's birth certificate, determine the probability of their death.  After linking the birth and death certificates, this involved simply giving a report of the descriptive statitistics of infant deaths (defined as deaths of children who were born who died one year or younger) followed by a logistic regression.  I was never asked to make a more detailed predictive model, but as this is a project for my portfolio and not for my job, we can have a little fun.  Below I will first do a logistic regression, and then try a decision tree, a neural network, and a random forest.  *Note to readers:  This is a work in progress.  So far I've only done the logistic regression...next step is to test it.  This page will be updated weekly.*


While the data from IDOH are de-identified and I could conceivably use those, I'll elect to make my own dataset by splitting ten thousand individuals between three races (purple, blue, and green), two genders (positive and negative), and four ages (young, teen, adult, senior).

But first, let's import some necessary libraries:

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

## Create Population and Assign Probabilities

In the next session we'll create our population and assign the probabilities of death.

### Create Population

As mentioned above, we'll have three races (purple, blue, and green), four ages (young, teen, adult, senior), and at risk of reinforcing the gender binary, two genders (positive and negative).

In [2]:
# population will be 70% blue, 20% green, 10% purple.
race_list = ["blue"]*7000
for i in range(2000):
    race_list.append("green")
for i in range(1000):
    race_list.append("purple")

# population will be 50% positive and 50% negative.
gender_list = ["positive","negative"]*5000

# population will be 20% young, adult, and senior, and 40% teen.
age_list = ["young","teen","teen","adult","senior"]*2000

# assemble these lists into a data frame using the dictionary method:
population = pd.DataFrame(
                            { 'race':race_list,
                              'gender':gender_list,
                              'age':age_list }
)

### Assign Probabilities

We'll make the probability of death 50% for purple, 10% for green, and 100% for purple.  Note that these are *much higher* than the probabilities of *actual* infant death, but for the purposes of this exercise we want to refrain from having too unbalanced a target set.  Just trying to show basic technique, here, and not dig into the specifics of how to deal with unbalanced sets.  (Perhaps that'll be a project for later.)

At any rate, the probability for positive gender will be 90%, for negative gender will be 80%, and for young 60%, 80% for teenagers, 90% for adult, and 100% for seniors.

In [3]:
# assign race probabilities
race_probs = [0.5]*7000
for i in range(2000):
    race_probs.append(0.1)
for i in range(1000):
    race_probs.append(1.0)
race_probs = np.array(race_probs)

# assign gender probabilities
gender_probs = [0.9,0.8]*5000
gender_probs = np.array(gender_probs)

# assign age probabilities
age_probs = [0.6, 0.8, 0.8, 0.9, 1.0]*2000
age_probs = np.array(age_probs)

# figure out total probabilities, add column to data frame
total_probs = race_probs*gender_probs*age_probs
population['total_probs'] = total_probs

# now assign each individual to an outcome, based on probability
final_outcome = np.array( [0]*10000 )
for i in range(10000):
    if total_probs[i]>0.5:
        final_outcome[i] = 1
    else:
        final_outcome[i] = 0
population['final_outcome'] = final_outcome

# take a look at the population
population.head()

Unnamed: 0,race,gender,age,total_probs,final_outcome
0,blue,positive,young,0.27,0
1,blue,negative,teen,0.32,0
2,blue,positive,teen,0.36,0
3,blue,negative,adult,0.36,0
4,blue,positive,senior,0.45,0


## Descriptive Statistics

Let's do some quick and dirty descriptive statistics on the dataset.  We have already determined the proportion of race, age, and gender, so we'll not worry about that.  How many are dead?

In [4]:
x = population['final_outcome'].sum()
N = 10000

print("Number of dead:",x)
print("Percentage of population",100*x/N)

Number of dead: 900
Percentage of population 9.0


In [5]:
# create a data frame of the deceased population consisting of race, gender, and age columns
pop_deceased = population[population['final_outcome']==1].reset_index(drop=True).drop(['total_probs','final_outcome'],axis=1)
print(pop_deceased.head())

N = len(pop_deceased)
print("Number of dead:",N)

print("Total population breakdown:")
print( pop_deceased.value_counts()/N )

print("\n")
print("Proportion gender:")
print( pop_deceased.value_counts('gender')/N )

print("\n")
print("Proportion age:")
print( pop_deceased.value_counts('age')/N )

     race    gender     age
0  purple  positive   young
1  purple  negative    teen
2  purple  positive    teen
3  purple  negative   adult
4  purple  positive  senior
Number of dead: 900
Total population breakdown:
race    gender    age   
purple  positive  teen      0.222222
        negative  teen      0.222222
        positive  young     0.111111
                  senior    0.111111
                  adult     0.111111
        negative  senior    0.111111
                  adult     0.111111
dtype: float64


Proportion gender:
gender
positive    0.555556
negative    0.444444
dtype: float64


Proportion age:
age
teen      0.444444
senior    0.222222
adult     0.222222
young     0.111111
dtype: float64


So all of the deceased are of race purple, 5/9ths are positively gendered, 4/9ths are teenagers, 2/9ths are seniors and adults, and 1/9th are young.

## Logistic Regression

Now let's perform a logistic regression on the dataset, the first and simplest form of machine learning we can apply here.

Because we're working with categorical data, we can't simply plug the numbers in directly and let 'er rip.  Insetad we need to encode them into dummy variables.  For this we need LabelEncoder() and OneHotEncoder()

In [6]:
x = population.select_dtypes(exclude=['number']) \
              .apply(LabelEncoder().fit_transform) \
              .join(population.select_dtypes(include=['number']))

x

Unnamed: 0,race,gender,age,total_probs,final_outcome
0,0,1,3,0.27,0
1,0,0,2,0.32,0
2,0,1,2,0.36,0
3,0,0,0,0.36,0
4,0,1,1,0.45,0
...,...,...,...,...,...
9995,2,0,3,0.48,0
9996,2,1,2,0.72,1
9997,2,0,2,0.64,1
9998,2,1,0,0.81,1


So in our first step, we have encoded race so that blue is 0, green is 1, and purple is 2.  Gender is encoded so that 0 is negative and 1 is positive.  Age is encoded so that 0 is adult, 1 is senior, 2 is teen, and 3 is young.

Next we use one hot encoder and run the model:

In [7]:
pop_subset = x[['race','gender','age']]
enc = OneHotEncoder()
enc.fit(pop_subset)
onehotlabels = enc.transform(pop_subset).toarray()

model = LogisticRegression(solver='liblinear',random_state=0) \
        .fit(onehotlabels,x['final_outcome'])

print(model.intercept_)
print(model.coef_)

[-2.20961704]
[[-5.60976893 -4.56132434  7.96147624 -3.04261357  0.83299653  0.72513262
   0.72513262  0.7862183  -4.44610058]]


So now we have our model, approximately:
$$y = -2.2 - 5.6\beta_{blue} - 4.6\beta_{green} + 8.0\beta_{purple} - 3.0\beta_{negative} + 0.8\beta_{positive} + 0.7\beta_{adult} + 0.7\beta_{senior} + 0.8\beta_{teen} - 4.4\beta_{young}$$

*Note to readers:  This is a work in progress.  So far I've only done the logistic regression...next step is to test it.  This page will be updated weekly.*