# Infant Mortality

One of my primary tasks at the Indiana Department of Health was, given the data from a child's birth certificate, determine the probability of their death.  After linking the birth and death certificates, this involved simply giving a report of the descriptive statitistics of infant deaths (defined as deaths of children who were born who died one year or younger) followed by a logistic regression.  I was never asked to make a more detailed predictive model, but as this is a project for my portfolio and not for my job, we can have a little fun.  Below I will first do a logistic regression, and then try a decision tree, a neural network, and a random forest.  *Note to readers:  This is a work in progress.  This page will be updated weekly.*


While the data from IDOH are de-identified and I could conceivably use those, I'll elect to make my own dataset by splitting ten thousand individuals between three races (purple, blue, and green), two genders (positive and negative), and four ages (young, teen, adult, senior).

But first, let's import some necessary libraries:

In [1]:
import numpy as np
import pandas as pd
import random
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier

## Create Population and Assign Probabilities

In the next session we'll create our population and assign the probabilities of death.

### Create Population

As mentioned above, we'll have three races (purple, blue, and green), four ages (young, teen, adult, senior), and at risk of reinforcing the gender binary, two genders (positive and negative).

In [2]:
# population will be 70% blue, 20% green, 10% purple.
race_list = ["blue"]*7000
for i in range(2000):
    race_list.append("green")
for i in range(1000):
    race_list.append("purple")

# population will be 50% positive and 50% negative.
gender_list = ["positive","negative"]*5000

# population will be 20% young, adult, and senior, and 40% teen.
age_list = ["young","teen","teen","adult","senior"]*2000

# assemble these lists into a data frame using the dictionary method:
population = pd.DataFrame(
                            { 'race':race_list,
                              'gender':gender_list,
                              'age':age_list }
)

### Assign Probabilities

We'll make the probability of death 50% for purple, 10% for green, and 90% for purple.  Note that these are *much higher* than the probabilities of *actual* infant death, but for the purposes of this exercise we want to refrain from having too unbalanced a target set.  Just trying to show basic technique, here, and not dig into the specifics of how to deal with unbalanced sets.  (Perhaps that'll be a project for later.)

At any rate, the probability for positive gender will be 70%, for negative gender will be 30%, and for young %40, 60% for teenagers, 80% for adult, and 10% for seniors.

In [3]:
# assign race probabilities
race_probs = [0.5]*7000
for i in range(2000):
    race_probs.append(0.1)
for i in range(1000):
    race_probs.append(1.0)
race_probs = np.array(race_probs)

# assign gender probabilities
gender_probs = [0.9,0.8]*5000
gender_probs = np.array(gender_probs)

# assign age probabilities
age_probs = [0.6, 0.8, 0.8, 0.9, 1.0]*2000
age_probs = np.array(age_probs)

# figure out total probabilities, add column to data frame
total_probs = race_probs*gender_probs*age_probs
population['total_probs'] = total_probs

# now assign each individual to an outcome, based on probability
final_outcome = np.array( [0]*10000 )
for i in range(10000):
    j = random.random()
    if j > total_probs[i]:
        final_outcome[i] = 0
    else:
        final_outcome[i] = 1
population['final_outcome'] = final_outcome

# take a look at the population
population.head()

Unnamed: 0,race,gender,age,total_probs,final_outcome
0,blue,positive,young,0.27,0
1,blue,negative,teen,0.32,0
2,blue,positive,teen,0.36,0
3,blue,negative,adult,0.36,1
4,blue,positive,senior,0.45,0


## Descriptive Statistics

Let's do some quick and dirty descriptive statistics on the dataset.  We have already determined the proportion of race, age, and gender, so we'll not worry about that.  How many are dead?

In [4]:
x = population['final_outcome'].sum()
N = 10000

print("Number of dead:",x)
print("Percentage of population",100*x/N)

Number of dead: 3243
Percentage of population 32.43


In [5]:
# create a data frame of the deceased population consisting of race, gender, and age columns
pop_deceased = population[population['final_outcome']==1].reset_index(drop=True).drop(['total_probs','final_outcome'],axis=1)
print(pop_deceased.head())

N = len(pop_deceased)
print("Number of dead:",N)

print("Total population breakdown:")
print( pop_deceased.value_counts()/N )

print("\n")
print("Proportion race:")
print( pop_deceased.value_counts('race')/N )

print("\n")
print("Proportion gender:")
print( pop_deceased.value_counts('gender')/N )

print("\n")
print("Proportion age:")
print( pop_deceased.value_counts('age')/N )

   race    gender     age
0  blue  negative   adult
1  blue  negative    teen
2  blue  negative  senior
3  blue  negative    teen
4  blue  positive  senior
Number of dead: 3243
Total population breakdown:
race    gender    age   
blue    positive  teen      0.155103
        negative  teen      0.131668
        positive  senior    0.096516
                  adult     0.089732
        negative  senior    0.083873
                  adult     0.074931
        positive  young     0.055813
        negative  young     0.053962
purple  positive  teen      0.043478
        negative  teen      0.038853
        positive  adult     0.028060
                  senior    0.027752
        negative  senior    0.025285
                  adult     0.022202
                  young     0.015418
        positive  young     0.013568
green   negative  teen      0.008326
        positive  teen      0.007092
        negative  senior    0.006475
        positive  adult     0.005550
                  senior    0.

So 74.7% of the deceased are blue race, 21.1% are purple race, and 4.1% are green race.  52.3% are positive gender, 47.7% negative gender.  14.1% are young, 39.5% are teens, 21.8% are adults, 24.7% are seniors.

## Logistic Regression

Now let's perform a logistic regression on the dataset, the first and simplest form of machine learning we can apply here.

Because we're working with categorical data, we can't simply plug the numbers in directly and let 'er rip.  Insetad we need to encode them into dummy variables.  For this we need LabelEncoder() and OneHotEncoder()

In [6]:
x = population.select_dtypes(exclude=['number']) \
              .apply(LabelEncoder().fit_transform) \
              .join(population.select_dtypes(include=['number']))

x

Unnamed: 0,race,gender,age,total_probs,final_outcome
0,0,1,3,0.27,0
1,0,0,2,0.32,0
2,0,1,2,0.36,0
3,0,0,0,0.36,1
4,0,1,1,0.45,0
...,...,...,...,...,...
9995,2,0,3,0.48,1
9996,2,1,2,0.72,1
9997,2,0,2,0.64,1
9998,2,1,0,0.81,1


So in our first step, we have encoded race so that blue is 0, green is 1, and purple is 2.  Gender is encoded so that 0 is negative and 1 is positive.  Age is encoded so that 0 is adult, 1 is senior, 2 is teen, and 3 is young.

Next we use one hot encoder and run the model:

In [7]:
pop_subset = x[['race','gender','age']]
enc = OneHotEncoder()
enc.fit(pop_subset)
onehotlabels = enc.transform(pop_subset).toarray()

model = LogisticRegression(solver='liblinear',random_state=0) \
        .fit(onehotlabels,x['final_outcome'])

print(model.intercept_)
print(model.coef_)

[-0.38089533]
[[ 0.01692769 -1.91738952  1.5195665  -0.29782757 -0.08306775  0.11016506
   0.26683522 -0.16277802 -0.59511758]]


So now we have our model, approximately:
$$y = -0.39 + 0.05\beta_{blue} - 1.96\beta_{green} + 1.52\beta_{purple} - 0.276\beta_{negative} -0.12\beta_{positive} + 0.07\beta_{adult} + 0.29\beta_{senior} - 0.11\beta_{teen} - 0.64\beta_{young}$$

How does our model do at predicting?  Let's make a "classification report":

In [8]:
y_pred = model.predict(onehotlabels)

print( classification_report(x['final_outcome'],y_pred))

              precision    recall  f1-score   support

           0       0.72      0.96      0.82      6757
           1       0.70      0.21      0.33      3243

    accuracy                           0.71     10000
   macro avg       0.71      0.58      0.57     10000
weighted avg       0.71      0.71      0.66     10000



In [9]:
confusion_matrix(x['final_outcome'],y_pred)

array([[6453,  304],
       [2547,  696]])

So, 6,417 true negatives, 694 true positives, 306 false negatives, and 2,583 false positives.

Let's see if we can't do better!

# Decision Tree

Now let's try a decision tree, and see if that improves our results at all.  We'll try two methods.  Training it according to the Gini Index $1-\sum_j j^2$ and the entropy $-\sum_{i=1}^N p(x_i) log_2 p(x_i)$.



In [10]:
GiniTree = DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth = 5, min_samples_leaf=50)
GiniTree.fit(onehotlabels,x['final_outcome'])

EntropyTree = DecisionTreeClassifier(criterion = "entropy", random_state = 100, max_depth = 5, min_samples_leaf=50)
EntropyTree.fit(onehotlabels,x['final_outcome'])

DecisionTreeClassifier(criterion='entropy', max_depth=5, min_samples_leaf=50,
                       random_state=100)

In [11]:
gini_pred = GiniTree.predict(onehotlabels)

print( classification_report(x['final_outcome'],gini_pred))

              precision    recall  f1-score   support

           0       0.71      0.97      0.82      6757
           1       0.75      0.19      0.30      3243

    accuracy                           0.72     10000
   macro avg       0.73      0.58      0.56     10000
weighted avg       0.73      0.72      0.65     10000



In [12]:
entr_pred = EntropyTree.predict(onehotlabels)

print( classification_report(x['final_outcome'],entr_pred))

              precision    recall  f1-score   support

           0       0.71      0.97      0.82      6757
           1       0.75      0.19      0.30      3243

    accuracy                           0.72     10000
   macro avg       0.73      0.58      0.56     10000
weighted avg       0.73      0.72      0.65     10000



In [13]:
confusion_matrix(x['final_outcome'],gini_pred)

array([[6559,  198],
       [2641,  602]])

In [14]:
confusion_matrix(x['final_outcome'],entr_pred)

array([[6559,  198],
       [2641,  602]])

So we basically see no improvement using a decision tree over using a simple logistic regression.

*Next step:  Neural network.*