## What factors contribute to lower rates of maternal mortality long-term?

The primary purpose of this study was to identify maternal mortality statistics in Mexico. Given that most of maternal deaths can be prevented, the prediction of the likelihood of death of maternal patients is a massively important task. It should be noted that any misinterpretation of data can be misleading, and averages are no expecption. Averages hide the range of different data points within a single number which can result in a dichotomized perception of the sample data. In order to compensate for this overdramatization of the data, one needs to look for the majority and assess where the spreads overlap.

Truthfully, the incidence of maternal mortality has decreased since the UN initiative in 2015, but it is important to assess which regions accomplished the largest changes over time, and which factors contributed to the reduction of maternal mortality. Creating a predictive model that incorporate these factors is an effective tool to maintain a low rate or maternal mortality long term. The proposed method is based on Random Forest, which allowed the identification of "11" independent predictors from "32" predictive factors in order to create a predictive model to assess the factors associated with reducing the rate of maternal mortality within each State of Mexico.  

Machine learning methods are becoming increasingly relevant as data availability increases, as well as the general understanding of the complex dynamic changes in indicators with a significant number of interrelated factors. Additionally, since scientific significance is the foundation of the results, it is possible to make the results interpretable from a clinical point of view.

The following methods consist of: Random Forest Classifier, KNeighbors Classifier, Logistic Regression, and Gaussian Naive Bayes.  For the interpretation of differences between individual parameters of the mothers in two classes predicted failure (maternal mortality) verses prediction success (maternal vitality), the problem of classification via using the decision tree can be solved. The target is prediction failure class.

**Description of the Computer Process Used to Quantify Factors Involved with Maternal Mortality:**
The formalism concept defines a learning capability of a computer program by its expereince(E) with respect to a class of tasks(T) and performance measures(P): its performance at T, measured by P, improves with E.
Therefore, this computer program will be defined as such:

- **Task (T)**
    - Classify a measure as a predicitive or non-predicetive factor of maternal mortality.
    
- **Experience (E)**
    - A dataset of measures where some State's mean age of maternal mortality is statistically different from other State means and where some are not.
    
- **Performance Measures (P)**: 
    - Classification accuracy, the number of factors of maternal mortality predicted correctly out of all measures considered as a percentage.

In [1]:
# Import the relevant modules
import pandas as pd
import numpy as np

# Machine Learning modules
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.tree import export_graphviz

  from numpy.core.umath_tests import inner1d


In [13]:
# Open merged materna_mortal_factors dataset
%store -r metro_gdp_mortality

In [14]:
data = metro_gdp_mortality
data.head()

Unnamed: 0,State,State Population 2010,State Population 2015,State GDP 2010,State GDP 2015,μ Age Maternal Death,Standard Deviation,Variance
0,Aguascalientes,932369,1044049,16597.0,19528.0,28.36,7.43,55.16
1,Baja California,3155070,3315766,52579.0,57136.0,27.15,6.81,46.31
4,Baja California Sur,251871,272711,21260.0,21431.0,27.56,7.44,55.37
5,Campeche,259005,283025,,,26.87,6.65,44.29
6,Chiapas,1058712,1162592,14271.0,13392.0,28.02,6.71,45.02


#### Purpose for Changing all Categorical Strings to a Numeric Value: 
- Machine Learning models will ignore string values (strings have no statistical value unless added)
- Numeric values are comparable therefore string values should be categorically changed to numbers
- This is how you compare a string value to a numeric value that the model can use

In [15]:
# Convert Column strings to a numeric value
for i, column in enumerate(list([str(d) for d in data.dtypes])):
    if column == "object":
        data[data.columns[i]] = data[data.columns[i]].fillna(data[data.columns[i]].mode())
        data[data.columns[i]] = data[data.columns[i]].astype("category").cat.codes
    else:
        data[data.columns[i]] = data[data.columns[i]].fillna(data[data.columns[i]].median())

In [16]:
data.head()

Unnamed: 0,State,State Population 2010,State Population 2015,State GDP 2010,State GDP 2015,μ Age Maternal Death,Standard Deviation,Variance
0,0,932369,1044049,16597.0,19528.0,28.36,7.43,55.16
1,1,3155070,3315766,52579.0,57136.0,27.15,6.81,46.31
4,2,251871,272711,21260.0,21431.0,27.56,7.44,55.37
5,3,259005,283025,20994.5,22120.5,26.87,6.65,44.29
6,4,1058712,1162592,14271.0,13392.0,28.02,6.71,45.02


In [17]:
data.columns = [0, 1, 2, 3, 4, 5, 6, 7]
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,0,932369,1044049,16597.0,19528.0,28.36,7.43,55.16
1,1,3155070,3315766,52579.0,57136.0,27.15,6.81,46.31
4,2,251871,272711,21260.0,21431.0,27.56,7.44,55.37
5,3,259005,283025,20994.5,22120.5,26.87,6.65,44.29
6,4,1058712,1162592,14271.0,13392.0,28.02,6.71,45.02


In [21]:
# Entire dataset (even with response variable)
X = data.copy()

# Dataset *minus* the response variable
y = X.pop(8)

KeyError: 8

In [None]:
# Create train and test data sets with train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)

#### For Random Forest Classifiers:
- n_estimators: choosing 100 is a straight forward value
- max_depth: the level of complexity/freedom - values 2 to 6 is usually fine, though starting at 2 is recommended
- random_state: the seed for the random number generator - makes a model replicatable - any number is fine

In [19]:
# Create a Random Forest Classifier incidence 
clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)

In [None]:
# Fit train data to Random Forest Classifier
clf.fit(X_train, y_train)

In [None]:
# Compute the confusion_matrix to evaluate the accuracy of a classification
confusion_matrix(y_test, clf.predict(X_test))

In [None]:
# Create a Logistic Regression incidence 
clf_lin =  LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X, y)

In [None]:
# Extract single tree
estimator = clf.estimators_[5]

In [None]:
# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')