## 1. Inspecting request_responses.csv file
<p>Nabd App provide easy, fast and reliable way of communication for those who are in need for urgent advice on how to deal with a medical emergency. Trauma specialists call the first 60 minutes after a serious injury "the golden hour". As chances of survival are greatest if the patient gets proper treatment in that hour. This obviously won't replace the need for a doctor, but it helps until medical transport arrives. We are modelling responses to get insights about users' behaviors that will help us improve the app.</p>
<p>We generated a dummy dataset to test the model using []. Once we deploy the app we will link the model to the database to perform our analysis.</p>
<p>The data is stored in <code>datasets/request_responses.csv</code> and it is structured according to RFMTC marketing model (a variation of RFM). </p>
<p> RFM stands for Recency, Frequency and Monetary Value and it is commonly used in marketing for identifying your best customers. In our case, our customers are request responders.
RFMTC is a variation of the RFM model. Below is a description of what each column means in our dataset:</p>
<ul>
<li> R (Recency - time since the last response in hours) </li>
<li> F (Frequency - total number of responses)</li>
<li> M (Monetary - duration of call in minutes)</li>
<li>T (Time - months since the first response)</li>
<li> a binary variable representing whether he/she responded to a call in the last week (1 stands for responding; 0 stands for not responding)</li>
<ul>

In [1]:
# Import pandas
import pandas as pd

# Read in dataset
responses = pd.read_csv('datasets/request_responses3.csv')

# Print out the first rows of our dataset
responses.head()

Unnamed: 0,Recency (in Hours),Frequency,Monetary(in Minutes),Time(in Weeks),Responded
0,64,7,45,5,1
1,32,10,39,2,1
2,33,16,49,3,1
3,14,10,45,3,1
4,70,14,33,1,1


In [2]:
# Print a concise summary of responses DataFrame
responses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245 entries, 0 to 244
Data columns (total 5 columns):
Recency (in Hours)      245 non-null int64
Frequency               245 non-null int64
Monetary(in Minutes)    245 non-null int64
Time(in Weeks)          245 non-null int64
Responded               245 non-null int64
dtypes: int64(5)
memory usage: 9.7 KB


## 2. Creating target column
<p>We are aiming to predict the value in <code>responded</code> column. We will rename this it to <code>target</code> so that it's more convenient to work with.</p>

In [3]:
# Rename target column as 'target' for brevity 
responses.rename(
    columns={'Responded': 'target'},
    inplace=True
)

# Print out the first 2 rows
responses.head(2)

Unnamed: 0,Recency (in Hours),Frequency,Monetary(in Minutes),Time(in Weeks),target
0,64,7,45,5,1
1,32,10,39,2,1


## 3. Checking target incidence
<p>We want to predict whether or not the same perosn will respond to a call. The model for this is a binary classifier, meaning that there are only 2 possible outcomes:</p>
<ul>
<li><code>0</code> - the person will not respond to the call</li>
<li><code>1</code> - the person will respond to the call</li>
</ul>
<p>Target incidence is defined as the number of cases of each individual target value in a dataset. That is, how many 0s in the target column compared to how many 1s? Target incidence gives us an idea of how balanced (or imbalanced) is our dataset.</p>

In [4]:
# Print target incidence proportions, rounding output to 3 decimal places
responses.target.value_counts(normalize = True ).round(3)

0    0.571
1    0.429
Name: target, dtype: float64

## 4. Splitting responses into train and test datasets
<p>We'll now use <code>train_test_split()</code> method to split <code>transfusion</code> DataFrame.</p>
<p>Target incidence informed us that in our dataset <code>0</code>s appear 48.6% of the time. We want to keep the same structure in train and test datasets, i.e., both datasets must have 0 target incidence of 48.6%. This can be done using the <code>train_test_split()</code> method from the <code>scikit learn</code> library - by specifiying the <code>stratify</code> parameter. In our case, we'll stratify on the <code>target</code> column.</p>

In [5]:
# Import train_test_split method
from sklearn.model_selection import train_test_split 

# Split responses DataFrame into
# X_train, X_test, y_train and y_test datasets,
# stratifying on the `target` column
X_train, X_test, y_train, y_test = train_test_split(
     responses.drop(columns='target'),
     responses.target,
    test_size=0.2,
    random_state=42,
    stratify= responses['target']
)

# Print out the first 2 rows of X_train
X_train.head(2)

Unnamed: 0,Recency (in Hours),Frequency,Monetary(in Minutes),Time(in Weeks)
36,53,9,26,4
169,193,3,1,17


## 5. Selecting model using TPOT
<p><a href="https://github.com/EpistasisLab/tpot">TPOT</a> is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.</p>
<p><img src="https://assets.datacamp.com/production/project_646/img/tpot-ml-pipeline.png" alt="TPOT Machine Learning Pipeline"></p>
<p>TPOT will automatically explore hundreds of possible pipelines to find the best one for our dataset. Note, the outcome of this search will be a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">scikit-learn pipeline</a>, meaning it will include any pre-processing steps as well as the model.</p>
<p>We are using TPOT to help us zero in on one model that we can then explore and optimize further.</p>

In [6]:
# Import TPOTClassifier and roc_auc_score
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score

# Instantiate TPOTClassifier
tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    verbosity=2,
    scoring='roc_auc',
    random_state=42,
    disable_update_check=True,
    config_dict='TPOT light'
)
tpot.fit(X_train, y_train)

# AUC score for tpot model
tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')

# Print best pipeline steps
print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    # Print idx and transform
    print(f'{idx}. {transform}')

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=120, style=ProgressStyle(descript…

Generation 1 - Current best internal CV score: 0.8402391885607997
Generation 2 - Current best internal CV score: 0.8402391885607997
Generation 3 - Current best internal CV score: 0.8402391885607997
Generation 4 - Current best internal CV score: 0.8408037375029063
Generation 5 - Current best internal CV score: 0.8425983782841199

Best pipeline: DecisionTreeClassifier(BernoulliNB(input_matrix, alpha=100.0, fit_prior=False), criterion=gini, max_depth=2, min_samples_leaf=4, min_samples_split=7)

AUC score: 0.7908

Best pipeline steps:
1. StackingEstimator(estimator=BernoulliNB(alpha=100.0, binarize=0.0,
                                        class_prior=None, fit_prior=False))
2. DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=4, min_samples_split=7,
                       min_weight_fract

In [7]:
# from sklearn import tree
from sklearn import linear_model
from sklearn.metrics import accuracy_score

# model = tree.DecisionTreeClassifier()
model = linear_model.LogisticRegression(random_state = 42)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
accuracy_score(y_test, y_predict)



0.7755102040816326

In [8]:
from sklearn.metrics import confusion_matrix

pd.DataFrame(
    confusion_matrix(y_test, y_predict),
    columns=['Predicted Not Responded', 'Predicted Responded'],
    index=['True Not Responded', 'True Responded']
)

Unnamed: 0,Predicted Not Responded,Predicted Responded
True Not Responded,19,9
True Responded,2,19
