<a href="https://colab.research.google.com/github/PrecyMaeMwanyungu/Bank-account-usage-in-Kenya-Rwanda-Tanzania-and-Uganda-in-relation-to-related-demographic-factors/blob/master/Machine_Learning_Testing_Your_Models_In_Real_World.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab: Testing Your Models in the Real World
## How do you know that your models will do a good job making predictions on new, unseen data?
This lab will discuss the fundamentals of splitting your data into training, validation and test data sets and demonstrate the dangers of overreliance on training data to make predictions.


## Section 1: Import Data
This lab uses the [Stroke Prediction Dataset](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset) from Kaggle.
In order to interact with the data in python, you will need to import the CSV into a pandas dataframe. The pandas package is useful for manipulating and analyzing data.


In [None]:
#Import pandas package
import pandas as pd
#Read in stroke data
stroke_data = pd.read_csv('healthcare-dataset-stroke-data.csv')
# 
stroke_data.head(20) 

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
6,53882,Male,74.0,1,1,Yes,Private,Rural,70.09,27.4,never smoked,1
7,10434,Female,69.0,0,0,No,Private,Urban,94.39,22.8,never smoked,1
8,27419,Female,59.0,0,0,Yes,Private,Rural,76.15,,Unknown,1
9,60491,Female,78.0,0,0,Yes,Private,Urban,58.57,24.2,Unknown,1


Below are some summary statistics about the stroke column, which is what you will be trying to predict.


In [None]:
#Count the total records in the dataset
total_records = stroke_data.shape[0]
print('There are {:,} records in the stroke dataset.'.format(total_records))
summary = pd.DataFrame(stroke_data.groupby('stroke').size()).rename(columns={0:'Count'})
summary['Percent'] = summary['Count'] / total_records
summary


There are 5,110 records in the stroke dataset.


Unnamed: 0_level_0,Count,Percent
stroke,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4861,0.951272
1,249,0.048728


* Ninety-five percent of the patients in this data did not have a stroke. Therefore, you could build a model that always predicts "no stroke" and have 95% accuracy. 
* This is not, however, what you want to do, since the goal is to accurately predict when a patient does have a stroke.
There are a number of techniques to deal with unbalanced datasets such as this one. 
* For this lab, you will use the true positive rate to assess the performance of your predictions.
True Positive Rate=
𝑇𝑃
/
𝑇𝑃+𝐹𝑁
True Positive Rate=TP/TP+FN
where
### TP
TP is the number of true positive predictions (actual value = stroke; predicted value = stroke)
### 𝐹𝑁
FN is the number of false negative predictions (actual value = stroke; predicted value = no stroke)
This value, also called sensitivity or recall, measures how well a model is at capturing actual stroke cases.

* Assuming medical interventions are relatively cheap (i.e., recommending weight loss or exercise to a patient in danger of a stroke), it is better to have the occasional false positive than miss patients at high risk for strokes.

* So now that you know how you will evaluate the predictions from your model, how can you know if your model does well predicting strokes in the real world? The goal is to build a model that identifies factors indicating a high likelihood of having a stroke, so interventions can hopefully prevent the stroke before it happens.
* This being the case, you need to evaluate correct predictions on data that the model has never seen before. This can be done by splitting your data into a training dataset (to use for training and evaluating the model as it is being built) and a test dataset.


### Data Cleanup
- Before you can do any modeling, categorical variables that will be used to predict strokes need to be converted to dummy variables.
- You also need to replace missing data in the BMI column. For this lab, you will simply use the average BMI to replace the missing data.


In [None]:
#Gender
#First, to make things easier, remove the one "other" gender value.
stroke_data = stroke_data[stroke_data['gender'] != 'Other']
#Add new column 'male': 1 = male; 0 = female
stroke_data['male'] = pd.get_dummies(stroke_data['gender'], drop_first=True)
#Add new column 'urban': 1 = urban; 0 = rural
stroke_data['urban'] = pd.get_dummies(stroke_data['Residence_type'], drop_first=True)

stroke_data['married'] = pd.get_dummies(stroke_data['ever_married'], drop_first=True)
smoking_dummies = pd.get_dummies(stroke_data['smoking_status'], drop_first=True)
stroke_data = pd.concat([stroke_data, smoking_dummies], axis=1)
##Replace Missing BMI with average BMI
bmi_average = stroke_data['bmi'].mean()
stroke_data['bmi'] = stroke_data['bmi'].fillna(bmi_average)


## Section 2: Create a Test Dataset
* You will now split your dataset into two datasets:
1. Training Data: Used to train your model to identify important predictors of stroke
2. Test Data: Reserved to evaluate the model on new, unseen data
* The scikit-learn package in python has many tools for machine learning, including data preparation tools. For this lab, you will be using the train_test_split function.
* Inputs to train_test_split:
arrays: This is where you enter one or more arrays: the entire dataset including the output or two separate arrays (X array (predictors) and y array (output) variable). If you enter two arrays, the number of rows in the X and y arrays must be the same and the indexes must align the data.

* test_size: Value between 0 and 1 that indicates the percentage of data to be reserved for the test dataset (defaults to 0.25 if train_size is None).
* train_size: Value between 0 and 1 that indicates the percentage of data to be used for the training dataset (complement of test_size if test_size is set and this value is None).
* random_state: Seed value for randomizing the data split.
shuffle: Whether to shuffle the data before splitting (defaults to True).
* stratify: Output field to use for stratified sampling (defaults to None).
####Since your dataset has unbalanced output classes, you want to be sure to use the stratify option.


In [None]:
#Import train_test_split function from scikit-learn package
from sklearn.model_selection import train_test_split
train, test = train_test_split(stroke_data,train_size=0.8,stratify=stroke_data['stroke'])

In [None]:
#Count the total records in the training dataset
training_records = train.shape[0] ##GET THE COUNT OF RECORDS IN TRAINING DATASET##
print('There are {:,} records in the training dataset.'.format(training_records))
train_summary = pd.DataFrame(train.groupby('stroke').size()).rename(columns={0:'Count'})
train_summary['Percent'] = train_summary['Count'] / training_records##CALCULATE percent of data grouped by stroke/no stroke##
train_summary


There are 4,087 records in the training dataset.


Unnamed: 0_level_0,Count,Percent
stroke,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3888,0.951309
1,199,0.048691


In [None]:
#Count the total records in the test dataset
test_records =test.shape[0] ##GET THE COUNT OF RECORDS IN TEST DATASET##
print('There are {:,} records in the test dataset.'.format(test_records))
#
​test_summary= pd.DataFrame(test.groupby('stroke').size()).rename(columns={0:'Count'})

test_summary['Percent'] = test_summary['Count']/ test_records
##CALCULATE percent of data grouped by stroke/no stroke##
test_summary

SyntaxError: ignored

In [None]:
#Confirm the split did what you expected!
print('There are {:.1%} of all records in training dataset.'.format(training_records/total_records))


There are 80.0% of all records in training dataset.


## Section 3: Tune Models Using Validation Data
* So, now you have your training dataset and you know what metric to use to evaluate your model. But how can you tune a model to ensure it is the best algorithm and best settings for your dataset? 
* You could use the true positive rate for the training dataset, but overreliance on the training dataset may create blindspots in your model where there is real combinations of predictors that due to limited data or simple bad luck were never seen in your training data.
* How about the test data? This is not a good idea because you want to keep the test data separate from the model evaluation/tuning process while building the model. If you use the test dataset during model training and evaluation, the test dataset will no longer represent "unseen" data, which is import to evaluate how your model generalizes.
* So what can you do? The answer is validation data. This is another split of the data, this time using the training dataset.
Using the training dataset, you can again use the 
### train_test_split function to create two new datasets:
* train_final: The final dataset used to train your models
* validation: The dataset used to evaluate and tune your models


In [None]:
from sklearn.model_selection import train_test_split #Split the training data into training/validation data using a 75%/25% split
#Be sure to use stratified sampling!
train_final, validation = train_test_split(train,train_size=0.75,stratify=train['stroke'])
# Now you can finally built a binary classification model to predict strokes.
# You will be using the K-Nearest Neighbors to build your model.


In [None]:
#Import KNN model function from scikit-learn
from sklearn.neighbors import KNeighborsClassifier
model=KNeighborsClassifier(n_neighbors=5)
​#Remove 'id' and 'stroke' column from features (predictors)
features = ~train_final.columns.isin(['gender','ever_married','work_type','Residence_type','smoking_status'])
feature_columns = train_final.columns[features]
model.fit(train_final[feature_columns],train_final['stroke'])


SyntaxError: ignored

In [None]:
#Import true positive rate (recall) function
from sklearn.metrics import recall_score
​
#Predict output for training dataset
train_predict = model.predict(train_final[feature_columns])
​
tpr_train = recall_score(train_final['stroke'],train_predict)
print('The true positive rate for the training dataset is {:.3%}.'.format(tpr_train))


SyntaxError: ignored