This week's Competition involves creating classification models using a relatively simple data set.

First, we should import the packages we'll need.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, accuracy_score, roc_auc_score, confusion_matrix
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

Next, we need to retrieve the data and store it in a pandas DataFrame variable.

In [2]:
data = pd.read_csv(r'healthcare-dataset-stroke-data.csv')
data

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


Next, we should see if there are any NaN values.

In [3]:
pd.DataFrame.from_dict(data={
    'anyNaN' : data.isna().any()
})

Unnamed: 0,anyNaN
id,False
gender,False
age,False
hypertension,False
heart_disease,False
ever_married,False
work_type,False
Residence_type,False
avg_glucose_level,False
bmi,True


It looks like bmi is the only attribute with NaN values. Let's see how many such values there are.

In [4]:
len(data.bmi[data.bmi.isna()])

201

There are only 201 NaN values in the bmi attribute. Therefore, we can drop these records, as we'd still have enough data left over.

In [5]:
data.dropna(subset=['bmi'], inplace=True)
data.reset_index(drop=True, inplace=True)
data

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
2,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
3,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
4,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
4904,14180,Female,13.0,0,0,No,children,Rural,103.08,18.6,Unknown,0
4905,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
4906,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
4907,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


Now, let's make sure there are no duplicates in the data.

In [6]:
pd.DataFrame.from_dict(data={
    'anyDuplicates' : data[data.duplicated()].any()
})

Unnamed: 0,anyDuplicates
id,False
gender,False
age,False
hypertension,False
heart_disease,False
ever_married,False
work_type,False
Residence_type,False
avg_glucose_level,False
bmi,False


It seems there are no duplicates in the data, so we can move on.

Now, let's check the data type each column contains.

In [7]:
pd.DataFrame.from_dict(data={
    'dataType' : data.dtypes
})

Unnamed: 0,dataType
id,int64
gender,object
age,float64
hypertension,int64
heart_disease,int64
ever_married,object
work_type,object
Residence_type,object
avg_glucose_level,float64
bmi,float64


In our baseline model, we'll only use bmi for our independent/explanatory variable. In your model, you can use any combination of the features in this data set as you'd like. Lust make sure you don't overfit the model on the training data, or it'll perform very poorly on the testing data.

**Note: If you want to use a categorical feature of the data in your model, such as gender, you need to encode the feature's values first (more on this [here](https://www.askpython.com/python/examples/label-encoding)).**

In [8]:
X = data[['bmi']]
X

Unnamed: 0,bmi
0,36.6
1,32.5
2,34.4
3,24.0
4,29.0
...,...
4904,18.6
4905,40.0
4906,30.6
4907,25.6


We should also store the dependent/target variable in its own Python variable. In this case, the dependent/target variable is the stroke feature.

In [9]:
y = data['stroke']
pd.DataFrame(y)

Unnamed: 0,stroke
0,1
1,1
2,1
3,1
4,1
...,...
4904,0
4905,0
4906,0
4907,0


Before you do anything else, we need to split the data into training and testing sets.

The training set will be used to "train" the model we use; this essentially means that the model will fit its parameters and such so that it performs better based on the training data. Think of it like buying a suit or dress that doesn't fit: for it to fit you better, you need to have it tailored to your specific body shape. This is a good analogy for how the training set is used to tailor the model to better fit the data.

The testing set is used for evaluating the model after we train it using the training set. Therefore, your evaluation metrics should be calculated using the performance of the model on the testing set, **NOT** the training set. **Always make sure you train your model *before* testing it with the testing set, or else the model won't work correctly!**

When splitting the data using the train_test_split function from scikit-learn, make sure to define the "random_state" parameter, or the data will be split differently each time you execute the code in the cell.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.25)

Now, we'll create a baseline classification model using the Logistic Regression algorithm. Note: whenever you're initializing a model, make sure you define the "random_state" parameter, or your model will perform differently every time. 

**This baseline model is what you're trying to beat with your own models!**

In [11]:
baseline = LogisticRegression(random_state=42)

Now, we're going to train the baseline model using the training data.

In [12]:
baseline.fit(X=X_train, y=y_train.values.ravel())

LogisticRegression(random_state=42)

Now, we'll use the independent/explanatory variable data in the testing set to generate predictions for the dependent/target variable values in the testing set.

In [13]:
y_pred = baseline.predict(X=X_test)
pd.DataFrame.from_dict(data={
    'prediction' : y_pred
})

Unnamed: 0,prediction
0,0
1,0
2,0
3,0
4,0
...,...
1223,0
1224,0
1225,0
1226,0


Before we evaluate our model using these generated predictions, we should make sure the predictions aren't one-sided, meaning all 0's or all 1's.

If this is the case, then the data we trained on model using might've been massively unbalanced, meaning there were far more instances of one case type than the other.

In [14]:
pd.DataFrame.from_dict(data={
    'count' : pd.Series(y_pred).value_counts()
})

Unnamed: 0,count
0,1228


It looks like our baseline model predicted all observations were negative, i.e., no stroke. Thus, our suspicion that our data is unbalanced may be true. Let's check how many values in the overall target data (training and testing) are positive/1 (stroke) versus negative/0 (no stroke).

In [15]:
pd.DataFrame.from_dict(data={
    'count' : y.value_counts()
})

Unnamed: 0,count
0,4700
1,209


The data is indeed massively unbalanced:
  - 4,700 negative (non-stroke) cases.
  - 209 positive (stroke) cases.

We should try to balance out the data so that there's about as many positive cases as there are negative cases. We'll use a method called SMOTE to do this: SMOTE generates synthetic data so that a data set is more evenly balanced between positive and negative target variable cases.

In [16]:
X_res, y_res = SMOTE(random_state=42).fit_resample(X, y)

Now, let's make sure the data has been balanced using the SMOTE resampling method.

First, we'll view the resampled X data, and then we'll view the resampled y data.

In [17]:
X_res

Unnamed: 0,bmi
0,36.600000
1,32.500000
2,34.400000
3,24.000000
4,29.000000
...,...
9395,32.654579
9396,31.389270
9397,38.823629
9398,30.985245


In [18]:
pd.DataFrame(y_res)

Unnamed: 0,stroke
0,1
1,1
2,1
3,1
4,1
...,...
9395,1
9396,1
9397,1
9398,1


The data was indeed balanced correctly, so now we can redefine the training and testing data sets using train_test_split (again, make sure to define the "random_state" parameter, preferrably using the same value as we did previously).

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, random_state=42)

Next, we should re-initialize the baseline Logistic Regression model (once again defining the "random_state" parameter, preferrably using the same value as we did previously). 

In [20]:
baseline = LogisticRegression(random_state=42)

Now, we'll train our new baseline model using the SMOTE-resampled training data.

In [21]:
baseline.fit(X_train, y_train.values.ravel())

LogisticRegression(random_state=42)

Next, we'll use the SMOTE-resampled testing data to generate predictions.

In [22]:
y_pred = baseline.predict(X_test)
pd.DataFrame.from_dict(data={
    'prediction' : y_pred
})

Unnamed: 0,prediction
0,1
1,0
2,1
3,0
4,0
...,...
2345,0
2346,0
2347,0
2348,0


For this competition, we'll be using the recall score as our evaluation metric. Recall intuitively measures the number of observations correctly classified as positive compared to the number of all the observations that actually are positivel it is calculated using the formula

$$recall = \frac{TP}{TP + FN},$$

where
  - $TP:$ the number of observations correctly classified as positive cases.
  - $FN:$ the number of observations incorrectly classified as negative cases.

For this specific data set, $TP$ is the number of times the classifier predicted a patient had a stroke when the patient actually had a stroke, while $FN$ is the number of times the classifier predicted a patient **dindn't** have a stroke when the patient did in fact have a stroke. Thus, we can see that $TP + FN$ would be the total number of patients who did have a stroke. 

We're using recall over the other metrics we could've chosen because when you're building a model meant to determine something as important as whether or not a patient suffered a stroke, the emphasis should be on trying to correctly classify actual positive instances so that these instances can be dealt with appropriately. In this case, it's much more important to correctly classify a stroke patient as such so that the side effects of the stroke can be treated as soon as possible, potentially fast enough to prevent any long-term brain damage the stroke might cause if left untreated for a time.

**YOUR GOAL IS TO BEAT THE RECALL SCORE OF THIS BASELINE MODEL!!!**

In [23]:
recall = recall_score(y_test, y_pred)
print(f'This is the recall score you\'re trying to beat: {recall}')

This is the recall score you're trying to beat: 0.5050590219224284


Therefore,
$$recall_{goal} > 0.5050590219224284$$ 

In other words, your goal is to obtain a recall score greater than the above value. 

**Please make sure you use the testing data when calculating your recall score, as testing data is solely meant for evaluation!**