# 1. Background information on the dataset

## Features of Dataset

1. **Gender**: (0: 'Female', 1: 'Male')
2. **Age**: Patient's age in years.
3. **Hypertension**: (0 : No history of hypertension, 1: History of hypertension)
4. **Heart Disease**: (0: No history of heart disease, 1: History of heart disease)
5. **Ever Married**: (0: Patient has not been married before, 1: Patient has been married before)
6. **Work Type**: (0: 'Govt_job', 1: 'Never_worked', 2: 'Private', 3: 'Self-employed', 4: 'children')
7. **Residence Type**: (0: 'Rural', 1: 'Urban')
8. **Average Glucose Level**: Numeric data representing the average patient's glucose level.
9. **BMI**: Numeric data representing Body Mass Index.
10. **Smoking Status**: (0: 'Unknown', 1: 'formerly smoked', 2: 'never smoked', 3: 'smokes')

Before data cleaning, there were 5110 rows of data. However, there were "NaN" values under "BMI" and an anomaly of "Other" under "Gender". Hence, after data cleaning, we are left with 4908 rows of data.

## Objective

Our primary goal is to develop a predictive model that accurately identifies individuals at risk of experiencing a stroke based on various demographic, lifestyle, and health-related factors provided in the dataset.

# 2. Libraries and packages

## Tools and Libraries

For this project, we will utilize Python along with the following libraries for data analysis, machine learning, and data visualization. Below are the main libraries we will be using:

- **Pandas**: Used for data manipulation and analysis.
- **Matplotlib** and **Seaborn**: Employed for data visualization tasks.
- **Scikit-learn**: Utilized for various machine learning tasks such as data preprocessing, model training, and evaluation.
- **Imbalanced-learn**: Employed for addressing class imbalance issues within the dataset.


In [1]:
# Import necessary packages
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt

# Set seaborn style
sb.set()

# Import models and tools from Scikit-Learn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# 3. Data preparation and cleaning

### Step 1: Import the csv file

In [2]:
# Import the data set
sourcedata = pd.read_csv('healthcare-dataset-stroke-data.csv')
sourcedata.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [3]:
sourcedata.tail()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.2,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0
5109,44679,Female,44.0,0,0,Yes,Govt_job,Urban,85.28,26.2,Unknown,0


Observing the first and last 5 rows of the dataset, it is evident that some values are missing or labeled as "NaN" under the `bmi` column. Similarly, under the `smoking_status` column, some respondents did not provide their smoking history and instead entered "unknown".

To handle these missing values, we will opt to remove the respective rows from the dataset rather than imputing estimated values. This approach ensures that our analysis is based on actual data rather than assumptions, thus maintaining data integrity.


### Step 2: Check the data type of the factors in the dataset

In [4]:
sourcedata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [5]:
sourcedata.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


### Step 3: Check whether there are any NaN values in the csv file

In [6]:
# Count the number of NaN values in 'bmi'
sourcedata.isnull().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

In [7]:
# Remove the NaN values in 'bmi'
sourcedata.dropna(subset = ['bmi'], inplace=True)
sourcedata.head(25)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
6,53882,Male,74.0,1,1,Yes,Private,Rural,70.09,27.4,never smoked,1
7,10434,Female,69.0,0,0,No,Private,Urban,94.39,22.8,never smoked,1
9,60491,Female,78.0,0,0,Yes,Private,Urban,58.57,24.2,Unknown,1
10,12109,Female,81.0,1,0,Yes,Private,Rural,80.43,29.7,never smoked,1
11,12095,Female,61.0,0,1,Yes,Govt_job,Rural,120.46,36.8,smokes,1


### Step 4: Check to ensure there is no NaN value after data cleaning

In [8]:
# After removing the null values under 'bmi'
# Check to ensure that the dataset does not contain any NaN values after data cleaning
sourcedata.isnull().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

In [9]:
sourcedata.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0
mean,37064.313506,42.865374,0.091872,0.049501,105.30515,28.893237,0.042575
std,20995.098457,22.555115,0.288875,0.216934,44.424341,7.854067,0.201917
min,77.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,18605.0,25.0,0.0,0.0,77.07,23.5,0.0
50%,37608.0,44.0,0.0,0.0,91.68,28.1,0.0
75%,55220.0,60.0,0.0,0.0,113.57,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


In [10]:
sourcedata.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4909 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 4909 non-null   int64  
 1   gender             4909 non-null   object 
 2   age                4909 non-null   float64
 3   hypertension       4909 non-null   int64  
 4   heart_disease      4909 non-null   int64  
 5   ever_married       4909 non-null   object 
 6   work_type          4909 non-null   object 
 7   Residence_type     4909 non-null   object 
 8   avg_glucose_level  4909 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     4909 non-null   object 
 11  stroke             4909 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 498.6+ KB


After removing the null values under `bmi`, the total number of rows have dropped from 5110 to 4909. 201 rows with `bmi` value = "NaN" have been dropped.

### Step 5: Check for duplicates

In [11]:
# Check for duplicate rows
duplicate_rows = sourcedata.duplicated()

# Count the number of duplicate rows
num_duplicate_rows = duplicate_rows.sum()

# Print the number of duplicate rows
print("Number of duplicate rows:", num_duplicate_rows)

Number of duplicate rows: 0


There are no duplicates in our dataset.

### Step 6: Remove rows with unknown information

In [12]:
# Count the number of instances of 'Unknown' in 'smoking_status'
unknown_smoking_status_count = (sourcedata['smoking_status'] == 'Unknown').sum()

# Print the count
print("Number of 'Unknown' instances in 'smoking_status':", unknown_smoking_status_count)

Number of 'Unknown' instances in 'smoking_status': 1483


There are 1483 instances of "Unknown" in the `smoking_status` column. Given the significant number of instances (1483), simply dropping these rows would result in substantial data loss.

Instead, we can treat "Unknown" as a distinct category, interpreting it as representing individuals for whom we lack smoking information, rather than considering it as missing data.

### Step 7: Remove rows with other gender

In [13]:
# Count the number of instances of 'Other' in 'gender'
other_gender_count = (sourcedata['gender'] == 'Other').sum()

# Print the count
print("Number of 'Other' instances in 'gender':", other_gender_count)

Number of 'Other' instances in 'gender': 1


Since there is only 1 instance of "Other" for `gender`, we will drop the row as it is unlikely to have a significant impact on our analysis.

In [14]:
sourcedata = sourcedata[sourcedata["gender"].str.contains("Other") == False]
sourcedata.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4908 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 4908 non-null   int64  
 1   gender             4908 non-null   object 
 2   age                4908 non-null   float64
 3   hypertension       4908 non-null   int64  
 4   heart_disease      4908 non-null   int64  
 5   ever_married       4908 non-null   object 
 6   work_type          4908 non-null   object 
 7   Residence_type     4908 non-null   object 
 8   avg_glucose_level  4908 non-null   float64
 9   bmi                4908 non-null   float64
 10  smoking_status     4908 non-null   object 
 11  stroke             4908 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 498.5+ KB


In [15]:
sourcedata.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,4908.0,4908.0,4908.0,4908.0,4908.0,4908.0,4908.0
mean,37060.423594,42.86881,0.091891,0.049511,105.297402,28.89456,0.042584
std,20995.468407,22.556128,0.288901,0.216954,44.42555,7.85432,0.201937
min,77.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,18602.5,25.0,0.0,0.0,77.0675,23.5,0.0
50%,37580.5,44.0,0.0,0.0,91.68,28.1,0.0
75%,55181.75,60.0,0.0,0.0,113.495,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


After removing rows with the value "Other" under the `gender` column, the total number of rows has dropped from 4909 to 4908. This indicates that 1 row with "Other" as the gender has been dropped from the dataset.

The decision to drop rows with values "Other" under the `gender` column was made because gender is a key variable for our analysis. By removing these abnormal values, we ensure that our analysis is based on reliable and complete data, thus maintaining the integrity of our results.

### Step 7: Remove irrelavent columns

In [16]:
sourcedata = sourcedata.drop(columns=['id'])

As the `id` column does not affect or benefit our Exploratory Data Analysis (EDA) process, we have decided to drop it.

### Step 8: Convert the data type to category¶

In [17]:
sourcedata['gender']= sourcedata['gender'].astype('category')

In [18]:
sourcedata['hypertension']= sourcedata['hypertension'].astype('category')

In [19]:
sourcedata['heart_disease']= sourcedata['heart_disease'].astype('category')

In [20]:
sourcedata['ever_married']= sourcedata['ever_married'].astype('category')

In [21]:
sourcedata['work_type']= sourcedata['work_type'].astype('category')

In [22]:
sourcedata['Residence_type']= sourcedata['Residence_type'].astype('category')

In [23]:
sourcedata['smoking_status']= sourcedata['smoking_status'].astype('category')

In [24]:
sourcedata['stroke']= sourcedata['stroke'].astype('category')

### Step 9: Convert categorical columns into nurmerical columns

In [25]:
# Instantiate LabelEncoder
label_encoder = LabelEncoder()

In [26]:
# Fit and transform the 'gender' column
sourcedata['gender'] = label_encoder.fit_transform(sourcedata['gender'])

mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Encoded Mapping:", mapping)

Encoded Mapping: {'Female': 0, 'Male': 1}


In [27]:
# Fit and transform the 'ever_married' column
sourcedata['ever_married'] = label_encoder.fit_transform(sourcedata['ever_married'])

mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Encoded Mapping:", mapping)

Encoded Mapping: {'No': 0, 'Yes': 1}


In [28]:
# Fit and transform the 'work_type' column
sourcedata['work_type'] = label_encoder.fit_transform(sourcedata['work_type'])

mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Encoded Mapping:", mapping)

Encoded Mapping: {'Govt_job': 0, 'Never_worked': 1, 'Private': 2, 'Self-employed': 3, 'children': 4}


In [29]:
# Fit and transform the 'Residence_type' column
sourcedata['Residence_type'] = label_encoder.fit_transform(sourcedata['Residence_type'])

mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Encoded Mapping:", mapping)

Encoded Mapping: {'Rural': 0, 'Urban': 1}


In [30]:
# Fit and transform the 'smoking_status' column
sourcedata['smoking_status'] = label_encoder.fit_transform(sourcedata['smoking_status'])

mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Encoded Mapping:", mapping)

Encoded Mapping: {'Unknown': 0, 'formerly smoked': 1, 'never smoked': 2, 'smokes': 3}


### Step 10: Save the cleaned dataframe back to the csv file

In [31]:
sourcedata.to_csv('cleaned_data.csv', index=False)

#### Moving forward into the project, we will be using this cleaned dataframe which we converted into a csv file, "cleaned_data.csv".