# 1. Background information on the dataset

Features of dataset:

1. sex: ('Female': 0, 'Male': 1)
2. age: (patient's age in years)
3. hypertension: (0 represents no history of hypertension & 1 represents history of hypertension)
4. heart_disease: (0 represents no history of heart disease & 1 represents history of heart disease)
5. ever_married: ('No': 0, 'Yes': 1)
6. work_type: ('Govt_job': 0, 'Never_worked': 1, 'Private': 2, 'Self-employed': 3, 'children': 4
7. Residence_type: ('Rural': 0, 'Urban': 1)
8. avg_glucose_level: (numeric data to represent the average patient's glucose level)
9. bmi: (numeric data to represent Body Mass Index)
10. smoking_status: ('formerly smoked': 0, 'never smoked': 1, 'smokes': 2)

#### 5109 rows of data but there are unknown information under "smoking_status", "NaN" values under "bmi", and an anomaly of "Other" under "gender". After data cleaning, we are left with 3425 rows of data.

# 2. Libraries and packages

In [1]:
# Import general packages - numpy, pandas, seaborn, matplotlib
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
sb.set

# Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression

# Create a Linear Regression object
linreg = LinearRegression()

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

# Encode target labels
from sklearn.preprocessing import LabelEncoder

# 3. Data preparation and cleaning

### Step 1: Import the csv file

In [2]:
# Import the data set
sourcedata = pd.read_csv('healthcare-dataset-stroke-data.csv')
sourcedata.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [3]:
sourcedata.tail()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.2,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0
5109,44679,Female,44.0,0,0,Yes,Govt_job,Urban,85.28,26.2,Unknown,0


#### Observing the first 5 rows of the dataset, we can see that some values are not included from the survey. For example, some values under the column "bmi" is empty and filled with "NaN". To handle these missing values, we will need to remove the rows consisting of these values to clean the data. We will not be filling these values with estimated values as there is an oppurtunity to lose integrity of the data because we might then be operating from assumptions and not actual analysis.

#### Also, under the column "smoking_status", some survery respondents did not indicate their smoking history and entered "unknown". We will also need to clean the dataset from these values by removing the respective rows.

### Step 2: Check the data type of the factors in the dataset

In [4]:
sourcedata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [5]:
sourcedata.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


### Step 3: Check whether there are any NaN values in the csv file

In [6]:
# Count the number of NaN values in 'bmi'
sourcedata.isnull().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

In [7]:
# Remove the NaN values in 'bmi'
sourcedata.dropna(subset = ['bmi'], inplace=True)
sourcedata.head(25)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
6,53882,Male,74.0,1,1,Yes,Private,Rural,70.09,27.4,never smoked,1
7,10434,Female,69.0,0,0,No,Private,Urban,94.39,22.8,never smoked,1
9,60491,Female,78.0,0,0,Yes,Private,Urban,58.57,24.2,Unknown,1
10,12109,Female,81.0,1,0,Yes,Private,Rural,80.43,29.7,never smoked,1
11,12095,Female,61.0,0,1,Yes,Govt_job,Rural,120.46,36.8,smokes,1


### Step 4: Check to ensure there is no NaN value after data cleaning¶

In [8]:
# After removing the null values under 'bmi'
# Check to ensure that the dataset does not contain any NaN values after data cleaning
sourcedata.isnull().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

In [9]:
sourcedata.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0
mean,37064.313506,42.865374,0.091872,0.049501,105.30515,28.893237,0.042575
std,20995.098457,22.555115,0.288875,0.216934,44.424341,7.854067,0.201917
min,77.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,18605.0,25.0,0.0,0.0,77.07,23.5,0.0
50%,37608.0,44.0,0.0,0.0,91.68,28.1,0.0
75%,55220.0,60.0,0.0,0.0,113.57,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


In [10]:
sourcedata.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4909 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 4909 non-null   int64  
 1   gender             4909 non-null   object 
 2   age                4909 non-null   float64
 3   hypertension       4909 non-null   int64  
 4   heart_disease      4909 non-null   int64  
 5   ever_married       4909 non-null   object 
 6   work_type          4909 non-null   object 
 7   Residence_type     4909 non-null   object 
 8   avg_glucose_level  4909 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     4909 non-null   object 
 11  stroke             4909 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 498.6+ KB


#### After removing the null values under "bmi", the total number of rows have dropped from 5110 to 4909. 201 rows with "bmi" value = "NaN" have been dropped.

### Step 5: Remove rows with unknown information

In [11]:
# Remove rows with values 'Unknown' under smoking_status coloumn as we will not be filling in unknown values with estimations.
sourcedata = sourcedata[sourcedata["smoking_status"].str.contains("Unknown") == False]
sourcedata.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3426 entries, 0 to 5108
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 3426 non-null   int64  
 1   gender             3426 non-null   object 
 2   age                3426 non-null   float64
 3   hypertension       3426 non-null   int64  
 4   heart_disease      3426 non-null   int64  
 5   ever_married       3426 non-null   object 
 6   work_type          3426 non-null   object 
 7   Residence_type     3426 non-null   object 
 8   avg_glucose_level  3426 non-null   float64
 9   bmi                3426 non-null   float64
 10  smoking_status     3426 non-null   object 
 11  stroke             3426 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 348.0+ KB


In [12]:
sourcedata.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,3426.0,3426.0,3426.0,3426.0,3426.0,3426.0,3426.0
mean,37339.00613,48.645943,0.119089,0.060128,108.321891,30.290047,0.052539
std,21049.976345,18.851239,0.323941,0.237759,47.703541,7.295958,0.223145
min,84.0,10.0,0.0,0.0,55.12,11.5,0.0
25%,18997.5,34.0,0.0,0.0,77.2375,25.3,0.0
50%,38068.5,50.0,0.0,0.0,92.36,29.1,0.0
75%,55464.25,63.0,0.0,0.0,116.2075,34.1,0.0
max,72915.0,82.0,1.0,1.0,271.74,92.0,1.0


#### After removing rows with values "Unknown" under "smoking_status" coloumn, the total number of rows have dropped from 4904 to 3426. 1483 rows with "Unknown" smoking status have been dropped. The reason for dropping rows with values "Unknown" under "smoking_status" is because the smoking history of an individual is one of the key variables for our analysis. By dropping these unknown values, we ensure that our analysis is based on reliable and complete data, thus maintaining the integrity of our results.

### Step 6: Remove rows with other gender

In [13]:
sourcedata = sourcedata[sourcedata["gender"].str.contains("Other") == False]
sourcedata.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3425 entries, 0 to 5108
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 3425 non-null   int64  
 1   gender             3425 non-null   object 
 2   age                3425 non-null   float64
 3   hypertension       3425 non-null   int64  
 4   heart_disease      3425 non-null   int64  
 5   ever_married       3425 non-null   object 
 6   work_type          3425 non-null   object 
 7   Residence_type     3425 non-null   object 
 8   avg_glucose_level  3425 non-null   float64
 9   bmi                3425 non-null   float64
 10  smoking_status     3425 non-null   object 
 11  stroke             3425 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 347.9+ KB


In [14]:
sourcedata.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,3425.0,3425.0,3425.0,3425.0,3425.0,3425.0,3425.0
mean,37333.512117,48.652555,0.119124,0.060146,108.31167,30.29235,0.052555
std,21050.593185,18.850018,0.323982,0.237792,47.706754,7.295778,0.223175
min,84.0,10.0,0.0,0.0,55.12,11.5,0.0
25%,18986.0,34.0,0.0,0.0,77.23,25.3,0.0
50%,38067.0,50.0,0.0,0.0,92.35,29.1,0.0
75%,55459.0,63.0,0.0,0.0,116.2,34.1,0.0
max,72915.0,82.0,1.0,1.0,271.74,92.0,1.0


#### After removing rows with values "Other" under "gender" coloumn, the total number of rows have dropped from 3426 to 3425. 1 row with "Other" as gender has been dropped. The reason for dropping rows with values "Other" under "gender" is because the gender of an individual is one of the key variables for our analysis. By dropping these abnormal values, we ensure that our analysis is based on reliable and complete data, thus maintaining the integrity of our results.¶

### Step 7: Remove irrelavent columns

In [15]:
sourcedata = sourcedata.drop(columns=['id'])

#### As the "id" column does not affect or benefit our Exploratory Data Analysis (EDA) process, we have decided to drop it.

### Step 8: Convert the data type to category¶

In [16]:
sourcedata['gender']= sourcedata['gender'].astype('category')

In [17]:
sourcedata['hypertension']= sourcedata['hypertension'].astype('category')

In [18]:
sourcedata['heart_disease']= sourcedata['heart_disease'].astype('category')

In [19]:
sourcedata['ever_married']= sourcedata['ever_married'].astype('category')

In [20]:
sourcedata['work_type']= sourcedata['work_type'].astype('category')

In [21]:
sourcedata['Residence_type']= sourcedata['Residence_type'].astype('category')

In [22]:
sourcedata['smoking_status']= sourcedata['smoking_status'].astype('category')

In [23]:
sourcedata['stroke']= sourcedata['stroke'].astype('category')

### Step 9: Convert categorical columns into nurmerical columns

In [24]:
# Instantiate LabelEncoder
label_encoder = LabelEncoder()

In [25]:
# Fit and transform the 'gender' column
sourcedata['gender'] = label_encoder.fit_transform(sourcedata['gender'])

mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Encoded Mapping:", mapping)

Encoded Mapping: {'Female': 0, 'Male': 1}


In [26]:
# Fit and transform the 'ever_married' column
sourcedata['ever_married'] = label_encoder.fit_transform(sourcedata['ever_married'])

mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Encoded Mapping:", mapping)

Encoded Mapping: {'No': 0, 'Yes': 1}


In [27]:
# Fit and transform the 'work_type' column
sourcedata['work_type'] = label_encoder.fit_transform(sourcedata['work_type'])

mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Encoded Mapping:", mapping)

Encoded Mapping: {'Govt_job': 0, 'Never_worked': 1, 'Private': 2, 'Self-employed': 3, 'children': 4}


In [28]:
# Fit and transform the 'Residence_type' column
sourcedata['Residence_type'] = label_encoder.fit_transform(sourcedata['Residence_type'])

mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Encoded Mapping:", mapping)

Encoded Mapping: {'Rural': 0, 'Urban': 1}


In [29]:
# Fit and transform the 'smoking_status' column
sourcedata['smoking_status'] = label_encoder.fit_transform(sourcedata['smoking_status'])

mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Encoded Mapping:", mapping)

Encoded Mapping: {'formerly smoked': 0, 'never smoked': 1, 'smokes': 2}


### Step 10: Save the cleaned dataframe back to the csv file

In [30]:
sourcedata.to_csv('cleaned_data.csv', index=False)