In [None]:
from google.colab import files
uploaded = files.upload()

Saving cleaned_heart_stroke_data.csv to cleaned_heart_stroke_data.csv


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

In [None]:
# Load the dataset (update the path as needed)
df = pd.read_csv('cleaned_heart_stroke_data.csv')

**1. Convert Residence_type column to binary (0 = rural, 1 = urban)**

In [None]:
# Check unique values in the original Residence_type column
print("Unique values in 'Residence_type' column before encoding:", df['Residence_type'].unique())

Unique values in 'Residence_type' column before encoding: ['Urban' 'Rural']


In [None]:
# Encode Residence_type: 0 = Rural, 1 = Urban (with improved handling for spaces and case sensitivity)
df['Residence_type'] = df['Residence_type'].apply(lambda x: 1 if x.strip().lower() == 'urban' else 0)

In [None]:
# Verify encoding of Residence_type
print("Unique values in 'Residence_type' column after encoding:", df['Residence_type'].unique())

Unique values in 'Residence_type' column after encoding: [1 0]


**Observations:**

The Residence_type column is now represented in binary form,1 is for urban and o for rural.

This will make it easier for machine learning algorithms to process.

**2. Convert work_type column to multiple columns (binary encoding)**

In [None]:
# Convert work_type to binary columns (one-hot encoding for specific categories)
# Encode work_type with new binary columns
df['Never_worked'] = df['work_type'].apply(lambda x: 1 if x == 'Never_worked' else 0)
df['Private'] = df['work_type'].apply(lambda x: 1 if x == 'Private' else 0)
df['Self_employed'] = df['work_type'].apply(lambda x: 1 if x == 'Self-employed' else 0)

**Observations:**

We created separate columns for each category within work_type, turning them into binary features.

This setup allows us to represent different work types clearly without using a single categorical column, reducing model complexity.

In [None]:
# Verify encoding for work_type columns
print("Unique values in 'Never_worked':", df['Never_worked'].unique())
print("Unique values in 'Private':", df['Private'].unique())
print("Unique values in 'Self_employed':", df['Self_employed'].unique())

Unique values in 'Never_worked': [0 1]
Unique values in 'Private': [1 0]
Unique values in 'Self_employed': [0 1]


**3. Convert smoking_status column to binary columns**

In [None]:
# Convert smoking_status to binary columns
# Encode smoking_status with new binary columns
df['Never_smoked'] = df['smoking_status'].apply(lambda x: 1 if x == 'never smoked' else 0)
df['Formerly_smoked'] = df['smoking_status'].apply(lambda x: 1 if x == 'formerly smoked' else 0)
df['Smokes'] = df['smoking_status'].apply(lambda x: 1 if x == 'smokes' else 0)


**Observations:**

The smoking_status column now has separate binary columns for each category, enabling a more precise representation of smoking habits in the dataset.

This encoding is necessary to avoid bias introduced by categorical values.

In [None]:
# Verify encoding for smoking_status columns
print("Unique values in 'Never_smoked':", df['Never_smoked'].unique())
print("Unique values in 'Formerly_smoked':", df['Formerly_smoked'].unique())
print("Unique values in 'Smokes':", df['Smokes'].unique())


Unique values in 'Never_smoked': [0 1]
Unique values in 'Formerly_smoked': [1 0]
Unique values in 'Smokes': [0 1]


**Binary Encoding for Additional Categorical Variables (if applicable)
If there are other categorical variables, such as gender, we can apply binary encoding similarly**

In [None]:
# Encode gender as binary: 0 = Female, 1 = Male
df['gender'] = df['gender'].apply(lambda x: 1 if x == 'Male' else 0)


In [None]:
# Verify encoding for gender column
print("Unique values in 'gender' column:", df['gender'].unique())


Unique values in 'gender' column: [1 0]


**Additional Verification:**

Summary Check for Binary Columns:

After encoding all columns, it’s a good to check the summary of the DataFrame to ensure that only 0s and 1s exist in the encoded columns.

In [None]:
# Verify that binary columns contain only 0s and 1s by displaying the unique values in each column
binary_columns = ['Residence_type', 'Never_worked', 'Private', 'Self_employed',
                  'Never_smoked', 'Formerly_smoked', 'Smokes', 'gender']

for col in binary_columns:
    print(f"Unique values in '{col}': {df[col].unique()}")


Unique values in 'Residence_type': [0]
Unique values in 'Never_worked': [0 1]
Unique values in 'Private': [1 0]
Unique values in 'Self_employed': [0 1]
Unique values in 'Never_smoked': [0 1]
Unique values in 'Formerly_smoked': [1 0]
Unique values in 'Smokes': [0 1]
Unique values in 'gender': [1 0]


**4. Create a new DataFrame for the data model and drop original columns**

In [None]:
# Create a new dataframe for modeling, dropping the original columns
df_model = df.drop(['Residence_type', 'work_type', 'smoking_status'], axis=1)

**Observations:**

The new DataFrame, df_model, is ready for the next stage in data modeling.

The unnecessary columns have been removed, leaving only binary encoded features, which will help improve model interpretability and performance.

**Overall Observations:**

**Binary Encoding:** Each categorical variable has been converted to binary columns using only lambda functions.

**Data Ready for Modeling:** The new DataFrame df_model is now optimized for machine learning models, with all categorical data represented as binary columns.

**Increased Interpretability:** The binary encoding makes it easy to interpret the influence of specific categories (e.g., work type and smoking status) on stroke risk.

This setup ensures all categorical data is in a format that’s easy for models to work with, while maintaining interpretability in terms of individual category contributions.