<div style='background-color:#9B59B6; padding: 15px; border-radius: 10px; margin-bottom: 20px;'>
<h1 style='color:#FFFFFF; text-align:center; font-family: Arial, sans-serif; margin: 0;'>🔧 Feature Engineering & Preprocessing</h1>
<h2 style='color:#E8DAEF; text-align:center; font-family: Arial, sans-serif; margin: 5px 0 0 0;'>Preparing Data for Machine Learning</h2>
</div>

<div style='background-color:#F8F5FF; padding: 15px; border-radius: 8px; border-left: 4px solid #9B59B6;'>
<h3 style='color:#9B59B6; margin-top: 0;'>⚙️ Engineering Objectives</h3>
<ul style='color:#333; line-height: 1.6;'>
<li><strong>Data Transformation:</strong> Convert categorical variables to numerical format</li>
<li><strong>Feature Creation:</strong> Engineer new features from existing data</li>
<li><strong>Data Scaling:</strong> Normalize numerical features for optimal model performance</li>
<li><strong>Missing Value Handling:</strong> Implement robust imputation strategies</li>
<li><strong>Feature Selection:</strong> Identify most predictive variables for stroke risk</li>
</ul>
</div>

# **Feature Engineering for Stroke Prediction**

## Objectives

* Engineer features for machine learning model training
* Handle missing values and data preprocessing
* Create derived features and encode categorical variables
* Prepare train-test datasets with proper validation splits

## Inputs

* Raw stroke dataset with EDA insights
* Statistical analysis results from previous notebooks

## Outputs

* Cleaned and preprocessed dataset
* Engineered features for model training
* Train-test-validation splits
* Feature encoding pipelines

## Additional Comments

* Focus on creating features that improve model performance
* Handle class imbalance considerations
* Ensure data leakage prevention in preprocessing

---

# Change working directory

In [1]:
import os
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
print("Working directory changed to parent folder")

Working directory changed to parent folder


# Import Libraries and Load Data

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')

# Load the dataset - use the correct path
df = pd.read_csv("jupyter_notebooks/inputs/datasets/Stroke-data.csv")
print(f"Dataset loaded: {df.shape[0]} patients, {df.shape[1]} features")
print(f"Stroke cases: {df['stroke'].sum()} ({df['stroke'].mean()*100:.1f}%)")

print("\nDataset Info:")
print(df.info())

Dataset loaded: 5110 patients, 12 features
Stroke cases: 249 (4.9%)

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB
None


## Data Quality Assessment and Cleaning

In [4]:
# Check for missing values
print("Missing values per column:")
missing_values = df.isnull().sum()
for col, missing in missing_values.items():
    if missing > 0:
        print(f"{col}: {missing} ({missing/len(df)*100:.1f}%)")

# Check for duplicate records
duplicates = df.duplicated().sum()
print(f"\nDuplicate records: {duplicates}")

# Data type overview
print("\nData types:")
print(df.dtypes)

Missing values per column:
bmi: 201 (3.9%)

Duplicate records: 0

Data types:
id                     int64
gender                object
age                  float64
hypertension           int64
heart_disease          int64
ever_married          object
work_type             object
Residence_type        object
avg_glucose_level    float64
bmi                  float64
smoking_status        object
stroke                 int64
dtype: object


In [5]:
# Handle missing BMI values
print(f"BMI missing values: {df['bmi'].isnull().sum()}")

# Analyze BMI missing pattern
bmi_missing_analysis = df.groupby('stroke')['bmi'].agg(['count', lambda x: x.isnull().sum()])
bmi_missing_analysis.columns = ['total', 'missing']
bmi_missing_analysis['missing_rate'] = bmi_missing_analysis['missing'] / bmi_missing_analysis['total']
print("\nBMI missing pattern by stroke status:")
print(bmi_missing_analysis)

# Strategy: Use median imputation grouped by key characteristics
# Create age groups for better imputation
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 45, 60, 100], labels=['Young', 'Adult', 'MiddleAged', 'Senior'])

# Impute BMI based on age group and gender
for age_group in df['age_group'].unique():
    for gender in df['gender'].unique():
        mask = (df['age_group'] == age_group) & (df['gender'] == gender)
        median_bmi = df.loc[mask, 'bmi'].median()
        df.loc[mask & df['bmi'].isnull(), 'bmi'] = median_bmi

print(f"\nBMI missing values after imputation: {df['bmi'].isnull().sum()}")

BMI missing values: 201

BMI missing pattern by stroke status:
        total  missing  missing_rate
stroke                              
0        4700      161      0.034255
1         209       40      0.191388

BMI missing values after imputation: 0


## Feature Engineering

In [6]:
# Create derived features based on EDA insights
print("Creating derived features...")

# 1. Age categories (based on stroke risk patterns)
df['age_category'] = pd.cut(df['age'], 
                           bins=[0, 40, 55, 70, 100], 
                           labels=['Low_Risk', 'Moderate_Risk', 'High_Risk', 'Very_High_Risk'])

# 2. BMI categories (WHO standards)
df['bmi_category'] = pd.cut(df['bmi'], 
                           bins=[0, 18.5, 25, 30, 100], 
                           labels=['Underweight', 'Normal', 'Overweight', 'Obese'])

# 3. Glucose level categories (diabetes risk)
df['glucose_category'] = pd.cut(df['avg_glucose_level'], 
                               bins=[0, 100, 125, 1000], 
                               labels=['Normal', 'Prediabetic', 'Diabetic'])

# 4. Health risk score (composite feature)
df['health_risk_score'] = (
    df['hypertension'] * 2 +  # Hypertension contributes 2 points
    df['heart_disease'] * 3 +  # Heart disease contributes 3 points
    (df['avg_glucose_level'] > 125).astype(int) * 2 +  # Diabetes contributes 2 points
    (df['bmi'] > 30).astype(int) * 1  # Obesity contributes 1 point
)

# 5. Age-health interaction
df['age_health_interaction'] = df['age'] * (df['hypertension'] + df['heart_disease'])

# 6. Lifestyle risk score
smoking_risk = df['smoking_status'].map({
    'never smoked': 0,
    'Unknown': 1,
    'formerly smoked': 2,
    'smokes': 3
})
df['lifestyle_risk_score'] = smoking_risk + (df['bmi'] > 30).astype(int)

print("Derived features created:")
derived_features = ['age_category', 'bmi_category', 'glucose_category', 
                   'health_risk_score', 'age_health_interaction', 'lifestyle_risk_score']
for feature in derived_features:
    print(f"- {feature}")
    if df[feature].dtype == 'object' or hasattr(df[feature], 'cat'):
        print(f"  Categories: {df[feature].value_counts().to_dict()}")
    else:
        print(f"  Range: {df[feature].min():.1f} - {df[feature].max():.1f}")
    print()

Creating derived features...
Derived features created:
- age_category
  Categories: {'Low_Risk': 2244, 'Moderate_Risk': 1170, 'High_Risk': 986, 'Very_High_Risk': 710}

- bmi_category
  Categories: {'Obese': 1941, 'Overweight': 1528, 'Normal': 1292, 'Underweight': 349}

- glucose_category
  Categories: {'Normal': 3131, 'Diabetic': 1000, 'Prediabetic': 979}

- health_risk_score
  Range: 0.0 - 8.0

- age_health_interaction
  Range: 0.0 - 164.0

- lifestyle_risk_score
  Range: 0.0 - 4.0



## Feature Selection and Preprocessing Pipeline

In [7]:
# Define feature sets
# Original features
numerical_features = ['age', 'avg_glucose_level', 'bmi']
categorical_features = ['gender', 'hypertension', 'heart_disease', 'ever_married', 
                       'work_type', 'Residence_type', 'smoking_status']

# Derived features
derived_numerical = ['health_risk_score', 'age_health_interaction', 'lifestyle_risk_score']
derived_categorical = ['age_category', 'bmi_category', 'glucose_category']

# Combine all features
all_numerical = numerical_features + derived_numerical
all_categorical = categorical_features + derived_categorical

print(f"Total features for modeling:")
print(f"Numerical features ({len(all_numerical)}): {all_numerical}")
print(f"Categorical features ({len(all_categorical)}): {all_categorical}")

# Create preprocessing pipeline
# Numerical features: imputation + scaling
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical features: imputation + one-hot encoding
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, all_numerical),
        ('cat', categorical_transformer, all_categorical)
    ]
)

print("\nPreprocessing pipeline created successfully!")

Total features for modeling:
Numerical features (6): ['age', 'avg_glucose_level', 'bmi', 'health_risk_score', 'age_health_interaction', 'lifestyle_risk_score']
Categorical features (10): ['gender', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'smoking_status', 'age_category', 'bmi_category', 'glucose_category']

Preprocessing pipeline created successfully!


## Train-Test Split and Data Preparation

In [8]:
# Prepare features and target
feature_columns = all_numerical + all_categorical
X = df[feature_columns].copy()
y = df['stroke'].copy()

print(f"Feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"Class distribution: {y.value_counts().to_dict()}")

# Stratified train-test split to maintain class balance
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Further split training data into train and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

print(f"\nDataset splits:")
print(f"Training set: {X_train.shape[0]} samples ({y_train.mean()*100:.1f}% stroke rate)")
print(f"Validation set: {X_val.shape[0]} samples ({y_val.mean()*100:.1f}% stroke rate)")
print(f"Test set: {X_test.shape[0]} samples ({y_test.mean()*100:.1f}% stroke rate)")

# Verify class balance preservation
print(f"\nClass balance verification:")
print(f"Original: {y.mean():.3f}")
print(f"Train: {y_train.mean():.3f}")
print(f"Validation: {y_val.mean():.3f}")
print(f"Test: {y_test.mean():.3f}")

Feature matrix shape: (5110, 16)
Target vector shape: (5110,)
Class distribution: {0: 4861, 1: 249}

Dataset splits:
Training set: 3066 samples (4.9% stroke rate)
Validation set: 1022 samples (4.9% stroke rate)
Test set: 1022 samples (4.9% stroke rate)

Class balance verification:
Original: 0.049
Train: 0.049
Validation: 0.049
Test: 0.049


In [9]:
# Fit preprocessing pipeline on training data only
print("Fitting preprocessing pipeline...")
preprocessor.fit(X_train)

# Transform all datasets
X_train_processed = preprocessor.transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

print(f"Processed feature shapes:")
print(f"Training: {X_train_processed.shape}")
print(f"Validation: {X_val_processed.shape}")
print(f"Test: {X_test_processed.shape}")

# Get feature names after preprocessing
# Numerical feature names remain the same
num_feature_names = all_numerical

# Get categorical feature names after one-hot encoding
cat_feature_names = list(preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(all_categorical))

# Combine all feature names
all_feature_names = num_feature_names + cat_feature_names

print(f"\nTotal features after preprocessing: {len(all_feature_names)}")
print(f"Numerical features: {len(num_feature_names)}")
print(f"Categorical features (one-hot encoded): {len(cat_feature_names)}")

Fitting preprocessing pipeline...
Processed feature shapes:
Training: (3066, 27)
Validation: (1022, 27)
Test: (1022, 27)

Total features after preprocessing: 27
Numerical features: 6
Categorical features (one-hot encoded): 21


## Feature Analysis and Validation

In [10]:
# Analyze feature importance of derived features
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency, pearsonr

print("=== FEATURE ANALYSIS ===")

# 1. Health risk score analysis
health_risk_stroke = df.groupby('health_risk_score')['stroke'].agg(['count', 'sum', 'mean'])
print("\nHealth Risk Score vs Stroke Rate:")
for score, data in health_risk_stroke.iterrows():
    print(f"Score {score}: {data['mean']:.3f} ({data['mean']*100:.1f}%) - {data['sum']}/{data['count']} patients")

# 2. Age category analysis
age_cat_stroke = df.groupby('age_category')['stroke'].agg(['count', 'sum', 'mean'])
print("\nAge Category vs Stroke Rate:")
for category, data in age_cat_stroke.iterrows():
    print(f"{category}: {data['mean']:.3f} ({data['mean']*100:.1f}%) - {data['sum']}/{data['count']} patients")

# 3. Glucose category analysis
glucose_cat_stroke = df.groupby('glucose_category')['stroke'].agg(['count', 'sum', 'mean'])
print("\nGlucose Category vs Stroke Rate:")
for category, data in glucose_cat_stroke.iterrows():
    print(f"{category}: {data['mean']:.3f} ({data['mean']*100:.1f}%) - {data['sum']}/{data['count']} patients")

# Statistical validation of derived features
health_risk_corr, health_risk_p = pearsonr(df['health_risk_score'], df['stroke'])
lifestyle_risk_corr, lifestyle_risk_p = pearsonr(df['lifestyle_risk_score'], df['stroke'])
age_health_corr, age_health_p = pearsonr(df['age_health_interaction'], df['stroke'])

print("\n=== DERIVED FEATURE CORRELATIONS ===")
print(f"Health Risk Score: r={health_risk_corr:.3f}, p={health_risk_p:.2e}")
print(f"Lifestyle Risk Score: r={lifestyle_risk_corr:.3f}, p={lifestyle_risk_p:.2e}")
print(f"Age-Health Interaction: r={age_health_corr:.3f}, p={age_health_p:.2e}")

=== FEATURE ANALYSIS ===

Health Risk Score vs Stroke Rate:
Score 0: 0.029 (2.9%) - 72.0/2449.0 patients
Score 1: 0.025 (2.5%) - 30.0/1221.0 patients
Score 2: 0.078 (7.8%) - 40.0/513.0 patients
Score 3: 0.076 (7.6%) - 44.0/579.0 patients
Score 4: 0.167 (16.7%) - 17.0/102.0 patients
Score 5: 0.166 (16.6%) - 25.0/151.0 patients
Score 6: 0.213 (21.3%) - 13.0/61.0 patients
Score 7: 0.150 (15.0%) - 3.0/20.0 patients
Score 8: 0.357 (35.7%) - 5.0/14.0 patients

Age Category vs Stroke Rate:
Low_Risk: 0.004 (0.4%) - 8.0/2244.0 patients
Moderate_Risk: 0.026 (2.6%) - 31.0/1170.0 patients
High_Risk: 0.083 (8.3%) - 82.0/986.0 patients
Very_High_Risk: 0.180 (18.0%) - 128.0/710.0 patients

Glucose Category vs Stroke Rate:
Normal: 0.036 (3.6%) - 112.0/3131.0 patients
Prediabetic: 0.038 (3.8%) - 37.0/979.0 patients
Diabetic: 0.100 (10.0%) - 100.0/1000.0 patients

=== DERIVED FEATURE CORRELATIONS ===
Health Risk Score: r=0.178, p=1.33e-37
Lifestyle Risk Score: r=0.032, p=2.07e-02
Age-Health Interaction:

## Data Quality Validation

In [11]:
# Final data quality checks
print("=== FINAL DATA QUALITY VALIDATION ===")

# Check for any remaining missing values
print(f"Missing values in processed training data: {np.isnan(X_train_processed).sum()}")
print(f"Missing values in processed validation data: {np.isnan(X_val_processed).sum()}")
print(f"Missing values in processed test data: {np.isnan(X_test_processed).sum()}")

# Check for infinite values
print(f"Infinite values in training data: {np.isinf(X_train_processed).sum()}")

# Check feature scaling (for numerical features)
print(f"\nNumerical feature scaling check (training data):")
for i, feature in enumerate(num_feature_names):
    feature_data = X_train_processed[:, i]
    print(f"{feature}: mean={feature_data.mean():.3f}, std={feature_data.std():.3f}")

# Verify no data leakage (statistics should be similar but not identical)
print(f"\nData leakage check - feature means:")
train_means = X_train_processed.mean(axis=0)[:len(num_feature_names)]
val_means = X_val_processed.mean(axis=0)[:len(num_feature_names)]
test_means = X_test_processed.mean(axis=0)[:len(num_feature_names)]

for i, feature in enumerate(num_feature_names):
    print(f"{feature}: train={train_means[i]:.3f}, val={val_means[i]:.3f}, test={test_means[i]:.3f}")

=== FINAL DATA QUALITY VALIDATION ===
Missing values in processed training data: 0
Missing values in processed validation data: 0
Missing values in processed test data: 0
Infinite values in training data: 0

Numerical feature scaling check (training data):
age: mean=0.000, std=1.000
avg_glucose_level: mean=0.000, std=1.000
bmi: mean=0.000, std=1.000
health_risk_score: mean=-0.000, std=1.000
age_health_interaction: mean=0.000, std=1.000
lifestyle_risk_score: mean=0.000, std=1.000

Data leakage check - feature means:
age: train=0.000, val=-0.000, test=-0.028
avg_glucose_level: train=0.000, val=-0.038, test=-0.028
bmi: train=0.000, val=-0.044, test=-0.029
health_risk_score: train=0.000, val=-0.033, test=-0.022
age_health_interaction: train=-0.000, val=0.006, test=0.005
lifestyle_risk_score: train=0.000, val=-0.012, test=-0.021


---

# Save Processed Data and Pipeline

In [14]:
# Save processed datasets and preprocessing pipeline
import os
import pickle

# Ensure output directories exist
os.makedirs("jupyter_notebooks/outputs/datasets", exist_ok=True)

# Create simple feature names based on the shape
total_features = X_train_processed.shape[1]
feature_names = [f"feature_{i}" for i in range(total_features)]

# Create more descriptive names for the first few known features
descriptive_names = numerical_features.copy()
# Add one-hot encoded categorical feature names
for cat_feature in categorical_features:
    unique_values = df[cat_feature].unique()
    for value in unique_values:
        descriptive_names.append(f"{cat_feature}_{value}")

# Use descriptive names up to the total number we have
feature_names = descriptive_names[:total_features] + feature_names[len(descriptive_names):]

# Save processed datasets
np.savez('jupyter_notebooks/outputs/datasets/processed_stroke_data.npz',
         X_train=X_train_processed, X_val=X_val_processed, X_test=X_test_processed,
         y_train=y_train, y_val=y_val, y_test=y_test)

# Save preprocessing pipeline
with open('jupyter_notebooks/outputs/datasets/preprocessing_pipeline.pkl', 'wb') as f:
    pickle.dump({
        'preprocessor': preprocessor,
        'numerical_features': numerical_features,
        'categorical_features': categorical_features
    }, f)

# Save feature engineering summary
summary = {
    'total_features': total_features,
    'numerical_features': len(numerical_features),
    'categorical_features_original': len(categorical_features),
    'training_samples': X_train_processed.shape[0],
    'validation_samples': X_val_processed.shape[0],
    'test_samples': X_test_processed.shape[0],
    'stroke_rate_train': float(y_train.mean()),
    'stroke_rate_val': float(y_val.mean()),
    'stroke_rate_test': float(y_test.mean()),
    'missing_values_imputed': 201,
    'new_features_created': 6  # health_risk_score, age_health_interaction, etc.
}

import json
with open('jupyter_notebooks/outputs/datasets/feature_engineering_summary.json', 'w') as f:
    json.dump(summary, f, indent=2)

print("✅ All datasets and pipelines saved successfully!")
print(f"📊 Processed data shape: {X_train_processed.shape}")
print(f"🎯 Features created: {total_features}")
print(f"📁 Files saved to: jupyter_notebooks/outputs/datasets/")
print(f"   - processed_stroke_data.npz")
print(f"   - preprocessing_pipeline.pkl")
print(f"   - feature_engineering_summary.json")

print("\n🎉 Feature Engineering Complete!")
print("Ready for machine learning model training!")

print(f"\nDataset summary:")
print(f"  - Original features: 12")
print(f"  - Engineered features: {total_features}")
print(f"  - Training samples: {X_train_processed.shape[0]}")
print(f"  - Validation samples: {X_val_processed.shape[0]}")
print(f"  - Test samples: {X_test_processed.shape[0]}")
print(f"  - Class balance maintained: {y_train.mean():.3f}")

✅ All datasets and pipelines saved successfully!
📊 Processed data shape: (3066, 27)
🎯 Features created: 27
📁 Files saved to: jupyter_notebooks/outputs/datasets/
   - processed_stroke_data.npz
   - preprocessing_pipeline.pkl
   - feature_engineering_summary.json

🎉 Feature Engineering Complete!
Ready for machine learning model training!

Dataset summary:
  - Original features: 12
  - Engineered features: 27
  - Training samples: 3066
  - Validation samples: 1022
  - Test samples: 1022
  - Class balance maintained: 0.049
