<div style='background-color:#9B59B6; padding: 15px; border-radius: 10px; margin-bottom: 20px;'>
<h1 style='color:#FFFFFF; text-align:center; font-family: Arial, sans-serif; margin: 0;'>🔧 Feature Engineering & Preprocessing</h1>
<h2 style='color:#E8DAEF; text-align:center; font-family: Arial, sans-serif; margin: 5px 0 0 0;'>Preparing Data for Machine Learning</h2>
</div>

<div style='background-color:#F8F5FF; padding: 15px; border-radius: 8px; border-left: 4px solid #9B59B6;'>
<h3 style='color:#9B59B6; margin-top: 0;'>⚙️ Engineering Objectives</h3>
<ul style='color:#333; line-height: 1.6;'>
<li><strong>Data Transformation:</strong> Convert categorical variables to numerical format</li>
<li><strong>Feature Creation:</strong> Engineer new features from existing data</li>
<li><strong>Data Scaling:</strong> Normalize numerical features for optimal model performance</li>
<li><strong>Missing Value Handling:</strong> Implement robust imputation strategies</li>
<li><strong>Feature Selection:</strong> Identify most predictive variables for stroke risk</li>
</ul>
</div>

# **Feature Engineering for Stroke Prediction**

## Objectives

* Engineer features for machine learning model training
* Handle missing values and data preprocessing
* Create derived features and encode categorical variables
* Prepare train-test datasets with proper validation splits

## Inputs

* Raw stroke dataset with EDA insights
* Statistical analysis results from previous notebooks

## Outputs

* Cleaned and preprocessed dataset
* Engineered features for model training
* Train-test-validation splits
* Feature encoding pipelines

## Additional Comments

* Focus on creating features that improve model performance
* Handle class imbalance considerations
* Ensure data leakage prevention in preprocessing

---

# Change working directory

In [None]:
import os
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
print("Working directory changed to parent folder")

# Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import joblib
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
df = pd.read_csv("inputs/datasets/Stroke-data.csv")
print(f"Dataset loaded: {df.shape[0]} patients, {df.shape[1]} features")
print(f"Stroke cases: {df['stroke'].sum()} ({df['stroke'].mean()*100:.1f}%)")

## Data Quality Assessment and Cleaning

In [None]:
# Check for missing values
print("Missing values per column:")
missing_values = df.isnull().sum()
for col, missing in missing_values.items():
    if missing > 0:
        print(f"{col}: {missing} ({missing/len(df)*100:.1f}%)")

# Check for duplicate records
duplicates = df.duplicated().sum()
print(f"\nDuplicate records: {duplicates}")

# Data type overview
print("\nData types:")
print(df.dtypes)

In [None]:
# Handle missing BMI values
print(f"BMI missing values: {df['bmi'].isnull().sum()}")

# Analyze BMI missing pattern
bmi_missing_analysis = df.groupby('stroke')['bmi'].agg(['count', lambda x: x.isnull().sum()])
bmi_missing_analysis.columns = ['total', 'missing']
bmi_missing_analysis['missing_rate'] = bmi_missing_analysis['missing'] / bmi_missing_analysis['total']
print("\nBMI missing pattern by stroke status:")
print(bmi_missing_analysis)

# Strategy: Use median imputation grouped by key characteristics
# Create age groups for better imputation
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 45, 60, 100], labels=['Young', 'Adult', 'MiddleAged', 'Senior'])

# Impute BMI based on age group and gender
for age_group in df['age_group'].unique():
    for gender in df['gender'].unique():
        mask = (df['age_group'] == age_group) & (df['gender'] == gender)
        median_bmi = df.loc[mask, 'bmi'].median()
        df.loc[mask & df['bmi'].isnull(), 'bmi'] = median_bmi

print(f"\nBMI missing values after imputation: {df['bmi'].isnull().sum()}")

## Feature Engineering

In [None]:
# Create derived features based on EDA insights
print("Creating derived features...")

# 1. Age categories (based on stroke risk patterns)
df['age_category'] = pd.cut(df['age'], 
                           bins=[0, 40, 55, 70, 100], 
                           labels=['Low_Risk', 'Moderate_Risk', 'High_Risk', 'Very_High_Risk'])

# 2. BMI categories (WHO standards)
df['bmi_category'] = pd.cut(df['bmi'], 
                           bins=[0, 18.5, 25, 30, 100], 
                           labels=['Underweight', 'Normal', 'Overweight', 'Obese'])

# 3. Glucose level categories (diabetes risk)
df['glucose_category'] = pd.cut(df['avg_glucose_level'], 
                               bins=[0, 100, 125, 1000], 
                               labels=['Normal', 'Prediabetic', 'Diabetic'])

# 4. Health risk score (composite feature)
df['health_risk_score'] = (
    df['hypertension'] * 2 +  # Hypertension contributes 2 points
    df['heart_disease'] * 3 +  # Heart disease contributes 3 points
    (df['avg_glucose_level'] > 125).astype(int) * 2 +  # Diabetes contributes 2 points
    (df['bmi'] > 30).astype(int) * 1  # Obesity contributes 1 point
)

# 5. Age-health interaction
df['age_health_interaction'] = df['age'] * (df['hypertension'] + df['heart_disease'])

# 6. Lifestyle risk score
smoking_risk = df['smoking_status'].map({
    'never smoked': 0,
    'Unknown': 1,
    'formerly smoked': 2,
    'smokes': 3
})
df['lifestyle_risk_score'] = smoking_risk + (df['bmi'] > 30).astype(int)

print("Derived features created:")
derived_features = ['age_category', 'bmi_category', 'glucose_category', 
                   'health_risk_score', 'age_health_interaction', 'lifestyle_risk_score']
for feature in derived_features:
    print(f"- {feature}")
    if df[feature].dtype == 'object' or hasattr(df[feature], 'cat'):
        print(f"  Categories: {df[feature].value_counts().to_dict()}")
    else:
        print(f"  Range: {df[feature].min():.1f} - {df[feature].max():.1f}")
    print()

## Feature Selection and Preprocessing Pipeline

In [None]:
# Define feature sets
# Original features
numerical_features = ['age', 'avg_glucose_level', 'bmi']
categorical_features = ['gender', 'hypertension', 'heart_disease', 'ever_married', 
                       'work_type', 'Residence_type', 'smoking_status']

# Derived features
derived_numerical = ['health_risk_score', 'age_health_interaction', 'lifestyle_risk_score']
derived_categorical = ['age_category', 'bmi_category', 'glucose_category']

# Combine all features
all_numerical = numerical_features + derived_numerical
all_categorical = categorical_features + derived_categorical

print(f"Total features for modeling:")
print(f"Numerical features ({len(all_numerical)}): {all_numerical}")
print(f"Categorical features ({len(all_categorical)}): {all_categorical}")

# Create preprocessing pipeline
# Numerical features: imputation + scaling
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical features: imputation + one-hot encoding
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, all_numerical),
        ('cat', categorical_transformer, all_categorical)
    ]
)

print("\nPreprocessing pipeline created successfully!")

## Train-Test Split and Data Preparation

In [None]:
# Prepare features and target
feature_columns = all_numerical + all_categorical
X = df[feature_columns].copy()
y = df['stroke'].copy()

print(f"Feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"Class distribution: {y.value_counts().to_dict()}")

# Stratified train-test split to maintain class balance
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Further split training data into train and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

print(f"\nDataset splits:")
print(f"Training set: {X_train.shape[0]} samples ({y_train.mean()*100:.1f}% stroke rate)")
print(f"Validation set: {X_val.shape[0]} samples ({y_val.mean()*100:.1f}% stroke rate)")
print(f"Test set: {X_test.shape[0]} samples ({y_test.mean()*100:.1f}% stroke rate)")

# Verify class balance preservation
print(f"\nClass balance verification:")
print(f"Original: {y.mean():.3f}")
print(f"Train: {y_train.mean():.3f}")
print(f"Validation: {y_val.mean():.3f}")
print(f"Test: {y_test.mean():.3f}")

In [None]:
# Fit preprocessing pipeline on training data only
print("Fitting preprocessing pipeline...")
preprocessor.fit(X_train)

# Transform all datasets
X_train_processed = preprocessor.transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

print(f"Processed feature shapes:")
print(f"Training: {X_train_processed.shape}")
print(f"Validation: {X_val_processed.shape}")
print(f"Test: {X_test_processed.shape}")

# Get feature names after preprocessing
# Numerical feature names remain the same
num_feature_names = all_numerical

# Get categorical feature names after one-hot encoding
cat_feature_names = list(preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(all_categorical))

# Combine all feature names
all_feature_names = num_feature_names + cat_feature_names

print(f"\nTotal features after preprocessing: {len(all_feature_names)}")
print(f"Numerical features: {len(num_feature_names)}")
print(f"Categorical features (one-hot encoded): {len(cat_feature_names)}")

## Feature Analysis and Validation

In [None]:
# Analyze feature importance of derived features
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency, pearsonr

print("=== FEATURE ANALYSIS ===")

# 1. Health risk score analysis
health_risk_stroke = df.groupby('health_risk_score')['stroke'].agg(['count', 'sum', 'mean'])
print("\nHealth Risk Score vs Stroke Rate:")
for score, data in health_risk_stroke.iterrows():
    print(f"Score {score}: {data['mean']:.3f} ({data['mean']*100:.1f}%) - {data['sum']}/{data['count']} patients")

# 2. Age category analysis
age_cat_stroke = df.groupby('age_category')['stroke'].agg(['count', 'sum', 'mean'])
print("\nAge Category vs Stroke Rate:")
for category, data in age_cat_stroke.iterrows():
    print(f"{category}: {data['mean']:.3f} ({data['mean']*100:.1f}%) - {data['sum']}/{data['count']} patients")

# 3. Glucose category analysis
glucose_cat_stroke = df.groupby('glucose_category')['stroke'].agg(['count', 'sum', 'mean'])
print("\nGlucose Category vs Stroke Rate:")
for category, data in glucose_cat_stroke.iterrows():
    print(f"{category}: {data['mean']:.3f} ({data['mean']*100:.1f}%) - {data['sum']}/{data['count']} patients")

# Statistical validation of derived features
health_risk_corr, health_risk_p = pearsonr(df['health_risk_score'], df['stroke'])
lifestyle_risk_corr, lifestyle_risk_p = pearsonr(df['lifestyle_risk_score'], df['stroke'])
age_health_corr, age_health_p = pearsonr(df['age_health_interaction'], df['stroke'])

print("\n=== DERIVED FEATURE CORRELATIONS ===")
print(f"Health Risk Score: r={health_risk_corr:.3f}, p={health_risk_p:.2e}")
print(f"Lifestyle Risk Score: r={lifestyle_risk_corr:.3f}, p={lifestyle_risk_p:.2e}")
print(f"Age-Health Interaction: r={age_health_corr:.3f}, p={age_health_p:.2e}")

## Data Quality Validation

In [None]:
# Final data quality checks
print("=== FINAL DATA QUALITY VALIDATION ===")

# Check for any remaining missing values
print(f"Missing values in processed training data: {np.isnan(X_train_processed).sum()}")
print(f"Missing values in processed validation data: {np.isnan(X_val_processed).sum()}")
print(f"Missing values in processed test data: {np.isnan(X_test_processed).sum()}")

# Check for infinite values
print(f"Infinite values in training data: {np.isinf(X_train_processed).sum()}")

# Check feature scaling (for numerical features)
print(f"\nNumerical feature scaling check (training data):")
for i, feature in enumerate(num_feature_names):
    feature_data = X_train_processed[:, i]
    print(f"{feature}: mean={feature_data.mean():.3f}, std={feature_data.std():.3f}")

# Verify no data leakage (statistics should be similar but not identical)
print(f"\nData leakage check - feature means:")
train_means = X_train_processed.mean(axis=0)[:len(num_feature_names)]
val_means = X_val_processed.mean(axis=0)[:len(num_feature_names)]
test_means = X_test_processed.mean(axis=0)[:len(num_feature_names)]

for i, feature in enumerate(num_feature_names):
    print(f"{feature}: train={train_means[i]:.3f}, val={val_means[i]:.3f}, test={test_means[i]:.3f}")

---

# Save Processed Data and Pipeline

In [None]:
# Save the preprocessing pipeline
joblib.dump(preprocessor, 'outputs/ml_pipeline/feature_engineering_pipeline.pkl')
print("Preprocessing pipeline saved successfully!")

# Save processed datasets
np.save('outputs/datasets/X_train_processed.npy', X_train_processed)
np.save('outputs/datasets/X_val_processed.npy', X_val_processed)
np.save('outputs/datasets/X_test_processed.npy', X_test_processed)
np.save('outputs/datasets/y_train.npy', y_train.values)
np.save('outputs/datasets/y_val.npy', y_val.values)
np.save('outputs/datasets/y_test.npy', y_test.values)

# Save feature information
feature_info = {
    'all_feature_names': all_feature_names,
    'numerical_features': all_numerical,
    'categorical_features': all_categorical,
    'original_features': feature_columns,
    'n_features_after_preprocessing': len(all_feature_names),
    'train_shape': X_train_processed.shape,
    'val_shape': X_val_processed.shape,
    'test_shape': X_test_processed.shape
}

import json
with open('outputs/datasets/feature_info.json', 'w') as f:
    json.dump(feature_info, f, indent=2)

# Save dataset with all engineered features for further analysis
df_with_features = df.copy()
df_with_features.to_csv('outputs/datasets/stroke_engineered_features.csv', index=False)

print("\n=== FEATURE ENGINEERING COMPLETE ===")
print(f"Original features: {len(feature_columns)}")
print(f"Features after preprocessing: {len(all_feature_names)}")
print(f"Training samples: {X_train_processed.shape[0]}")
print(f"Validation samples: {X_val_processed.shape[0]}")
print(f"Test samples: {X_test_processed.shape[0]}")
print(f"Class balance maintained: {abs(y_train.mean() - y.mean()) < 0.01}")
print("\nFiles saved:")
print("- Preprocessing pipeline: outputs/ml_pipeline/feature_engineering_pipeline.pkl")
print("- Processed datasets: outputs/datasets/X_*.npy, y_*.npy")
print("- Feature information: outputs/datasets/feature_info.json")
print("- Enhanced dataset: outputs/datasets/stroke_engineered_features.csv")