# Feature Engineering & Data Preprocessing
# HAM10000 Skin Lesion Dataset

This notebook performs comprehensive feature engineering and data preprocessing to prepare the dataset for machine learning models.

## Objectives:
1. **Data Cleaning**: Handle missing values and outliers
2. **Feature Engineering**: Create new meaningful features
3. **Data Encoding**: Convert categorical variables to numerical
4. **Feature Scaling**: Normalize/standardize features
5. **Data Splitting**: Create train/validation/test sets
6. **Feature Selection**: Identify most important features
7. **Data Balancing**: Handle class imbalance issues

## Key Steps:
- Load and clean the dataset
- Create age groups and interaction features  
- Encode categorical variables (Label/One-Hot encoding)
- Scale numerical features
- Split data with stratification
- Apply feature selection techniques
- Handle class imbalance with SMOTE/class weights
- Save processed datasets for model training

In [10]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif, RFE
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")
print("Available preprocessing tools:")
print("   • Data splitting and stratification")
print("   • Feature scaling (Standard, MinMax)")
print("   • Encoding (Label, One-Hot)")
print("   • Feature selection (Chi2, Mutual Info, RFE)")
print("   • Class balancing (SMOTE, ADASYN)")
print("   • Pipeline creation for automated preprocessing")

Libraries imported successfully!
Available preprocessing tools:
   • Data splitting and stratification
   • Feature scaling (Standard, MinMax)
   • Encoding (Label, One-Hot)
   • Feature selection (Chi2, Mutual Info, RFE)
   • Class balancing (SMOTE, ADASYN)
   • Pipeline creation for automated preprocessing


In [11]:
# Load the dataset
df = pd.read_csv('../Dataset/HAM10000_metadata.csv')

print("📊 Dataset loaded successfully!")
print(f"   • Shape: {df.shape}")
print(f"   • Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Display basic info
print("\n📋 Dataset Overview:")
print(f"   • Total samples: {len(df):,}")
print(f"   • Features: {list(df.columns)}")
print(f"   • Target variable: 'dx' (Disease type)")
print(f"   • Unique diseases: {df['dx'].nunique()}")

# Check data quality
print("\n🔍 Data Quality Check:")
missing_vals = df.isnull().sum()
print(f"   • Missing values: {missing_vals.sum()} total")
if missing_vals.sum() > 0:
    print("   • Missing by column:")
    for col, miss in missing_vals.items():
        if miss > 0:
            print(f"     - {col}: {miss} ({miss/len(df)*100:.1f}%)")

print(f"   • Duplicated rows: {df.duplicated().sum()}")
print(f"   • Data types: {dict(df.dtypes)}")

# Quick preview
print("\n👀 Data Preview:")
display(df.head())

# Data Cleaning
print("\n🧹 Data Cleaning:")
# Handle missing values
if df.isnull().sum().sum() > 0:
    # Fill missing ages with median
    if 'age' in df.columns and df['age'].isnull().sum() > 0:
        median_age = df['age'].median()
        df['age'].fillna(median_age, inplace=True)
        print(f"   ✅ Filled {df['age'].isnull().sum()} missing age values with median: {median_age}")
    
    # Fill categorical missing values with mode
    for col in df.select_dtypes(include=['object']).columns:
        if df[col].isnull().sum() > 0:
            mode_val = df[col].mode()[0]
            df[col].fillna(mode_val, inplace=True)
            print(f"   ✅ Filled missing {col} values with mode: {mode_val}")
else:
    print("   ✅ No missing values found!")

# Remove duplicates if any
duplicate_count = df.duplicated().sum()
if duplicate_count > 0:
    df.drop_duplicates(inplace=True)
    print(f"   ✅ Removed {duplicate_count} duplicate rows")
else:
    print("   ✅ No duplicate rows found!")

print(f"\n📊 Clean dataset shape: {df.shape}")

📊 Dataset loaded successfully!
   • Shape: (10015, 7)
   • Memory usage: 3.78 MB

📋 Dataset Overview:
   • Total samples: 10,015
   • Features: ['lesion_id', 'image_id', 'dx', 'dx_type', 'age', 'sex', 'localization']
   • Target variable: 'dx' (Disease type)
   • Unique diseases: 7

🔍 Data Quality Check:
   • Missing values: 57 total
   • Missing by column:
     - age: 57 (0.6%)
   • Duplicated rows: 0
   • Data types: {'lesion_id': dtype('O'), 'image_id': dtype('O'), 'dx': dtype('O'), 'dx_type': dtype('O'), 'age': dtype('float64'), 'sex': dtype('O'), 'localization': dtype('O')}

👀 Data Preview:


Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization
0,HAM_0000118,ISIC_0027419,bkl,histo,80.0,male,scalp
1,HAM_0000118,ISIC_0025030,bkl,histo,80.0,male,scalp
2,HAM_0002730,ISIC_0026769,bkl,histo,80.0,male,scalp
3,HAM_0002730,ISIC_0025661,bkl,histo,80.0,male,scalp
4,HAM_0001466,ISIC_0031633,bkl,histo,75.0,male,ear



🧹 Data Cleaning:
   ✅ Filled 0 missing age values with median: 50.0
   ✅ No duplicate rows found!

📊 Clean dataset shape: (10015, 7)


## Step 1: Data Cleaning & Quality Assurance

In [12]:
# Data Cleaning
print("🧹 Data Cleaning:")

# Handle missing values
original_shape = df.shape
if df.isnull().sum().sum() > 0:
    # Fill missing ages with median
    if 'age' in df.columns and df['age'].isnull().sum() > 0:
        median_age = df['age'].median()
        missing_age_count = df['age'].isnull().sum()
        df['age'].fillna(median_age, inplace=True)
        print(f"   ✅ Filled {missing_age_count} missing age values with median: {median_age}")
    
    # Fill categorical missing values with mode
    for col in df.select_dtypes(include=['object']).columns:
        if df[col].isnull().sum() > 0:
            mode_val = df[col].mode()[0]
            missing_count = df[col].isnull().sum()
            df[col].fillna(mode_val, inplace=True)
            print(f"   ✅ Filled {missing_count} missing {col} values with mode: {mode_val}")
else:
    print("   ✅ No missing values found!")

# Remove duplicates if any
duplicate_count = df.duplicated().sum()
if duplicate_count > 0:
    df.drop_duplicates(inplace=True)
    print(f"   ✅ Removed {duplicate_count} duplicate rows")
else:
    print("   ✅ No duplicate rows found!")

# Check for outliers in age
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['age'] < lower_bound) | (df['age'] > upper_bound)]
print(f"   📊 Age outliers detected: {len(outliers)} samples")
print(f"   📊 Age range: {df['age'].min():.0f} - {df['age'].max():.0f} years")
print(f"   📊 Outlier bounds: {lower_bound:.1f} - {upper_bound:.1f} years")

print(f"\n📊 Clean dataset shape: {df.shape}")
print(f"   🔄 Shape change: {original_shape} → {df.shape}")

# Verify data quality
print(f"\n✅ Data Quality Summary:")
print(f"   • Missing values: {df.isnull().sum().sum()}")
print(f"   • Duplicate rows: {df.duplicated().sum()}")
print(f"   • Data integrity: 100% complete")

🧹 Data Cleaning:
   ✅ No missing values found!
   ✅ No duplicate rows found!
   📊 Age outliers detected: 39 samples
   📊 Age range: 0 - 85 years
   📊 Outlier bounds: 2.5 - 102.5 years

📊 Clean dataset shape: (10015, 7)
   🔄 Shape change: (10015, 7) → (10015, 7)

✅ Data Quality Summary:
   • Missing values: 0
   • Duplicate rows: 0
   • Data integrity: 100% complete
   • Missing values: 0
   • Duplicate rows: 0
   • Data integrity: 100% complete


## Step 2: Feature Engineering

In [13]:
# Feature Engineering
print("🔧 Feature Engineering:")

# Create a copy for feature engineering
df_fe = df.copy()

# 1. Age Groups
age_bins = [0, 20, 40, 60, 80, 100]
age_labels = ['0-20', '21-40', '41-60', '61-80', '81-100']
df_fe['age_group'] = pd.cut(df_fe['age'], bins=age_bins, labels=age_labels, include_lowest=True)
print("   ✅ Created age groups: 0-20, 21-40, 41-60, 61-80, 81-100")

# 2. Age normalization (0-1 scale)
df_fe['age_normalized'] = (df_fe['age'] - df_fe['age'].min()) / (df_fe['age'].max() - df_fe['age'].min())
print(f"   ✅ Normalized age to 0-1 scale")

# 3. Binary features
df_fe['is_elderly'] = (df_fe['age'] >= 65).astype(int)
df_fe['is_young'] = (df_fe['age'] <= 30).astype(int)
df_fe['is_male'] = (df_fe['sex'] == 'male').astype(int)
print(f"   ✅ Created binary features: is_elderly, is_young, is_male")

# 4. Lesion count per patient (if multiple lesions exist)
lesion_counts = df_fe.groupby('lesion_id').size().reset_index(name='lesion_count')
df_fe = df_fe.merge(lesion_counts, on='lesion_id', how='left')
print(f"   ✅ Added lesion count per patient (range: {df_fe['lesion_count'].min()}-{df_fe['lesion_count'].max()})")

# 5. Body region grouping
# Group similar body locations
torso_locations = ['back', 'chest', 'abdomen']
extremity_locations = ['upper extremity', 'lower extremity', 'hand', 'foot']
head_locations = ['face', 'scalp', 'neck', 'ear']

def categorize_location(location):
    location = location.lower() if pd.notna(location) else 'unknown'
    if any(loc in location for loc in torso_locations):
        return 'torso'
    elif any(loc in location for loc in extremity_locations):
        return 'extremity'
    elif any(loc in location for loc in head_locations):
        return 'head_neck'
    else:
        return 'other'

df_fe['body_region'] = df_fe['localization'].apply(categorize_location)
print(f"   ✅ Grouped body locations into regions: {df_fe['body_region'].unique()}")

# 6. Disease risk categories (based on medical knowledge)
high_risk_diseases = ['mel', 'bcc']  # Melanoma, Basal Cell Carcinoma
df_fe['is_high_risk'] = df_fe['dx'].isin(high_risk_diseases).astype(int)
print(f"   ✅ Created high-risk disease indicator")

# 7. Diagnosis confidence (based on dx_type)
confidence_mapping = {'histo': 1.0, 'follow_up': 0.8, 'consensus': 0.6}
df_fe['diagnosis_confidence'] = df_fe['dx_type'].map(confidence_mapping)
print(f"   ✅ Added diagnosis confidence scores")

# Display new features
print(f"\n📊 Feature Engineering Summary:")
print(f"   • Original features: {len(df.columns)}")
print(f"   • New features: {len(df_fe.columns) - len(df.columns)}")
print(f"   • Total features: {len(df_fe.columns)}")

new_features = [col for col in df_fe.columns if col not in df.columns]
print(f"   • New feature list: {new_features}")

# Show feature distributions
print(f"\n📈 New Feature Distributions:")
for feature in ['age_group', 'body_region', 'is_high_risk']:
    if feature in df_fe.columns:
        print(f"   • {feature}:")
        print(f"     {dict(df_fe[feature].value_counts())}")
        
print(f"\n✅ Feature Engineering completed!")

🔧 Feature Engineering:
   ✅ Created age groups: 0-20, 21-40, 41-60, 61-80, 81-100
   ✅ Normalized age to 0-1 scale
   ✅ Created binary features: is_elderly, is_young, is_male
   ✅ Created age groups: 0-20, 21-40, 41-60, 61-80, 81-100
   ✅ Normalized age to 0-1 scale
   ✅ Created binary features: is_elderly, is_young, is_male
   ✅ Added lesion count per patient (range: 1-6)   ✅ Added lesion count per patient (range: 1-6)

   ✅ Grouped body locations into regions: ['head_neck' 'torso' 'other' 'extremity']
   ✅ Created high-risk disease indicator
   ✅ Added diagnosis confidence scores

📊 Feature Engineering Summary:
   • Original features: 7
   • New features: 9
   • Total features: 16
   • New feature list: ['age_group', 'age_normalized', 'is_elderly', 'is_young', 'is_male', 'lesion_count', 'body_region', 'is_high_risk', 'diagnosis_confidence']

📈 New Feature Distributions:
   • age_group:
     {'41-60': np.int64(4355), '61-80': np.int64(2509), '21-40': np.int64(2449), '0-20': np.int64(4

## Step 3: Data Encoding & Transformation

In [14]:
# Data Encoding & Transformation
print("Data Encoding & Transformation:")

# Ensure df_fe is defined (run feature engineering cell first)
try:
    df_encoded = df_fe.copy()
except NameError:
    raise RuntimeError("Variable 'df_fe' is not defined. Please run the Feature Engineering cell before this one.")

# Store original target for reference
target_mapping = {label: idx for idx, label in enumerate(df_encoded['dx'].unique())}
reverse_target_mapping = {idx: label for label, idx in target_mapping.items()}

print(f"Target variable mapping:")
for label, idx in target_mapping.items():
    count = (df_encoded['dx'] == label).sum()
    print(f"      {label} → {idx} ({count} samples)")

# 1. Label Encoding for target variable
label_encoder_target = LabelEncoder()
df_encoded['dx_encoded'] = label_encoder_target.fit_transform(df_encoded['dx'])

# 2. Label Encoding for ordinal features
ordinal_features = ['dx_type']  # These have inherent order
label_encoders = {}

for feature in ordinal_features:
    if feature in df_encoded.columns:
        le = LabelEncoder()
        df_encoded[f'{feature}_encoded'] = le.fit_transform(df_encoded[feature])
        label_encoders[feature] = le
        print(f"Label encoded {feature}: {dict(zip(le.classes_, le.transform(le.classes_)))}")

# 3. One-Hot Encoding for nominal categorical features
nominal_features = ['sex', 'localization', 'age_group', 'body_region']
categorical_encoded_features = []

for feature in nominal_features:
    if feature in df_encoded.columns:
        # Get unique values
        unique_vals = df_encoded[feature].unique()
        
        # Create one-hot encoded columns
        for val in unique_vals:
            new_col = f'{feature}_{val}'
            df_encoded[new_col] = (df_encoded[feature] == val).astype(int)
            categorical_encoded_features.append(new_col)
        
        print(f"One-hot encoded {feature}: {len(unique_vals)} categories")

# 4. Feature scaling for numerical features
numerical_features = ['age', 'age_normalized', 'lesion_count', 'diagnosis_confidence']
scalers = {}

# Standard Scaling
scaler_standard = StandardScaler()
for feature in numerical_features:
    if feature in df_encoded.columns:
        df_encoded[f'{feature}_scaled'] = scaler_standard.fit_transform(df_encoded[[feature]])
        
scalers['standard'] = scaler_standard
print(f"Standard scaled numerical features")

# MinMax Scaling
scaler_minmax = MinMaxScaler()
for feature in numerical_features:
    if feature in df_encoded.columns:
        df_encoded[f'{feature}_minmax'] = scaler_minmax.fit_transform(df_encoded[[feature]])
        
scalers['minmax'] = scaler_minmax
print(f"MinMax scaled numerical features")

# 5. Create final feature sets
# Original categorical columns to drop
original_categorical = ['lesion_id', 'image_id', 'dx', 'dx_type', 'sex', 'localization', 'age_group', 'body_region']

# Features for modeling
numerical_cols = [col for col in df_encoded.columns if any(x in col for x in ['age', 'lesion_count', 'diagnosis_confidence'])]
binary_cols = [col for col in df_encoded.columns if col.startswith(('is_', 'age_group_', 'body_region_', 'sex_', 'localization_'))]
encoded_cols = [col for col in df_encoded.columns if col.endswith('_encoded') and col != 'dx_encoded']

feature_columns = numerical_cols + binary_cols + encoded_cols
print(f"\nFeature Engineering Summary:")
print(f"   • Numerical features: {len([col for col in numerical_cols if df_encoded[col].dtype != 'O'])}")
print(f"   • Binary/One-hot features: {len(binary_cols)}")
print(f"   • Label encoded features: {len(encoded_cols)}")
print(f"   • Total modeling features: {len(feature_columns)}")

# Display sample of encoded data
print(f"\n👀 Encoded Data Preview:")
preview_cols = ['dx', 'dx_encoded'] + feature_columns[:5]
display(df_encoded[preview_cols].head())

Data Encoding & Transformation:
Target variable mapping:
      bkl → 0 (1099 samples)
      nv → 1 (6705 samples)
      df → 2 (115 samples)
      mel → 3 (1113 samples)
      vasc → 4 (142 samples)
      bcc → 5 (514 samples)
      akiec → 6 (327 samples)
Label encoded dx_type: {'confocal': np.int64(0), 'consensus': np.int64(1), 'follow_up': np.int64(2), 'histo': np.int64(3)}
One-hot encoded sex: 3 categories
      bcc → 5 (514 samples)
      akiec → 6 (327 samples)
Label encoded dx_type: {'confocal': np.int64(0), 'consensus': np.int64(1), 'follow_up': np.int64(2), 'histo': np.int64(3)}
One-hot encoded sex: 3 categories
One-hot encoded localization: 15 categories
One-hot encoded age_group: 5 categories
One-hot encoded body_region: 4 categories
Standard scaled numerical features
MinMax scaled numerical features

Feature Engineering Summary:
   • Numerical features: 18
   • Binary/One-hot features: 31
   • Label encoded features: 1
   • Total modeling features: 51

👀 Encoded Data Previe

Unnamed: 0,dx,dx_encoded,image_id,age,age_group,age_normalized,lesion_count
0,bkl,2,ISIC_0027419,80.0,61-80,0.941176,2
1,bkl,2,ISIC_0025030,80.0,61-80,0.941176,2
2,bkl,2,ISIC_0026769,80.0,61-80,0.941176,2
3,bkl,2,ISIC_0025661,80.0,61-80,0.941176,2
4,bkl,2,ISIC_0031633,75.0,61-80,0.882353,2


## Step 4: Data Splitting & Stratification

In [16]:
# Data Splitting with Stratification
print("Data Splitting & Stratification:")

# Ensure df_encoded is defined (run encoding cell first)
if 'df_encoded' not in globals():
    print("Variable 'df_encoded' is not defined. Please run the Data Encoding & Transformation cell before this one.")
else:
    # Remove non-numeric columns from feature_columns
    non_numeric_cols = ['image_id', 'age_group']
    modeling_features = [col for col in feature_columns if col not in non_numeric_cols and pd.api.types.is_numeric_dtype(df_encoded[col])]
    X = df_encoded[modeling_features].copy()
    y = df_encoded['dx_encoded']

# Convert categorical columns to string to avoid fillna issues
if 'X' in locals():
    for col in X.select_dtypes(include=['category']).columns:
        X[col] = X[col].astype(str)
    X = X.fillna(0)  # Handle any remaining NaN values
else:
    print("Feature matrix 'X' is not defined. Please check previous steps.")

if 'X' in locals():
    print(f"Feature matrix shape: {X.shape}")
    print(f"Target vector shape: {y.shape}")
    print(f"Number of classes: {y.nunique()}")
else:
    print("Feature matrix 'X' is not defined. Please check previous steps.")

# Check class distribution before splitting
if 'y' in locals():
    class_distribution = y.value_counts().sort_index()
    print(f"\nOriginal class distribution:")
    for class_idx, count in class_distribution.items():
        class_name = reverse_target_mapping[class_idx]
        percentage = count / len(y) * 100
        print(f"Class {class_idx} ({class_name}): {count:,} samples ({percentage:.1f}%)")
else:
    print("Target vector 'y' is not defined. Please check previous steps.")

# First split: Train/Test (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

# Second split: Train/Validation (64/16 of original, maintaining 80/20 split)
X_train_final, X_val, y_train_final, y_val = train_test_split(
    X_train, y_train,
    test_size=0.2,  # 0.2 of 0.8 = 0.16 of total
    random_state=42,
    stratify=y_train
)

print(f"\nData splitting completed:")
print(f"Training set: {X_train_final.shape[0]:,} samples ({X_train_final.shape[0]/len(X)*100:.1f}%)")
print(f"Validation set: {X_val.shape[0]:,} samples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"Test set: {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.1f}%)")

# Verify stratification worked
print(f"\nClass distribution verification:")
sets = {'Train': y_train_final, 'Validation': y_val, 'Test': y_test}

for set_name, y_set in sets.items():
    print(f"      {set_name} set:")
    set_distribution = y_set.value_counts().sort_index()
    for class_idx, count in set_distribution.items():
        class_name = reverse_target_mapping[class_idx]
        percentage = count / len(y_set) * 100
        print(f"         Class {class_idx} ({class_name}): {count:,} ({percentage:.1f}%)")

# Feature statistics
print(f"\nFeature Statistics:")
print(f"      • Total features: {X.shape[1]}")
print(f"      • Numerical features: {len([col for col in feature_columns if any(x in col for x in ['age', 'lesion_count', 'diagnosis_confidence'])])}")
print(f"      • Categorical features: {len([col for col in feature_columns if not any(x in col for x in ['age', 'lesion_count', 'diagnosis_confidence'])])}")
print(f"      • Feature value range: [{X.min().min():.3f}, {X.max().max():.3f}]")

Data Splitting & Stratification:
Feature matrix shape: (10015, 49)
Target vector shape: (10015,)
Number of classes: 7

Original class distribution:
Class 0 (bkl): 327 samples (3.3%)
Class 1 (nv): 514 samples (5.1%)
Class 2 (df): 1,099 samples (11.0%)
Class 3 (mel): 115 samples (1.1%)
Class 4 (vasc): 1,113 samples (11.1%)
Class 5 (bcc): 6,705 samples (66.9%)
Class 6 (akiec): 142 samples (1.4%)

Data splitting completed:
Training set: 6,409 samples (64.0%)
Validation set: 1,603 samples (16.0%)
Test set: 2,003 samples (20.0%)

Class distribution verification:
      Train set:
         Class 0 (bkl): 209 (3.3%)
         Class 1 (nv): 329 (5.1%)
         Class 2 (df): 703 (11.0%)
         Class 3 (mel): 74 (1.2%)
         Class 4 (vasc): 712 (11.1%)
         Class 5 (bcc): 4,291 (67.0%)
         Class 6 (akiec): 91 (1.4%)
      Validation set:
         Class 0 (bkl): 53 (3.3%)
         Class 1 (nv): 82 (5.1%)
         Class 2 (df): 176 (11.0%)
         Class 3 (mel): 18 (1.1%)
         Clas

## Step 5: Class Imbalance Handling & Feature Selection

In [18]:
# Class Imbalance Handling
print("⚖️ Class Imbalance Handling:")

# Calculate imbalance ratio
class_counts = y_train_final.value_counts().sort_index()
majority_count = class_counts.max()
minority_count = class_counts.min()
imbalance_ratio = majority_count / minority_count

print(f"   📊 Imbalance Analysis:")
print(f"      • Majority class: {majority_count:,} samples")
print(f"      • Minority class: {minority_count:,} samples")
print(f"      • Imbalance ratio: {imbalance_ratio:.1f}:1")

# Apply SMOTE for oversampling
if imbalance_ratio > 2:  # Apply balancing if significantly imbalanced
    print(f"\\n   🔄 Applying SMOTE for class balancing...")
    
    # SMOTE
    smote = SMOTE(random_state=42, k_neighbors=3)
    X_train_smote, y_train_smote = smote.fit_resample(X_train_final, y_train_final)
    
    print(f"      ✅ SMOTE completed:")
    print(f"         Original training samples: {len(X_train_final):,}")
    print(f"         SMOTE training samples: {len(X_train_smote):,}")
    
    # Verify SMOTE distribution
    smote_distribution = pd.Series(y_train_smote).value_counts().sort_index()
    print(f"      📊 SMOTE class distribution:")
    for class_idx, count in smote_distribution.items():
        class_name = reverse_target_mapping[class_idx]
        percentage = count / len(y_train_smote) * 100
        print(f"         Class {class_idx} ({class_name}): {count:,} ({percentage:.1f}%)")
else:
    print(f"   ✅ Dataset is relatively balanced, no SMOTE applied")
    X_train_smote, y_train_smote = X_train_final.copy(), y_train_final.copy()

# Feature Selection
print(f"\\n🎯 Feature Selection:")

# 1. Mutual Information
mi_selector = SelectKBest(score_func=mutual_info_classif, k=15)
X_train_mi = mi_selector.fit_transform(X_train_final, y_train_final)

# Get selected features
selected_features_mi = np.array(modeling_features)[mi_selector.get_support()]
mi_scores = mi_selector.scores_[mi_selector.get_support()]

print(f"   📊 Mutual Information Feature Selection:")
print(f"      • Selected features: {len(selected_features_mi)}")
print(f"      • Top 5 features by MI score:")
mi_ranking = sorted(zip(selected_features_mi, mi_scores), key=lambda x: x[1], reverse=True)
for i, (feature, score) in enumerate(mi_ranking[:5]):
    print(f"         {i+1}. {feature}: {score:.4f}")

# 2. Random Forest Feature Importance
rf_selector = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_selector.fit(X_train_final, y_train_final)

feature_importance = rf_selector.feature_importances_
importance_ranking = sorted(zip(modeling_features, feature_importance), key=lambda x: x[1], reverse=True)

print(f"\\n   🌳 Random Forest Feature Importance:")
print(f"      • Top 10 features by importance:")
for i, (feature, importance) in enumerate(importance_ranking[:10]):
    print(f"         {i+1}. {feature}: {importance:.4f}")

# Select top features for final model
top_k_features = 20
selected_features_final = [feat for feat, _ in importance_ranking[:top_k_features]]

print(f"\\n   ✅ Final feature selection:")
print(f"      • Selected {len(selected_features_final)} most important features")
print(f"      • Feature reduction: {len(feature_columns)} → {len(selected_features_final)}")

# Create final feature sets
X_train_final_selected = X_train_final[selected_features_final]
X_val_final = X_val[selected_features_final]
X_test_final = X_test[selected_features_final]
X_train_smote_selected = X_train_smote[selected_features_final] if 'X_train_smote' in locals() else X_train_final_selected

print(f"\\n📊 Final Dataset Shapes:")
print(f"   • Training (original): {X_train_final_selected.shape}")
print(f"   • Training (SMOTE): {X_train_smote_selected.shape}")
print(f"   • Validation: {X_val_final.shape}")
print(f"   • Test: {X_test_final.shape}")

⚖️ Class Imbalance Handling:
   📊 Imbalance Analysis:
      • Majority class: 4,291 samples
      • Minority class: 74 samples
      • Imbalance ratio: 58.0:1
\n   🔄 Applying SMOTE for class balancing...
      ✅ SMOTE completed:
         Original training samples: 6,409
         SMOTE training samples: 30,037
      📊 SMOTE class distribution:
         Class 0 (bkl): 4,291 (14.3%)
         Class 1 (nv): 4,291 (14.3%)
         Class 2 (df): 4,291 (14.3%)
         Class 3 (mel): 4,291 (14.3%)
         Class 4 (vasc): 4,291 (14.3%)
         Class 5 (bcc): 4,291 (14.3%)
         Class 6 (akiec): 4,291 (14.3%)
\n🎯 Feature Selection:
   📊 Mutual Information Feature Selection:
      • Selected features: 15
      • Top 5 features by MI score:
         1. is_high_risk: 0.4427
         2. dx_type_encoded: 0.2621
         3. diagnosis_confidence_scaled: 0.2606
         4. diagnosis_confidence: 0.2579
         5. diagnosis_confidence_minmax: 0.2456
   📊 Mutual Information Feature Selection:
      •

## Step 6: Save Processed Datasets

In [19]:
# Save Processed Datasets
print("💾 Saving Processed Datasets:")

import os
import pickle

# Create processed data directory
processed_dir = '../Dataset/processed'
os.makedirs(processed_dir, exist_ok=True)

# Save train/validation/test splits
datasets = {
    'X_train': X_train_final_selected,
    'X_train_smote': X_train_smote_selected,
    'y_train': y_train_final,
    'y_train_smote': y_train_smote,
    'X_val': X_val_final,
    'y_val': y_val,
    'X_test': X_test_final,
    'y_test': y_test
}

# Save as CSV files
for name, data in datasets.items():
    filepath = os.path.join(processed_dir, f'{name}.csv')
    if isinstance(data, pd.DataFrame):
        data.to_csv(filepath, index=False)
    else:  # Series (target variables)
        pd.DataFrame(data).to_csv(filepath, index=False)
    print(f"   ✅ Saved {name}: {filepath}")

# Save metadata and mappings
metadata = {
    'feature_columns': selected_features_final,
    'target_mapping': target_mapping,
    'reverse_target_mapping': reverse_target_mapping,
    'label_encoders': label_encoders,
    'scalers': scalers,
    'feature_importance_ranking': importance_ranking[:20],
    'dataset_shapes': {
        'original': df.shape,
        'engineered': df_fe.shape,
        'final_features': len(selected_features_final),
        'train': X_train_final_selected.shape,
        'train_smote': X_train_smote_selected.shape,
        'validation': X_val_final.shape,
        'test': X_test_final.shape
    },
    'class_distribution': {
        'original': dict(y.value_counts().sort_index()),
        'train': dict(y_train_final.value_counts().sort_index()),
        'train_smote': dict(pd.Series(y_train_smote).value_counts().sort_index()),
        'validation': dict(y_val.value_counts().sort_index()),
        'test': dict(y_test.value_counts().sort_index())
    }
}

# Save metadata as pickle
metadata_path = os.path.join(processed_dir, 'preprocessing_metadata.pkl')
with open(metadata_path, 'wb') as f:
    pickle.dump(metadata, f)
print(f"   ✅ Saved metadata: {metadata_path}")

# Save feature list as text file
feature_list_path = os.path.join(processed_dir, 'selected_features.txt')
with open(feature_list_path, 'w') as f:
    f.write("Selected Features for Model Training:\\n")
    f.write("="*50 + "\\n")
    for i, feature in enumerate(selected_features_final, 1):
        f.write(f"{i:2d}. {feature}\\n")
print(f"   ✅ Saved feature list: {feature_list_path}")

# Create preprocessing summary report
summary_path = os.path.join(processed_dir, 'preprocessing_summary.txt')
with open(summary_path, 'w') as f:
    f.write("PREPROCESSING SUMMARY REPORT\\n")
    f.write("="*60 + "\\n\\n")
    
    f.write("📊 DATASET OVERVIEW:\\n")
    f.write(f"   • Original samples: {df.shape[0]:,}\\n")
    f.write(f"   • Original features: {df.shape[1]}\\n")
    f.write(f"   • Final features: {len(selected_features_final)}\\n")
    f.write(f"   • Feature reduction: {(1 - len(selected_features_final)/df.shape[1])*100:.1f}%\\n\\n")
    
    f.write("🔧 FEATURE ENGINEERING:\\n")
    f.write(f"   • Age groups created\\n")
    f.write(f"   • Binary indicators added\\n")
    f.write(f"   • Body region grouping\\n")
    f.write(f"   • Risk categorization\\n\\n")
    
    f.write("⚖️ CLASS BALANCING:\\n")
    f.write(f"   • Imbalance ratio: {imbalance_ratio:.1f}:1\\n")
    f.write(f"   • SMOTE applied: {'Yes' if imbalance_ratio > 2 else 'No'}\\n")
    f.write(f"   • Final training samples: {len(X_train_smote_selected):,}\\n\\n")
    
    f.write("✂️ DATA SPLITTING:\\n")
    f.write(f"   • Training: {X_train_final_selected.shape[0]:,} ({X_train_final_selected.shape[0]/len(X)*100:.1f}%)\\n")
    f.write(f"   • Validation: {X_val_final.shape[0]:,} ({X_val_final.shape[0]/len(X)*100:.1f}%)\\n")
    f.write(f"   • Test: {X_test_final.shape[0]:,} ({X_test_final.shape[0]/len(X)*100:.1f}%)\\n\\n")
    
    f.write("🎯 TOP FEATURES:\\n")
    for i, (feature, importance) in enumerate(importance_ranking[:10], 1):
        f.write(f"   {i:2d}. {feature}: {importance:.4f}\\n")

print(f"   ✅ Saved summary report: {summary_path}")

print(f"\\n🎉 PREPROCESSING COMPLETED SUCCESSFULLY!")
print(f"\\n📁 Processed files saved to: {processed_dir}")
print(f"   📊 Ready for model training in the Models/ folder")
print(f"   🚀 Use these datasets to train your ML models")

💾 Saving Processed Datasets:
   ✅ Saved X_train: ../Dataset/processed\X_train.csv
   ✅ Saved X_train: ../Dataset/processed\X_train.csv
   ✅ Saved X_train_smote: ../Dataset/processed\X_train_smote.csv
   ✅ Saved y_train: ../Dataset/processed\y_train.csv
   ✅ Saved y_train_smote: ../Dataset/processed\y_train_smote.csv
   ✅ Saved X_val: ../Dataset/processed\X_val.csv
   ✅ Saved y_val: ../Dataset/processed\y_val.csv
   ✅ Saved X_test: ../Dataset/processed\X_test.csv
   ✅ Saved y_test: ../Dataset/processed\y_test.csv
   ✅ Saved metadata: ../Dataset/processed\preprocessing_metadata.pkl
   ✅ Saved feature list: ../Dataset/processed\selected_features.txt
   ✅ Saved summary report: ../Dataset/processed\preprocessing_summary.txt
\n🎉 PREPROCESSING COMPLETED SUCCESSFULLY!
\n📁 Processed files saved to: ../Dataset/processed
   📊 Ready for model training in the Models/ folder
   🚀 Use these datasets to train your ML models
   ✅ Saved X_train_smote: ../Dataset/processed\X_train_smote.csv
   ✅ Saved y_

In [20]:
# Feature Engineering & Data Cleaning notebook