# Data Preprocessing for Hepatitis C Prediction

This notebook handles data cleaning, feature engineering, and preparation for machine learning. We'll transform the raw data into a format suitable for neural network training.

## Objectives
- Clean and handle missing values
- Encode categorical variables  
- Scale numerical features
- Split data into train/test sets
- Save processed data for modeling

In [8]:
import pandas as pd
import numpy as np
import sys
import pickle
import os

sys.path.append('../src')
from data import load_raw_data, clean_data, prepare_features, split_and_scale_data

print("Libraries imported successfully")

Libraries imported successfully


## 1. Load and Clean Data

In [9]:
df = load_raw_data('../data/raw/hepatitis_data.csv')

if df is not None:
    print("Raw data loaded successfully")
    
    cleaned_data, sex_encoder = clean_data(df)
    
    if cleaned_data is not None:
        print(f"Cleaned data shape: {cleaned_data.shape}")
        print(f"Target distribution:")
        print(cleaned_data['target'].value_counts())
    else:
        print("Data cleaning failed")
else:
    print("Failed to load data")

Dataset loaded successfully: (615, 14)
Raw data loaded successfully
Data cleaned successfully
Healthy: 540 samples
Hepatitis C: 75 samples
Cleaned data shape: (615, 16)
Target distribution:
target
0    540
1     75
Name: count, dtype: int64


## 2. Prepare Features

In [10]:
if 'cleaned_data' in locals() and cleaned_data is not None:
    X, y, imputer = prepare_features(cleaned_data)
    
    if X is not None:
        print(f"Features prepared: {X.shape}")
        print(f"Feature columns: {list(X.columns)}")
        print(f"Target distribution: {y.value_counts().to_dict()}")
        
        print("\nFeature Summary:")
        display(X.describe())
    else:
        print("Feature preparation failed")
else:
    print("No cleaned data available")

Features prepared: (615, 12)
Missing values after imputation: 0
Features prepared: (615, 12)
Feature columns: ['Age', 'ALB', 'ALP', 'ALT', 'AST', 'BIL', 'CHE', 'CHOL', 'CREA', 'GGT', 'PROT', 'sex_encoded']
Target distribution: {0: 540, 1: 75}

Feature Summary:


Unnamed: 0,Age,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT,sex_encoded
count,615.0,615.0,615.0,615.0,615.0,615.0,615.0,615.0,615.0,615.0,615.0,615.0
mean,47.40813,41.620732,68.222927,28.441951,34.786341,11.396748,8.196634,5.366992,81.287805,39.533171,72.04439,0.613008
std,10.055105,5.775935,25.646364,25.449889,33.09069,19.67315,2.205657,1.123499,49.756166,54.661071,5.398238,0.487458
min,19.0,14.9,11.3,0.9,10.6,0.8,1.42,1.43,8.0,4.5,44.8,0.0
25%,39.0,38.8,52.95,16.4,21.6,5.3,6.935,4.62,67.0,15.7,69.3,0.0
50%,47.0,41.95,66.2,23.0,25.9,7.3,8.26,5.3,77.0,23.3,72.2,1.0
75%,54.0,45.2,79.3,33.05,32.9,11.2,9.59,6.055,88.0,40.2,75.4,1.0
max,77.0,82.2,416.6,325.3,324.0,254.0,16.41,9.67,1079.1,650.9,90.0,1.0


## 3. Split and Scale Data

In [11]:
if 'X' in locals() and X is not None:
    X_train, X_test, y_train, y_test, scaler = split_and_scale_data(X, y)
    
    print("Data split and scaled successfully!")
    print(f"Training features shape: {X_train.shape}")
    print(f"Test features shape: {X_test.shape}")
    print(f"Training targets shape: {y_train.shape}")
    print(f"Test targets shape: {y_test.shape}")
    
    # Check class distribution in splits
    print(f"\nTraining set class distribution:")
    print(f"  Healthy: {sum(y_train == 0)} ({sum(y_train == 0)/len(y_train)*100:.1f}%)")
    print(f"  Hepatitis C: {sum(y_train == 1)} ({sum(y_train == 1)/len(y_train)*100:.1f}%)")
    
    print(f"\nTest set class distribution:")
    print(f"  Healthy: {sum(y_test == 0)} ({sum(y_test == 0)/len(y_test)*100:.1f}%)")
    print(f"  Hepatitis C: {sum(y_test == 1)} ({sum(y_test == 1)/len(y_test)*100:.1f}%)")
else:
    print("No features available for splitting")

✅ Data split and scaled:
   Training set: (492, 12)
   Test set: (123, 12)
Data split and scaled successfully!
Training features shape: (492, 12)
Test features shape: (123, 12)
Training targets shape: (492,)
Test targets shape: (123,)

Training set class distribution:
  Healthy: 432 (87.8%)
  Hepatitis C: 60 (12.2%)

Test set class distribution:
  Healthy: 108 (87.8%)
  Hepatitis C: 15 (12.2%)


## 4. Save Processed Data

In [12]:
if all(var in locals() for var in ['X_train', 'X_test', 'y_train', 'y_test', 'scaler', 'imputer', 'sex_encoder']):
    
    os.makedirs('../data/processed', exist_ok=True)
    
    np.save('../data/processed/X_train.npy', X_train)
    np.save('../data/processed/X_test.npy', X_test)
    np.save('../data/processed/y_train.npy', y_train)
    np.save('../data/processed/y_test.npy', y_test)
    
    preprocessing_info = {
        'scaler': scaler,
        'imputer': imputer,
        'sex_encoder': sex_encoder,
        'feature_names': list(X.columns),
        'n_features': X.shape[1],
        'n_samples_train': len(y_train),
        'n_samples_test': len(y_test)
    }
    
    with open('../data/processed/preprocessing_info.pkl', 'wb') as f:
        pickle.dump(preprocessing_info, f)
    
    print("Processed data saved successfully!")
    print("Files saved:")
    print("X_train.npy, X_test.npy (features)")
    print("y_train.npy, y_test.npy (targets)")
    print("preprocessing_info.pkl (preprocessing objects)")
    
    print(f"\nFinal dataset summary:")
    print(f"Features: {preprocessing_info['n_features']}")
    print(f"Training samples: {preprocessing_info['n_samples_train']}")
    print(f"Test samples: {preprocessing_info['n_samples_test']}")
    print(f"Feature names: {preprocessing_info['feature_names']}")
    
else:
    print("Not all variables are available for saving")

Not all variables are available for saving


In [13]:
required_vars = ['X_train', 'X_test', 'y_train', 'y_test', 'scaler', 'imputer', 'sex_encoder']
available_vars = []
missing_vars = []

for var in required_vars:
    if var in locals():
        available_vars.append(var)
        print(f"{var}: {type(locals()[var])}")
    else:
        missing_vars.append(var)
        print(f"{var}: not available")

print(f"\nAvailable: {available_vars}")
print(f"Missing: {missing_vars}")

X_train: <class 'numpy.ndarray'>
X_test: <class 'numpy.ndarray'>
y_train: <class 'pandas.core.series.Series'>
y_test: <class 'pandas.core.series.Series'>
scaler: <class 'sklearn.preprocessing._data.StandardScaler'>
imputer: <class 'sklearn.impute._base.SimpleImputer'>
sex_encoder: <class 'sklearn.preprocessing._label.LabelEncoder'>

Available: ['X_train', 'X_test', 'y_train', 'y_test', 'scaler', 'imputer', 'sex_encoder']
Missing: []


In [14]:
os.makedirs('../data/processed', exist_ok=True)

y_train_np = y_train.values if hasattr(y_train, 'values') else y_train
y_test_np = y_test.values if hasattr(y_test, 'values') else y_test

np.save('../data/processed/X_train.npy', X_train)
np.save('../data/processed/X_test.npy', X_test)
np.save('../data/processed/y_train.npy', y_train_np)
np.save('../data/processed/y_test.npy', y_test_np)

preprocessing_info = {
    'scaler': scaler,
    'imputer': imputer,
    'sex_encoder': sex_encoder,
    'feature_names': list(X.columns),
    'n_features': X.shape[1],
    'n_samples_train': len(y_train),
    'n_samples_test': len(y_test)
}

with open('../data/processed/preprocessing_info.pkl', 'wb') as f:
    pickle.dump(preprocessing_info, f)

print("Processed data saved successfully!")
print("Files saved:")
print("   - X_train.npy, X_test.npy (features)")
print("   - y_train.npy, y_test.npy (targets)")
print("   - preprocessing_info.pkl (preprocessing objects)")

print(f"\nFinal dataset summary:")
print(f"   - Features: {preprocessing_info['n_features']}")
print(f"   - Training samples: {preprocessing_info['n_samples_train']}")
print(f"   - Test samples: {preprocessing_info['n_samples_test']}")
print(f"   - Feature names: {preprocessing_info['feature_names']}")

Processed data saved successfully!
Files saved:
   - X_train.npy, X_test.npy (features)
   - y_train.npy, y_test.npy (targets)
   - preprocessing_info.pkl (preprocessing objects)

Final dataset summary:
   - Features: 12
   - Training samples: 492
   - Test samples: 123
   - Feature names: ['Age', 'ALB', 'ALP', 'ALT', 'AST', 'BIL', 'CHE', 'CHOL', 'CREA', 'GGT', 'PROT', 'sex_encoded']


## Summary

 **Data preprocessing completed successfully!**

**What we accomplished:**
1.  Loaded raw hepatitis C dataset
2.  Cleaned data and created binary target (Healthy vs Hepatitis C)
3.  Handled missing values using median imputation
4.  Encoded categorical variable (Sex)
5.  Scaled numerical features using StandardScaler
6.  Split data into train/test sets (80/20)
7.  Saved processed data and preprocessing objects

**Next steps:**
👉 Run notebook `03-model-training.ipynb` to train the neural network model