# Data Preprocessing

## Overview
This notebook handles data cleaning and preparation for the ML pipeline.

## Responsibilities
- Load raw dataset from `data/raw/`
- Check for missing values and duplicates
- Clean and handle data issues
- Feature scaling/normalization
- Train-test split (80/20)
- Save processed data to `data/processed/`

## Output Files
After running this notebook, the following files will be created in `data/processed/`:
- `X_train.csv` - Training features
- `X_test.csv` - Testing features
- `y_train.csv` - Training labels
- `y_test.csv` - Testing labels
- `scaler.pkl` - Scaler object

## Status
**Done** - Mohamed Abdelkader


### Import Libraries

In [20]:
import os
import joblib
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## 1. Load Dataset

In [2]:
df = pd.read_csv('../data/raw/diabetes_012_health_indicators_BRFSS2015.csv')
print(f"Dataset shape: {df.shape}")

Dataset shape: (253680, 22)


## 2. Explore Data Structure

### Display basic information

In [5]:
print("DATASET OVERVIEW:")
print(f"\nNumber of samples: {df.shape[0]:,}")
print(f"Number of features: {df.shape[1]}")
print(f"\nColumn names:\n{df.columns.tolist()}")


DATASET OVERVIEW:

Number of samples: 253,680
Number of features: 22

Column names:
['Diabetes_012', 'HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker', 'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth', 'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education', 'Income']


### Display first few rows

In [10]:
print("FIRST 5 ROWS:")
print(df.head())

FIRST 5 ROWS:
   Diabetes_012  HighBP  HighChol  CholCheck   BMI  Smoker  Stroke  \
0           0.0     1.0       1.0        1.0  40.0     1.0     0.0   
1           0.0     0.0       0.0        0.0  25.0     1.0     0.0   
2           0.0     1.0       1.0        1.0  28.0     0.0     0.0   
3           0.0     1.0       0.0        1.0  27.0     0.0     0.0   
4           0.0     1.0       1.0        1.0  24.0     0.0     0.0   

   HeartDiseaseorAttack  PhysActivity  Fruits  ...  AnyHealthcare  \
0                   0.0           0.0     0.0  ...            1.0   
1                   0.0           1.0     0.0  ...            0.0   
2                   0.0           0.0     1.0  ...            1.0   
3                   0.0           1.0     1.0  ...            1.0   
4                   0.0           1.0     1.0  ...            1.0   

   NoDocbcCost  GenHlth  MentHlth  PhysHlth  DiffWalk  Sex   Age  Education  \
0          0.0      5.0      18.0      15.0       1.0  0.0   9.0       

### Data types and info

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Diabetes_012          253680 non-null  float64
 1   HighBP                253680 non-null  float64
 2   HighChol              253680 non-null  float64
 3   CholCheck             253680 non-null  float64
 4   BMI                   253680 non-null  float64
 5   Smoker                253680 non-null  float64
 6   Stroke                253680 non-null  float64
 7   HeartDiseaseorAttack  253680 non-null  float64
 8   PhysActivity          253680 non-null  float64
 9   Fruits                253680 non-null  float64
 10  Veggies               253680 non-null  float64
 11  HvyAlcoholConsump     253680 non-null  float64
 12  AnyHealthcare         253680 non-null  float64
 13  NoDocbcCost           253680 non-null  float64
 14  GenHlth               253680 non-null  float64
 15  

### Statistical summary

In [8]:
print(df.describe())

        Diabetes_012         HighBP       HighChol      CholCheck  \
count  253680.000000  253680.000000  253680.000000  253680.000000   
mean        0.296921       0.429001       0.424121       0.962670   
std         0.698160       0.494934       0.494210       0.189571   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       1.000000   
50%         0.000000       0.000000       0.000000       1.000000   
75%         0.000000       1.000000       1.000000       1.000000   
max         2.000000       1.000000       1.000000       1.000000   

                 BMI         Smoker         Stroke  HeartDiseaseorAttack  \
count  253680.000000  253680.000000  253680.000000         253680.000000   
mean       28.382364       0.443169       0.040571              0.094186   
std         6.608694       0.496761       0.197294              0.292087   
min        12.000000       0.000000       0.000000              0.000000  

## 3. Data Quality Checks

### Check for missing values

In [11]:
print("MISSING VALUES CHECK")
missing_values = df.isnull().sum()
print(f"\nTotal missing values: {missing_values.sum()}")
if missing_values.sum() > 0:
    print("\nMissing values per column:")
    print(missing_values[missing_values > 0])
else:
    print("No missing values found!")

MISSING VALUES CHECK

Total missing values: 0
No missing values found!


### Check for duplicates

In [12]:
print("DUPLICATE ROWS CHECK")
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates:,}")

if duplicates > 0:
    print(f"Removing {duplicates:,} duplicate rows...")
    df = df.drop_duplicates()
    print(f"Duplicates removed! New shape: {df.shape}")
else:
    print("No duplicates found!")

DUPLICATE ROWS CHECK
Number of duplicate rows: 23,899
Removing 23,899 duplicate rows...
Duplicates removed! New shape: (229781, 22)


### Check target variable distribution

In [13]:
print("TARGET VARIABLE DISTRIBUTION")
target_col = 'Diabetes_012'
print(df[target_col].value_counts().sort_index())
print("\nPercentage distribution:")
print(df[target_col].value_counts(normalize=True).sort_index() * 100)


TARGET VARIABLE DISTRIBUTION
Diabetes_012
0.0    190055
1.0      4629
2.0     35097
Name: count, dtype: int64

Percentage distribution:
Diabetes_012
0.0    82.711364
1.0     2.014527
2.0    15.274109
Name: proportion, dtype: float64


## 4. Separate Features and Target

### Split features and target

In [14]:
X = df.drop(columns=[target_col])
y = df[target_col]

print("Features and target separated")
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

Features and target separated
Features shape: (229781, 21)
Target shape: (229781,)


## 5. Feature Scaling/Normalization

### Apply StandardScaler

In [16]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert back to DataFrame to preserve column names
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

print("Features scaled using StandardScaler")
print("\nScaled features - first 5 rows:")
print(X_scaled.head())

Features scaled using StandardScaler

Scaled features - first 5 rows:
     HighBP  HighChol  CholCheck       BMI    Smoker    Stroke  \
0  1.095675  1.124132   0.205356  1.667220  1.071208 -0.216455   
1 -0.912679 -0.889575  -4.869594 -0.543101  1.071208 -0.216455   
2  1.095675  1.124132   0.205356 -0.101037 -0.933526 -0.216455   
3  1.095675 -0.889575   0.205356 -0.248391 -0.933526 -0.216455   
4  1.095675  1.124132   0.205356 -0.690456 -0.933526 -0.216455   

   HeartDiseaseorAttack  PhysActivity    Fruits   Veggies  ...  AnyHealthcare  \
0             -0.339257     -1.658403 -1.258473  0.508092  ...       0.238745   
1             -0.339257      0.602990 -1.258473 -1.968149  ...      -4.188578   
2             -0.339257     -1.658403  0.794614 -1.968149  ...       0.238745   
3             -0.339257      0.602990  0.794614  0.508092  ...       0.238745   
4             -0.339257      0.602990  0.794614  0.508092  ...       0.238745   

   NoDocbcCost   GenHlth  MentHlth  PhysHlth  

## 6. Train-Test Split

### Split data into training and testing sets

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled,
    y,
    test_size=0.2,  # 80% train, 20% test
    random_state=42,  # For reproducibility
    stratify=y  # Maintain class distribution
)

print("Data split into train and test sets")
print(f"\nTraining set size: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(df)*100:.1f}%)")
print(f"Testing set size: {X_test.shape[0]:,} samples ({X_test.shape[0]/len(df)*100:.1f}%)")

# Check class distribution in splits
print("\nTarget distribution in training set:")
print(y_train.value_counts().sort_index())
print("\nTarget distribution in testing set:")
print(y_test.value_counts().sort_index())

Data split into train and test sets

Training set size: 183,824 samples (80.0%)
Testing set size: 45,957 samples (20.0%)

Target distribution in training set:
Diabetes_012
0.0    152043
1.0      3703
2.0     28078
Name: count, dtype: int64

Target distribution in testing set:
Diabetes_012
0.0    38012
1.0      926
2.0     7019
Name: count, dtype: int64


## 7. Save Processed Data

### Save training and testing sets and the scaler

In [22]:
# Save training and testing sets
X_train.to_csv('../data/processed/X_train.csv', index=False)
X_test.to_csv('../data/processed/X_test.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)

# Also save the scaler for future use
joblib.dump(scaler, '../data/processed/scaler.pkl')

print("All processed data saved successfully")
print("\nSaved files:")
print("  - X_train.csv")
print("  - X_test.csv")
print("  - y_train.csv")
print("  - y_test.csv")
print("  - scaler.pkl")

All processed data saved successfully

Saved files:
  - X_train.csv
  - X_test.csv
  - y_train.csv
  - y_test.csv
  - scaler.pkl


## 8. Summary Report

In [23]:
print("DATA PREPROCESSING SUMMARY:\n")
print(f"Original dataset size: {df.shape[0]:,} samples")
print(f"Features: {X.shape[1]}")
print(f"Missing values: {missing_values.sum()}")
print(f"Duplicates removed: {duplicates:,}")
print(f"Scaling method: StandardScaler")
print(f"Train set: {X_train.shape[0]:,} samples")
print(f"Test set: {X_test.shape[0]:,} samples")
print(f"Data saved to: ../data/processed/")
print("\nData preprocessing completed successfully!")

DATA PREPROCESSING SUMMARY:

Original dataset size: 229,781 samples
Features: 21
Missing values: 0
Duplicates removed: 23,899
Scaling method: StandardScaler
Train set: 183,824 samples
Test set: 45,957 samples
Data saved to: ../data/processed/

Data preprocessing completed successfully!
