# Data Preprocessing

This notebook handles data cleaning, encoding, and preparation for modeling.

## Objectives
- Clean the raw dataset
- Handle missing values
- Encode categorical features
- Normalize/standardize numerical features
- Create processed dataset for modeling
- Save cleaned data to `data/processed/`

## 1. Setup and Imports

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder

# Custom modules
import sys
sys.path.append('..')
from src.data.preprocessing import (
    load_raw_data,
    clean_data,
    encode_categorical_features,
    handle_missing_values,
    normalize_features
)

%matplotlib inline

## 2. Load Raw Data

In [None]:
# Load raw data
# TODO: Load data from ../data/raw/rawdata.csv
df = pd.read_csv('../data/raw/rawdata.csv')

print(f"Original dataset shape: {df.shape}")
df.head()

## 3. Data Cleaning

In [None]:
# TODO: Remove duplicates
# Example: df = df.drop_duplicates()

In [None]:
# TODO: Handle missing values
# Decide on strategy: drop, impute with mean/median/mode

In [None]:
# TODO: Handle outliers if necessary
# Consider using IQR method or domain knowledge

## 4. Feature Encoding

In [None]:
# TODO: Encode antibiotic resistance columns (R/S/I to numerical)
# R (Resistant) = 1, S (Susceptible) = 0, I (Intermediate) = 0.5 or handle separately

In [None]:
# TODO: Encode other categorical features
# Use LabelEncoder for ordinal, OneHotEncoder for nominal

## 5. Feature Normalization

In [None]:
# TODO: Normalize/standardize numerical features
# scaler = StandardScaler()
# df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

## 6. Verify Cleaned Data

In [None]:
# Check for remaining missing values
print("Missing values after cleaning:")
print(df.isnull().sum())

# Check data types
print("\nData types:")
print(df.dtypes)

In [None]:
# Display summary statistics
df.describe()

## 7. Save Processed Data

In [None]:
# TODO: Save cleaned and encoded dataset
# df.to_csv('../data/processed/cleaned_data.csv', index=False)
# print("Processed data saved to data/processed/cleaned_data.csv")