# 03 - Data Preprocessing

## Objective
**WHY are we doing this preprocessing?**

Based on findings from Data Understanding and EDA, we now prepare the data for machine learning. Each preprocessing step is **motivated by specific observations** from our analysis.

### Preprocessing Goals:
1. **Handle categorical variables** - Convert to numerical format for ML algorithms
2. **Feature scaling** - Normalize features with different scales
3. **Feature engineering** - Create new features based on domain knowledge
4. **Train/Test split** - Properly separate data for unbiased evaluation
5. **Address class imbalance** - If necessary based on model performance

**Important:** We will explain **WHY** we make each preprocessing decision.

---

## 1. Setup and Load Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.utils import resample
import pickle
import warnings

warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina'

# Load raw data
df = pd.read_csv('../data/raw/pokemon.csv')
print(f"✓ Loaded {df.shape[0]} samples with {df.shape[1]} features")
print(f"\nOriginal columns: {list(df.columns)}")

✓ Loaded 1543 samples with 23 features

Original columns: ['id', 'user_id', 'QoA_VLCresolution', 'QoA_VLCbitrate', 'QoA_VLCframerate', 'QoA_VLCdropped', 'QoA_VLCaudiorate', 'QoA_VLCaudioloss', 'QoA_BUFFERINGcount', 'QoA_BUFFERINGtime', 'QoS_type', 'QoS_operator', 'QoD_model', 'QoD_os-version', 'QoD_api-level', 'QoU_sex', 'QoU_age', 'QoU_Ustedy', 'QoF_begin', 'QoF_shift', 'QoF_audio', 'QoF_video', 'MOS']


## 2. Remove Unnecessary Features

**WHY:** Some features don't contribute to prediction:
- `id`: Just an identifier, no predictive power
- `user_id`: User identity shouldn't directly predict MOS (unless we're doing personalization)
- `QoD_model`, `QoD_os-version`: Too many unique values (high cardinality) - would need special encoding

In [2]:
# Features to remove and why
features_to_remove = {
    'id': 'Just an identifier',
    'user_id': 'User identity (not using personalization)',
    'QoD_model': 'Too many unique values (high cardinality)',
    'QoD_os-version': 'Too many unique values (high cardinality)'
}

print("Removing features:")
for feat, reason in features_to_remove.items():
    if feat in df.columns:
        print(f"  - {feat}: {reason}")
        df = df.drop(columns=[feat])

print(f"\n✓ Remaining features: {df.shape[1]}")
print(f"Columns: {list(df.columns)}")

Removing features:
  - id: Just an identifier
  - user_id: User identity (not using personalization)
  - QoD_model: Too many unique values (high cardinality)
  - QoD_os-version: Too many unique values (high cardinality)

✓ Remaining features: 19
Columns: ['QoA_VLCresolution', 'QoA_VLCbitrate', 'QoA_VLCframerate', 'QoA_VLCdropped', 'QoA_VLCaudiorate', 'QoA_VLCaudioloss', 'QoA_BUFFERINGcount', 'QoA_BUFFERINGtime', 'QoS_type', 'QoS_operator', 'QoD_api-level', 'QoU_sex', 'QoU_age', 'QoU_Ustedy', 'QoF_begin', 'QoF_shift', 'QoF_audio', 'QoF_video', 'MOS']


## 3. Handle Categorical Variables

**WHY:** Machine learning algorithms require numerical input. We have several categorical features that need encoding.

### Encoding Strategy:
- **Ordinal features** (with natural order): Use label encoding
  - `QoS_type`: EDGE(1) < UMTS(2) < HSPA(3) < HSPAP(4) < LTE(5) ✓ Already encoded
  - `QoU_Ustedy`: Education level (1-5) ✓ Already encoded
  
- **Nominal features** (no natural order): Use one-hot encoding
  - `QoS_operator`: Network operators (no inherent order)

In [3]:
# Check data types
print("Current data types:")
print(df.dtypes)

# Identify categorical features that need encoding
print("\n" + "="*60)
print("Categorical Features Analysis:")
print("="*60)

categorical_cols = ['QoS_operator']  # Nominal feature

for col in categorical_cols:
    if col in df.columns:
        print(f"\n{col}:")
        print(f"  Unique values: {df[col].nunique()}")
        print(f"  Values: {sorted(df[col].unique())}")

Current data types:
QoA_VLCresolution       int64
QoA_VLCbitrate        float64
QoA_VLCframerate      float64
QoA_VLCdropped          int64
QoA_VLCaudiorate      float64
QoA_VLCaudioloss        int64
QoA_BUFFERINGcount      int64
QoA_BUFFERINGtime       int64
QoS_type                int64
QoS_operator            int64
QoD_api-level           int64
QoU_sex                 int64
QoU_age                 int64
QoU_Ustedy              int64
QoF_begin               int64
QoF_shift               int64
QoF_audio               int64
QoF_video               int64
MOS                     int64
dtype: object

Categorical Features Analysis:

QoS_operator:
  Unique values: 4
  Values: [1, 2, 3, 4]


In [4]:
# One-hot encode QoS_operator
print("\nApplying One-Hot Encoding to QoS_operator...")

# Create mapping for better column names
operator_mapping = {1: 'SFR', 2: 'BOUYEGUES', 3: 'ORANGE', 4: 'FREE'}

# One-hot encode
operator_encoded = pd.get_dummies(df['QoS_operator'], prefix='Operator')
operator_encoded.columns = [f'Operator_{operator_mapping.get(int(col.split("_")[1]), col.split("_")[1])}' 
                            for col in operator_encoded.columns]

# Drop original and add encoded
df = df.drop(columns=['QoS_operator'])
df = pd.concat([df, operator_encoded], axis=1)

print(f"✓ One-hot encoding complete")
print(f"New columns: {[col for col in df.columns if 'Operator' in col]}")


Applying One-Hot Encoding to QoS_operator...
✓ One-hot encoding complete
New columns: ['Operator_SFR', 'Operator_BOUYEGUES', 'Operator_ORANGE', 'Operator_FREE']


## 4. Feature Engineering

**WHY:** Based on EDA insights, we can create new features that might better capture QoE patterns.

### New Features to Create:
1. **Buffering_Severity**: Combines buffering count and time
2. **Quality_Score**: Composite score from bitrate, framerate, resolution
3. **Network_Generation**: Grouped network types (2G, 3G, 4G)

In [5]:
print("Creating engineered features...\n")

# 1. Buffering Severity (normalized)
df['Buffering_Severity'] = (
    df['QoA_BUFFERINGcount'] * df['QoA_BUFFERINGtime'] / 1000
).fillna(0)
print("✓ Buffering_Severity: count × time / 1000")

# 2. Network Generation
network_gen_mapping = {
    1: 2,  # EDGE -> 2G
    2: 3,  # UMTS -> 3G
    3: 3,  # HSPA -> 3G
    4: 3,  # HSPAP -> 3G
    5: 4   # LTE -> 4G
}
df['Network_Generation'] = df['QoS_type'].map(network_gen_mapping)
print("✓ Network_Generation: Grouped by technology generation (2G/3G/4G)")

# 3. Video Quality Index (normalized)
# Higher bitrate + higher framerate + higher resolution = better quality
df['Video_Quality_Index'] = (
    (df['QoA_VLCbitrate'] / df['QoA_VLCbitrate'].max()) * 0.4 +
    (df['QoA_VLCframerate'] / df['QoA_VLCframerate'].max()) * 0.3 +
    (df['QoA_VLCresolution'] / df['QoA_VLCresolution'].max()) * 0.3
)
print("✓ Video_Quality_Index: Weighted combination of bitrate, framerate, resolution")

# 4. Audio Quality (combine audio rate and loss)
df['Audio_Quality'] = df['QoA_VLCaudiorate'] * (1 - df['QoA_VLCaudioloss'] / 100)
print("✓ Audio_Quality: Audio rate adjusted by loss percentage")

print(f"\nNew feature count: {df.shape[1]}")

Creating engineered features...

✓ Buffering_Severity: count × time / 1000
✓ Network_Generation: Grouped by technology generation (2G/3G/4G)
✓ Video_Quality_Index: Weighted combination of bitrate, framerate, resolution
✓ Audio_Quality: Audio rate adjusted by loss percentage

New feature count: 26


## 5. Train-Test Split

**WHY:** We must separate data BEFORE any scaling to prevent data leakage.

### Split Strategy:
- **80/20 split** - Standard practice
- **Stratified** - Maintain MOS class distribution (important due to class imbalance)
- **Random state** - For reproducibility

In [6]:
# Separate features and target
X = df.drop(columns=['MOS'])
y = df['MOS']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nTarget distribution:")
print(y.value_counts().sort_index())

# Stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y  # Maintain class distribution
)

print(f"\n✓ Train set: {X_train.shape[0]} samples")
print(f"✓ Test set: {X_test.shape[0]} samples")

# Verify stratification
print("\nMOS distribution in train set:")
print(y_train.value_counts(normalize=True).sort_index().round(3))
print("\nMOS distribution in test set:")
print(y_test.value_counts(normalize=True).sort_index().round(3))

Features shape: (1543, 25)
Target shape: (1543,)

Target distribution:
MOS
1     93
2    118
3    246
4    784
5    302
Name: count, dtype: int64

✓ Train set: 1234 samples
✓ Test set: 309 samples

MOS distribution in train set:
MOS
1    0.060
2    0.076
3    0.160
4    0.508
5    0.196
Name: proportion, dtype: float64

MOS distribution in test set:
MOS
1    0.061
2    0.078
3    0.159
4    0.508
5    0.194
Name: proportion, dtype: float64


## 6. Feature Scaling

**WHY:** Features have very different scales (e.g., bitrate ~500, age ~25). Many ML algorithms (SVM, KNN, Neural Nets) perform better with scaled features.

**IMPORTANT:** We fit the scaler on training data only, then transform both train and test to prevent data leakage!

In [7]:
# Initialize scaler
scaler = StandardScaler()

# Fit on training data only!
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use fitted scaler

# Convert back to DataFrame for convenience
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print("✓ Features scaled using StandardScaler")
print(f"\nScaled features - Train: {X_train_scaled.shape}")
print(f"Scaled features - Test: {X_test_scaled.shape}")

# Show scaling effect
print("\nBefore scaling (sample):")
print(X_train[['QoA_VLCbitrate', 'QoU_age', 'QoA_BUFFERINGtime']].head(3))
print("\nAfter scaling (same sample):")
print(X_train_scaled[['QoA_VLCbitrate', 'QoU_age', 'QoA_BUFFERINGtime']].head(3))

✓ Features scaled using StandardScaler

Scaled features - Train: (1234, 25)
Scaled features - Test: (309, 25)

Before scaling (sample):
     QoA_VLCbitrate  QoU_age  QoA_BUFFERINGtime
604       393.35184       29               2416
117       391.83496       28               1413
888       185.53412       25               2998

After scaling (same sample):
     QoA_VLCbitrate   QoU_age  QoA_BUFFERINGtime
604       -0.377626 -0.009995          -0.240226
117       -0.382027 -0.137142          -0.305573
888       -0.980521 -0.518582          -0.202308


## 7. Save Preprocessed Data

**WHY:** Save processed data and scaler for:
- Reproducibility
- Use in modeling notebooks
- Future predictions on new data

In [8]:
# Save processed datasets
X_train_scaled.to_csv('../data/processed/X_train_scaled.csv', index=False)
X_test_scaled.to_csv('../data/processed/X_test_scaled.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)

# Save unscaled versions too (for tree-based models)
X_train.to_csv('../data/processed/X_train.csv', index=False)
X_test.to_csv('../data/processed/X_test.csv', index=False)

# Save scaler
with open('../models/scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

# Save feature names
feature_names = list(X_train.columns)
with open('../data/processed/feature_names.txt', 'w') as f:
    f.write('\n'.join(feature_names))

print("✓ Saved preprocessed data:")
print("  - X_train_scaled.csv, X_test_scaled.csv")
print("  - X_train.csv, X_test.csv (unscaled)")
print("  - y_train.csv, y_test.csv")
print("  - scaler.pkl")
print("  - feature_names.txt")

✓ Saved preprocessed data:
  - X_train_scaled.csv, X_test_scaled.csv
  - X_train.csv, X_test.csv (unscaled)
  - y_train.csv, y_test.csv
  - scaler.pkl
  - feature_names.txt


## 8. Preprocessing Summary

### What We Did:

1. **Feature Removal:**
   - Removed: id, user_id, QoD_model, QoD_os-version
   - Why: No predictive value or too high cardinality

2. **Encoding:**
   - One-hot encoded: QoS_operator
   - Kept as-is: Already encoded ordinal features

3. **Feature Engineering:**
   - Created: Buffering_Severity, Network_Generation, Video_Quality_Index, Audio_Quality
   - Why: Capture domain knowledge about QoE factors

4. **Train/Test Split:**
   - 80/20 split with stratification
   - Why: Maintain class distribution, enable unbiased evaluation

5. **Scaling:**
   - StandardScaler (mean=0, std=1)
   - Why: Different feature scales, better for many algorithms
   - Important: Fitted on train only!

### Critical Assessment:

**Limitations:**
- Removed high-cardinality features (device model) - might lose information
- Feature engineering based on assumptions - may not be optimal
- Class imbalance still present - may address during modeling if needed

**Assumptions:**
- Linear scaling appropriate for all features
- One-hot encoding for operator won't cause multicollinearity issues
- Engineered features capture meaningful patterns

### Next Steps:
1. Baseline model creation
2. Model comparison and selection
3. Hyperparameter tuning
4. Address class imbalance if necessary