# Data Preprocessing - Enhanced Student Engagement Dataset

**Purpose**: Load and preprocess the enhanced dataset with initial questions, participant tracking, and completion status

**Datasets Used**:
- `Merge_Enhanced.csv` - Main dataset with initial questions
- `Participant_Tracking.csv` - Real-time participant presence

**Key Features**:
1. Filter only participating students
2. Separate initial questions (Quiz# = 0) from quiz questions
3. Apply different feature selection for Completed vs Not Completed
4. Cross-validate with participant tracking

## 1. Setup and Mount Google Drive

In [None]:
# Install required packages
!pip install pandas numpy matplotlib seaborn scikit-learn -q
print("✅ Packages installed")

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
print("\n✅ Drive mounted")
print("\nUpload these files to /content/drive/MyDrive/FYP_Data/:")
print("  1. Merge_Enhanced.csv")
print("  2. Participant_Tracking.csv")

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

pd.set_option('display.max_columns', None)
print("✅ Libraries imported")

## 2. Load Enhanced Datasets

In [None]:
# Paths (modify if needed)
DATA_PATH = '/content/drive/MyDrive/FYP_Data/'

# Load datasets
print("Loading datasets...")
df_enhanced = pd.read_csv(DATA_PATH + 'Merge_Enhanced.csv')
participant_df = pd.read_csv(DATA_PATH + 'Participant_Tracking.csv')

print(f"✅ Enhanced dataset loaded: {df_enhanced.shape}")
print(f"✅ Participant tracking loaded: {participant_df.shape}")
print(f"\nTotal students: {df_enhanced['Admission No'].nunique()}")

## 3. Filter Only Participating Students

In [None]:
# Get students who actually participated (joined sessions)
participated_students = participant_df[
    participant_df['Event Type'] == 'Joined'
]['Admission No'].unique()

print(f"Students who participated: {len(participated_students)}")

# Filter dataset to only include participating students
df = df_enhanced[df_enhanced['Admission No'].isin(participated_students)].copy()

print(f"✅ Filtered to participating students only")
print(f"Records: {len(df)} (from {len(df_enhanced)})")
print(f"Students: {df['Admission No'].nunique()}")

## 4. Separate Question Types

In [None]:
# Separate initial questions from quiz questions
initial_questions = df[df['Quiz#'] == 0].copy()
quiz_questions = df[df['Quiz#'] > 0].copy()

print("Dataset Separation:")
print(f"  Initial Questions (Quiz# = 0): {len(initial_questions)}")
print(f"  Quiz Questions (Quiz# > 0): {len(quiz_questions)}")

# Further separate by completion status
completed = quiz_questions[quiz_questions['Attempt Status'] == 'Completed'].copy()
not_completed = quiz_questions[quiz_questions['Attempt Status'] == 'Not Completed'].copy()

print(f"\nQuiz Questions Breakdown:")
print(f"  Completed: {len(completed)} ({len(completed)/len(quiz_questions)*100:.1f}%)")
print(f"  Not Completed: {len(not_completed)} ({len(not_completed)/len(quiz_questions)*100:.1f}%)")

## 5. Feature Engineering

In [None]:
# Convert categorical features
def prepare_features(data_df):
    df_prep = data_df.copy()
    
    # Binary encoding
    df_prep['Is_Correct_Binary'] = df_prep['Is Correct'].apply(
        lambda x: 1 if str(x).lower() == 'yes' else 0
    )
    
    # Engagement encoding
    engagement_map = {'Passive': 0, 'Moderate': 1, 'Active': 2}
    df_prep['Engagement_Encoded'] = df_prep['Engagement Level'].map(engagement_map)
    
    # Network quality encoding
    network_map = {'Poor': 0, 'Fair': 1, 'Good': 2, 'Excellent': 3}
    df_prep['Network_Quality_Encoded'] = df_prep['Network Quality'].map(network_map)
    df_prep['Network_Quality_Encoded'].fillna(1, inplace=True)
    
    return df_prep

# Apply to all datasets
initial_questions = prepare_features(initial_questions)
completed = prepare_features(completed)
not_completed = prepare_features(not_completed)

print("✅ Feature engineering complete")

## 6. Feature Selection by Question Type

In [None]:
# STAGE 1: Initial questions (for baseline clustering)
initial_features = [
    'Response Time (sec)',
    'RTT (ms)',
    'Jitter (ms)',
    'Stability (%)'
]

X_initial = initial_questions[initial_features].copy()
y_initial = initial_questions['Engagement_Encoded'].copy()

print("Stage 1 - Initial Questions:")
print(f"  Features: {initial_features}")
print(f"  Shape: {X_initial.shape}")

# STAGE 2: Completed questions (NO network params)
completed_features = [
    'Response Time (sec)',
    'Is_Correct_Binary'
]

X_completed = completed[completed_features].copy()
y_completed = completed['Engagement_Encoded'].copy()

print("\nStage 2 - Completed Questions:")
print(f"  Features: {completed_features}")
print(f"  Shape: {X_completed.shape}")
print(f"  ⚠️ Network params excluded (student succeeded)")

# STAGE 3: Not completed (USE network params)
not_completed_features = [
    'Response Time (sec)',
    'RTT (ms)',
    'Jitter (ms)',
    'Stability (%)',
    'Network_Quality_Encoded'
]

X_not_completed = not_completed[not_completed_features].copy()
y_not_completed = not_completed['Engagement_Encoded'].copy()

print("\nStage 3 - Not Completed Questions:")
print(f"  Features: {not_completed_features}")
print(f"  Shape: {X_not_completed.shape}")
print(f"  ✅ Network params included for validation")

## 7. Cross-Validation with Participant Tracking

In [None]:
# Merge not completed with participant data
not_completed_with_tracking = not_completed.merge(
    participant_df[participant_df['Event Type'] == 'Joined'][
        ['Admission No', 'Quiz#', 'Completion Rate (%)', 'Had Network Issue']
    ],
    on=['Admission No', 'Quiz#'],
    how='left'
)

print("Not Completed Questions Analysis:")
print(f"\nWith Valid Network Issue:")
valid_network = not_completed_with_tracking[
    (not_completed_with_tracking['Completion Rate (%)'] < 70) &
    (not_completed_with_tracking['Had Network Issue'] == True)
]
print(f"  Count: {len(valid_network)}")

print(f"\nLikely Engagement Issue (not network):")
engagement_issue = not_completed_with_tracking[
    (not_completed_with_tracking['Completion Rate (%)'] < 70) &
    (not_completed_with_tracking['Had Network Issue'] == False)
]
print(f"  Count: {len(engagement_issue)}")

## 8. Scale Features

In [None]:
# Standardize features
scaler_initial = StandardScaler()
scaler_completed = StandardScaler()
scaler_not_completed = StandardScaler()

X_initial_scaled = scaler_initial.fit_transform(X_initial)
X_completed_scaled = scaler_completed.fit_transform(X_completed)
X_not_completed_scaled = scaler_not_completed.fit_transform(X_not_completed)

print("✅ Features scaled")
print(f"\nScaled shapes:")
print(f"  Initial: {X_initial_scaled.shape}")
print(f"  Completed: {X_completed_scaled.shape}")
print(f"  Not Completed: {X_not_completed_scaled.shape}")

## 9. Save Preprocessed Data

In [None]:
# Save to Drive
OUTPUT_PATH = '/content/drive/MyDrive/FYP_Data/Preprocessed/'
!mkdir -p "$OUTPUT_PATH"

# Save feature matrices
np.save(OUTPUT_PATH + 'X_initial_scaled.npy', X_initial_scaled)
np.save(OUTPUT_PATH + 'y_initial.npy', y_initial.values)

np.save(OUTPUT_PATH + 'X_completed_scaled.npy', X_completed_scaled)
np.save(OUTPUT_PATH + 'y_completed.npy', y_completed.values)

np.save(OUTPUT_PATH + 'X_not_completed_scaled.npy', X_not_completed_scaled)
np.save(OUTPUT_PATH + 'y_not_completed.npy', y_not_completed.values)

# Save scalers
import pickle
with open(OUTPUT_PATH + 'scaler_initial.pkl', 'wb') as f:
    pickle.dump(scaler_initial, f)
with open(OUTPUT_PATH + 'scaler_completed.pkl', 'wb') as f:
    pickle.dump(scaler_completed, f)
with open(OUTPUT_PATH + 'scaler_not_completed.pkl', 'wb') as f:
    pickle.dump(scaler_not_completed, f)

print("✅ All preprocessed data saved to:")
print(OUTPUT_PATH)
print("\nFiles saved:")
print("  - X_initial_scaled.npy, y_initial.npy")
print("  - X_completed_scaled.npy, y_completed.npy")
print("  - X_not_completed_scaled.npy, y_not_completed.npy")
print("  - scaler_*.pkl (3 files)")

## Summary

**Preprocessing Complete!**

- ✅ Filtered only participating students
- ✅ Separated initial questions, completed, and not completed
- ✅ Applied proper feature selection:
  - Initial: Response Time + Network metrics
  - Completed: Response Time + Correctness only
  - Not Completed: Response Time + Network metrics
- ✅ Cross-validated with participant tracking
- ✅ Scaled and saved all data

**Next**: Use `02_Model_Training.ipynb` for model training