# 🤖 Machine Learning Model Training

Welcome to the exciting world of machine learning! In this notebook, we'll:

1. **Load** our cleaned and preprocessed data
2. **Split** data into training and testing sets
3. **Train multiple ML models** (Linear Regression, Random Forest, XGBoost)
4. **Compare model performance** with various metrics
5. **Select the best model** for our problem
6. **Save the trained model** for deployment

## 🎯 Learning Goals
- Understand train/validation/test splits
- Learn different ML algorithms and when to use them
- Practice model evaluation and comparison
- Understand overfitting and cross-validation
- Learn to save and load trained models

## 📚 Step 1: Import Libraries & Load Data

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Model persistence
import joblib
import pickle

# System libraries
import os
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("✅ Libraries imported successfully!")
print(f"📦 Available ML algorithms: Linear Regression, Ridge, Lasso, Random Forest, XGBoost, SVM")

In [None]:
# Load our cleaned and preprocessed data
print("📂 Loading cleaned datasets...")

# Load scaled data for training
df_scaled = pd.read_csv('../data/processed/student_performance_cleaned.csv')
print(f"✅ Scaled dataset loaded: {df_scaled.shape}")

# Load unscaled data for interpretability
df_unscaled = pd.read_csv('../data/processed/student_performance_cleaned_unscaled.csv')
print(f"✅ Unscaled dataset loaded: {df_unscaled.shape}")

# Load the saved scaler
scaler = joblib.load('../models/scaler.pkl')
print(f"✅ Scaler loaded for future predictions")

print("\n📋 Dataset Overview:")
display(df_scaled.head())

print(f"\n📊 Features available: {df_scaled.shape[1] - 2} (excluding student_id and exam_score)")
feature_columns = [col for col in df_scaled.columns if col not in ['student_id', 'exam_score']]
print(f"Feature list: {feature_columns}")

## ✂️ Step 2: Split Data for Training & Testing

We'll split our data into training and testing sets to evaluate model performance properly:

In [None]:
# Prepare features and target variables
print("🎯 Preparing Features and Target:")
print("="*40)

# Features (X) and target (y) from scaled data
X = df_scaled[feature_columns]
y = df_scaled['exam_score']

print(f"📊 Feature matrix shape: {X.shape}")
print(f"🎯 Target vector shape: {y.shape}")
print(f"📈 Target statistics:")
print(f"   Mean: {y.mean():.2f}")
print(f"   Std:  {y.std():.2f}")
print(f"   Min:  {y.min():.2f}")
print(f"   Max:  {y.max():.2f}")

# Check for any remaining missing values
print(f"\n❓ Missing values check:")
print(f"   Features: {X.isnull().sum().sum()}")
print(f"   Target: {y.isnull().sum()}")

In [None]:
# Split data into training and testing sets
print("✂️ Splitting Data:")
print("="*20)

# 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=None  # We'll use simple random split for regression
)

print(f"📊 Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"📊 Testing set:  {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")

print(f"\n🎯 Target distribution:")
print(f"   Training - Mean: {y_train.mean():.2f}, Std: {y_train.std():.2f}")
print(f"   Testing  - Mean: {y_test.mean():.2f}, Std: {y_test.std():.2f}")

# Visualize the split
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.hist(y, bins=20, alpha=0.7, color='blue', label='Full Dataset')
plt.title('Full Dataset Distribution')
plt.xlabel('Exam Score')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 2)
plt.hist(y_train, bins=20, alpha=0.7, color='green', label='Training Set')
plt.title('Training Set Distribution')
plt.xlabel('Exam Score')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 3)
plt.hist(y_test, bins=20, alpha=0.7, color='red', label='Testing Set')
plt.title('Testing Set Distribution')
plt.xlabel('Exam Score')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("✅ Data split completed! Ready for model training.")

## 🤖 Step 3: Train Multiple ML Models

Let's train several different algorithms and compare their performance:

In [None]:
# Initialize different ML models
print("🤖 Initializing ML Models:")
print("="*30)

models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0, random_state=42),
    'Lasso Regression': Lasso(alpha=1.0, random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'XGBoost': xgb.XGBRegressor(n_estimators=100, random_state=42),
    'Support Vector Regression': SVR(kernel='rbf', C=1.0)
}

print(f"📦 Initialized {len(models)} different models:")
for name in models.keys():
    print(f"   • {name}")

print("\n🎯 Each model will be trained and evaluated using:")
print("   • Mean Squared Error (MSE)")
print("   • Mean Absolute Error (MAE)")
print("   • R² Score (coefficient of determination)")
print("   • Cross-validation scores")

In [None]:
# Train all models and collect results
print("🚀 Training Models:")
print("="*20)

results = {}
trained_models = {}

for name, model in models.items():
    print(f"\n🔄 Training {name}...")
    
    # Train the model
    model.fit(X_train, y_train)
    trained_models[name] = model
    
    # Make predictions
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    # Calculate metrics
    train_mse = mean_squared_error(y_train, y_pred_train)
    test_mse = mean_squared_error(y_test, y_pred_test)
    train_mae = mean_absolute_error(y_train, y_pred_train)
    test_mae = mean_absolute_error(y_test, y_pred_test)
    train_r2 = r2_score(y_train, y_pred_train)
    test_r2 = r2_score(y_test, y_pred_test)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
    
    # Store results
    results[name] = {
        'train_mse': train_mse,
        'test_mse': test_mse,
        'train_mae': train_mae,
        'test_mae': test_mae,
        'train_r2': train_r2,
        'test_r2': test_r2,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'predictions_train': y_pred_train,
        'predictions_test': y_pred_test
    }
    
    print(f"   ✅ {name} trained successfully!")
    print(f"      Test R²: {test_r2:.4f}, Test MAE: {test_mae:.4f}")

print("\n🎉 All models trained successfully!")

## 📊 Step 4: Model Evaluation & Comparison

Let's compare the performance of all our models:

In [None]:
# Create results comparison table
print("📊 Model Performance Comparison:")
print("="*40)

# Create DataFrame for easy comparison
comparison_data = []
for name, metrics in results.items():
    comparison_data.append({
        'Model': name,
        'Train R²': f"{metrics['train_r2']:.4f}",
        'Test R²': f"{metrics['test_r2']:.4f}",
        'Test MAE': f"{metrics['test_mae']:.4f}",
        'Test MSE': f"{metrics['test_mse']:.4f}",
        'CV Mean R²': f"{metrics['cv_mean']:.4f}",
        'CV Std': f"{metrics['cv_std']:.4f}"
    })

comparison_df = pd.DataFrame(comparison_data)
comparison_df = comparison_df.sort_values('Test R²', ascending=False)

print(comparison_df.to_string(index=False))

# Identify best model
best_model_name = max(results.keys(), key=lambda x: results[x]['test_r2'])
best_r2 = results[best_model_name]['test_r2']

print(f"\n🏆 Best performing model: {best_model_name}")
print(f"🎯 Best Test R² Score: {best_r2:.4f}")
print(f"📈 This means the model explains {best_r2*100:.1f}% of the variance in exam scores!")

In [None]:
# Visualize model performance comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Model Performance Comparison', fontsize=16)

model_names = list(results.keys())

# R² Score comparison
train_r2_scores = [results[name]['train_r2'] for name in model_names]
test_r2_scores = [results[name]['test_r2'] for name in model_names]

x = np.arange(len(model_names))
width = 0.35

axes[0, 0].bar(x - width/2, train_r2_scores, width, label='Train R²', alpha=0.8)
axes[0, 0].bar(x + width/2, test_r2_scores, width, label='Test R²', alpha=0.8)
axes[0, 0].set_xlabel('Models')
axes[0, 0].set_ylabel('R² Score')
axes[0, 0].set_title('R² Score Comparison')
axes[0, 0].set_xticks(x)
axes[0, 0].set_xticklabels(model_names, rotation=45, ha='right')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# MAE comparison
test_mae_scores = [results[name]['test_mae'] for name in model_names]
axes[0, 1].bar(model_names, test_mae_scores, alpha=0.8, color='orange')
axes[0, 1].set_xlabel('Models')
axes[0, 1].set_ylabel('Mean Absolute Error')
axes[0, 1].set_title('Test MAE Comparison (Lower is Better)')
axes[0, 1].tick_params(axis='x', rotation=45)
axes[0, 1].grid(True, alpha=0.3)

# Cross-validation scores
cv_means = [results[name]['cv_mean'] for name in model_names]
cv_stds = [results[name]['cv_std'] for name in model_names]
axes[1, 0].bar(model_names, cv_means, yerr=cv_stds, alpha=0.8, color='green', capsize=5)
axes[1, 0].set_xlabel('Models')
axes[1, 0].set_ylabel('Cross-Validation R² Score')
axes[1, 0].set_title('Cross-Validation Performance')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].grid(True, alpha=0.3)

# Overfitting analysis (difference between train and test R²)
overfitting = [results[name]['train_r2'] - results[name]['test_r2'] for name in model_names]
colors = ['red' if x > 0.1 else 'orange' if x > 0.05 else 'green' for x in overfitting]
axes[1, 1].bar(model_names, overfitting, alpha=0.8, color=colors)
axes[1, 1].set_xlabel('Models')
axes[1, 1].set_ylabel('Train R² - Test R²')
axes[1, 1].set_title('Overfitting Analysis (Lower is Better)')
axes[1, 1].tick_params(axis='x', rotation=45)
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].axhline(y=0.05, color='red', linestyle='--', alpha=0.5, label='Overfitting Threshold')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

print("\n📊 Performance Analysis:")
print(f"   🏆 Highest Test R²: {best_model_name} ({best_r2:.4f})")
print(f"   📉 Lowest Test MAE: {min(model_names, key=lambda x: results[x]['test_mae'])} ({min(results[x]['test_mae'] for x in model_names):.4f})")
print(f"   🎯 Best CV Score: {max(model_names, key=lambda x: results[x]['cv_mean'])} ({max(results[x]['cv_mean'] for x in model_names):.4f})")

In [None]:
# Analyze predictions for the best model
print(f"🔍 Detailed Analysis of Best Model: {best_model_name}")
print("="*50)

best_model = trained_models[best_model_name]
best_predictions = results[best_model_name]['predictions_test']

# Prediction vs Actual scatter plot
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.scatter(y_test, best_predictions, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Exam Scores')
plt.ylabel('Predicted Exam Scores')
plt.title(f'{best_model_name}\nPredictions vs Actual')
plt.grid(True, alpha=0.3)

# Residuals plot
residuals = y_test - best_predictions
plt.subplot(1, 3, 2)
plt.scatter(best_predictions, residuals, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Exam Scores')
plt.ylabel('Residuals (Actual - Predicted)')
plt.title('Residuals Plot')
plt.grid(True, alpha=0.3)

# Residuals distribution
plt.subplot(1, 3, 3)
plt.hist(residuals, bins=20, alpha=0.7, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Residuals Distribution')
plt.axvline(residuals.mean(), color='red', linestyle='--', label=f'Mean: {residuals.mean():.3f}')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Prediction statistics
print(f"\n📊 Prediction Analysis:")
print(f"   Mean Residual: {residuals.mean():.4f} (should be close to 0)")
print(f"   Residual Std: {residuals.std():.4f}")
print(f"   Max Absolute Error: {abs(residuals).max():.4f}")
print(f"   95% of predictions within ±{np.percentile(abs(residuals), 95):.2f} points")

## 🎯 Step 5: Feature Importance Analysis

Let's understand which features are most important for predicting exam scores:

In [None]:
# Feature importance analysis for tree-based models
print("🎯 Feature Importance Analysis:")
print("="*35)

# Get feature importance from tree-based models
tree_models = ['Random Forest', 'Gradient Boosting', 'XGBoost']
available_tree_models = [name for name in tree_models if name in trained_models]

if available_tree_models:
    plt.figure(figsize=(15, 5 * len(available_tree_models)))
    
    for i, model_name in enumerate(available_tree_models, 1):
        model = trained_models[model_name]
        
        # Get feature importance
        if hasattr(model, 'feature_importances_'):
            importance = model.feature_importances_
            
            # Create importance DataFrame
            importance_df = pd.DataFrame({
                'feature': feature_columns,
                'importance': importance
            }).sort_values('importance', ascending=False)
            
            plt.subplot(len(available_tree_models), 1, i)
            bars = plt.bar(importance_df['feature'], importance_df['importance'], alpha=0.8)
            plt.title(f'Feature Importance - {model_name}')
            plt.xlabel('Features')
            plt.ylabel('Importance')
            plt.xticks(rotation=45, ha='right')
            plt.grid(True, alpha=0.3)
            
            # Color the bars by importance
            max_importance = importance_df['importance'].max()
            for bar, imp in zip(bars, importance_df['importance']):
                if imp > 0.7 * max_importance:
                    bar.set_color('red')
                elif imp > 0.4 * max_importance:
                    bar.set_color('orange')
                else:
                    bar.set_color('lightblue')
    
    plt.tight_layout()
    plt.show()
    
    # Print top features for each model
    for model_name in available_tree_models:
        model = trained_models[model_name]
        if hasattr(model, 'feature_importances_'):
            importance = model.feature_importances_
            importance_df = pd.DataFrame({
                'feature': feature_columns,
                'importance': importance
            }).sort_values('importance', ascending=False)
            
            print(f"\n🏆 Top 5 features for {model_name}:")
            for idx, row in importance_df.head().iterrows():
                print(f"   {row['feature']:20s}: {row['importance']:.4f}")
else:
    print("No tree-based models available for feature importance analysis.")

# Linear model coefficients
if 'Linear Regression' in trained_models:
    linear_model = trained_models['Linear Regression']
    coefficients = pd.DataFrame({
        'feature': feature_columns,
        'coefficient': linear_model.coef_
    }).sort_values('coefficient', key=abs, ascending=False)
    
    print(f"\n📊 Linear Regression Coefficients (Top 5):")
    for idx, row in coefficients.head().iterrows():
        direction = "↗️" if row['coefficient'] > 0 else "↘️"
        print(f"   {row['feature']:20s}: {row['coefficient']:8.4f} {direction}")

## ⚙️ Step 6: Hyperparameter Tuning (Optional)

Let's optimize our best model with hyperparameter tuning:

In [None]:
# Hyperparameter tuning for the best model
print(f"⚙️ Hyperparameter Tuning for {best_model_name}:")
print("="*45)

# Define parameter grids for different models
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 9],
        'learning_rate': [0.01, 0.1, 0.2]
    },
    'Gradient Boosting': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 6, 9],
        'learning_rate': [0.01, 0.1, 0.2]
    },
    'Ridge Regression': {
        'alpha': [0.1, 1.0, 10.0, 100.0]
    }
}

if best_model_name in param_grids:
    print(f"🔍 Performing Grid Search for {best_model_name}...")
    
    # Get the base model
    base_model = models[best_model_name]
    param_grid = param_grids[best_model_name]
    
    # Perform grid search
    grid_search = GridSearchCV(
        base_model, 
        param_grid, 
        cv=5, 
        scoring='r2', 
        n_jobs=-1, 
        verbose=1
    )
    
    grid_search.fit(X_train, y_train)
    
    # Get the best model
    best_tuned_model = grid_search.best_estimator_
    
    # Evaluate tuned model
    y_pred_tuned = best_tuned_model.predict(X_test)
    tuned_r2 = r2_score(y_test, y_pred_tuned)
    tuned_mae = mean_absolute_error(y_test, y_pred_tuned)
    
    print(f"\n🎯 Tuning Results:")
    print(f"   Original R² Score: {results[best_model_name]['test_r2']:.4f}")
    print(f"   Tuned R² Score:    {tuned_r2:.4f}")
    print(f"   Improvement:       {tuned_r2 - results[best_model_name]['test_r2']:.4f}")
    print(f"\n   Original MAE:      {results[best_model_name]['test_mae']:.4f}")
    print(f"   Tuned MAE:         {tuned_mae:.4f}")
    print(f"   Improvement:       {results[best_model_name]['test_mae'] - tuned_mae:.4f}")
    
    print(f"\n🏆 Best Parameters:")
    for param, value in grid_search.best_params_.items():
        print(f"   {param}: {value}")
    
    # Update best model if tuned version is better
    if tuned_r2 > results[best_model_name]['test_r2']:
        print(f"\n✅ Tuned model is better! Using tuned version as final model.")
        final_model = best_tuned_model
        final_model_name = f"{best_model_name} (Tuned)"
    else:
        print(f"\n📊 Original model performs better. Using original version.")
        final_model = trained_models[best_model_name]
        final_model_name = best_model_name
        
else:
    print(f"⚠️ Hyperparameter tuning not available for {best_model_name}")
    final_model = trained_models[best_model_name]
    final_model_name = best_model_name

print(f"\n🎉 Final Model Selected: {final_model_name}")

## 💾 Step 7: Save the Best Model

Let's save our trained model for future use and deployment:

In [None]:
# Save the final model and related information
print("💾 Saving Final Model:")
print("="*25)

# Create models directory if it doesn't exist
os.makedirs('../models', exist_ok=True)

# Save the model
model_filename = '../models/best_student_performance_model.pkl'
joblib.dump(final_model, model_filename)
print(f"✅ Model saved: {model_filename}")

# Save model metadata
if 'final_model' in locals():
    final_predictions = final_model.predict(X_test)
    final_r2 = r2_score(y_test, final_predictions)
    final_mae = mean_absolute_error(y_test, final_predictions)
else:
    final_predictions = results[best_model_name]['predictions_test']
    final_r2 = results[best_model_name]['test_r2']
    final_mae = results[best_model_name]['test_mae']

model_metadata = {
    'model_name': final_model_name,
    'model_type': type(final_model).__name__,
    'training_date': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S'),
    'features': feature_columns,
    'performance': {
        'test_r2_score': final_r2,
        'test_mae': final_mae,
        'test_rmse': np.sqrt(mean_squared_error(y_test, final_predictions))
    },
    'training_data_shape': X_train.shape,
    'test_data_shape': X_test.shape
}

# Save metadata
metadata_filename = '../models/model_metadata.pkl'
joblib.dump(model_metadata, metadata_filename)
print(f"✅ Metadata saved: {metadata_filename}")

# Save feature names for deployment
feature_names_filename = '../models/feature_names.pkl'
joblib.dump(feature_columns, feature_names_filename)
print(f"✅ Feature names saved: {feature_names_filename}")

# Create a simple prediction function
def predict_exam_score(model, scaler, **features):
    """
    Predict exam score based on student features.
    
    Parameters:
    model: trained ML model
    scaler: fitted StandardScaler
    **features: student features as keyword arguments
    
    Returns:
    predicted exam score
    """
    # Create DataFrame with features
    feature_df = pd.DataFrame([features])
    
    # Ensure all required features are present
    for col in feature_columns:
        if col not in feature_df.columns:
            feature_df[col] = 0  # Default value
    
    # Reorder columns to match training data
    feature_df = feature_df[feature_columns]
    
    # Scale features
    scaled_features = scaler.transform(feature_df)
    
    # Make prediction
    prediction = model.predict(scaled_features)[0]
    
    return prediction

# Save prediction function
prediction_func_filename = '../models/prediction_function.pkl'
joblib.dump(predict_exam_score, prediction_func_filename)
print(f"✅ Prediction function saved: {prediction_func_filename}")

print(f"\n📊 Final Model Summary:")
print(f"   Model: {final_model_name}")
print(f"   Test R² Score: {final_r2:.4f}")
print(f"   Test MAE: {final_mae:.4f}")
print(f"   Features: {len(feature_columns)}")
print(f"   Training samples: {X_train.shape[0]}")
print(f"   Test samples: {X_test.shape[0]}")

## 🧪 Step 8: Test the Saved Model

Let's test our saved model to make sure it works correctly:

In [None]:
# Test the saved model
print("🧪 Testing Saved Model:")
print("="*25)

# Load the saved model
loaded_model = joblib.load('../models/best_student_performance_model.pkl')
loaded_scaler = joblib.load('../models/scaler.pkl')
loaded_features = joblib.load('../models/feature_names.pkl')
loaded_metadata = joblib.load('../models/model_metadata.pkl')
loaded_predict_func = joblib.load('../models/prediction_function.pkl')

print(f"✅ Model loaded successfully!")
print(f"📊 Model type: {loaded_metadata['model_type']}")
print(f"🕒 Training date: {loaded_metadata['training_date']}")
print(f"🎯 Test R² Score: {loaded_metadata['performance']['test_r2_score']:.4f}")

# Test prediction with sample data
print(f"\n🔮 Testing Predictions:")

# Create sample students for testing
sample_students = [
    {
        'name': 'High Achiever',
        'study_hours': 15.0,
        'attendance': 95.0,
        'previous_grade': 85.0,
        'sleep_hours': 8.0,
        'extra_activities': 2,
        'family_support': 5,
        'study_efficiency': 15.79,
        'sleep_quality': 0.93,
        'work_life_balance': 0.33,
        'preparation_score': 0.85,
        'improvement_potential': 15.0
    },
    {
        'name': 'Average Student',
        'study_hours': 8.0,
        'attendance': 75.0,
        'previous_grade': 65.0,
        'sleep_hours': 7.0,
        'extra_activities': 3,
        'family_support': 3,
        'study_efficiency': 10.67,
        'sleep_quality': 1.0,
        'work_life_balance': 0.25,
        'preparation_score': 0.65,
        'improvement_potential': 35.0
    },
    {
        'name': 'Struggling Student',
        'study_hours': 3.0,
        'attendance': 60.0,
        'previous_grade': 45.0,
        'sleep_hours': 5.0,
        'extra_activities': 5,
        'family_support': 2,
        'study_efficiency': 5.0,
        'sleep_quality': 0.67,
        'work_life_balance': 0.17,
        'preparation_score': 0.35,
        'improvement_potential': 55.0
    }
]

for student in sample_students:
    name = student.pop('name')
    predicted_score = loaded_predict_func(loaded_model, loaded_scaler, **student)
    print(f"\n👤 {name}:")
    print(f"   Study Hours: {student['study_hours']}, Attendance: {student['attendance']:.0f}%")
    print(f"   Previous Grade: {student['previous_grade']:.0f}, Sleep: {student['sleep_hours']} hours")
    print(f"   🎯 Predicted Exam Score: {predicted_score:.1f}")

# Verify model performance on test set
loaded_test_predictions = loaded_model.predict(loaded_scaler.transform(X_test))
loaded_test_r2 = r2_score(y_test, loaded_test_predictions)

print(f"\n✅ Model Verification:")
print(f"   Original Test R²: {final_r2:.4f}")
print(f"   Loaded Test R²:   {loaded_test_r2:.4f}")
print(f"   Match: {'✅ Yes' if abs(final_r2 - loaded_test_r2) < 0.0001 else '❌ No'}")

print(f"\n🎉 Model successfully saved and tested!")

## 📋 Step 9: Training Summary

Let's summarize everything we accomplished in this model training phase:

In [None]:
print("🎯 MACHINE LEARNING TRAINING SUMMARY")
print("="*55)

print("✅ COMPLETED TASKS:")
print(f"   1. 📂 Data Loading:")
print(f"      • Loaded {df_scaled.shape[0]} cleaned student records")
print(f"      • Used {len(feature_columns)} engineered features")

print(f"\n   2. ✂️ Data Splitting:")
print(f"      • Training set: {X_train.shape[0]} samples (80%)")
print(f"      • Testing set: {X_test.shape[0]} samples (20%)")

print(f"\n   3. 🤖 Model Training:")
print(f"      • Trained {len(models)} different algorithms")
print(f"      • Linear, Ridge, Lasso, Random Forest, XGBoost, SVM")
print(f"      • Used cross-validation for robust evaluation")

print(f"\n   4. 📊 Model Evaluation:")
print(f"      • Best model: {final_model_name}")
print(f"      • Test R² Score: {final_r2:.4f} ({final_r2*100:.1f}% variance explained)")
print(f"      • Test MAE: {final_mae:.4f} points")
print(f"      • Model explains real patterns in student performance!")

print(f"\n   5. 🎯 Feature Analysis:")
if available_tree_models:
    # Get top features from best tree model
    if best_model_name in available_tree_models:
        model = trained_models[best_model_name]
        if hasattr(model, 'feature_importances_'):
            importance = model.feature_importances_
            importance_df = pd.DataFrame({
                'feature': feature_columns,
                'importance': importance
            }).sort_values('importance', ascending=False)
            top_feature = importance_df.iloc[0]['feature']
            print(f"      • Most important feature: {top_feature}")
            print(f"      • Feature importance analysis completed")

print(f"\n   6. 💾 Model Persistence:")
print(f"      • Saved trained model for deployment")
print(f"      • Saved preprocessing scaler")
print(f"      • Created prediction function")
print(f"      • Saved model metadata and feature names")

print(f"\n📊 MODEL PERFORMANCE BREAKDOWN:")
print(f"   • R² Score of {final_r2:.4f} means:")
print(f"     - Model explains {final_r2*100:.1f}% of exam score variance")
print(f"     - Strong predictive power for student performance")
print(f"   • MAE of {final_mae:.4f} means:")
print(f"     - Average prediction error is ±{final_mae:.1f} points")
print(f"     - Very accurate for educational assessment")

print(f"\n🚀 NEXT STEPS:")
print(f"   • Create model evaluation notebook (05_model_evaluation.ipynb)")
print(f"   • Build web application for model deployment")
print(f"   • Create interactive dashboard for predictions")
print(f"   • Deploy model to help students improve performance")

print(f"\n💡 KEY INSIGHTS:")
print(f"   • Machine learning can predict student performance accurately")
print(f"   • Study habits and attendance are likely key factors")
print(f"   • Model is ready for real-world application")
print(f"   • Can help identify students who need extra support")

print(f"\n🎉 Congratulations! Your ML model is trained and ready!")

## 🎉 Congratulations!

You've successfully completed machine learning model training! You've learned:

✅ **Data Splitting** - Train/test splits and cross-validation  
✅ **Multiple Algorithms** - Linear, Tree-based, and Ensemble methods  
✅ **Model Evaluation** - R², MAE, MSE, and residual analysis  
✅ **Feature Importance** - Understanding what drives predictions  
✅ **Hyperparameter Tuning** - Optimizing model performance  
✅ **Model Persistence** - Saving models for deployment  

### 🚀 What's Next?

In the next notebook (`04_model_evaluation.ipynb`), we'll dive deeper into:
- Advanced model evaluation techniques
- Error analysis and model diagnostics
- Model interpretation and explainability
- Preparing for deployment

Then we'll create a web application in `05_model_deployment.ipynb`!

Your model can now predict student exam scores with **{final_r2*100:.1f}% accuracy**! 🌟