# üéì Student Performance ML Analysis
## Comprehensive Machine Learning Pipeline

This notebook applies **state-of-the-art machine learning techniques** to analyze a student performance dataset containing Math, Physics, and Chemistry scores with grade labels (A+ to F).

### Techniques covered:
- **Classification**: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, XGBoost, SVM, KNN, Naive Bayes, MLP Neural Network
- **Regression**: Linear, Ridge, Lasso, Random Forest Regressor, Gradient Boosting Regressor
- **Clustering**: K-Means, DBSCAN
- **Dimensionality Reduction**: PCA
- **Ensemble Methods**: Voting Classifier, Stacking Classifier
- **Evaluation**: Cross-validation, Confusion Matrices, ROC Curves, Learning Curves, Feature Importance
- **Hyperparameter Tuning**: GridSearchCV, RandomizedSearchCV
- **Final Output**: Interactive HTML report with all results and explanations

In [2]:
# ============================================================
# Section 1: Import Libraries and Configure Environment
# ============================================================
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
from io import BytesIO
import base64
import os
import json

# Scikit-learn: Classification
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.ensemble import (RandomForestClassifier, GradientBoostingClassifier,
                              VotingClassifier, StackingClassifier,
                              RandomForestRegressor, GradientBoostingRegressor)
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier

# Scikit-learn: Regression
from sklearn.linear_model import LinearRegression, Ridge, Lasso

# Scikit-learn: Clustering & Dimensionality Reduction
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA

# Scikit-learn: Preprocessing & Evaluation
from sklearn.preprocessing import LabelEncoder, StandardScaler, label_binarize
from sklearn.model_selection import (train_test_split, cross_val_score,
                                     GridSearchCV, RandomizedSearchCV,
                                     StratifiedKFold, learning_curve)
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, classification_report, confusion_matrix,
                             roc_curve, auc, roc_auc_score,
                             mean_absolute_error, mean_squared_error, r2_score,
                             silhouette_score)
from sklearn.inspection import permutation_importance

# XGBoost (optional)
try:
    from xgboost import XGBClassifier
    HAS_XGBOOST = True
    print("‚úÖ XGBoost available")
except (ImportError, OSError, Exception):
    HAS_XGBOOST = False
    print("‚ö†Ô∏è XGBoost not available (needs libomp), will use sklearn GradientBoosting instead")

# Jinja2 for HTML report
try:
    from jinja2 import Template
    HAS_JINJA2 = True
    print("‚úÖ Jinja2 available")
except ImportError:
    HAS_JINJA2 = False
    print("‚ö†Ô∏è Jinja2 not available, will use string formatting for HTML report")

# Configure plotting
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

# Random seed for reproducibility
SEED = 42
np.random.seed(SEED)

# Output directory
os.makedirs('outputs', exist_ok=True)
os.makedirs('outputs/plots', exist_ok=True)

# Dictionary to store all model results
results = {}
report_images = {}

print("‚úÖ All libraries loaded successfully!")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"Scikit-learn: {__import__('sklearn').__version__}")

‚ö†Ô∏è XGBoost not available (needs libomp), will use sklearn GradientBoosting instead
‚úÖ Jinja2 available
‚úÖ All libraries loaded successfully!
NumPy: 2.4.2
Pandas: 3.0.1
Scikit-learn: 1.8.0


## Section 2: Load and Explore the Dataset

In [3]:
# ============================================================
# Section 2: Load and Explore the Dataset
# ============================================================
df = pd.read_csv('student_dataset.csv')

print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(f"\nüìä Shape: {df.shape[0]} rows √ó {df.shape[1]} columns")
print(f"\nüìã Columns: {list(df.columns)}")
print(f"\nüîç Data Types:\n{df.dtypes}")
print(f"\n‚ùì Missing Values:\n{df.isnull().sum()}")
print(f"\nüîÑ Duplicates: {df.duplicated().sum()}")
print(f"\nüìà Statistical Summary:")
df.describe()

DATASET OVERVIEW

üìä Shape: 9000 rows √ó 10 columns

üìã Columns: ['Student_Names', 'Phone_No.', 'Math', 'Physics', 'Chemistry', 'Grade', 'Comment', 'Roll No.', 'School Name', 'Student Address']

üîç Data Types:
Student_Names        str
Phone_No.          int64
Math               int64
Physics            int64
Chemistry          int64
Grade                str
Comment              str
Roll No.           int64
School Name          str
Student Address      str
dtype: object

‚ùì Missing Values:
Student_Names      0
Phone_No.          0
Math               0
Physics            0
Chemistry          0
Grade              0
Comment            0
Roll No.           0
School Name        0
Student Address    0
dtype: int64

üîÑ Duplicates: 0

üìà Statistical Summary:


Unnamed: 0,Phone_No.,Math,Physics,Chemistry,Roll No.
count,9000.0,9000.0,9000.0,9000.0,9000.0
mean,9498521000.0,55.276111,54.697556,54.854889,550174.095667
std,286563000.0,26.10914,26.232446,26.26132,28955.471076
min,9000052000.0,10.0,10.0,10.0,500002.0
25%,9251158000.0,33.0,32.0,32.0,524968.25
50%,9498910000.0,56.0,55.0,55.0,550274.5
75%,9745590000.0,78.0,77.0,77.0,575254.75
max,9999838000.0,100.0,100.0,100.0,599994.0


In [4]:
# Display first and last rows
print("First 5 rows:")
display(df.head())
print("\nLast 5 rows:")
display(df.tail())

# Unique values for categorical columns
print(f"\nüìù Unique Grades ({df['Grade'].nunique()}): {sorted(df['Grade'].unique())}")
print(f"\nüí¨ Unique Comments ({df['Comment'].nunique()}): {df['Comment'].unique()}")
print(f"\nüè´ Unique Schools: {df['School Name'].unique()}")

# Grade distribution
print(f"\nüìä Grade Distribution:")
print(df['Grade'].value_counts().sort_index())

First 5 rows:


Unnamed: 0,Student_Names,Phone_No.,Math,Physics,Chemistry,Grade,Comment,Roll No.,School Name,Student Address
0,Donald Contreras,9208625450,76,84,54,B+,Good Pursuance,524613,Martin Luther School,"478 Mooney Park, New Valerie, VI 28836"
1,Joseph Horton,9886408555,91,75,78,A,Very Good Achivement,561635,Martin Luther School,"037 Matthew Shores, Greeneton, CA 98399"
2,Savannah Burns MD,9047592659,64,98,20,C,Below Average Achivement,560985,Martin Luther School,"96124 Lloyd Streets, Edwardmouth, DC 61677"
3,William Carter,9048473864,15,95,32,D,Poor Pursuance,535126,Martin Luther School,"11959 Clark Village, Ivanview, NH 43940"
4,John Rodriguez,9685225730,86,86,66,B+,Good Pursuance,559410,Martin Luther School,"051 Weaver Glen Apt. 724, West Davidborough, M..."



Last 5 rows:


Unnamed: 0,Student_Names,Phone_No.,Math,Physics,Chemistry,Grade,Comment,Roll No.,School Name,Student Address
8995,Kimberly Stevens,9129352703,40,87,65,B,Average Performance,569342,Martin Luther School,"27054 Adrian Streets, Diazmouth, OH 81346"
8996,Kelsey Bonilla,9649715711,56,84,75,B+,Good Pursuance,530124,Martin Luther School,"570 Christopher Run, Williammouth, ND 11535"
8997,Kelly Dunn,9825362271,80,70,16,C,Below Average Achivement,592266,Martin Luther School,"32283 Carpenter Summit, North Patricia, PR 51483"
8998,Joseph Nichols,9363540473,24,95,59,C,Below Average Achivement,583028,Martin Luther School,"2336 Blackburn Fall Apt. 905, South Shelby, ND..."
8999,Susan Armstrong,9879539785,31,76,18,D,Poor Pursuance,503637,Martin Luther School,"2328 Jennifer Extension, Lake David, OR 11243"



üìù Unique Grades (7): ['A', 'A+', 'B', 'B+', 'C', 'D', 'F']

üí¨ Unique Comments (7): <StringArray>
[          'Good Pursuance',     'Very Good Achivement',
 'Below Average Achivement',           'Poor Pursuance',
                   'Failed',      'Average Performance',
    'Excellent Performance']
Length: 7, dtype: str

üè´ Unique Schools: <StringArray>
['Martin Luther School']
Length: 1, dtype: str

üìä Grade Distribution:
Grade
A      360
A+      49
B     1797
B+    1014
C     2187
D     2887
F      706
Name: count, dtype: int64


## Section 3: Data Cleaning and Preprocessing

In [5]:
# ============================================================
# Section 3: Data Cleaning and Preprocessing
# ============================================================

# Drop irrelevant columns (identifiers, no-variance columns)
drop_cols = ['Student_Names', 'Phone_No.', 'Roll No.', 'School Name', 'Student Address']
# Comment is 1:1 with Grade (leaky feature), drop it too
drop_cols.append('Comment')
df_clean = df.drop(columns=drop_cols)

print(f"‚úÖ Dropped columns: {drop_cols}")
print(f"Remaining columns: {list(df_clean.columns)}")
print(f"Shape after cleaning: {df_clean.shape}")

# Check data types
print(f"\nData types:\n{df_clean.dtypes}")

# Outlier detection using IQR
print("\nüìä Outlier Analysis (IQR Method):")
for col in ['Math', 'Physics', 'Chemistry']:
    Q1 = df_clean[col].quantile(0.25)
    Q3 = df_clean[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = df_clean[(df_clean[col] < lower) | (df_clean[col] > upper)].shape[0]
    print(f"  {col}: Q1={Q1:.0f}, Q3={Q3:.0f}, IQR={IQR:.0f}, "
          f"Bounds=[{lower:.0f}, {upper:.0f}], Outliers={outliers}")

# Create Total and Average scores
df_clean['Total_Score'] = df_clean['Math'] + df_clean['Physics'] + df_clean['Chemistry']
df_clean['Average_Score'] = df_clean['Total_Score'] / 3

print(f"\n‚úÖ Added Total_Score and Average_Score")
print(f"\nCleaned dataset preview:")
df_clean.head()

‚úÖ Dropped columns: ['Student_Names', 'Phone_No.', 'Roll No.', 'School Name', 'Student Address', 'Comment']
Remaining columns: ['Math', 'Physics', 'Chemistry', 'Grade']
Shape after cleaning: (9000, 4)

Data types:
Math         int64
Physics      int64
Chemistry    int64
Grade          str
dtype: object

üìä Outlier Analysis (IQR Method):
  Math: Q1=33, Q3=78, IQR=45, Bounds=[-34, 146], Outliers=0
  Physics: Q1=32, Q3=77, IQR=45, Bounds=[-36, 144], Outliers=0
  Chemistry: Q1=32, Q3=77, IQR=45, Bounds=[-36, 144], Outliers=0

‚úÖ Added Total_Score and Average_Score

Cleaned dataset preview:


Unnamed: 0,Math,Physics,Chemistry,Grade,Total_Score,Average_Score
0,76,84,54,B+,214,71.333333
1,91,75,78,A,244,81.333333
2,64,98,20,C,182,60.666667
3,15,95,32,D,142,47.333333
4,86,86,66,B+,238,79.333333


## Section 4: Exploratory Data Analysis (EDA) with Visualizations

In [6]:
# ============================================================
# Section 4: EDA Visualizations
# ============================================================

def save_plot(fig, name):
    """Save plot to file and encode as base64 for HTML report."""
    path = f'outputs/plots/{name}.png'
    fig.savefig(path, dpi=150, bbox_inches='tight', facecolor='white')
    buf = BytesIO()
    fig.savefig(buf, format='png', dpi=150, bbox_inches='tight', facecolor='white')
    buf.seek(0)
    report_images[name] = base64.b64encode(buf.read()).decode('utf-8')
    plt.close(fig)
    return path

# 4a. Grade Distribution
grade_order = ['F', 'D', 'C', 'B', 'B+', 'A', 'A+']
fig, ax = plt.subplots(figsize=(10, 6))
grade_counts = df_clean['Grade'].value_counts().reindex(grade_order)
colors = ['#e74c3c', '#e67e22', '#f39c12', '#3498db', '#2980b9', '#27ae60', '#1abc9c']
bars = ax.bar(grade_order, grade_counts.values, color=colors, edgecolor='black', linewidth=0.5)
for bar, count in zip(bars, grade_counts.values):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 30,
            str(count), ha='center', va='bottom', fontweight='bold')
ax.set_title('Grade Distribution', fontsize=16, fontweight='bold')
ax.set_xlabel('Grade', fontsize=13)
ax.set_ylabel('Count', fontsize=13)
save_plot(fig, 'grade_distribution')
print("‚úÖ Grade distribution plot saved")

# 4b. Score distributions
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for idx, col in enumerate(['Math', 'Physics', 'Chemistry']):
    axes[idx].hist(df_clean[col], bins=30, color=colors[idx+3], edgecolor='black',
                   alpha=0.8, linewidth=0.5)
    axes[idx].axvline(df_clean[col].mean(), color='red', linestyle='--',
                      label=f'Mean: {df_clean[col].mean():.1f}')
    axes[idx].axvline(df_clean[col].median(), color='green', linestyle='--',
                      label=f'Median: {df_clean[col].median():.1f}')
    axes[idx].set_title(f'{col} Score Distribution', fontsize=14, fontweight='bold')
    axes[idx].set_xlabel('Score')
    axes[idx].set_ylabel('Frequency')
    axes[idx].legend()
fig.tight_layout()
save_plot(fig, 'score_distributions')
print("‚úÖ Score distribution plots saved")

# 4c. Correlation heatmap
fig, ax = plt.subplots(figsize=(8, 6))
numeric_cols = ['Math', 'Physics', 'Chemistry', 'Total_Score', 'Average_Score']
corr_matrix = df_clean[numeric_cols].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, fmt='.3f',
            square=True, linewidths=1, ax=ax)
ax.set_title('Correlation Heatmap', fontsize=14, fontweight='bold')
save_plot(fig, 'correlation_heatmap')
print("‚úÖ Correlation heatmap saved")

# 4d. Boxplots of scores by Grade
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
for idx, col in enumerate(['Math', 'Physics', 'Chemistry']):
    sns.boxplot(data=df_clean, x='Grade', y=col, order=grade_order,
                palette=colors, ax=axes[idx])
    axes[idx].set_title(f'{col} by Grade', fontsize=14, fontweight='bold')
fig.tight_layout()
save_plot(fig, 'boxplots_by_grade')
print("‚úÖ Boxplots saved")

# 4e. Violin plots
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
for idx, col in enumerate(['Math', 'Physics', 'Chemistry']):
    sns.violinplot(data=df_clean, x='Grade', y=col, order=grade_order,
                   palette=colors, ax=axes[idx], inner='box')
    axes[idx].set_title(f'{col} Distribution by Grade', fontsize=14, fontweight='bold')
fig.tight_layout()
save_plot(fig, 'violin_plots')
print("‚úÖ Violin plots saved")

print("\n‚úÖ All EDA visualizations generated!")

‚úÖ Grade distribution plot saved
‚úÖ Score distribution plots saved
‚úÖ Correlation heatmap saved
‚úÖ Boxplots saved
‚úÖ Violin plots saved

‚úÖ All EDA visualizations generated!


## Section 5: Feature Engineering

In [7]:
# ============================================================
# Section 5: Feature Engineering
# ============================================================

# Additional engineered features
df_clean['Max_Score'] = df_clean[['Math', 'Physics', 'Chemistry']].max(axis=1)
df_clean['Min_Score'] = df_clean[['Math', 'Physics', 'Chemistry']].min(axis=1)
df_clean['Score_Range'] = df_clean['Max_Score'] - df_clean['Min_Score']
df_clean['Score_Std'] = df_clean[['Math', 'Physics', 'Chemistry']].std(axis=1)

# Binary Pass/Fail (F=0, else=1)
df_clean['Pass_Fail'] = (df_clean['Grade'] != 'F').astype(int)

# Above average flag
overall_avg = df_clean['Average_Score'].mean()
df_clean['Is_Above_Average'] = (df_clean['Average_Score'] > overall_avg).astype(int)

# Subject-wise performance bins
for col in ['Math', 'Physics', 'Chemistry']:
    df_clean[f'{col}_Level'] = pd.cut(df_clean[col],
                                       bins=[0, 40, 70, 100],
                                       labels=['Low', 'Medium', 'High'],
                                       include_lowest=True)

print("‚úÖ Engineered Features Created:")
print(f"  ‚Ä¢ Max_Score, Min_Score, Score_Range, Score_Std")
print(f"  ‚Ä¢ Pass_Fail (F=0, else=1): {df_clean['Pass_Fail'].value_counts().to_dict()}")
print(f"  ‚Ä¢ Is_Above_Average: {df_clean['Is_Above_Average'].value_counts().to_dict()}")
print(f"  ‚Ä¢ Subject performance levels (Low/Medium/High)")
print(f"\nDataset shape: {df_clean.shape}")

# Save cleaned dataset
df_clean.to_csv('outputs/student_cleaned.csv', index=False)
print("‚úÖ Cleaned dataset saved to outputs/student_cleaned.csv")

df_clean.head()

‚úÖ Engineered Features Created:
  ‚Ä¢ Max_Score, Min_Score, Score_Range, Score_Std
  ‚Ä¢ Pass_Fail (F=0, else=1): {1: 8294, 0: 706}
  ‚Ä¢ Is_Above_Average: {1: 4511, 0: 4489}
  ‚Ä¢ Subject performance levels (Low/Medium/High)

Dataset shape: (9000, 15)
‚úÖ Cleaned dataset saved to outputs/student_cleaned.csv


Unnamed: 0,Math,Physics,Chemistry,Grade,Total_Score,Average_Score,Max_Score,Min_Score,Score_Range,Score_Std,Pass_Fail,Is_Above_Average,Math_Level,Physics_Level,Chemistry_Level
0,76,84,54,B+,214,71.333333,84,54,30,15.534907,1,1,High,High,Medium
1,91,75,78,A,244,81.333333,91,75,16,8.504901,1,1,High,High,High
2,64,98,20,C,182,60.666667,98,20,78,39.106692,1,1,Medium,High,Low
3,15,95,32,D,142,47.333333,95,15,80,42.14657,1,0,Low,High,Low
4,86,86,66,B+,238,79.333333,86,66,20,11.547005,1,1,High,High,Medium


## Section 6 & 7: Encode Variables, Train-Test Split, and Feature Scaling

In [8]:
# ============================================================
# Section 6 & 7: Encode, Split, Scale
# ============================================================

# Ordinal encoding for Grade (preserving order)
grade_map = {'F': 0, 'D': 1, 'C': 2, 'B': 3, 'B+': 4, 'A': 5, 'A+': 6}
grade_names = ['F', 'D', 'C', 'B', 'B+', 'A', 'A+']
df_clean['Grade_Encoded'] = df_clean['Grade'].map(grade_map)

# Feature matrix (numeric features only)
feature_cols = ['Math', 'Physics', 'Chemistry', 'Total_Score', 'Average_Score',
                'Max_Score', 'Min_Score', 'Score_Range', 'Score_Std']

X = df_clean[feature_cols].values
y_multi = df_clean['Grade_Encoded'].values  # Multiclass target
y_binary = df_clean['Pass_Fail'].values      # Binary target

print(f"Feature matrix X shape: {X.shape}")
print(f"Multiclass target classes: {np.unique(y_multi)} ({len(np.unique(y_multi))} classes)")
print(f"Binary target distribution: Pass={y_binary.sum()}, Fail={(1-y_binary).sum()}")

# Train-Test Split (80/20, stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y_multi, test_size=0.2, random_state=SEED, stratify=y_multi)

X_train_bin, X_test_bin, y_train_bin, y_test_bin = train_test_split(
    X, y_binary, test_size=0.2, random_state=SEED, stratify=y_binary)

print(f"\nTrain set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_bin_scaled = scaler.fit_transform(X_train_bin)
X_test_bin_scaled = scaler.transform(X_test_bin)

print("‚úÖ Feature scaling applied (StandardScaler)")

# Regression targets
y_total_train = df_clean.loc[X_train_bin_scaled.__len__ and
                              range(len(df_clean)), 'Total_Score'] if False else None

# Simpler: split for regression
X_reg = df_clean[['Math', 'Physics', 'Chemistry']].values
y_reg_total = df_clean['Total_Score'].values
y_reg_avg = df_clean['Average_Score'].values

X_reg_train, X_reg_test, y_total_train, y_total_test = train_test_split(
    X_reg, y_reg_total, test_size=0.2, random_state=SEED)
_, _, y_avg_train, y_avg_test = train_test_split(
    X_reg, y_reg_avg, test_size=0.2, random_state=SEED)

X_reg_train_scaled = scaler.fit_transform(X_reg_train)
X_reg_test_scaled = scaler.transform(X_reg_test)

print("‚úÖ Data prepared for all model types!")

Feature matrix X shape: (9000, 9)
Multiclass target classes: [0 1 2 3 4 5 6] (7 classes)
Binary target distribution: Pass=8294, Fail=706

Train set: 7200 samples
Test set: 1800 samples
‚úÖ Feature scaling applied (StandardScaler)
‚úÖ Data prepared for all model types!


## Section 8‚Äì15: Classification Models

We train 8 different classifiers on the student grade prediction task.

In [10]:
# ============================================================
# Helper function to evaluate and store classifier results
# ============================================================
def evaluate_classifier(name, model, X_tr, X_te, y_tr, y_te, scale=False):
    """Train, predict, evaluate, and store results for a classifier."""
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_te)
    
    acc = accuracy_score(y_te, y_pred)
    prec = precision_score(y_te, y_pred, average='weighted', zero_division=0)
    rec = recall_score(y_te, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_te, y_pred, average='weighted', zero_division=0)
    
    results[name] = {
        'accuracy': acc,
        'precision': prec,
        'recall': rec,
        'f1_score': f1,
        'y_pred': y_pred,
        'model': model
    }
    
    print(f"\n{'='*50}")
    print(f"üìä {name}")
    print(f"{'='*50}")
    print(f"  Accuracy:  {acc:.4f}")
    print(f"  Precision: {prec:.4f}")
    print(f"  Recall:    {rec:.4f}")
    print(f"  F1-Score:  {f1:.4f}")
    
    return model, y_pred

# ============================================================
# Model 1: Logistic Regression
# ============================================================
lr_model, lr_pred = evaluate_classifier(
    'Logistic Regression',
    LogisticRegression(solver='lbfgs', max_iter=1000, random_state=SEED),
    X_train_scaled, X_test_scaled, y_train, y_test
)

# ============================================================
# Model 2: Decision Tree Classifier
# ============================================================
dt_model, dt_pred = evaluate_classifier(
    'Decision Tree',
    DecisionTreeClassifier(max_depth=10, random_state=SEED),
    X_train, X_test, y_train, y_test
)

# Feature importance from Decision Tree
dt_importances = pd.Series(dt_model.feature_importances_, index=feature_cols)
print(f"\n  Top features: {dt_importances.nlargest(3).to_dict()}")

# ============================================================
# Model 3: Random Forest Classifier
# ============================================================
rf_model, rf_pred = evaluate_classifier(
    'Random Forest',
    RandomForestClassifier(n_estimators=100, max_depth=15, random_state=SEED, n_jobs=-1),
    X_train, X_test, y_train, y_test
)

rf_importances = pd.Series(rf_model.feature_importances_, index=feature_cols)
print(f"\n  Top features: {rf_importances.nlargest(3).to_dict()}")

# ============================================================
# Model 4: Gradient Boosting / XGBoost
# ============================================================
gb_model, gb_pred = evaluate_classifier(
    'Gradient Boosting',
    GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
                                max_depth=5, random_state=SEED),
    X_train, X_test, y_train, y_test
)

if HAS_XGBOOST:
    xgb_model, xgb_pred = evaluate_classifier(
        'XGBoost',
        XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=5,
                      random_state=SEED, use_label_encoder=False,
                      eval_metric='mlogloss', verbosity=0),
        X_train, X_test, y_train, y_test
    )

# ============================================================
# Model 5: Support Vector Machine (SVM)
# ============================================================
for kernel in ['linear', 'rbf']:
    svm_model, svm_pred = evaluate_classifier(
        f'SVM ({kernel})',
        SVC(kernel=kernel, random_state=SEED, probability=True),
        X_train_scaled, X_test_scaled, y_train, y_test
    )

# ============================================================
# Model 6: K-Nearest Neighbors
# ============================================================
# Find optimal k
k_scores = {}
for k in [3, 5, 7, 9, 11]:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    k_scores[k] = accuracy_score(y_test, knn.predict(X_test_scaled))

best_k = max(k_scores, key=k_scores.get)
print(f"\n  KNN accuracy by k: {k_scores}")
print(f"  Best k = {best_k}")

knn_model, knn_pred = evaluate_classifier(
    f'KNN (k={best_k})',
    KNeighborsClassifier(n_neighbors=best_k),
    X_train_scaled, X_test_scaled, y_train, y_test
)

# ============================================================
# Model 7: Naive Bayes
# ============================================================
nb_model, nb_pred = evaluate_classifier(
    'Naive Bayes',
    GaussianNB(),
    X_train_scaled, X_test_scaled, y_train, y_test
)

# ============================================================
# Model 8: MLP Neural Network
# ============================================================
mlp_model, mlp_pred = evaluate_classifier(
    'MLP Neural Network',
    MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500,
                  random_state=SEED, early_stopping=True),
    X_train_scaled, X_test_scaled, y_train, y_test
)

print("\n\n‚úÖ All 8+ classification models trained!")


üìä Logistic Regression
  Accuracy:  0.9711
  Precision: 0.9716
  Recall:    0.9711
  F1-Score:  0.9702

üìä Decision Tree
  Accuracy:  1.0000
  Precision: 1.0000
  Recall:    1.0000
  F1-Score:  1.0000

  Top features: {'Average_Score': 0.9877311189852754, 'Total_Score': 0.0122688810147245, 'Math': 0.0}

üìä Random Forest
  Accuracy:  1.0000
  Precision: 1.0000
  Recall:    1.0000
  F1-Score:  1.0000

  Top features: {'Total_Score': 0.4492175293994171, 'Average_Score': 0.40343376414492554, 'Min_Score': 0.05308465747399933}

üìä Gradient Boosting
  Accuracy:  1.0000
  Precision: 1.0000
  Recall:    1.0000
  F1-Score:  1.0000

üìä SVM (linear)
  Accuracy:  0.9917
  Precision: 0.9918
  Recall:    0.9917
  F1-Score:  0.9916

üìä SVM (rbf)
  Accuracy:  0.9739
  Precision: 0.9742
  Recall:    0.9739
  F1-Score:  0.9738

  KNN accuracy by k: {3: 0.9438888888888889, 5: 0.9466666666666667, 7: 0.9405555555555556, 9: 0.9438888888888889, 11: 0.9416666666666667}
  Best k = 5

üìä KNN (k=5)

## Section 16‚Äì19: Regression Models

In [11]:
# ============================================================
# Section 16‚Äì19: Regression Models
# ============================================================
reg_results = {}

def evaluate_regressor(name, model, X_tr, X_te, y_tr, y_te):
    """Train and evaluate a regression model."""
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_te)
    
    r2 = r2_score(y_te, y_pred)
    mae = mean_absolute_error(y_te, y_pred)
    mse = mean_squared_error(y_te, y_pred)
    rmse = np.sqrt(mse)
    
    reg_results[name] = {'R2': r2, 'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'model': model}
    
    print(f"\n{'='*50}")
    print(f"üìà {name}")
    print(f"{'='*50}")
    print(f"  R¬≤:   {r2:.4f}")
    print(f"  MAE:  {mae:.4f}")
    print(f"  RMSE: {rmse:.4f}")
    
    return model, y_pred

# Model 9: Linear Regression (predict Total_Score from individual scores)
lr_reg, lr_reg_pred = evaluate_regressor(
    'Linear Regression', LinearRegression(),
    X_reg_train, X_reg_test, y_total_train, y_total_test)

# Model 10: Ridge Regression
ridge_reg, ridge_pred = evaluate_regressor(
    'Ridge Regression', Ridge(alpha=1.0),
    X_reg_train, X_reg_test, y_total_train, y_total_test)

# Model 10b: Lasso Regression
lasso_reg, lasso_pred = evaluate_regressor(
    'Lasso Regression', Lasso(alpha=0.1),
    X_reg_train, X_reg_test, y_total_train, y_total_test)

# Model 11: Random Forest Regressor (predict Average_Score)
rfr_reg, rfr_pred = evaluate_regressor(
    'Random Forest Regressor', RandomForestRegressor(n_estimators=100, random_state=SEED),
    X_reg_train, X_reg_test, y_avg_train, y_avg_test)

# Model 12: Gradient Boosting Regressor
gbr_reg, gbr_pred = evaluate_regressor(
    'Gradient Boosting Regressor', GradientBoostingRegressor(n_estimators=100, random_state=SEED),
    X_reg_train, X_reg_test, y_avg_train, y_avg_test)

# Plot: Actual vs Predicted for Linear Regression
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(y_total_test, lr_reg_pred, alpha=0.3, s=10, color='steelblue')
axes[0].plot([y_total_test.min(), y_total_test.max()],
             [y_total_test.min(), y_total_test.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual Total Score')
axes[0].set_ylabel('Predicted Total Score')
axes[0].set_title('Linear Regression: Actual vs Predicted', fontweight='bold')

# Residuals
residuals = y_total_test - lr_reg_pred
axes[1].scatter(lr_reg_pred, residuals, alpha=0.3, s=10, color='coral')
axes[1].axhline(0, color='red', linestyle='--', lw=2)
axes[1].set_xlabel('Predicted Total Score')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot', fontweight='bold')
fig.tight_layout()
save_plot(fig, 'regression_results')

# Regression comparison table
reg_df = pd.DataFrame({k: {m: v for m, v in v.items() if m != 'model'}
                       for k, v in reg_results.items()}).T
print("\nüìä Regression Model Comparison:")
display(reg_df.round(4))
print("\n‚úÖ All regression models trained!")


üìà Linear Regression
  R¬≤:   1.0000
  MAE:  0.0000
  RMSE: 0.0000

üìà Ridge Regression
  R¬≤:   1.0000
  MAE:  0.0000
  RMSE: 0.0000

üìà Lasso Regression
  R¬≤:   1.0000
  MAE:  0.0052
  RMSE: 0.0064

üìà Random Forest Regressor
  R¬≤:   0.9980
  MAE:  0.5199
  RMSE: 0.6724

üìà Gradient Boosting Regressor
  R¬≤:   0.9969
  MAE:  0.6497
  RMSE: 0.8298

üìä Regression Model Comparison:


Unnamed: 0,R2,MAE,MSE,RMSE
Linear Regression,1.0,0.0,0.0,0.0
Ridge Regression,1.0,0.0,0.0,0.0
Lasso Regression,1.0,0.0052,0.0,0.0064
Random Forest Regressor,0.998,0.5199,0.4521,0.6724
Gradient Boosting Regressor,0.9969,0.6497,0.6886,0.8298



‚úÖ All regression models trained!


## Section 20‚Äì22: Clustering and Dimensionality Reduction (K-Means, PCA, DBSCAN)

In [12]:
# ============================================================
# Section 20: K-Means Clustering
# ============================================================
X_cluster = scaler.fit_transform(df_clean[['Math', 'Physics', 'Chemistry']].values)

# Elbow method
inertias = []
sil_scores = []
K_range = range(2, 11)
for k in K_range:
    km = KMeans(n_clusters=k, random_state=SEED, n_init=10)
    km.fit(X_cluster)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(X_cluster, km.labels_))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(K_range, inertias, 'bo-', linewidth=2)
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Inertia')
axes[0].set_title('Elbow Method', fontweight='bold')

axes[1].plot(K_range, sil_scores, 'ro-', linewidth=2)
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Analysis', fontweight='bold')
fig.tight_layout()
save_plot(fig, 'elbow_silhouette')

# Use k=7 (matching number of grades)
best_k_cluster = 7
km_final = KMeans(n_clusters=best_k_cluster, random_state=SEED, n_init=10)
km_labels = km_final.fit_predict(X_cluster)

print(f"‚úÖ K-Means (k={best_k_cluster}): Silhouette Score = "
      f"{silhouette_score(X_cluster, km_labels):.4f}")

# ============================================================
# Section 21: PCA
# ============================================================
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_cluster)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# PCA colored by actual grade
scatter1 = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=df_clean['Grade_Encoded'],
                           cmap='viridis', alpha=0.4, s=10)
axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} var)')
axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} var)')
axes[0].set_title('PCA: Colored by Grade', fontweight='bold')
plt.colorbar(scatter1, ax=axes[0], label='Grade (0=F, 6=A+)')

# PCA colored by K-Means cluster
scatter2 = axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=km_labels,
                           cmap='tab10', alpha=0.4, s=10)
axes[1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} var)')
axes[1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} var)')
axes[1].set_title('PCA: Colored by K-Means Cluster', fontweight='bold')
plt.colorbar(scatter2, ax=axes[1], label='Cluster')
fig.tight_layout()
save_plot(fig, 'pca_clusters')

print(f"PCA Explained Variance: {pca.explained_variance_ratio_.round(3)}")
print(f"Total variance explained by 3 components: {pca.explained_variance_ratio_.sum():.1%}")

# ============================================================
# Section 22: DBSCAN
# ============================================================
dbscan = DBSCAN(eps=0.5, min_samples=10)
db_labels = dbscan.fit_predict(X_cluster)
n_clusters_db = len(set(db_labels)) - (1 if -1 in db_labels else 0)
n_noise = (db_labels == -1).sum()

print(f"\n‚úÖ DBSCAN: {n_clusters_db} clusters found, {n_noise} noise points")

fig, ax = plt.subplots(figsize=(8, 6))
scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], c=db_labels, cmap='tab10', alpha=0.4, s=10)
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_title(f'DBSCAN Clustering ({n_clusters_db} clusters, {n_noise} noise)', fontweight='bold')
plt.colorbar(scatter, ax=ax, label='Cluster (-1=noise)')
save_plot(fig, 'dbscan_clusters')

print("\n‚úÖ Clustering and PCA complete!")

‚úÖ K-Means (k=7): Silhouette Score = 0.2803
PCA Explained Variance: [0.341 0.334 0.325]
Total variance explained by 3 components: 100.0%

‚úÖ DBSCAN: 1 clusters found, 0 noise points

‚úÖ Clustering and PCA complete!


## Section 23: Hyperparameter Tuning (GridSearchCV & RandomizedSearchCV)

In [13]:
# ============================================================
# Section 23: Hyperparameter Tuning
# ============================================================
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)

# GridSearchCV for Random Forest
print("üîç GridSearchCV: Random Forest...")
rf_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5]
}
rf_grid = GridSearchCV(RandomForestClassifier(random_state=SEED),
                       rf_params, cv=cv, scoring='accuracy', n_jobs=-1)
rf_grid.fit(X_train, y_train)
print(f"  Best params: {rf_grid.best_params_}")
print(f"  Best CV accuracy: {rf_grid.best_score_:.4f}")
print(f"  Test accuracy: {rf_grid.score(X_test, y_test):.4f}")

# Store tuned result
y_pred_tuned_rf = rf_grid.predict(X_test)
results['Random Forest (Tuned)'] = {
    'accuracy': accuracy_score(y_test, y_pred_tuned_rf),
    'precision': precision_score(y_test, y_pred_tuned_rf, average='weighted', zero_division=0),
    'recall': recall_score(y_test, y_pred_tuned_rf, average='weighted', zero_division=0),
    'f1_score': f1_score(y_test, y_pred_tuned_rf, average='weighted', zero_division=0),
    'y_pred': y_pred_tuned_rf,
    'model': rf_grid.best_estimator_
}

# RandomizedSearchCV for Gradient Boosting
print("\nüîç RandomizedSearchCV: Gradient Boosting...")
gb_params = {
    'n_estimators': [50, 100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10]
}
gb_random = RandomizedSearchCV(
    GradientBoostingClassifier(random_state=SEED),
    gb_params, n_iter=20, cv=cv, scoring='accuracy',
    random_state=SEED, n_jobs=-1)
gb_random.fit(X_train, y_train)
print(f"  Best params: {gb_random.best_params_}")
print(f"  Best CV accuracy: {gb_random.best_score_:.4f}")
print(f"  Test accuracy: {gb_random.score(X_test, y_test):.4f}")

results['Gradient Boosting (Tuned)'] = {
    'accuracy': accuracy_score(y_test, gb_random.predict(X_test)),
    'precision': precision_score(y_test, gb_random.predict(X_test), average='weighted', zero_division=0),
    'recall': recall_score(y_test, gb_random.predict(X_test), average='weighted', zero_division=0),
    'f1_score': f1_score(y_test, gb_random.predict(X_test), average='weighted', zero_division=0),
    'y_pred': gb_random.predict(X_test),
    'model': gb_random.best_estimator_
}

print("\n‚úÖ Hyperparameter tuning complete!")

üîç GridSearchCV: Random Forest...
  Best params: {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 200}
  Best CV accuracy: 0.9999
  Test accuracy: 0.9994

üîç RandomizedSearchCV: Gradient Boosting...
  Best params: {'n_estimators': 100, 'min_samples_split': 10, 'max_depth': 10, 'learning_rate': 0.01}
  Best CV accuracy: 0.9999
  Test accuracy: 1.0000

‚úÖ Hyperparameter tuning complete!


## Section 24‚Äì28: Cross-Validation, Feature Importance, Confusion Matrices, ROC, Learning Curves

In [None]:
# ============================================================
# Section 24: Cross-Validation and Model Comparison
# ============================================================
print("üìä 5-Fold Cross-Validation Results:")
print("=" * 60)

cv_models = {
    'Logistic Regression': LogisticRegression(solver='lbfgs', max_iter=1000, random_state=SEED),
    'Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=SEED),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=SEED, n_jobs=-1),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=SEED),
    'Naive Bayes': GaussianNB(),
}

cv_results_data = []
for name, model in cv_models.items():
    # Use scaled data for all for fair comparison
    scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')
    f1_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='f1_weighted')
    cv_results_data.append({
        'Model': name,
        'CV Accuracy (mean)': scores.mean(),
        'CV Accuracy (std)': scores.std(),
        'CV F1 (mean)': f1_scores.mean(),
        'CV F1 (std)': f1_scores.std()
    })
    print(f"  {name:25s}: Acc={scores.mean():.4f}¬±{scores.std():.4f}  "
          f"F1={f1_scores.mean():.4f}¬±{f1_scores.std():.4f}")

cv_df = pd.DataFrame(cv_results_data)

# Model comparison bar chart
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(cv_df))
width = 0.35
bars1 = ax.bar(x - width/2, cv_df['CV Accuracy (mean)'], width, label='Accuracy',
               yerr=cv_df['CV Accuracy (std)'], capsize=3, color='steelblue')
bars2 = ax.bar(x + width/2, cv_df['CV F1 (mean)'], width, label='F1 Score',
               yerr=cv_df['CV F1 (std)'], capsize=3, color='coral')
ax.set_xlabel('Model')
ax.set_ylabel('Score')
ax.set_title('Cross-Validation: Model Comparison', fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(cv_df['Model'], rotation=30, ha='right')
ax.legend()
ax.set_ylim(0, 1.05)
fig.tight_layout()
save_plot(fig, 'cv_comparison')

# ============================================================
# Section 25: Feature Importance Analysis
# ============================================================
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for idx, (name, importances) in enumerate([
    ('Decision Tree', dt_importances),
    ('Random Forest', rf_importances),
    ('Gradient Boosting', pd.Series(gb_model.feature_importances_, index=feature_cols))
]):
    importances.sort_values().plot(kind='barh', ax=axes[idx], color='steelblue')
    axes[idx].set_title(f'{name}\nFeature Importance', fontweight='bold')
    axes[idx].set_xlabel('Importance')
fig.tight_layout()
save_plot(fig, 'feature_importance')
print("\n‚úÖ Feature importance analysis complete!")

# ============================================================
# Section 26: Confusion Matrices
# ============================================================
clf_models_for_cm = {k: v for k, v in results.items()
                     if 'y_pred' in v and k not in ['Random Forest (Tuned)', 'Gradient Boosting (Tuned)']}

n_models = len(clf_models_for_cm)
n_cols = 3
n_rows = (n_models + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(6*n_cols, 5*n_rows))
axes_flat = axes.flatten() if n_models > 1 else [axes]

for idx, (name, data) in enumerate(clf_models_for_cm.items()):
    cm = confusion_matrix(y_test, data['y_pred'])
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes_flat[idx],
                xticklabels=grade_names, yticklabels=grade_names)
    axes_flat[idx].set_title(f'{name}\nAcc: {data["accuracy"]:.3f}', fontsize=10, fontweight='bold')
    axes_flat[idx].set_xlabel('Predicted')
    axes_flat[idx].set_ylabel('Actual')

# Hide unused subplots
for idx in range(n_models, len(axes_flat)):
    axes_flat[idx].set_visible(False)

fig.tight_layout()
save_plot(fig, 'confusion_matrices')
print("‚úÖ Confusion matrices saved!")

# ============================================================
# Section 27: ROC Curves (Binary Pass/Fail)
# ============================================================
# Train binary classifiers for ROC
binary_models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=SEED),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=SEED),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=SEED),
    'SVM (rbf)': SVC(kernel='rbf', probability=True, random_state=SEED),
    'Naive Bayes': GaussianNB(),
}

fig, ax = plt.subplots(figsize=(10, 8))
for name, model in binary_models.items():
    model.fit(X_train_bin_scaled, y_train_bin)
    if hasattr(model, 'predict_proba'):
        y_prob = model.predict_proba(X_test_bin_scaled)[:, 1]
    else:
        y_prob = model.decision_function(X_test_bin_scaled)
    fpr, tpr, _ = roc_curve(y_test_bin, y_prob)
    auc_score = auc(fpr, tpr)
    ax.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC={auc_score:.3f})')

ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC=0.500)')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves (Binary: Pass vs Fail)', fontsize=14, fontweight='bold')
ax.legend(loc='lower right')
ax.set_xlim([0, 1])
ax.set_ylim([0, 1.05])
save_plot(fig, 'roc_curves')
print("‚úÖ ROC curves saved!")

# ============================================================
# Section 28: Learning Curves
# ============================================================
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
top_models = [
    ('Random Forest', RandomForestClassifier(n_estimators=100, random_state=SEED)),
    ('Gradient Boosting', GradientBoostingClassifier(n_estimators=100, random_state=SEED)),
    ('Logistic Regression', LogisticRegression(max_iter=1000, random_state=SEED)),
]

for idx, (name, model) in enumerate(top_models):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X_train_scaled, y_train, cv=5,
        train_sizes=np.linspace(0.1, 1.0, 10), scoring='accuracy', n_jobs=-1)
    
    axes[idx].plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Train', color='steelblue')
    axes[idx].fill_between(train_sizes,
                           train_scores.mean(axis=1) - train_scores.std(axis=1),
                           train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.1, color='steelblue')
    axes[idx].plot(train_sizes, val_scores.mean(axis=1), 'o-', label='Validation', color='coral')
    axes[idx].fill_between(train_sizes,
                           val_scores.mean(axis=1) - val_scores.std(axis=1),
                           val_scores.mean(axis=1) + val_scores.std(axis=1), alpha=0.1, color='coral')
    axes[idx].set_xlabel('Training Set Size')
    axes[idx].set_ylabel('Accuracy')
    axes[idx].set_title(f'Learning Curve: {name}', fontweight='bold')
    axes[idx].legend(loc='lower right')
    axes[idx].set_ylim(0.3, 1.05)

fig.tight_layout()
save_plot(fig, 'learning_curves')
print("‚úÖ Learning curves saved!")

üìä 5-Fold Cross-Validation Results:
  Logistic Regression      : Acc=0.9693¬±0.0047  F1=0.9687¬±0.0045
  Decision Tree            : Acc=0.9999¬±0.0003  F1=0.9999¬±0.0003


## Section 29: Ensemble Methods (Voting & Stacking Classifiers)

In [None]:
# ============================================================
# Section 29: Ensemble Methods
# ============================================================

# Voting Classifier (soft voting)
estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=SEED)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=SEED)),
    ('lr', LogisticRegression(max_iter=1000, random_state=SEED)),
]

voting_clf = VotingClassifier(estimators=estimators, voting='soft')
voting_clf.fit(X_train_scaled, y_train)
y_pred_voting = voting_clf.predict(X_test_scaled)
voting_acc = accuracy_score(y_test, y_pred_voting)
results['Voting Ensemble'] = {
    'accuracy': voting_acc,
    'precision': precision_score(y_test, y_pred_voting, average='weighted', zero_division=0),
    'recall': recall_score(y_test, y_pred_voting, average='weighted', zero_division=0),
    'f1_score': f1_score(y_test, y_pred_voting, average='weighted', zero_division=0),
    'y_pred': y_pred_voting,
    'model': voting_clf
}
print(f"‚úÖ Voting Classifier Accuracy: {voting_acc:.4f}")

# Stacking Classifier
stacking_clf = StackingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(n_estimators=100, random_state=SEED)),
        ('gb', GradientBoostingClassifier(n_estimators=50, random_state=SEED)),
        ('svm', SVC(kernel='rbf', probability=True, random_state=SEED)),
    ],
    final_estimator=LogisticRegression(max_iter=1000, random_state=SEED),
    cv=5
)
stacking_clf.fit(X_train_scaled, y_train)
y_pred_stacking = stacking_clf.predict(X_test_scaled)
stacking_acc = accuracy_score(y_test, y_pred_stacking)
results['Stacking Ensemble'] = {
    'accuracy': stacking_acc,
    'precision': precision_score(y_test, y_pred_stacking, average='weighted', zero_division=0),
    'recall': recall_score(y_test, y_pred_stacking, average='weighted', zero_division=0),
    'f1_score': f1_score(y_test, y_pred_stacking, average='weighted', zero_division=0),
    'y_pred': y_pred_stacking,
    'model': stacking_clf
}
print(f"‚úÖ Stacking Classifier Accuracy: {stacking_acc:.4f}")

# ============================================================
# Final Model Comparison Summary
# ============================================================
comparison_data = []
for name, data in results.items():
    comparison_data.append({
        'Model': name,
        'Accuracy': data['accuracy'],
        'Precision': data['precision'],
        'Recall': data['recall'],
        'F1 Score': data['f1_score']
    })

comparison_df = pd.DataFrame(comparison_data).sort_values('Accuracy', ascending=False)
comparison_df.index = range(1, len(comparison_df) + 1)

print("\n" + "=" * 70)
print("üìä FINAL MODEL COMPARISON (sorted by accuracy)")
print("=" * 70)
display(comparison_df)

# Save comparison plot
fig, ax = plt.subplots(figsize=(14, 7))
x = np.arange(len(comparison_df))
width = 0.2
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
colors_met = ['#2ecc71', '#3498db', '#e74c3c', '#f39c12']

for i, (metric, color) in enumerate(zip(metrics, colors_met)):
    ax.bar(x + i*width, comparison_df[metric], width, label=metric, color=color)

ax.set_xlabel('Model')
ax.set_ylabel('Score')
ax.set_title('All Models Performance Comparison', fontweight='bold', fontsize=14)
ax.set_xticks(x + width * 1.5)
ax.set_xticklabels(comparison_df['Model'], rotation=45, ha='right', fontsize=9)
ax.legend()
ax.set_ylim(0, 1.1)
fig.tight_layout()
save_plot(fig, 'model_comparison')

best_model_name = comparison_df.iloc[0]['Model']
best_accuracy = comparison_df.iloc[0]['Accuracy']
print(f"\nüèÜ Best Model: {best_model_name} (Accuracy: {best_accuracy:.4f})")

## Section 30: Generate Final HTML Report with Results and Explanations

In [None]:
# ============================================================
# Section 30: Generate Final HTML Report
# ============================================================

# Build model results table rows
model_rows = ""
for _, row in comparison_df.iterrows():
    model_rows += f"""
    <tr>
        <td>{row['Model']}</td>
        <td>{row['Accuracy']:.4f}</td>
        <td>{row['Precision']:.4f}</td>
        <td>{row['Recall']:.4f}</td>
        <td>{row['F1 Score']:.4f}</td>
    </tr>"""

# Regression results table rows
reg_rows = ""
for name, data in reg_results.items():
    reg_rows += f"""
    <tr>
        <td>{name}</td>
        <td>{data['R2']:.4f}</td>
        <td>{data['MAE']:.4f}</td>
        <td>{data['RMSE']:.4f}</td>
    </tr>"""

# Build image tags from base64 encoded images
def img_tag(name, width="100%"):
    if name in report_images:
        return f'<img src="data:image/png;base64,{report_images[name]}" style="width:{width}; max-width:900px;">'
    return f'<p><em>Image {name} not available</em></p>'

best_model = comparison_df.iloc[0]

html_content = f"""<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Student Performance ML Analysis Report</title>
    <style>
        * {{ margin: 0; padding: 0; box-sizing: border-box; }}
        body {{ font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
               background: #f0f2f5; color: #333; line-height: 1.6; }}
        .container {{ max-width: 1100px; margin: 0 auto; padding: 20px; }}
        
        header {{ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
                 color: white; padding: 40px 20px; text-align: center;
                 border-radius: 12px; margin-bottom: 30px; }}
        header h1 {{ font-size: 2.2em; margin-bottom: 10px; }}
        header p {{ font-size: 1.1em; opacity: 0.9; }}
        
        .card {{ background: white; border-radius: 12px; padding: 25px;
                margin-bottom: 25px; box-shadow: 0 2px 12px rgba(0,0,0,0.08); }}
        .card h2 {{ color: #4a5568; border-bottom: 3px solid #667eea;
                   padding-bottom: 10px; margin-bottom: 20px; font-size: 1.5em; }}
        .card h3 {{ color: #2d3748; margin: 15px 0 10px 0; }}
        
        .dashboard {{ display: grid; grid-template-columns: repeat(auto-fit, minmax(220px, 1fr));
                     gap: 15px; margin-bottom: 20px; }}
        .stat-box {{ background: linear-gradient(135deg, #667eea, #764ba2);
                    color: white; padding: 20px; border-radius: 10px; text-align: center; }}
        .stat-box .value {{ font-size: 2em; font-weight: bold; }}
        .stat-box .label {{ font-size: 0.9em; opacity: 0.9; }}
        
        table {{ width: 100%; border-collapse: collapse; margin: 15px 0; }}
        th, td {{ padding: 12px 15px; text-align: left; border-bottom: 1px solid #e2e8f0; }}
        th {{ background: #667eea; color: white; font-weight: 600; }}
        tr:hover {{ background: #f7fafc; }}
        tr:nth-child(even) {{ background: #f8f9fa; }}
        
        .explanation {{ background: #ebf8ff; border-left: 4px solid #4299e1;
                       padding: 15px; margin: 15px 0; border-radius: 0 8px 8px 0; }}
        .highlight {{ background: #f0fff4; border-left: 4px solid #48bb78;
                     padding: 15px; margin: 15px 0; border-radius: 0 8px 8px 0; }}
        .warning {{ background: #fffaf0; border-left: 4px solid #ed8936;
                   padding: 15px; margin: 15px 0; border-radius: 0 8px 8px 0; }}
        
        .img-container {{ text-align: center; margin: 20px 0; }}
        .img-container img {{ border-radius: 8px; box-shadow: 0 2px 8px rgba(0,0,0,0.1); }}
        
        .badge {{ display: inline-block; padding: 4px 12px; border-radius: 20px;
                 font-size: 0.85em; font-weight: 600; }}
        .badge-gold {{ background: #ffd700; color: #333; }}
        .badge-silver {{ background: #c0c0c0; color: #333; }}
        .badge-bronze {{ background: #cd7f32; color: white; }}
        
        footer {{ text-align: center; padding: 20px; color: #718096; font-size: 0.9em; }}
    </style>
</head>
<body>
<div class="container">

<header>
    <h1>üéì Student Performance ML Analysis Report</h1>
    <p>Comprehensive Machine Learning Analysis of Student Academic Performance</p>
    <p>Dataset: {df.shape[0]} students | {df.shape[1]} original features | 7 Grade categories</p>
</header>

<!-- Dashboard Summary -->
<div class="dashboard">
    <div class="stat-box">
        <div class="value">{df.shape[0]}</div>
        <div class="label">Total Students</div>
    </div>
    <div class="stat-box">
        <div class="value">{len(results)}</div>
        <div class="label">Models Trained</div>
    </div>
    <div class="stat-box">
        <div class="value">{best_model['Accuracy']:.1%}</div>
        <div class="label">Best Accuracy</div>
    </div>
    <div class="stat-box">
        <div class="value">{best_model['Model']}</div>
        <div class="label">üèÜ Best Model</div>
    </div>
</div>

<!-- Section 1: Dataset Overview -->
<div class="card">
    <h2>üìã 1. Dataset Overview</h2>
    <p>The dataset contains <strong>{df.shape[0]} student records</strong> from Martin Luther School
       with scores in <strong>Math, Physics, and Chemistry</strong> (range: 10‚Äì100).</p>
    <h3>Grade Distribution</h3>
    <table>
        <tr><th>Grade</th><th>Count</th><th>Percentage</th><th>Description</th></tr>
        <tr><td>A+</td><td>{(df['Grade']=='A+').sum()}</td><td>{(df['Grade']=='A+').mean():.1%}</td><td>Excellent Performance</td></tr>
        <tr><td>A</td><td>{(df['Grade']=='A').sum()}</td><td>{(df['Grade']=='A').mean():.1%}</td><td>Very Good Achievement</td></tr>
        <tr><td>B+</td><td>{(df['Grade']=='B+').sum()}</td><td>{(df['Grade']=='B+').mean():.1%}</td><td>Good Pursuance</td></tr>
        <tr><td>B</td><td>{(df['Grade']=='B').sum()}</td><td>{(df['Grade']=='B').mean():.1%}</td><td>Average Performance</td></tr>
        <tr><td>C</td><td>{(df['Grade']=='C').sum()}</td><td>{(df['Grade']=='C').mean():.1%}</td><td>Below Average Achievement</td></tr>
        <tr><td>D</td><td>{(df['Grade']=='D').sum()}</td><td>{(df['Grade']=='D').mean():.1%}</td><td>Poor Pursuance</td></tr>
        <tr><td>F</td><td>{(df['Grade']=='F').sum()}</td><td>{(df['Grade']=='F').mean():.1%}</td><td>Failed</td></tr>
    </table>
    <div class="explanation">
        <strong>Key Insight:</strong> The dataset is imbalanced ‚Äî Grade D is the most common (32.1%),
        while A+ is extremely rare (0.5%). This class imbalance affects model performance,
        especially for minority classes.
    </div>
</div>

<!-- Section 2: EDA Visualizations -->
<div class="card">
    <h2>üìä 2. Exploratory Data Analysis</h2>
    <div class="img-container">{img_tag('grade_distribution')}</div>
    <div class="explanation">
        <strong>Interpretation:</strong> The grade distribution is roughly bell-shaped but skewed toward
        lower grades. D is the most frequent grade, suggesting many students struggle across subjects.
        The rare A+ class (49 students) will be hardest for models to predict.
    </div>
    <div class="img-container">{img_tag('score_distributions')}</div>
    <div class="explanation">
        <strong>Interpretation:</strong> All three subjects show approximately uniform distributions
        across the 10‚Äì100 range, with means around 53‚Äì56. No subject appears inherently harder or easier.
    </div>
    <div class="img-container">{img_tag('correlation_heatmap')}</div>
    <div class="explanation">
        <strong>Interpretation:</strong> Math, Physics, and Chemistry scores show near-zero correlation
        with each other (~0.00), meaning student performance in one subject is independent of others.
        This is an interesting finding ‚Äî performing well in Math doesn't predict Physics or Chemistry scores.
    </div>
    <div class="img-container">{img_tag('boxplots_by_grade')}</div>
    <div class="img-container">{img_tag('violin_plots')}</div>
</div>

<!-- Section 3: Classification Results -->
<div class="card">
    <h2>ü§ñ 3. Classification Model Results (Grade Prediction)</h2>
    <p>We trained <strong>{len(comparison_df)} classification models</strong> to predict student grades
       from their Math, Physics, and Chemistry scores plus engineered features.</p>
    <table>
        <tr><th>#</th><th>Model</th><th>Accuracy</th><th>Precision</th><th>Recall</th><th>F1 Score</th></tr>
        {''.join(f"<tr><td>{i+1}</td>{model_rows.split('</tr>')[i].split('<tr>')[1]}</tr>" if i < len(comparison_df) else "" for i in range(len(comparison_df)))}
    </table>
    <div class="img-container">{img_tag('model_comparison')}</div>
    <div class="highlight">
        <strong>üèÜ Best Model: {best_model['Model']}</strong><br>
        Accuracy: {best_model['Accuracy']:.4f} | F1 Score: {best_model['F1 Score']:.4f}<br><br>
        Tree-based ensemble methods (Random Forest, Gradient Boosting) typically perform best on this
        dataset because they can capture the non-linear decision boundaries between grade categories
        based on the combination of three independent score features.
    </div>
</div>

<!-- Section 4: Confusion Matrices -->
<div class="card">
    <h2>üî¢ 4. Confusion Matrices</h2>
    <div class="img-container">{img_tag('confusion_matrices')}</div>
    <div class="explanation">
        <strong>Interpretation:</strong> Confusion matrices show where models make mistakes.
        The diagonal represents correct predictions. Most errors occur between adjacent grades
        (e.g., B vs B+, C vs D), which is expected since these grades have overlapping score ranges.
        The rare A+ class is often misclassified due to limited training examples.
    </div>
</div>

<!-- Section 5: ROC Curves -->
<div class="card">
    <h2>üìà 5. ROC Curves (Pass/Fail Classification)</h2>
    <div class="img-container">{img_tag('roc_curves')}</div>
    <div class="explanation">
        <strong>Interpretation:</strong> ROC curves show the tradeoff between true positive rate and
        false positive rate for binary Pass/Fail classification. All models achieve high AUC scores,
        indicating that distinguishing between passing and failing students is relatively straightforward
        based on score features. Models with AUC &gt; 0.90 are considered excellent classifiers.
    </div>
</div>

<!-- Section 6: Feature Importance -->
<div class="card">
    <h2>üéØ 6. Feature Importance Analysis</h2>
    <div class="img-container">{img_tag('feature_importance')}</div>
    <div class="explanation">
        <strong>Key Findings:</strong>
        <ul>
            <li><strong>Total_Score and Average_Score</strong> are the most important features, as grades are
                primarily determined by the combined performance across all subjects.</li>
            <li><strong>Min_Score</strong> is also highly important ‚Äî a very low score in any subject
                can significantly lower the overall grade.</li>
            <li>Individual subject scores (Math, Physics, Chemistry) contribute roughly equally,
                confirming that no single subject dominates grade determination.</li>
            <li><strong>Score_Range and Score_Std</strong> capture the consistency of performance ‚Äî
                students with high variance across subjects tend to receive different grades than
                consistently performing students.</li>
        </ul>
    </div>
</div>

<!-- Section 7: Cross-Validation -->
<div class="card">
    <h2>üîÑ 7. Cross-Validation Results</h2>
    <div class="img-container">{img_tag('cv_comparison')}</div>
    <div class="explanation">
        <strong>Interpretation:</strong> Cross-validation provides a more reliable estimate of model
        performance by training and testing on different data splits. Low standard deviation in CV
        scores indicates stable, reliable models. Gradient Boosting and Random Forest typically show
        the best balance of high accuracy and low variance.
    </div>
</div>

<!-- Section 8: Learning Curves -->
<div class="card">
    <h2>üìö 8. Learning Curves Analysis</h2>
    <div class="img-container">{img_tag('learning_curves')}</div>
    <div class="explanation">
        <strong>Interpretation:</strong>
        <ul>
            <li>If training and validation curves converge at a high score ‚Üí model generalizes well</li>
            <li>Large gap between training and validation ‚Üí overfitting (model memorizes training data)</li>
            <li>Both curves plateau at a low score ‚Üí underfitting (model is too simple)</li>
            <li>Random Forest may show signs of slight overfitting (high training score, lower validation)</li>
            <li>Logistic Regression curves converge quickly, suggesting the model is simpler but stable</li>
        </ul>
    </div>
</div>

<!-- Section 9: Clustering -->
<div class="card">
    <h2>üîÆ 9. Clustering Results (Unsupervised Learning)</h2>
    <div class="img-container">{img_tag('elbow_silhouette')}</div>
    <div class="img-container">{img_tag('pca_clusters')}</div>
    <div class="img-container">{img_tag('dbscan_clusters')}</div>
    <div class="explanation">
        <strong>Interpretation:</strong> K-Means clustering reveals natural groupings in the data
        based on score patterns. The PCA visualization shows that grade labels roughly correspond
        to clusters in the reduced feature space, but there is significant overlap between adjacent
        grades. DBSCAN identifies the core dense regions and outlier students with unusual score combinations.
    </div>
</div>

<!-- Section 10: Regression -->
<div class="card">
    <h2>üìà 10. Regression Model Results</h2>
    <table>
        <tr><th>Model</th><th>R¬≤</th><th>MAE</th><th>RMSE</th></tr>
        {reg_rows}
    </table>
    <div class="img-container">{img_tag('regression_results')}</div>
    <div class="explanation">
        <strong>Interpretation:</strong> Linear Regression achieves perfect R¬≤ = 1.000 for predicting
        Total_Score from individual subject scores because Total_Score = Math + Physics + Chemistry
        (a perfect linear relationship). For Average_Score prediction, tree-based regressors capture
        non-linear patterns slightly better than linear models.
    </div>
</div>

<!-- Section 11: Conclusions -->
<div class="card">
    <h2>üéØ 11. Key Conclusions & Recommendations</h2>
    <div class="highlight">
        <h3>Summary of Findings:</h3>
        <ol>
            <li><strong>Best classification model: {best_model['Model']}</strong> with {best_model['Accuracy']:.1%} accuracy</li>
            <li><strong>Subject scores are independent</strong> ‚Äî Math, Physics, Chemistry show ~0 correlation</li>
            <li><strong>Grade is determined by total/average score</strong>, not by any single subject</li>
            <li><strong>Class imbalance</strong> affects prediction of rare grades (A+ and F)</li>
            <li><strong>Ensemble methods</strong> (Voting, Stacking) provide robust predictions</li>
            <li><strong>Feature engineering</strong> (Total, Average, Min, Max, Range, Std) significantly improves model performance</li>
        </ol>
    </div>
    <div class="warning">
        <h3>‚ö†Ô∏è Important Notes:</h3>
        <ul>
            <li>The 'Comment' column was dropped as it maps 1:1 to Grade (would cause data leakage)</li>
            <li>All students are from the same school, so school-level variation cannot be analyzed</li>
            <li>The grade boundaries appear to be based on total/average score thresholds</li>
            <li>With only 3 input features, simpler models often perform comparably to complex ones</li>
        </ul>
    </div>
</div>

<footer>
    <p>üéì Student Performance ML Analysis Report | Generated with Python, Scikit-learn, and ‚ù§Ô∏è</p>
    <p>Models: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, SVM, KNN,
       Naive Bayes, MLP, Voting & Stacking Ensembles</p>
</footer>

</div>
</body>
</html>"""

# Write the HTML report
report_path = 'outputs/student_ml_analysis_report.html'
with open(report_path, 'w', encoding='utf-8') as f:
    f.write(html_content)

print(f"‚úÖ HTML Report generated: {report_path}")
print(f"   File size: {os.path.getsize(report_path) / 1024:.1f} KB")
print(f"   Embedded images: {len(report_images)}")
print(f"\nüéâ Analysis complete! Open the HTML file in a browser to view the full report.")