# Phase 2: Supervised Learning - Student Performance Prediction

**Course:** SWE485 - Selected Topics in Software Engineering  
**Project:** Student Academic Success Prediction System  
**Phase:** 2 - Supervised Learning  

---

## Objective
Build and compare supervised learning models to predict whether a student will **pass or fail** based on features like study hours, attendance rate, previous grades, and extracurricular participation.

---

## 1. Algorithm Selection & Justification

For this binary classification problem (Pass/Fail), we selected **two algorithms** with different strengths:

### **Algorithm 1: Logistic Regression**
**Why this algorithm?**
- **Interpretability:** Provides clear coefficients showing how each feature affects pass/fail probability
- **Baseline Model:** Industry-standard baseline for binary classification
- **Fast Training:** Efficient on small-to-medium datasets like ours (~40K samples)
- **Probabilistic Output:** Gives probability scores, useful for understanding prediction confidence
- **Well-suited for linearly separable data:** If features have linear relationships with the target

### **Algorithm 2: Random Forest Classifier**
**Why this algorithm?**
- **Handles Non-linearity:** Can capture complex, non-linear relationships between features
- **Feature Importance:** Automatically ranks features by importance for interpretability
- **Robust to Outliers:** Less sensitive to extreme values than linear models
- **Ensemble Learning:** Combines multiple decision trees to reduce overfitting
- **No Feature Scaling Required:** Works well even without normalization (though we normalized already)
- **Strong Performance:** Often achieves high accuracy on structured/tabular data

### **Why NOT other algorithms?**
- **Neural Networks:** Overkill for this dataset size and complexity; require more tuning and data
- **SVM:** Can be slow on 40K samples without kernel tricks; less interpretable than our choices
- **Naive Bayes:** Assumes feature independence, which may not hold (e.g., study hours and grades are correlated)

**Our hypothesis:** Random Forest will outperform Logistic Regression if there are non-linear patterns in the data.

---

## 2. Implementation

### 2.1 Setup & Import Libraries

In [None]:
# Install required libraries (if not already installed)
import sys
print("Installing required libraries...")
!{sys.executable} -m pip install pandas numpy matplotlib seaborn scikit-learn imbalanced-learn --quiet

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Sklearn imports
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve
)

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

print("="*80)
print("LIBRARIES IMPORTED SUCCESSFULLY")
print("="*80)

### 2.2 Load Cleaned Dataset

In [None]:
# Load the cleaned dataset from Phase 1
df = pd.read_csv('../Dataset/student_performance_cleaned.csv')

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"\nFirst 5 rows:")
display(df.head())

print(f"\nColumn names:")
print(df.columns.tolist())

print(f"\nData types:")
print(df.dtypes)

### 2.3 Data Preparation

In [None]:
# Separate features and target
# Drop 'Student ID' if it exists (not useful for prediction)
X = df.drop(columns=['Passed', 'Student ID'], errors='ignore')
y = df['Passed']

print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"\nFeature columns: {X.columns.tolist()}")

In [None]:
# Encode categorical features (if any)
# Check for categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
print(f"Categorical columns found: {categorical_cols}")

# Encode categorical features using Label Encoding
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    label_encoders[col] = le
    print(f"Encoded '{col}': {le.classes_}")

print(f"\nFeatures after encoding:")
display(X.head())

In [None]:
# Encode target variable if it's not already numeric
if y.dtype == 'object':
    target_encoder = LabelEncoder()
    y = target_encoder.fit_transform(y)
    print(f"Target encoded: {target_encoder.classes_} -> {np.unique(y)}")
else:
    print(f"Target is already numeric: {np.unique(y)}")

# Check class distribution
print(f"\nClass distribution:")
print(pd.Series(y).value_counts())
print(f"\nClass distribution (%):")
print(round(pd.Series(y).value_counts(normalize=True) * 100, 2))

In [None]:
# Split data into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Test set size: {X_test.shape[0]} ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"\nTraining set class distribution:")
print(pd.Series(y_train).value_counts())
print(f"\nTest set class distribution:")
print(pd.Series(y_test).value_counts())

---

## 3. Model Training & Evaluation

### 3.1 Model 1: Logistic Regression

In [None]:
# Train Logistic Regression model
print("="*80)
print("TRAINING LOGISTIC REGRESSION MODEL")
print("="*80)

# Initialize model with balanced class weights to handle any class imbalance
lr_model = LogisticRegression(
    random_state=42,
    max_iter=1000,
    class_weight='balanced'  # Handles class imbalance
)

# Train the model
lr_model.fit(X_train, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test)
y_pred_proba_lr = lr_model.predict_proba(X_test)[:, 1]  # Probability of class 1

print("✓ Logistic Regression model trained successfully!")

In [None]:
# Evaluate Logistic Regression
print("\n" + "="*80)
print("LOGISTIC REGRESSION - EVALUATION METRICS")
print("="*80)

lr_accuracy = accuracy_score(y_test, y_pred_lr)
lr_precision = precision_score(y_test, y_pred_lr, average='binary')
lr_recall = recall_score(y_test, y_pred_lr, average='binary')
lr_f1 = f1_score(y_test, y_pred_lr, average='binary')
lr_roc_auc = roc_auc_score(y_test, y_pred_proba_lr)

print(f"Accuracy:  {lr_accuracy:.4f}")
print(f"Precision: {lr_precision:.4f}")
print(f"Recall:    {lr_recall:.4f}")
print(f"F1-Score:  {lr_f1:.4f}")
print(f"ROC-AUC:   {lr_roc_auc:.4f}")

print(f"\n{'-'*80}")
print("Classification Report:")
print(f"{'-'*80}")
print(classification_report(y_test, y_pred_lr, target_names=['Fail', 'Pass']))

In [None]:
# Confusion Matrix for Logistic Regression
cm_lr = confusion_matrix(y_test, y_pred_lr)

plt.figure(figsize=(6, 5))
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Fail', 'Pass'], 
            yticklabels=['Fail', 'Pass'])
plt.title('Confusion Matrix - Logistic Regression', fontsize=14, weight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

print(f"True Negatives:  {cm_lr[0,0]}")
print(f"False Positives: {cm_lr[0,1]}")
print(f"False Negatives: {cm_lr[1,0]}")
print(f"True Positives:  {cm_lr[1,1]}")

### 3.2 Model 2: Random Forest Classifier

In [None]:
# Train Random Forest model
print("="*80)
print("TRAINING RANDOM FOREST CLASSIFIER")
print("="*80)

# Initialize model with optimized hyperparameters
rf_model = RandomForestClassifier(
    n_estimators=100,       # Number of trees
    max_depth=10,           # Maximum depth of trees
    min_samples_split=10,   # Minimum samples to split a node
    min_samples_leaf=4,     # Minimum samples in leaf node
    random_state=42,
    class_weight='balanced', # Handle class imbalance
    n_jobs=-1               # Use all CPU cores
)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]

print("✓ Random Forest model trained successfully!")

In [None]:
# Evaluate Random Forest
print("\n" + "="*80)
print("RANDOM FOREST - EVALUATION METRICS")
print("="*80)

rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_precision = precision_score(y_test, y_pred_rf, average='binary')
rf_recall = recall_score(y_test, y_pred_rf, average='binary')
rf_f1 = f1_score(y_test, y_pred_rf, average='binary')
rf_roc_auc = roc_auc_score(y_test, y_pred_proba_rf)

print(f"Accuracy:  {rf_accuracy:.4f}")
print(f"Precision: {rf_precision:.4f}")
print(f"Recall:    {rf_recall:.4f}")
print(f"F1-Score:  {rf_f1:.4f}")
print(f"ROC-AUC:   {rf_roc_auc:.4f}")

print(f"\n{'-'*80}")
print("Classification Report:")
print(f"{'-'*80}")
print(classification_report(y_test, y_pred_rf, target_names=['Fail', 'Pass']))

In [None]:
# Confusion Matrix for Random Forest
cm_rf = confusion_matrix(y_test, y_pred_rf)

plt.figure(figsize=(6, 5))
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Greens',
            xticklabels=['Fail', 'Pass'],
            yticklabels=['Fail', 'Pass'])
plt.title('Confusion Matrix - Random Forest', fontsize=14, weight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

print(f"True Negatives:  {cm_rf[0,0]}")
print(f"False Positives: {cm_rf[0,1]}")
print(f"False Negatives: {cm_rf[1,0]}")
print(f"True Positives:  {cm_rf[1,1]}")

### 3.3 Feature Importance (Random Forest Only)

In [None]:
# Get feature importances from Random Forest
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("Feature Importance (Random Forest):")
display(feature_importance)

# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='Importance', y='Feature', palette='viridis')
plt.title('Feature Importance - Random Forest', fontsize=14, weight='bold')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

---

## 4. Model Comparison

In [None]:
# Create comparison dataframe
comparison = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest'],
    'Accuracy': [lr_accuracy, rf_accuracy],
    'Precision': [lr_precision, rf_precision],
    'Recall': [lr_recall, rf_recall],
    'F1-Score': [lr_f1, rf_f1],
    'ROC-AUC': [lr_roc_auc, rf_roc_auc]
})

print("="*80)
print("MODEL COMPARISON")
print("="*80)
display(comparison)

# Highlight best scores
print("\n📊 Best Performance:")
for metric in ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']:
    best_idx = comparison[metric].idxmax()
    best_model = comparison.loc[best_idx, 'Model']
    best_score = comparison.loc[best_idx, metric]
    print(f"  {metric:12s}: {best_model:20s} ({best_score:.4f})")

In [None]:
# Visualize model comparison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
lr_scores = [lr_accuracy, lr_precision, lr_recall, lr_f1, lr_roc_auc]
rf_scores = [rf_accuracy, rf_precision, rf_recall, rf_f1, rf_roc_auc]

x = np.arange(len(metrics))
width = 0.35

fig, ax = plt.subplots(figsize=(12, 6))
bars1 = ax.bar(x - width/2, lr_scores, width, label='Logistic Regression', color='#4C72B0')
bars2 = ax.bar(x + width/2, rf_scores, width, label='Random Forest', color='#55A868')

ax.set_xlabel('Metrics', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Model Performance Comparison', fontsize=14, weight='bold')
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()
ax.set_ylim([0, 1])
ax.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.3f}',
                ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

### 4.1 ROC Curve Comparison

In [None]:
# Plot ROC curves for both models
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_proba_lr)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_proba_rf)

plt.figure(figsize=(8, 6))
plt.plot(fpr_lr, tpr_lr, label=f'Logistic Regression (AUC = {lr_roc_auc:.3f})', 
         color='#4C72B0', linewidth=2)
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {rf_roc_auc:.3f})', 
         color='#55A868', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier', linewidth=1)

plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve Comparison', fontsize=14, weight='bold')
plt.legend(loc='lower right', fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

### 4.2 Cross-Validation (Optional - for robustness check)

In [None]:
# Perform 5-fold cross-validation on both models
print("Performing 5-Fold Cross-Validation...\n")

# Logistic Regression CV
lr_cv_scores = cross_val_score(lr_model, X_train, y_train, cv=5, scoring='accuracy')
print(f"Logistic Regression CV Accuracy: {lr_cv_scores.mean():.4f} (+/- {lr_cv_scores.std():.4f})")

# Random Forest CV
rf_cv_scores = cross_val_score(rf_model, X_train, y_train, cv=5, scoring='accuracy')
print(f"Random Forest CV Accuracy:       {rf_cv_scores.mean():.4f} (+/- {rf_cv_scores.std():.4f})")

# Plot CV scores
cv_results = pd.DataFrame({
    'Fold': list(range(1, 6)) * 2,
    'Accuracy': list(lr_cv_scores) + list(rf_cv_scores),
    'Model': ['Logistic Regression']*5 + ['Random Forest']*5
})

plt.figure(figsize=(10, 5))
sns.barplot(data=cv_results, x='Fold', y='Accuracy', hue='Model', palette=['#4C72B0', '#55A868'])
plt.title('Cross-Validation Accuracy by Fold', fontsize=14, weight='bold')
plt.ylabel('Accuracy')
plt.xlabel('Fold')
plt.ylim([0.5, 1.0])
plt.legend(title='Model')
plt.tight_layout()
plt.show()

---

## 5. Results Interpretation & Key Findings

### 5.1 Which Model Performed Best?

Based on the evaluation metrics:

**Winner: [Model Name]** (Fill this in after running the notebook)

**Justification:**
- **Accuracy:** [Describe which model had higher accuracy]
- **Precision:** [Discuss precision - important if we want to avoid false positives (predicting Pass when actually Fail)]
- **Recall:** [Discuss recall - important if we want to catch all students who might fail]
- **F1-Score:** [Balanced metric - good for overall assessment]
- **ROC-AUC:** [Measures the model's ability to distinguish between classes]

### 5.2 Key Findings

1. **Model Performance:**
   - Both models achieved [high/moderate/low] accuracy (>XX%)
   - Random Forest [outperformed/underperformed] Logistic Regression, suggesting [linear/non-linear] relationships in the data

2. **Feature Importance (from Random Forest):**
   - **Most Important Feature:** [Feature name] - This makes sense because [explain why]
   - **Least Important Feature:** [Feature name] - Could potentially be removed in future iterations

3. **Class Imbalance Impact:**
   - Used `class_weight='balanced'` to handle imbalance
   - [Discuss if the imbalance affected performance, check confusion matrix]

4. **Confusion Matrix Insights:**
   - **False Positives:** [Number] students predicted to pass but actually failed - [Discuss implications]
   - **False Negatives:** [Number] students predicted to fail but actually passed - [Discuss implications]
   - For educational advice systems, [False Positives/False Negatives] are more critical because...

5. **Cross-Validation Results:**
   - [Model name] showed more consistent performance across folds (lower standard deviation)
   - This indicates [better/worse] generalization to unseen data

### 5.3 Practical Implications

- **For Educators:** The model can identify at-risk students early based on [key features]
- **For Students:** Understanding that [most important feature] has the strongest impact can guide study strategies
- **For System Design:** [Winning model] should be used for the advice system due to its superior [metric]

### 5.4 Limitations & Future Improvements

- **Data Size:** ~40K samples is moderate; more data could improve performance
- **Feature Engineering:** Could explore interactions between features (e.g., study hours × attendance)
- **Hyperparameter Tuning:** Used basic hyperparameters; GridSearchCV could optimize further
- **Algorithm Exploration:** Could try XGBoost, LightGBM, or Neural Networks in future phases

---

**Conclusion:**  
[Winning model] is recommended for Phase 3 integration due to its [best metric] of [score] and ability to [key strength]. The system can now predict student outcomes with [XX%] accuracy, enabling proactive academic support.

---

## 6. Save Models (Optional)

In [None]:
# Save trained models for future use
import joblib

# Create models directory if it doesn't exist
import os
os.makedirs('models', exist_ok=True)

# Save models
joblib.dump(lr_model, 'models/logistic_regression_model.pkl')
joblib.dump(rf_model, 'models/random_forest_model.pkl')

print("✓ Models saved successfully!")
print("  - models/logistic_regression_model.pkl")
print("  - models/random_forest_model.pkl")

---

**End of Phase 2**

**Next Steps:** Phase 3 will apply unsupervised learning (clustering) to discover patterns in student data and potentially improve the supervised model.