# Quiz 7: Ensemble Methods for Student Pass/Fail Prediction

**Objective:** To build, compare, and evaluate ensemble machine learning models (Random Forest, AdaBoost, Gradient Boosting) for predicting whether a student will pass or fail based on academic and behavioral features.

## 1. Setup and Library Imports

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import warnings

# Ignore warnings for cleaner output
warnings.filterwarnings('ignore')

# Optional: Improve plot aesthetics using seaborn
try:
    import seaborn as sns
    sns.set_style("whitegrid")
except ImportError:
    pass # Keep default matplotlib style if seaborn is not installed

## 2. Data Loading and Preparation

Since no specific dataset is provided, we will create a synthetic dataset representative of student academic and behavioral features.

In [None]:
# Create a synthetic dataset
np.random.seed(42) # for reproducibility
n_students = 1000

data = {
    'StudyHours': np.random.uniform(1, 20, n_students),
    'PreviousGPA': np.random.uniform(1.5, 4.0, n_students),
    'AttendancePercentage': np.random.uniform(50, 100, n_students),
    'AssignmentsCompleted': np.random.randint(0, 11, n_students), # 0 to 10 assignments
    'EngagementLevel': np.random.choice(['Low', 'Medium', 'High'], n_students, p=[0.3, 0.5, 0.2]),
    'HasTutor': np.random.choice([0, 1], n_students, p=[0.7, 0.3]) # 0: No, 1: Yes
}

df = pd.DataFrame(data)

# Create a synthetic target variable 'Pass' (1 for Pass, 0 for Fail)
# Probability of passing increases with better metrics
prob_pass = (df['StudyHours']/20 + 
             df['PreviousGPA']/4 + 
             df['AttendancePercentage']/100 + 
             df['AssignmentsCompleted']/10 + 
             df['EngagementLevel'].map({'Low': 0.1, 'Medium': 0.5, 'High': 0.9}) + 
             df['HasTutor']*0.1) / 5 # Normalize probability factor

# Add some noise
prob_pass = np.clip(prob_pass + np.random.normal(0, 0.15, n_students), 0, 1)

# Determine pass/fail based on probability threshold (e.g., 0.5)
df['Pass'] = (prob_pass > 0.5).astype(int)

print("Dataset Head:")
print(df.head())
print("\nDataset Info:")
df.info()
print("\nTarget Variable Distribution:")
print(df['Pass'].value_counts(normalize=True))

## 3. Feature Engineering and Preprocessing

We need to:
1.  Identify numerical and categorical features.
2.  Encode categorical features (e.g., using One-Hot Encoding).
3.  Potentially scale numerical features (though tree-based models are less sensitive to scaling).

In [None]:
# Define features (X) and target (y)
X = df.drop('Pass', axis=1)
y = df['Pass']

# Identify feature types
numerical_features = X.select_dtypes(include=np.number).columns.tolist()
categorical_features = X.select_dtypes(exclude=np.number).columns.tolist()

print(f"Numerical Features: {numerical_features}")
print(f"Categorical Features: {categorical_features}")

# Create preprocessing pipelines for numerical and categorical features
# For tree models, scaling isn't strictly necessary, but we'll include it for completeness
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Create a column transformer to apply different transformations to different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Apply preprocessing (fit_transform on the whole dataset for demonstration, 
# but typically fit on training and transform both train/test)
# We will integrate this into pipelines later for proper train/test split handling
# X_processed = preprocessor.fit_transform(X)
# print(f"\nShape of processed features: {X_processed.shape}")

## 4. Train-Test Split

Split the data into training and testing sets to evaluate model performance on unseen data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

print(f"Training set shape: X={X_train.shape}, y={y_train.shape}")
print(f"Testing set shape: X={X_test.shape}, y={y_test.shape}")

## 5. Model Building and Training

We will create pipelines that include the preprocessing steps and the respective ensemble models.

In [None]:
# --- Random Forest --- 
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', RandomForestClassifier(random_state=42, n_estimators=100))])

# --- AdaBoost --- 
ab_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('classifier', AdaBoostClassifier(random_state=42, n_estimators=50))]) 
# Note: AdaBoost often uses DecisionTreeClassifier(max_depth=1) as base estimator by default

# --- Gradient Boosting --- 
gb_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', GradientBoostingClassifier(random_state=42, n_estimators=100))])

# Train the models
print("Training Random Forest...")
rf_pipeline.fit(X_train, y_train)
print("Training AdaBoost...")
ab_pipeline.fit(X_train, y_train)
print("Training Gradient Boosting...")
gb_pipeline.fit(X_train, y_train)

print("\nAll models trained.")

## 6. Model Evaluation

Evaluate each model on the test set using standard classification metrics.

In [None]:
# Make predictions
y_pred_rf = rf_pipeline.predict(X_test)
y_pred_ab = ab_pipeline.predict(X_test)
y_pred_gb = gb_pipeline.predict(X_test)

# Get prediction probabilities for AUC
y_prob_rf = rf_pipeline.predict_proba(X_test)[:, 1]
y_prob_ab = ab_pipeline.predict_proba(X_test)[:, 1]
y_prob_gb = gb_pipeline.predict_proba(X_test)[:, 1]

# Evaluate each model
results = {}

print("--- Random Forest Evaluation ---")
acc_rf = accuracy_score(y_test, y_pred_rf)
auc_rf = roc_auc_score(y_test, y_prob_rf)
print(f"Accuracy: {acc_rf:.4f}")
print(f"ROC AUC: {auc_rf:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_rf))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))
results['Random Forest'] = {'Accuracy': acc_rf, 'AUC': auc_rf}

print("\n--- AdaBoost Evaluation ---")
acc_ab = accuracy_score(y_test, y_pred_ab)
auc_ab = roc_auc_score(y_test, y_prob_ab)
print(f"Accuracy: {acc_ab:.4f}")
print(f"ROC AUC: {auc_ab:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_ab))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_ab))
results['AdaBoost'] = {'Accuracy': acc_ab, 'AUC': auc_ab}

print("\n--- Gradient Boosting Evaluation ---")
acc_gb = accuracy_score(y_test, y_pred_gb)
auc_gb = roc_auc_score(y_test, y_prob_gb)
print(f"Accuracy: {acc_gb:.4f}")
print(f"ROC AUC: {auc_gb:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_gb))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_gb))
results['Gradient Boosting'] = {'Accuracy': acc_gb, 'AUC': auc_gb}

## 7. Model Comparison

Summarize the performance of the three ensemble models.

In [None]:
print("--- Model Performance Summary ---")
results_df = pd.DataFrame(results).T # Transpose for better readability
print(results_df)

# Find the best model based on a chosen metric (e.g., ROC AUC)
best_model_auc = results_df['AUC'].idxmax()
best_model_acc = results_df['Accuracy'].idxmax()

print(f"\nBest model based on ROC AUC: {best_model_auc} (AUC: {results_df.loc[best_model_auc, 'AUC']:.4f})")
print(f"Best model based on Accuracy: {best_model_acc} (Accuracy: {results_df.loc[best_model_acc, 'Accuracy']:.4f})")

## 8. Visual Comparison

In [None]:
# --- Performance Bar Chart --- 
fig, ax = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy Plot
results_df['Accuracy'].plot(kind='bar', ax=ax[0], color=['skyblue', 'lightcoral', 'lightgreen'])
ax[0].set_title('Model Accuracy Comparison')
ax[0].set_ylabel('Accuracy')
ax[0].set_xlabel('Model')
ax[0].tick_params(axis='x', rotation=0)
ax[0].set_ylim(bottom=max(0, results_df['Accuracy'].min() - 0.05), top=min(1.0, results_df['Accuracy'].max() + 0.05)) # Adjust y-lim dynamically

# AUC Plot
results_df['AUC'].plot(kind='bar', ax=ax[1], color=['skyblue', 'lightcoral', 'lightgreen'])
ax[1].set_title('Model ROC AUC Comparison')
ax[1].set_ylabel('ROC AUC Score')
ax[1].set_xlabel('Model')
ax[1].tick_params(axis='x', rotation=0)
ax[1].set_ylim(bottom=max(0, results_df['AUC'].min() - 0.05), top=min(1.0, results_df['AUC'].max() + 0.05)) # Adjust y-lim dynamically

plt.tight_layout()
plt.show()

# --- ROC Curve Plot ---
plt.figure(figsize=(8, 6))

# Random Forest
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_prob_rf)
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {auc_rf:.4f})')

# AdaBoost
fpr_ab, tpr_ab, _ = roc_curve(y_test, y_prob_ab)
plt.plot(fpr_ab, tpr_ab, label=f'AdaBoost (AUC = {auc_ab:.4f})')

# Gradient Boosting
fpr_gb, tpr_gb, _ = roc_curve(y_test, y_prob_gb)
plt.plot(fpr_gb, tpr_gb, label=f'Gradient Boosting (AUC = {auc_gb:.4f})')

# Plotting the diagonal line (random guessing)
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')

# Customizing the plot
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('Receiver Operating Characteristic (ROC) Curves')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

## 9. Conclusion

This notebook demonstrated the process of building, training, and evaluating three common ensemble classifiers (Random Forest, AdaBoost, Gradient Boosting) for a binary classification task (student pass/fail prediction) using a synthetic dataset.

Based on the evaluation metrics (specifically Accuracy and ROC AUC) and the visualizations above, we compared their performance on the unseen test data.

* **Random Forest** typically performs well out-of-the-box and is robust to overfitting with enough trees.
* **AdaBoost** focuses on misclassified samples, which can be powerful but sometimes sensitive to noisy data or outliers.
* **Gradient Boosting** builds trees sequentially, correcting errors from previous trees, often leading to high accuracy but potentially requiring more careful tuning.

The 'best' model depends on the specific dataset characteristics and the chosen evaluation metric. In this synthetic example, [mention the best performing model based on the output, e.g., Gradient Boosting or Random Forest] showed slightly better performance according to [mention metric, e.g., ROC AUC]. The visualizations provide a clear comparison of these metrics.

**Further Steps:**
* Hyperparameter tuning (e.g., using GridSearchCV or RandomizedSearchCV) for each model could further optimize performance.
* Feature importance analysis could reveal which academic/behavioral factors are most predictive (especially useful with Random Forest and Gradient Boosting).
* Trying other ensemble techniques (e.g., XGBoost, LightGBM, CatBoost) could yield different results.
* Using a real-world dataset would provide more meaningful insights.