# Student Admission Prediction - BentoML Project

This notebook demonstrates the complete machine learning pipeline for predicting university admission chances using student data. We will:

1. **Load and explore** the processed dataset
2. **Build a linear regression model** to predict admission chances
3. **Evaluate model performance** using various metrics
4. **Prepare the model** for BentoML deployment

## Dataset Overview

The dataset contains information about students applying to universities with features like:
- GRE Score, TOEFL Score, University Rating
- Statement of Purpose (SOP), Letter of Recommendation (LOR)
- CGPA, Research Experience
- **Target**: Chance of Admit (0-1 probability)

## 1. Import Required Libraries

Import all necessary libraries for data manipulation, modeling, and visualization.

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

# Model persistence
import joblib
import os

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("✅ All libraries imported successfully!")

## 2. Load Processed Data

Load the preprocessed training and test datasets that were created by our `prepare_data.py` script.

In [None]:
# Load processed data
data_path = 'data/processed/'

print("Loading processed datasets...")
X_train = pd.read_csv(os.path.join(data_path, 'X_train.csv'))
X_test = pd.read_csv(os.path.join(data_path, 'X_test.csv'))
y_train = pd.read_csv(os.path.join(data_path, 'y_train.csv'))
y_test = pd.read_csv(os.path.join(data_path, 'y_test.csv'))

# Convert y dataframes to series for easier handling
y_train = y_train.squeeze()
y_test = y_test.squeeze()

print(f"✅ Data loaded successfully!")
print(f"Training set: X_train {X_train.shape}, y_train {y_train.shape}")
print(f"Test set: X_test {X_test.shape}, y_test {y_test.shape}")
print(f"\nFeatures: {list(X_train.columns)}")

## 3. Data Exploration and Visualization

Explore the dataset to understand the distribution of features and their relationships with the target variable.

In [None]:
# Display basic statistics
print("📊 TRAINING DATA OVERVIEW")
print("=" * 50)
print(f"Dataset shape: {X_train.shape}")
print(f"\nFeature statistics:")
print(X_train.describe())

print(f"\n📈 TARGET VARIABLE STATISTICS")
print("=" * 50)
print(f"Chance of Admit statistics:")
print(y_train.describe())

# Create comprehensive visualizations
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
fig.suptitle('Feature Distributions and Relationships', fontsize=16, fontweight='bold')

# Plot distributions of each feature
features = X_train.columns
for i, feature in enumerate(features):
    row, col = i // 4, i % 4
    
    # Distribution plot
    axes[row, col].hist(X_train[feature], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
    axes[row, col].set_title(f'{feature} Distribution', fontweight='bold')
    axes[row, col].set_xlabel(feature)
    axes[row, col].set_ylabel('Frequency')
    axes[row, col].grid(True, alpha=0.3)

# Remove empty subplot
if len(features) < 8:
    fig.delaxes(axes[1, 3])

plt.tight_layout()
plt.show()

# Correlation analysis
print(f"\n🔗 FEATURE CORRELATIONS")
print("=" * 50)
correlation_with_target = X_train.corrwith(y_train).sort_values(ascending=False)
print("Correlation with Chance of Admit:")
for feature, corr in correlation_with_target.items():
    print(f"{feature:20}: {corr:.4f}")

In [None]:
# Create correlation heatmap
plt.figure(figsize=(12, 8))
correlation_matrix = X_train.corr()
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

sns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='RdYlBu_r', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": .8})
plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Scatter plots of most correlated features with target
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Feature vs Chance of Admit Relationships', fontsize=16, fontweight='bold')

top_features = correlation_with_target.abs().nlargest(6).index
for i, feature in enumerate(top_features):
    row, col = i // 3, i % 3
    
    axes[row, col].scatter(X_train[feature], y_train, alpha=0.6, color='coral')
    axes[row, col].set_xlabel(feature)
    axes[row, col].set_ylabel('Chance of Admit')
    axes[row, col].set_title(f'{feature} vs Chance of Admit\n(Correlation: {correlation_with_target[feature]:.4f})')
    axes[row, col].grid(True, alpha=0.3)
    
    # Add trend line
    z = np.polyfit(X_train[feature], y_train, 1)
    p = np.poly1d(z)
    axes[row, col].plot(X_train[feature], p(X_train[feature]), "b--", alpha=0.8)

plt.tight_layout()
plt.show()

## 4. Linear Regression Model Training

Build and train a linear regression model to predict the chance of admission.

In [None]:
# Initialize the StandardScaler for feature normalization
print("🔧 FEATURE SCALING")
print("=" * 50)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("✅ Features scaled successfully!")
print(f"Original feature range example (GRE_Score): {X_train['GRE_Score'].min():.2f} to {X_train['GRE_Score'].max():.2f}")
print(f"Scaled feature range (GRE_Score): {X_train_scaled[:, 0].min():.2f} to {X_train_scaled[:, 0].max():.2f}")

# Initialize and train the Linear Regression model
print(f"\n🤖 MODEL TRAINING")
print("=" * 50)

model = LinearRegression()
model.fit(X_train_scaled, y_train)

print("✅ Linear Regression model trained successfully!")
print(f"Model coefficients shape: {model.coef_.shape}")
print(f"Model intercept: {model.intercept_:.6f}")

# Display feature coefficients
print(f"\n📊 FEATURE COEFFICIENTS")
print("=" * 50)
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': model.coef_,
    'Abs_Coefficient': np.abs(model.coef_)
}).sort_values('Abs_Coefficient', ascending=False)

print(feature_importance.to_string(index=False))

## 5. Model Evaluation and Performance Metrics

Evaluate the trained model using various regression metrics and visualizations.

In [None]:
# Make predictions on both training and test sets
print("🎯 MODEL PREDICTIONS")
print("=" * 50)

y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)

print("✅ Predictions completed!")

# Calculate performance metrics
print(f"\n📈 PERFORMANCE METRICS")
print("=" * 50)

# Training metrics
train_r2 = r2_score(y_train, y_train_pred)
train_mse = mean_squared_error(y_train, y_train_pred)
train_rmse = np.sqrt(train_mse)
train_mae = mean_absolute_error(y_train, y_train_pred)

# Test metrics
test_r2 = r2_score(y_test, y_test_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
test_rmse = np.sqrt(test_mse)
test_mae = mean_absolute_error(y_test, y_test_pred)

# Create metrics comparison table
metrics_df = pd.DataFrame({
    'Metric': ['R² Score', 'MSE', 'RMSE', 'MAE'],
    'Training': [train_r2, train_mse, train_rmse, train_mae],
    'Test': [test_r2, test_mse, test_rmse, test_mae]
})

print("REGRESSION METRICS COMPARISON:")
print(metrics_df.to_string(index=False, float_format='%.6f'))

# Interpretation
print(f"\n🎯 MODEL INTERPRETATION")
print("=" * 50)
print(f"• R² Score on test set: {test_r2:.4f} ({test_r2*100:.2f}% variance explained)")
print(f"• RMSE on test set: {test_rmse:.6f} (average prediction error)")
print(f"• The model explains {test_r2*100:.2f}% of the variance in admission chances")

if test_r2 > 0.8:
    print("• 🟢 Excellent model performance!")
elif test_r2 > 0.6:
    print("• 🟡 Good model performance")
else:
    print("• 🔴 Model needs improvement")

In [None]:
# Create comprehensive evaluation visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Model Evaluation Visualizations', fontsize=16, fontweight='bold')

# 1. Actual vs Predicted scatter plot
axes[0, 0].scatter(y_test, y_test_pred, alpha=0.6, color='coral')
axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'b--', lw=2)
axes[0, 0].set_xlabel('Actual Chance of Admit')
axes[0, 0].set_ylabel('Predicted Chance of Admit')
axes[0, 0].set_title(f'Actual vs Predicted\n(R² = {test_r2:.4f})')
axes[0, 0].grid(True, alpha=0.3)

# 2. Residuals plot
residuals = y_test - y_test_pred
axes[0, 1].scatter(y_test_pred, residuals, alpha=0.6, color='lightgreen')
axes[0, 1].axhline(y=0, color='red', linestyle='--')
axes[0, 1].set_xlabel('Predicted Values')
axes[0, 1].set_ylabel('Residuals')
axes[0, 1].set_title('Residuals Plot')
axes[0, 1].grid(True, alpha=0.3)

# 3. Histogram of residuals
axes[1, 0].hist(residuals, bins=20, alpha=0.7, color='skyblue', edgecolor='black')
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Residuals')
axes[1, 0].axvline(x=0, color='red', linestyle='--')
axes[1, 0].grid(True, alpha=0.3)

# 4. Feature importance plot
feature_importance_sorted = feature_importance.sort_values('Abs_Coefficient', ascending=True)
axes[1, 1].barh(feature_importance_sorted['Feature'], feature_importance_sorted['Coefficient'], 
                color='lightcoral', alpha=0.7)
axes[1, 1].set_xlabel('Coefficient Value')
axes[1, 1].set_title('Feature Importance (Coefficients)')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Additional analysis: Prediction accuracy distribution
print(f"\n📊 PREDICTION ACCURACY ANALYSIS")
print("=" * 50)

# Calculate absolute percentage errors
ape = np.abs((y_test - y_test_pred) / y_test) * 100
mape = np.mean(ape)

print(f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%")
print(f"Predictions within 5% error: {np.sum(ape <= 5)}/{len(ape)} ({np.sum(ape <= 5)/len(ape)*100:.1f}%)")
print(f"Predictions within 10% error: {np.sum(ape <= 10)}/{len(ape)} ({np.sum(ape <= 10)/len(ape)*100:.1f}%)")
print(f"Predictions within 15% error: {np.sum(ape <= 15)}/{len(ape)} ({np.sum(ape <= 15)/len(ape)*100:.1f}%)")

## 6. Model Saving and BentoML Preparation

Save the trained model and scaler for use with BentoML deployment.

In [None]:
# Create models directory if it doesn't exist
models_dir = 'models'
os.makedirs(models_dir, exist_ok=True)

print("💾 SAVING TRAINED MODEL AND SCALER")
print("=" * 50)

# Save the trained linear regression model
model_path = os.path.join(models_dir, 'admission_model.joblib')
joblib.dump(model, model_path)

# Save the scaler
scaler_path = os.path.join(models_dir, 'scaler.joblib')
joblib.dump(scaler, scaler_path)

# Save feature names for reference
feature_names_path = os.path.join(models_dir, 'feature_names.joblib')
joblib.dump(list(X_train.columns), feature_names_path)

print(f"✅ Model saved to: {model_path}")
print(f"✅ Scaler saved to: {scaler_path}")
print(f"✅ Feature names saved to: {feature_names_path}")

# Test loading the model to ensure it works
print(f"\n🧪 TESTING MODEL LOADING")
print("=" * 50)

try:
    loaded_model = joblib.load(model_path)
    loaded_scaler = joblib.load(scaler_path)
    loaded_features = joblib.load(feature_names_path)
    
    # Test prediction with loaded model
    test_sample = X_test_scaled[:1]
    original_pred = model.predict(test_sample)[0]
    loaded_pred = loaded_model.predict(test_sample)[0]
    
    print(f"Original model prediction: {original_pred:.6f}")
    print(f"Loaded model prediction: {loaded_pred:.6f}")
    print(f"Predictions match: {np.allclose(original_pred, loaded_pred)}")
    print("✅ Model loading test successful!")
    
except Exception as e:
    print(f"❌ Error loading model: {e}")

# Model summary for BentoML integration
print(f"\n📋 MODEL SUMMARY FOR BENTOML")
print("=" * 50)
print(f"Model Type: Linear Regression")
print(f"Features: {len(X_train.columns)} features")
print(f"Feature Names: {list(X_train.columns)}")
print(f"Training Samples: {len(X_train)}")
print(f"Test R² Score: {test_r2:.4f}")
print(f"Test RMSE: {test_rmse:.6f}")
print(f"Model File: {model_path}")
print(f"Scaler File: {scaler_path}")
print(f"Ready for BentoML deployment! 🚀")

## 7. Summary and Next Steps

### Model Performance Summary
- **Algorithm**: Linear Regression with StandardScaler normalization
- **Features**: 7 input features (GRE Score, TOEFL Score, University Rating, SOP, LOR, CGPA, Research)
- **Target**: Chance of Admit (probability between 0-1)
- **Dataset**: 500 samples (400 training, 100 testing)

### Key Findings
1. **Most Important Features**: The features with highest correlation to admission chances
2. **Model Performance**: R² score and RMSE values on test set
3. **Prediction Accuracy**: Percentage of predictions within acceptable error ranges

### Next Steps for BentoML Deployment
1. **Create BentoML Service**: Build the API service using the saved model
2. **Define API Endpoints**: Create prediction endpoints for single and batch predictions
3. **Build Bento**: Package the service into a deployable bento
4. **Containerize**: Create Docker container for deployment
5. **Test API**: Validate the deployed service with test data

The model and preprocessing components are now ready for BentoML integration! 🎯