# Student Dropout Prediction & Academic Success
**Milestone 1: Prototype Implementation**

This notebook implements a supervised machine learning pipeline to predict whether a student will dropout, stay enrolled, or graduate. The solution involves data preprocessing, feature analysis, handling class imbalance using SMOTE, and comparing three distinct algorithms: Logistic Regression, SVM, and Random Forest.

In [None]:
# 1. Import Essential Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 2. Data Preprocessing Libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from imblearn.over_sampling import SMOTE  # Crucial for handling class imbalance

# 3. Machine Learning Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# 4. Evaluation Metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score

# 5. Configuration
sns.set_style("whitegrid")
import warnings
warnings.filterwarnings('ignore') # Clean up output by hiding warnings


## 1. Data Loading and Initial Inspection
Here we load the dataset and perform a basic check of the structure, dimensions, and missing values.

In [None]:
df = pd.read_csv('data.csv') 
df.head()

In [None]:
df.shape

In [None]:
df.isnull().sum().sum()

## 2. Target Encoding
The target variable (Status) is categorical text. We convert it into numerical format (0, 1, 2) using Label Encoding so the machine learning models can process it.

In [None]:
le = LabelEncoder()
df['Target_Encoded'] = le.fit_transform(df['Target'])
class_names = le.classes_ 
print(f"Target Classes: {class_names}")

## 3. Exploratory Data Analysis (EDA)
### 3.1 Class Distribution
We visualize the target variable to check for class imbalance. This step justifies the need for resampling techniques (SMOTE) later in the pipeline.

In [None]:
# 1. Class Distribution Analysis
plt.figure(figsize=(6, 4))
sns.countplot(x='Target', data=df, palette='viridis')
plt.title("Class Distribution (Before Balancing)")
plt.xlabel("Student Status")
plt.ylabel("Count")
plt.show()

### 3.2 Feature Importance Analysis
Before training the final models, we use a temporary Random Forest classifier to analyze which features (columns) have the most influence on the prediction. This helps explain the underlying data patterns.

In [None]:

X_temp = df.drop(['Target', 'Target_Encoded'], axis=1)
y_temp = df['Target_Encoded']

rf_temp = RandomForestClassifier(n_estimators=100, random_state=42)
rf_temp.fit(X_temp, y_temp)

# Create a DataFrame of feature importance
feature_importance = pd.DataFrame({
    'Feature': X_temp.columns,
    'Importance': rf_temp.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Plot the Top 10 Most Important Features
plt.figure(figsize=(12, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10), palette='magma')
plt.title("Top 10 Most Influential Features on Student Dropout")
plt.show()

## 4. Data Preprocessing
### 4.1 Splitting and Scaling
We define our features (X) and target (y), split the data into training (80%) and testing (20%) sets, and apply Standard Scaling. Scaling is essential for algorithms like SVM and Logistic Regression to perform correctly.

In [None]:
# 1. Define Features (X) and Target (y)
X = df.drop(['Target', 'Target_Encoded'], axis=1)
y = df['Target_Encoded']

# 2. Train-Test Split (80% Training, 20% Testing)
# 'stratify=y' ensures the test set has the same proportion of dropouts as the training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# 3. Standardization (Scaling)
# Fit on TRAIN, transform on TEST (Prevents data leakage)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



### 4.2 Handling Class Imbalance (SMOTE)
Since the dataset is imbalanced (as seen in Section 3.1), we apply SMOTE (Synthetic Minority Over-sampling Technique) to the training data. This creates synthetic examples of the minority classes (Dropout/Enrolled) to prevent the model from being biased toward Graduates.

In [None]:
print(f"Original Training Size: {len(X_train)}")
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled, y_train)
print(f"Balanced Training Size (After SMOTE): {len(X_train_balanced)}")

## 5. Model Training and Evaluation
We will now train three different models: Logistic Regression, Support Vector Machine (SVM), and Random Forest. We evaluate each using Accuracy, Classification Reports, and Confusion Matrices.

### 5.1 Logistic Regression

In [None]:
print("\n" + "="*40)
print("TRAINING MODEL 1: LOGISTIC REGRESSION")
print("="*40)

# Initialize and Train
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train_balanced, y_train_balanced)

# Predict
lr_pred = lr_model.predict(X_test_scaled)

# Evaluate
lr_acc = accuracy_score(y_test, lr_pred)
print(f"Accuracy: {lr_acc*100:.2f}%")
print("\nDetailed Classification Report:")
print(classification_report(y_test, lr_pred, target_names=class_names))

# Confusion Matrix
plt.figure(figsize=(5, 4))
sns.heatmap(confusion_matrix(y_test, lr_pred), annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix - Logistic Regression")
plt.show()

### 5.2 Support Vector Machine (SVM)

In [None]:
print("\n" + "="*40)
print("TRAINING MODEL 2: SUPPORT VECTOR MACHINE (SVM)")
print("="*40)

# Initialize and Train (probability=True needed for final inference)
svm_model = SVC(kernel='rbf', probability=True, random_state=42)
svm_model.fit(X_train_balanced, y_train_balanced)

# Predict
svm_pred = svm_model.predict(X_test_scaled)

# Evaluate
svm_acc = accuracy_score(y_test, svm_pred)
print(f"Accuracy: {svm_acc*100:.2f}%")
print("\nDetailed Classification Report:")
print(classification_report(y_test, svm_pred, target_names=class_names))

# Confusion Matrix
plt.figure(figsize=(5, 4))
sns.heatmap(confusion_matrix(y_test, svm_pred), annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix - SVM")
plt.show()

### 5.3 Random Forest Classifier

In [None]:
print("\n" + "="*40)
print("TRAINING MODEL 3: RANDOM FOREST")
print("="*40)

# Initialize and Train
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_balanced, y_train_balanced)

# Predict
rf_pred = rf_model.predict(X_test_scaled)

# Evaluate
rf_acc = accuracy_score(y_test, rf_pred)
print(f"Accuracy: {rf_acc*100:.2f}%")
print("\nDetailed Classification Report (Precision, Recall, F1):")
print(classification_report(y_test, rf_pred, target_names=class_names))

# Confusion Matrix
plt.figure(figsize=(5, 4))
sns.heatmap(confusion_matrix(y_test, rf_pred), annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix - Random Forest")
plt.show()

## 6. Model Comparison and Conclusion
Finally, we compare the performance of all three models side-by-side using Accuracy and Weighted F1-Score to determine the best approach.

In [None]:
# Create a comparison dataframe
results_df = pd.DataFrame({
    'Model': ['Logistic Regression', 'SVM', 'Random Forest'],
    'Accuracy': [lr_acc, svm_acc, rf_acc],
    'F1 Score (Weighted)': [
        f1_score(y_test, lr_pred, average='weighted'),
        f1_score(y_test, svm_pred, average='weighted'),
        f1_score(y_test, rf_pred, average='weighted')
    ]
})

print("\nFinal Performance Comparison:")
print(results_df)

# Plot Comparison
results_df.set_index('Model')[['Accuracy', 'F1 Score (Weighted)']].plot(kind='bar', figsize=(10, 6), color=['skyblue', 'salmon'])
plt.title("Model Performance Comparison")
plt.ylim(0.6, 0.9) # Zoom in to see differences
plt.ylabel("Score (0-1)")
plt.xticks(rotation=0)
plt.legend(loc='lower right')
plt.show()

