Automatically download the dataset to your Kaggle environment.
 
Print the file path where the dataset is saved, which you can then use to load it into a DataFrame.

In [None]:
import kagglehub
 
# Download latest version
path = kagglehub.A("uciml/student-alcohol-consumption")
 
print("Path to dataset files:", path)

In [None]:
# Import Libraries and Load Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
 
# Load data
df = pd.read_csv('/kaggle/input/student-alcohol-consumption/student-mat.csv')
df.head()

- Import essential Python libraries for data analysis, visualization, and machine learning.
- Load the Student Alcohol Consumption dataset using pandas.
 
The dataset student-mat.csv contains information about students’ academic performance and alcohol consumption. This will help us explore behavioral patterns and predict outcomes using classification models.

In [None]:
# Create the Target Variable
# Create binary label
df['high_grade'] = (df['G3'] >= 10).astype(int)

To prepare the data for classification, we created a new binary target variable called high_grade:
 
- Students with a final grade (G3) greater than or equal to 10 are labeled as 1 (high performance).
- Students with a grade below 10 are labeled as 0 (low performance).

In [None]:
import kagglehub
 
# Download latest version
path = kagglehub.A("uciml/student-alcohol-consumption")
 
print("Path to dataset files:", path)

Dropped G1,G2, and G3: These are earlier and final grades that directly impact the target (`high_grade`), so we removed them to prevent data leakage.
- Encoded categorical variables using one-hot encoding with drop_first=True to avoid multicollinearity and convert categorical features into numerical format.

In [None]:
# Modeling
# Train a few classifiers
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(),
    "Decision Tree": DecisionTreeClassifier()
}
 
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"\n{name} Results:")
    print(classification_report(y_test, y_pred))

Logistic Regression: A linear model suitable for binary classification problems.
- Random Forest Classifier: An ensemble method using multiple decision trees for improved accuracy and robustness.
- Decision Tree Classifier: A simple, interpretable tree-based model.

In [None]:
# Hyperparameter Tuning
# Tune Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15]
}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='f1')
grid.fit(X_train, y_train)
print("Best Params:", grid.best_params_)
print("Best F1 Score:", grid.best_score_)

- n_estimators: Number of trees in the forest → tested values: 50, 100, 200
- max_depth: Maximum depth of each tree → tested values: 5, 10, 15
 
Key details:
- Cross-validation: 5-fold (`cv=5`) to ensure reliable performance estimates
- Scoring metric: F1 Score, which balances precision and recall

Final Comparison & Reflection
 
Model Performance Comparison
 
After training and evaluating three models — Logistic Regression, Decision Tree, and 
Random Forest — we compared their results based on key classification metrics (Precision, Recall, F1-Score, and Accuracy):
 
- Logistic Regression offered solid baseline performance with high interpretability, especially useful when transparency is important.
- Decision Tree provided clear decision rules but tended to overfit on training data, showing slightly lower generalization performance.
- Random Forest, especially after hyperparameter tuning, achieved the highest F1-score, indicating the best balance between precision and recall.
 
Recommended Model
 
Random Forest Classifier is recommended for deployment in a school analytics tool due to the following reasons:
 
- Strong performance across all evaluation metrics, especially after tuning.
- Robustness to noise and overfitting thanks to ensemble learning.
- Feature importance insights that can help educators understand key predictors of student success.
- ]Scalability to work on larger or more complex datasets if extended in the future.
 
Reflection
 
Building this classification pipeline offered valuable hands-on experience with:
 
- Real-world education-related data
- End-to-end machine learning workflow
- Trade-offs between interpretability and performance
 
By using such models, schools can better identify students at academic risk early and target interventions more effectively.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import pandas as pd
 
# Assuming models dict and predictions from before, recompute metrics on test set
model_metrics = {}
 
for name, model in models.items():
    y_pred = model.predict(X_test)
    model_metrics[name] = {
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred),
        "Recall": recall_score(y_test, y_pred),
        "F1 Score": f1_score(y_test, y_pred)
    }
 
# Create DataFrame for easy viewing
metrics_df = pd.DataFrame(model_metrics).T
metrics_df = metrics_df.round(3)
 
print("Evaluation Metrics Comparison:")
print(metrics_df)
 
# Plot F1 scores for visual comparison
plt.figure(figsize=(8,5))
metrics_df["F1 Score"].plot(kind='bar', color=['skyblue', 'lightgreen', 'salmon'])
plt.title('Model Comparison: F1 Scores')
plt.ylabel('F1 Score')
plt.ylim(0, 1)
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

 
Final Reflection
 
This project demonstrates the importance of:
 
- Careful preprocessing and feature engineering
- Model selection and hyperparameter tuning
- Balancing interpretability with predictive performance
 
Deploying such models can help educators identify students who may benefit from additional support early, improving academic outcomes.