In [1]:
import pandas as pd
import os
import sys
# Add project root to Python path so we can import modules from src/
# Notebook is in 'notebooks/' folder, src/ is in parent directory
sys.path.append("../")
from src.models import train_linear_regression, train_decision_tree_classifier, train_logistic_regression


In [2]:
# Load preprocessed data
df = pd.read_csv("../data/processed.csv")

### Why I Chose Following Three Models

**1. Linear Regression (Predicting Final Grades `G3`)**
- Linear Regression is simple and interpretable regression model, ideal for predicting continuous outcomes like final grades.
- It quantifies the relationship between features and the target using coefficients.

**2. Decision Tree Classifier (Predicting Pass/Fail)**
- Decision Trees can capture nonlinear relationships between features and the binary target (pass/fail).
- It's easy to visualize and interpret, and shows which features most effectively split the data for predicting success.
- Handles mixed feature types (numeric + categorical) without extensive preprocessing.

**3. Logistic Regression (Predicting Pass/Fail)**
- Logistic Regression is standard model for binary classification, predicting the probability of passing or failing.
- Coefficients provide interpretable insights into how each feature affects the likelihood of passing.

In [3]:
# 1. Feature Selection

# Regression: Predict final grade (G3)
regression_features = [
    "age","Medu","Fedu","traveltime","studytime",
    "failures","famrel","freetime","goout","Dalc",
    "Walc","health","absences","support_score","total_alcohol"
]
X_reg = df[regression_features]
y_reg = df["G3"]

# Classification: Predict pass/fail
classification_features = regression_features + ["sex"]
X_clf = df[classification_features]
y_clf = df["pass_fail"]

In [4]:
# 2. Train Linear Regression
linear_model, linear_metrics = train_linear_regression(X_reg, y_reg, save_path="../reports/results")
print("Linear Regression Metrics:")
for k,v in linear_metrics.items():
    print(f"{k}: {v:.3f}")

Linear Regression Metrics:
R2: 0.127
MSE: 17.895


In [5]:
# 3. Train Decision Tree Classifier
dt_model, dt_metrics = train_decision_tree_classifier(X_clf, y_clf, save_path="../reports/results")
print("\nDecision Tree Classifier Metrics:")
for k,v in dt_metrics.items():
    if k != "Confusion_Matrix":
        print(f"{k}: {v:.3f}")
    else:
        print(f"{k}:\n{v}")


Decision Tree Classifier Metrics:
Accuracy: 0.684
Precision: 0.729
Recall: 0.827
Confusion_Matrix:
[[11, 16], [9, 43]]


In [6]:
# 4. Train Logistic Regression Classifier
logreg_model, logreg_metrics = train_logistic_regression(X_clf, y_clf, save_path="../reports/results")
print("\nLogistic Regression Metrics:")
for k,v in logreg_metrics.items():
    if k != "Confusion_Matrix":
        print(f"{k}: {v:.3f}")
    else:
        print(f"{k}:\n{v}")


Logistic Regression Metrics:
Accuracy: 0.747
Precision: 0.742
Recall: 0.942
Confusion_Matrix:
[[10, 17], [3, 49]]


## Model Comparison and Performance Discussion

### 1. Linear Regression (Predicting Final Grade G3)

- **R<sup>2</sup> = 0.127** - This means the features explain only about 12.7% of the variance in final grades.
- **MSE = 17.895** - The model has a relatively high error in predicting the final grade.
- Conclusion: Linear Regression is not good for predicting final grades in this dataset. The low R<sup>2</sup> means that either the chosen features aren't necessary responsible for G3, or student grades are influenced by other factors that's not in the dataset.

### 2. Decision Tree Classifier (Pass/Fail Prediction)

- Accuracy = 0.684, Precision = 0.729, Recall = 0.827
- The Decision Tree performs reasonably well but has false positives (classifies some failing students as passing).
- Confusion Matrix shows a moderate number of misclassifications: [[11, 16], [9, 43]]
- Conclusion: The model captures patterns for pass/fail outcomes but is less consistent than Logistic Regression in terms of overall performance metrics.

### 3. Logistic Regression (Pass/Fail Prediction)

- Accuracy = 0.747, Precision = 0.742, Recall = 0.942
- Confusion Matrix:[[10, 17], [3, 49]]
- Conclusion: Logistic Regression performs the best overall for pass/fail classification. It has high recall (0.942), so based on that it correctly identifies most students who pass. Slightly lower precision shows a few false positives, but overall it is more consistent than the Decision Tree.

### Which Model Performs Better?

- For regression: Linear Regression is weak (low R<sup>2</sup>), so it does not predict final grades accurately.
- For classification: Logistic Regression is better than Decision Tree in predicting pass/fail outcomes which is shown by higher accuracy and recall.

### Factors Influencing Student Success

#### **From the models:**

**Key predictors for final grades (G3) include:**

- **Study time** - More study time tends to correlate with higher grades.

- **Failures** - Past failures negatively affect final grades.

- **Parental education** - Higher parental education levels are associated with better student performance.

- **Alcohol consumption** - Higher alcohol use may negatively impact grades.

**Classification pass/fail prediction shows that:**

- **Sex and study-related features** are important for predicting success.

- Logistic Regression shows that behavioral and academic features combined are pretty good for predicting which students pass.