In [9]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)


# Week 5: Supervised Learning – Regression
**Assignment 5:** Train/test split and apply Linear Regression on the student performance dataset.  
**Dataset:** cleaned_students.csv


- -------





In [10]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Optional: nicer plots
sns.set(style="whitegrid", palette="pastel", font_scale=1.1)
plt.rcParams['figure.figsize'] = (10,6)

# Load dataset
df = pd.read_csv("cleaned_students.csv")

# Use top features identified in Week 4
top_features = ['Total_Score', 'Participation_Score', 'Assignments_Avg']  # example, replace with your top 3
X = df[top_features]
y = df["Final_Score"]


In [11]:
# Split dataset: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)


Training set size: (4000, 3)
Test set size: (1000, 3)


In [14]:
# Initialize and train Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)
print("Sample Predictions:")
print(y_pred[:10])



Sample Predictions:
[65.37884567 70.42374197 79.79490002 74.791321   56.01789263 74.46231139
 64.95262026 84.20448279 86.99778981 77.30851688]


In [13]:
# Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)

# Root Mean Squared Error (RMSE)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")


Mean Absolute Error (MAE): 11.08
Root Mean Squared Error (RMSE): 13.57


## Week 5: Regression Results Summary

In this week’s task, a **Linear Regression model** was implemented to predict students' **Final Scores** using the most correlated features identified in Week 4.

#### Dataset: cleaned_students.csv

##### Steps Applied:

1. Selected top predictive features: Midterm_Score, Assignments_Avg, Study_Hours_per_Week (example).  
2. Split the dataset into **training (80%)** and **testing (20%)** sets.  
3. Trained a **Linear Regression model** using Scikit-learn.  
4. Evaluated performance using **Mean Absolute Error (MAE)** and **Root Mean Squared Error (RMSE)**.

### Results:

- Mean Absolute Error (MAE): 11.08
- Root Mean Squared Error (RMSE): 13.57

### Interpretation:
A lower MAE and RMSE indicate that the model predicts students' Final Scores with good accuracy.  
These metrics serve as the **baseline performance** for future model improvements (e.g., feature scaling, advanced models).

### Project Milestone:
Baseline regression model successfully built and evaluated.

### Conclusion:

- This baseline regression model predicts Final_Score using top features.  
- Next steps: Try feature engineering, scaling, or other regression algorithms for improvement.
