# 🎓 Project: Predicting Student Exam Scores using Multi-Linear Regression

## 🧠 Goal:
Use a multi-linear regression model to predict students' final grades (`G3`) based on study-related features like:
- Study time
- Past grades (G1, G2)
- Number of failures
- Absences

---

## 📦 Dataset: Student Performance Dataset

- Source: UCI ML Repository  
  → https://archive.ics.uci.edu/ml/datasets/Student+Performance  
- File to use: `student-mat.csv` (Math scores)  
- Target variable: `G3` (final grade)  
- Features to consider: `G1`, `G2`, `studytime`, `failures`, `absences`

---

# 🔹 Step 1: Load the Dataset
- Load the CSV file into a dataframe.
- Preview first few rows.
- Check shape, column names, and data types.

---

# 🔹 Step 2: Understand the Features
- Print summary statistics.
- Identify numeric features.
- Decide which columns to use as independent variables (X).
- Define the dependent variable (`G3`) as your target (y).

---

# 🔹 Step 3: Clean the Data
- Drop irrelevant columns if any (like school name, address).
- Focus on numeric columns only.
- Check for missing values or outliers.

---

# 🔹 Step 4: Exploratory Data Analysis (EDA)
- Plot distributions of numeric features (histograms, boxplots).
- Create correlation heatmap.
- Scatterplots of each feature vs `G3` to observe trends.

---

# 🔹 Step 5: Prepare the Data
- Create feature matrix `X` and target vector `y`.
- Split the dataset into training and test sets (e.g., 80/20).
- (Optional) Scale features if needed.

---

# 🔹 Step 6: Train the Model
- Use Linear Regression from a library like scikit-learn.
- Fit the model on training data.

---

# 🔹 Step 7: Evaluate the Model
- Predict on test data.
- Evaluate using:
  - Mean Absolute Error (MAE)
  - Mean Squared Error (MSE)
  - R² Score
- Comment on model accuracy.

---

# 🔹 Step 8: Visualize Results
- Plot Actual vs Predicted values.
- Plot residuals (errors).
- Highlight any obvious overfitting or underfitting patterns.

---

# 🔹 Step 9: Conclusion
- Summarize how well the model performed.
- Which features were most influential?
- Any next steps or improvements you would make?

---

# 🚀 Bonus Ideas (Optional)
- Add categorical features by encoding (like `school`, `sex`, `study support`).
- Try using only early grades (`G1`, `G2`) vs. using all features.
- Use the second dataset (`student-por.csv`) and compare results.


-------

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### Step 1: Load the Dataset

In [None]:
df = pd.read_csv('data/student-mat.csv', sep = ';')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

----
#### Step 2: Understand the Features

In [None]:
df.describe()

- For Linear Regression, we only want to consider numeric attributes as independent variables (X)
- For y, we will consider G3

----
#### Step 3: Clean the Data

In [None]:
df.head()

In [None]:
# dropping up irrelevent columns (we will create another df with only essential columns, as to save our efforts)
num_df = df[['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc', 'health', 'absences', 'G1', 'G2', 'G3']]


In [None]:
num_df.head()

In [None]:
num_df.isnull().count() # no missing value, all cleaned up!

In [None]:
# slight insights (age 20 gets the best mean of G3 )
num_df.groupby('age').mean(numeric_only = True)

---
####  Step 4: Exploratory Data Analysis (EDA)

In [None]:
sns.displot(data = num_df, x = 'age', kde = True)
plt.title('Age Distribution')

In [None]:
sns.scatterplot(data = num_df, x = 'G1', y = 'G3')
plt.title('G1 VS G3')

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(13, 6)) # 2 row, 2 columns

sns.boxplot(data = num_df, x = 'health', y = 'G3', ax = axes[0][0])
sns.boxplot(data = num_df, x = 'freetime', y = 'G3', ax = axes[0][1])
sns.boxplot(data = num_df, x = 'Dalc', y = 'G3', ax = axes[1][0])
sns.boxplot(data = num_df, x = 'goout', y = 'G3', ax = axes[1][1])

fig.subplots_adjust(wspace = 0.3, hspace = 0.3)

-----
####  Step 5: Prepare the Data

In [None]:
# seperate out features and the label
X = num_df.drop('G3', axis=1)
y = num_df['G3']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# creating a train test split (80 - 20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
X_train.head()

In [None]:
y_train.head() # ensure proper index 

----
#### Step 6: Train the Model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()
model

In [None]:
# fit the traing set into the model
model.fit(X_train, y_train)

----
#### Step 7: Evaluate the Model

In [None]:
# Actual prediciton of unseen data (testing our model)
predictions = model.predict(X_test)
predictions

In [None]:
y_test # above first array G3 corresponds to following one : 

- Let's check out how accurate is our model in average!

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [None]:
# mean absolute error
mean_absolute_error(y_test, predictions)
# so in average our MAE is off by 1.3 (That's not bad at all!)

In [None]:
# root mean squared error
np.sqrt(mean_squared_error(y_test, predictions))

---
#### Step 8: Visualize Results

In [None]:
# Actual Grades Vs predicted Grades
sns.scatterplot(x = y_test, y = predictions)
plt.title('Actual Grades Vs predicted Grades')
plt.ylabel('Predicted Grades')

In [None]:
# plotting the residuals (error)
test_residuals = y_test - predictions
sns.scatterplot(x = y_test, y = test_residuals)
plt.axhline(y = 0, ls = '--', color = 'red')

In [None]:
sns.displot(test_residuals, bins = 25, kde = True)

## 📌 Basic Conclusion from Residual Plot

- The residuals are mostly centered around **0**, which indicates that the model is not biased toward over- or under-predicting.
- Most prediction errors fall within a small range (e.g., ±2), suggesting that the model is **reasonably accurate** for the majority of students.
- The distribution is roughly bell-shaped, showing a **normal-like error pattern**, which is expected in a well-performing linear regression model.
- There are **no extreme outliers or skew**, meaning the model does not make large, frequent mistakes.

✅ **Conclusion**: The model performs well overall, with small, balanced errors — it can be trusted for general grade prediction within a ~2-point margin.
