## Load the Preprocessed Dataset
- Verify that the dataset is properly cleaned and ready for modeling.

In [14]:
import pandas as pd

# Load the preprocessed dataset
df = pd.read_csv('student_performance_clean.csv')

# Display the first few rows to verify the data
print("Dataset preview:")
df.head()

Dataset preview:


Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7.0,99,Yes,9.0,1.0,91.0
1,4.0,82,No,4.0,2.0,65.0
2,8.0,51,Yes,7.0,2.0,45.0
3,5.0,52,Yes,5.0,2.0,36.0
4,7.0,75,No,8.0,5.0,66.0


## Feature Selection and Data Preparation
Make sure that all features are in the correct numerical format (if any feature is categorical, you might need to encode it).
### Converting the Yes/No Column
Since the Extracurricular Activities column contains "yes" or "no" values, convert it to numerical format. One common approach is mapping "yes" to 1 and "no" to 0.

In [15]:
# Map 'yes' to 1 and 'no' to 0 in the 'Extracurricular Activities' column
df['Extracurricular Activities'] = df['Extracurricular Activities'].map({'Yes': 1, 'No': 0})

# Verify the conversion
print("Unique values in 'Extracurricular Activities':", df['Extracurricular Activities'].unique())

Unique values in 'Extracurricular Activities': [1 0]


### Selecting Features and the Target
Select the features (predictors) and target variable. In this scenario, we'll use:

- **Features**: Hours Studied, Previous Scores, Extracurricular Activities, Sleep Hours, Sample Question Papers Practiced
- **Target**: Performance Index

- Correctly selecting and separating the predictors and target is crucial for training an accurate model.

In [16]:
# Define the feature matrix X and target vector y
X = df[['Hours Studied', 'Previous Scores', 'Extracurricular Activities', 'Sleep Hours', 'Sample Question Papers Practiced']]
y = df['Performance Index']

In [17]:
# Check the feature matrix and target variable
print("Features preview:")
X.head()

Features preview:


Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced
0,7.0,99,1,9.0,1.0
1,4.0,82,0,4.0,2.0
2,8.0,51,1,7.0,2.0
3,5.0,52,1,5.0,2.0
4,7.0,75,0,8.0,5.0


In [18]:
print("Target preview:")
y.head()

Target preview:


0    91.0
1    65.0
2    45.0
3    36.0
4    66.0
Name: Performance Index, dtype: float64

## Splitting the Dataset
Split the dataset into training and testing sets. Typically, an 80/20 or 70/30 split is common to ensure the model is evaluated on unseen data.
- Splitting data helps evaluate model performance on unseen data, preventing overfitting.


In [19]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

Training set size: (7968, 5)
Testing set size: (1993, 5)


## Training the Linear Regression Model
Use scikit-learn's `LinearRegression` to fit a model on the training data.
- Understand the process of fitting a model to the training data using scikit-learn.

In [20]:
from sklearn.linear_model import LinearRegression

# Create an instance of the LinearRegression model
model = LinearRegression()

# Train (fit) the model using the training data
model.fit(X_train, y_train)

print("Model training complete.")

Model training complete.


## Making Predictions and Evaluating the Model
After training, use the model to predict performance on the test set. 

Then, evaluate its performance using metrics such as the Mean Squared Error (MSE) and $R²$ score.
- Evaluate the regression model to understand its predictive accuracy and error.

In [21]:
from sklearn.metrics import mean_squared_error, r2_score

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R² Score:", r2)


Mean Squared Error (MSE): 9.615715629516592
R² Score: 0.9740478788441724


# Evaluating Regression Models: MSE and R²

## Mean Squared Error (MSE)

Mean Squared Error (MSE) is a common metric for evaluating the performance of regression models. It measures the average squared difference between the predicted values and the actual values.

**Formula:**

$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2$

Where:
- $( y_i )$ is the actual value for the $(i^{th})$ observation.
- $( \hat{y}_i )$ is the predicted value for the $(i^{th})$ observation.
- $( n )$ is the total number of observations.

**Key Point:**  
Squaring the errors means that larger errors are penalized more heavily, which makes MSE sensitive to outliers.

---

## Coefficient of Determination (R² Score)

The R² score, also known as the coefficient of determination, quantifies how well the independent variables explain the variability of the dependent variable in a regression model.

**Formula:**

$R^2 = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}$

Where:
- $(\text{SS}_{\text{res}} = \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2)$ is the residual sum of squares.
- $(\text{SS}_{\text{tot}} = \sum_{i=1}^{n} \left( y_i - \bar{y} \right)^2)$ is the total sum of squares.
- $( \bar{y} )$ is the mean of the actual values.

**Interpretation:**
- An $( R^2 )$ value of **1** indicates that the model perfectly explains the variance in the data.
- An $( R^2 )$ value of **0** means that the model does no better than predicting the mean.
- Negative values suggest that the model performs worse than simply predicting the mean.

---

Together, these metrics provide a robust way to evaluate regression models. MSE offers insight into the magnitude of the prediction errors, while the R² score provides a measure of the proportion of variance explained by the model.


## Interpreting the Model Coefficients
Understanding the coefficients can offer insight into how each predictor influences the target variable.
- Model coefficients help in interpreting the influence of each feature on the outcome.

In [22]:
# Retrieve the coefficients and the intercept
coefficients = model.coef_
intercept = model.intercept_

# Display the model parameters
feature_names = X.columns
print("Intercept:", intercept)
print("Coefficients:")
for feature, coef in zip(feature_names, coefficients):
    print(f"  {feature}: {coef}")


Intercept: -34.1210594325544
Coefficients:
  Hours Studied: 2.8521269892855154
  Previous Scores: 1.0192038825540717
  Extracurricular Activities: 0.7487257123811044
  Sleep Hours: 0.46664605215555943
  Sample Question Papers Practiced: 0.19887583660555258


# Model Interpretation

The linear regression model was fitted to predict the **Performance Index** based on several predictors. Below is an interpretation of the intercept and each coefficient in the model:

### Intercept
- **Intercept:** -34.121  
  This value represents the estimated **Performance Index** when all predictor variables are equal to zero. In this context, a negative intercept suggests that if a student had zero hours studied, zero previous scores, did not participate in extracurricular activities, got zero sleep hours, and practiced zero sample question papers, the model predicts a very low (or negative) performance index. While this scenario is unlikely, the intercept serves as a baseline for the model.

### Coefficients
Each coefficient quantifies the expected change in the **Performance Index** for a one-unit increase in the predictor variable, holding all other variables constant.

- **Hours Studied:** 2.852  
  For each additional hour a student studies, the **Performance Index** is predicted to increase by approximately 2.85 units. This indicates a strong positive relationship between study time and performance.

- **Previous Scores:** 1.019  
  An increase of one unit in previous scores is associated with an increase of about 1.02 units in the **Performance Index**. This suggests that a student's past academic performance is a valuable predictor of future performance.

- **Extracurricular Activities:** 0.749  
  Since the **Extracurricular Activities** variable was converted to numeric (with 1 representing "yes" and 0 representing "no"), this coefficient indicates that students who participate in extracurricular activities have a **Performance Index** that is, on average, 0.75 units higher than those who do not, all else being equal.

- **Sleep Hours:** 0.467  
  Each additional hour of sleep is associated with an increase of approximately 0.47 units in the **Performance Index**. Although this effect is positive, it is smaller compared to other predictors, suggesting that sleep has a more modest impact on performance.

- **Sample Question Papers Practiced:** 0.199  
  For every additional sample question paper practiced, the **Performance Index** increases by about 0.20 units. This indicates a positive, but relatively minor, effect on performance.

### Overall Interpretation
The model highlights the importance of study-related activities and previous academic performance in predicting the **Performance Index**. Notably, **Hours Studied** and **Previous Scores** have the most significant impacts, as evidenced by their higher coefficients. While factors like **Extracurricular Activities**, **Sleep Hours**, and **Sample Question Papers Practiced** also contribute positively, their effects are comparatively smaller. This insight can be useful for educators and students alike, suggesting that focusing on study time and building on past performance may yield the most substantial improvements in overall academic outcomes.
