Purpose: Imports necessary libraries for data manipulation (pandas, numpy), visualization (matplotlib, seaborn), model training/testing (sklearn), and saving the model (pickle).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pickle 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error,r2_score


In [2]:
df = pd.read_csv("Student-Performance-csv_y56IU.csv")

Purpose: Loads the dataset from the CSV file into a DataFrame (df). The dataset contains student performance metrics.

In [3]:
df.head()

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,9,1,91.0
1,4,82,No,4,2,65.0
2,8,51,Yes,7,2,45.0
3,5,52,Yes,5,2,36.0
4,7,75,No,8,5,66.0


In [4]:
df.tail()

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
9995,1,49,Yes,4,2,23.0
9996,7,64,Yes,8,5,58.0
9997,6,83,Yes,8,5,74.0
9998,9,97,Yes,7,0,95.0
9999,7,74,No,8,1,64.0


In [5]:
df.isnull().sum()

Hours Studied                       0
Previous Scores                     0
Extracurricular Activities          0
Sleep Hours                         0
Sample Question Papers Practiced    0
Performance Index                   0
dtype: int64

Result: No missing values in any column. Confirms the dataset is clean.



In [6]:
df.dtypes

Hours Studied                         int64
Previous Scores                       int64
Extracurricular Activities           object
Sleep Hours                           int64
Sample Question Papers Practiced      int64
Performance Index                   float64
dtype: object

Result: All columns are numeric except Extracurricular Activities (categorical: Yes/No). This column needs encoding for model compatibility.



______________________________________________________________
ExtraCurricular activities is a categorical variable and 
system understands only numeric data.
So, it is necessary to convert the data in numerical value.

if the data can be converted into numeric value then we can use the data otherwise we have to drop this column.

To convert this categorical data we'll use "One Hot Encoding".
______________________________________________________________


In [7]:

catergorical_column = df['Extracurricular Activities']

In [8]:
catergorical_column

0       Yes
1        No
2       Yes
3       Yes
4        No
       ... 
9995    Yes
9996    Yes
9997    Yes
9998    Yes
9999     No
Name: Extracurricular Activities, Length: 10000, dtype: object

In [9]:
le = LabelEncoder()
df['Extracurricular Activities'] = le.fit_transform(catergorical_column)

Purpose: Converts Yes/No to 1/0 using LabelEncoder. Now the column is numeric.



Explanation:
LabelEncoder() is used to convert categorical labels into numerical values.
.fit_transform() assigns a unique number to each category (e.g., "cat" → 0, "dog" → 1, "mouse" → 2).
The same mapping is used for all occurrences of these categories in the dataset.

In [10]:
df

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,1,9,1,91.0
1,4,82,0,4,2,65.0
2,8,51,1,7,2,45.0
3,5,52,1,5,2,36.0
4,7,75,0,8,5,66.0
...,...,...,...,...,...,...
9995,1,49,1,4,2,23.0
9996,7,64,1,8,5,58.0
9997,6,83,1,8,5,74.0
9998,9,97,1,7,0,95.0


In [11]:
x = df[	['Hours Studied','Previous Scores'	,'Extracurricular Activities',	'Sleep Hours',	'Sample Question Papers Practiced']]

Purpose: Separates features (x) and the target variable (y).



In [12]:
x

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced
0,7,99,1,9,1
1,4,82,0,4,2
2,8,51,1,7,2
3,5,52,1,5,2
4,7,75,0,8,5
...,...,...,...,...,...
9995,1,49,1,4,2
9996,7,64,1,8,5
9997,6,83,1,8,5
9998,9,97,1,7,0


In [13]:
y = df['Performance Index']

now we'll use 'train_test_split' in order to split the data and then perform testing the data as well as testing the data parallely with each other

In [14]:
#split the data in x and y,where 20% data will be tested and rest will be trained.
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.2) 

Result: 80% of data (8,000 rows) for training, 20% (2,000 rows) for testing.



In [15]:
x_train

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced
8363,5,98,0,9,5
9567,9,77,1,9,8
274,5,51,1,5,7
9702,3,83,0,8,1
7627,8,91,0,9,5
...,...,...,...,...,...
4359,4,42,1,5,7
5427,1,91,0,6,2
9882,1,75,0,8,8
1775,1,67,0,4,6


In [16]:
x_test

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced
7970,7,68,0,8,3
5007,6,86,1,8,2
820,5,89,1,7,7
5430,9,62,0,9,8
372,5,95,1,5,7
...,...,...,...,...,...
9005,4,92,0,6,2
513,9,87,0,8,9
8670,7,91,0,8,1
3734,7,43,1,4,1


Now we'll standardize the data.
We have to convert data in Standard scaler form

StandardScaler: Scales x_train so that it has a mean of 0 and a standard deviation of 1. This helps with numerical stability.



In [17]:
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.fit_transform(x_test)


Purpose: Scales features to have mean=0 and variance=1.

Note: Potential issue: fit_transform is called separately on x_train and x_test, which may cause data leakage. Best practice: Fit on x_train and transform both using scaler.transform(x_test).

Now we can train the data

Linear Regression Model:
model.fit(x_train_scaled, y_train): Fits the model to the training data.

The model learns the best-fit line (or hyperplane for multiple features).

In [18]:
model = LinearRegression()
model.fit(x_train_scaled, y_train)

Result: Trains the model on the scaled training data. The model learns the relationship between features and the target.



In [19]:
y_pred = model.predict(x_test_scaled)

Purpose: Uses the trained model to predict Performance Index on the test set.



In [20]:
mean_squared_error(y_test,y_pred)

4.155158170153733

Result: Mean Squared Error (MSE) of ~4.37. Lower MSE indicates better performance. For context, RMSE (√MSE) is ~2.09, meaning predictions are off by ~2.09 units on average.

We'll make a physical file that will store this model otherwise when system will restart we'll loose this model and have to run all the variables again

In [21]:
with open("Linear_Regression_model.pkl", 'wb') as file:
    pickle.dump((model,scaler,le),file)

Purpose: Saves the trained model, scaler, and label encoder to a .pkl file for reuse without retraining.

