We have taken a simple dataset, 'Student-Performance.csv,' from which our model can predict the performance of any student.

In [1]:
import pandas as pd

In [2]:
# read the entire dataset
df = pd.read_csv("Student-Performance.csv")

In [3]:
# first 5 rows
df.head()

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,9,1,91.0
1,4,82,No,4,2,65.0
2,8,51,Yes,7,2,45.0
3,5,52,Yes,5,2,36.0
4,7,75,No,8,5,66.0


In [4]:
# last 5 rows
df.tail()

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
9995,1,49,Yes,4,2,23.0
9996,7,64,Yes,8,5,58.0
9997,6,83,Yes,8,5,74.0
9998,9,97,Yes,7,0,95.0
9999,7,74,No,8,1,64.0


In [5]:
#Statistical dataset details 
df.describe()

Unnamed: 0,Hours Studied,Previous Scores,Sleep Hours,Sample Question Papers Practiced,Performance Index
count,10000.0,10000.0,10000.0,10000.0,10000.0
mean,4.9929,69.4457,6.5306,4.5833,55.2248
std,2.589309,17.343152,1.695863,2.867348,19.212558
min,1.0,40.0,4.0,0.0,10.0
25%,3.0,54.0,5.0,2.0,40.0
50%,5.0,69.0,7.0,5.0,55.0
75%,7.0,85.0,8.0,7.0,71.0
max,9.0,99.0,9.0,9.0,100.0


In [6]:
# Check for null values and display the count of null values in each column.
df.isnull().sum()  

Hours Studied                       0
Previous Scores                     0
Extracurricular Activities          0
Sleep Hours                         0
Sample Question Papers Practiced    0
Performance Index                   0
dtype: int64

In [7]:
# 
df.dtypes

Hours Studied                         int64
Previous Scores                       int64
Extracurricular Activities           object
Sleep Hours                           int64
Sample Question Papers Practiced      int64
Performance Index                   float64
dtype: object

In [8]:
# Label encoder will change the labels 
from sklearn.preprocessing import LabelEncoder, StandardScaler

In [9]:
# override categorical data to numerical data
le = LabelEncoder()
df["Extracurricular Activities"] = le.fit_transform(df["Extracurricular Activities"])

In [10]:
# Updated dataframe
df 

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,1,9,1,91.0
1,4,82,0,4,2,65.0
2,8,51,1,7,2,45.0
3,5,52,1,5,2,36.0
4,7,75,0,8,5,66.0
...,...,...,...,...,...,...
9995,1,49,1,4,2,23.0
9996,7,64,1,8,5,58.0
9997,6,83,1,8,5,74.0
9998,9,97,1,7,0,95.0


In [11]:
#Partician of dataset into x 
x = df[["Hours Studied",	"Previous Scores",	"Extracurricular Activities",	"Sleep Hours",	"Sample Question Papers Practiced"]]
x

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced
0,7,99,1,9,1
1,4,82,0,4,2
2,8,51,1,7,2
3,5,52,1,5,2
4,7,75,0,8,5
...,...,...,...,...,...
9995,1,49,1,4,2
9996,7,64,1,8,5
9997,6,83,1,8,5
9998,9,97,1,7,0


In [12]:
# Partician of dataset into y
y = df["Performance Index"]
y

0       91.0
1       65.0
2       45.0
3       36.0
4       66.0
        ... 
9995    23.0
9996    58.0
9997    74.0
9998    95.0
9999    64.0
Name: Performance Index, Length: 10000, dtype: float64

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
# Split the data into training data and testing data with respect 80% and 20%
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [15]:
# 8000 data for train
x_train

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced
4701,1,68,0,5,7
9503,7,46,1,9,8
3415,9,64,0,4,3
3408,1,55,0,7,3
5548,5,64,0,7,3
...,...,...,...,...,...
5492,3,54,1,6,0
3334,6,81,1,4,5
7697,3,59,1,6,7
8128,9,95,0,5,1


In [16]:
# 2000 data for test
x_test

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced
9245,9,87,1,8,3
6965,2,70,1,9,5
4850,8,61,0,8,0
8468,9,89,0,9,9
7493,9,66,1,5,3
...,...,...,...,...,...
7887,9,62,1,7,2
831,3,79,1,9,5
8245,4,80,0,9,6
1266,2,63,0,7,8


My data is on different scales, which may make it difficult for my model to learn patterns effectively. Converting the entire dataset into a standard normal distribution following a Gaussian curve might help.

In [18]:
# standardize entire dataset using z-score 
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.fit_transform(x_test)

Standardize features by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:


    z = (x - u) / s
where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

Now pass a data into training, from where i am able to get library for linear regression training.
Hold a varible for deployment because we need physical file 

In [19]:
from sklearn.linear_model import LinearRegression
import pickle
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [20]:
# train the model
model = LinearRegression()
model.fit(x_train_scaled, y_train)

In [22]:
x_test_scaled

array([[ 1.54016765,  1.03483026,  0.99104014,  0.89955766, -0.57651865],
       [-1.13009205,  0.04067274,  0.99104014,  1.50369309,  0.12891867],
       [ 1.15870198, -0.48564595, -1.00904087,  0.89955766, -1.63467463],
       ...,
       [-0.36716071,  0.62547128, -1.00904087,  1.50369309,  0.48163733],
       [-1.13009205, -0.36868624, -1.00904087,  0.29542223,  1.18707465],
       [ 0.77723631,  1.6196288 ,  0.99104014,  0.89955766,  1.18707465]],
      shape=(2000, 5))

In [23]:
# prediction with respect to the single data
model.predict([[ 1.54016765,  1.03483026,  0.99104014,  0.89955766, -0.57651865]])

array([85.69900606])

In [26]:
# prediction with respect to entire data
y_pred = model.predict(x_test_scaled)

We don't know how accurate this prediction, i have to print maybe a means sqare error  or mean absolute error or maybe r2 i have to print. Then only will be able to know that how accurate my model.

In [29]:
# MSE from y test value and predicted value  
mean_squared_error(y_test, y_pred)

4.128420135603799

In [33]:
# check the accuracy of model in terms of percentage like 0 to 1
r2_score(y_test, y_pred)

0.9886792925933731

Creating physical a binary file for model with the help of dump library then i have to pass model. There was a column where categorical data was available and i have converted those categorical data into a numerical value by using a label encoding. So when i am save a model, i have to pass that part because when i will try to do prediction with the new data with this physical file. So again whatever data which i am trying to pass i have to convert that with the same standard, a scale at which i have used at the time of creating a model. 

Those informations are available in those variables. If i would have not implemented standard scaler them it's simple "pickle.dump(model). But as ia have done some sort of preprocessing, I have to store those preprocessing as well. one is scaler because it know how we have scaled the data and label encoder also.

Any data come in between, it will try to follow the same thing. with that i'm trying to store even those variables. which has helped me to scale the data and label encode the data. this entire thing will stored inside a particular file.

In [35]:
# create binary file
with open("Student_lr_final_model.pkl", "wb") as file :
    pickle.dump((model, scaler, le), file)