# Student Performance (Multiple Linear Regression)

## About Dataset

### Variables:

- Hours Studied: The total number of hours spent studying by each student.
- Previous Scores: The scores obtained by students in previous tests.
- Extracurricular Activities: Whether the student participates in extracurricular activities (Yes or No).
- Sleep Hours: The average number of hours of sleep the student had per day.
- Sample Question Papers Practiced: The number of sample question papers the student practiced.

#### Target Variable:

- Performance Index: A measure of the overall performance of each student. The performance index represents the student's academic performance and has been rounded to the nearest integer. The index ranges from 10 to 100, with higher values indicating better performance.


In [None]:
import pandas as pd
import numpy as np

path = '/td/student_performance/Student_Performance.csv'

df= pd.read_csv(path)

In [12]:
df

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,9,1,91.0
1,4,82,No,4,2,65.0
2,8,51,Yes,7,2,45.0
3,5,52,Yes,5,2,36.0
4,7,75,No,8,5,66.0
...,...,...,...,...,...,...
9995,1,49,Yes,4,2,23.0
9996,7,64,Yes,8,5,58.0
9997,6,83,Yes,8,5,74.0
9998,9,97,Yes,7,0,95.0


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9873 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Hours Studied                     9873 non-null   int64  
 1   Previous Scores                   9873 non-null   int64  
 2   Extracurricular Activities        9873 non-null   object 
 3   Sleep Hours                       9873 non-null   int64  
 4   Sample Question Papers Practiced  9873 non-null   int64  
 5   Performance Index                 9873 non-null   float64
dtypes: float64(1), int64(4), object(1)
memory usage: 539.9+ KB


In [31]:
print('Number of Nan :\n', df.isna().sum())
print('Number of Duplicates :', df.duplicated().sum())

Number of Nan :
 Hours Studied                       0
Previous Scores                     0
Extracurricular Activities          0
Sleep Hours                         0
Sample Question Papers Practiced    0
Performance Index                   0
dtype: int64
Number of Duplicates : 0


Unnamed: 0_level_0,count
Sleep Hours,Unnamed: 1_level_1
8,1784
7,1653
6,1645
9,1606
4,1605
5,1580


# Modification a apporter au Dataset

In [14]:
df= df.drop_duplicates() # on retire les duplicates

'''
on a vu que la colonne 'Extracurricular Activities' est de type object
Transformons les Yes en 1 et les No en 0
'''
df['Extracurricular Activities']= df['Extracurricular Activities'].replace("Yes", 1)
df['Extracurricular Activities']= df['Extracurricular Activities'].replace("No", 0)

  df['Extracurricular Activities']= df['Extracurricular Activities'].replace("No", 0)


In [17]:
df.info() # On a plus de type "objet"

<class 'pandas.core.frame.DataFrame'>
Index: 9873 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Hours Studied                     9873 non-null   int64  
 1   Previous Scores                   9873 non-null   int64  
 2   Extracurricular Activities        9873 non-null   int64  
 3   Sleep Hours                       9873 non-null   int64  
 4   Sample Question Papers Practiced  9873 non-null   int64  
 5   Performance Index                 9873 non-null   float64
dtypes: float64(1), int64(5)
memory usage: 539.9 KB


# Split & Scale

In [24]:
from sklearn.model_selection import train_test_split
#va nous permettre de split notre data set

X = df.drop(columns='Performance Index') #features
y = df['Performance Index'] #target (colums predicted)

#train test split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, test_size=.35)

In [25]:
# si on fait un .shape on verra que X_test est plus petit
X_test.shape
X_train.shape

(6417, 5)

In [26]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train)

# on retourne la data transformé
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# on retransforme en data frame
X_train_df = pd.DataFrame(X_train_scaled)
X_test_df = pd.DataFrame(X_test_scaled)

In [27]:
# pour bien voir le tableau on peut faire .describe().round(2) qui permet d'arrondir tout a 2 decimales
X_train_df.describe().round(2)

Unnamed: 0,0,1,2,3,4
count,6417.0,6417.0,6417.0,6417.0,6417.0
mean,0.0,-0.0,0.0,-0.0,0.0
std,1.0,1.0,1.0,1.0,1.0
min,-1.53,-1.7,-0.98,-1.5,-1.6
25%,-0.76,-0.89,-0.98,-0.91,-0.9
50%,0.01,-0.03,-0.98,0.26,0.14
75%,0.79,0.9,1.02,0.85,0.84
max,1.56,1.71,1.02,1.44,1.54


In [28]:
X_train_df

Unnamed: 0,0,1,2,3,4
0,0.786862,1.359904,-0.983156,1.436641,-1.600542
1,1.558807,0.724682,1.017133,-0.323704,-1.600542
2,1.172834,-0.719004,-0.983156,-0.323704,-1.251840
3,-1.143001,-0.488014,-0.983156,-0.910485,-1.251840
4,-0.757028,-0.314772,-0.983156,-1.497267,1.189077
...,...,...,...,...,...
6412,-1.143001,-1.354226,-0.983156,-0.323704,0.142970
6413,1.558807,-0.545761,1.017133,-1.497267,-0.554435
6414,0.014917,1.475399,1.017133,-0.910485,0.491672
6415,-1.528973,-1.296478,-0.983156,1.436641,-1.600542
