# Student Performance (Multiple Linear Regression)

## Exploring Factors Affecting Student Performance

## Machine Learning Lifecycle

**The machine learning lifecycle is a process that guides the development and deployment of machine learning models in a structured way. It consists of various steps.**

- Problem Definition
- Data Collection
- Data Quality Check
- Exploratory Data Analysis (EDA)
- Data Preprocessing
- Model Training
- Choose Best Model

## 1. Problem Definition

- This project explores how the student's performance (Performance Index) is affected by the factors like Hours Studied, Previous Scores, Extracurricular Activities, Sleep Hours and Sample Question Papers Practiced.

## 2. Data Collection

- Dataset Source - [Link](https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression)
- The dataset consists of 10,000 student records.

### 2.1) Import Data & Libraries

**Importing Pandas, Numpy, Seaborn, MatplotLib and Warnings Library.**

In [1]:
import pandas as pd # chiefly used for machine learning in the form of DataFrames
import numpy as np # used for working with arrays
import seaborn as sns # used to create statistical graphics for data visualization
import matplotlib.pyplot as plt # used to create data visualizations, such as plots, histograms, bar charts, and scatter plots
%matplotlib inline 
# enables the rendering of Matplotlib plots directly below code cells

import warnings
warnings.filterwarnings('ignore') # for ignoring the warnings 

**Import CSV file in the form of Pandas Dataframe**

In [2]:
data = pd.read_csv('data/Student_Performance.csv')

**Top 5 Records**

In [3]:
data.head()

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,9,1,91.0
1,4,82,No,4,2,65.0
2,8,51,Yes,7,2,45.0
3,5,52,Yes,5,2,36.0
4,7,75,No,8,5,66.0


**Shape of the Dataset**

In [4]:
data.shape

(10000, 6)

## 2.2) About Dataset

### Description:

The Student Performance Dataset is a dataset designed to examine the factors influencing academic student performance. The dataset consists of 10,000 student records, with each record containing information about various predictors and a performance index.

### Variables:

**Hours Studied:** The total number of hours spent studying by each student.

**Previous Scores:** The scores obtained by students in previous tests.

**Extracurricular Activities:** Whether the student participates in extracurricular activities (Yes or No).

**Sleep Hours:** The average number of hours of sleep the student had per day.

**Sample Question Papers Practiced:** The number of sample question papers the student practiced.

### Target Variable:

**Performance Index:** A measure of the overall performance of each student. The performance index represents the student's academic performance and has been rounded to the nearest integer. The index ranges from 10 to 100, with higher values indicating better performance.

## 3. Data Quality Check

- Check Missing values
- Check Data-types
- Check Duplicates
- Check unique values count of each variable.
- Check descriptive statistics of the dataset.

### 3.1) Check Missing values

In [5]:
data.isnull().sum()

Hours Studied                       0
Previous Scores                     0
Extracurricular Activities          0
Sleep Hours                         0
Sample Question Papers Practiced    0
Performance Index                   0
dtype: int64

**There are no missing-values in the dataset.**

### 3.2) Check Data-types

In [6]:
data.dtypes

Hours Studied                         int64
Previous Scores                       int64
Extracurricular Activities           object
Sleep Hours                           int64
Sample Question Papers Practiced      int64
Performance Index                   float64
dtype: object

**Numeric variables:** Hours Studied, Previous Scores, Sleep Hours, Sample Question Papers Practiced, Performance Index (Target)

**Categorical vaiables:** Extracurricular Activities

### 3.3) Check Duplicates

In [7]:
print('Duplicate records count:', data.duplicated().sum())

Duplicate records count: 127


**127 records are identified as duplicates, hence removing them can help to prevent the model from overfitting to the training data.**

### 3.4) Check unique values count of each variable.

In [8]:
print('Unique values count of each variable\n\n', data.nunique(), sep="")

Unique values count of each variable

Hours Studied                        9
Previous Scores                     60
Extracurricular Activities           2
Sleep Hours                          6
Sample Question Papers Practiced    10
Performance Index                   91
dtype: int64


**Extracurricular Activities variable signifies that a student is participating or not in it (Yes or No).**

### 3.5) Check descriptive statistics of the dataset.

In [9]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Hours Studied,10000.0,4.9929,2.589309,1.0,3.0,5.0,7.0,9.0
Previous Scores,10000.0,69.4457,17.343152,40.0,54.0,69.0,85.0,99.0
Sleep Hours,10000.0,6.5306,1.695863,4.0,5.0,7.0,8.0,9.0
Sample Question Papers Practiced,10000.0,4.5833,2.867348,0.0,2.0,5.0,7.0,9.0
Performance Index,10000.0,55.2248,19.212558,10.0,40.0,55.0,71.0,100.0


In [10]:
print('Categories count of', data.select_dtypes('object').value_counts(), sep=" ")

Categories count of Extracurricular Activities
No                            5052
Yes                           4948
Name: count, dtype: int64


**Insights**

- All the numeric variables are discrete in nature.
- There is a huge difference between mean of Previous Scores and Performance Index.
- Also mean and median are almost equal or closely equal for all numeric variables.
- Minimum of Previous Scores is equal to the 25th percentile of Performance Index. This shows Performance Index are not as appreciable as compared to that of Previous scores of students.
- Extracurricular Activities categories are quite balanced (i.e., no class imbalance).