# Student Performance (Multiple Linear Regression)

The Student Performance Dataset is a dataset designed to examine the factors influencing academic student performance. The dataset consists of 10,000 student records, with each record containing information about various predictors and a performance index.  
学生表现数据集是旨在检查影响学术学生表现的因素的数据集。 该数据集由 10,000 条学生记录组成，每条记录都包含有关各种预测变量和表现指数的信息。

Variables: 变量：
- Hours Studied: The total number of hours spent studying by each student.  
  学习时数：每个学生学习所花费的总时数。
- Previous Scores: The scores obtained by students in previous tests.  
  以前的分数：学生在以前的测试中获得的分数。
- Extracurricular Activities: Whether the student participates in extracurricular activities (Yes or No).  
  课外活动：学生是否参加课外活动（是或否）。
- Sleep Hours: The average number of hours of sleep the student had per day.  
  睡眠时间：学生每天的平均睡眠时间。
- Sample Question Papers Practiced: The number of sample question papers the student practiced.  
  练习的样本试卷：学生练习的样本试卷数量。

Target Variable: 目标变量：

- Performance Index: A measure of the overall performance of each student. The performance index represents the student's academic performance and has been rounded to the nearest integer. The index ranges from 10 to 100, with higher values indicating better performance.  
  表现指数：衡量每个学生的整体表现。 成绩指数代表学生的学业成绩，并已四舍五入至最接近的整数。 该指数范围为 10 至 100，值越高表示性能越好。

The dataset aims to provide insights into the relationship between the predictor variables and the performance index. Researchers and data analysts can use this dataset to explore the impact of studying hours, previous scores, extracurricular activities, sleep hours, and sample question papers on student performance.  
该数据集旨在深入了解预测变量和绩效指数之间的关系。 研究人员和数据分析师可以使用此数据集来探索学习时间、以往成绩、课外活动、睡眠时间和样本试卷对学生表现的影响。

P.S: Please note that this dataset is synthetic and created for illustrative purposes. The relationships between the variables and the performance index may not reflect real-world scenarios  
P.S：请注意，该数据集是合成的，出于说明目的而创建。 变量和绩效指数之间的关系可能无法反映真实场景

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt



In [4]:
# read dataset from csv file
data = pd.read_csv("./Student_Performance.csv")

In [11]:
# show top 10 rows
data.head(10)

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,9,1,91.0
1,4,82,No,4,2,65.0
2,8,51,Yes,7,2,45.0
3,5,52,Yes,5,2,36.0
4,7,75,No,8,5,66.0
5,3,78,No,9,6,61.0
6,7,73,Yes,5,6,63.0
7,8,45,Yes,4,6,42.0
8,5,77,No,8,2,61.0
9,4,89,No,4,0,69.0


In [6]:
# show column data type and data range
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Hours Studied                     10000 non-null  int64  
 1   Previous Scores                   10000 non-null  int64  
 2   Extracurricular Activities        10000 non-null  object 
 3   Sleep Hours                       10000 non-null  int64  
 4   Sample Question Papers Practiced  10000 non-null  int64  
 5   Performance Index                 10000 non-null  float64
dtypes: float64(1), int64(4), object(1)
memory usage: 468.9+ KB


In [8]:
# see the number of missing value in each column
data.isna().sum()

Hours Studied                       0
Previous Scores                     0
Extracurricular Activities          0
Sleep Hours                         0
Sample Question Papers Practiced    0
Performance Index                   0
dtype: int64

In [9]:
# show info of numeric values
data.describe()

Unnamed: 0,Hours Studied,Previous Scores,Sleep Hours,Sample Question Papers Practiced,Performance Index
count,10000.0,10000.0,10000.0,10000.0,10000.0
mean,4.9929,69.4457,6.5306,4.5833,55.2248
std,2.589309,17.343152,1.695863,2.867348,19.212558
min,1.0,40.0,4.0,0.0,10.0
25%,3.0,54.0,5.0,2.0,40.0
50%,5.0,69.0,7.0,5.0,55.0
75%,7.0,85.0,8.0,7.0,71.0
max,9.0,99.0,9.0,9.0,100.0


In [12]:
# show info of objective values
data.describe(include = object)

Unnamed: 0,Extracurricular Activities
count,10000
unique,2
top,No
freq,5052
