# Final project
In this project, the team will study and analyze data based on the Student Performance Factors dataset provided on Kaggle.

Based on data analysis, it will provide a comprehensive overview of the factors affecting students' academic performance in exams.

(Last update: 2/12/2024)

Group

---

## 1. Import necessary libraries

In [27]:
import pandas as pd

## 2. Exploring data

In [None]:
df = pd.read_csv('data/data.csv')
df

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6602,25,69,High,Medium,No,7,76,Medium,Yes,1,High,Medium,Public,Positive,2,No,High School,Near,Female,68
6603,23,76,High,Medium,No,8,81,Medium,Yes,3,Low,High,Public,Positive,2,No,High School,Near,Female,69
6604,20,90,Medium,Low,Yes,6,65,Low,Yes,3,Low,Medium,Public,Negative,2,No,Postgraduate,Near,Female,68
6605,10,86,High,High,Yes,6,91,High,Yes,2,Low,Medium,Private,Positive,3,No,High School,Far,Female,68


- Determine the number of rows and columns.

In [29]:
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

Rows: 6607, Columns: 20


- Identify the attributes in the dataset.

In [30]:
print(df.columns.tolist())

['Hours_Studied', 'Attendance', 'Parental_Involvement', 'Access_to_Resources', 'Extracurricular_Activities', 'Sleep_Hours', 'Previous_Scores', 'Motivation_Level', 'Internet_Access', 'Tutoring_Sessions', 'Family_Income', 'Teacher_Quality', 'School_Type', 'Peer_Influence', 'Physical_Activity', 'Learning_Disabilities', 'Parental_Education_Level', 'Distance_from_Home', 'Gender', 'Exam_Score']


- Determine the data type for each attribute.

In [31]:
print(df.dtypes)

Hours_Studied                  int64
Attendance                     int64
Parental_Involvement          object
Access_to_Resources           object
Extracurricular_Activities    object
Sleep_Hours                    int64
Previous_Scores                int64
Motivation_Level              object
Internet_Access               object
Tutoring_Sessions              int64
Family_Income                 object
Teacher_Quality               object
School_Type                   object
Peer_Influence                object
Physical_Activity              int64
Learning_Disabilities         object
Parental_Education_Level      object
Distance_from_Home            object
Gender                        object
Exam_Score                     int64
dtype: object


- The percentage of missing values

In [32]:
missing_percentage = (df.isnull().sum() / len(df)) * 100
print(missing_percentage)

Hours_Studied                 0.000000
Attendance                    0.000000
Parental_Involvement          0.000000
Access_to_Resources           0.000000
Extracurricular_Activities    0.000000
Sleep_Hours                   0.000000
Previous_Scores               0.000000
Motivation_Level              0.000000
Internet_Access               0.000000
Tutoring_Sessions             0.000000
Family_Income                 0.000000
Teacher_Quality               1.180566
School_Type                   0.000000
Peer_Influence                0.000000
Physical_Activity             0.000000
Learning_Disabilities         0.000000
Parental_Education_Level      1.362192
Distance_from_Home            1.014076
Gender                        0.000000
Exam_Score                    0.000000
dtype: float64


- Identify the min and max values for numerical attributes. Check if they are abnormal?

In [33]:
print(df.min(numeric_only=True))
print(df.max(numeric_only=True))

Hours_Studied         1
Attendance           60
Sleep_Hours           4
Previous_Scores      50
Tutoring_Sessions     0
Physical_Activity     0
Exam_Score           55
dtype: int64
Hours_Studied         44
Attendance           100
Sleep_Hours           10
Previous_Scores      100
Tutoring_Sessions      8
Physical_Activity      6
Exam_Score           101
dtype: int64


## 3. Preprocessing

- Remove duplicate and missing data columns

In [35]:
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)
df.isnull().sum()

Hours_Studied                 0
Attendance                    0
Parental_Involvement          0
Access_to_Resources           0
Extracurricular_Activities    0
Sleep_Hours                   0
Previous_Scores               0
Motivation_Level              0
Internet_Access               0
Tutoring_Sessions             0
Family_Income                 0
Teacher_Quality               0
School_Type                   0
Peer_Influence                0
Physical_Activity             0
Learning_Disabilities         0
Parental_Education_Level      0
Distance_from_Home            0
Gender                        0
Exam_Score                    0
dtype: int64

- Perform data mapping and normalization.

In [36]:
ordinal_mapping = {
    'Low': 0,
    'Medium': 1,
    'High': 2
}

binary_mapping = {
    'Yes': 1,
    'No': 0,
    'Public': 0,
    'Private': 1,
    'Male': 0,
    'Female': 1
}

peer_influence_mapping = {
    'Negative': 0,
    'Neutral': 1,
    'Positive': 2
}

parental_education_mapping = {
    'High School': 0,
    'College': 1,
    'Postgraduate': 2
}

distance_mapping = {
    'Near': 0,
    'Moderate': 1,
    'Far': 2
}

df['Parental_Involvement'] = df['Parental_Involvement'].map(ordinal_mapping)
df['Access_to_Resources'] = df['Access_to_Resources'].map(ordinal_mapping)
df['Motivation_Level'] = df['Motivation_Level'].map(ordinal_mapping)
df['Family_Income'] = df['Family_Income'].map(ordinal_mapping)
df['Teacher_Quality'] = df['Teacher_Quality'].map(ordinal_mapping)

df['Extracurricular_Activities'] = df['Extracurricular_Activities'].map(binary_mapping)
df['Internet_Access'] = df['Internet_Access'].map(binary_mapping)

df['School_Type'] = df['School_Type'].map(binary_mapping)

df['Peer_Influence'] = df['Peer_Influence'].map(peer_influence_mapping)

df['Parental_Education_Level'] = df['Parental_Education_Level'].map(parental_education_mapping)

df['Distance_from_Home'] = df['Distance_from_Home'].map(distance_mapping)

df['Gender'] = df['Gender'].map(binary_mapping)

df['Learning_Disabilities'] = df['Learning_Disabilities'].map(binary_mapping)

In [37]:
df.head(10)

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,0,2,0,7,73,0,1,0,0,1,0,2,3,0,0,0,0,67
1,19,64,0,1,0,8,59,0,1,2,1,1,0,0,4,0,1,1,1,61
2,24,98,1,1,1,7,91,1,1,2,1,1,0,1,4,0,2,0,0,74
3,29,89,0,1,1,8,98,1,1,1,1,1,0,0,4,0,0,1,0,71
4,19,92,1,1,1,6,65,1,1,3,1,2,0,1,4,0,1,0,1,70
5,19,88,1,1,1,8,89,1,1,3,1,1,0,2,3,0,2,0,0,71
6,29,84,1,0,1,7,68,0,1,1,0,1,1,1,2,0,0,1,0,67
7,25,78,0,2,1,6,50,1,1,1,2,2,0,0,2,0,0,2,0,66
8,17,94,1,2,0,6,80,2,1,0,1,0,1,1,1,0,1,0,0,69
9,23,98,1,1,1,8,71,1,1,0,2,2,0,2,5,0,0,1,0,72
