# Student Performance Prediction - Data Understanding

**Goals**:
* To understand the data, its columns, 
* data types of each column e.g. numeric, categorical, integer, etc.
* Shape of the data.
* Getting information about each column like mean, minimum, maximum, etc.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
students = pd.read_csv("../data/raw/StudentPerformanceFactors.csv")
students.head()

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70


In [3]:
students.shape

(6607, 20)

In [4]:
students.describe()

Unnamed: 0,Hours_Studied,Attendance,Sleep_Hours,Previous_Scores,Tutoring_Sessions,Physical_Activity,Exam_Score
count,6607.0,6607.0,6607.0,6607.0,6607.0,6607.0,6607.0
mean,19.975329,79.977448,7.02906,75.070531,1.493719,2.96761,67.235659
std,5.990594,11.547475,1.46812,14.399784,1.23057,1.031231,3.890456
min,1.0,60.0,4.0,50.0,0.0,0.0,55.0
25%,16.0,70.0,6.0,63.0,1.0,2.0,65.0
50%,20.0,80.0,7.0,75.0,1.0,3.0,67.0
75%,24.0,90.0,8.0,88.0,2.0,4.0,69.0
max,44.0,100.0,10.0,100.0,8.0,6.0,101.0


In [5]:
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6607 entries, 0 to 6606
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Hours_Studied               6607 non-null   int64 
 1   Attendance                  6607 non-null   int64 
 2   Parental_Involvement        6607 non-null   object
 3   Access_to_Resources         6607 non-null   object
 4   Extracurricular_Activities  6607 non-null   object
 5   Sleep_Hours                 6607 non-null   int64 
 6   Previous_Scores             6607 non-null   int64 
 7   Motivation_Level            6607 non-null   object
 8   Internet_Access             6607 non-null   object
 9   Tutoring_Sessions           6607 non-null   int64 
 10  Family_Income               6607 non-null   object
 11  Teacher_Quality             6529 non-null   object
 12  School_Type                 6607 non-null   object
 13  Peer_Influence              6607 non-null   obje

In [6]:
students.dtypes.value_counts()

object    13
int64      7
Name: count, dtype: int64

* 13 columns have string data type.
* 7 columns have integer data type.

In [7]:
int_cols = students.select_dtypes(include="int64").columns
str_cols = students.select_dtypes(include="object").columns

In [8]:
print("Integer columns\n", list(int_cols))
print("String columns\n", list(str_cols))

Integer columns
 ['Hours_Studied', 'Attendance', 'Sleep_Hours', 'Previous_Scores', 'Tutoring_Sessions', 'Physical_Activity', 'Exam_Score']
String columns
 ['Parental_Involvement', 'Access_to_Resources', 'Extracurricular_Activities', 'Motivation_Level', 'Internet_Access', 'Family_Income', 'Teacher_Quality', 'School_Type', 'Peer_Influence', 'Learning_Disabilities', 'Parental_Education_Level', 'Distance_from_Home', 'Gender']


In [9]:
df = ['Parental_Involvement', 'Access_to_Resources', 'Extracurricular_Activities', 'Motivation_Level', 'Internet_Access', 'Family_Income', 'Teacher_Quality', 'School_Type', 'Peer_Influence', 'Learning_Disabilities', 'Parental_Education_Level', 'Distance_from_Home', 'Gender']
for i in df:
    print(students[i].value_counts(), "\n")

Parental_Involvement
Medium    3362
High      1908
Low       1337
Name: count, dtype: int64 

Access_to_Resources
Medium    3319
High      1975
Low       1313
Name: count, dtype: int64 

Extracurricular_Activities
Yes    3938
No     2669
Name: count, dtype: int64 

Motivation_Level
Medium    3351
Low       1937
High      1319
Name: count, dtype: int64 

Internet_Access
Yes    6108
No      499
Name: count, dtype: int64 

Family_Income
Low       2672
Medium    2666
High      1269
Name: count, dtype: int64 

Teacher_Quality
Medium    3925
High      1947
Low        657
Name: count, dtype: int64 

School_Type
Public     4598
Private    2009
Name: count, dtype: int64 

Peer_Influence
Positive    2638
Neutral     2592
Negative    1377
Name: count, dtype: int64 

Learning_Disabilities
No     5912
Yes     695
Name: count, dtype: int64 

Parental_Education_Level
High School     3223
College         1989
Postgraduate    1305
Name: count, dtype: int64 

Distance_from_Home
Near        3884
Moderate

In [10]:
students.isna().sum()

Hours_Studied                  0
Attendance                     0
Parental_Involvement           0
Access_to_Resources            0
Extracurricular_Activities     0
Sleep_Hours                    0
Previous_Scores                0
Motivation_Level               0
Internet_Access                0
Tutoring_Sessions              0
Family_Income                  0
Teacher_Quality               78
School_Type                    0
Peer_Influence                 0
Physical_Activity              0
Learning_Disabilities          0
Parental_Education_Level      90
Distance_from_Home            67
Gender                         0
Exam_Score                     0
dtype: int64

In [11]:
students['Teacher_Quality'].isna().sum()

78

In [12]:
students['Parental_Education_Level'].isna().sum()

90

In [13]:
students['Distance_from_Home'].isna().sum()

67

In [14]:
students[students['Teacher_Quality'].isna() & students['Parental_Education_Level'].isna() & students['Distance_from_Home'].isna()]

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score


**There are no common rows that have null values in Teacher quality, parental education level, and distance from home.**

In [15]:
students.duplicated().sum()

0

**There are no duplicate values.**

## Conclusion

Following conclusions are drawn from above data understanding:
* Data contains 6607 rows and 20 columns.
* 13 columns have string datatype, 7 columns have integer datatype.
* Integer columns are **Hours_Studied**, **Attendance**, **Sleep_Hours**, **Previous_Scores**, **Tutoring_Sessions**, **Physical_Activity**, **Exam_Score**
* String columns are **Parental_Involvement**, **Access_to_Resources**, **Extracurricular_Activities**, **Motivation_Level**, **Internet_Access**, **Family_Income**, **Teacher_Quality**, **School_Type**, **Peer_Influence**, **Learning_Disabilities**, **Parental_Education_Level**, **Distance_from_Home**, **Gender**
* No columns contains as duplicate values.
* **Teacher quality**, **parental education**, **distance from home** contains 78, 90, and 67 null values respectively.
* There are no common rows that have null values in Teacher quality, parental education level, and distance from home.
* The following is the summary of all string and numeric columns.

### 1. Categorical Feature Analysis & Data Understanding

1. **Parental Involvement**

| **Level** | **Count** | **Description** |
|:---------:|:---------:|:---------------:|
| Medium | 3362 | Parents are moderately involved in the student’s academic activities |
| High | 1908 | Parents actively participate in academic guidance and monitoring |
| Low | 1337 | Minimal parental engagement in academic matters |

---

2.  **Access to Resources**

| **Level** | **Count** | **Description** |
|:---------:|:---------:|:---------------:|
| Medium | 3319 | Student has average access to learning resources (books, devices, etc.) |
| High | 1975 | Student has strong access to educational resources |
| Low | 1313 | Student has limited access to learning materials |

---

3.  **Extracurricular Activities**

| **Participation** | **Count** | **Description** |
|:-----------------:|:---------:|:---------------:|
| Yes | 3938 | Student participates in extracurricular activities |
| No | 2669 | Student does not participate in extracurricular activities |

---

3. **Motivation Level**

| **Level** | **Count** | **Description** |
|:---------:|:---------:|:---------------:|
| Medium | 3351 | Student shows moderate academic motivation |
| Low | 1937 | Student lacks motivation toward studies |
| High | 1319 | Student is highly motivated academically |

---

4.  **Internet Access**

| **Access** | **Count** | **Description** |
|:----------:|:---------:|:---------------:|
| Yes | 6108 | Student has access to the internet at home |
| No | 499 | Student does not have reliable internet access |

---

5.  **Family Income**

| **Level** | **Count** | **Description** |
|:---------:|:---------:|:---------------:|
| Low | 2672 | Family has low household income |
| Medium | 2666 | Family has moderate household income |
| High | 1269 | Family has high household income |

---

6.  **Teacher Quality**

| **Level** | **Count** | **Description** |
|:---------:|:---------:|:---------------:|
| Medium | 3925 | Student is taught by teachers of average quality |
| High | 1947 | Student is taught by highly qualified and effective teachers |
| Low | 657 | Student is taught by teachers with lower effectiveness |

---

7.  **School Type**

| **Type** | **Count** | **Description** |
|:--------:|:---------:|:---------------:|
| Public | 4598 | Student attends a government-funded public school |
| Private | 2009 | Student attends a privately funded school |

---

8. **Peer Influence**

| **Influence** | **Count** | **Description** |
|:-------------:|:---------:|:---------------:|
| Positive | 2638 | Peers positively influence academic behavior |
| Neutral | 2592 | Peers have minimal influence on academics |
| Negative | 1377 | Peers negatively affect academic performance |

---

9. **Learning Disabilities**

| **Status** | **Count** | **Description** |
|:----------:|:---------:|:---------------:|
| No | 5912 | Student has no reported learning disabilities |
| Yes | 695 | Student has a diagnosed learning disability |

---

10.  **Parental Education Level**

| **Level** | **Count** | **Description** |
|:---------:|:---------:|:---------------:|
| High School | 3223 | Highest parental education level is high school |
| College | 1989 | At least one parent has a college degree |
| Postgraduate | 1305 | At least one parent has postgraduate education |

---
11. **Distance from Home**

| **Distance** | **Count** | **Description** |
|:------------:|:---------:|:---------------:|
| Near | 3884 | Student lives close to the school |
| Moderate | 1998 | Student lives at a moderate distance from school |
| Far | 658 | Student lives far from the school |

---

12. **Gender**

| **Gender** | **Count** | **Description** |
|:----------:|:---------:|:---------------:|
| Male | 3814 | Student identifies as male |
| Female | 2793 | Student identifies as female |

---


### 2. Numerical Feature Summary

| **Feature** | **Range (Min – Max)** | **Mean** | **Standard Deviation** |
|:-----------:|:---------------------:|:--------:|:----------------------:|
| Hours_Studied | 1 – 44 | 19.98 | 5.99 |
| Attendance | 60 – 100 | 79.98 | 11.55 |
| Sleep_Hours | 4 – 10 | 7.03 | 1.47 |
| Previous_Scores | 50 – 100 | 75.07 | 14.40 |
| Tutoring_Sessions | 0 – 8 | 1.49 | 1.23 |
| Physical_Activity | 0 – 6 | 2.97 | 1.03 |
| Exam_Score | 55 – 101 | 67.24 | 3.89 |
