# Categorical Analysis

## Dataset: Exam Score Prediction

## Dataset Link : https://www.kaggle.com/datasets/kundanbedmutha/exam-score-prediction-dataset
## Total Number of columns: 13

## Details about the dataset

### This dataset provides an extensive and realistic representation of various factors that contribute to student exam performance

- 'student_id' - individual id for every student 
- 'age'  : age of the student
- 'gender' : sex of each student
- 'course'  : course they are enrolled in 
- 'study_hours'  : number of hours studying per day
- 'class_attendance'  : attendance Percentage
- 'internet_access'  :  boolean value of having internet access or not
- 'sleep_hours'  : daily sleeping duration
- 'sleep_quality'  : Quality of sleep (Poor/Average/Good)
- 'study_method'  :    Primary study technique
- 'facility_rating'  : Academic environment indicators such as facility rating
- 'exam_difficulty'  : exam difficulty in categorical(easy/moderate/hard)
- 'exam_score' : Final Score value (from 1 to 100)

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("Data/Exam_Score_Prediction.csv")

In [3]:
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   student_id        20000 non-null  int64  
 1   age               20000 non-null  int64  
 2   gender            20000 non-null  object 
 3   course            20000 non-null  object 
 4   study_hours       20000 non-null  float64
 5   class_attendance  20000 non-null  float64
 6   internet_access   20000 non-null  object 
 7   sleep_hours       20000 non-null  float64
 8   sleep_quality     20000 non-null  object 
 9   study_method      20000 non-null  object 
 10  facility_rating   20000 non-null  object 
 11  exam_difficulty   20000 non-null  object 
 12  exam_score        20000 non-null  float64
dtypes: float64(4), int64(2), object(7)
memory usage: 2.0+ MB


### No missing values in the dataset. So we don't need to handle it for this data

In [4]:
df.head()

Unnamed: 0,student_id,age,gender,course,study_hours,class_attendance,internet_access,sleep_hours,sleep_quality,study_method,facility_rating,exam_difficulty,exam_score
0,1,17,male,diploma,2.78,92.9,yes,7.4,poor,coaching,low,hard,58.9
1,2,23,other,bca,3.37,64.8,yes,4.6,average,online videos,medium,moderate,54.8
2,3,22,male,b.sc,7.88,76.8,yes,8.5,poor,coaching,high,moderate,90.3
3,4,20,other,diploma,0.67,48.4,yes,5.8,average,online videos,low,moderate,29.7
4,5,20,female,diploma,0.89,71.6,yes,9.8,poor,coaching,low,moderate,43.7


In [6]:
df = df.set_index('student_id')

In [12]:
df.columns

Index(['age', 'gender', 'course', 'study_hours', 'class_attendance',
       'internet_access', 'sleep_hours', 'sleep_quality', 'study_method',
       'facility_rating', 'exam_difficulty', 'exam_score'],
      dtype='object')

In [8]:
df['sleep_quality'].value_counts()

sleep_quality
average    6694
poor       6687
good       6619
Name: count, dtype: int64

In [9]:
df

Index(['age', 'gender', 'course', 'study_hours', 'class_attendance',
       'internet_access', 'sleep_hours', 'sleep_quality', 'study_method',
       'facility_rating', 'exam_difficulty', 'exam_score'],
      dtype='object')

In [10]:
df['gender'].value_counts()

gender
other     6726
male      6695
female    6579
Name: count, dtype: int64

In [11]:
df['course'].value_counts()

course
bca        2902
ba         2896
b.sc       2878
b.com      2864
bba        2836
diploma    2826
b.tech     2798
Name: count, dtype: int64

In [16]:
df.describe(include = 'all')

Unnamed: 0,age,gender,course,study_hours,class_attendance,internet_access,sleep_hours,sleep_quality,study_method,facility_rating,exam_difficulty,exam_score
count,20000.0,20000,20000,20000.0,20000.0,20000,20000.0,20000,20000,20000,20000,20000.0
unique,,3,7,,,2,,3,5,3,3,
top,,other,bca,,,yes,,average,self-study,medium,moderate,
freq,,6726,2902,,,16988,,6694,4079,6760,9878,
mean,20.4733,,,4.007604,70.017365,,7.00856,,,,,62.513225
std,2.284458,,,2.308313,17.282262,,1.73209,,,,,18.908491
min,17.0,,,0.08,40.6,,4.1,,,,,19.599
25%,18.0,,,2.0,55.1,,5.5,,,,,48.8
50%,20.0,,,4.04,69.9,,7.0,,,,,62.6
75%,22.0,,,6.0,85.0,,8.5,,,,,76.3
