## **Student Performance Indicator**

***Life cycle of Machine learning Project***

### Understanding the Problem Statement

1. Data Collection

2. Data Checks to perform

3. Exploratory data analysis

4. Data Pre-Processing

5. Model Training

6. Choose best model

1) Problem statement
This project understands how the student's performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental level of education, Lunch and Test preparation course.
2) Data Collection
Dataset Source - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977
The data consists of 8 column and 1000 rows.
2.1 Import Data and Required Packages
Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [45]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from plotnine import ggplot, geom_histogram, aes

Import the CSV Data as Pandas Dataframe

In [2]:
df = pd.read_csv('data/stud.csv')

In [9]:
df.shape    

(1000, 8)

In [10]:
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


### Data checks to perform

1. Check missing values
2. Check Duplicates
3. Check data types
4. Check the number of unique values in each columns
5. Check statistics of data set
6. Check various categories present in the different categorical column

In [11]:
# Checking Missing Values
df.isna().sum()

gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64

In [12]:
# Checking Duplicates
df.duplicated().sum()

0

In [13]:
# Unique values in each columns
df.nunique()

gender                          2
race_ethnicity                  5
parental_level_of_education     6
lunch                           2
test_preparation_course         2
math_score                     81
reading_score                  72
writing_score                  77
dtype: int64

In [14]:
# Check the null and Dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race_ethnicity               1000 non-null   object
 2   parental_level_of_education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test_preparation_course      1000 non-null   object
 5   math_score                   1000 non-null   int64 
 6   reading_score                1000 non-null   int64 
 7   writing_score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


In [15]:
# Checking the descriptive statistics
df.describe(include="all")

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
count,1000,1000,1000,1000,1000,1000.0,1000.0,1000.0
unique,2,5,6,2,2,,,
top,female,group C,some college,standard,none,,,
freq,518,319,226,645,642,,,
mean,,,,,,66.089,69.169,68.054
std,,,,,,15.16308,14.600192,15.195657
min,,,,,,0.0,17.0,10.0
25%,,,,,,57.0,59.0,57.75
50%,,,,,,66.0,70.0,69.0
75%,,,,,,77.0,79.0,79.0


In [19]:
# Exploring the data    
print("Categories in 'gender' variable: ", end=" ")
print(df['gender'].unique())

print("Categories in 'race/ethnicity' variable: ", end=" ")
print(df['race_ethnicity'].unique())

print("Categories in 'parental level of education' variable: ", end=" ")
print(df['parental_level_of_education'].unique())

print("Categories in 'lunch' variable: ", end=" ")
print(df['lunch'].unique())

print("Categories in 'test preparation course' variable: ", end=" ")
print(df['test_preparation_course'].unique())

Categories in 'gender' variable:  ['female' 'male']
Categories in 'race/ethnicity' variable:  ['group B' 'group C' 'group A' 'group D' 'group E']
Categories in 'parental level of education' variable:  ["bachelor's degree" 'some college' "master's degree" "associate's degree"
 'high school' 'some high school']
Categories in 'lunch' variable:  ['standard' 'free/reduced']
Categories in 'test preparation course' variable:  ['none' 'completed']


In [22]:
# Define categorical and numerical columns
numerical_features = [col for col in df.select_dtypes(exclude="object").columns]
categorical_features = [col for col in df.select_dtypes(include="object").columns]

In [24]:
print("We have {} numerical features: {}".format(len(numerical_features), numerical_features))
print("We have {} categorical features: {}".format(len(categorical_features), categorical_features))

We have 3 numerical features: ['math_score', 'reading_score', 'writing_score']
We have 5 categorical features: ['gender', 'race_ethnicity', 'parental_level_of_education', 'lunch', 'test_preparation_course']


In [26]:
# Adding columns for Total Score and Average
df['total_score']=df['math_score']+df['reading_score']+df['writing_score']
df['average']=df['total_score']/3
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score,total_score,average
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.666667
1,female,group C,some college,standard,completed,69,90,88,247,82.333333
2,female,group B,master's degree,standard,none,90,95,93,278,92.666667
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,49.333333
4,male,group C,some college,standard,none,76,78,75,229,76.333333


In [33]:
reading_full = int(df.query("reading_score == 100").filter(['average']).count())
writing_full = int(df.query("writing_score == 100").filter(['average']).count())
math_full = int(df.query("math_score == 100").filter(['average']).count())

In [34]:
print(f"The number of people scoring full mark in reading is {reading_full}")
print(f"The number of people scoring full mark in writing is {writing_full}")
print(f"The number of people scoring full mark in math is {math_full}")

The number of people scoring full mark in reading is 17
The number of people scoring full mark in writing is 14
The number of people scoring full mark in math is 7


In [35]:
reading_less_40 = int(df.query("reading_score <= 40").filter(['average']).count())
writing_less_40 = int(df.query("writing_score <= 40").filter(['average']).count())
math_less_40 = int(df.query("math_score <= 40").filter(['average']).count())

In [36]:
print(f"The number of people scoring full mark in reading is {reading_less_40}")
print(f"The number of people scoring full mark in writing is {writing_less_40}")
print(f"The number of people scoring full mark in math is {math_less_40}")

The number of people scoring full mark in reading is 27
The number of people scoring full mark in writing is 35
The number of people scoring full mark in math is 50


In [41]:
print(plotnine.__all__)

['coord_cartesian', 'coord_fixed', 'coord_equal', 'coord_flip', 'coord_trans', 'facet_grid', 'facet_null', 'facet_wrap', 'label_value', 'label_both', 'label_context', 'labeller', 'as_labeller', 'annotate', 'annotation_logticks', 'annotation_stripes', 'geom_abline', 'geom_area', 'geom_bar', 'geom_bin_2d', 'geom_bin2d', 'geom_blank', 'geom_boxplot', 'geom_col', 'geom_count', 'geom_crossbar', 'geom_density', 'geom_density_2d', 'geom_dotplot', 'geom_errorbar', 'geom_errorbarh', 'geom_freqpoly', 'geom_histogram', 'geom_hline', 'geom_jitter', 'geom_label', 'geom_line', 'geom_linerange', 'geom_map', 'arrow', 'geom_path', 'geom_point', 'geom_pointdensity', 'geom_pointrange', 'geom_quantile', 'geom_qq', 'geom_qq_line', 'geom_polygon', 'geom_raster', 'geom_rect', 'geom_ribbon', 'geom_rug', 'geom_segment', 'geom_sina', 'geom_smooth', 'geom_spoke', 'geom_step', 'geom_text', 'geom_tile', 'geom_violin', 'geom_vline', 'guides', 'ggplot', 'ggsave', 'save_as_pdf_pages', 'guide_colorbar', 'guide_colourb