### **STUDENT PERFORMANCE INDICATOR** 


### ***LIFE CYCLE OF MACHINE LEARNING PROJECT***

1. Understanding problem statement
2. Data Collection
3. Data checks to perform
4. Exploratory Data Analysis
5. Data Pre-processing
6. Model Training
7. Choose best model

### **1. Problem Statement**

This project aims to understand how the student's performance(test scores) is affected based on different factors such as Gender, Ethnicity, Parental level of education, Lunch, Test preparation course

### **Data Collection**
- Data Source : https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977
- The dataset has 8 columns and 1000 rows

### 2.1 Import Data and Required Packages

In [None]:
import pandas as pandas
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Import CSV Data as Python DataFrame

In [None]:
df = pandas.read_csv('data/stud.csv')

### Show Top 5 Records

In [None]:
df.head()

### Shape Of Dataset

In [None]:
df.shape

### 2.2 Dataset Information

1. Gender : sex of students ->(Male/Female)
2. Ethnicity : Group (A/B/C/D/E)
3. Lunch : Standard or Free/Reduced
4. Parent Level of Education: bachelor's degree, some college, master's degree, associate's degree, high school, some high school
5. Test Preparation Course : None, Completed
6. Reading Score
7. Math Score
8. Writing Score





### **3. Data Checks To Perform**
- Check missing value
- Check duplicates
- Check Data Type
- Check Number of Unique Values for each column
- Check statistics of dataset
- Check various categories present in different categorical columns

### 3.1 Check Missing Value

In [None]:
df.isna().sum()

### There is no missing value in the dataset

### 3.2 Check Duplicates

In [None]:
df.duplicated().sum()

### There are no duplicates in the dataset 

### 3.3 Check Data Types

In [None]:
df.info()

### 3.4 Checking the number of unique values of each column

In [None]:
df.nunique()

### 3.5 Statistics of data

In [None]:
df.describe()

### Insights
- From the above decription, the means of all he numerical features are very close to each other.
- All standard deviations are also close --between 14.6 and 15.19


### 3.6 Exploring Data


In [None]:
df.head()

In [None]:
for cat in df.dtypes[df.dtypes == 'object'].index:
    print(f"Categories in {cat} : {df[cat].unique()}")

In [None]:
### Define numerical and categorical features
numerical_features = [feature for feature in df.dtypes[df.dtypes != 'object'].index]

categorical_features = [feature for feature in df.dtypes[df.dtypes == 'object'].index]


### print the features
print('We have {} numerical features : {}'.format(len(numerical_features),numerical_features))
print('We have {} categorical features : {}'.format(len(categorical_features),categorical_features))

### 3.8 Adding new columns for Average and Total Scores

In [None]:
df['Total'] = df['math_score'] + df['reading_score'] + df['writing_score']
df['Average'] = df['Total'] / 3
df.head()

In [None]:
print("Number of students with full marks in Math:",df[df['math_score']==100]['Average'].count())
print("Number of students with full marks in Reading:",df[df['reading_score']==100]['Average'].count())
print("Number of students with full marks in Writing:",df[df['writing_score']==100]['Average'].count())


In [None]:
print("Number of students who scored <=20 in Math:",df[df['math_score']<=20]['Average'].count())
print("Number of students who scored <=20 in Reading:",df[df['reading_score']<=20]['Average'].count())
print("Number of students who scored <=20 in Writing:",df[df['writing_score']<=20]['Average'].count())

### Insights
- Students have performed well in reading.
- Students have performed worst in math.

### 4. Data Visualization

#### 4.1 Visualize average score distribution to make some conclusion. 
- Histogram
- Kernel Distribution Function (KDE)

#### 4.1.1 Histogram & KDE

In [None]:
fig, ax =plt.subplots(1,2, figsize=(12,5))
plt.subplot(1,2,1)
sns.histplot(df['Average'],kde=True,bins=30,color='blue')
plt.title('Distribution of Average Scores')
plt.subplot(1,2,2)
sns.histplot(data=df,x='Average',kde=True,bins=30,hue='gender')
plt.show()

In [None]:
fig, ax =plt.subplots(1,2, figsize=(12,5))
plt.subplot(1,2,1)
sns.histplot(df['Total'],kde=True,bins=30,color='blue')
plt.title('Distribution of Total Scores')
plt.subplot(1,2,2)
sns.histplot(data=df,x='Total',kde=True,bins=30,hue='gender')
plt.show()

#### Insights
- Female Students have performed better compared to male students.


In [None]:
plt.subplots(1,3,figsize=(10,6))
plt.subplot(1,3,1)
sns.histplot(data=df,x='Average',kde=True,bins=30,hue='lunch')
plt.subplot(1,3,2)
sns.histplot(data=df[df['gender']=='female'],x='Average',kde=True,bins=30,hue='lunch')
plt.subplot(1,3,3)
sns.histplot(data=df[df['gender']=='male'],x="Average",kde=True,bins=30,hue='lunch')
plt.show()

### Insight
- The average score increases if the lunch is "standard", be it a girl or a boy.

In [None]:
plt.subplots(1,3,figsize=(15,8))
plt.subplot(1,3,1)
sns.histplot(data=df,x='Average',kde=True,bins=30,hue='parental_level_of_education')
plt.subplot(1,3,2)
sns.histplot(data=df[df['gender']=='female'],x='Average',kde=True,bins=30,hue='parental_level_of_education')
plt.subplot(1,3,3)
sns.histplot(data=df[df['gender']=='male'],x="Average",kde=True,bins=30,hue='parental_level_of_education')
plt.show()

### Insights
- In general, the parental level of education doesn't matter in the performance of the student.

In [None]:
plt.subplots(1,3,figsize=(10,6))
plt.subplot(1,3,1)
sns.histplot(data=df,x='Average',kde=True,bins=30,hue='race_ethnicity')
plt.subplot(1,3,2)
sns.histplot(data=df[df['gender']=='female'],x='Average',kde=True,bins=30,hue='race_ethnicity')
plt.subplot(1,3,3)
sns.histplot(data=df[df['gender']=='male'],x="Average",kde=True,bins=30,hue='race_ethnicity')
plt.show()

### Insights
- Most of students from Group C perform well.
- Students from Group A perform poorly in exams

### Maximum Scores of students in all three subjects

In [None]:
plt.figure(figsize=(18,8))
plt.subplot(1, 4, 1)
plt.title('MATH SCORES')
sns.boxplot(y='math_score',data=df,color='red',linewidth=3)
plt.subplot(1, 4, 2)
plt.title('READING SCORES')
sns.boxplot(y='reading_score',data=df,color='green',linewidth=3)
plt.subplot(1, 4, 3)
plt.title('WRITING SCORES')
sns.boxplot(y='writing_score',data=df,color='blue',linewidth=3)
plt.show()

### Insights
- Most students have scored between 60-80 in all three subjects.

#### 4.3 Multivariate analysis using pieplot

In [None]:
plt.rcParams['figure.figsize'] = (40, 20)

plt.subplot(2, 3, 1)
size = df['gender'].value_counts()

color = ['red','green']


plt.pie(size, colors = color, labels = size.index,autopct = '.%2f%%')
plt.title('Gender', fontsize = 25)
plt.axis('off')



plt.subplot(2, 3, 2)
size = df['race_ethnicity'].value_counts()

color = ['red', 'green', 'blue', 'cyan','orange']

plt.pie(size, colors = color,labels = size.index,autopct = '.%2f%%')
plt.title('Race/Ethnicity', fontsize = 25)
plt.axis('off')



plt.subplot(2, 3, 3)
size = df['lunch'].value_counts()

color = ['red','green']

plt.pie(size, colors = color,labels = size.index,autopct = '.%2f%%')
plt.title('Lunch', fontsize = 25)
plt.axis('off')


plt.subplot(2, 3, 4)
size = df['test_preparation_course'].value_counts()

color = ['red','green']

plt.pie(size, colors = color,labels = size.index,autopct = '.%2f%%')
plt.title('Test Course', fontsize = 25)
plt.axis('off')


plt.subplot(2, 3, 5)
size = df['parental_level_of_education'].value_counts()

color = ['red', 'green', 'blue', 'cyan','orange','grey']

plt.pie(size, colors = color,labels = size.index,autopct = '.%2f%%')
plt.title('Parental Education', fontsize = 25)
plt.axis('off')


plt.tight_layout()
plt.grid()

plt.show()

### Insights
- Number of male and female students are almost equal.
- The most students are from Group C.
- Most student have standard lunch.
- Number of students who have not enrolled in any courses are higher.
- Number of students whose parental education is "Some College" is greater followed closely by "Associate's Degree"

#### 4.4 Feature Wise Analysis

#### 4.4.1 Gender (does gender impact the performace of students?)

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(data=df,x='gender',hue='gender')
plt.title('Gender Distribution', fontsize=20)
plt.show()

#### Insight
- The gender ratio is almost equal

In [None]:
gender_group = df.groupby('gender')['Average'].mean()
gender_group

In [None]:
plt.figure(figsize=(8,5))
gender_group.plot(kind='bar', color=['pink','lightblue'])
plt.title('Average Scores by Gender')
plt.xlabel('Gender')
plt.ylabel('Average Score')
plt.show()

### Insights
- Female students are more likely to perform better than male students.

#### 4.4.2 RACE/EHNICITY COLUMN
- How is Group wise distribution ?
- Is Race/Ehnicity has any impact on student's performance ?

In [None]:
plt.figure(figsize=(8,5))
plt.pie(x=df['race_ethnicity'].value_counts(),labels=df['race_ethnicity'].value_counts().index,autopct='%1.1f%%',colors=sns.color_palette('pastel'))
plt.title("Race/Ethnicity Distribution")
plt.show()

In [None]:
race_group=df.groupby('race_ethnicity')['Average'].mean()
race_group

In [None]:
plt.figure(figsize=(8,5))
race_group.plot(kind='bar', color=sns.color_palette('pastel'))
plt.title('Average Scores by Race/Ethnicity')
plt.xlabel('Race/Ethnicity')
plt.ylabel('Average Score')
plt.show()

### Insights 
- Group C has most students followed by Group D
- Group A has least students
- Group E performed well amongst all the groups.
- Group A performed the poorest.

#### 4.4.3 PARENTAL LEVEL OF EDUCATION COLUMN
- What is educational background of student's parent ?
- Is parental education has any impact on student's performance ?

In [None]:
plt.figure(figsize=(8,5))
plt.pie(x=df['parental_level_of_education'].value_counts(),labels=df['parental_level_of_education'].value_counts().index,autopct='%1.1f%%',colors=sns.color_palette('pastel'))
plt.title("Parent Level of Education Distribution")
plt.show()

In [None]:
parent_group=df.groupby('parental_level_of_education')['Average'].mean()
parent_group

In [None]:
plt.figure(figsize=(8,5))
parent_group.plot(kind='bar', color=sns.color_palette('pastel'))
plt.title('Average Scores by Parental Level of Education')
plt.xlabel('Parental Level of Education')
plt.ylabel('Average Score')
plt.show()

#### Insights
- Largest group of parents are from some college.
- Students with parent having master's or bachelor's degree performed well

#### 4.4.4 LUNCH COLUMN 
- Which type of lunch is most common amoung students ?
- What is the effect of lunch type on test results?

In [None]:
plt.figure(figsize=(8,5))
plt.pie(x=df['lunch'].value_counts(),labels=df['lunch'].value_counts().index,autopct='%1.1f%%',colors=sns.color_palette('pastel'))
plt.title("Lunch Distribution")
plt.show()

In [None]:
lunch_group=df.groupby('lunch')['Average'].mean()
lunch_group

In [None]:
plt.figure(figsize=(8,5))
lunch_group.plot(kind='bar', color=sns.color_palette('pastel'))
plt.title('Average Scores by Lunch')
plt.xlabel('Lunch')
plt.ylabel('Average Score')
plt.show()

### Insights
- Most students have standard lunch and they tend to perform better than students with free lunch.

#### 4.4.5 TEST PREPARATION COURSE COLUMN 
- Is Test prepration course has any impact on student's performance ?

In [None]:
test_group=df.groupby('test_preparation_course')['Average'].mean()
test_group

In [None]:
plt.figure(figsize=(8,5))
test_group.plot(kind='bar', color=sns.color_palette('pastel'))
plt.title('Average Scores by Test Preparation Course')
plt.xlabel('Test Preparation Course')
plt.ylabel('Average Score')
plt.show()

#### Insights
- Students who are enrolled in a course tend to perform better than those who enrolled in none.