## Student Performance Indicator


#### Life cycle of Machine learning Project

- Understanding the Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory data analysis
- Data Pre-Processing
- Model Training
- Choose best model

### 1) Problem statement
- This project understands how the student's performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental level of education, Lunch and Test preparation course.


### 2) Data Collection
- Dataset Source - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977
- The data consists of 8 column and 1000 rows.

### 2.1 Import Data and Required Packages
####  Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [137]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [170]:
A = {

                "gender" : ['female'],
                "race_ethnicity": ['group C'],
                "parental_level_of_education": ["bachelor's degree"],
                "lunch": ['standard'],
                "test_preparation_course": ['completed'],
                "writing_score": [88],
                "reading_score": [90],
}

In [171]:
DF = pd.DataFrame(A)
DF

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,writing_score,reading_score
0,female,group C,bachelor's degree,standard,completed,88,90


In [140]:
from src.utils import load_object
import os

In [179]:

pre=load_object(r'C:\Users\Swamesh\Desktop\UDemy\KrishNaik\MLProject1\artifact\preprocessor.pkl')
mod = load_object(r'C:\Users\Swamesh\Desktop\UDemy\KrishNaik\MLProject1\artifact\model.pkl')

In [180]:
data_scaled = pre.transform(DF)
data_scaled

array([[1.29635617, 1.4146351 , 2.00276196, 0.        , 0.        ,
        0.        , 2.13504205, 0.        , 0.        , 0.        ,
        3.07728727, 0.        , 0.        , 0.        , 0.        ,
        0.        , 2.10183809, 2.09830697, 0.        ]])

In [182]:
num_features = pre.transformers_[0][2]
cat_features = pre.transformers_[1][2]
cat_encoder = pre.named_transformers_['categorical_pipeline']
cat_columns = cat_encoder.get_feature_names_out(cat_features)
column_names = num_features + cat_columns.tolist()

In [183]:
R = pd.DataFrame(data_scaled,columns=column_names)

In [184]:
R

Unnamed: 0,writing_score,reading_score,gender_female,gender_male,race_ethnicity_group A,race_ethnicity_group B,race_ethnicity_group C,race_ethnicity_group D,race_ethnicity_group E,parental_level_of_education_associate's degree,parental_level_of_education_bachelor's degree,parental_level_of_education_high school,parental_level_of_education_master's degree,parental_level_of_education_some college,parental_level_of_education_some high school,lunch_free/reduced,lunch_standard,test_preparation_course_completed,test_preparation_course_none
0,1.296356,1.414635,2.002762,0.0,0.0,0.0,2.135042,0.0,0.0,0.0,3.077287,0.0,0.0,0.0,0.0,0.0,2.101838,2.098307,0.0


In [185]:
mod.predict(R)

array([77.38640939])

#### Import the CSV Data as Pandas DataFrame

In [168]:
df = pd.read_csv('data/stud.csv')

#### Show Top 5 Records

In [169]:
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


#### Shape of the dataset

In [None]:
df.shape

### 2.2 Dataset information

- gender : sex of students  -> (Male/female)
- race/ethnicity : ethnicity of students -> (Group A, B,C, D,E)
- parental level of education : parents' final education ->(bachelor's degree,some college,master's degree,associate's degree,high school)
- lunch : having lunch before test (standard or free/reduced) 
- test preparation course : complete or not complete before test
- math score
- reading score
- writing score

### 3. Data Checks to perform

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set
- Check various categories present in the different categorical column

### 3.1 Check Missing values

In [None]:
df.isna().sum()

#### There are no missing values in the data set

### 3.2 Check Duplicates

In [None]:
df.duplicated().sum()

#### There are no duplicates  values in the data set

### 3.3 Check data types

In [None]:
# Check Null and Dtypes
df.info()

### 3.4 Checking the number of unique values of each column

In [None]:
df.nunique()

### 3.5 Check statistics of data set

In [None]:
df.describe()

#### Insight
- From above description of numerical data, all means are very close to each other - between 66 and 68.05;
- All standard deviations are also close - between 14.6 and 15.19;
- While there is a minimum score  0 for math, for writing minimum is much higher = 10 and for reading myet higher = 17

### 3.7 Exploring Data

In [None]:
df.head()

In [None]:
print("Categories in 'gender' variable:     ",end=" " )
print(df['gender'].unique())

print("Categories in 'race_ethnicity' variable:  ",end=" ")
print(df['race_ethnicity'].unique())

print("Categories in'parental level of education' variable:",end=" " )
print(df['parental_level_of_education'].unique())

print("Categories in 'lunch' variable:     ",end=" " )
print(df['lunch'].unique())

print("Categories in 'test preparation course' variable:     ",end=" " )
print(df['test_preparation_course'].unique())

In [None]:
# define numerical & categorical columns
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
df.head(2)

### 3.8 Adding columns for "Total Score" and "Average"

In [None]:
df['total score'] = df['math_score'] + df['reading_score'] + df['writing_score']
df['average'] = df['total score']/3
df.head()

In [None]:
reading_full = df[df['reading_score'] == 100]['average'].count()
writing_full = df[df['writing_score'] == 100]['average'].count()
math_full = df[df['math_score'] == 100]['average'].count()

print(f'Number of students with full marks in Maths: {math_full}')
print(f'Number of students with full marks in Writing: {writing_full}')
print(f'Number of students with full marks in Reading: {reading_full}')

In [None]:
reading_less_20 = df[df['reading_score'] <= 20]['average'].count()
writing_less_20 = df[df['writing_score'] <= 20]['average'].count()
math_less_20 = df[df['math_score'] <= 20]['average'].count()

print(f'Number of students with less than 20 marks in Maths: {math_less_20}')
print(f'Number of students with less than 20 marks in Writing: {writing_less_20}')
print(f'Number of students with less than 20 marks in Reading: {reading_less_20}')

#####  Insights
 - From above values we get students have performed the worst in Maths 
 - Best performance is in reading section

### 4. Exploring Data ( Visualization )
#### 4.1 Visualize average score distribution to make some conclusion. 
- Histogram
- Kernel Distribution Function (KDE)

#### 4.1.1 Histogram & KDE

In [None]:
plt.figure(figsize=(15,7))
plt.subplot(121)
sns.histplot(data=df,x='average',bins=30,kde=True,color='g')
plt.subplot(122)
sns.histplot(data=df,x='average',kde=True,hue='gender')
plt.show()

In [None]:
df.head()

In [None]:
plt.figure(figsize=(15,7))
plt.subplot(121)
sns.histplot(data=df,x='total score',bins=30,kde=True,color='g')
plt.subplot(122)
sns.histplot(data=df,x='total score',kde=True,hue='gender')
plt.show()

#####  Insights
- Female students tend to perform well then male students.

In [None]:
plt.figure(figsize=(25,6))
plt.subplot(141)
sns.histplot(data=df,x='average',kde=True,hue='lunch')
plt.title('WHole data')
plt.subplot(142)
sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='lunch')
plt.title('Female')
plt.subplot(143)
sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='lunch')
plt.title('Male')
plt.show()

#####  Insights
- Standard lunch helps perform well in exams.
- Standard lunch helps perform well in exams be it a male or a female.

KDE and HIST PLOT FOR AVEGRAGE OF TOTAL,MALE AND FEMALE WRT PARENT EDUCATION 
- No need for plt.legend() if hue= is used. 

In [None]:

plt.figure(figsize=(25,6))
plt.subplot(141)
sns.histplot(data=df,x='average',kde=True,hue='parental_level_of_education')
plt.subplot(142)
sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='parental_level_of_education')
plt.subplot(143)
sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='parental_level_of_education')


plt.figure(figsize=(25,4))
plt.subplot(141)
sns.kdeplot(data=df,x='average',hue='parental_level_of_education')
plt.title('WHole data')

plt.subplot(142)
sns.kdeplot(data=df[df.gender=='male'],x='average',hue='parental_level_of_education')
plt.title('Male')

plt.subplot(143)
sns.kdeplot(data=df[df.gender=='female'],x='average',hue='parental_level_of_education')
plt.title('Female')

plt.show()

#####  Insights
- In general parent's education don't help student perform well in exam.
- 2nd plot shows that parent's whose education is of associate's degree or master's degree their male child tend to perform well in exam
- 3rd plot we can see there is no effect of parent's education on female students.

In [None]:

plt.figure(figsize=(25,6))

plt.subplot(141)
sns.histplot(data=df,x='average',kde=True,hue='race_ethnicity')

plt.subplot(142)
sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='race_ethnicity')

plt.subplot(143)
sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='race_ethnicity')


plt.figure(figsize=(25,4))

plt.subplot(141)
sns.kdeplot(data=df,x='average',hue='race_ethnicity')
plt.title('WHole data')

plt.subplot(142)
sns.kdeplot(data=df[df.gender=='male'],x='average',hue='race_ethnicity')
plt.title('Male')

plt.subplot(143)
sns.kdeplot(data=df[df.gender=='female'],x='average',hue='race_ethnicity')
plt.title('Female')

plt.show()




#####  Insights
- Students of group A and group B tends to perform poorly in exam.
- Students of group A and group B tends to perform poorly in exam irrespective of whether they are male or female

- THIS PLOT TELL THAT ON Y AXIS WE HAVE NO.OF STUDENTS FOR GIVEN AVERAGE (X AXIS)

In [None]:
plt.figure(figsize=(25,6))

plt.subplot(141)
sns.histplot(data=df,x='average',kde=True,hue='test_preparation_course',)

plt.subplot(142)
sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='test_preparation_course')

plt.subplot(143)
sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='test_preparation_course')


plt.figure(figsize=(25,4))

plt.subplot(141)
sns.kdeplot(data=df,x='average',hue='test_preparation_course')
plt.title('WHole data')

plt.subplot(142)
sns.kdeplot(data=df[df.gender=='male'],x='average',hue='test_preparation_course')
plt.title('Male')

plt.subplot(143)
sns.kdeplot(data=df[df.gender=='female'],x='average',hue='test_preparation_course')
plt.title('Female')

plt.show()

##### Insights

Irrespective of the gender
- There are more number of students who have NOT completed the test prep.
- The overall average score of the students who have COMPLETED the test prep is better

#### 4.2 Maximumum score of students in all three subjects

In [None]:
df.head(2)

In [None]:

plt.figure(figsize=(15,6))
plt.subplot(1, 4, 1)
plt.title('MATH SCORES')
sns.violinplot(y='math_score',data=df,color='red',linewidth=3)
plt.subplot(1, 4, 2)
plt.title('READING SCORES')
sns.violinplot(y='reading_score',data=df,color='green',linewidth=3)
plt.subplot(1, 4, 3)
plt.title('WRITING SCORES')
sns.violinplot(y='writing_score',data=df,color='blue',linewidth=3)
plt.show()

#### Insights
- From the above three plots its clearly visible that most of the students score in between 60-80 in Maths whereas in reading and writing most of them score from 50-80

#### 4.3 Multivariate analysis using  ( One way to do as coded by me)

In [None]:
df.head(3)

In [None]:
df.columns[0]

In [None]:
df[df.columns[0]].value_counts()


In [None]:
plt.figure(figsize= (20,4))


for i in range(len(df.columns)-5):
    # print(i+1)
    # print(df.columns[i])
    plt.subplot(1,len(df.columns)-5,i+1)
    plt.pie(df[df.columns[i]].value_counts(),labels=df[df.columns[i]].value_counts().index,autopct= '%1.2f%%')
    plt.title(df.columns[i],fontsize=20)
    
plt.tight_layout(w_pad=2,h_pad=8)
plt.grid()



Insights
- Number of Male and Female students is almost equal
- Number students are greatest in Group C
- Number of students who have standard lunch are greater
- Number of students who have not enrolled in any test preparation course is greater
- Number of students whose parental education is "Some College" is greater followed closely by "Associate's Degree"

#### Alternate way as done by Krish NAik


In [None]:
plt.rcParams['figure.figsize'] = (30, 12)

plt.subplot(1, 5, 1)
size = df['gender'].value_counts()
labels = 'Female', 'Male'
color = ['red','green']


plt.pie(size, colors = color, labels = labels,autopct = '%.2f%%')
plt.title('Gender', fontsize = 20)
# plt.axis('off')



plt.subplot(1, 5, 2)
size = df['race_ethnicity'].value_counts()
labels = 'Group C', 'Group D','Group B','Group E','Group A'
color = ['red', 'green', 'blue', 'cyan','orange']

plt.pie(size, colors = color,labels = labels,autopct = '%.2f%%')
plt.title('Race Ethnicity', fontsize = 20)
# plt.axis('off')



plt.subplot(1, 5, 3)
size = df['lunch'].value_counts()
labels = 'Standard', 'Free'
color = ['red','green']

plt.pie(size, colors = color,labels = labels,autopct = '%.2f%%')
plt.title('Lunch', fontsize = 20)
# plt.axis('off')


plt.subplot(1, 5, 4)
size = df['test_preparation_course'].value_counts()
labels = 'None', 'Completed'
color = ['red','green']

plt.pie(size, colors = color,labels = labels,autopct = '%.2f%%')
plt.title('Test Course', fontsize = 20)
# plt.axis('off')


plt.subplot(1, 5, 5)
size = df['parental_level_of_education'].value_counts()
labels = 'Some College', "Associate's Degree",'High School','Some High School',"Bachelor's Degree","Master's Degree"
color = ['red', 'green', 'blue', 'cyan','orange','grey']

plt.pie(size, colors = color,labels = labels,autopct = '%.2f%%')
plt.title('Parental Education', fontsize = 20)
# plt.axis('off')


plt.tight_layout()
plt.grid()

plt.show()

#####  Insights
- Number of Male and Female students is almost equal
- Number students are greatest in Group C
- Number of students who have standard lunch are greater
- Number of students who have not enrolled in any test preparation course is greater
- Number of students whose parental education is "Some College" is greater followed closely by "Associate's Degree"

#### 4.4 FEATURE WISE VISUALIZATION

#### 4.4.1 Gender Feature
- How is distribution of Gender ?
- Is gender has any impact on student's performance ?

#### UNIVARIATE ANALYSIS ( How is distribution of Gender ? )

In [None]:
f,ax=plt.subplots(1,2,figsize=(8,3))
sns.countplot(x=df['gender'],data=df,palette ='bright',ax=ax[0],saturation=0.95)
for container in ax[0].containers:
    ax[0].bar_label(container,color='black',size=6)
    
plt.pie(x=df['gender'].value_counts(),labels=df['gender'].value_counts().index.str.capitalize(),explode=[0,0.1],autopct='%1.1f%%',shadow=True,colors=['#ff4d4d','#ff8000'])
plt.show()

#### Insights 
- Gender has balanced data with female students are 518 (51.8%) and male students are 482 (48.2%) 

### BIVARIATE ANALYSIS ( Is gender has any impact on student's performance ? ) 

#### ONE WAY TO DO USING PLT.BAR, TOO MUCH CODE !!

In [None]:
df.head(2)

In [None]:
gender_group = df.groupby('gender').mean().round(decimals=3)
gender_group

In [None]:
gender_group1 = gender_group.drop(['total score','average'],axis=1)

In [None]:
plt.figure(figsize=(10, 3))

X = ['Math_Average','Reading_Avg','Writing_Avg']

female_scores=[]
male_scores =[]

for i in gender_group1:
    female_scores.append(gender_group1[i][0])
    male_scores.append(gender_group1[i][1])



X_axis = np.arange(len(X))
  
plt.bar(X_axis - 0.2, male_scores, 0.4, label = 'Male')
plt.bar(X_axis + 0.2, female_scores, 0.4, label = 'Female')
  
plt.xticks(X_axis, X)
plt.ylabel("Marks")
plt.title("Gender comparison across average values", fontweight='bold',fontsize=12)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

#### SIMPLE WAY TO PLOT in NORMAL AND INVERTED WAYS
- USE GROUPED_DATAFRAME.PLOT() TO GET INDEX VALUES ON X-AXIS AND COLUMN VALUES ON Y, WHERE HUE WOULD BE COLUMN NAMES

- Use Grouped_Dataframe.TRANSPOSE().plot() to interchange index and columns names.

In [None]:
ax = gender_group1.plot(kind='bar',figsize=(10,4))
for container in ax.containers:
    ax.bar_label(container,fontsize=11)

plt.xticks(rotation='horizontal')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

#####  USING TRANSPOSE ON GROUPED DATA

In [None]:
ax = gender_group1.transpose().plot(kind='bar',figsize=(10,4))
for container in ax.containers:
    ax.bar_label(container,fontsize=11)

plt.xticks(rotation='horizontal')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

#### Insights 
- On an average females have a better overall score than men.
- whereas males have scored higher in Maths.

### 4.4.2 RACE/EHNICITY COLUMN
- How is Group wise distribution ?
- Is Race/Ehnicity has any impact on student's performance ?

#### UNIVARIATE ANALYSIS ( How is Group wise distribution ?)

In [None]:
f,ax=plt.subplots(1,2,figsize=(10,3))
sns.countplot(x=df['race_ethnicity'],data=df,ax=ax[0],saturation=0.95)


for container in ax[0].containers:
    ax[0].bar_label(container,color='black',size=9)
    
plt.pie(x = df['race_ethnicity'].value_counts(),labels=df['race_ethnicity'].value_counts().index.str.capitalize(),explode=[0.1,0,0,0,0],autopct='%1.1f%%',shadow=True)
plt.show()

#### Insights 
- Most of the student belonging from group C /group D.
- Lowest number of students belong to groupA.

#### BIVARIATE ANALYSIS ( Is Race/Ehnicity has any impact on student's performance ? )

In [None]:
group_date_race = df.groupby('race_ethnicity').mean().round(decimals=2)

group_date_race.rename(index={'group A':'group_A','group B':'group_B','group C':'group_C','group D':'group_D','group E':'group_E'},inplace=True) #To remove space between group and A,B,...E
group_date_race

In [None]:
group_date_race1 = group_date_race.drop(['total score','average'],axis=1)

#### MY OWN WAY ( SIMILAR TO GENDER ONE) TOO MUCH CODE !!!

In [None]:
# TO ASSIGN EMPTY LISTS TO ALL THE GROUPS SO THAT ALL THE AVREAGE VALUES WRT GROUPS CAN BE STORED IN A LIST 
group_A,group_B,group_C,group_D,group_E = [[] for i in range(len(group_date_race.index))]  

type(group_A)

In [None]:
for i in group_date_race1:
    group_A.append(group_date_race1[i][0])
    group_B.append(group_date_race1[i][1])
    group_C.append(group_date_race1[i][2])
    group_D.append(group_date_race1[i][3])
    group_E.append(group_date_race1[i][4])


In [None]:
plt.figure(figsize=(10, 4))

X = ['Math_Average','Reading_Avg','Writing_Avg']

X_axis = np.arange(len(X))

width = 0.05 # DONE TO ADJUST SPACING ON X AXIS SINCE THERE ARE MANY CATEGORIES


'''
x_AXIS+WIDTH IS DONE TO SET UP GROUP WISE AVERAGE ON THE AXIS AT APPROPRIATE LENGTHS
'''

plt.bar(X_axis+width ,group_A, 0.1, label = 'group_A')


plt.bar(X_axis+width*3, group_B, 0.1, label = 'group_B')


plt.bar(X_axis+width*5 , group_C, 0.1, label = 'group_C')


plt.bar(X_axis+width*7 , group_D, 0.1, label = 'group_D')


plt.bar(X_axis+width*9, group_E, 0.1, label = 'group_E')



plt.xticks(X_axis+width*5, X)
plt.ylabel("Marks")
plt.title("Race Ethnicity comparison across average values", fontweight='bold')
plt.legend(fontsize=8)
plt.show()

#### KRISH NAIK METHOD TO PLOT RACE_ETHNICITY VS AVERAGE

In [None]:

f,ax=plt.subplots(1,3,figsize=(15,5))

sns.barplot(x=group_date_race1.index, y=group_date_race1['math_score'],palette = 'mako',ax=ax[0])
ax[0].set_title('Math score',color='#005ce6',size=15)

for container in ax[0].containers:
    ax[0].bar_label(container,color='black',size=9)


sns.barplot(x=group_date_race1.index,y=group_date_race1['reading_score'],palette = 'flare',ax=ax[1])
ax[1].set_title('Reading score',color='#005ce6',size=15)

for container in ax[1].containers:
    ax[1].bar_label(container,color='black',size=9)
    

sns.barplot(x=group_date_race1.index,y=group_date_race1['writing_score'],palette = 'coolwarm',ax=ax[2])
ax[2].set_title('Writing score',color='#005ce6',size=15)

for container in ax[2].containers:
    ax[2].bar_label(container,color='black',size=9)

#### SIMPLE WAY TO PLOT in NORMAL AND INVERTED WAYS


In [None]:
ax = group_date_race1.plot(kind='bar',figsize=(11,4))
for container in ax.containers:
    ax.bar_label(container,fontsize=7)
    
plt.xticks(rotation='horizontal')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

#####  USING TRANSPOSE ON GROUPED DATA

In [None]:

ax = group_date_race1.transpose().plot(kind='bar',figsize=(10,6))
for container in ax.containers:
    ax.bar_label(container,fontsize=8)
    
plt.xticks(rotation='horizontal')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

#### Insights 
- Group E students have scored the highest marks. 
- Group A students have scored the lowest marks. 
- Students from a lower Socioeconomic status have a lower avg in all course subjects

### 4.4.3 PARENTAL LEVEL OF EDUCATION COLUMN
- What is educational background of student's parent ?
- Is parental education has any impact on student's performance ?

#### UNIVARIATE ANALYSIS ( What is educational background of student's parent ? )

In [None]:
# plt.rcParams['figure.figsize'] = (15, 6)
plt.figure(figsize=(13,4))
plt.style.use('fivethirtyeight')
sns.countplot(data=df,x=df['parental_level_of_education'],palette='Blues')
plt.title('Comparison of Parental Education', fontweight = 30, fontsize = 15)
plt.xlabel('Degree')

plt.ylabel('count')
plt.show()

#### Insights 
- Largest number of parents are from some college.

#### BIVARIATE ANALYSIS ( Is parental education has any impact on student's performance ? )

In [None]:
df.groupby('parental_level_of_education').agg('mean').plot(kind='barh',figsize=(5,5))
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=1)
plt.show()

#### Insights 
- The score of student whose parents possess master and bachelor level education are higher than others.

### 4.4.4 LUNCH COLUMN 
- Which type of lunch is most common amoung students ?
- What is the effect of lunch type on test results?


#### UNIVARIATE ANALYSIS ( Which type of lunch is most common amoung students ? )

In [None]:
# plt.rcParams['figure.figsize'] = (15, 9)
plt.figure(figsize=(10,4))

a = sns.countplot(data=df,x='lunch', palette = 'PuBu',)
for container in a.containers:
    a.bar_label(container,color='black')


plt.title('Comparison of different types of lunch', fontweight = 30, fontsize = 20)
plt.xlabel('types of lunch')
plt.ylabel('count')
plt.show()

#### Insights 
- No. of students being served Standard lunch was more than free lunch

#### BIVARIATE ANALYSIS (  Is lunch type intake has any impact on student's performance ? )

In [None]:
group_lunch = df.groupby('lunch').mean().round(decimals=3).drop(['average'],axis=1)
group_lunch

In [None]:
a = group_lunch.plot(kind='bar')

for container in a.containers:
    a.bar_label(container,fontsize=6)

plt.xticks(rotation='horizontal')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=1)
plt.show()

#####  USING TRANSPOSE ON GROUPED DATA

In [None]:
a = group_lunch.transpose().plot(kind='bar')

for container in a.containers:
    a.bar_label(container,fontsize=6)

plt.xticks(rotation='horizontal')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=1)
plt.show()

#### As done by Krish NAik

In [None]:
f,ax=plt.subplots(1,2,figsize=(15,4))


sns.countplot(x=df['parental_level_of_education'],data=df,palette = 'bright',hue='test_preparation_course',saturation=0.95,ax=ax[0])
ax[0].set_title('Students vs test preparation course ',color='black',size=25)
for container in ax[0].containers:
    ax[0].bar_label(container,color='black',size=20)
    
sns.countplot(x=df['parental_level_of_education'],data=df,palette = 'bright',hue='lunch',saturation=0.95,ax=ax[1])
for container in ax[1].containers:
    ax[1].bar_label(container,color='black',size=20)   

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=1)

#### Insights 
- Students who get Standard Lunch tend to perform better than students who got free/reduced lunch

#### 4.4.5 TEST PREPARATION COURSE COLUMN 
- Which type of lunch is most common amoung students ?
- Is Test prepration course has any impact on student's performance ?

#### UNIVARIATE ANALYSIS

In [None]:
df.head(2)

In [None]:
plt.figure(figsize=(8,4))
ax = sns.countplot(data=df,x='test_preparation_course')
for container in ax.containers:
    ax.bar_label(container)

plt.show()

####  Insights 
- NO.OF STUDENTS WHO HAVE NOT COMPLETED ANY TEST PREP COURSE IS WAY MORE THAN THE ONES WHO COMPLETED 

#### BIVARIATE ANALYSIS ( Is Test prepration course has any impact on student's performance ? )

In [None]:
group_test_prep = df.groupby('test_preparation_course').mean().round(decimals=2).drop(['average'],axis=1)
group_test_prep

### Simple way using grouped_feature.DataFrame.plot

In [None]:
ax = group_test_prep.plot(kind='barh')
for container in ax.containers:
    ax.bar_label(container,fontsize=8)

plt.xticks(rotation='horizontal')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=1)
plt.show()

#### Using transpose

In [None]:
ax = group_test_prep.transpose().plot(kind='barh')
for container in ax.containers:
    ax.bar_label(container,fontsize=7)

plt.xticks(rotation='horizontal')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=1)
plt.show()

##### As done by KRish Naik

In [None]:
plt.figure(figsize=(12,6))
plt.subplot(2,2,1)
sns.barplot (x=df['lunch'], y=df['math_score'], hue=df['test_preparation_course'])
plt.subplot(2,2,2)
sns.barplot (x=df['lunch'], y=df['reading_score'], hue=df['test_preparation_course'])
plt.subplot(2,2,3)
sns.barplot (x=df['lunch'], y=df['writing_score'], hue=df['test_preparation_course'])

plt.show()

#### Insights  
- Students who have completed the Test Prepration Course have scores higher in all three categories than those who haven't taken the course

#### 4.4.6 CHECKING OUTLIERS

- In sns.boxplot(), x axis should always have numeric values, categorical is not allowed

In [None]:
df.head(2)

In [None]:
plt.subplots(1,4,figsize=(15,3))
plt.subplot(141)
sns.boxplot(x = df['math_score'],color='skyblue')

plt.subplot(142)
sns.boxplot(x=df['reading_score'],color='hotpink')

plt.subplot(143)
sns.boxplot(x=df['writing_score'],color='yellow')

plt.subplot(144)
sns.boxplot(x=df['average'],color='lightgreen')
plt.show()

#### Insights

As we can see there are few outliers

#### 4.4.7 MUTIVARIATE ANALYSIS USING PAIRPLOT

In [None]:
sns.pairplot(df,hue ='race_ethnicity')
plt.show()

#### Insights
- From the above plot it is clear that all the scores increase linearly with each other.

### 5. Conclusions
- Student's Performance is related with lunch, race, parental level education
- Females lead in pass percentage and also are top-scorers
- Student's Performance is not much related with test preparation course
- Finishing preparation course is benefitial.