### Student Performance Indicator

#### Life Cycle of Machine Learning 

* Understanding the Problem Statement
* Data Collection
* Data Checks to perform
* Exploratory data analysis
* Data Pre-Processing
* Model Training
* Choose best model

### Problem Statement

* This project understands how the student's performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental level of education, Lunch and Test preparation course.

### Data Collection

* The data consists of 8 column and 1000 rows.

### 2.1 Import Data and Required Packages
##### Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import os
import warnings
warnings.filterwarnings('ignore')


In [3]:
# Read the data as dataframe

df = pd.read_csv('data/stud.csv')

In [4]:
# First 5 records
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [5]:
# last 5 records
df.tail()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77
999,female,group D,some college,free/reduced,none,77,86,86


In [6]:
# shape of the data
df.shape

(1000, 8)

#### 2.2 Dataset information
* gender : sex of students -> (Male/female)
* race/ethnicity : ethnicity of students -> (Group A, B,C, D,E)
* parental level of education : parents' final education ->(bachelor's degree,some college,master's degree,associate's degree,high school)
* lunch : having lunch before test (standard or free/reduced)
* test preparation course : complete or not complete before test
* math score
* reading score
* writing score

#### 3. Data Checks to perform
* Check Missing values
* Check Duplicates
* Check data type
* Check the number of unique values of each column
* Check statistics of data set
* Check various categories present in the different categorical column


##### 3.1 Check Missing values

In [7]:
df.isna().sum()

gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64

In [8]:
df.isnull().sum()

gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64

#### 3.2 Check the Duplicates



In [9]:
df.duplicated().sum()

0

In [10]:
df[df.duplicated()]

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score


There is no duplicated

##### 3.3 Check data types

In [11]:
# Summary of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race_ethnicity               1000 non-null   object
 2   parental_level_of_education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test_preparation_course      1000 non-null   object
 5   math_score                   1000 non-null   int64 
 6   reading_score                1000 non-null   int64 
 7   writing_score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


##### 3.4 Checking the number of unique values of each column

In [13]:
#unique value of each column

df.nunique()

gender                          2
race_ethnicity                  5
parental_level_of_education     6
lunch                           2
test_preparation_course         2
math_score                     81
reading_score                  72
writing_score                  77
dtype: int64

In [14]:
#Descriptive statistics of the dataset

df.describe()

Unnamed: 0,math_score,reading_score,writing_score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


##### Insight drwan from the dataset: 
* From above description of numerical data, all means are very close to each other - between 66 and 69.16;
* All standard deviations are also close - between 14.6 and 15.19;
* While there is a minimum score 0 for math, for writing minimum is much higher = 10 and for reading yet higher = 17

#### 3.7 Exploring Data

In [15]:
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [23]:
# Write a function to  check what are the unique value in each categorical column

def display_unique_cat():
    categorical_column = [feature for feature in df.columns if df[feature].dtype =='O' ]
    for cat_column in categorical_column:
        print(f"Categories in '{cat_column}'  ", end=" ")
        print(df[cat_column].unique())
    
    
display_unique_cat()

Categories in 'gender'   ['female' 'male']
Categories in 'race_ethnicity'   ['group B' 'group C' 'group A' 'group D' 'group E']
Categories in 'parental_level_of_education'   ["bachelor's degree" 'some college' "master's degree" "associate's degree"
 'high school' 'some high school']
Categories in 'lunch'   ['standard' 'free/reduced']
Categories in 'test_preparation_course'   ['none' 'completed']


In [32]:
# Seperate the columns into numerical and categorical

categorical_col = [feature for feature in df.columns if df[feature].dtypes =='O']
numerical_col = [feature for feature in df.columns if df[feature].dtypes !='O']

print(f'We have {len(categorical_col)} categorical columns  {categorical_col}')
print(f'We have {len(numerical_col)} numerical columns {numerical_col}')

We have 5 categorical columns  ['gender', 'race_ethnicity', 'parental_level_of_education', 'lunch', 'test_preparation_course']
We have 3 numerical columns ['math_score', 'reading_score', 'writing_score']


#### 3.8 Adding columns for "Total Score" and "Average"

In [39]:
df['total_score'] = df['math_score'] + df['reading_score'] + df['writing_score']
df['average'] = df['total_score']/3

In [55]:
df.head(3)

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score,total_score,average
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.666667
1,female,group C,some college,standard,completed,69,90,88,247,82.333333
2,female,group B,master's degree,standard,none,90,95,93,278,92.666667


In [59]:
#Number of Student getting full mrks in reading , writing and math

math_full=df[df['math_score']==100]['average'].count()
reading_full = df[df['reading_score']==100]['average'].count()
writing_full =df[df['writing_score']==100]['average'].count()

print(f'Total student secure full marks in math: {math_full}')
print(f'Total student secure full marks in reading: {reading_full}')
print(f'Total student secure full marks in writing: {writing_full}')


Total student secure full marks in math: 7
Total student secure full marks in reading: 17
Total student secure full marks in writing: 14


In [65]:
#Number of Student scored  mrks less than 30  in reading , writing and math

def count_of_student(limit_of_score):
    for i in df[numerical_col]:
        count_of_student=df[df[i]<=limit_of_score]['average'].count()
        print(f'Number of student scored less than 20 in {i} :{count_of_student}')
        
count_of_student(30)
        

Number of student scored less than 20 in math_score :16
Number of student scored less than 20 in reading_score :8
Number of student scored less than 20 in writing_score :10


#### Insights
From above values we get students have performed the worst in Maths <br>
Best performance is in reading section