##  Life cycle of Machine learning Project
#### Understanding the Problem Statement
#### Data Collection
#### Data Checks to perform
####  Exploratory data analysis
#### Data Pre-Processing
#### Model Training
#### Choose best model

1) Problem statement:
   
   This project understands how the student's performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental level of education, Lunch and Test preparation course.

2) Data Collection:
   
   Dataset Source - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977  
   The data consists of 8 column and 1000 rows.

2.1 Import Data and Required Packages

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings

In [2]:
df=pd.read_csv('data/StudentsPerformance.csv')

In [3]:
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [4]:
df.shape

(1000, 8)

2.2 Dataset information:  

    gender : sex of students -> (Male/female)
    race/ethnicity : ethnicity of students -> (Group A, B,C, D,E)
    parental level of education : parents' final education ->(bachelor's degree,some college,master's degree,associate's degree,high school)
    lunch : having lunch before test (standard or free/reduced)
    test preparation course : complete or not complete before test
    math score
    reading score
    writing score

3. Data Checks to perform :  
   Check Missing values  
   Check Duplicates   
   Check data type   
   Check the number of unique values of each column  
   Check statistics of data set   
   Check various categories present in the different categorical column  

In [5]:
df.isna().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

In [6]:
df.duplicated().sum()

0

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


In [9]:
df.nunique()

gender                          2
race/ethnicity                  5
parental level of education     6
lunch                           2
test preparation course         2
math score                     81
reading score                  72
writing score                  77
dtype: int64

In [10]:
df.describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


In [11]:
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [13]:
print("Categories is in 'gender' variable:  ",end=" " )
print(df['gender'].unique())

print("Categories is in 'race/ethnicity' variable:  ",end=" " )
print(df['race/ethnicity'].unique())

print("Categories is in 'parental level of education' variable:  ",end=" " )
print(df['parental level of education'].unique())

print("Categories is in 'lunch' variable:  ",end=" " )
print(df['lunch'].unique())

print("Categories is in 'test preparation course' variable:  ",end=" " )
print(df['test preparation course'].unique())

Categories is in 'gender' variable:   ['female' 'male']
Categories is in 'race/ethnicity' variable:   ['group B' 'group C' 'group A' 'group D' 'group E']
Categories is in 'parental level of education' variable:   ["bachelor's degree" 'some college' "master's degree" "associate's degree"
 'high school' 'some high school']
Categories is in 'lunch' variable:   ['standard' 'free/reduced']
Categories is in 'test preparation course' variable:   ['none' 'completed']


In [14]:
#define numerical and categorical columns

In [15]:
numerical_features = [ features for features in df.columns if df[features].dtype != 'O']
categorical_features = [ features for features in df.columns if df[features].dtype  == 'O']


In [16]:
#print columns
print('We have {} numerical features : {}'.format(len(numerical_features),numerical_features))

print('We have {} categorical features : {}'.format(len(categorical_features),categorical_features))


We have 3 numerical features : ['math score', 'reading score', 'writing score']
We have 5 categorical features : ['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']


Adding columns for "Total Score" and "Average"

In [17]:
df['total_score']=df['math score']+df['reading score']+df['writing score']
df['avarage']=df['total_score']/3
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total_score,avarage
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.666667
1,female,group C,some college,standard,completed,69,90,88,247,82.333333
2,female,group B,master's degree,standard,none,90,95,93,278,92.666667
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,49.333333
4,male,group C,some college,standard,none,76,78,75,229,76.333333
