# Student Performance Indicator

### life cycle of machine learning project
- Understanding the problem statement
- Data collection
- Data checks to perform
- Exploratory data analysis
- Data pre-processing
- Model training
- Choose the best model

## 1. Problem satatement
- This is to understand how students performance (test scores) is affected by other variables susch as gender, ethnicity, parental level of education, Lunch and test Preparation course

## 2. Data collection
- Data source- https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977
- The data consists of 8 column and 1000 rows.

### 2.1 import Required data and Packages

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from tabulate import tabulate

### Import csv


In [3]:
df = pd.read_csv("data/stud.csv")
df.head(10)

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
5,female,group B,associate's degree,standard,none,71,83,78
6,female,group B,some college,standard,completed,88,95,92
7,male,group B,some college,free/reduced,none,40,43,39
8,male,group D,high school,free/reduced,completed,64,64,67
9,female,group B,high school,free/reduced,none,38,60,50


Shape of the DataSet


In [4]:
df.shape

(1000, 8)

## 2.2 Data set information

- Gender: Male/Female
- Race/Ethnicity: Group A,B,C,D
- Parental level of Education: bachelor degree, some college,masters degree, associate degree, high school.
- Lunch: Having lunch befor test; standard, Free/reduced
- Test preparation Course; complete or not complete
- Math score
- Reading score 
- Writing score

## 3. Data checks to perform

- Missing value
- Duplicates
- Data type
- Number of Unique values at each column
- Statistics of the Data
- Various categories Present in diffrent categorical columns

## 3.1 Missing values

In [7]:
df.isna().sum()

gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64

## 3.2 Duplicates

In [8]:
df.duplicated().sum()

0

## 3.3 Data type

In [11]:
#df.dtypes 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race_ethnicity               1000 non-null   object
 2   parental_level_of_education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test_preparation_course      1000 non-null   object
 5   math_score                   1000 non-null   int64 
 6   reading_score                1000 non-null   int64 
 7   writing_score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


## 3.4 Number of Unique Values at each column

In [9]:
total_unique_value=df.nunique().sum()
unique_value_at_column = df.nunique()
print(f'total unique value in all column:{total_unique_value}'+"\n\n\n")

print(tabulate(pd.DataFrame(unique_value_at_column), headers= ["columns", "Number of unique values"], tablefmt= "outline"))


total unique value in all column:247



+-----------------------------+---------------------------+
| columns                     |   Number of unique values |
| gender                      |                         2 |
| race_ethnicity              |                         5 |
| parental_level_of_education |                         6 |
| lunch                       |                         2 |
| test_preparation_course     |                         2 |
| math_score                  |                        81 |
| reading_score               |                        72 |
| writing_score               |                        77 |
+-----------------------------+---------------------------+


## 3.5 Statistics of the dataset


In [28]:
df.describe()

Unnamed: 0,math_score,reading_score,writing_score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


#### Insight

- Total number of data is 1000 in reading,math score, writing score
- Mean score is close to each other for math_score: 66.08, reading score: 69.15, writing score: 68.05
- Min score for all the numerical data is math_score: 0, reading score: 17, writing score: 10
- standard deviation are close to each other math_score: 15.16, reading score: 14.60, writing score:≈ 15.196

## 3.6 Exploring Data 

In [30]:
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [46]:

data = {
    'Variable': ['gender', 'race_ethnicity', 'parental_level_of_education', 'lunch', 'test_preparation_course'],
    'Categories': [
        ', '.join(df['gender'].unique()),
        ', '.join(df['race_ethnicity'].unique()),
        ', '.join(df['parental_level_of_education'].unique()),
        ', '.join(df['lunch'].unique()),
        ', '.join(df['test_preparation_course'].unique())
    ]
}


df_categories = pd.DataFrame(data)

print(tabulate(df_categories, headers='keys', tablefmt="outline"))



+----+-----------------------------+-----------------------------------------------------------------------------------------------------+
|    | Variable                    | Categories                                                                                          |
|  0 | gender                      | female, male                                                                                        |
|  1 | race_ethnicity              | group B, group C, group A, group D, group E                                                         |
|  2 | parental_level_of_education | bachelor's degree, some college, master's degree, associate's degree, high school, some high school |
|  3 | lunch                       | standard, free/reduced                                                                              |
|  4 | test_preparation_course     | none, completed                                                                                     |
+----+---------------------

### Define Numerical And Categorical Columns


In [15]:
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O'] 

categorical_features =[feature for feature in df.columns if df[feature].dtype == 'O']

print(tabulate([
    ['Numerical Features', len(numeric_features), numeric_features],
    ['Categorical Features', len(categorical_features), categorical_features]
], headers=['Feature Type', 'Count', 'Features'], tablefmt= "outline"))


print()

+----------------------+---------+-------------------------------------------------------------------------------------------------+
| Feature Type         |   Count | Features                                                                                        |
| Numerical Features   |       3 | ['math_score', 'reading_score', 'writing_score']                                                |
| Categorical Features |       5 | ['gender', 'race_ethnicity', 'parental_level_of_education', 'lunch', 'test_preparation_course'] |
+----------------------+---------+-------------------------------------------------------------------------------------------------+



### 3.8 Adding columns for total score and average score



In [17]:
df["total score"]= df["math_score"] + df["reading_score"] + df["writing_score"]
df["average score"]= df["total score"]/3
df.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score,total score,average score
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.666667
1,female,group C,some college,standard,completed,69,90,88,247,82.333333
2,female,group B,master's degree,standard,none,90,95,93,278,92.666667
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,49.333333
4,male,group C,some college,standard,none,76,78,75,229,76.333333


In [38]:

reading_full= len([reading for reading in df['reading_score'] if reading == 100])
writing_full= len([writing for writing in df['writing_score'] if writing == 100])
math_full= len([math for math in df['math_score'] if math == 100])

print(tabulate([{"Number of students with full marks in reading": reading_full,
                 "Number of student with full marks in maths": math_full,
                 "Number of student with full marks in writing":writing_full}], headers="keys", tablefmt="outline" ))



+-------------------------------------------------+----------------------------------------------+------------------------------------------------+
|   Number of students with full marks in reading |   Number of student with full marks in maths |   Number of student with full marks in writing |
|                                              17 |                                            7 |                                             14 |
+-------------------------------------------------+----------------------------------------------+------------------------------------------------+
