# Student Missing Value

In this notebook we will learn how to deal with missing values in our dataset how to tackle this obstacle while doing exploratory data analysis

### 1. Importing necessary libraries

In [4]:
import numpy as np # numpy helps in storing the data as arrays
import pandas as pd # pandas heps in storing and representing data frames
import matplotlib.pyplot as plt # matplotlib helps in data visualization

### 2. Loading the data set


The "student_missing" dataset comprises the scores of 50 students across three distinct categories: Score 1, Score 2, and Score 3.

In [7]:
df=pd.read_csv("student_missing.csv") # The csv file containg score information has been loaded into a pandas data frame
df.head()

Unnamed: 0,StudentID,Score1,Score2,Score3
0,1,76.0,90.0,85.0
1,2,95.0,,92.0
2,3,86.0,70.0,
3,4,72.0,,78.0
4,5,,88.0,80.0


In this dataset there are a lot of missing values in the student score, so let us delve deep into it and findout how many values are we missing

In [6]:
missing=df.isnull().sum() # the isnull() operator returns the null values and sum() will agrregrate all those missing values
print(missing)

StudentID     0
Score1       26
Score2       32
Score3       29
dtype: int64


From this we can obeserve the total number of missing values under each category

### 3. Dealing with Missing Values


We have two approaches to handle missing values, tailored to the nature of the data and the analytical needs: either we **discard** the entries or **impute** values in place of the missing data.

### a.) Droping the values


We have the option to drop either rows or columns that contain missing values, enabling us to proceed with analysis using the available data if this approach aligns with the feasibility of our analysis, as demonstrated below:

In [17]:
# Droping rows
drop=df.dropna(axis=0)# The dropna command drops the values which are not available, here axis=0 to 
print(drop.shape)     # drop the rows with missing values
print(drop)

(2, 4)
    StudentID  Score1  Score2  Score3
0           1    76.0    90.0    85.0
16         17    78.0    97.0    95.0


In [18]:
# Droping columns
drop=df.dropna(axis=1)# Here the axis=1 to drop columns
print(drop.shape)
drop.head()

(50, 1)


Unnamed: 0,StudentID
0,1
1,2
2,3
3,4
4,5



After employing the drop method to clean our data, it becomes evident that if we drop all rows with missing values, we are left with only 2 rows for analysis. Conversely, dropping the column containing missing values would result in the elimination of the entire scores section, leaving us with nothing to analyze. Thus, dropping the data is not a feasible approach for this dataset.

### b.) Data Imputation


Data imputation is the process of filling in missing values in a dataset using estimation techniques based on observed data.

### 1. Filling in 0's

In this method we will fill the missing values with 0 

In [24]:
fill=df.fillna(0)# the fillna function will fill the missing values based on the argument, here the argument is 0, 
fill.head()      #  hence it fills in all missing values with 0's 

Unnamed: 0,StudentID,Score1,Score2,Score3
0,1,76.0,90.0,85.0
1,2,95.0,63.0,92.0
2,3,86.0,70.0,53.0
3,4,72.0,51.333333,78.0
4,5,57.666667,88.0,80.0



For the given dataset, treating 0 assignments as indicating student absence and subsequently assigning a score of 0 could be a feasible approach. However, if the Professor opts not to grade 0 and instead wants to assign some credit, alternative options can be explored as follows:

### 2.  Forward fill (ffill)

The ffill method propagates the last known value forward to the  missing value, in this scenario let us assume the proffesor wishes to assign the same grade as score 1 if the student is missing score 2 and so on 

In [11]:
fill=df.fillna(method='ffill',axis=1)# ffill, axis=1 for row ffill 
fill.head()

Unnamed: 0,StudentID,Score1,Score2,Score3
0,1.0,76.0,90.0,85.0
1,2.0,95.0,95.0,92.0
2,3.0,86.0,70.0,70.0
3,4.0,72.0,72.0,78.0
4,5.0,5.0,88.0,80.0


### 3. Backward fill (bfill)

The bfill method propagates the next known value forward to the  missing value, in this scenario let us assume the proffesor wishes to assign the same grade as score 3 if the student is missing score 1 and so on 

In [27]:
fill=df.fillna(method='bfill')# bfill
fill.head()

Unnamed: 0,StudentID,Score1,Score2,Score3
0,1,76.0,90.0,85.0
1,2,95.0,63.0,92.0
2,3,86.0,70.0,53.0
3,4,72.0,51.333333,78.0
4,5,57.666667,88.0,80.0



While forward fill (ffill) and backward fill (bfill) methods can replace missing values, they may not be the most suitable for this dataset context. This is because awarding a student the same grade as previous or subsequent tests for those they did not attend may not accurately reflect their performance. Hence, exploring a more appropriate approach is advisable.

### 4. Mean Imputation


In the given situation, mean imputation emerges as the most feasible approach. This method assigns a student who missed a test the average score of all their tests, which is more meaningful if the professor opts to provide partial credit instead of a zero for the missed test.

In [29]:
row_means = df.mean(axis=1)# calculating rowwise mean
for i, row in df.iterrows(): # iterating each element in the same row to replace the row mean with the missing value
    df.loc[i] = row.fillna(row_means[i])
df.head()

Unnamed: 0,StudentID,Score1,Score2,Score3
0,1,76.0,90.0,85.0
1,2,95.0,63.0,92.0
2,3,86.0,70.0,53.0
3,4,72.0,51.333333,78.0
4,5,57.666667,88.0,80.0


Lastly, lets check again for any missing values

In [22]:
missing=df.isnull().sum()
print(missing)

StudentID    0
Score1       0
Score2       0
Score3       0
dtype: int64


After verification, we can conclude that we have successfully imputed the missing values, resulting in a complete dataset ready for analysis.

### Conclusion


We can conclude that this notebook has provided us with a variety of techniques for handling missing values, including dropping, filling with 0s, forward fill (ffill), backward fill (bfill), and mean imputation. These methods offer flexibility and can be applied based on the specific situation and characteristics of the data being analyzed, enhancing our ability to effectively manage missing data in future analyses.