<h1>Dataset 2: "Students performance in Exams"</h1>
<h2>Task: Data Imputation</h2>
Shacha Parker (300235525)\
Callum Frodsham and (300199446)\
Group 79

### Setup Instructions To Reproduce this Data Cleaning Notebook:
(Step 1 Optional)
1. Create a virtual python environment in the project directory (if you want) for all of the packages required:  
``` 
python -m venv .venv
```
To enter the virutal environment: 
```
.venv/Scripts/activate.ps1 # on windows
source .venv/bin/activate # on mac/linux
```
2. Download all of the required packages (run in cmd/shell of choice):
```
pip install pandas
pip install numpy

pip install sklearn
#OR
pip install scikit-learn
```
3. VSCode: Ensure you have the correct python kernel selected!
<br> 
If you are using a virtual environment, make sure to select the python interpreter for that virtual environment otherwise this will not work! If you have everything done globally, then just make sure the correct python kernel you are using is selected.

In [484]:
# imports
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

data = pd.read_csv("https://raw.githubusercontent.com/CLFrod/Assignment2CSI4142/refs/heads/master/StudentsPerformance.csv")

### Dataset Information

Author: Jakki Seshapanpu

"This data set consists of the marks secured by [US high school] students in various subjects...To understand the influence of the parents background, test preparation etc on students performance"

<h3>Dataset Specifications & Features</h3>

The dataset has 8 columns and 1000 rows.

Link: <a href="https://www.kaggle.com/datasets/spscientist/students-performance-in-exams/data">Students Performance in Exams</a>

### Feature List
#### Gender
Categorical - Nominal
<br>
Male/Female
#### Race/Ethnicity
Categorical - Nominal
<br>
Referred to as groups ranging from A-E
#### Parental Level of Education
Categorical - Ordinal
<br>
Level of education of the student's parents.
<br>
Arguably, this data type could be nominal. I chose ordinal because there is a discernable difference in level one can make across the diploma types.
#### Lunch
Categorical - Nominal
<br>
The type of lunch plan the student is paying for.
#### Test Preparation Course
Categorical - Nominal
<br>
If the student had taken the course to prepare for the test or not
#### Math Score
Numerical - Discrete
#### Reading Score
Numerical - Discrete
#### Writing Score
Numerical - Discrete

### Imputation Test #1
We will simulate "Missing Completely At Random"-type missing values.
<br>
For this, the 'gender' column will be used.
<br>
We can randomly delete entries in the column amounting to a fraction of the total data.
<br>
Then, we will use Random Sample Imputation to re-fill the empty entries.

In [485]:
# Make 2 copies of the 'gender' column
D1 = data['gender'].copy()
D2 = data['gender'].copy()

# Calculate the number of entries to drop
drop = 0.50 # a simulated percentage
num_to_drop = int(len(D2) * drop)

# Randomly select entries to drop
np.random.seed(0)  # for reproducibility. comment out for multiple tries
drop_indices = np.random.choice(D2.index, num_to_drop, replace=False)
D2.loc[drop_indices] = np.nan

# Fill empty entries in D2 with random samples from D1
missing = D2.isnull()
num_missing = missing.sum()
if num_missing > 0:
    sampled_values = D1.dropna().sample(num_missing, replace=True).values
    D2.loc[missing] = sampled_values

# Compare D1 to D2 in similarity
similarity = (D1 == D2).mean()
print(f"Similarity between D1 and D2: {similarity * 100}%")

Similarity between D1 and D2: 75.6%


#### Test Results and Discussion
D1-D2 similarity with Random Sample Imputation Accuracy and random seed 0:
<br>
Format: **drop percentage : D1-D2 similarity percentage : accuracy rating**
<br>
5% : 96.5% : 30%
<br>
10% : 95.1% : 51%
<br>
15% : 93% : 53%
<br>
20% : 89.9% : 50%
<br>
33% : 83.8% : 54%
<br>
50% : 75.6% : 51%
<br>
For this column, where the types of entries are split close to 50/50, Random Sample Imputation is not very effective in recreating the original entries, with an average accuracy rating of about 50%. Furthermore, I would say that as the types of entries increases past 2, this imputation's effectiveness in finding the correct missing values decreases. Generally, this imputation would be fine if wanting to recreate the proportional amounts of each type of entry, but is ineffective in confidently replicating the original value of the data cell. It is only moderately effective in this case because the chance of error is minimized with there only being Male or Female as entries - which is why I chose it for this experiment. Only accuracy was calculated because precision and recall are trivial for this column's values.

### Imputation Test #2
We will simulate "Missing At Random"-type missing values.
For this, the 'reading score' will be the target for bivariate imputation, with 'writing score' serving as a contextual variable to predict from. In this situation, reading score data is more likely to be missing for those who scored low on the writing score.
<br>
We'll randomly delete entries in reading score where the writing score is on the lower end (<\50). Then, we'll use correlational imputation to try to accurately restore the original values.

In [486]:
# Calculate the correlation matrix
correlation_matrix = data[['math score', 'reading score', 'writing score']].corr()

# Print the correlation matrix
print(correlation_matrix)

               math score  reading score  writing score
math score       1.000000       0.817580       0.802642
reading score    0.817580       1.000000       0.954598
writing score    0.802642       0.954598       1.000000


We can see that reading and writing scores are very highly correlated (0.954598).

In [487]:
reading_mean = data['reading score'].mean()
reading_std = data['reading score'].std()
writing_mean = data['writing score'].mean()
writing_std = data['writing score'].std()
# Make 2 copies of the 'reading score' column
D1 = data['reading score'].copy()
D2 = data['reading score'].copy()

# drop 'drop'% of the entries below the median.
drop = 0.1  # 10%
below_median_indices = D2[D2 < D2.median()].index
num_to_drop = int(len(below_median_indices) * drop * 2)

# Randomly select entries to drop from below the median, then turn their value to NaN
np.random.seed(0)  # for reproducibility
drop_indices = np.random.choice(below_median_indices, num_to_drop, replace=False)
D2.loc[drop_indices] = np.nan

In [488]:
correlation = D2.corr(data['reading score'])

def correlationFormula(W):
    return(correlation * (reading_std/writing_std) * (W-writing_mean) + reading_mean)

# Apply correlation formula to every NaN value in D2
for index in D2.index:
    # if the value is NaN
    if pd.isna(D2.loc[index]):
        D2.loc[index] = correlationFormula(data['writing score'].loc[index])

In [489]:
# Evaluation of results
MSE = 0
closenum = 0
closerange = 5
for i in drop_indices:
    MSE += (D1[i] - D2[i])**2
    if -closerange <= (D1[i] - D2[i]) <= closerange:
        closenum += 1

MSE = MSE / num_to_drop
print("Mean Squared Error:", MSE, "\nRoot Mean Squared Error:", MSE**(1/2), "\nNumber of Predictions within +/-", closerange, ":", closenum, "out of", len(drop_indices))

Mean Squared Error: 23.757924386104854 
Root Mean Squared Error: 4.8742101294573725 
Number of Predictions within +/- 5 : 63 out of 97


With a root mean squared error of ~2 for this column where scores have a valid range of 0-100, I would say this method is a good estimator of the original values in D1.
<br>The accuracy ratio test only works if I defined a pre-existing range to determine as "accurate". In this case, I defined it as a 'radius' of 5 positive and negative, so a total range of 10. With 63/97 as an accuracy rating, this method is moderately successful in getting a close score. 

### Imputation Test #3
We will simulate "Missing Not At Random"-type missing values.
<br>
We'll use the numerical values again, this time including math score. Suppose that for some reason, students who have a high math score are permitted to not record their scores, and a certain portion opt into this. This is an unusual case but a valid representation of MNAR. We will then use regression imputation to try to replicate the original values.

In [490]:
D1 = data['math score'].copy()
D2 = data['math score'].copy()

# First Step - remove a fraction the values classified as "high grades"
threshold = 80 # changeable if needed
high_math_indices = data[data['math score'] > threshold].index

drop_percentage = 0.3  # removes a percentage of high math indices
num_to_drop = int(len(high_math_indices) * drop_percentage)

# Randomly select entries to drop
np.random.seed(0)  # for reproducibility
drop_indices = np.random.choice(high_math_indices, num_to_drop, replace=False)

# Drop the selected entries in the math scores
D2.loc[drop_indices] = np.nan

In [491]:
X = data[['reading score', 'writing score']].values
y = data['math score'].values

# Create model and fit
model = LinearRegression()
model.fit(X, y)

# Apply model for imputation
for index in D2.index:
    if pd.isna(D2.loc[index]):
        reading_score = data['reading score'].loc[index]
        writing_score = data['writing score'].loc[index]
        D2.loc[index] = model.predict(np.array([[reading_score, writing_score]]))[0]

In [492]:
# Evaluation of results
MSE = 0
closenum = 0
closerange = 5
for i in drop_indices:
    MSE += (D1[i] - D2[i])**2
    if -closerange <= (D1[i] - D2[i]) <= closerange:
        closenum += 1

MSE = MSE / num_to_drop
print("Mean Squared Error:", MSE, "\nRoot Mean Squared Error:", MSE**(1/2), "\nNumber of Predictions within +/-", closerange, ":", closenum, "out of", len(drop_indices))

Mean Squared Error: 100.57248346484052 
Root Mean Squared Error: 10.028583322924556 
Number of Predictions within +/- 5 : 24 out of 52


Regression imputation applied to this missing data gave mostly unsatisfactory results. Root mean square error coming up to 10 points and a 40% prediction rate when the prediction range was set to +/-5 suggests that this is not the best type of imputation for this situation. It's likely because regression doesn't work well when the removed data is localized within a dense, small area, rather than being spread across the other entries.

## Conclusion

In this assignment, we used various methods of imputation to amend simulated 3 missing sata scenarios, each being MCAR, MAR, and MNAR. We removed sections of data according to the situation's conditions, and then applied a chosen method of imputation. We used one univariate, one bivariate, and one multivartiate type of regression: Random Sample Imputation, Correlation Imputation, and Regression Imputation, respectively. After the imputation was performed we evaluated the effectiveness of the imputation using metrics like similarity percentage, Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). Random Sample imputation proved to be not very effective in predicting the original values of a binary value in the dataset at an average 50% correct prediction. Correlational imputation performed at a moderately successful rate (about 2/3rds were reasonably close guesses with RMSE of 2). Regressional Imputation did not work well because of its tendency to not predict data well when the area to predict is in a very precise area instead of being spread between other entries. In the future, we could try to test out different imputation methods to see if there would be more optimal results for these situations.

### References

Lecture Slides<br>
<a href="https://pandas.pydata.org/docs/">Pandas Documentation</a><br>
<a href="https://numpy.org/doc/stable/">NumPy Documentation</a><br>
<a href="https://scikit-learn.org/stable/user_guide.html">Scikit-Learn User Guide</a>