### Setup Instructions To Reproduce this Data Cleaning Notebook:
(Step 1 Optional)
1. Create a virtual python environment in the project directory (if you want) for all of the packages required:  
``` 
python -m venv .venv
```
To enter the virutal environment: 
```
.venv/Scripts/activate.ps1 # on windows
source .venv/bin/activate # on mac/linux
```
2. Download all of the required packages (run in cmd/shell of choice):
```
pip install pandas
pip install numpy
```
3. VSCode: Ensure you have the correct python kernel selected!
<br> 
If you are using a virtual environment, make sure to select the python interpreter for that virtual environment otherwise this will not work! If you have everything done globally, then just make sure the correct python kernel you are using is selected.

In [59]:
# imports
import pandas as pd
import numpy as np

data = pd.read_csv("https://raw.githubusercontent.com/CLFrod/Assignment2CSI4142/refs/heads/master/StudentsPerformance.csv")

<h1>Dataset 2: "Students performance in Exams"</h1>
<h3>Task: Data Imputation</h3>

Author: Jakki Seshapanpu

"This data set consists of the marks secured by [US high school] students in various subjects...To understand the influence of the parents background, test preparation etc on students performance"

<h3>Dataset Specifications & Features</h3>

The dataset has 8 columns and 1000 rows.

Link: <a href="https://www.kaggle.com/datasets/spscientist/students-performance-in-exams/data">Students Performance in Exams</a>

### Feature List
#### Gender
Categorical - Nominal
<br>
Male/Female
#### Race/Ethnicity
Categorical - Nominal
<br>
Referred to as groups ranging from A-E
#### Parental Level of Education
Categorical - Ordinal
<br>
Level of education of the student's parents.
<br>
Arguably, this data type could be nominal. I chose ordinal because there is a discernable difference in level one can make across the diploma types.
#### Lunch
Categorical - Nominal
<br>
The type of lunch plan the student is paying for.
#### Test Preparation Course
Categorical - Nominal
<br>
If the student had taken the course to prepare for the test or not
#### Math Score
Numerical - Discrete
#### Reading Score
Numerical - Discrete
#### Writing Score
Numerical - Discrete

### Imputation Test #1
We will simulate "Missing Completely At Random"-type missing values.
<br>
For this, the 'gender' column will be used.
<br>
We can randomly delete entries in the column amounting to a fraction of the total data.
<br>
Then, we will use Random Sample Imputation to re-fill the empty entries.

In [60]:
# Make 2 copies of the 'gender' column
D1 = data['gender'].copy()
D2 = data['gender'].copy()

# Calculate the number of entries to drop
drop = 0.50 # a simulated percentage
num_to_drop = int(len(D2) * drop)

# Randomly select entries to drop
np.random.seed(0)  # for reproducibility. comment out for multiple tries
drop_indices = np.random.choice(D2.index, num_to_drop, replace=False)
D2.loc[drop_indices] = np.nan

# Fill empty entries in D2 with random samples from D1
missing = D2.isnull()
num_missing = missing.sum()
if num_missing > 0:
    sampled_values = D1.dropna().sample(num_missing, replace=True).values
    D2.loc[missing] = sampled_values

# Compare D1 to D2 in similarity
similarity = (D1 == D2).mean()
print(f"Similarity between D1 and D2: {similarity * 100}%")

Similarity between D1 and D2: 75.6%


#### Test Results and Discussion
D1-D2 similarity with Random Sample Imputation Accuracy and random seed 0:
<br>
Format: **drop percentage : D1-D2 similarity percentage : accuracy rating**
<br>
5% : 96.5% : 30%
<br>
10% : 95.1% : 51%
<br>
15% : 93% : 53%
<br>
20% : 89.9% : 50%
<br>
33% : 83.8% : 54%
<br>
50% : 75.6% : 51%
<br>
For this column, where the types of entries are split close to 50/50, Random Sample Imputation is not very effective in recreating the original entries, with an average accuracy rating of about 50%. Furthermore, I would say that as the types of entries increases past 2, this imputation's effectiveness in finding the correct missing values decreases. Generally, this imputation would be fine if wanting to recreate the proportional amounts of each type of entry, but is ineffective in confidently replicating the original value of the data cell. It is only moderately effective in this case because the chance of error is minimized with there only being Male or Female as entries - which is why I chose it for this experiment. Only accuracy was calculated because precision and recall are trivial for this column's values.

### Imputation Test #2
We will simulate "Missing At Random"-type missing values.
For this, the 'reading score' will be the target for bivariate imputation, with 'writing score' serving as a contextual variable to predict from. In this situation, reading score data is more likely to be missing for those who scored low on the writing score.
<br>
We'll randomly delete entries in reading score where the writing score is on the lower end (\<50)

In [63]:
# Calculate the correlation matrix
correlation_matrix = data[['math score', 'reading score', 'writing score']].corr()

# Print the correlation matrix
print(correlation_matrix)

               math score  reading score  writing score
math score       1.000000       0.817580       0.802642
reading score    0.817580       1.000000       0.954598
writing score    0.802642       0.954598       1.000000


In [62]:
# Group by 'test preparation course' and describe 'writing score'
math_scores_description = data.groupby('test preparation course')['writing score'].describe()
print(math_scores_description)

                         count       mean        std   min   25%   50%   75%  \
test preparation course                                                        
completed                358.0  74.418994  13.375335  36.0  66.0  76.0  83.0   
none                     642.0  64.504673  14.999661  10.0  54.0  65.0  74.0   

                           max  
test preparation course         
completed                100.0  
none                     100.0  


### Imputation Test #3
We will simulate "Missing Not At Random"-type missing values.