## Data Importation

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load in the dataset
df = pd.read_csv("original_dataset.csv")

# Display basic information about the dataset
df.info()


The original dataset was loading into a Pandas Dataframe. Based off of the `.info()`, the dataset contains 6,607 records and 20 columns and the data types were identified, a mix of numerical and categorical values.

**Numerical Values**
* `Hours_Studied`
* `Attendance`
* `Sleep_Hours`
* `Previous_Scores`
* `Tutoring_Sessions`
* `Physical_Activity`
* `Exam_Score`

**Categorical Values**
* `Parental_Involvement`
* `Access_to_Resources`
* `Extracurricular_Activities`
* `Motivation_Level`
* `Internet_Access`
* `Family_Income`
* `Teacher_Quality`
* `School_Type`
* `Peer_Influence`
* `Learning_Disabilities`
* `Parental_Education_Level`
* `Distance_from_Home`
* `Gender`

## Initial Exploration

We first looked at the dataset using `.describe()` to generate a statistical summary of the numerical columns. A histogram and box plot were used to look at the exam score distributions and see if there any outliers. 

In [None]:
# Create summary statistics for numerical columns
df.describe()

In [None]:
# Plot 1: Histogram of Exam Score
plt.figure(figsize=(8, 6))
plt.hist(df['Exam_Score'], bins=30, edgecolor='black')
plt.title('Distribution of Exam Scores')
plt.xlabel('Exam Score')
plt.ylabel('Frequency')
plt.show()

### Plot 1 Insights
* The bell-shaped curve suggests that most of the students are performing within a similar range, 60-75 with some outliers.
* The peak of the plot is around 67 which aligns with the mean from the statistical summary above for the `Exam_Score`.
* There are a few outliers slightly above 100. 

In [None]:
# Plot 2: Box Plot of Exam Score by Parental Involvement
plt.figure(figsize=(8,6))
categories = df['Parental_Involvement'].dropna().unique()
data_to_plot = [df[df['Parental_Involvement'] == cat]['Exam_Score'] for cat in categories]

plt.boxplot(data_to_plot, labels=categories)
plt.title('Exam Score by Parental Involvement')
plt.xlabel('Parental Involvement')
plt.ylabel('Exam Score')
plt.grid()
plt.show()

### Plot 2 Insights
* Students with **High** parental involvement tend to score better with the highest median exam scores, then **Medium**, and lastly **Low**, which suggests that there is a positive relationship between parental involvement and exam scores.
* Each group contains several outliers towards the higher end of exam scores. This shows that there are students who are performing well regardless of how involved their parents are or possible just some inconsistencies in reporting and grading. 

## Cleaning Dataset

### Missing Data:
There were three columns that had missing values:
* Teacher_Quality - 78
* Parental_Education_Level - 90
* Distance_from_Home - 67

For these missing values, we used **mode** because the mode represents the most common and likely value.

### Outliers:
We saw from two plots that there were some exam scores that were above 100. To avoid these outliers, we chose to cap the scores at 100 so that we could still maintain our full dataset. 

In [None]:
# Check for any missing values
missing_values = df.isnull().sum()

print(f"Missing Values per Column:\n", missing_values)

In [7]:
# Represent any missing cateogrical values with the mode
df['Teacher_Quality'].fillna(df['Teacher_Quality'].mode()[0])
df['Parental_Education_Level'].fillna(df['Parental_Education_Level'].mode()[0])
df['Distance_from_Home'].fillna(df['Distance_from_Home'].mode()[0])

# Cap Exam_Score at 100 in case any value exceeds it
df['Exam_Score'] = df['Exam_Score'].apply(lambda x: min(x, 100))

# Save Cleaned Dataset
file_path = "cleaned_dataset.csv"
df.to_csv(file_path, index=False)