# Data Analysis on Kaggle Student Exam Scores dataset
## Source
The dataset can be found on [Kaggle](https://www.kaggle.com/datasets/desalegngeb/students-exam-scores).
## Objective
Our objective here is to practise loading CSV data, cleaning it and analyzing it using Pandas.

In [12]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd

import sys
sys.path.append('..')
from utils.data_io import check_file

In [13]:
original_data_path = './datasets/student_exam_scores/Original_data_with_more_rows.csv'
expanded_data_path = './datasets/student_exam_scores/Expanded_data_with_more_features.csv'

# Check if these are valid file paths.
check_file(original_data_path)
check_file(expanded_data_path)

In [14]:
original_df = pd.read_csv(original_data_path)
print(f"Original dataframe shape {original_df.shape}")
original_columns = original_df.columns.tolist()
print(f"Original dataframe columns {original_columns}")

expanded_df = pd.read_csv(expanded_data_path)
print(f"Expanded dataframe shape {expanded_df.shape}")
print(f"Expanded dataframe columns {expanded_df.columns}")

Original dataframe shape (30641, 9)
Original dataframe columns ['Unnamed: 0', 'Gender', 'EthnicGroup', 'ParentEduc', 'LunchType', 'TestPrep', 'MathScore', 'ReadingScore', 'WritingScore']
Expanded dataframe shape (30641, 15)
Expanded dataframe columns Index(['Unnamed: 0', 'Gender', 'EthnicGroup', 'ParentEduc', 'LunchType',
       'TestPrep', 'ParentMaritalStatus', 'PracticeSport', 'IsFirstChild',
       'NrSiblings', 'TransportMeans', 'WklyStudyHours', 'MathScore',
       'ReadingScore', 'WritingScore'],
      dtype='object')


## Initial comments
- While both datasets contain the same number of rows, they don't have the same number of columns. The second dataset has additional columns/features.
- They both have an index column, which may be dropped later on.
- Column names are shortened in both datasets, so they may be re-named later on.

In [15]:
print(expanded_df.head(5))
print(expanded_df.info())

   Unnamed: 0  Gender EthnicGroup          ParentEduc     LunchType TestPrep  \
0           0  female         NaN   bachelor's degree      standard     none   
1           1  female     group C        some college      standard      NaN   
2           2  female     group B     master's degree      standard     none   
3           3    male     group A  associate's degree  free/reduced     none   
4           4    male     group C        some college      standard     none   

  ParentMaritalStatus PracticeSport IsFirstChild  NrSiblings TransportMeans  \
0             married     regularly          yes         3.0     school_bus   
1             married     sometimes          yes         0.0            NaN   
2              single     sometimes          yes         4.0     school_bus   
3             married         never           no         1.0            NaN   
4             married     sometimes          yes         0.0     school_bus   

  WklyStudyHours  MathScore  ReadingScore  W

## Compare the two datasets
- We can check if the common columns of these two datasets are equal.

In [16]:
expanded_df_selected = expanded_df.isin(original_columns)
print(f"Are the common columns equal? {original_df.equals(expanded_df_selected)}")

Are the common columns equal? False


## Delete redundant columns
- The first column is the index. We don't need it, so we can get rid of it.

In [17]:
expanded_df = expanded_df.drop(columns=['Unnamed: 0'])

## Drop duplicates

## Fill in missing values in the dataset

## Make column names clear/verbose

## Visualize the data

## Convert data

## Save the cleaned dataframe