
    ## CRISP-DM Phase 3: Data Preparation

    This notebook describes the data preparation phase for the graduate admission dataset. 
    The dataset is split into train/validation/test sets, and necessary preprocessing steps are outlined.
    

In [None]:

    import pandas as pd
    from sklearn.model_selection import train_test_split

    # Load the dataset
    data = pd.read_csv('graduate_admission1.csv')

    # Check initial data structure
    data.head()
    

In [None]:

    # Split the data into training, validation, and test sets
    # Using an 80-10-10 split ratio
    temp_data, test_data = train_test_split(data, test_size=0.1, random_state=42)
    train_data, val_data = train_test_split(temp_data, test_size=0.1, random_state=42)

    # Save the splits to parquet files for reproducibility
    train_data.to_parquet('data/processed/train.parquet', index=False)
    val_data.to_parquet('data/processed/val.parquet', index=False)
    test_data.to_parquet('data/processed/test.parquet', index=False)
    


### Quality Profiling and Checks

We conducted several quality checks on the processed dataset to ensure high data quality before modeling.
Visualizations were generated to inspect missing values and other data characteristics.


In [None]:

import seaborn as sns
import matplotlib.pyplot as plt

# Load the train dataset for profiling
train_data = pd.read_parquet('data/processed/train.parquet')

# Check for missing values and plot them
missing_values = train_data.isnull().sum()
plt.figure(figsize=(10, 6))
missing_plot = sns.heatmap(train_data.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values After Preprocessing')
plt.savefig('assets/missingness_after.png')
plt.show()


In [None]:

# Save missing values to JSON report
prep_checks = {
    'missing_values_after_preprocessing': missing_values[missing_values > 0].to_dict()
}

import json
with open('reports/prep_checks.json', 'w') as json_file:
    json.dump(prep_checks, json_file, indent=4)
