# Data Cleaning Notebook

This notebook performs data cleaning and preprocessing steps, including:

- **Label Encoding**: Transform categorical variables into numeric codes.
- **Typo Correction**: Fix typos in column names for consistency.
- **Column Dropping**: Remove unnecessary columns to streamline the dataset.
- **Category Encoding**: Encode nominal features for analysis.

In [None]:

import pandas as pd
import os  # Add to import path
import numpy as np
import warnings
warnings.filterwarnings("ignore")

## Load Data from Files

Load the training and testing datasets for preprocessing.

In [6]:
train_df = pd.read_csv(r'DataFiles/train.csv')
test_df = pd.read_csv(r'DataFiles/test.csv')

## Fix Typos in Dataset

Correct typos in column names to ensure consistency and accuracy.

In [7]:

#typo?
train_df.rename(columns={'austim': 'autism'}, inplace=True)
test_df.rename(columns={'austim': 'autism'}, inplace=True)

train_df.rename(columns={'contry_of_res': 'country_of_res'}, inplace=True)
test_df.rename(columns={'contry_of_res': 'country_of_res'}, inplace=True)

## Replace Non-Numeric Column Values

Convert non-numeric column values into respective numeric codes for easier processing.

In [8]:
# Replace gender with 1 or 0
train_df['gender'] = train_df['gender'].str.lower().map({'m': 1, 'f': 0})
test_df['gender'] = test_df['gender'].str.lower().map({'m': 1, 'f': 0})

# Replace jaundice with 1 or 0
train_df['jaundice'] = train_df['jaundice'].str.lower().map({'yes': 1, 'no': 0})
test_df['jaundice'] = test_df['jaundice'].str.lower().map({'yes': 1, 'no': 0})

# Replace autism with 1 or 0
train_df['autism'] = train_df['autism'].str.lower().map({'yes': 1, 'no': 0})
test_df['autism'] = test_df['autism'].str.lower().map({'yes': 1, 'no': 0})

categorical_columns = ['relation', 'ethnicity', 'country_of_res']

train_df = pd.get_dummies(train_df, columns=categorical_columns, drop_first=True)
test_df = pd.get_dummies(test_df, columns=categorical_columns, drop_first=True)

# Add missing columns to test_df with default value 0
for col in train_df.columns:
    if col not in test_df.columns:
        test_df[col] = 0

# Ensure test_df has the same column order as train_df
test_df = test_df[train_df.columns]

## Drop Unneeded Columns

Remove columns that are not required for the analysis to streamline the dataset.

In [9]:

#drop id
train_df.drop(columns=['ID'], inplace=True)
test_df.drop(columns=['ID'], inplace=True)
#drop age desc
train_df.drop(columns=['age_desc'], inplace=True)
test_df.drop(columns=['age_desc'], inplace=True)
#drop used app before
train_df.drop(columns=['used_app_before'], inplace=True)
test_df.drop(columns=['used_app_before'], inplace=True)