# Data Cleaning Notebook

This notebook performs data cleaning and preprocessing steps, including:

- **Label Encoding**: Transform categorical variables into numeric codes.
- **Typo Correction**: Fix typos in column names for consistency.
- **Column Dropping**: Remove unnecessary columns to streamline the dataset.
- **Category Encoding**: Encode nominal features for analysis.

In [6]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

## Load Data from Files

Load the training and testing datasets for preprocessing.

In [7]:
train_df = pd.read_csv(r'DataFiles/train.csv')
test_df = pd.read_csv(r'DataFiles/test.csv')

## Fix Typos in Dataset

Correct typos in column names to ensure consistency and accuracy.

In [8]:

#typo?
train_df.rename(columns={'austim': 'autism'}, inplace=True)
test_df.rename(columns={'austim': 'autism'}, inplace=True)

train_df.rename(columns={'contry_of_res': 'country_of_res'}, inplace=True)
test_df.rename(columns={'contry_of_res': 'country_of_res'}, inplace=True)

## Replace Non-Numeric Column Values

Convert non-numeric column values into respective numeric codes for easier processing.

In [9]:

#replace gener with 1 or 0
train_df['gender'] = train_df['gender'].str.lower().map({'m': 1, 'f': 0})
test_df['gender'] = test_df['gender'].str.lower().map({'m': 1, 'f': 0})

#replace jaundice with 1 or 0
train_df['jaundice'] = train_df['jaundice'].str.lower().map({'yes': 1, 'no': 0})
test_df['jaundice'] = test_df['jaundice'].str.lower().map({'yes': 1, 'no': 0})

#replace autism with 1 or 0
train_df['autism'] = train_df['autism'].str.lower().map({'yes': 1, 'no': 0})
test_df['autism'] = test_df['autism'].str.lower().map({'yes': 1, 'no': 0})

#replace relation with 1,2,3,4
train_df['relation'] = train_df['relation'].str.lower().map({'self': 1, 'parent': 2, 'relative': 3, 'health care professional': 4, 'others': 5})
test_df['relation'] = test_df['relation'].str.lower().map({'self': 1, 'parent': 2, 'relative': 3, 'health care professional': 4, 'others': 5})

#replace ethnicity with 1,2,3,4,5
train_df['ethnicity'] = train_df['ethnicity'].astype('category')
test_df['ethnicity'] = test_df['ethnicity'].astype('category')

# Save the mapping for ethnicity
ethnicity_categories = train_df['ethnicity'].cat.categories

# Convert 'ethnicity' to category codes
train_df['ethnicity'] = train_df['ethnicity'].cat.codes
test_df['ethnicity'] = test_df['ethnicity'].cat.codes

#replace country code with 1,2,3,4,5
train_df['country_of_res'] = train_df['country_of_res'].astype('category')
test_df['country_of_res'] = test_df['country_of_res'].astype('category')

# Save the mapping for Country
country_categories = train_df['country_of_res'].cat.categories

# Convert 'country_of_res' to category codes
train_df['country_of_res'] = train_df['country_of_res'].cat.codes
test_df['country_of_res'] = test_df['country_of_res'].cat.codes


## Drop Unneeded Columns

Remove columns that are not required for the analysis to streamline the dataset.

In [10]:

#drop id
train_df.drop(columns=['ID'], inplace=True)
test_df.drop(columns=['ID'], inplace=True)
#drop age desc
train_df.drop(columns=['age_desc'], inplace=True)
test_df.drop(columns=['age_desc'], inplace=True)
#drop used app before
train_df.drop(columns=['used_app_before'], inplace=True)
test_df.drop(columns=['used_app_before'], inplace=True)