# Data Cleaning and Preprocessing
ML is a powerful tool for analyzing and interpreting data, but before we can apply any machine learning algorithms, we need to ensure that our data is clean and well-prepared. 

Garbage in, garbage out (GIGO) is a common phrase in data science, emphasizing that the quality of the input data directly affects the quality of the output. Can you list some reasons why data might be considered "dirty", and how does this affect the results of your analysis?

We'll use pandas and seaborn for data manipulation and visualization, respectively. We'll work with the `titanic` dataset, which is a classic dataset for demonstrating data cleaning and preprocessing techniques.

# Titanic Dataset (Seaborn Version) - Feature Descriptions

This dataset contains demographic and survival information about passengers on the Titanic. It is often used for demonstrating data cleaning, visualization, and classification tasks.

## Target Variable

- **`survived`** *(int)*  
  Survival indicator:  
  - `0` = Did not survive  
  - `1` = Survived  

---

## Passenger Details

- **`pclass`** *(int)*  
  Passenger class (1st, 2nd, or 3rd) — proxy for socio-economic status.

- **`sex`** *(object)*  
  Gender of the passenger: `male` or `female`.

- **`age`** *(float)*  
  Age in years. Some missing values exist.

- **`sibsp`** *(int)*  
  Number of siblings or spouses aboard.

- **`parch`** *(int)*  
  Number of parents or children aboard.

- **`fare`** *(float)*  
  Fare paid for the ticket.

- **`embarked`** *(object)*  
  Port of embarkation:  
  - `C` = Cherbourg  
  - `Q` = Queenstown  
  - `S` = Southampton

- **`class`** *(category)*  
  Duplicate of `pclass`, but as a readable label: "First", "Second", or "Third".

- **`who`** *(object)*  
  Simplified category of passenger type: "man", "woman", or "child".

- **`adult_male`** *(bool)*  
  True if the passenger is an adult male.

- **`deck`** *(category)*  
  Deck letter extracted from cabin — many missing values.

- **`embark_town`** *(object)*  
  Full name of embarkation town (duplicate of `embarked` but more readable).

- **`alive`** *(object)*  
  Human-readable survival status: "yes" or "no" (duplicate of `survived`).

- **`alone`** *(bool)*  
  True if the passenger had no family aboard (`sibsp + parch == 0`).

### AIM: We'll use the cleaned dataset to build a simple machine learning model to predict survival on the Titanic.

In [None]:
# import pandas as pd
# import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Load the Titanic dataset
df = sns.load_dataset("titanic")

In [None]:
# Explore the dataset
df.info()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
df.head(10)

NOTE: Features are a proxy for real world measurements.

# Let's start the cleaning process!

Why drop rows with missing values? Or should we impute them? What about duplicate rows?

In [None]:
# Handle missing values and drop duplicates

df.dropna(subset=['age', 'embarked'], inplace=True)

# Alternative
# df['age'] = df['age'].fillna(df['age'].mean())
# df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

df.drop_duplicates(inplace=True)

### Should every feature count?

The following are generally considered **redundant or derived** from other features:

- `class` → derived from `pclass`
- `who` → derived from `sex` and `age`
- `adult_male` → derived from `sex` and `age`
- `alive` → duplicate of `survived`
- `embark_town` → readable version of `embarked`
- `alone` → derived from `sibsp` and `parch`
- `deck` → sparse, many missing values

For basic analysis, you may want to drop these to simplify your dataset.

In [None]:
# Drop columns
df.drop(columns=["class", "who", "adult_male", "alive", "embark_town", "alone", "deck"], inplace=True)

# Some visualisations

### Equi-distrubutional?

In [None]:
# Plot distribution of 0 and 1 samples
sns.countplot(x="survived", data=df)
plt.title("Distribution of Survived")
plt.xlabel("Survived")
plt.ylabel("Count")
plt.show()

In [None]:
# Get target
objective = df['survived']
df.drop(columns=["survived"], inplace=True)

In [None]:
# Get categorical and numerical columns
categorical_cols = ['pclass', 'sex', 'embarked']
numeric_cols = [col for col in df.columns if col not in categorical_cols]

In [None]:
def plot_categorical_feature(objective, feature):
    plt.figure(figsize=(6, 3))
    # Add your plot
    sns.countplot(x=feature, hue=objective, data=df)
    plt.title(f"Count of {feature} by Survival")
    plt.show()
    
def plot_numeric_feature(objective, feature):
    plt.figure(figsize=(6, 3))
    # Add your plot
    sns.boxplot(x=objective, y=feature, data=df)
    plt.title(f"Distribution of {feature} by Survival")
    plt.show()


In [None]:
for feature in categorical_cols:
    plot_categorical_feature(objective, feature)

In [None]:
for feature in numeric_cols:
    plot_numeric_feature(objective, feature)

Correlated Features

In [None]:
corr_map = df[numeric_cols].corr()
plt.figure(figsize=(len(numeric_cols), len(numeric_cols)))
sns.heatmap(corr_map, annot=True, cmap='coolwarm')
plt.show()

# A Fun reading exercise:
Data Processing Inequality: _Post-processing cannot increase information_

https://en.wikipedia.org/wiki/Data_processing_inequality