# Data Cleaning and Preprocessing Pipeline in Pandas

### What is Data Cleaning & Preprocessing and Why It Matters?

Data cleaning and preprocessing is the process of preparing raw data for analysis or machine learning. This step removes errors, handles missing values, fixes inconsistent formatting, encodes categorical variables, scales numerical features, and ensures the dataset is ready for modeling. In AI/ML, poor preprocessing can lead to inaccurate predictions, bias, and wasted computation. The Titanic dataset is an ideal example because it contains missing ages, categorical variables like *Sex* and *Embarked*, and numerical columns like *Fare* that need scaling.

### Loading and Inspecting Data

In [None]:
import pandas as pd
df = pd.read_csv("data/train.csv")
df.info()
df.head()

### Handling Missing Values

Handling missing values means identifying and addressing gaps in the dataset to avoid errors and biased predictions. Missing data can occur for various reasons—data entry errors, lost records, or unrecorded information. There are multiple strategies to handle them: deletion (removing rows/columns with missing values), imputation (filling in missing values with mean, median, mode, or a model prediction), or marking them as a separate category. In the Titanic dataset, *Age* and *Embarked* contain missing values. Removing rows would reduce valuable data, so we impute *Age* with the median (less affected by outliers than the mean) and *Embarked* with the mode (most frequent value). This keeps the dataset complete and ready for analysis while preserving statistical consistency. Proper missing value treatment ensures that the dataset remains representative and prevents algorithms from misinterpreting gaps as patterns.

In [None]:
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

### Removing or Capping Outliers

Outliers are data points that differ significantly from the rest of the dataset. They can distort statistical analyses and negatively impact model performance, especially for algorithms sensitive to scale, like linear regression or KNN. In the Titanic dataset, the *Fare* column contains extreme values, representing rare but expensive ticket purchases. Instead of removing these rows entirely—which might discard useful information—we use **capping** (also called Winsorizing). This replaces extreme values above a certain threshold (e.g., the 99th percentile) with the threshold value itself. This keeps the overall distribution realistic while limiting the influence of extreme points. Outlier treatment ensures that models don’t overfit or produce unstable results because of rare but large deviations. Proper handling of outliers is crucial for both fairness and accuracy in predictive models.

In [None]:
fare_cap = df['Fare'].quantile(0.99)
df['Fare'] = df['Fare'].apply(lambda x: fare_cap if x > fare_cap else x)

### Encoding Categorical Variables

Machine learning algorithms generally require numeric input, so categorical features must be transformed into numbers. This process is called encoding. There are two common approaches:

- **Label Encoding:** Assigns each category a unique integer.
- **One-Hot Encoding:** Creates binary columns for each category.

For the Titanic dataset, *Sex* and *Embarked* are categorical variables. We use **One-Hot Encoding** because it prevents models from incorrectly assuming an ordinal relationship between categories. For example, 'male' ≠ 1 and 'female' ≠ 0 in magnitude, they are just different categories. By creating binary columns (e.g., `Sex_male`, `Embarked_Q`, `Embarked_S`), we give models the flexibility to learn relationships without introducing artificial hierarchies. Dropping the first category (`drop_first=True`) avoids the **dummy variable trap**, which is when perfectly correlated columns cause multicollinearity problems in certain models.

In [None]:
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)

### Feature Scaling

Feature scaling standardizes the range of numeric features so that each has equal influence on the model. Without scaling, features with larger ranges (like *Fare*) may dominate those with smaller ranges (like *Age*), leading to biased predictions. Scaling is essential for algorithms that rely on distance metrics (e.g., KNN, SVM) or gradient-based optimization (e.g., Logistic Regression, Neural Networks). Common scaling methods include **Standardization** (subtract mean, divide by standard deviation) and **Min-Max Scaling** (scale between 0 and 1). Here, we use **StandardScaler** to center features around zero with unit variance. This speeds up convergence and ensures fair treatment of all numeric variables. In Titanic’s case, scaling *Fare* and *Age* improves model stability without distorting relative differences between passengers.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['Fare'] = scaler.fit_transform(df[['Fare']])
df['Age'] = scaler.fit_transform(df[['Age']])

### **Final Verification**

Final verification ensures that preprocessing was done correctly and that the dataset is clean, consistent, and ready for modeling. This step includes checking for any remaining missing values, confirming correct data types, validating that categorical variables were properly encoded, and reviewing summary statistics to detect anomalies. A clean dataset should have no missing values, all features in the right format, and consistent scales across numeric variables. For the Titanic dataset, after preprocessing, we verify that all steps—imputation, outlier treatment, encoding, and scaling—were applied correctly. This confirmation avoids issues during model training, ensuring smoother performance and fewer runtime errors.

In [None]:
df.info()
df.head()

### Exercises

Q1. Detect and replace missing *Cabin* values with `"Unknown"`.

In [None]:
df['Cabin'] = df['Cabin'].fillna("Unknown")

Q2. Normalize the *Fare* column using MinMaxScaler.

In [None]:
scaler = MinMaxScaler()
df['Fare'] = scaler.fit_transform(df[['Fare']])

Q3. Encode Pclass as categorical instead of numeric

In [None]:
df['Pclass'] = df['Pclass'].astype(str)  # Convert to string
df = pd.get_dummies(df, columns=['Pclass'], drop_first=True)

Q4. Identify and remove rows with extreme outliers in Age

In [None]:
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df = df[(df['Age'] >= lower_bound) & (df['Age'] <= upper_bound)]

Q5. Create a reusable preprocessing function for Titanic data

In [None]:
def preprocess_titanic(data_path):
    df = pd.read_csv(data_path)
    
    # Handle missing values
    df['Cabin'] = df['Cabin'].fillna("Unknown")
    df['Age'] = df['Age'].fillna(df['Age'].median())
    df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
    
    # Remove Age outliers
    Q1, Q3 = df['Age'].quantile(0.25), df['Age'].quantile(0.75)
    IQR = Q3 - Q1
    lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
    df = df[(df['Age'] >= lower) & (df['Age'] <= upper)]
    
    # Normalize Fare
    scaler = MinMaxScaler()
    df['Fare'] = scaler.fit_transform(df[['Fare']])
    
    # Encode categorical features
    df['Pclass'] = df['Pclass'].astype(str)
    df = pd.get_dummies(df, columns=['Pclass', 'Sex', 'Embarked'], drop_first=True)
    
    return df

# Example usage
clean_df = preprocess_titanic("train.csv")

### Summary

Data cleaning and preprocessing is the bridge between raw data and actionable insights. Using the Titanic dataset, we addressed missing values, treated outliers, encoded categorical variables, and scaled numerical features. These steps transform inconsistent and incomplete data into a clean, structured format that is ready for machine learning. A well-designed preprocessing pipeline improves model accuracy, prevents bias, and ensures that every feature contributes meaningfully to predictions. Without this step, even the most advanced models risk producing misleading results due to poor-quality data.