# ML Internship Task: Data Cleaning & Preprocessing

This notebook demonstrates the complete data preprocessing workflow for the Titanic dataset, including:
1. Data exploration and basic information
2. Handling missing values
3. Encoding categorical features
4. Normalizing/standardizing numerical features
5. Visualizing and removing outliers

The notebook also includes answers to common interview questions related to data preprocessing.

## Setup and Import Libraries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from scipy import stats

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')

## 1. Data Loading and Exploration

In [None]:
# Load the dataset
df = pd.read_csv('titanic.csv')

# Display basic information about the dataset
print("Dataset shape:", df.shape)
print("\nFirst 5 rows of the dataset:")
df.head()

In [None]:
# Check data types and non-null counts
df.info()

In [None]:
# Get summary statistics
df.describe()

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:")
print(missing_values)

### Visualize the data distribution

In [None]:
# Visualize the distribution of the target variable (Survived)
plt.figure(figsize=(8, 6))
sns.countplot(x='Survived', data=df)
plt.title('Survival Distribution')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

In [None]:
# Visualize the distribution of categorical features
plt.figure(figsize=(15, 10))
categorical_features = ['Pclass', 'Sex']
for i, feature in enumerate(categorical_features, 1):
    plt.subplot(1, 2, i)
    sns.countplot(x=feature, data=df, hue='Survived')
    plt.title(f'{feature} Distribution by Survival')
plt.tight_layout()
plt.show()

In [None]:
# Visualize the distribution of numerical features
plt.figure(figsize=(15, 10))
numerical_features = ['Age', 'Fare', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']
for i, feature in enumerate(numerical_features, 1):
    plt.subplot(2, 2, i)
    sns.histplot(df[feature], kde=True)
    plt.title(f'{feature} Distribution')
plt.tight_layout()
plt.show()

## 2. Handling Missing Values

In this dataset, we don't have any missing values as confirmed by the null check above. However, in real-world scenarios, we would typically handle missing values using techniques like:

1. **Deletion**: Remove rows or columns with missing values
2. **Imputation**: Fill missing values with mean, median, mode, or predicted values
3. **Advanced methods**: Use algorithms like KNN or regression for imputation

Below is an example of how we would handle missing values if they existed:

In [None]:
# Create a copy of the dataframe for demonstration
df_with_missing = df.copy()

# Artificially introduce some missing values for demonstration
np.random.seed(42)
mask = np.random.random(df_with_missing.shape) < 0.05  # 5% of data will be missing
df_with_missing = df_with_missing.mask(mask)

# Check the artificially introduced missing values
print("Artificially introduced missing values:")
print(df_with_missing.isnull().sum())

In [None]:
# Handle missing values for numerical features using mean imputation
numerical_imputer = SimpleImputer(strategy='mean')
df_with_missing[numerical_features] = numerical_imputer.fit_transform(df_with_missing[numerical_features])

# Handle missing values for categorical features using most frequent value
categorical_imputer = SimpleImputer(strategy='most_frequent')
df_with_missing[categorical_features] = categorical_imputer.fit_transform(df_with_missing[categorical_features])

# Check if missing values are handled
print("\nMissing values after imputation:")
print(df_with_missing.isnull().sum())

## 3. Encoding Categorical Features

In this dataset, we need to encode the 'Sex' column which is categorical. We'll demonstrate both label encoding and one-hot encoding.

In [None]:
# Create a copy of the dataframe for preprocessing
df_processed = df.copy()

# 1. Label Encoding for Sex column
print("Label Encoding for 'Sex' column")
label_encoder = LabelEncoder()
df_processed['Sex_Label'] = label_encoder.fit_transform(df_processed['Sex'])
print("Label encoding mapping:", dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))
print("\nFirst 5 rows after label encoding:")
df_processed[['Sex', 'Sex_Label']].head()

In [None]:
# 2. One-Hot Encoding for Sex column
print("One-Hot Encoding for 'Sex' column")
# Using pandas get_dummies for one-hot encoding
df_onehot = pd.get_dummies(df_processed['Sex'], prefix='Sex')
df_processed = pd.concat([df_processed, df_onehot], axis=1)
print("\nFirst 5 rows after one-hot encoding:")
df_processed[['Sex', 'Sex_male', 'Sex_female']].head()

In [None]:
# Visualize the distribution of encoded features
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.countplot(x='Sex_Label', data=df_processed, hue='Survived')
plt.title('Sex (Label Encoded) vs Survival')
plt.xlabel('Sex (0=female, 1=male)')

plt.subplot(1, 2, 2)
# Calculate survival rate by sex
survival_by_sex = df_processed.groupby('Sex')['Survived'].mean().reset_index()
sns.barplot(x='Sex', y='Survived', data=survival_by_sex)
plt.title('Survival Rate by Sex')
plt.ylabel('Survival Rate')
plt.tight_layout()
plt.show()

## 4. Normalizing/Standardizing Numerical Features

We'll demonstrate both standardization (z-score normalization) and min-max normalization on the numerical features.

In [None]:
# Create a DataFrame to store original and scaled values for comparison
scaling_comparison = pd.DataFrame()
for feature in numerical_features:
    scaling_comparison[f'{feature}_Original'] = df_processed[feature]

In [None]:
# 1. Standardization (Z-score normalization)
print("Standardization (Z-score normalization)")
standard_scaler = StandardScaler()
df_standardized = df_processed.copy()
df_standardized[numerical_features] = standard_scaler.fit_transform(df_processed[numerical_features])

# Add standardized values to comparison DataFrame
for feature in numerical_features:
    scaling_comparison[f'{feature}_Standardized'] = df_standardized[feature]

print("\nFirst 5 rows after standardization:")
df_standardized[numerical_features].head()

In [None]:
# Print standardization parameters
print("Standardization parameters:")
for i, feature in enumerate(numerical_features):
    print(f"{feature}: mean = {standard_scaler.mean_[i]:.4f}, std = {standard_scaler.scale_[i]:.4f}")

In [None]:
# 2. Min-Max Normalization
print("Min-Max Normalization")
minmax_scaler = MinMaxScaler()
df_normalized = df_processed.copy()
df_normalized[numerical_features] = minmax_scaler.fit_transform(df_processed[numerical_features])

# Add normalized values to comparison DataFrame
for feature in numerical_features:
    scaling_comparison[f'{feature}_MinMax'] = df_normalized[feature]

print("\nFirst 5 rows after min-max normalization:")
df_normalized[numerical_features].head()

In [None]:
# Print min-max normalization parameters
print("Min-Max normalization parameters:")
for i, feature in enumerate(numerical_features):
    print(f"{feature}: min = {minmax_scaler.data_min_[i]:.4f}, max = {minmax_scaler.data_max_[i]:.4f}")

In [None]:
# Visualize the distribution of original vs scaled features for Age
plt.figure(figsize=(15, 5))

# Original distribution
plt.subplot(1, 3, 1)
sns.histplot(scaling_comparison['Age_Original'], kde=True)
plt.title('Original Age')

# Standardized distribution
plt.subplot(1, 3, 2)
sns.histplot(scaling_comparison['Age_Standardized'], kde=True)
plt.title('Standardized Age')

# Min-Max normalized distribution
plt.subplot(1, 3, 3)
sns.histplot(scaling_comparison['Age_MinMax'], kde=True)
plt.title('Min-Max Normalized Age')

plt.tight_layout()
plt.show()

In [None]:
# Visualize the distribution of original vs scaled features for Fare
plt.figure(figsize=(15, 5))

# Original distribution
plt.subplot(1, 3, 1)
sns.histplot(scaling_comparison['Fare_Original'], kde=True)
plt.title('Original Fare')

# Standardized distribution
plt.subplot(1, 3, 2)
sns.histplot(scaling_comparison['Fare_Standardized'], kde=True)
plt.title('Standardized Fare')

# Min-Max normalized distribution
plt.subplot(1, 3, 3)
sns.histplot(scaling_comparison['Fare_MinMax'], kde=True)
plt.title('Min-Max Normalized Fare')

plt.tight_layout()
plt.show()

## 5. Visualizing and Removing Outliers

We'll use boxplots to visualize outliers and the IQR method to identify and remove them.

In [None]:
# Visualize outliers using boxplots
plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features, 1):
    plt.subplot(2, 2, i)
    sns.boxplot(x=feature, data=df_processed)
    plt.title(f'Boxplot of {feature}')
plt.tight_layout()
plt.show()

In [None]:
# Identify outliers using IQR method
outliers_summary = {}
for feature in numerical_features:
    Q1 = df_processed[feature].quantile(0.25)
    Q3 = df_processed[feature].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df_processed[(df_processed[feature] < lower_bound) | (df_processed[feature] > upper_bound)]
    outliers_count = len(outliers)
    outliers_percent = (outliers_count / len(df_processed)) * 100
    
    outliers_summary[feature] = {
        'count': outliers_count,
        'percentage': outliers_percent,
        'lower_bound': lower_bound,
        'upper_bound': upper_bound
    }
    
    print(f"{feature}: {outliers_count} outliers ({outliers_percent:.2f}% of data)")
    print(f"  - Lower bound: {lower_bound:.2f}")
    print(f"  - Upper bound: {upper_bound:.2f}")

In [None]:
# Visualize outliers with scatter plots
plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features, 1):
    plt.subplot(2, 2, i)
    
    # Get bounds from summary
    lower_bound = outliers_summary[feature]['lower_bound']
    upper_bound = outliers_summary[feature]['upper_bound']
    
    # Create a boolean mask for outliers
    is_outlier = (df_processed[feature] < lower_bound) | (df_processed[feature] > upper_bound)
    
    # Plot non-outliers and outliers with different colors
    plt.scatter(range(len(df_processed)), df_processed[feature], c=is_outlier.map({True: 'red', False: 'blue'}), 
                alpha=0.5, label='Outlier' if True in is_outlier.values else 'No outliers')
    
    # Add horizontal lines for bounds
    plt.axhline(y=lower_bound, color='green', linestyle='--', label=f'Lower bound: {lower_bound:.2f}')
    plt.axhline(y=upper_bound, color='green', linestyle='--', label=f'Upper bound: {upper_bound:.2f}')
    
    plt.title(f'Outliers in {feature}')
    plt.xlabel('Index')
    plt.ylabel(feature)
    plt.legend()
    
plt.tight_layout()
plt.show()

In [None]:
# Remove outliers
# Create a copy of the dataframe before removing outliers
df_no_outliers = df_processed.copy()
total_rows_before = len(df_no_outliers)

# Create a mask for rows to keep (not outliers in any feature)
keep_mask = pd.Series(True, index=df_no_outliers.index)

for feature in numerical_features:
    lower_bound = outliers_summary[feature]['lower_bound']
    upper_bound = outliers_summary[feature]['upper_bound']
    feature_mask = (df_no_outliers[feature] >= lower_bound) & (df_no_outliers[feature] <= upper_bound)
    keep_mask = keep_mask & feature_mask

# Apply the mask to keep only non-outlier rows
df_no_outliers = df_no_outliers[keep_mask]
total_rows_after = len(df_no_outliers)
rows_removed = total_rows_before - total_rows_after
percent_removed = (rows_removed / total_rows_before) * 100

print(f"Total rows before outlier removal: {total_rows_before}")
print(f"Total rows after outlier removal: {total_rows_after}")
print(f"Rows removed: {rows_removed} ({percent_removed:.2f}% of data)")

In [None]:
# Compare distributions before and after outlier removal for Age
plt.figure(figsize=(12, 5))

# Before removal
plt.subplot(1, 2, 1)
sns.histplot(df_processed['Age'], kde=True)
plt.title('Age Before Outlier Removal')

# After removal
plt.subplot(1, 2, 2)
sns.histplot(df_no_outliers['Age'], kde=True)
plt.title('Age After Outlier Removal')

plt.tight_layout()
plt.show()

In [None]:
# Compare distributions before and after outlier removal for Fare
plt.figure(figsize=(12, 5))

# Before removal
plt.subplot(1, 2, 1)
sns.histplot(df_processed['Fare'], kde=True)
plt.title('Fare Before Outlier Removal')

# After removal
plt.subplot(1, 2, 2)
sns.histplot(df_no_outliers['Fare'], kde=True)
plt.title('Fare After Outlier Removal')

plt.tight_layout()
plt.show()

## 6. Save the Cleaned Dataset

In [None]:
# Save the cleaned dataset
df_no_outliers.to_csv('titanic_cleaned.csv', index=False)
print("Cleaned dataset saved to 'titanic_cleaned.csv'")

## 7. Summary of Preprocessing Steps

In this notebook, we have performed the following preprocessing steps on the Titanic dataset:

1. **Data Exploration**:
   - Loaded the dataset and examined its structure
   - Checked for missing values (none found)
   - Visualized distributions of features

2. **Missing Value Handling**:
   - Demonstrated imputation techniques on artificially introduced missing values

3. **Categorical Feature Encoding**:
   - Applied label encoding to the 'Sex' column
   - Applied one-hot encoding to the 'Sex' column

4. **Feature Scaling**:
   - Applied standardization (z-score normalization)
   - Applied min-max normalization
   - Compared the distributions before and after scaling

5. **Outlier Detection and Removal**:
   - Visualized outliers using boxplots
   - Identified outliers using the IQR method
   - Removed outliers and compared distributions

6. **Saved the Cleaned Dataset**:
   - Saved the fully preprocessed dataset for further analysis

These preprocessing steps have prepared the data for machine learning modeling, ensuring that it is clean, properly formatted, and optimized for algorithm performance.

## 8. Interview Questions and Answers

Please refer to the separate markdown file `interview_questions.md` for detailed answers to the following interview questions:

1. What are the different types of missing data?
2. How do you handle categorical variables?
3. What is the difference between normalization and standardization?
4. How do you detect outliers?
5. Why is preprocessing important in ML?
6. What is one-hot encoding vs label encoding?
7. How do you handle data imbalance?
8. Can preprocessing affect model accuracy?