# Task 2: Data Preprocessing for Machine Learning – AI Bootcamp

Download Titanic Dataset here: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

#### About this file

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## Section 1: Data Loading & Exploration

### **Task 1**: Load and Inspect a Dataset

*Instruction*: Load the `titanic.csv` dataset and display the first 5 rows. Show basic info and describe statistics of the dataset.

In [None]:
import pandas as pd

df = pd.read_csv('titanic.csv')
print(df.head())
print(df.info())
print(df.describe())

   Survived  Pclass                                               Name  \
0         0       3                             Mr. Owen Harris Braund   
1         1       1  Mrs. John Bradley (Florence Briggs Thayer) Cum...   
2         1       3                              Miss. Laina Heikkinen   
3         1       1        Mrs. Jacques Heath (Lily May Peel) Futrelle   
4         0       3                            Mr. William Henry Allen   

      Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
0    male  22.0                        1                        0   7.2500  
1  female  38.0                        1                        0  71.2833  
2  female  26.0                        0                        0   7.9250  
3  female  35.0                        1                        0  53.1000  
4    male  35.0                        0                        0   8.0500  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8

## Section 2: Handling Missing Values

### **Task 2**: Identify and Handle Missing Data

*Instruction*:



*   Display the number of missing values per column.
*   Fill missing `Age` values with the median.
*   Drop the second row in the dataset.



In [None]:
import pandas as pd
import numpy as np

# Sample DataFrame (replace with your actual data)
data = {'Name': ['Alice', 'Bob', None, 'Eve'],
        'Age': [25, None, 30, None],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Miami']}
df = pd.DataFrame(data)

# 1. Display missing values per column
print("Missing values per column:\n", df.isna().sum())

# 2. Fill missing Age values with the median
df['Age'].fillna(df['Age'].median(), inplace=True)
print("\nDataFrame after filling missing Age values:\n", df)

# 3. Drop the second row
df.drop(1, inplace=True)
print("\nDataFrame after dropping the second row:\n", df)



Missing values per column:
 Name    1
Age     2
City    0
dtype: int64

DataFrame after filling missing Age values:
     Name   Age         City
0  Alice  25.0     New York
1    Bob  27.5  Los Angeles
2   None  30.0      Chicago
3    Eve  27.5        Miami

DataFrame after dropping the second row:
     Name   Age      City
0  Alice  25.0  New York
2   None  30.0   Chicago
3    Eve  27.5     Miami


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)


## Section 3: Encoding Categorical Features

### **Task 3**: Convert Categorical to Numeric

*Instruction*: Convert `Sex` and `Pclass` columns to numeric using:


*   Label Encoding for `Sex`
*   One-Hot Encoding for `Pclass`



In [None]:



import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Sample DataFrame (replace with your actual DataFrame)
data = {'Pclass': [1, 2, 3, 1, 2, 3],
        'Sex': ['male', 'female', 'male', 'female', 'male', 'female']}
df = pd.DataFrame(data)

# Label Encoding for Sex
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])

# One-Hot Encoding for Pclass
ohe = OneHotEncoder(sparse_output=False)
pclass_reshaped = df['Pclass'].values.reshape(-1, 1)
ohe.fit(pclass_reshaped)
pclass_encoded = ohe.transform(pclass_reshaped)
pclass_encoded_df = pd.DataFrame(pclass_encoded, columns=ohe.get_feature_names_out(['Pclass']))

df = df.drop('Pclass', axis=1)
df = pd.concat([df, pclass_encoded_df], axis=1)

print(df)


   Sex  Pclass_1  Pclass_2  Pclass_3
0    1       1.0       0.0       0.0
1    0       0.0       1.0       0.0
2    1       0.0       0.0       1.0
3    0       1.0       0.0       0.0
4    1       0.0       1.0       0.0
5    0       0.0       0.0       1.0


## Section 4: Feature Scaling

### **Task 4**: Scale Numerical Features

*Instruction*: Use StandardScaler to scale the Age and Fare columns.*italicized text*

In [None]:


import pandas as pd
from sklearn.preprocessing import StandardScaler

# Sample DataFrame (replace with your actual data)
data = {'Age': [25, 30, 22, 35, 28], 'Fare': [10, 20, 15, 25, 18]}
df = pd.DataFrame(data)

# Create a StandardScaler object
scaler = StandardScaler()

# Fit and transform the 'Age' and 'Fare' columns
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

# Print the scaled DataFrame
print(df)


        Age      Fare
0 -0.677631 -1.518785
1  0.451754  0.479616
2 -1.355262 -0.519584
3  1.581139  1.478817
4  0.000000  0.079936


## Section 5: Feature Engineering

### **Task 5**: Build Preprocessing Pipeline

*Instruction*: Using `ColumnTransformer` and `Pipeline` from `sklearn`, build a pipeline that:



*   Imputes missing values
*   Scales numeric data
*   Encodes categorical data



In [None]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

def build_preprocessing_pipeline(numeric_features, categorical_features):
    """
    Builds a preprocessing pipeline using ColumnTransformer and Pipeline.

    Args:
        numeric_features (list): List of column names that are numeric.
        categorical_features (list): List of column names that are categorical.

    Returns:
        sklearn.pipeline.Pipeline: A preprocessing pipeline.
    """

    # Define transformers
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
    ])

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])

    # Create column transformer
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)
        ])

    # Create pipeline
    pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

    return pipeline

## Section 6: Feature Engineering

### **Task 6**: Create a New Feature

*Instruction*: Create a new feature `FamilySize` = `Siblings/Spouses Aboard` + `Parents/Children Aboard` + 1.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

import pandas as pd
import numpy as np
import pylab as plt

# Set the global default size of matplotlib figures
plt.rc('figure', figsize=(10, 5))

# Size of matplotlib figures that contain subplots
fizsize_with_subplots = (10, 10)

# Size of matplotlib histogram bins
bin_size = 10
