# Task 2: Data Preprocessing for Machine Learning – AI Bootcamp

Download Titanic Dataset here: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

#### About this file

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## Section 1: Data Loading & Exploration

### **Task 1**: Load and Inspect a Dataset

*Instruction*: Load the `titanic.csv` dataset and display the first 5 rows. Show basic info and describe statistics of the dataset.

In [None]:
import pandas as pd

df = pd.read_csv('titanic.csv')
print(df.head())
print(df.info())
print(df.describe())

   Survived  Pclass                                               Name  \
0         0       3                             Mr. Owen Harris Braund   
1         1       1  Mrs. John Bradley (Florence Briggs Thayer) Cum...   
2         1       3                              Miss. Laina Heikkinen   
3         1       1        Mrs. Jacques Heath (Lily May Peel) Futrelle   
4         0       3                            Mr. William Henry Allen   

      Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
0    male  22.0                        1                        0   7.2500  
1  female  38.0                        1                        0  71.2833  
2  female  26.0                        0                        0   7.9250  
3  female  35.0                        1                        0  53.1000  
4    male  35.0                        0                        0   8.0500  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8

## Section 2: Handling Missing Values

### **Task 2**: Identify and Handle Missing Data

*Instruction*:



*   Display the number of missing values per column.
*   Fill missing `Age` values with the median.
*   Drop the `Cabin` column.



In [None]:
import pandas as pd

# Load the Titanic dataset
titanic_df = pd.read_csv('titanic.csv')

# Display the number of missing values per column
print("Number of missing values per column:")
print(titanic_df.isnull().sum())

# Fill missing Age values with the median
titanic_df['Age'].fillna(titanic_df['Age'].median(), inplace=True)

# Drop the Cabin column
titanic_df.drop(columns=['Cabin'], inplace=True)

# Display the DataFrame after handling missing data
print("\nDataFrame after handling missing data:")
print(titanic_df.info())  # Display DataFrame info to confirm changes


Number of missing values per column:
Survived                   0
Pclass                     0
Name                       0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic_df['Age'].fillna(titanic_df['Age'].median(), inplace=True)


KeyError: "['Cabin'] not found in axis"

# New section

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Section 3: Encoding Categorical Features

### **Task 3**: Convert Categorical to Numeric

*Instruction*: Convert `Sex` and `Embarked` columns to numeric using:


*   Label Encoding for `Sex`
*   One-Hot Encoding for `Embarked`



In [None]:
from sklearn.preprocessing import LabelEncoder

# Your code here
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load the Titanic dataset
titanic_df = pd.read_csv('titanic.csv')

# Display the first 5 rows for context
print("First 5 rows of the Titanic dataset:")
print(titanic_df.head())

# Convert 'Sex' column using Label Encoding
label_encoder = LabelEncoder()
titanic_df['Sex'] = label_encoder.fit_transform(titanic_df['Sex'])

# Convert 'Embarked' column using One-Hot Encoding
titanic_df = pd.get_dummies(titanic_df, columns=['Embarked'], drop_first=True)

# Display the modified DataFrame
print("\nDataframe after encoding:")
print(titanic_df.head())

First 5 rows of the Titanic dataset:
   Survived  Pclass                                               Name  \
0         0       3                             Mr. Owen Harris Braund   
1         1       1  Mrs. John Bradley (Florence Briggs Thayer) Cum...   
2         1       3                              Miss. Laina Heikkinen   
3         1       1        Mrs. Jacques Heath (Lily May Peel) Futrelle   
4         0       3                            Mr. William Henry Allen   

      Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
0    male  22.0                        1                        0   7.2500  
1  female  38.0                        1                        0  71.2833  
2  female  26.0                        0                        0   7.9250  
3  female  35.0                        1                        0  53.1000  
4    male  35.0                        0                        0   8.0500  


KeyError: "None of [Index(['Embarked'], dtype='object')] are in the [columns]"

## Section 4: Feature Scaling

### **Task 4**: Scale Numerical Features

*Instruction*: Use StandardScaler to scale the Age and Fare columns.*italicized text*

In [None]:
from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the Titanic dataset
titanic_df = pd.read_csv('titanic.csv')

# Display the first 5 rows for context
print("First 5 rows of the Titanic dataset:")
print(titanic_df.head())

# Initialize StandardScaler
scaler = StandardScaler()

# Scale the Age and Fare columns
titanic_df[['Age', 'Fare']] = scaler.fit_transform(titanic_df[['Age', 'Fare']])

# Display the modified DataFrame to see the scaled features
print("\nDataFrame after scaling Age and Fare:")
print(titanic_df[['Age', 'Fare']].head())



First 5 rows of the Titanic dataset:
   Survived  Pclass                                               Name  \
0         0       3                             Mr. Owen Harris Braund   
1         1       1  Mrs. John Bradley (Florence Briggs Thayer) Cum...   
2         1       3                              Miss. Laina Heikkinen   
3         1       1        Mrs. Jacques Heath (Lily May Peel) Futrelle   
4         0       3                            Mr. William Henry Allen   

      Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
0    male  22.0                        1                        0   7.2500  
1  female  38.0                        1                        0  71.2833  
2  female  26.0                        0                        0   7.9250  
3  female  35.0                        1                        0  53.1000  
4    male  35.0                        0                        0   8.0500  

DataFrame after scaling Age and Fare:
        Age      

## Section 5: Feature Engineering

### **Task 5**: Build Preprocessing Pipeline

*Instruction*: Using `ColumnTransformer` and `Pipeline` from `sklearn`, build a pipeline that:



*   Imputes missing values
*   Scales numeric data
*   Encodes categorical data



In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Load the Titanic dataset
titanic_df = pd.read_csv('titanic.csv')

# Define the features and target variable
features = titanic_df.drop('Survived', axis=1)  # Assuming 'Survived' is the target column
target = titanic_df['Survived']

# Column names for preprocessing
numeric_features = ['Age', 'Fare']
categorical_features = ['Sex', 'Embarked']

# Create a preprocessing pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with mean
    ('scaler', StandardScaler())                  # Scale the data
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  # Impute missing values with a constant
    ('onehot', OneHotEncoder(handle_unknown='ignore'))                      # One-hot encode categorical data
])

# Combine preprocessing for numeric and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Create and fit the full preprocessing pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
])

# Transform the features
X_transformed = pipeline.fit_transform(features)

# Display the transformed feature
print("Transformed feature array shape:", X_transformed.shape)


ValueError: A given column is not a column of the dataframe

## Section 6: Feature Engineering

### **Task 6**: Create a New Feature

*Instruction*: Create a new feature `FamilySize` = `SibSp` + `Parch` + 1.

In [None]:

import pandas as pd

# Load the Titanic dataset
titanic_df = pd.read_csv('titanic.csv')

# Display the first 5 rows for context
print("First 5 rows of the Titanic dataset:")
print(titanic_df.head())

# Create the FamilySize feature
titanic_df['FamilySize'] = titanic_df['SibSp'] + titanic_df['Parch'] + 1

# Display the modified DataFrame with the new FamilySize feature
print("\nDataFrame with FamilySize feature:")
print(titanic_df[['SibSp', 'Parch', 'FamilySize']].head())


First 5 rows of the Titanic dataset:
   Survived  Pclass                                               Name  \
0         0       3                             Mr. Owen Harris Braund   
1         1       1  Mrs. John Bradley (Florence Briggs Thayer) Cum...   
2         1       3                              Miss. Laina Heikkinen   
3         1       1        Mrs. Jacques Heath (Lily May Peel) Futrelle   
4         0       3                            Mr. William Henry Allen   

      Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
0    male  22.0                        1                        0   7.2500  
1  female  38.0                        1                        0  71.2833  
2  female  26.0                        0                        0   7.9250  
3  female  35.0                        1                        0  53.1000  
4    male  35.0                        0                        0   8.0500  


KeyError: 'SibSp'