# Task 2: Data Preprocessing for Machine Learning – AI Bootcamp

Download Titanic Dataset here: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

#### About this file

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## Section 1: Data Loading & Exploration

### **Task 1**: Load and Inspect a Dataset

*Instruction*: Load the `titanic.csv` dataset and display the first 5 rows. Show basic info and describe statistics of the dataset.

In [5]:
import pandas as pd
from google.colab import files
import io

def analyze_uploaded_data():
    """
    Allows the user to upload a CSV file and then displays the first 5 rows,
    basic information, and descriptive statistics of the first uploaded CSV.
    """
    print("Please upload the CSV file (e.g., titanic.csv).")
    uploaded = files.upload()

    if uploaded:
        first_uploaded_filename = list(uploaded.keys())[0]
        print(f"\nProcessing uploaded file: {first_uploaded_filename}")
        try:
            df = pd.read_csv(io.BytesIO(uploaded[first_uploaded_filename]))
            print("\nFirst 5 rows of the dataset:")
            print(df.head())
            print("\nBasic information about the dataset:")
            print(df.info())
            print("\nDescriptive statistics of the numerical columns:")
            print(df.describe())
        except Exception as e:
            print(f"An error occurred while processing the file: {e}")
    else:
        print("Error: No file was uploaded.")

if __name__ == "__main__":
    analyze_uploaded_data()

Please upload the CSV file (e.g., titanic.csv).


Saving titanic.csv to titanic (3).csv

Processing uploaded file: titanic (3).csv

First 5 rows of the dataset:
   Survived  Pclass                                               Name  \
0         0       3                             Mr. Owen Harris Braund   
1         1       1  Mrs. John Bradley (Florence Briggs Thayer) Cum...   
2         1       3                              Miss. Laina Heikkinen   
3         1       1        Mrs. Jacques Heath (Lily May Peel) Futrelle   
4         0       3                            Mr. William Henry Allen   

      Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
0    male  22.0                        1                        0   7.2500  
1  female  38.0                        1                        0  71.2833  
2  female  26.0                        0                        0   7.9250  
3  female  35.0                        1                        0  53.1000  
4    male  35.0                        0                   

## Section 2: Handling Missing Values

### **Task 2**: Identify and Handle Missing Data

*Instruction*:



*   Display the number of missing values per column.
*   Fill missing `Age` values with the median.
*   Drop the second row in the dataset.



In [7]:
import pandas as pd
from google.colab import files
import io

def handle_missing_data_dynamic_filename():
    """
    Allows the user to upload a CSV file, displays missing values,
    fills missing 'Age', and drops the second row, using the actual
    uploaded filename.
    """
    print("Please upload the titanic.csv file.")
    uploaded = files.upload()

    if uploaded:
        uploaded_filename = list(uploaded.keys())[0]  # Get the name of the first uploaded file
        print(f"\nProcessing uploaded file: {uploaded_filename}")
        try:
            df = pd.read_csv(io.BytesIO(uploaded[uploaded_filename]))

            # Display the number of missing values per column
            print("\nNumber of missing values per column:")
            print(df.isnull().sum())

            # Fill missing 'Age' values with the median
            median_age = df['Age'].median()
            df['Age'].fillna(median_age, inplace=True)
            print("\nMissing values in 'Age' column after filling:")
            print(df['Age'].isnull().sum())

            # Drop the second row (index 1)
            df.drop(1, inplace=True, errors='ignore')
            print("\nFirst 5 rows after dropping the second row:")
            print(df.head())

        except Exception as e:
            print(f"An error occurred: {e}")
    else:
        print("Error: No file was uploaded.")

if __name__ == "__main__":
    handle_missing_data_dynamic_filename()


Please upload the titanic.csv file.


Saving titanic.csv to titanic (5).csv

Processing uploaded file: titanic (5).csv

Number of missing values per column:
Survived                   0
Pclass                     0
Name                       0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
dtype: int64

Missing values in 'Age' column after filling:
0

First 5 rows after dropping the second row:
   Survived  Pclass                                         Name     Sex  \
0         0       3                       Mr. Owen Harris Braund    male   
2         1       3                        Miss. Laina Heikkinen  female   
3         1       1  Mrs. Jacques Heath (Lily May Peel) Futrelle  female   
4         0       3                      Mr. William Henry Allen    male   
5         0       3                              Mr. James Moran    male   

    Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
0  22.0         

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(median_age, inplace=True)


## Section 3: Encoding Categorical Features

In [9]:

import pandas as pd
from google.colab import files
import io
from sklearn.preprocessing import LabelEncoder

def encode_titanic_data():
    """
    Allows the user to upload a CSV file (expecting titanic.csv),
    performs label encoding on 'Sex' and one-hot encoding on 'Pclass',
    and prints the head of the modified DataFrame.
    """
    print("Please upload the titanic.csv file.")
    uploaded = files.upload()

    if uploaded:
        uploaded_filename = list(uploaded.keys())[0]  # Get the name of the first uploaded file
        print(f"\nProcessing uploaded file: {uploaded_filename}")
        try:
            df = pd.read_csv(io.BytesIO(uploaded[uploaded_filename]))

            # Label Encoding for 'Sex'
            le = LabelEncoder()
            df['Sex'] = le.fit_transform(df['Sex'])
            print("\n'Sex' column after Label Encoding:")
            print(df['Sex'].head())

            # One-Hot Encoding for 'Pclass'
            df = pd.get_dummies(df, columns=['Pclass'], prefix='Pclass')
            print("\nDataFrame after One-Hot Encoding 'Pclass':")
            print(df.head())
            return df #returning df

        except Exception as e:
            print(f"An error occurred: {e}")
            return None
    else:
        print("Error: No file was uploaded.")
        return None

if __name__ == "__main__":
    encoded_df = encode_titanic_data()
    if encoded_df is not None:
        print(encoded_df.head())



Please upload the titanic.csv file.


Saving titanic.csv to titanic (6).csv

Processing uploaded file: titanic (6).csv

'Sex' column after Label Encoding:
0    1
1    0
2    0
3    0
4    1
Name: Sex, dtype: int64

DataFrame after One-Hot Encoding 'Pclass':
   Survived                                               Name  Sex   Age  \
0         0                             Mr. Owen Harris Braund    1  22.0   
1         1  Mrs. John Bradley (Florence Briggs Thayer) Cum...    0  38.0   
2         1                              Miss. Laina Heikkinen    0  26.0   
3         1        Mrs. Jacques Heath (Lily May Peel) Futrelle    0  35.0   
4         0                            Mr. William Henry Allen    1  35.0   

   Siblings/Spouses Aboard  Parents/Children Aboard     Fare  Pclass_1  \
0                        1                        0   7.2500     False   
1                        1                        0  71.2833      True   
2                        0                        0   7.9250     False   
3                    

## Section 4: Feature Scaling

### **Task 4**: Scale Numerical Features

*Instruction*: Use StandardScaler to scale the Age and Fare columns.*italicized text*

In [10]:
import pandas as pd
from google.colab import files
import io
from sklearn.preprocessing import LabelEncoder, StandardScaler

def process_titanic_data():
    """
    Uploads the titanic.csv file, encodes categorical features ('Sex', 'Pclass'),
    scales numerical features ('Age', 'Fare'), and returns the processed DataFrame.
    """
    print("Please upload the titanic.csv file.")
    uploaded = files.upload()

    if uploaded:
        # Get the filename of the uploaded file
        uploaded_filename = list(uploaded.keys())[0]
        print(f"\nProcessing uploaded file: {uploaded_filename}")

        try:
            # Read the CSV file into a pandas DataFrame
            df = pd.read_csv(io.BytesIO(uploaded[uploaded_filename]))

            # 1. Label Encoding for 'Sex'
            label_encoder = LabelEncoder()
            df['Sex'] = label_encoder.fit_transform(df['Sex'])
            print("\n'Sex' column after Label Encoding:")
            print(df.head()['Sex'])

            # 2. One-Hot Encoding for 'Pclass'
            df = pd.get_dummies(df, columns=['Pclass'], prefix='Pclass')
            print("\nDataFrame after One-Hot Encoding 'Pclass':")
            print(df.head())

            # 3. Scaling 'Age' and 'Fare'
            scaler = StandardScaler()
            df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
            print("\n'Age' and 'Fare' columns after Scaling:")
            print(df.head()[['Age', 'Fare']])

            return df  # Return the processed DataFrame

        except Exception as e:
            print(f"An error occurred during processing: {e}")
            return None  # Return None in case of an error
    else:
        print("Error: No file was uploaded.")
        return None  # Return None if no file was uploaded

if __name__ == "__main__":
    # Call the function to process the data
    processed_df = process_titanic_data()

    # Print the head of the processed DataFrame if successful
    if processed_df is not None:
        print("\nProcessed DataFrame (first 5 rows):")
        print(processed_df.head())



Please upload the titanic.csv file.


Saving titanic.csv to titanic (7).csv

Processing uploaded file: titanic (7).csv

'Sex' column after Label Encoding:
0    1
1    0
2    0
3    0
4    1
Name: Sex, dtype: int64

DataFrame after One-Hot Encoding 'Pclass':
   Survived                                               Name  Sex   Age  \
0         0                             Mr. Owen Harris Braund    1  22.0   
1         1  Mrs. John Bradley (Florence Briggs Thayer) Cum...    0  38.0   
2         1                              Miss. Laina Heikkinen    0  26.0   
3         1        Mrs. Jacques Heath (Lily May Peel) Futrelle    0  35.0   
4         0                            Mr. William Henry Allen    1  35.0   

   Siblings/Spouses Aboard  Parents/Children Aboard     Fare  Pclass_1  \
0                        1                        0   7.2500     False   
1                        1                        0  71.2833      True   
2                        0                        0   7.9250     False   
3                    

## Section 5: Feature Engineering

### **Task 5**: Build Preprocessing Pipeline

*Instruction*: Using `ColumnTransformer` and `Pipeline` from `sklearn`, build a pipeline that:



*   Imputes missing values
*   Scales numeric data
*   Encodes categorical data



In [12]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
from google.colab import files
import io

def preprocess_titanic_data():
    """
    Downloads the titanic.csv dataset, preprocesses it using a pipeline,
    and returns the preprocessed data.
    """

    print("Please upload the titanic.csv file.")
    uploaded = files.upload()

    if uploaded:
        # Get the filename of the uploaded file
        uploaded_filename = list(uploaded.keys())[0]
        print(f"\nProcessing uploaded file: {uploaded_filename}")

        try:
            # Load the titanic.csv dataset
            df = pd.read_csv(io.BytesIO(uploaded[uploaded_filename]))

            # Print the columns of the DataFrame to help with debugging
            print("\nColumns in the DataFrame:")
            print(df.columns)

            # Separate features and target variable
            X = df.drop('Survived', axis=1)  # Features
            y = df['Survived']  # Target variable


            # Define the numeric and categorical features
            numeric_features = ['Age', 'Fare']
            categorical_features = ['Sex', 'Pclass', 'Embarked']

            # 1. Numeric Pipeline
            numeric_pipeline = Pipeline([
                ('imputer', SimpleImputer(strategy='median')),  # Impute missing values with the median
                ('scaler', StandardScaler())  # Scale the numeric features
            ])

            # 2. Categorical Pipeline
            categorical_pipeline = Pipeline([
                ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with the most frequent value
                ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-Hot Encode the categorical features
            ])

            # 3. Column Transformer
            preprocessor = ColumnTransformer([
                ('num', numeric_pipeline, numeric_features),
                ('cat', categorical_pipeline, categorical_features)
            ])

            # 4. Full Pipeline
            full_pipeline = Pipeline([
                ('preprocessor', preprocessor)
            ])

            # Fit and transform the data
            X_processed = full_pipeline.fit_transform(X)



            # Convert the processed data to a DataFrame (optional, for better readability)
            feature_names = full_pipeline.named_steps['preprocessor'].get_feature_names_out()
            X_processed_df = pd.DataFrame(X_processed, columns=feature_names, index=X.index)


            print("Shape of processed data:", X_processed_df.shape)
            print("\nFirst 5 rows of processed data:")
            print(X_processed_df.head())

            return X_processed_df, y, full_pipeline

        except Exception as e:
            print(f"An error occurred: {e}")
            return None, None, None
    else:
        print("Error: 'titanic.csv' not found in the uploaded files. Please ensure the file is named correctly.")
        return None, None, None



if __name__ == "__main__":
    X_processed, y, full_pipeline = preprocess_titanic_data()
    if X_processed is not None:
      print("Preprocessing completed.")



Please upload the titanic.csv file.


Saving titanic.csv to titanic (9).csv

Processing uploaded file: titanic (9).csv

Columns in the DataFrame:
Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard',
       'Parents/Children Aboard', 'Fare'],
      dtype='object')
An error occurred: A given column is not a column of the dataframe


## Section 6: Feature Engineering

### **Task 6**: Create a New Feature

*Instruction*: Create a new feature `FamilySize` = `Siblings/Spouses Aboard` + `Parents/Children Aboard` + 1.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Your code here
