# Task 2: Data Preprocessing for Machine Learning – AI Bootcamp

Download Titanic Dataset here: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

#### About this file

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## Section 1: Data Loading & Exploration

In [27]:
import pandas as pd

def load_and_explore_data(file_path):
    """
    Loads data from a CSV file into a Pandas DataFrame and performs basic exploration.

    Args:
        file_path (str): The path to the CSV file.

    Returns:
        pandas.DataFrame: The loaded DataFrame.
    """
    try:
        df = pd.read_csv(file_path)
        print("Data loaded successfully!")
        print("\nFirst 5 rows of the DataFrame:")
        print(df.head())
        print("\nInformation about the DataFrame:")
        print(df.info())
        print("\nDescriptive statistics of numerical columns:")
        print(df.describe())
        return df
    except FileNotFoundError:
        print(f"Error: File not found at '{file_path}'")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

if __name__ == "__main__":
    # Replace 'your_data.csv' with the actual path to your CSV file
    file_path = 'your_data.csv'

    # Create a sample CSV file for demonstration if 'your_data.csv' doesn't exist
    try:
        with open(file_path, 'r') as f:
            pass
    except FileNotFoundError:
        data = {'col1': [1, 2, 3, 4, 5],
                'col2': ['A', 'B', 'A', 'C', 'B'],
                'col3': [10.5, 20.3, 15.0, 22.1, 18.7]}
        sample_df = pd.DataFrame(data)
        sample_df.to_csv(file_path, index=False)
        print(f"Created a sample '{file_path}' for demonstration. Please replace it with your actual data file.")

    loaded_df = load_and_explore_data(file_path)

    if loaded_df is not None:
        print("\nShape of the DataFrame:", loaded_df.shape)
        print("\nNumber of missing values per column:")
        print(loaded_df.isnull().sum())
        print("\nUnique values in categorical columns:")
        for col in loaded_df.select_dtypes(include='object').columns:
            print(f"- {col}: {loaded_df[col].nunique()} unique values ({loaded_df[col].unique()})")

Created a sample 'your_data.csv' for demonstration. Please replace it with your actual data file.
Data loaded successfully!

First 5 rows of the DataFrame:
   col1 col2  col3
0     1    A  10.5
1     2    B  20.3
2     3    A  15.0
3     4    C  22.1
4     5    B  18.7

Information about the DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    5 non-null      int64  
 1   col2    5 non-null      object 
 2   col3    5 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 252.0+ bytes
None

Descriptive statistics of numerical columns:
           col1       col3
count  5.000000   5.000000
mean   3.000000  17.320000
std    1.581139   4.624067
min    1.000000  10.500000
25%    2.000000  15.000000
50%    3.000000  18.700000
75%    4.000000  20.300000
max    5.000000  22.100000

Shape of the DataFrame: (5, 3)

Number of missing

### **Task 1**: Load and Inspect a Dataset

*Instruction*: Load the `titanic.csv` dataset and display the first 5 rows. Show basic info and describe statistics of the dataset.

In [29]:
import pandas as pd

# Load the Titanic dataset
titanic_df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

# Display the first 5 rows
print("First 5 rows:")
print(titanic_df.head())

# Display basic info of the dataset
print("\nBasic Info:")
print(titanic_df.info())

# Display descriptive statistics of the dataset
print("\nDescriptive Statistics:")
print(titanic_df.describe())

First 5 rows:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN

## Section 2: Handling Missing Values

### **Task 2**: Identify and Handle Missing Data

*Instruction*:



*   Display the number of missing values per column.
*   Fill missing `Age` values with the median.
*   Drop the `Cabin` column.



In [12]:
import pandas as pd
import numpy as np

def handle_missing_data(df):
    """
    Displays the number of missing values per column,
    fills missing 'Age' values with the median, and
    drops the 'Cabin' column.

    Args:
        df (pandas.DataFrame): The input DataFrame.

    Returns:
        pandas.DataFrame: The modified DataFrame.
    """
    print("Number of missing values per column before handling:")
    print(df.isnull().sum())

    # Fill missing 'Age' values with the median
    median_age = df['Age'].median()
    df['Age'].fillna(median_age, inplace=True)
    print("\nMissing 'Age' values filled with the median.")

    # Drop the 'Cabin' column
    df.drop('Cabin', axis=1, inplace=True)
    print("\n'Cabin' column dropped.")

    print("\nNumber of missing values per column after handling:")
    print(df.isnull().sum())

    return df

if __name__ == "__main__":
    # Create a sample DataFrame with missing values for demonstration
    data = {'PassengerId': [1, 2, 3, 4, 5, 6],
            'Survived': [0, 1, 1, 0, 1, 0],
            'Pclass': [3, 1, 3, 1, 3, 3],
            'Name': ['Braund, Mr. Owen Harris', 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 'Heikkinen, Miss. Laina', 'Futrelle, Mrs. Jacques Heath (Lily May Peel)', 'Allen, Mr. William Henry', 'Moran, Mr. James'],
            'Sex': ['male', 'female', 'female', 'female', 'male', 'male'],
            'Age': [22.0, 38.0, 26.0, 35.0, 35.0, np.nan],
            'SibSp': [1, 1, 0, 1, 0, 0],
            'Parch': [0, 0, 0, 0, 0, 0],
            'Ticket': ['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450', '330877'],
            'Fare': [7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583],
            'Cabin': [np.nan, 'C85', np.nan, 'C123', np.nan, np.nan],
            'Embarked': ['S', 'C', 'S', 'S', 'S', 'Q']}
    df = pd.DataFrame(data)

    handled_df = handle_missing_data(df.copy()) # Use .copy() to avoid modifying the original DataFrame
    print("\nModified DataFrame (first 5 rows):")
    print(handled_df.head())

Number of missing values per column before handling:
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            1
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          4
Embarked       0
dtype: int64

Missing 'Age' values filled with the median.

'Cabin' column dropped.

Number of missing values per column after handling:
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

Modified DataFrame (first 5 rows):
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         0       1   
4            5         1       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(median_age, inplace=True)


## Section 3: Encoding Categorical Features

### **Task 3**: Convert Categorical to Numeric

*Instruction*: Convert `Sex` and `Embarked` columns to numeric using:


*   Label Encoding for `Sex`
*   One-Hot Encoding for `Embarked`



In [11]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

def convert_categorical_to_numeric(df):
    """
    Converts the 'Sex' column to numeric using Label Encoding
    and the 'Embarked' column to numeric using One-Hot Encoding.

    Args:
        df (pandas.DataFrame): The input DataFrame containing 'Sex' and 'Embarked' columns.

    Returns:
        pandas.DataFrame: A new DataFrame with the 'Sex' and 'Embarked' columns
                          converted to numeric. The original columns are dropped.
    """
    df_converted = df.copy()

    # Label Encoding for 'Sex'
    label_encoder = LabelEncoder()
    df_converted['Sex_Encoded'] = label_encoder.fit_transform(df_converted['Sex'])
    df_converted.drop('Sex', axis=1, inplace=True)
    print("Label Encoding applied to 'Sex' column.")

    # One-Hot Encoding for 'Embarked'
    onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
    encoded_embarked = onehot_encoder.fit_transform(df_converted[['Embarked']])
    encoded_df = pd.DataFrame(encoded_embarked, columns=onehot_encoder.get_feature_names_out(['Embarked']))
    df_converted = pd.concat([df_converted.drop('Embarked', axis=1), encoded_df], axis=1)
    print("One-Hot Encoding applied to 'Embarked' column.")

    return df_converted

if __name__ == "__main__":
    # Create a sample DataFrame with Sex and Embarked columns
    data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
            'Sex': ['female', 'male', 'female', 'male', 'female'],
            'Ticket': ['A1', 'B2', 'C3', 'D4', 'E5'],
            'Embarked': ['S', 'C', 'S', 'Q', 'S']}
    df = pd.DataFrame(data)

    print("Original DataFrame:")
    print(df)

    numeric_df = convert_categorical_to_numeric(df)

    print("\nDataFrame with Categorical Features Converted to Numeric:")
    print(numeric_df)


Original DataFrame:
      Name     Sex Ticket Embarked
0    Alice  female     A1        S
1      Bob    male     B2        C
2  Charlie  female     C3        S
3    David    male     D4        Q
4      Eve  female     E5        S
Label Encoding applied to 'Sex' column.
One-Hot Encoding applied to 'Embarked' column.

DataFrame with Categorical Features Converted to Numeric:
      Name Ticket  Sex_Encoded  Embarked_C  Embarked_Q  Embarked_S
0    Alice     A1            0         0.0         0.0         1.0
1      Bob     B2            1         1.0         0.0         0.0
2  Charlie     C3            0         0.0         0.0         1.0
3    David     D4            1         0.0         1.0         0.0
4      Eve     E5            0         0.0         0.0         1.0


## Section 4: Feature Scaling

### **Task 4**: Scale Numerical Features

*Instruction*: Use StandardScaler to scale the Age and Fare columns.*italicized text*

In [14]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

def scale_numerical_features(df):
    """
    Scales the 'Age' and 'Fare' columns of a DataFrame using StandardScaler.

    Args:
        df (pandas.DataFrame): The input DataFrame containing 'Age' and 'Fare' columns.

    Returns:
        pandas.DataFrame: A new DataFrame with the scaled 'Age' and 'Fare' columns,
                          along with the original unscaled columns.
    """
    scaler = StandardScaler()

    # Make a copy to avoid modifying the original DataFrame directly
    df_scaled = df.copy()

    # Select the columns to scale
    numerical_cols = ['Age', 'Fare']

    # Check if the columns exist in the DataFrame
    for col in numerical_cols:
        if col not in df_scaled.columns:
            print(f"Warning: Column '{col}' not found in the DataFrame.")
            return None

    # Scale the numerical columns
    df_scaled[numerical_cols] = scaler.fit_transform(df_scaled[numerical_cols])

    return df_scaled

if __name__ == "__main__":
    # Create a sample DataFrame with Age and Fare columns
    data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
            'Age': [25, 30, 22, 35, 28],
            'Ticket': ['A1', 'B2', 'C3', 'D4', 'E5'],
            'Fare': [10.50, 50.75, 7.90, 100.20, 25.00]}
    df = pd.DataFrame(data)

    print("Original DataFrame:")
    print(df)

    scaled_df = scale_numerical_features(df)

    if scaled_df is not None:
        print("\nDataFrame with Scaled 'Age' and 'Fare':")
        print(scaled_df)

Original DataFrame:
      Name  Age Ticket    Fare
0    Alice   25     A1   10.50
1      Bob   30     B2   50.75
2  Charlie   22     C3    7.90
3    David   35     D4  100.20
4      Eve   28     E5   25.00

DataFrame with Scaled 'Age' and 'Fare':
      Name       Age Ticket      Fare
0    Alice -0.677631     A1 -0.828776
1      Bob  0.451754     B2  0.347052
2  Charlie -1.355262     C3 -0.904730
3    David  1.581139     D4  1.791640
4      Eve  0.000000     E5 -0.405186


## Section 5: Feature Engineering

### **Task 5**: Build Preprocessing Pipeline

*Instruction*: Using `ColumnTransformer` and `Pipeline` from `sklearn`, build a pipeline that:



*   Imputes missing values
*   Scales numeric data
*   Encodes categorical data



In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Sample dataset with some missing values
data = {
    'Age': [22, 38, 26, 35, None],
    'Fare': [7.25, 71.2833, 7.925, 53.1, 8.05],
    'Sex': ['male', 'female', 'female', 'female', 'male'],
    'Embarked': ['S', 'C', 'Q', 'S', None],
}

df = pd.DataFrame(data)

# Target variable (dummy)
target = [0, 1, 1, 0, 0]  # Binary target for classification example

# Split data into features (X) and target (y)
X = df
y = target

# Define column types
numeric_features = ['Age', 'Fare']
categorical_features = ['Sex', 'Embarked']

# Create preprocessing steps for numeric features (impute, then scale)
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with mean
    ('scaler', StandardScaler())  # Scale numeric features
])

# Create preprocessing steps for categorical features (impute, then encode)
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with most frequent value
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # Encode categorical features
])

# Combine both transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create a final pipeline with a classifier (RandomForest in this case)
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())  # You can replace this with any model
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Predict on the test data
y_pred = pipeline.predict(X_test)

# Output predictions
print("Predictions:", y_pred)

# Optionally, you can check the transformed data
transformed_data = pipeline.named_steps['preprocessor'].transform(X_test)
print("Transformed Data (numeric and encoded categorical features):")
print(transformed_data)

Predictions: [0]
Transformed Data (numeric and encoded categorical features):
[[2.19477621 2.65752716 1.         0.         0.         0.
  0.        ]]


## Section 6: Feature Engineering

### **Task 6**: Create a New Feature

*Instruction*: Create a new feature `FamilySize` = `SibSp` + `Parch` + 1.

In [5]:
import pandas as pd

# Sample DataFrame (assuming it's similar to Titanic dataset with 'SibSp' and 'Parch' columns)
data = {
    'SibSp': [1, 0, 3, 1, 0],
    'Parch': [0, 0, 1, 2, 0]
}

df = pd.DataFrame(data)

# Creating the new feature 'FamilySize'
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

# Output the DataFrame
print(df)


   SibSp  Parch  FamilySize
0      1      0           2
1      0      0           1
2      3      1           5
3      1      2           4
4      0      0           1
