<div style="text-align:center; font-size: 48px;">Titanic Competition Kaggle</div>
<div style="text-align:center; font-size: 36px;">Feature Engineering</div>

In [1]:
import pandas as pd

# 1. Reload raw data

In [2]:
train_df = pd.read_csv('../data/raw/train.csv')
test_df = pd.read_csv('../data/raw/test.csv')

# 2. Handle missing values

In [3]:
MEDIAN_AGE = train_df['Age'].median()  # Default median age for filling missing values in the 'Age' column
EMBARKED_MODE = train_df['Embarked'].mode()[0]  # Default mode for filling missing values in the 'Embarked' column


def fill_missing_values(df: pd.DataFrame, median_age=MEDIAN_AGE, embarked_mode=EMBARKED_MODE) -> pd.DataFrame:


    #Age
    df['Age'].fillna(median_age, inplace=True)

    #Cabin - drop column
    df.drop(columns=['Cabin'], inplace=True)
    
    #Fare
    df['Fare'] = df['Fare'].fillna(
        df.groupby('Pclass')['Fare'].transform('median')
    )

    #Embarked
    df['Embarked'].fillna(embarked_mode, inplace=True)
    
    return df

# 3. Preprocessing

In [4]:
#Caps outliers based on Pclass
fare_bounds = train_df.groupby('Pclass')['Fare'].agg(
    Q1=lambda x: x.quantile(0.25),
    Q3=lambda x: x.quantile(0.75)
).round(2)

fare_bounds['IQR'] = fare_bounds['Q3'] - fare_bounds['Q1']
fare_bounds['Upper_Bound'] = fare_bounds['Q3'] + 1.5 * fare_bounds['IQR']
fare_bounds['Lower_Bound'] = (fare_bounds['Q1'] - 1.5 * fare_bounds['IQR']).clip(lower=0)

fare_bounds.drop(columns=['Q1', 'Q3', 'IQR'], inplace=True)


In [5]:
# mapping titles to categories
title_mapping = {
    'Mr': 'Mr',
    'Mrs': 'Mrs',
    'Miss': 'Miss',
    'Ms': 'Miss',         # converting 'Ms' to 'Miss' for consistency
    'Mlle': 'Miss',       # old French title for Miss
    'Master': 'Master',
    'Dr': 'Rare',
    'Rev': 'Rare',
    'Col': 'Rare',
    'Major': 'Rare',
    'Capt': 'Rare',
    'Sir': 'Mr',
    'Lady': 'Miss',
    'Don': 'Rare',
    'the Countess': 'Rare',
    'Jonkheer': 'Rare', 
    'Mme': 'Mrs',         # French title for Mrs
    'Dona': 'Mrs'         # Portuguese title for Mrs
}

In [6]:
def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
    
    #Sex
    df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
    
    
    #Fare Capping    
    for i in range(1, 4):
        df.loc[
            (df['Pclass'] == i) & (df['Fare'] > fare_bounds.loc[i, 'Upper_Bound']),
            'Fare'
        ] = fare_bounds.loc[i, 'Upper_Bound']
    
    
    #Embarked 
    df['Embarked'] = df['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})
    
    #Name
    df['Title'] = df['Name'].str.split(',').str[1].str.split('.').str[0].str.strip()
    df['Title'] = df['Title'].map(title_mapping)
        
    return df

# 4. Feature engineering

In [7]:
def feature_engineering(df: pd.DataFrame) -> pd.DataFrame:
    
    #age
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 12, 18, 35, 60, 100], labels= [0, 1, 2, 3, 4]) #['Child', 'Teen', 'Young', 'Adult', 'Senior']
    
    #sibsp & parch
    df['isfamilyonboard'] = (df['SibSp'] > 0) | (df['Parch'] > 0)
    df['isfamilyonboard'] = df['isfamilyonboard'].map({True: 1, False: 0})

    #ticket
    df['TicketGroupSize'] = df.groupby('Ticket')['Ticket'].transform('count')
    
    #title
    dummies = pd.get_dummies(df['Title'], prefix='Title', drop_first=True).astype(int)
    df = df.join(dummies)

    #drop columns
    df.drop(columns=['Name', 'SibSp', 'Parch', 'Ticket'], inplace=True)

    return df

In [8]:
train_df = fill_missing_values(train_df)
train_df = preprocess_data(train_df)
train_df = feature_engineering(train_df)
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   PassengerId      891 non-null    int64   
 1   Survived         891 non-null    int64   
 2   Pclass           891 non-null    int64   
 3   Sex              891 non-null    int64   
 4   Age              891 non-null    float64 
 5   Fare             891 non-null    float64 
 6   Embarked         891 non-null    int64   
 7   Title            891 non-null    object  
 8   AgeGroup         891 non-null    category
 9   isfamilyonboard  891 non-null    int64   
 10  TicketGroupSize  891 non-null    int64   
 11  Title_Miss       891 non-null    int64   
 12  Title_Mr         891 non-null    int64   
 13  Title_Mrs        891 non-null    int64   
 14  Title_Rare       891 non-null    int64   
dtypes: category(1), float64(2), int64(11), object(1)
memory usage: 98.7+ KB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(median_age, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(embarked_mode, inplace=True)


In [9]:
test_df = fill_missing_values(test_df)
test_df = preprocess_data(test_df)
test_df = feature_engineering(test_df)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   PassengerId      418 non-null    int64   
 1   Pclass           418 non-null    int64   
 2   Sex              418 non-null    int64   
 3   Age              418 non-null    float64 
 4   Fare             418 non-null    float64 
 5   Embarked         418 non-null    int64   
 6   Title            418 non-null    object  
 7   AgeGroup         418 non-null    category
 8   isfamilyonboard  418 non-null    int64   
 9   TicketGroupSize  418 non-null    int64   
 10  Title_Miss       418 non-null    int64   
 11  Title_Mr         418 non-null    int64   
 12  Title_Mrs        418 non-null    int64   
 13  Title_Rare       418 non-null    int64   
dtypes: category(1), float64(2), int64(10), object(1)
memory usage: 43.2+ KB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(median_age, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(embarked_mode, inplace=True)


# 5. Save processed data to `/data/processed/`

In [10]:
train_df.to_csv('../data/processed/train_processed.csv', index=False)
test_df.to_csv('../data/processed/test_processed.csv', index=False)