<div style="text-align:center; font-size: 48px;">Titanic Competition Kaggle</div>
<div style="text-align:center; font-size: 36px;">Feature Engineering</div>

In [1]:
import pandas as pd


# 1. Reload raw data

In [2]:
train_df = pd.read_csv('../data/raw/train.csv')
test_df = pd.read_csv('../data/raw/test.csv')

In [3]:
print(test_df.shape)
test_df.info()

(418, 11)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB


# 2. Handle missing values

In [4]:
from numpy import test


MEDIAN_AGE = train_df['Age'].median()  # Default median age for filling missing values in the 'Age' column
EMBARKED_MODE = train_df['Embarked'].mode()[0]  # Default mode for filling missing values in the 'Embarked' column


def fill_missing_values(df: pd.DataFrame, median_age=MEDIAN_AGE, embarked_mode=EMBARKED_MODE) -> pd.DataFrame:


    #Age
    df['Age'].fillna(median_age, inplace=True)

    #Cabin - drop column
    df.drop(columns=['Cabin'], inplace=True)
    
    #Fare
    df['Fare'] = df['Fare'].fillna(
        df.groupby('Pclass')['Fare'].transform('median')
    )

    #Embarked
    df['Embarked'].fillna(embarked_mode, inplace=True)
    
    return df

# 3. Preprocessing

In [5]:
#Caps outliers based on Pclass
fare_bounds = train_df.groupby('Pclass')['Fare'].agg(
    Q1=lambda x: x.quantile(0.25),
    Q3=lambda x: x.quantile(0.75)
).round(2)

fare_bounds['IQR'] = fare_bounds['Q3'] - fare_bounds['Q1']
fare_bounds['Upper_Bound'] = fare_bounds['Q3'] + 1.5 * fare_bounds['IQR']
fare_bounds['Lower_Bound'] = (fare_bounds['Q1'] - 1.5 * fare_bounds['IQR']).clip(lower=0)

fare_bounds.drop(columns=['Q1', 'Q3', 'IQR'], inplace=True)


In [6]:
def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
    
    #Sex
    df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
    
    
    #Fare Capping    
    for i in range(1, 4):
        df.loc[
            (df['Pclass'] == i) & (df['Fare'] > fare_bounds.loc[i, 'Upper_Bound']),
            'Fare'
        ] = fare_bounds.loc[i, 'Upper_Bound']
    
    
    #Embarked 
    df['Embarked'] = df['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})
    
    return df

# 4. Create new features:

In [7]:
def freature_engineering(df: pd.DataFrame) -> pd.DataFrame:
    
    #age
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 12, 18, 35, 60, 100], labels= [0, 1, 2, 3, 4]) #['Child', 'Teen', 'Young', 'Adult', 'Senior']
    
    #sibsp & parch
    df['isfamilyonboard'] = (df['SibSp'] > 0) | (df['Parch'] > 0)

    #ticket
    df['TicketGroupSize'] = df.groupby('Ticket')['Ticket'].transform('count')    

    return df

In [8]:
train_df = fill_missing_values(train_df)
train_df = preprocess_data(train_df)
train_df = freature_engineering(train_df)
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   PassengerId      891 non-null    int64   
 1   Survived         891 non-null    int64   
 2   Pclass           891 non-null    int64   
 3   Name             891 non-null    object  
 4   Sex              891 non-null    int64   
 5   Age              891 non-null    float64 
 6   SibSp            891 non-null    int64   
 7   Parch            891 non-null    int64   
 8   Ticket           891 non-null    object  
 9   Fare             891 non-null    float64 
 10  Embarked         891 non-null    int64   
 11  AgeGroup         891 non-null    category
 12  isfamilyonboard  891 non-null    bool    
 13  TicketGroupSize  891 non-null    int64   
dtypes: bool(1), category(1), float64(2), int64(8), object(2)
memory usage: 85.6+ KB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(median_age, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(embarked_mode, inplace=True)


In [9]:
test_df = fill_missing_values(test_df)
test_df = preprocess_data(test_df)
test_df = freature_engineering(test_df)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   PassengerId      418 non-null    int64   
 1   Pclass           418 non-null    int64   
 2   Name             418 non-null    object  
 3   Sex              418 non-null    int64   
 4   Age              418 non-null    float64 
 5   SibSp            418 non-null    int64   
 6   Parch            418 non-null    int64   
 7   Ticket           418 non-null    object  
 8   Fare             418 non-null    float64 
 9   Embarked         418 non-null    int64   
 10  AgeGroup         418 non-null    category
 11  isfamilyonboard  418 non-null    bool    
 12  TicketGroupSize  418 non-null    int64   
dtypes: bool(1), category(1), float64(2), int64(7), object(2)
memory usage: 37.1+ KB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(median_age, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(embarked_mode, inplace=True)


# 5. Save processed data to `/data/processed/`

In [10]:
train_df.to_csv('../data/processed/train_processed.csv')
test_df.to_csv('../data/processed/test_processed.csv')