## Feature Engineering 2
In part 1 of Feature Engineering, we preprocess the data without splitting into train and test data. And I stated that it can cause what is referred to as **Data Leakage** - information from outside the training dataset is accidentally used to create the model.

So, in Feature Engineering 2, I will preprocess the data again this time i will split into train and test and repeat:
1. Filling the missing data
2. Encoding the categorical column
3. Standardizing the data

### Import libraries and Load Dataset

**Import Libraries**

In [1]:
#import the libraries, this time i will be introducing the sklearn library
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split #for splittling the data
from sklearn.preprocessing import OneHotEncoder #to get dummies
from sklearn.preprocessing import StandardScaler #to standardize

ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
scaler = StandardScaler()

**Load the Titanic Dataset**

In [2]:
#Load the data
dataset = pd.read_csv(r"C:\Users\KOLADE\OneDrive\Documents\Practices\Titanic\data\Titanic-Dataset.csv")
df = dataset.copy()
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Before splitting i will add the family_size column

In [3]:
duplicated_ticket = df[df['Ticket'].duplicated(keep=False)]
grouped = duplicated_ticket.groupby('Ticket')
data = {'Ticket': [], 'Family_size': []}
for ticket, group in grouped:
    data['Ticket'].append(ticket)
    data['Family_size'].append(len(group))

family = pd.DataFrame(data)
df = df.merge(family, on='Ticket', how='left')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family_size
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,2.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,


### **Splitting the Data**

In [4]:
X = df.drop(columns=['PassengerId', 'Survived', 'Ticket'])
y = df['Survived']

display(X.head(2))
y

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Family_size
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,,S,
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C85,C,


0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print(f"X_Train contains {X_train.shape}, X_Test contains {X_test.shape}, y_train contains {len(y_train)}, y_test contains {len(y_test)}")

X_Train contains (596, 10), X_Test contains (295, 10), y_train contains 596, y_test contains 295


### **Handling Missing Data**

In [6]:
display(X_train.isnull().sum())
X_test.isnull().sum()

Pclass           0
Name             0
Sex              0
Age            118
SibSp            0
Parch            0
Fare             0
Cabin          462
Embarked         1
Family_size    358
dtype: int64

Pclass           0
Name             0
Sex              0
Age             59
SibSp            0
Parch            0
Fare             0
Cabin          225
Embarked         1
Family_size    189
dtype: int64

In [7]:
def process_missing(data1, data2):
    """
    Handle missing values for train (data1) and test (data2).
    Ensures imputations are based on TRAIN only to avoid data leakage.
    """
    # -----------------
    # Cabin missing
    # -----------------
    data1['Cabin_status'] = np.where(data1['Cabin'].isnull(), 'unknown', 'known')
    data1.drop(labels=['Cabin'], axis=1, inplace=True)

    data2['Cabin_status'] = np.where(data2['Cabin'].isnull(), 'unknown', 'known')
    data2.drop(labels=['Cabin'], axis=1, inplace=True)

    # -----------------
    # Family_size missing
    # Add the Alone column (whether the passenger is alone on the boat or not)
    # -----------------
    data1['Alone'] = np.where(data1['Family_size'].isnull(), 'Yes', 'No')
    data1['Family_size'] = data1['Family_size'].fillna(1).astype('int64')

    data2['Alone'] = np.where(data2['Family_size'].isnull(), 'Yes', 'No')
    data2['Family_size'] = data2['Family_size'].fillna(1).astype('int64')

    # -----------------
    # Embarked missing
    # -----------------
    data1['Embarked_nan'] = np.where(data1['Embarked'].isnull(), 1, 0)
    embarked_mode = data1['Embarked'].mode()[0]   # mode from TRAIN
    data1['Embarked'] = data1['Embarked'].fillna(embarked_mode)

    data2['Embarked_nan'] = np.where(data2['Embarked'].isnull(), 1, 0)
    data2['Embarked'] = data2['Embarked'].fillna(embarked_mode)  # use train mode

    # -----------------
    # Age missing (group-based imputation)
    # -----------------
    # Add missing indicators
    data1['Age_nan'] = np.where(data1['Age'].isnull(), 1, 0)
    data2['Age_nan'] = np.where(data2['Age'].isnull(), 1, 0)

    # Compute group medians on TRAIN only
    age_medians = data1.groupby(['Pclass', 'Sex'])['Age'].median()
    global_median = data1['Age'].median()

    # Impute train
    train_keys = pd.Series(list(zip(data1['Pclass'], data1['Sex'])), index=data1.index)
    group_medians_train = train_keys.map(age_medians)
    data1['Age'] = data1['Age'].fillna(group_medians_train).fillna(global_median)

    # Impute test (using TRAIN medians only)
    test_keys = pd.Series(list(zip(data2['Pclass'], data2['Sex'])), index=data2.index)
    group_medians_test = test_keys.map(age_medians)
    data2['Age'] = data2['Age'].fillna(group_medians_test).fillna(global_median)

    return data1, data2

In [8]:
X_train_processed, X_test_processed = process_missing(X_train.copy(), X_test.copy())

In [9]:
display(X_train_processed.isnull().sum())
X_test_processed.isnull().sum()

Pclass          0
Name            0
Sex             0
Age             0
SibSp           0
Parch           0
Fare            0
Embarked        0
Family_size     0
Cabin_status    0
Alone           0
Embarked_nan    0
Age_nan         0
dtype: int64

Pclass          0
Name            0
Sex             0
Age             0
SibSp           0
Parch           0
Fare            0
Embarked        0
Family_size     0
Cabin_status    0
Alone           0
Embarked_nan    0
Age_nan         0
dtype: int64

In [10]:
display(X_train.head(3))
X_train_processed.head(3)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Family_size
6,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,51.8625,E46,S,
718,3,"McEvoy, Mr. Michael",male,,0,0,15.5,,Q,
685,2,"Laroche, Mr. Joseph Philippe Lemercier",male,25.0,1,2,41.5792,,C,3.0


Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Family_size,Cabin_status,Alone,Embarked_nan,Age_nan
6,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,51.8625,S,1,known,Yes,0,0
718,3,"McEvoy, Mr. Michael",male,26.0,0,0,15.5,Q,1,unknown,Yes,0,1
685,2,"Laroche, Mr. Joseph Philippe Lemercier",male,25.0,1,2,41.5792,C,3,unknown,No,0,0


In [11]:
display(X_test.head(3))
X_test_processed.head(3)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Family_size
709,3,"Moubarek, Master. Halim Gonios (""William George"")",male,,1,1,15.2458,,C,2.0
439,2,"Kvillner, Mr. Johan Henrik Johannesson",male,31.0,0,0,10.5,,S,
840,3,"Alhomaki, Mr. Ilmari Rudolf",male,20.0,0,0,7.925,,S,


Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Family_size,Cabin_status,Alone,Embarked_nan,Age_nan
709,3,"Moubarek, Master. Halim Gonios (""William George"")",male,26.0,1,1,15.2458,C,2,unknown,No,0,1
439,2,"Kvillner, Mr. Johan Henrik Johannesson",male,31.0,0,0,10.5,S,1,unknown,Yes,0,0
840,3,"Alhomaki, Mr. Ilmari Rudolf",male,20.0,0,0,7.925,S,1,unknown,Yes,0,0


In [12]:
X_train.groupby(['Pclass', 'Sex'])['Age'].median()

Pclass  Sex   
1       female    35.0
        male      41.0
2       female    28.0
        male      29.0
3       female    22.0
        male      26.0
Name: Age, dtype: float64

The missing data has been filled appropriately. And if you observe no data from test is leaked to the train instead all imputation are based on the training data.

### **Encoding the categorical data**

In [13]:
display(X_train_processed.head(3))
X_test_processed.head(3)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Family_size,Cabin_status,Alone,Embarked_nan,Age_nan
6,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,51.8625,S,1,known,Yes,0,0
718,3,"McEvoy, Mr. Michael",male,26.0,0,0,15.5,Q,1,unknown,Yes,0,1
685,2,"Laroche, Mr. Joseph Philippe Lemercier",male,25.0,1,2,41.5792,C,3,unknown,No,0,0


Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Family_size,Cabin_status,Alone,Embarked_nan,Age_nan
709,3,"Moubarek, Master. Halim Gonios (""William George"")",male,26.0,1,1,15.2458,C,2,unknown,No,0,1
439,2,"Kvillner, Mr. Johan Henrik Johannesson",male,31.0,0,0,10.5,S,1,unknown,Yes,0,0
840,3,"Alhomaki, Mr. Ilmari Rudolf",male,20.0,0,0,7.925,S,1,unknown,Yes,0,0


In [14]:
def categorical_encoding(data1, data2):
    # Copy to avoid modifying original
    data1 = data1.copy()
    data2 = data2.copy()

    # =============================
    # 1. Sex encoding (binary map)
    # =============================
    sex = {'female': 0, 'male': 1}
    data1['Sex'] = data1['Sex'].map(sex)
    data2['Sex'] = data2['Sex'].map(sex)

    # =============================
    # 2. Cabin status encoding (binary map)
    # =============================
    cabin = {'unknown': 0, 'known': 1}
    data1['Cabin_status'] = data1['Cabin_status'].map(cabin)
    data2['Cabin_status'] = data2['Cabin_status'].map(cabin)

    # =============================
    # 3. Alone status encoding (binary map)
    # =============================
    alone = {'Yes': 1, 'No': 0}
    data1['Alone'] = data1['Alone'].map(alone)
    data2['Alone'] = data2['Alone'].map(alone)

    # =============================
    # 3. Embarked (OneHotEncode)
    # =============================
    # ohe_embarked = OneHotEncoder(handle_unknown="ignore", sparse=False)
    ohe.fit(data1[['Embarked']])  # fit only on train

    embarked_train = ohe.transform(data1[['Embarked']])
    embarked_test = ohe.transform(data2[['Embarked']])

    cols_embarked = ohe.get_feature_names_out(['Embarked'])
    embarked_train_df = pd.DataFrame(embarked_train, columns=cols_embarked, index=data1.index)
    embarked_test_df = pd.DataFrame(embarked_test, columns=cols_embarked, index=data2.index)

    data1 = pd.concat([data1.drop('Embarked', axis=1), embarked_train_df], axis=1)
    data2 = pd.concat([data2.drop('Embarked', axis=1), embarked_test_df], axis=1)

    # =============================
    # 4. Title Extraction
    # =============================
    data1['Title'] = data1['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
    data2['Title'] = data2['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())

    # Replace rare titles with "Other" (based on TRAIN only)
    rare_titles = [title for title, count in data1['Title'].value_counts().items() if count < 10]
    data1['Title'] = data1['Title'].replace(rare_titles, 'Other')
    data2['Title'] = data2['Title'].replace(rare_titles, 'Other')

    # =============================
    # 5. Title (OneHotEncode)
    # =============================
    # ohe_title = OneHotEncoder(handle_unknown="ignore", sparse=False)
    ohe.fit(data1[['Title']])  # fit only on train

    title_train = ohe.transform(data1[['Title']])
    title_test = ohe.transform(data2[['Title']])

    cols_title = ohe.get_feature_names_out(['Title'])
    title_train_df = pd.DataFrame(title_train, columns=cols_title, index=data1.index)
    title_test_df = pd.DataFrame(title_test, columns=cols_title, index=data2.index)

    data1 = pd.concat([data1.drop('Title', axis=1), title_train_df], axis=1)
    data2 = pd.concat([data2.drop('Title', axis=1), title_test_df], axis=1)

    # =============================
    # 6. Drop Name column (since Title extracted)
    # =============================
    data1.drop('Name', axis=1, inplace=True)
    data2.drop('Name', axis=1, inplace=True)

    return data1, data2

In [15]:
processed_X_train, processed_X_test = categorical_encoding(X_train_processed, X_test_processed)

In [16]:
display(processed_X_train.head(3))
processed_X_test.head(3)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Family_size,Cabin_status,Alone,Embarked_nan,Age_nan,Embarked_C,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Other
6,1,1,54.0,0,0,51.8625,1,1,1,0,0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
718,3,1,26.0,0,0,15.5,1,0,1,0,1,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
685,2,1,25.0,1,2,41.5792,3,0,0,0,0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Family_size,Cabin_status,Alone,Embarked_nan,Age_nan,Embarked_C,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Other
709,3,1,26.0,1,1,15.2458,2,0,0,0,1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
439,2,1,31.0,0,0,10.5,1,0,1,0,0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
840,3,1,20.0,0,0,7.925,1,0,1,0,0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


### **Scaling**

In [17]:
def scale_age_fare(train, test, drop_age=False, drop_original_fare=True):
    """
    Scale Age and log+scale Fare for train and test.
    - drop_age: if True, drop the original Age column after scaling.
    - drop_original_fare: if True, drop raw Fare and intermediate Fare_log.
    """
    train = train.copy()
    test  = test.copy()

    # create log fare
    train['Fare_log'] = np.log1p(train['Fare'])
    test['Fare_log']  = np.log1p(test['Fare'])

    cols_to_scale = ['Age', 'Fare_log']

    # fit scaler on train only
    # scaler = StandardScaler()
    scaler.fit(train[cols_to_scale])

    # transform both sets
    train_scaled = scaler.transform(train[cols_to_scale])
    test_scaled  = scaler.transform(test[cols_to_scale])

    scaled_cols = [f'{c}_scaled' for c in cols_to_scale]
    train[scaled_cols] = train_scaled
    test[scaled_cols]  = test_scaled

    # optional drops
    if drop_original_fare:
        train.drop(['Fare', 'Fare_log'], axis=1, inplace=True)
        test.drop(['Fare', 'Fare_log'], axis=1, inplace=True)

    if drop_age:
        train.drop('Age', axis=1, inplace=True)
        test.drop('Age', axis=1, inplace=True)

    return train, test

In [18]:
X_train_final, X_test_final = scale_age_fare(processed_X_train.copy(), processed_X_test.copy())

In [19]:
display(X_train_final.head(3))
X_test_final.head(3)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Family_size,Cabin_status,Alone,Embarked_nan,Age_nan,Embarked_C,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Other,Age_scaled,Fare_log_scaled
6,1,1,54.0,0,0,1,1,1,0,0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.888773,1.053619
718,3,1,26.0,0,0,1,0,1,0,1,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,-0.232122,-0.159147
685,2,1,25.0,1,2,3,0,0,0,0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,-0.307868,0.828292


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Family_size,Cabin_status,Alone,Embarked_nan,Age_nan,Embarked_C,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Other,Age_scaled,Fare_log_scaled
709,3,1,26.0,1,1,2,0,0,0,1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-0.232122,-0.175318
439,2,1,31.0,0,0,1,0,1,0,0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.14661,-0.535177
840,3,1,20.0,0,0,1,0,1,0,0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,-0.686599,-0.799211


### Conclusion

Finally, the dataset is fully ready for modelling with any data leakage. For modeling experiments I kept both Age and Age_scaled initially (so I can compare).

From here i will move to the next stage which is the feature selection, I will save the train and test data for future use. 

In [20]:
train_df = pd.concat([X_train_final, y_train], axis=1)
test_df = pd.concat([X_test_final, y_test], axis=1)

display(train_df.head(3))
test_df.head(3)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Family_size,Cabin_status,Alone,Embarked_nan,Age_nan,...,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Other,Age_scaled,Fare_log_scaled,Survived
6,1,1,54.0,0,0,1,1,1,0,0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.888773,1.053619,0
718,3,1,26.0,0,0,1,0,1,0,1,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,-0.232122,-0.159147,0
685,2,1,25.0,1,2,3,0,0,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,-0.307868,0.828292,0


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Family_size,Cabin_status,Alone,Embarked_nan,Age_nan,...,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Other,Age_scaled,Fare_log_scaled,Survived
709,3,1,26.0,1,1,2,0,0,0,1,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-0.232122,-0.175318,1
439,2,1,31.0,0,0,1,0,1,0,0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.14661,-0.535177,0
840,3,1,20.0,0,0,1,0,1,0,0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,-0.686599,-0.799211,0


In [21]:
train_df.to_csv(r"C:\Users\KOLADE\OneDrive\Documents\Practices\Titanic\data\Train.csv", index=False)
test_df.to_csv(r"C:\Users\KOLADE\OneDrive\Documents\Practices\Titanic\data\Test.csv",index=False)

In [22]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 596 entries, 6 to 102
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Pclass           596 non-null    int64  
 1   Sex              596 non-null    int64  
 2   Age              596 non-null    float64
 3   SibSp            596 non-null    int64  
 4   Parch            596 non-null    int64  
 5   Family_size      596 non-null    int64  
 6   Cabin_status     596 non-null    int64  
 7   Alone            596 non-null    int64  
 8   Embarked_nan     596 non-null    int64  
 9   Age_nan          596 non-null    int64  
 10  Embarked_C       596 non-null    float64
 11  Embarked_Q       596 non-null    float64
 12  Embarked_S       596 non-null    float64
 13  Title_Master     596 non-null    float64
 14  Title_Miss       596 non-null    float64
 15  Title_Mr         596 non-null    float64
 16  Title_Mrs        596 non-null    float64
 17  Title_Other      596 

In [23]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 295 entries, 709 to 173
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Pclass           295 non-null    int64  
 1   Sex              295 non-null    int64  
 2   Age              295 non-null    float64
 3   SibSp            295 non-null    int64  
 4   Parch            295 non-null    int64  
 5   Family_size      295 non-null    int64  
 6   Cabin_status     295 non-null    int64  
 7   Alone            295 non-null    int64  
 8   Embarked_nan     295 non-null    int64  
 9   Age_nan          295 non-null    int64  
 10  Embarked_C       295 non-null    float64
 11  Embarked_Q       295 non-null    float64
 12  Embarked_S       295 non-null    float64
 13  Title_Master     295 non-null    float64
 14  Title_Miss       295 non-null    float64
 15  Title_Mr         295 non-null    float64
 16  Title_Mrs        295 non-null    float64
 17  Title_Other      29