<a href="https://colab.research.google.com/github/Hafeezali366/Spaceship-Titanic/blob/main/Hamza_(Data_Preprocessing_%26_Feature_Engineering).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



---



---


# **3. Data Preprocessing & Feature Engineering**


---



---




# **1. Data Preprocessing**

This section focuses on handling missing values, removing duplicates, and cleaning the dataset.



# **Step 1: Import Libraries**
We import the necessary libraries, including pandas for data manipulation, numpy for numerical operations, and KNNImputer for more advanced imputation if needed.



In [47]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer


# **Step 2: Load the Dataset**

Load the training and test datasets from the provided URLs.

In [48]:
# Load the training and test datasets
train_url = "https://raw.githubusercontent.com/Hafeezali366/Spaceship-Titanic/refs/heads/main/dataset/train.csv"
test_url = "https://raw.githubusercontent.com/Hafeezali366/Spaceship-Titanic/refs/heads/main/dataset/test.csv"

train_df = pd.read_csv(train_url)
test_df = pd.read_csv(test_url)

train_df.head()


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


# **Step 3: Handle Missing Values (Imputation)**


# **3.1 Impute Numerical Features Using Median**

This step imputes missing values in numerical columns using the median. The median is used to avoid the influence of outliers in skewed distributions

In [49]:
# Impute numerical columns with median (for Age, RoomService, FoodCourt, etc.)
numerical_cols = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
for col in numerical_cols:
    if train_df[col].isnull().sum() > 0:
        train_df[col] = train_df[col].fillna(train_df[col].median())
        test_df[col] = test_df[col].fillna(test_df[col].median())


## **3.2 Impute Categorical Features Using Mode**

Imputes missing categorical values using the most frequent category (mode). This is a simple and effective way to handle missing categorical data

In [50]:
# Impute categorical columns with mode (for HomePlanet, Destination, VIP)
categorical_cols = ['HomePlanet', 'Destination', 'VIP']
for col in categorical_cols:
    if train_df[col].isnull().sum() > 0:
        train_df[col] = train_df[col].fillna(train_df[col].mode()[0])
        test_df[col] = test_df[col].fillna(test_df[col].mode()[0])



  train_df[col] = train_df[col].fillna(train_df[col].mode()[0])
  test_df[col] = test_df[col].fillna(test_df[col].mode()[0])


## **3.3 Special CryoSleep Imputation**

This special logic imputes CryoSleep based on a passenger's spending behavior. If a passenger spent nothing (zero total spending), we assume they were in CryoSleep.

In [51]:
# Special CryoSleep Imputation: If CryoSleep is missing but total spending is zero, impute CryoSleep as True
train_df['CryoSleep'] = train_df.apply(lambda row: True if pd.isna(row['CryoSleep']) and row[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].sum() == 0 else row['CryoSleep'], axis=1)
test_df['CryoSleep'] = test_df.apply(lambda row: True if pd.isna(row['CryoSleep']) and row[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].sum() == 0 else row['CryoSleep'], axis=1)

# For remaining missing CryoSleep values, impute with mode
train_df['CryoSleep'] = train_df['CryoSleep'].fillna(train_df['CryoSleep'].mode()[0])
test_df['CryoSleep'] = test_df['CryoSleep'].fillna(test_df['CryoSleep'].mode()[0])



  train_df['CryoSleep'] = train_df['CryoSleep'].fillna(train_df['CryoSleep'].mode()[0])
  test_df['CryoSleep'] = test_df['CryoSleep'].fillna(test_df['CryoSleep'].mode()[0])


## **3.4 Impute Missing Values in the 'Cabin' Column**
As 'Cabin' is a categorical feature, impute missing values with the most frequent category (mode).

In [52]:
# Impute missing values in 'Cabin' column with mode (since it's categorical)
train_df['Cabin'] = train_df['Cabin'].fillna(train_df['Cabin'].mode()[0])
test_df['Cabin'] = test_df['Cabin'].fillna(test_df['Cabin'].mode()[0])


# **Step 4: Handle Duplicates**

Remove duplicate rows from the dataset to avoid any biases in model training


In [53]:
# Remove duplicate rows from the dataset to ensure data integrity
train_df.drop_duplicates(inplace=True)
test_df.drop_duplicates(inplace=True)



# **Step 5: Remove Irrelevant Features**
The 'Name' column is dropped since it doesn't contain meaningful information for prediction.


In [54]:
# Remove irrelevant features (e.g., Name) which are unlikely to be predictive for the model
train_df.drop(columns=['Name'], inplace=True)
test_df.drop(columns=['Name'], inplace=True)


# **Step 6: Verify Missing Data**

This step ensures that there are no remaining missing values after the imputation process.

In [55]:
# Verify that no missing values remain in the data after imputation
missing_train = train_df.isnull().sum()
missing_test = test_df.isnull().sum()

print("Missing values in training data:")
print(missing_train[missing_train > 0])

print("\nMissing values in test data:")
print(missing_test[missing_test > 0])



Missing values in training data:
Series([], dtype: int64)

Missing values in test data:
Series([], dtype: int64)


# **Step 7: Data Preprocessing Output**

After running the preprocessing steps, we see that missing values have been handled and that duplicates and irrelevant columns have been removed

In [56]:
# Check for missing values in the final dataset
missing_train = train_df.isnull().sum().sum()
missing_test = test_df.isnull().sum().sum()

# Print final missing values check
print(f"Remaining missing values in training data: {missing_train}")
print(f"Remaining missing values in test data: {missing_test}")

# Display first few rows of the final preprocessed data
train_df.head()


Remaining missing values in training data: 0
Remaining missing values in test data: 0


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True


# **2. Feature Engineering**

This section focuses on creating new features and transforming existing features to be used in the machine learning model.


# **Step 1: Create New Features**


 This section creates new features like:


*   Deck', 'CabinNum', 'Side' by splitting the 'Cabin' column
*   GroupId' and 'GroupSize' from 'PassengerId' to group passengers together
*   TotalSpending' (total amount spent across all services) and 'NoSpending' (whether a passenger spent 0)



In [57]:
# Split the Cabin column into three parts: Deck, CabinNum, and Side
train_df[['Deck', 'CabinNum', 'Side']] = train_df['Cabin'].str.split('/', expand=True)
test_df[['Deck', 'CabinNum', 'Side']] = test_df['Cabin'].str.split('/', expand=True)

# Create GroupId from PassengerId and calculate GroupSize
train_df['GroupId'] = train_df['PassengerId'].apply(lambda x: x.split('_')[0])
test_df['GroupId'] = test_df['PassengerId'].apply(lambda x: x.split('_')[0])

# GroupSize - Count the number of passengers in the same group
train_df['GroupSize'] = train_df.groupby('GroupId')['GroupId'].transform('count')
test_df['GroupSize'] = test_df.groupby('GroupId')['GroupId'].transform('count')

# Create spending features: TotalSpending and NoSpending
train_df['TotalSpending'] = train_df[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)
test_df['TotalSpending'] = test_df[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)

# Create binary feature for NoSpending (True if the total spending is 0)
train_df['NoSpending'] = (train_df['TotalSpending'] == 0).astype(int)
test_df['NoSpending'] = (test_df['TotalSpending'] == 0).astype(int)




# **Step 2: Encoding Categorical Features**

This step handles encoding for categorical variables:
* Label Encoding for binary features (e.g., CryoSleep, VIP, Side).
* One-Hot Encodingfor multi-class features (e.g., HomePlanet, Destination, Deck).







In [58]:
# Label Encoding for binary features (CryoSleep, VIP, Side)
from sklearn.preprocessing import LabelEncoder
label_cols = ['CryoSleep', 'VIP', 'Side']
label_encoder = LabelEncoder()

for col in label_cols:
    train_df[col] = label_encoder.fit_transform(train_df[col])
    test_df[col] = label_encoder.transform(test_df[col])

# One-Hot Encoding for multi-class features (HomePlanet, Destination, Deck)
train_df = pd.get_dummies(train_df, columns=['HomePlanet', 'Destination', 'Deck'], drop_first=True)
test_df = pd.get_dummies(test_df, columns=['HomePlanet', 'Destination', 'Deck'], drop_first=True)


# **Step 3: Scaling Numerical Features:**

Scaling the numerical features (Age, TotalSpending) using RobustScaler. This will handle outliers effectively by scaling based on interquartile range (IQR).


In [59]:
# Scale numerical features (Age, TotalSpending) using RobustScaler to handle outliers
from sklearn.preprocessing import RobustScaler

numerical_cols = ['Age', 'TotalSpending']
scaler = RobustScaler()

train_df[numerical_cols] = scaler.fit_transform(train_df[numerical_cols])
test_df[numerical_cols] = scaler.transform(test_df[numerical_cols])

# **Step 4: Feature Engineering Output**

After Feature Engineering, new columns like TotalSpending, GroupId, GroupSize, and NoSpending will be added to the dataset.


In [60]:
train_df.head()

Unnamed: 0,PassengerId,CryoSleep,Cabin,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,...,HomePlanet_Mars,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,Deck_B,Deck_C,Deck_D,Deck_E,Deck_F,Deck_G,Deck_T
0,0001_01,0,B/0/P,0.705882,0,0.0,0.0,0.0,0.0,0.0,...,False,False,True,True,False,False,False,False,False,False
1,0002_01,0,F/0/S,-0.176471,0,109.0,9.0,25.0,549.0,44.0,...,False,False,True,False,False,False,False,True,False,False
2,0003_01,0,A/0/S,1.823529,1,43.0,3576.0,0.0,6715.0,49.0,...,False,False,True,False,False,False,False,False,False,False
3,0003_02,0,A/0/S,0.352941,0,0.0,1283.0,371.0,3329.0,193.0,...,False,False,True,False,False,False,False,False,False,False
4,0004_01,0,F/1/S,-0.647059,0,303.0,70.0,151.0,565.0,2.0,...,False,False,True,False,False,False,False,True,False,False


The final versions of the training and test datasets, after all preprocessing and feature engineering steps, are available for use. The data has been cleaned and transformed, with missing values handled, new features created, and categorical variables appropriately encoded. We can access the processed files through the following links:


https://github.com/Hafeezali366/Spaceship-Titanic/blob/main/dataset/cleaned_train_data.csv

https://github.com/Hafeezali366/Spaceship-Titanic/blob/main/dataset/cleaned_test_data.csv
