<a id="Title"></a>
# <span style="color:teal;font-weight:bold;">Spaceship Titanic 🩹 Data imputation</span>

This notebook is the third part of my <span style="font-weight:bold;color:green">Spaceship Titanic series</span>:

1. <a href="https://www.kaggle.com/code/fertmeneses/spaceship-titanic-getting-familiar">Spaceship Titanic 🏁 Getting familiar</a>.
2. <a href="https://www.kaggle.com/code/fertmeneses/spaceship-titanic-feature-engineering">Spaceship Titanic 💡 Feature engineering.</a>
3. <span style="font-weight:bold">Spaceship Titanic 🩹 Data imputation.</span> [This notebook]
4. Spaceship Titanic 🖥️ Model optimization. (Coming soon)
5. Spaceship Titanic 🔭 Integrated analysis. (Coming soon)

In the <span style="color:orangered;font-weight:bold;">first episode</span>, I studied the Leaderboard (LB) and tried simple Machine Learning models with the original dataset, getting a <span style="color:orangered;font-weight:bold;">submission score of 0.79635</span>. 

From that experience, I learned that **$\approx$75% of the submissions in the LB are below a 0.80 score, while only $\approx$5% are above 0.81**.

In the <span style="color:orange;font-weight:bold;">second episode</span>, I performed feature engineering on the original dataset (without any data correction), getting a <span style="color:orange;font-weight:bold;">submission score of 0.80336</span>.

<div style="color:white;
    display:fill;
    border-radius:15px;
    margin-left: 100px;
    margin-right: 100px;
    background-color:lightblue;
    font-size:105%;
    font-family:Verdana;
    letter-spacing:0.5px">

<p style="padding: 20px;color:black;text-align:center;">
In this <span style="color:green;font-weight:bold;">notebook</span>, I use my previously engineered features and focus on <span style="font-weight:bold;">data imputation</span> employing my own Machine Learning methods, getting a <span style="color:green;font-weight:bold;">submission score of X</span>.

</p>
</div>



In the future notebooks, I'll optimize the Machine Learning model and finally make an integrated analysis based on my results and a deep study of other kagglers' contributions.

<hr>

# <font color='teal'>Outline</font> <a class="anchor"  id="Outline"></a>

[**Load data and preprocess**](#Load_data_and_preprocess)

  - [Load original data](#Load_original_data)

  - [Basic feature engineering](#Basic_feature_engineering)

[**Manual data correction**](#Manual_data_correction)

  - [Feature_X](#Manual_data_correction_Feature_X)

[**ML data imputation**](#ML_data_imputation)

  - [Method description](#Method_description)

  - [Data selection](#Data_selection)

  - [Model tests](#Model_tests)

  - [Data imputation](#Data_imputation)

    - [Feature_X](#Data_imputation_Feature_X)
  
[**Feature engineering**](#Feature_engineering)

  - [Engineer features](#Engineer_features)
   
  - [Correlations](#Correlations)
   
  - [Combined features](#Feature_engineering_combined)
   
[**Submission results**](#Submission_results)

  - [Try models](#Try_models)
    
  - [Analyze results](#Analyze_results)
  
[**Conclusions**](#Conclusions)

<a id="Load_data_and_preprocess"></a>
# <span style="color:teal;font-weight:bold;">Load data and preprocess</span>

In this section, I load the original data and do basic feature engineering, in which I only extract information from single variables or change names. Other feature engineering processes that involve relating two or more features will come later, once I've corrected the missing values.

In [8]:
import datetime
print(f"Notebook last run (end-to-end): {datetime.datetime.now()}")

Notebook last run (end-to-end): 2024-09-28 17:07:29.208300


<a id="Load_original_data"></a>
## <span style="color:teal;font-weight:bold;">Load original data</span>

In the following lines, I load the original datasets and get this information:

- Example for 10 random rows in training dataset.

- Number of rows in both datasets.

- Features' names and data types.

- Number of missing values in both datasets, per feature and per row.

In [16]:
import pandas as pd
import numpy as np
from termcolor import colored

# Load original datasets:
train_df = pd.read_csv('kaggle/input/spaceship-titanic/train.csv') # Training dataset
test_df = pd.read_csv('kaggle/input/spaceship-titanic/test.csv') # Testing dataset
# Keep the passengerID features separately:
train_ID = train_df['PassengerId']
test_ID = test_df['PassengerId']
# Display a 10 random examples:
np.random.seed(1) # Ensure reproducitibility
samples = np.random.choice(range(len(train_df)), 10, replace=False)
display(train_df.iloc[samples]) # Examples
# Print global information:
print('\nNumber of rows in train/test datasets:\n')
print(len(train_df),'/',len(test_df))
print('\nFeatures: names and data types:\n')
print(train_df.dtypes)
# Print number of missing values per feature:
print('\nMissing values in train/test datasets:\n')
for col in test_df.columns:
    # Count missing values and obtain percentages:
    N_train = train_df[col].isna().sum() 
    N_test = test_df[col].isna().sum()
    p_train = N_train/len(train_df)*100 # [%]
    p_test = N_test/len(test_df)*100 # [%]
    # Print results:
    color_train = 'red' if N_train else 'green'
    color_test = 'red' if N_test else 'green'
    rmargin = 60-len(col)
    text_train = colored(f'{N_train} ({p_train:.1f}%)', color_train)
    text_test = colored(f'{N_test} ({p_test:.1f}%)', color_test)
    print(f'{col}:',f'{text_train} / {text_test}'.rjust(rmargin))
# Count missing values in each row:
N_nan_train = train_df.apply(lambda x: x.isna().sum(), axis=1)
N_nan_test = test_df.apply(lambda x: x.isna().sum(), axis=1) 
# Print number of rows with N missing values:
for n in set(N_nan_train).union(set(N_nan_test)):
    print(f'Number of rows with {n} missing values: {sum(N_nan_train==n)}/{sum(N_nan_test==n)}')

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
1454,1539_01,Europa,,A/17/S,55 Cancri e,32.0,False,54.0,3782.0,0.0,21.0,5.0,Alyadum Barmant,True
218,0232_01,Earth,True,G/36/S,PSO J318.5-22,27.0,False,0.0,,0.0,0.0,0.0,Nica Bakerrison,False
7866,8392_01,Earth,False,F/1610/S,PSO J318.5-22,24.0,False,86.0,669.0,1.0,0.0,0.0,Therly Brightez,False
7622,8141_01,Earth,True,G/1310/S,TRAPPIST-1e,38.0,False,0.0,0.0,0.0,0.0,0.0,Stenny Belley,True
4108,4387_01,Mars,False,F/902/P,PSO J318.5-22,32.0,False,192.0,0.0,441.0,18.0,0.0,Apix Wala,False
4363,4645_01,Europa,False,B/184/S,TRAPPIST-1e,48.0,False,0.0,9633.0,0.0,1.0,2.0,Aton Bacistion,True
343,0379_01,Earth,False,G/63/P,TRAPPIST-1e,31.0,False,198.0,0.0,591.0,0.0,164.0,Brita Moodson,True
5966,6324_01,Earth,False,E/420/S,,31.0,False,19.0,509.0,0.0,0.0,177.0,Lesley Hinetthews,False
669,0699_01,Mars,True,F/126/S,,18.0,False,0.0,0.0,0.0,0.0,0.0,Roswal Sha,True
6506,6865_01,Europa,False,D/208/S,TRAPPIST-1e,27.0,,69.0,2878.0,0.0,4232.0,3798.0,Thabih Peducting,False



Number of rows in train/test datasets:

8693 / 4277

Features: names and data types:

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

Missing values in train/test datasets:

PassengerId:             [32m0 (0.0%)[0m / [32m0 (0.0%)[0m
HomePlanet:           [31m201 (2.3%)[0m / [31m87 (2.0%)[0m
CryoSleep:            [31m217 (2.5%)[0m / [31m93 (2.2%)[0m
Cabin:               [31m199 (2.3%)[0m / [31m100 (2.3%)[0m
Destination:          [31m182 (2.1%)[0m / [31m92 (2.2%)[0m
Age:                  [31m179 (2.1%)[0m / [31m91 (2.1%)[0m
VIP:                  [31m203 (2.3%)[0m / [31m93 (2.2%)[0m
RoomService:          [31m181 (2.1%)[0m / [31m82 (1.9%)[0m
FoodCourt:        

For more comments about the features, please visit <a href="https://www.kaggle.com/code/fertmeneses/spaceship-titanic-feature-engineering">Spaceship Titanic 💡 Feature engineering.</a>

<a id="Basic_feature_engineering"></a>
## <span style="color:teal;font-weight:bold;">Basic feature engineering</span>

Except for the **PassengerID** feature, there are around 2% missing values in both the training and testing datasets. In order to correct them, I'll make some fair assumptions and deductions using the original information. 

Then, **in this Basic feature engineering process I won't generate new features that involve two or more features, because that would multiply the missing values**. Instead, I will just extract information from the original features or do simple changes such as changing the feature name.

You can check my previous notebook <a href="https://www.kaggle.com/code/fertmeneses/spaceship-titanic-feature-engineering">Spaceship Titanic 💡 Feature engineering.</a> for details about the engineering of single features. In the following, I just apply that code.

In [27]:
# First generate the corrected datasets:
train_df_SF = train_df.copy()
test_df_SF = test_df.copy()

# # # "PassengerId": new features "IDgroup" and "GroupMembers" # # #

train_df_SF['IDgroup'] = train_df['PassengerId'].apply(
    lambda x: int(x.split('_')[0]))
test_df_SF['IDgroup'] = test_df['PassengerId'].apply(
    lambda x: int(x.split('_')[0]))
# Identify ocurrences for every unique value in IDgroup:
ocurrences = pd.concat([train_df_SF['IDgroup'], test_df_SF['IDgroup']]).value_counts().to_dict()
for dataset in [train_df_SF,test_df_SF]:
    dataset['GroupMembers'] = dataset['IDgroup'].apply(lambda x: ocurrences[x])
# Drop unnecessary feature:
train_df_SF = train_df_SF.drop('PassengerId',axis=1)
test_df_SF = test_df_SF.drop('PassengerId',axis=1)

# # # "CryoSleep": make boolean/numeric # # #

for dataset in [train_df_SF,test_df_SF]:
    dataset['CryoSleep'] = dataset['CryoSleep'].apply(
        lambda x: np.nan if x!=x else (1 if x else 0))

# # # "Cabin": new features "Cabin_Deck", "Cabin_num" and "Cabin_isPort" # # #

# Generate list of unique values:
cabin = list(train_df['Cabin'].loc[~train_df['Cabin'].isna()].values)+\
        list(test_df['Cabin'].loc[~test_df['Cabin'].isna()].values)
# Separate "Cabin" into three parts:
cabin_X = ['Cabin_Deck','Cabin_num','Cabin_Side']
for i,cabin in enumerate(cabin_X):
    train_df_SF[cabin] = train_df['Cabin'].apply(
        lambda x: np.nan if x!=x else (
            x.split('/')[i]))
    test_df_SF[cabin] = test_df['Cabin'].apply(
        lambda x: np.nan if x!=x else (
            x.split('/')[i]))
# Change 'Cabin_Side' to 'Cabin_isPort' and make it boolean/numeric:
train_df_SF['Cabin_isPort'] = train_df_SF['Cabin_Side'].apply(
    lambda x: np.nan if x!=x else (1 if x=='P' else 0))
test_df_SF['Cabin_isPort'] = test_df_SF['Cabin_Side'].apply(
    lambda x: np.nan if x!=x else (1 if x=='P' else 0))
# Drop unnecesary features:
for feature in ['Cabin','Cabin_Side']:
    train_df_SF = train_df_SF.drop(feature,axis=1)
    test_df_SF = test_df_SF.drop(feature,axis=1)

# # # "Destination" redefinition # # #

train_df_SF["Destination"] = train_df["Destination"].apply(
    lambda x: np.nan if x!=x else (
    "Cancri" if x=="55 Cancri e" else (
        "PSO" if x=="PSO J318.5-22" else "Trappist"))
)
test_df_SF["Destination"] = test_df["Destination"].apply(
    lambda x: np.nan if x!=x else (
    "Cancri" if x=="55 Cancri e" else (
        "PSO" if x=="PSO J318.5-22" else "Trappist"))
)

# # # Expense-features redefinition # # #

for dataset in [train_df_SF, test_df_SF]:
    dataset.rename(columns={
        'RoomService': 'ExpRS',
        'FoodCourt': 'ExpFC',
        'ShoppingMall': 'ExpSM',
        'Spa': 'ExpSpa',
        'VRDeck': 'ExpVR'
        }, inplace=True)

# # # "Name" feature: new features "Name_Last" and "Ocurrence_LastName" # # #

# Training dataset:
train_df_SF['Name_Last'] = train_df['Name'].apply(
    lambda x: np.nan if x!=x else (
        x.split(' ')[-1]))
# Testing dataset:
test_df_SF['Name_Last'] = test_df['Name'].apply(
    lambda x: np.nan if x!=x else (
        x.split(' ')[-1]))
# Identify ocurrences for every unique value in Name_Last:
ocurrences = pd.concat([train_df_SF['Name_Last'], test_df_SF['Name_Last']]).value_counts().to_dict()
for dataset in [train_df_SF,test_df_SF]:
    dataset['Ocurrence_LastName'] = dataset['Name_Last'].apply(
        lambda x: np.nan if x!=x else ocurrences[x])
# Drop unnecessary feature:
train_df_SF = train_df_SF.drop('Name',axis=1)
test_df_SF = test_df_SF.drop('Name',axis=1)

Below, I summarize all features after the basic engineering process:

| Feature | Definition |
| :---: | :--- |
| **IDgroup** | Indicates the group with which the passenger is travelling with. People in a group are often family members, but not always. |
| **GroupMembers** | Number of passengers sharing the same **IDgroup** (including self). |
| **Name_Last** | Last name of the passenger. |
| **Ocurrence_LastName** | Number of passengers sharing the same last name (including self). |
| **HomePlanet** | The planet the passenger departed from, typically their planet of permanent residence. |
| **CryoSleep** | Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins. |
| **Destination** | The planet the passenger will be debarking to. |
| **Age** | The age of the passenger. |
| **VIP** | Whether the passenger has paid for special VIP service during the voyage. |
| **Cabin_Deck** | Designation of the Deck in which the passenger's cabin is located |
| **Cabin_num** | Passenger's cabin number |
| **Cabin_isPort** | Side of the starship in which the passenger's cabin is located: a value 1 means Port, 0 means Starboard. |
| **ExpRS** | Amount the passenger has billed at the Room Service luxury amenity |
| **ExpFC** | Amount the passenger has billed at the Food Court luxury amenity |
| **ExpSM** | Amount the passenger has billed at the Shopping Mall luxury amenity |
| **ExpSpa** | Amount the passenger has billed at the Spa luxury amenity |
| **ExpVR** | Amount the passenger has billed at the VRDeck luxury amenity |

<a id="Manual_data_correction"></a>
# <span style="color:teal;font-weight:bold;">Manual data correction</span>

Xxxx

<a id="Manual_data_correction_Feature_X"></a>
## <span style="color:teal;font-weight:bold;">Feature X</span>

Xxxx

<a id="ML_data_imputation"></a>
# <span style="color:teal;font-weight:bold;">ML data imputation</span>

Xxxx

<a id="Feature_engineering"></a>
# <span style="color:teal;font-weight:bold;">Feature engineering</span>

Xxxx

<a id="Submission_results"></a>
# <span style="color:teal;font-weight:bold;">Submission results</span>

Xxxx

<a id="Conclusions"></a>
# <span style="color:teal;font-weight:bold;">Conclusions</span>

Xxxx