In this notebook, we conduct a very thorough EDA on the Spaceship Titanic dataset to investigate patterns and figure out which features are best for a model! The notebook also includes some obvious feature engineering.

This is Part 1 of a three part series.

* Part 1: [Spaceship Titanic - Exploratory Data Analysis](https://www.kaggle.com/code/defcodeking/spaceship-titanic-exploratory-data-analysis) (you are here!)
* Part 2: [Spaceship Titanic - Logistic Regression Baselines](https://www.kaggle.com/code/defcodeking/spaceship-titanic-logistic-regression-baselines)
* Part 3: [Ensembling (And Optuna 😉) Is All You Need!](https://www.kaggle.com/code/defcodeking/ensembling-and-optuna-is-all-you-need)

# Imports


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os

sns.set_theme()
sns.set_style("ticks")
sns.despine()

%matplotlib inline

# Config


In [None]:
DATA_DIR = "../input/spaceship-titanic"

In [None]:
def get_data_path(filename):
    return os.path.join(DATA_DIR, filename)

# Loading Dataset


In [None]:
filepath = get_data_path("train.csv")
df = pd.read_csv(filepath)

In [None]:
df.head()

# Data Summary

Below, the data has been summarised. The only thing of value that we can conclude from this summary is that most of the features have missing values. `PassengerId` is the only feature with no missing values. All the other features have missing values, which may or may not need to be handled. Moreover, the `Name` feature may not be that useful.


In [None]:
df.info()

# Basic Feature Engineering

There are some features that can be engineered without much exploration. Their usability in a model can then be judged.


## Using `PassengerId`

`PassengerId` is formatted as `gggg_pp`, where `gggg` is the ID of the group the passenger was traveling with and `pp` is their number in the group.

> Note: Here, number in the group is not the group size but the position of the passenger in the group.

There are two features that can be extracted from this:

- `GroupId`, which gives us a feature which represents the group a passenger belongs to.
- `GroupSize`, which gives us a feature which indicates the size of the group.


In [None]:
df["PassengerId"].head()

In [None]:
# expand=True returns a DataFrame with numerical columns 0, 1, ...
split_id = df["PassengerId"].str.split("_", expand=True)
split_id.head()

In [None]:
df["GroupId"] = split_id[0]
df["GroupSize"] = df.groupby("GroupId")["GroupId"].transform("count")
df.head()

## Using `Cabin`

`Cabin` is formatted as `deck/num/side`, where `deck` is the deck the cabin is on, `num` is the cabin number and `side` is one of `P` or `S`, for port and starboard respectively.

The following features can be extracted from this:

- `CabinDeck`: Deck on which the passenger's cabin is.
- `CabinId`: Combination of the `deck` and `num` components to get a single cabin number, without the side.
- `CabinSide`: Side the cabin is on.


In [None]:
split_cabin = df["Cabin"].str.split("/", expand=True)
split_cabin.head()

In [None]:
df["CabinDeck"] = split_cabin[0]
df.head()

In [None]:
df["CabinId"] = split_cabin[0] + split_cabin[1]
df.head()

In [None]:
df["CabinSide"] = split_cabin[2]
df.head()

## Using Expenditure Columns

The expenditure columns can be summed up together to get a total expenditure of the passenger while on board.


In [None]:
expenditure_cols = ["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]
df["TotalExpense"] = df[expenditure_cols].sum(axis=1)
df.head()

In [None]:
# Move Transported to the end
transported = df["Transported"]
df = df.drop("Transported", axis=1)
df["Transported"] = transported
df.head()

# Types of Features

In this section, we figure out which features are categorical/ordinal and which are numerical. This can be achieved by using `value_counts()` on the columns.


## Categorical/Ordinal Features

We can see that the following features are categorical/ordinal:

- `HomePlanet`
- `CryoSleep`
- `Destination`
- `VIP`
- `GroupId`
- `GroupSize`
- `CabinDeck`
- `CabinId`
- `CabinSide`

All these features are not encoded properly and will require an encoder.


In [None]:
df["HomePlanet"].value_counts()

In [None]:
df["CryoSleep"].value_counts()

In [None]:
df["Destination"].value_counts()

In [None]:
df["VIP"].value_counts()

In [None]:
df["GroupId"].value_counts()

In [None]:
df["GroupSize"].value_counts()

In [None]:
df["CabinDeck"].value_counts()

In [None]:
df["CabinId"].value_counts()

In [None]:
df["CabinSide"].value_counts()

## Numerical Features

All remaining features are numerical.


# Class Imbalance

There isn't a lot of imbalance in the dataset, which is a good thing. There's an almost even split between number of passengers transported and not transported.

In [None]:
df["Transported"].value_counts()

# Per-feature Insights

## `Age`


### Null values

There are ~2.06% null values in the `Age` feature.


In [None]:
(df["Age"].isna().sum() / len(df)) * 100

An interesting thing to explore is the relationship between these null `Age` values and `VIP` status.


In [None]:
age_df = df[df["Age"].isna() == True]
age_df.head()

Most of these passengers do not are not VIPs.


In [None]:
age_df["VIP"].value_counts()

It might also be interesting to see their expenditure while they were on board the spaceship. As can be seen, ~75% of these passengers spent less than or equal to $1092.5 while on board.


In [None]:
age_df["TotalExpense"].describe()

It can be seen that most of these passengers didn't spend any money. There are a few outliers where the passengers spent a lot of money.


In [None]:
sns.histplot(x=age_df["TotalExpense"])

The passenger Achira Unhaftimle is the only passenger who spent the maximum amount.


In [None]:
age_df[age_df["TotalExpense"] == 22261]

An interesting question here is whether this passenger is associated with any VIP passenger. There are three possibilities.

- There is a passenger who is a VIP and has the same surname as this passenger.
- There is a passenger who is a VIP and has a similar ID as this passenger.
- There is no affiliated passenger.

As can be seen below, no one else with this surname was on board and no one with a similar passenger ID was on board. Seems weird that the agency would not record the age of such a high spender, doesn't it?


In [None]:
names = df[["PassengerId", "Name", "VIP"]]
names[names["Name"].str.contains("Unhaftimle") == True]

In [None]:
names[names["PassengerId"].str.contains("6348")]

Let's also take a look at the sole VIP member among these passengers. Wow, a Martian!


In [None]:
age_df[age_df["VIP"] == True]

### Summary Statistics

The summary statistics suggest the following:

- Most of the passengers on board were not more than 38 years old. Maybe this could be attributed to younger people being more enthusiastic and healthier to travel in space.
- The variance is not that high.
- There were passengers whose age was recorded as 0 years. Maybe these are newborns.


In [None]:
df["Age"].describe()

An interesting question to ask is what proportion of passengers were more than or equal to the age of 65 years. These people must really enjoy adventure!


In [None]:
df[df["Age"] >= 65].head()

Just ~1.23% of the passengers were what we would call Senior Citizens.


In [None]:
len(df[df["Age"] >= 65]) / len(df) * 100

Let's also take a look at those passengers whose age was recorded as 0 years. An interesting thing to observe would be how many of these passengers were in a group of 1 person.


In [None]:
df[df["Age"] == 0].head()

Only 7 such passengers were on board. Most of them were in a group of 3 people.


In [None]:
df[df["Age"] == 0]["GroupSize"].value_counts()

Let's take a look at just one of these 3-people group. They have the same surnames, suggesting they are part of the same family.


In [None]:
g = df[df["GroupSize"] == 3].groupby("GroupId").filter(lambda x: x["Age"].eq(0).any())
g = g.groupby("GroupId")
g.groups

In [None]:
g.get_group("0067")

### Distribution

We visualize `Age` as a density plot. The distribution seems fairly normal with a very slight skew. This likely doesn't need any transformations for non-tree based algorithms and can be used as is in any tree-based algorithm.


In [None]:
sns.kdeplot(data=df, x="Age", fill=True)

### Distribution by Target

Below, we visualize the distribution of `Age` for each value of `Transported.` There is a very small difference in the two distributions.


In [None]:
sns.kdeplot(data=df, x="Age", hue="Transported", fill=True)

As can be seen, the distributions are almost identical.


In [None]:
sns.histplot(data=df, x="Age", hue="Transported")

### Null Values and Target

There isn't much difference between the transportation rates. This suggests using `Age` for using presence of missing values as a feature may not be useful.

In [None]:
sns.countplot(data=df[df["Age"].isna()], x="Transported")

## Expenditure Features

Since there are only 6 expenditure features, they can be analyzed together.


In [None]:
expenditure_cols = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'TotalExpense']

### Null Values

None of them have more than ~2.4% null values. There are no null values in `TotalExpense` since we decided to skip nulls when calculating it.


In [None]:
(df[expenditure_cols].isna().sum() / len(df)) * 100

### Summary Statistics

The summary statistics highlight the following:

- Most people on board didn't spend a lot of money, since the median is $0 for the individual categories and only $716 for the total expense.
- There was at least one passenger on board who spent almost $36,000 (!) while on board.
- There's a huge variation in the expenditures since all the standard deviations are large.


In [None]:
df[expenditure_cols].describe()

Lets' take a look at the passengers who spent almost $36,000. There's only one such passenger. It's interesting how this passenger is not a VIP. They also seem to love food and the spa. In fact, they spent the highest amount of money on the spa among all the passengers! They must be _really_ stressed out.


In [None]:
df[df["TotalExpense"] > 35000]

In [None]:
df[df["Spa"] == df["Spa"].max()]

### Distribution

Clearly, all the expenditure features are heavily right-skewed. This is natural since very few passengers spent a lot of money while on board. They will need some sort of transformation for non-tree based algorithms so that they follow a normal distribution.


In [None]:
fig, axes = plt.subplots(3, 2, figsize=(10, 10))

# Iterating through axes and names
for name, ax in zip(expenditure_cols, axes.flatten()):
    sns.kdeplot(data=df, x=name, fill=True, ax=ax)

plt.tight_layout()

### Distribution by Target

In all the cases, there seems to be a significant difference between transportation chances based on expenditure. Most of the people who did not spend a lot of money were transported. It also shows that there is much more variability in the expense of people who were not transported than in those who were.

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(10, 10))

# Iterating through axes and names
for name, ax in zip(expenditure_cols, axes.flatten()):
    sns.kdeplot(data=df, x=name, fill=True, hue="Transported", ax=ax)

plt.tight_layout()

In [None]:
false_std = df[df["Transported"] == False]["TotalExpense"].std()
true_std = df[df["Transported"] == True]["TotalExpense"].std()
false_std, true_std

### Null Values And Target

Shere are no null values in `TotalExpense`. Let's look at null values in other expenditure columns. Clearly, only `RoomService`, `FoodCourt` and `ShoppingMall` have some difference in transportation rates. In case of `Spa`, the difference is just 1 passenger and in case of `VRDeck`, the difference is just 8 passengers.

This suggests that, if we were planning to use the presence of missing values as a feature, using `VRDeck` and `Spa` may not add much to the prediction power of a model.

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(13, 10))

# Iterating through axes and names
for name, ax in zip(expenditure_cols, axes.flatten()):
    if name == "TotalExpense":
        continue
    sns.countplot(data=df[df[name].isna()], x="Transported", ax=ax)
    ax.set_title(name)

fig.delaxes(axes.flatten()[-1])
plt.tight_layout()

In [None]:
df[df["Spa"].isna()]["Transported"].value_counts()

In [None]:
df[df["VRDeck"].isna()]["Transported"].value_counts()

## `CabinDeck`

### Null Values

There are roughly ~2.3% null values in the feature.

In [None]:
(df["CabinDeck"].isna().sum() / len(df)) * 100

### Summary statistics

Most of the passengers (2794) were on deck `F`.

In [None]:
df["CabinDeck"].describe()

### Distribution by Target

According to the plot, most passengers on decks `B` and `G` were transported. But the difference is less significant for deck `G`.

In [None]:
sns.catplot(data=df, x="CabinDeck", hue="Transported", kind="count")

## `GroupSize`

In [None]:
df["GroupSize"] = df["GroupSize"].astype("category")

### Null values

There are no null values in `GroupSize` since this is derived from `PassengerId`.

In [None]:
(df["GroupSize"].isna().sum() / len(df)) * 100

### Summary Statistics

Most passengers (4805) were traveling alone.

In [None]:
df["GroupSize"].describe()

### Distribution by Target

Since most of the passengers were traveling alone, they dominate the share of passengers who were transported. Most of those who were traveling alone were not transported. But, as soon as we get to a group of size 2 or more, we see that passengers were more likely to be transported, except for groups of size 8.

In [None]:
sns.catplot(data=df, x="GroupSize", hue="Transported", kind="count")

## `CryoSleep`

### Null values

There are ~2.5% null values in this feature.

In [None]:
(df["CryoSleep"].isna().sum() / len(df)) * 100

Let's take a look at the VIP status of these null values. Most of them were non-VIP passengers, with only 3 VIP passengers. It's interesting how the feature was not recorded for these 3 VIP passengers.

In [None]:
df[df["CryoSleep"].isna()]["VIP"].value_counts()

### Summary Statistics

Most of the passengers (5439) did not opt for cryosleep.

In [None]:
df["CryoSleep"].describe()

Let's take a look at the expenditure of these passengers. As expected, passengers in cryosleep did not spend any money.

In [None]:
df.groupby("CryoSleep")["TotalExpense"].describe()

### Distribution by Target

Most of the passengers who did not opt for cryosleep were not transported while most of those who did were transported. This might be because the passengers in cryosleep were confined to their cabins and wouldn't have been able to take any action to save themselves.

In [None]:
sns.catplot(data=df, x="CryoSleep", kind="count", hue="Transported")

There were 554 passengers who were in cryosleep but still were not transported. Since there isn't any direct way of ascertaining how these passengers survived, we can just assume that they were lucky. It is also possible that they were woken up by someone. 

In [None]:
len(df[df["CryoSleep"] & (~df["Transported"])])

### Null Values and Target

There's a difference of just 5 passengers between the transportation chances based on null values. Using this when using presence of missing values as a feature may not be that helpful.

In [None]:
sns.countplot(data=df[df["CryoSleep"].isna()], x="Transported")

In [None]:
df[df["CryoSleep"].isna()]["Transported"].value_counts()

## `HomePlanet`

### Null Values

There are ~2.3% null values in the feature. So far, looking at the trends in other features, it seems likely that these null values have been artificially created and hence, follow a uniform distribution across features.

In [None]:
(df["HomePlanet"].isna().sum() / len(df)) * 100

### Summary Statistics

Most of the passengers (4602) were traveling from `Earth`.

In [None]:
df["HomePlanet"].describe()

### Distribution by Target

More passengers from `Earth` were not transported. The difference in transportation rates for passengers from `Mars` is not that significant.

In [None]:
sns.catplot(data=df, x="HomePlanet", hue="Transported", kind="count")

### Null Values and Target

The difference in transportation rates based on null values is not that significant, with just 5 more passengers being transported than not transported. Using this when using presence of missing values as a feature may not be that useful.

In [None]:
sns.countplot(data=df[df["HomePlanet"].isna()], x="Transported")

In [None]:
df[df["HomePlanet"].isna()]["Transported"].value_counts()

## `Destination`

### Null Values

There are ~2.1% null values in the feature.

In [None]:
(df["Destination"].isna().sum() / len(df)) * 100

### Summary Statistics

Most of the passengers (5915) were traveling to `TRAPPIST-1e`.

In [None]:
df["Destination"].describe()

### Distribution by Target

Among the passengers traveling to `55 Cancri e`, most were transported. The difference is not that significant for passengers traveling to `PSO J318.5-22`. Among those traveling to `TRAPPIST-1e`, most were not transported.

In [None]:
sns.catplot(data=df, x="Destination", hue="Transported", kind="count")

### Null Values and Target

There isn't much difference between the transportation rates. `Destination` may not be a good candidate when using presence of missing values as a feature.

In [None]:
sns.countplot(data=df[df["Destination"].isna()], x="Transported")

## `CabinSide`

### Null Values

There are ~2.3% null values in the feature.

In [None]:
(df["CabinSide"].isna().sum() / len(df)) * 100

### Summary Statistics

Almost the same number of passengers had cabins on the starboard and port sides, with almost a 50-50 split.

In [None]:
df["CabinSide"].describe()

### Distribution by Target

Among those transported, most were passengers with cabins on the starboard (`S`) side. Port (`P`) side passengers were less likely to be transported. 

In [None]:
sns.catplot(data=df, x="CabinSide", hue="Transported", kind="count")

### Null values and Target

There's almost no difference between transportation rates based on null values, suggesting that `CabinSide` is not a good candidate when using presence of missing values as a feature.

In [None]:
sns.countplot(data=df[df["CabinSide"].isna()], x="Transported")

In [None]:
df[df["CabinSide"].isna()]["Transported"].value_counts()

## `VIP`

### Null Values

There are ~2.3% null values in the feature.

In [None]:
(df["VIP"].isna().sum() / len(df)) * 100

### Summary Statistics

There were very few VIPs on board. More than 97% of the passengers were non-VIPs.

In [None]:
df["VIP"].describe()

### Distribution by Target

VIPs were slightly less likely to be transported while it was the opposite for non-VIPs.

In [None]:
sns.catplot(data=df, x="VIP", hue="Transported", kind="count")

### Null Values and Target

There's almost no difference between transportation rates based on null values, making this a poor candidate for using presence of missing values as a feature.

In [None]:
sns.countplot(data=df[df["VIP"].isna()], x="Transported")

In [None]:
df[df["VIP"].isna()]["Transported"].value_counts()

## `GroupId`

`GroupId` is a very dense categorical feature, with 6217 levels. It doesn't make sense to try to plot this feature. A better way to judge its usability is to train the same model once with the feature and once without the feature and see which one yields better results.

### Null Values

There are no null values in `GroupId` since it is derived from `PassengerId`.

In [None]:
(df["GroupId"].isna().sum() / len(df)) * 100

### Summary Statistics

This doesn't really give us much insight since there are multiple group IDs with a frequency of 8. What it tells is something we already know: The largest groups had 8 passengers in them.

In [None]:
df["GroupId"].astype("category").describe()

## `CabinId`

`CabinId` is a very dense categorical feature, with 4453 levels. Similar to `GroupId`, a better way to judge its usability is to train the same model once with the feature and once without the feature and see which one yields better results.

### Null Values

There are approximately ~2.3% null values in the feature.

In [None]:
(df["CabinId"].isna().sum() / len(df)) * 100

### Summary Statistics

This tells us that the cabin IDs with the most number of passengers had 11 passengers.

In [None]:
df["CabinId"].describe()

The number people with the same cabin ID could also become a feature.

In [None]:
df["CabinOccupancy"] = df.groupby("CabinId")["CabinId"].transform("count")
df["CabinOccupancy"] = df["CabinOccupancy"].astype("category")
df["CabinOccupancy"].value_counts()

The plot below shows that there are differences between transportation rates based on this feature.

In [None]:
sns.catplot(data=df, x="CabinOccupancy", hue="Transported", kind="count")

# Feature-to-feature Interactions

## Correlation

Correlation is only relevant for numerical features. None of the numerical features are strongly correlated, except for `TotalExpense` and `FoodCourt`, `TotalExpense` and `Spa`, and `TotalExpense` and `VRDeck`, all of which are relatively strongly positively correlated.

In [None]:
corr = df.drop("Transported", axis=1).corr()
mask = np.zeros_like(corr.values)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(7, 5))
sns.heatmap(corr, mask=mask, annot=True, linewidth=.5, square=True)

## `TotalExpense` and Its Correlated Columns

In [None]:
correlated = ["FoodCourt", "Spa", "VRDeck"]

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Iterating through axes and names
for name, ax in zip(correlated, axes.flatten()):
    sns.scatterplot(data=df, x="TotalExpense", y=name, hue="Transported", ax=ax)

plt.tight_layout()

## `Age` and Expenditure Features

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(13, 10))

# Iterating through axes and names
for name, ax in zip(expenditure_cols, axes.flatten()):
    sns.scatterplot(data=df, x="Age", y=name, hue="Transported", ax=ax)

plt.tight_layout()

## `Age` and `VIP`

There is more variability in the age of non-VIP passengers than in that of VIP passengers since the box is wider. There are also more outliers, as is evident by the larger cluster of points above the box. Moreover, ages in VIP passengers who were transported is slightly more than those who were not. Meanwhile, it is almost the same for non-VIP passengers across transported status.

In [None]:
f, ax = plt.subplots(figsize=(7, 5))
sns.boxplot(data=df, x="VIP", y="Age", hue="Transported")

## `Age` and `GroupSize`

Groups of size 1, 2 and 8 passengers have almost the same variability in age across transported status. But, there are significant differences in other group sizes.

In [None]:
f, ax = plt.subplots(figsize=(10, 7))
sns.boxplot(data=df, x="GroupSize", y="Age", hue="Transported")

## `Age` and `CabinOccupancy`

Notice the huge difference in variability in age across transportation status for cabin IDs with 11 passengers. People who were transported were much younger than those who were not.

In [None]:
f, ax = plt.subplots(figsize=(10, 7))
sns.boxplot(data=df, x="CabinOccupancy", y="Age", hue="Transported")

In [None]:
df[df["CabinOccupancy"] == 11].groupby("Transported")["Age"].describe()

## `Age` and `CabinDeck`

Deck `G` seemed to have the youngest passengers on board.

In [None]:
f, ax = plt.subplots(figsize=(10, 5))
sns.boxplot(data=df, x="CabinDeck", y="Age", hue="Transported")

## `Age` and `CryoSleep`

The distribution of age across cryosleep status and transported status is almost identical.

In [None]:
f, ax = plt.subplots(figsize=(7, 5))
sns.boxplot(data=df, x="CryoSleep", y="Age", hue="Transported")

## `CabinDeck` and `TotalExpense`

In [None]:
f, ax = plt.subplots(figsize=(7, 7))
sns.boxplot(data=df, x="CabinDeck", y="TotalExpense", hue="Transported")

In [None]:
order = ["A", "B", "C", "D", "E", "F", "G", "T"]
g = sns.FacetGrid(
    data=df[df["CabinDeck"].notna()],
    col="CabinDeck",
    hue="Transported",
    col_wrap=4,
    col_order=order,
)
g.map(sns.kdeplot, "TotalExpense", fill=True)
g.add_legend()

It's visible that the peak observed in `TotalExpense` for people who were transported (see `Distribution by Target` section in `Expenditure Features`) is mostly contributed by passengers on deck `G`. This tells us, based on the dependence of transportation chances on expenditure, that `G` is a more modest deck, possibly occupied by low spenders.

As can be seen in the summary below, deck G has the lowest mean and median expenditure. The next is `F` and the two plots above show that they both combine to dominate the number of people transported.
 
Moreover, `T`, the deck with no passengers who were transported, has the highest mean and median expenditure.

In [None]:
df.groupby("CabinDeck").agg({"TotalExpense": ["mean", "median"]})

## `Destination` and Expenditure Features

Overall, people traveling to `55 Cancri e` spent more than passengers headed to other destinations.

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(13, 10), sharex=True)

# Iterating through axes and names
for name, ax in zip(expenditure_cols, axes.flatten()):
    sns.boxplot(data=df, x="Destination", y=name, hue="Transported", ax=ax)

plt.tight_layout()

## `HomePlanet` and Expenditure Features

Overall, passengers from Europa spent more than passengers from other planets.

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(13, 10), sharex=True)

# Iterating through axes and names
for name, ax in zip(expenditure_cols, axes.flatten()):
    sns.boxplot(data=df, x="HomePlanet", y=name, hue="Transported", ax=ax)

plt.tight_layout()

# Conclusion

This has been an extremely detailed EDA into the dataset. Some observations:

- There are obvious features that can be engineered without any EDA like `GroupId` and `GroupSize`.
- All features except `PassengerId` have null values. These were most probably randomly added to the dataset, which might explain their uniform distribution across different features. We either need to impute these or use a model which has support for automatically imputing them (like XGBoost).
- Not all features will be equally helpful when used as candidates for using presence of missing values as features.
- The variables are not strongly correlated, which possibly means that a good model with proper hyperparameters is likely to generalize well.
- Features like `GroupId` and `CabinId` can't really be judged using visualization due to their dense nature. We need to train models with and without these features to see their usefulness.
- There are still other features that can be engineered. For example, the surnames can be extracted from the given names, the gender can be guessed using the names, families can be grouped according to surnames, etc.

Thank you for reading!