# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [102]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [4]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [6]:
#your code here
spaceship.shape

(8693, 14)

**Check for data types**

In [8]:
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


**Check for missing values**

In [10]:
#your code here
spaceship.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [12]:
#your code here
spaceship['Cabin'].unique()

array(['B/0/P', 'F/0/S', 'A/0/S', ..., 'G/1499/S', 'G/1500/S', 'E/608/S'],
      dtype=object)

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [14]:
# append the values foe cabin column to pick only first charact and drop other, using .apply method to this column.
spaceship['Cabin']=spaceship['Cabin'].apply(lambda x : x[0] if pd.notna(x) else None)
spaceship['Cabin'].unique()

array(['B', 'F', 'A', 'G', None, 'E', 'D', 'C', 'T'], dtype=object)

In [15]:
spaceship.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

- Drop PassengerId and Name

In [17]:
#your code here
df_spaceship_new=spaceship.copy()
df_spaceship_new=df_spaceship_new.drop(columns=['PassengerId' ,'Name'])

In [18]:
df_spaceship_new_clean = df_spaceship_new.dropna(subset=['Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','HomePlanet','CryoSleep','Cabin','Destination','VIP'])

In [19]:
df_spaceship_new_clean.isna().sum()

HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Transported     0
dtype: int64

In [20]:
df_spaceship_new_clean.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True


In [21]:
df_spaceship_new_clean.shape

(6764, 12)

In [22]:
# define features as X
X=df_spaceship_new_clean.drop(columns='Transported')

In [23]:
# # define target as Y
y=df_spaceship_new_clean['Transported']

**Perform Train Test Split**

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)

## Change below categorical columns to Numerical values

In [26]:
# Below are all categorical columns 
df_spaceship_new['HomePlanet'].nunique()
df_spaceship_new['CryoSleep'].nunique()
df_spaceship_new['Cabin'].nunique()
df_spaceship_new['Destination'].nunique()
df_spaceship_new['VIP'].nunique()

2

In [27]:
# Change Cyro Sleep and VIP to numerical values
from sklearn.preprocessing import OneHotEncoder


ohe = OneHotEncoder(sparse_output=False)

ohe.fit(X_train[['CryoSleep','VIP','HomePlanet', 'Destination']])
X_train_trans_np = ohe.transform(X_train[['CryoSleep','VIP','HomePlanet', 'Destination']])
X_train_trans_np

array([[1., 0., 1., ..., 1., 0., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 1., 0., 0.],
       ...,
       [1., 0., 1., ..., 1., 0., 0.],
       [1., 0., 1., ..., 0., 0., 1.],
       [1., 0., 1., ..., 0., 0., 1.]])

In [28]:
X_train_trans_df = pd.DataFrame(X_train_trans_np, columns=ohe.get_feature_names_out(), index=X_train.index)
X_train_trans_df

Unnamed: 0,CryoSleep_False,CryoSleep_True,VIP_False,VIP_True,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
6201,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
6106,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
7095,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
3060,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4101,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...
6362,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4216,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2153,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
3385,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


In [29]:
# to cahnge the value sfor ordinal categorical Cabin column
deck_mapping = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7, 'T': 8} 
X_train['Cabin_Deck'] = X_train['Cabin'].map(deck_mapping)
X_test['Cabin_Deck'] = X_test['Cabin'].map(deck_mapping)

In [30]:
X_train.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Cabin_Deck
6201,Europa,False,E,55 Cancri e,38.0,False,0.0,201.0,0.0,1308.0,4085.0,5
6106,Earth,True,G,PSO J318.5-22,38.0,False,0.0,0.0,0.0,0.0,0.0,7
7095,Mars,True,F,55 Cancri e,39.0,False,0.0,0.0,0.0,0.0,0.0,6
3060,Earth,False,G,55 Cancri e,0.0,False,0.0,0.0,0.0,0.0,0.0,7
4101,Mars,False,E,TRAPPIST-1e,42.0,False,954.0,0.0,1.0,0.0,0.0,5


In [31]:
# drop the original categorical columns from X _Train as we have transformed them.
X_train_numeric = X_train.drop(columns=['CryoSleep','VIP','HomePlanet', 'Destination','Cabin'])

In [32]:
X_train_numeric

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Cabin_Deck
6201,38.0,0.0,201.0,0.0,1308.0,4085.0,5
6106,38.0,0.0,0.0,0.0,0.0,0.0,7
7095,39.0,0.0,0.0,0.0,0.0,0.0,6
3060,0.0,0.0,0.0,0.0,0.0,0.0,7
4101,42.0,954.0,0.0,1.0,0.0,0.0,5
...,...,...,...,...,...,...,...
6362,48.0,0.0,0.0,0.0,0.0,0.0,1
4216,46.0,0.0,0.0,0.0,0.0,0.0,7
2153,27.0,0.0,12077.0,0.0,75.0,0.0,3
3385,46.0,105.0,335.0,0.0,72.0,388.0,6


In [33]:
#concat both the dataframe with numerical columns
X_train_final=pd.concat([X_train_numeric, X_train_trans_df], axis=1)

In [34]:
X_train_final

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Cabin_Deck,CryoSleep_False,CryoSleep_True,VIP_False,VIP_True,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
6201,38.0,0.0,201.0,0.0,1308.0,4085.0,5,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
6106,38.0,0.0,0.0,0.0,0.0,0.0,7,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
7095,39.0,0.0,0.0,0.0,0.0,0.0,6,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
3060,0.0,0.0,0.0,0.0,0.0,0.0,7,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4101,42.0,954.0,0.0,1.0,0.0,0.0,5,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6362,48.0,0.0,0.0,0.0,0.0,0.0,1,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4216,46.0,0.0,0.0,0.0,0.0,0.0,7,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2153,27.0,0.0,12077.0,0.0,75.0,0.0,3,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
3385,46.0,105.0,335.0,0.0,72.0,388.0,6,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


- For non-numerical columns, do dummies.

In [75]:
#your code here
#change the categorical columns based on  ordinal or nominal data and merge whole data and perform test on it.

X_test_trans_np = ohe.transform(X_test[['CryoSleep','VIP','HomePlanet', 'Destination']])
X_test_trans_np

array([[0., 1., 1., ..., 0., 0., 1.],
       [0., 1., 1., ..., 0., 0., 1.],
       [0., 1., 1., ..., 0., 1., 0.],
       ...,
       [1., 0., 1., ..., 1., 0., 0.],
       [0., 1., 1., ..., 0., 0., 1.],
       [1., 0., 1., ..., 1., 0., 0.]])

In [79]:
X_test_trans_df = pd.DataFrame(X_test_trans_np, columns=ohe.get_feature_names_out(), index=X_test.index)
X_test_trans_df

Unnamed: 0,CryoSleep_False,CryoSleep_True,VIP_False,VIP_True,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
3476,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2602,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
5359,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4701,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3649,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...
7160,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
6521,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
39,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2408,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


In [81]:
X_test.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Cabin_Deck
3476,Europa,True,B,TRAPPIST-1e,29.0,False,0.0,0.0,0.0,0.0,0.0,2
2602,Earth,True,G,TRAPPIST-1e,31.0,False,0.0,0.0,0.0,0.0,0.0,7
5359,Earth,True,G,PSO J318.5-22,20.0,False,0.0,0.0,0.0,0.0,0.0,7
4701,Earth,False,F,TRAPPIST-1e,23.0,False,0.0,0.0,2102.0,0.0,0.0,6
3649,Earth,True,G,TRAPPIST-1e,9.0,False,0.0,0.0,0.0,0.0,0.0,7


In [85]:
# drop the original categorical columns from X _Train as we have transformed them.
X_test_numeric = X_test.drop(columns=['CryoSleep','VIP','HomePlanet', 'Destination','Cabin'])

In [89]:
#concat both the dataframe with numerical columns
X_test_final=pd.concat([X_test_numeric, X_test_trans_df], axis=1)

In [91]:
X_test_final

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Cabin_Deck,CryoSleep_False,CryoSleep_True,VIP_False,VIP_True,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
3476,29.0,0.0,0.0,0.0,0.0,0.0,2,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2602,31.0,0.0,0.0,0.0,0.0,0.0,7,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
5359,20.0,0.0,0.0,0.0,0.0,0.0,7,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4701,23.0,0.0,0.0,2102.0,0.0,0.0,6,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3649,9.0,0.0,0.0,0.0,0.0,0.0,7,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7160,21.0,618.0,33.0,1.0,2.0,0.0,6,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
6521,32.0,0.0,326.0,31.0,5.0,545.0,7,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
39,20.0,554.0,195.0,0.0,2606.0,0.0,6,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2408,29.0,0.0,0.0,0.0,0.0,0.0,3,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [114]:
#your code here
knn = KNeighborsRegressor(n_neighbors=16)
knn.fit(X_train_final, y_train)

print(f"The value of R2 on the TEST set is: {knn.score(X_test_final, y_test): .2f}")

The value of R2 on the TEST set is:  0.45


- Evaluate your model's performance. Comment it

In [43]:
#your code here