# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [59]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [5]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [7]:
#displaying df info
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


**Check for data types**

In [9]:
#displaying df data types
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [11]:
#check for duplicates
spaceship.duplicated().sum()

0

In [13]:
#identify missing / null values
spaceship.isnull().sum()
# around 200 rows with missing values out of 8693
#dropping null values

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [17]:
#checking null values after drop (expected 0)
spaceship.isnull().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [15]:
#dropping null values
spaceship.dropna(inplace=True)

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [27]:
# taking only first character of values in column cabin
spaceship['Cabin'] = spaceship['Cabin'].str[0]
spaceship.head(30)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
5,0005_01,Earth,False,F,PSO J318.5-22,44.0,False,0.0,483.0,0.0,291.0,0.0,Sandie Hinetthews,True
6,0006_01,Earth,False,F,TRAPPIST-1e,26.0,False,42.0,1539.0,3.0,0.0,0.0,Billex Jacostaffey,True
8,0007_01,Earth,False,F,TRAPPIST-1e,35.0,False,0.0,785.0,17.0,216.0,0.0,Andona Beston,True
9,0008_01,Europa,True,B,55 Cancri e,14.0,False,0.0,0.0,0.0,0.0,0.0,Erraiam Flatic,True
11,0008_03,Europa,False,B,55 Cancri e,45.0,False,39.0,7295.0,589.0,110.0,124.0,Wezena Flatic,True


In [25]:
unique_values = spaceship['Cabin'].unique()
print(unique_values)

['B' 'F' 'A' 'G' 'E' 'C' 'D' 'T']


- Drop PassengerId and Name

In [29]:
#dropping passenger ID & name
spaceship.drop(['PassengerId', 'Name'], axis=1, inplace=True)
spaceship.head(30)

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True
5,Earth,False,F,PSO J318.5-22,44.0,False,0.0,483.0,0.0,291.0,0.0,True
6,Earth,False,F,TRAPPIST-1e,26.0,False,42.0,1539.0,3.0,0.0,0.0,True
8,Earth,False,F,TRAPPIST-1e,35.0,False,0.0,785.0,17.0,216.0,0.0,True
9,Europa,True,B,55 Cancri e,14.0,False,0.0,0.0,0.0,0.0,0.0,True
11,Europa,False,B,55 Cancri e,45.0,False,39.0,7295.0,589.0,110.0,124.0,True


- For non-numerical columns, do dummies.

In [49]:
spaceship.dtypes

HomePlanet       object
CryoSleep          bool
Cabin            object
Destination      object
Age             float64
VIP                bool
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Transported        bool
dtype: object

In [37]:
#although has dtype object, cryosleep seems to be boolean
spaceship['CryoSleep'].unique()

array([False, True], dtype=object)

In [41]:
#casting cryosleep to boolean to avoid doing drop dummies on this
spaceship['CryoSleep'] = spaceship['CryoSleep'].astype(bool)

In [45]:
#same approach for VIP
spaceship['VIP'].unique()

array([False, True], dtype=object)

In [47]:
spaceship['VIP'] = spaceship['VIP'].astype(bool)

In [None]:
#result: only 3 non numerical, non boolean columns : Homeplanet, cabin and destination

In [55]:
print(spaceship['HomePlanet'].unique())
print(spaceship['Cabin'].unique())
print(spaceship['Destination'].unique())

['Europa' 'Earth' 'Mars']
['B' 'F' 'A' 'G' 'E' 'C' 'D' 'T']
['TRAPPIST-1e' 'PSO J318.5-22' '55 Cancri e']


In [67]:
#applying the drop dummies method on 3 columns identified above
spaceship_dd = pd.get_dummies(spaceship, columns=['HomePlanet', 'Cabin', 'Destination'], drop_first=True)
spaceship_dd.head(30)

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,False,39.0,False,0.0,0.0,0.0,0.0,0.0,False,True,False,True,False,False,False,False,False,False,False,True
1,False,24.0,False,109.0,9.0,25.0,549.0,44.0,True,False,False,False,False,False,False,True,False,False,False,True
2,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,True,False,False,False,False,False,False,False,False,False,True
3,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,True,False,False,False,False,False,False,False,False,False,True
4,False,16.0,False,303.0,70.0,151.0,565.0,2.0,True,False,False,False,False,False,False,True,False,False,False,True
5,False,44.0,False,0.0,483.0,0.0,291.0,0.0,True,False,False,False,False,False,False,True,False,False,True,False
6,False,26.0,False,42.0,1539.0,3.0,0.0,0.0,True,False,False,False,False,False,False,True,False,False,False,True
8,False,35.0,False,0.0,785.0,17.0,216.0,0.0,True,False,False,False,False,False,False,True,False,False,False,True
9,True,14.0,False,0.0,0.0,0.0,0.0,0.0,True,True,False,True,False,False,False,False,False,False,False,False
11,False,45.0,False,39.0,7295.0,589.0,110.0,124.0,True,True,False,True,False,False,False,False,False,False,False,False


In [69]:
spaceship_dd.dtypes

CryoSleep                       bool
Age                          float64
VIP                             bool
RoomService                  float64
FoodCourt                    float64
ShoppingMall                 float64
Spa                          float64
VRDeck                       float64
Transported                     bool
HomePlanet_Europa               bool
HomePlanet_Mars                 bool
Cabin_B                         bool
Cabin_C                         bool
Cabin_D                         bool
Cabin_E                         bool
Cabin_F                         bool
Cabin_G                         bool
Cabin_T                         bool
Destination_PSO J318.5-22       bool
Destination_TRAPPIST-1e         bool
dtype: object

**Perform Train Test Split**

In [79]:
#creating the features & target dfs
X = spaceship_dd.drop('Transported', axis=1)
y = spaceship_dd['Transported']

In [85]:
#perform train / test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [93]:
#checking test is 20% of dataset
len(X_test)/len(X)

0.20012110202845898

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [97]:
#since the target value is boolean (true / false) we should apply the classificator KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)

- Evaluate your model's performance. Comment it

In [117]:
from sklearn.metrics import accuracy_score, r2_score, root_mean_squared_error
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

correct_predictions = (y_test == y_pred).sum()
print(f"Correct Predictions: {correct_predictions} out of {len(y_test)}")

Accuracy: 78.44%
Correct Predictions: 1037 out of 1322


In [None]:
#significant improvement vs previous ml model
#Accuracy: 76.55% => 78.44%
#Correct Predictions: 1012 out of 1322 => 1037 out of 1322
