# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [3]:
#your code here
spaceship.shape

(8693, 14)

**Check for data types**

In [4]:
#your code here
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [5]:
#your code here
spaceship.isna().sum()  

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [6]:
#your code here
spaceship = spaceship.dropna()

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [7]:
#your code here
#get only the fisrt letter from all the values in the column Cabin
spaceship['Cabin'] = spaceship['Cabin'].str[0]

- Drop PassengerId and Name

In [8]:
#your code here
#drop the column Name and PassengerId
spaceship = spaceship.drop(columns=['Name', 'PassengerId'])
spaceship

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,False,A,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False
8689,Earth,True,G,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,False
8690,Earth,False,G,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,True
8691,Europa,False,E,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False


- For non-numerical columns, do dummies.

In [9]:
#get the non numerical columns
non_numerical = spaceship.select_dtypes(include=['object'])
non_numerical

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,VIP
0,Europa,False,B,TRAPPIST-1e,False
1,Earth,False,F,TRAPPIST-1e,False
2,Europa,False,A,TRAPPIST-1e,True
3,Europa,False,A,TRAPPIST-1e,False
4,Earth,False,F,TRAPPIST-1e,False
...,...,...,...,...,...
8688,Europa,False,A,55 Cancri e,True
8689,Earth,True,G,PSO J318.5-22,False
8690,Earth,False,G,TRAPPIST-1e,False
8691,Europa,False,E,55 Cancri e,False


In [10]:
#your code here
#get dummies from numerical columns
spaceship_cat = pd.get_dummies(non_numerical)
spaceship_cat.shape

(6606, 18)

In [11]:
#create a for loop to transform all the values in spaceship_cat to integer
for i in spaceship_cat.columns:
    spaceship_cat[i] = spaceship_cat[i].astype(int)

In [12]:
spaceship_cat.columns

Index(['HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars',
       'CryoSleep_False', 'CryoSleep_True', 'Cabin_A', 'Cabin_B', 'Cabin_C',
       'Cabin_D', 'Cabin_E', 'Cabin_F', 'Cabin_G', 'Cabin_T',
       'Destination_55 Cancri e', 'Destination_PSO J318.5-22',
       'Destination_TRAPPIST-1e', 'VIP_False', 'VIP_True'],
      dtype='object')

In [13]:
#drop the columns ['Cabin', 'CryoSleep', 'Destination', 'HomePlanet', 'VIP'] from spaceship
spaceship = spaceship.drop(columns=['Cabin', 'CryoSleep', 'Destination', 'HomePlanet', 'VIP'])

In [14]:
#concatenate spaceship and spaceship_cat
spaceship = pd.concat([spaceship, spaceship_cat], axis=1)

**Perform Train Test Split**

In [15]:
#perfom train test split
features = spaceship.drop(columns=['Transported'])
target = spaceship['Transported']

In [16]:
features

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,...,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,39.0,0.0,0.0,0.0,0.0,0.0,0,1,0,1,...,0,0,0,0,0,0,0,1,1,0
1,24.0,109.0,9.0,25.0,549.0,44.0,1,0,0,1,...,0,0,1,0,0,0,0,1,1,0
2,58.0,43.0,3576.0,0.0,6715.0,49.0,0,1,0,1,...,0,0,0,0,0,0,0,1,0,1
3,33.0,0.0,1283.0,371.0,3329.0,193.0,0,1,0,1,...,0,0,0,0,0,0,0,1,1,0
4,16.0,303.0,70.0,151.0,565.0,2.0,1,0,0,1,...,0,0,1,0,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,41.0,0.0,6819.0,0.0,1643.0,74.0,0,1,0,1,...,0,0,0,0,0,1,0,0,0,1
8689,18.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,...,0,0,0,1,0,0,1,0,1,0
8690,26.0,0.0,0.0,1872.0,1.0,0.0,1,0,0,1,...,0,0,0,1,0,0,0,1,1,0
8691,32.0,0.0,1049.0,0.0,353.0,3235.0,0,1,0,1,...,0,1,0,0,0,1,0,0,1,0


In [17]:
features.shape

(6606, 24)

In [18]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

Transforming our training and testing data

In [19]:
from sklearn.preprocessing import MinMaxScaler

In [20]:
normalizer = MinMaxScaler()

In [21]:
X_train_norm = normalizer.fit_transform(X_train)

X_test_norm = normalizer.transform(X_test)

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [22]:
#your code here
from sklearn.neighbors import KNeighborsClassifier

Creating an instance of the model, for now, we will use n_neighbors=3 (we will see how to optimize this hyperparameter later)

In [23]:
knn = KNeighborsClassifier(n_neighbors=18)

Training the model. 

In [24]:
knn.fit(X_train, y_train)

In [25]:
X_test


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,...,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
2453,50.0,0.0,0.0,0.0,0.0,0.0,0,0,1,0,...,0,0,1,0,0,1,0,0,1,0
1334,18.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,...,0,0,0,1,0,0,1,0,1,0
8272,15.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,...,0,0,0,1,0,1,0,0,1,0
5090,52.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,...,0,0,0,1,0,0,1,0,1,0
4357,62.0,0.0,1633.0,0.0,1742.0,0.0,0,1,0,1,...,0,0,0,0,0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,19.0,0.0,0.0,670.0,1.0,34.0,1,0,0,1,...,0,0,0,1,0,0,0,1,1,0
6816,37.0,300.0,3434.0,0.0,1.0,171.0,0,1,0,1,...,0,0,0,0,0,0,0,1,1,0
5926,43.0,2.0,5329.0,0.0,7.0,0.0,0,1,0,1,...,0,0,0,0,0,0,0,1,1,0
3793,14.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,...,0,0,0,1,0,0,1,0,1,0


Now, our model is already trained, we can make predictions for new data points

In [26]:
pred = knn.predict(X_test)
pred

array([ True,  True,  True, ...,  True,  True,  True])

In [27]:
knn.score(X_test, y_test)

0.7859304084720121

- Evaluate your model's performance. Comment it

'. El F1-score es la media armónica de la precisión y el recall, proporcionando un equilibrio entre ambas métricas. \n\n'


In [28]:
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
# Precisión
precision = precision_score(y_test, pred, average='macro')
print(f'Precisión: {precision}')

Precisión: 0.7864818381948266


In [29]:
# Recall
recall = recall_score(y_test, pred, average='macro')
print(f'Recall: {recall}')

Recall: 0.7859304084720121


In [30]:
report = classification_report(y_test, pred)
print(report)


              precision    recall  f1-score   support

       False       0.80      0.76      0.78       661
        True       0.77      0.81      0.79       661

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322

