# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [58]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

In [59]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [61]:
spaceship.shape

(8693, 14)

**Check for data types**

In [63]:
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [65]:
print(f"Null values:\n \n", spaceship.isnull().sum(), "\n") # Check for nulls
print("Duplicates:", spaceship.duplicated().sum()) # Check for duplicates

Null values:
 
 PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64 

Duplicates: 0


There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [67]:
spaceship = spaceship.dropna().reset_index(drop=True)  # Drop nulls
spaceship.shape

(6606, 14)

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [69]:
spaceship.Cabin.value_counts()

Cabin
G/1476/S    7
E/13/S      7
C/137/S     7
G/734/S     7
B/11/S      7
           ..
E/233/S     1
E/209/P     1
G/548/S     1
D/108/P     1
B/153/P     1
Name: count, Length: 5305, dtype: int64

In [70]:
# Keep only the first letter

spaceship['Cabin'] = spaceship['Cabin'].str[0]

print(spaceship.Cabin.value_counts())

Cabin
F    2152
G    1973
E     683
B     628
C     587
D     374
A     207
T       2
Name: count, dtype: int64


- Drop PassengerId and Name

In [72]:
spaceship = spaceship.drop(['PassengerId', 'Name'], axis=1)

- For non-numerical columns, do dummies.

In [74]:
# Get non-numerical columns 
non_num_columns = spaceship.select_dtypes(include=['object']).columns

# Use get_dummies to transform these columns
spaceship_with_dummies = pd.get_dummies(spaceship, columns=non_num_columns, dtype=int) #(I got booleans before specifying int)

print(spaceship_with_dummies)

       Age  RoomService  FoodCourt  ShoppingMall     Spa  VRDeck  Transported  \
0     39.0          0.0        0.0           0.0     0.0     0.0        False   
1     24.0        109.0        9.0          25.0   549.0    44.0         True   
2     58.0         43.0     3576.0           0.0  6715.0    49.0        False   
3     33.0          0.0     1283.0         371.0  3329.0   193.0        False   
4     16.0        303.0       70.0         151.0   565.0     2.0         True   
...    ...          ...        ...           ...     ...     ...          ...   
6601  41.0          0.0     6819.0           0.0  1643.0    74.0        False   
6602  18.0          0.0        0.0           0.0     0.0     0.0        False   
6603  26.0          0.0        0.0        1872.0     1.0     0.0         True   
6604  32.0          0.0     1049.0           0.0   353.0  3235.0        False   
6605  44.0        126.0     4688.0           0.0     0.0    12.0         True   

      HomePlanet_Earth  Hom

In [75]:
spaceship_with_dummies.dtypes

Age                          float64
RoomService                  float64
FoodCourt                    float64
ShoppingMall                 float64
Spa                          float64
VRDeck                       float64
Transported                     bool
HomePlanet_Earth               int64
HomePlanet_Europa              int64
HomePlanet_Mars                int64
CryoSleep_False                int64
CryoSleep_True                 int64
Cabin_A                        int64
Cabin_B                        int64
Cabin_C                        int64
Cabin_D                        int64
Cabin_E                        int64
Cabin_F                        int64
Cabin_G                        int64
Cabin_T                        int64
Destination_55 Cancri e        int64
Destination_PSO J318.5-22      int64
Destination_TRAPPIST-1e        int64
VIP_False                      int64
VIP_True                       int64
dtype: object

**Perform Train Test Split**

In [77]:
# Let's try Transported 
features = spaceship_with_dummies.drop('Transported', axis=1)
target = spaceship_with_dummies['Transported']


In [78]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

**Scaling**

In [114]:
# Scale features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)   # fit scaler on training data, transform training data
X_test_scaled = scaler.transform(X_test)         # only transform test data

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [121]:
# Classifier - Transported is boolean 

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

In [123]:
knn.fit(X_train_scaled, y_train);

- Evaluate your model's performance. Comment it

In [134]:
y_pred = knn.predict(X_test_scaled)
score = knn.score(X_test_scaled, y_test)
print(score)

0.762481089258699


In [128]:
from sklearn.metrics import confusion_matrix, classification_report

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[501 160]
 [154 507]]
              precision    recall  f1-score   support

       False       0.76      0.76      0.76       661
        True       0.76      0.77      0.76       661

    accuracy                           0.76      1322
   macro avg       0.76      0.76      0.76      1322
weighted avg       0.76      0.76      0.76      1322



In [132]:
# Try MinMax - same result

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
score = knn.score(X_test_scaled, y_test)
print("MinMaxScaler KNN accuracy:", score)



MinMaxScaler KNN accuracy: 0.762481089258699
