# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [2]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [5]:
#your code here
spaceship.shape

(8693, 14)

**Check for data types**

In [6]:
#your code here
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [7]:
#your code here
spaceship.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [10]:
#your code here
spaceship.dropna(inplace=True)
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [8]:
#your code here
spaceship['Cabin'] =

for letter in spaceship['Cabin']

spaceship['Cabin'] = spaceship['Cabin'][0]
# Joindre les premières lettres pour former une nouvelle chaîne
result = ''.join(first_letters)

print(result)


- Drop PassengerId and Name

In [33]:
#your code here
#spaceship.drop(['PassengerId','Name'], axis=1, inplace=True)
spaceship

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,False,A,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False
8689,Earth,True,G,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,False
8690,Earth,False,G,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,True
8691,Europa,False,E,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False


- For non-numerical columns, do dummies.

In [34]:
spaceship.dtypes.items()

<zip at 0x1e2fce87800>

In [35]:

# Initialize an empty DataFrame to store the transformed data
spaceship_transformed = pd.DataFrame()

# Iterate over columns and create dummy variables for categorical columns
for col, dtype in spaceship.dtypes.items():
    if dtype == 'object':  # Check if the column is categorical (object type)
        # Apply pd.get_dummies to create dummy variables for the column
        dummies = pd.get_dummies(spaceship[col], drop_first=True)
        # Concatenate the dummy variables with the spaceship_transformed DataFrame
        if spaceship_transformed.empty:
            spaceship_transformed = dummies
        else:
            spaceship_transformed = pd.concat([spaceship_transformed, dummies], axis=1)

# Display the transformed DataFrame
spaceship_transformed.head()





Unnamed: 0,Europa,Mars,True,B,C,D,E,F,G,T,PSO J318.5-22,TRAPPIST-1e,True.1
0,True,False,False,True,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,True,False,False,False,True,False
2,True,False,False,False,False,False,False,False,False,False,False,True,True
3,True,False,False,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,True,False,False,False,True,False


In [41]:
spaceship_new = spaceship.join(spaceship_transformed)
spaceship_new

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,...,B,C,D,E,F,G,T,PSO J318.5-22,TRAPPIST-1e,True
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,...,True,False,False,False,False,False,False,False,True,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,...,False,False,False,False,True,False,False,False,True,False
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,...,False,False,False,False,False,False,False,False,True,True
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,...,False,False,False,False,False,False,False,False,True,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,...,False,False,False,False,True,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,False,A,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,...,False,False,False,False,False,False,False,False,False,True
8689,Earth,True,G,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,...,False,False,False,False,False,True,False,True,False,False
8690,Earth,False,G,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,...,False,False,False,False,False,True,False,False,True,False
8691,Europa,False,E,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,...,False,False,False,True,False,False,False,False,False,False


**Perform Train Test Split**

In [56]:
features = spaceship_new.drop(columns=["Transported", "HomePlanet","Cabin","Destination"]) #on doit drop les colonnes qui sont des stings également !
target = spaceship_new["Transported"].astype(int)

In [60]:
features.columns=features.columns.astype(str)
features=features.drop(columns=["True"])
# X.columns.astype(str)

In [62]:
features

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Europa,Mars,B,C,D,E,F,G,T,PSO J318.5-22,TRAPPIST-1e
0,False,39.0,False,0.0,0.0,0.0,0.0,0.0,True,False,True,False,False,False,False,False,False,False,True
1,False,24.0,False,109.0,9.0,25.0,549.0,44.0,False,False,False,False,False,False,True,False,False,False,True
2,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,True,False,False,False,False,False,False,False,False,False,True
3,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,True,False,False,False,False,False,False,False,False,False,True
4,False,16.0,False,303.0,70.0,151.0,565.0,2.0,False,False,False,False,False,False,True,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,False,41.0,True,0.0,6819.0,0.0,1643.0,74.0,True,False,False,False,False,False,False,False,False,False,False
8689,True,18.0,False,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False,True,False,True,False
8690,False,26.0,False,0.0,0.0,1872.0,1.0,0.0,False,False,False,False,False,False,False,True,False,False,True
8691,False,32.0,False,0.0,1049.0,0.0,353.0,3235.0,True,False,False,False,False,True,False,False,False,False,False


In [63]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [67]:
#your code here
#your code here
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.linear_model import LogisticRegression 
from sklearn.preprocessing import MinMaxScaler

normalizer = MinMaxScaler() 
normalizer.fit(X_train)



- Evaluate your model's performance. Comment it

In [65]:
#your code here
X_train_norm = normalizer.transform(X_train)
X_test_norm = normalizer.transform(X_test)

In [71]:
knn = KNeighborsRegressor(n_neighbors=100)

In [74]:
knn.fit(X_train_norm, y_train)

In [75]:
knn.score(X_test_norm, y_test)

0.3146199697428139

In [77]:
pred = knn.predict(X_test_norm)
pred

array([0.97, 0.71, 0.81, ..., 0.46, 0.71, 0.78])

ValueError: Classification metrics can't handle a mix of binary and continuous targets

In [None]:
#for the correlation between 2 cat variables --> chi test !