# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [5]:
spaceship.shape

(8693, 14)

**Check for data types**

In [7]:
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


**Check for missing values**

In [9]:
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [11]:
spaceship_dropped = spaceship.dropna()

In [13]:
spaceship_dropped.isnull().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
Transported     0
dtype: int64

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [17]:
spaceship_dropped['Cabin']= spaceship_dropped['Cabin'].str.extract('([A-Z])')[0]
print(spaceship_dropped)

     PassengerId HomePlanet CryoSleep Cabin    Destination   Age    VIP  \
0        0001_01     Europa     False     B    TRAPPIST-1e  39.0  False   
1        0002_01      Earth     False     F    TRAPPIST-1e  24.0  False   
2        0003_01     Europa     False     A    TRAPPIST-1e  58.0   True   
3        0003_02     Europa     False     A    TRAPPIST-1e  33.0  False   
4        0004_01      Earth     False     F    TRAPPIST-1e  16.0  False   
...          ...        ...       ...   ...            ...   ...    ...   
8688     9276_01     Europa     False     A    55 Cancri e  41.0   True   
8689     9278_01      Earth      True     G  PSO J318.5-22  18.0  False   
8690     9279_01      Earth     False     G    TRAPPIST-1e  26.0  False   
8691     9280_01     Europa     False     E    55 Cancri e  32.0  False   
8692     9280_02     Europa     False     E    TRAPPIST-1e  44.0  False   

      RoomService  FoodCourt  ShoppingMall     Spa  VRDeck               Name  \
0             0.0 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_dropped['Cabin']= spaceship_dropped['Cabin'].str.extract('([A-Z])')[0]


- Drop PassengerId and Name

In [19]:
spaceship_dropped = spaceship_dropped.drop(columns=['PassengerId', 'Name'])

- For non-numerical columns, do dummies.

In [None]:
spaceship_dropped = pd.get_dummies(spaceship_dropped, columns=['HomePlanet', 'Cabin', 'Destination', 'VIP'], drop_first=True)


**Perform Train Test Split**

In [27]:
from sklearn.model_selection import train_test_split

# Define features and target variable
X = spaceship_dropped.drop(columns=['Transported'])  # I remove transported from the other columns to choose the Features
y = spaceship_dropped['Transported']                 # I specify that the Target variable is transported

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # less than 100000 I can use 80%-20% if were more so 70%-30%.

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

# Ensure the proportions are correct
train_proportion = 5284 / (5284 + 1322)
test_proportion = 1322 / (5284 + 1322)

print(f"Proportion of training set: {train_proportion:.2f}")
print(f"Proportion of test set: {test_proportion:.2f}")

X_train shape: (5284, 19)
X_test shape: (1322, 19)
y_train shape: (5284,)
y_test shape: (1322,)
Proportion of training set: 0.80
Proportion of test set: 0.20


**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [29]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize the KNN model (initialise une instance de clasificateur)
knn = KNeighborsClassifier(10)

# Fit the model to the training data (il stocke les données de formation dans l'objet KNN) 
# il enregistre X_train et y_train en interne afin qu'ils puissent être utilisés ultérieurement pendant la phase de prédiction.
knn.fit(X_train, y_train)

# Prediction example
predictions = knn.predict(X_test)
print("Predictions:", predictions)

Predictions: [ True  True False ...  True  True  True]


- Evaluate your model's performance. Comment it

In [31]:
from sklearn.metrics import classification_report, confusion_matrix

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[538 115]
 [150 519]]
              precision    recall  f1-score   support

       False       0.78      0.82      0.80       653
        True       0.82      0.78      0.80       669

    accuracy                           0.80      1322
   macro avg       0.80      0.80      0.80      1322
weighted avg       0.80      0.80      0.80      1322



In [33]:
from sklearn.metrics import accuracy_score, classification_report

# Compute accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

# Classification report to see details like precision, recall, F1-score
report = classification_report(y_test, predictions)
print("Classification Report:\n", report)

Accuracy: 0.7995461422087746
Classification Report:
               precision    recall  f1-score   support

       False       0.78      0.82      0.80       653
        True       0.82      0.78      0.80       669

    accuracy                           0.80      1322
   macro avg       0.80      0.80      0.80      1322
weighted avg       0.80      0.80      0.80      1322

