# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [96]:
#Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# Load dataset
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [98]:
#your code here
spaceship.shape

(8693, 14)

**Check for data types**

In [100]:
#your code here
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [102]:
#your code here
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [104]:
spaceship["Cabin"] = spaceship["Cabin"].astype(str).str[0]


- Drop PassengerId and Name

In [106]:
spaceship = spaceship.drop(columns=["PassengerId", "Name"])


- For non-numerical columns, do dummies.

In [108]:
spaceship = pd.get_dummies(spaceship)


In [110]:
# Drop rows with any missing values (after dummy creation)
spaceship = spaceship.dropna()


**Perform Train Test Split**

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [112]:
# Split features and target
X = spaceship.drop("Transported", axis=1)
y = spaceship["Transported"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predict and check accuracy
y_pred = knn.predict(X_test_scaled)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.7821522309711286


- Evaluate your model's performance. Comment it

In [114]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.7821522309711286

Classification Report:
              precision    recall  f1-score   support

       False       0.77      0.80      0.79       758
        True       0.80      0.76      0.78       766

    accuracy                           0.78      1524
   macro avg       0.78      0.78      0.78      1524
weighted avg       0.78      0.78      0.78      1524


Confusion Matrix:
[[609 149]
 [183 583]]


In [None]:
#The model achieves about 78% accuracy, which is a solid baseline for a first attempt with KNN. The precision and recall are balanced across both classes, meaning the model isn’t favoring one outcome over the other. The F1-score suggests it’s doing a good job balancing false positives and false negatives.

#Still, KNN is sensitive to feature scaling and the choice of k. You could likely improve performance by:

#Tuning the n_neighbors parameter using cross-validation

#Trying other models like Logistic Regression or Random Forest

#Engineering new features or adjusting the ones used