# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [2]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

from sklearn.datasets import  fetch_california_housing

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [3]:
spaceship = pd.read_csv("/Users/skyler/Documents/GitHub/Homework/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [4]:
#your code here
spaceship.shape

(8693, 14)

**Check for data types**

In [5]:
#your code here
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

**Check for missing values**

In [6]:
#your code here
spaceship.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [7]:
#your code here
spaceship = spaceship.dropna()

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [8]:
#your code here
spaceship['Cabin'].value_counts()

# take cabin value the first letter
spaceship['Cabin'] = spaceship['Cabin'].str[0]

- Drop PassengerId and Name

In [9]:
#your code here
spaceship_no_name_ID = spaceship.drop(['Name', 'PassengerId'], axis=1)
spaceship_no_name_ID

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,False,A,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False
8689,Earth,True,G,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,False
8690,Earth,False,G,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,True
8691,Europa,False,E,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False


In [10]:
spaceship_no_name_ID['Destination'].value_counts()

Destination
TRAPPIST-1e      4576
55 Cancri e      1407
PSO J318.5-22     623
Name: count, dtype: int64

## define feature and target dataframe

In [11]:
X = spaceship_no_name_ID.drop('Transported', axis=1)
y = spaceship_no_name_ID['Transported']

- For non-numerical columns, do dummies. -- encoding categorical columns

In [12]:
# Check if the columns exist in the dataframe
columns_to_encode = ['HomePlanet', 'Destination', 'Cabin']
columns_to_encode = [col for col in columns_to_encode if col in spaceship_no_name_ID.columns]

# Apply pd.get_dummies() only to the existing columns
X_encoded = pd.get_dummies(X, columns=columns_to_encode)
X_encoded

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,...,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T
0,False,39.0,False,0.0,0.0,0.0,0.0,0.0,False,True,...,False,True,False,True,False,False,False,False,False,False
1,False,24.0,False,109.0,9.0,25.0,549.0,44.0,True,False,...,False,True,False,False,False,False,False,True,False,False
2,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,True,...,False,True,True,False,False,False,False,False,False,False
3,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,True,...,False,True,True,False,False,False,False,False,False,False
4,False,16.0,False,303.0,70.0,151.0,565.0,2.0,True,False,...,False,True,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,False,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False,True,...,False,False,True,False,False,False,False,False,False,False
8689,True,18.0,False,0.0,0.0,0.0,0.0,0.0,True,False,...,True,False,False,False,False,False,False,False,True,False
8690,False,26.0,False,0.0,0.0,1872.0,1.0,0.0,True,False,...,False,True,False,False,False,False,False,False,True,False
8691,False,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False,True,...,False,False,False,False,False,False,True,False,False,False


**Perform Train Test Split**

In [13]:
#your code here
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [16]:
#your code here
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from scipy import stats

# Define models
models = {
  'Logistic Regression': LogisticRegression(random_state=42),
  'Decision Tree': DecisionTreeClassifier(random_state=42),
  'Random Forest': RandomForestClassifier(random_state=42),
  'SVM': SVC(random_state=42),
  'KNN': KNeighborsClassifier()
}

# Train and evaluate models
results = {}

for name, model in models.items():
  # Train the model
  model.fit(X_train_scaled, y_train)
  
  # Make predictions
  y_pred = model.predict(X_test_scaled)
  
  # Calculate accuracy
  accuracy = accuracy_score(y_test, y_pred)
  
  # Store results
  results[name] = {
      'accuracy': accuracy,
      'report': classification_report(y_test, y_pred)
  }

# Print results
for name, result in results.items():
  print(f"\n{name}:")
  print(f"Accuracy: {result['accuracy']:.4f}")
  print("Classification Report:")
  print(result['report'])

# Compare accuracies
accuracies = {name: result['accuracy'] for name, result in results.items()}
best_model = max(accuracies, key=accuracies.get)

print("\nModel Accuracy Comparison:")
for name, accuracy in accuracies.items():
  print(f"{name}: {accuracy:.4f}")

print(f"\nBest performing model: {best_model} with accuracy {accuracies[best_model]:.4f}")


Logistic Regression:
Accuracy: 0.7920
Classification Report:
              precision    recall  f1-score   support

       False       0.80      0.77      0.79       653
        True       0.79      0.81      0.80       669

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322


Decision Tree:
Accuracy: 0.7670
Classification Report:
              precision    recall  f1-score   support

       False       0.77      0.76      0.76       653
        True       0.77      0.77      0.77       669

    accuracy                           0.77      1322
   macro avg       0.77      0.77      0.77      1322
weighted avg       0.77      0.77      0.77      1322


Random Forest:
Accuracy: 0.8071
Classification Report:
              precision    recall  f1-score   support

       False       0.79      0.82      0.81       653
        True       0.82      0.79      0.81       669

    accu

- Evaluate your model's performance. Comment it

In [15]:
#your code here