# LAB | Feature Engineering

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


**Check the shape of your data**

In [3]:
#your code here

print("The shape of this dataset is: ", spaceship.shape)

The shape of this dataset is:  (8693, 14)


**Check for data types**

In [4]:
#your code here
print("These are data types of our dataset: ", spaceship.dtypes)

These are data types of our dataset:  PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object


**Check for missing values**

In [5]:
#your code here

miss_values = spaceship.isnull().sum()
print("These are the missing values in our dataset: ", miss_values)

These are the missing values in our dataset:  PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64


In [6]:
nan_values = spaceship.isna().sum()
print("These are the NaN values in our dataset: ", nan_values)

These are the NaN values in our dataset:  PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64


There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [7]:
new_spaceship = spaceship.dropna()
new_spaceship.shape

(6606, 14)

- **Cabin** is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

In [8]:
#your code here

new_spaceship.value_counts('Cabin')

Cabin
G/1476/S    7
B/11/S      7
E/13/S      7
C/137/S     7
G/734/S     7
           ..
F/1190/S    1
F/1188/S    1
F/1187/S    1
F/1187/P    1
T/3/P       1
Name: count, Length: 5305, dtype: int64

In [9]:
#your code here
# i want to transform the column "Cabin" in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}

new_spaceship['Cabin'] = new_spaceship['Cabin'].str.split('/').str[0]

decks = set(new_spaceship['Cabin'].dropna().unique())

valid = set(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'])
decks = decks & valid

decks = (
    new_spaceship['Cabin'].dropna()
    .str.findall(r'([A-Z])(?=/)')
    .explode()
    .dropna()
    .pipe(lambda s: set(ch for ch in s if ch in valid))
)

new_spaceship.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_spaceship['Cabin'] = new_spaceship['Cabin'].str.split('/').str[0]


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [10]:
new_spaceship.value_counts('Cabin')

Cabin
F    2152
G    1973
E     683
B     628
C     587
D     374
A     207
T       2
Name: count, dtype: int64

- Drop PassengerId and Name

In [11]:
#your code here

new_spaceship.drop(columns=['PassengerId', 'Name'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_spaceship.drop(columns=['PassengerId', 'Name'], inplace=True)


- For non-numerical columns, do dummies.

In [12]:
#your code here

import numpy as np
from sklearn.compose import ColumnTransformer, make_column_selector as selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pre = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), selector(dtype_exclude=np.number))
    ],
    remainder="passthrough"
)

clf = Pipeline(steps=[
    ("prep", pre),
    ("model", LogisticRegression(max_iter=1000))
])

In [13]:
new_spaceship.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True


**Perform Train Test Split**

In [None]:
#your code here

from sklearn.model_selection import train_test_split


X = new_spaceship.drop(columns=['Transported'])
y = new_spaceship['Transported']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:
y_train_b = pd.Series(y_train).astype(str).str.strip().str.lower().map({'true': True, 'false': False})
y_test_b  = pd.Series(y_test ).astype(str).str.strip().str.lower().map({'true': True, 'false': False})



**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

In [15]:
#your code here
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier

In [16]:
X_train.select_dtypes(exclude='number').columns.tolist()


['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP']

In [22]:
import numpy as np
from sklearn.compose import ColumnTransformer, make_column_selector as selector
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsRegressor

num_sel = selector(dtype_include=np.number)
cat_sel = selector(dtype_exclude=np.number)

pre = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("imp", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler())
        ]), num_sel),
        ("cat", Pipeline([
            ("imp", SimpleImputer(strategy="most_frequent")),
            ("ohe", OneHotEncoder(handle_unknown="ignore"))  # sparse por defeito
        ]), cat_sel),
    ]
)

model = Pipeline(steps=[
    ("prep", pre),
    ("knn", KNeighborsClassifier(n_neighbors=5, algorithm="brute"))  # usa brute se a saída for sparse
])

model.fit(X_train, y_train)


0,1,2
,steps,"[('prep', ...), ('knn', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'brute'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


- Evaluate your model's performance. Comment it

In [23]:
#your code here

y_pred = model.predict(X_test)

In [24]:
from sklearn.metrics import classification_report, confusion_matrix

# 5) Evaluation
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[516 137]
 [142 527]]
              precision    recall  f1-score   support

       False       0.78      0.79      0.79       653
        True       0.79      0.79      0.79       669

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322

