# Spaceship Titanic Classification w/ Binary Logistic Regression

#### Dataset: https://www.kaggle.com/competitions/spaceship-titanic/overview
##### Dataset License: https://creativecommons.org/licenses/by/4.0/

###### Author: Cody Weaver

### Load and Process Dataset

In [20]:
import pandas as pd
from lr import LogisticRegressionModel

### Training Set

In [21]:
# load raw train dataset
train_set = pd.read_csv('../data/train.csv')
print(train_set.head())
print(train_set.dtypes)

# convert bool labels to 0-1
def convert_labels(df, label_col='Transported'):
    return df[label_col].apply(lambda l: 0 if l == False else 1)

train_set['Transported'] = convert_labels(train_set)

  PassengerId HomePlanet CryoSleep  Cabin  Destination   Age    VIP  \
0     0001_01     Europa     False  B/0/P  TRAPPIST-1e  39.0  False   
1     0002_01      Earth     False  F/0/S  TRAPPIST-1e  24.0  False   
2     0003_01     Europa     False  A/0/S  TRAPPIST-1e  58.0   True   
3     0003_02     Europa     False  A/0/S  TRAPPIST-1e  33.0  False   
4     0004_01      Earth     False  F/1/S  TRAPPIST-1e  16.0  False   

   RoomService  FoodCourt  ShoppingMall     Spa  VRDeck               Name  \
0          0.0        0.0           0.0     0.0     0.0    Maham Ofracculy   
1        109.0        9.0          25.0   549.0    44.0       Juanna Vines   
2         43.0     3576.0           0.0  6715.0    49.0      Altark Susent   
3          0.0     1283.0         371.0  3329.0   193.0       Solam Susent   
4        303.0       70.0         151.0   565.0     2.0  Willy Santantines   

   Transported  
0        False  
1         True  
2        False  
3        False  
4         True  
Pa

### Classification using amount billed for amenities only

#### Process and Normalize data

In [22]:
amenities_columns = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

amenities_data = train_set.copy()

def normalize_amenity_data(df, col_names):
    # normalize to mean zero and unit variance
    for column in amenities_columns:
        column_mean = amenities_data[column].mean()
        column_std = amenities_data[column].std()
        normalize = lambda x: (x - column_mean) / column_std
        amenities_data[column] = normalize(amenities_data[column])

    # fill in missing values for amenities
    amenities_data[amenities_columns] = amenities_data[amenities_columns].fillna(0)
    
    return amenities_data[amenities_columns]

amenities_data[amenities_columns] = normalize_amenity_data(train_set, amenities_columns)

#### Train Model

In [23]:
amenities_model = LogisticRegressionModel(dim=len(amenities_columns))
num_epochs = 5
amenities_model.fit_model(
    amenities_data[amenities_columns].to_numpy(),
    amenities_data['Transported'],
    num_epochs=num_epochs,
)

Epoch: 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8693/8693 [00:00<00:00, 113241.97it/s]


Average Log Loss for epoch 0: 0.69


Epoch: 1: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8693/8693 [00:00<00:00, 107285.18it/s]


Average Log Loss for epoch 1: 0.68


Epoch: 2: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8693/8693 [00:00<00:00, 107941.69it/s]


Average Log Loss for epoch 2: 0.67


Epoch: 3: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8693/8693 [00:00<00:00, 112366.36it/s]


Average Log Loss for epoch 3: 0.67


Epoch: 4: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8693/8693 [00:00<00:00, 114163.15it/s]

Average Log Loss for epoch 4: 0.66





#### Evaluate Amenities Model on Train Set

In [24]:
X = amenities_data[amenities_columns].to_numpy()
Y = amenities_data['Transported']
print('Accuracy: %.2f, Precision: %.2f, Recall: %.2f' % amenities_model.evaluate(X, Y))

Accuracy: 0.70, Precision: 0.64, Recall: 0.93


# Full Feature Logistic Regression

In [25]:
data = pd.DataFrame(train_set[['PassengerId', 'Transported']])

# normalize continuous numerical features
NUMERICAL_FEATS = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

def normalize_numerical_feats(df, cols):
    numerical_cols = df[cols].copy()

    for col in cols:
        mean = numerical_cols[col].mean()
        std = numerical_cols[col].std()
        
        normalize = lambda x: (x - mean ) / std
        numerical_cols[col] = numerical_cols[col].apply(normalize)

    # fill in missing values with mean (0)
    numerical_cols[cols].fillna(0)

    return numerical_cols

data[NUMERICAL_FEATS] = normalize_numerical_feats(train_set, NUMERICAL_FEATS)

# convert boolean values to 0, 1
BOOLEAN_FEATS = ['CryoSleep', 'VIP']

def convert_bool_feats(df, cols):
    boolean_cols = df[cols].copy()

    convert_bool = lambda x: 1 if x else 0

    for col in cols:
        boolean_cols[col] = boolean_cols[col].apply(convert_bool)

    # fill in missing values with 0
    boolean_cols.fillna(0)

    return boolean_cols

data[BOOLEAN_FEATS] = convert_bool_feats(train_set, BOOLEAN_FEATS)

print(train_set['Cabin'].value_counts())

Cabin
G/734/S     8
B/11/S      7
F/1411/P    7
B/82/S      7
G/981/S     7
           ..
G/543/S     1
B/106/P     1
G/542/S     1
F/700/P     1
G/559/P     1
Name: count, Length: 6560, dtype: int64
