# Spaceship Titanic 
Competition in Kaggle to predict which passengers are transported to an alternate dimension
 
Data can be found also from: https://www.kaggle.com/competitions/spaceship-titanic/data

Data Dictionary: 
PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
Destination - The planet the passenger will be debarking to.
Age - The age of the passenger.
VIP - Whether the passenger has paid for special VIP service during the voyage.
RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
Name - The first and last names of the passenger.
Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

Information about evaluation can be found at : https://www.kaggle.com/competitions/spaceship-titanic/overview/evaluation

In this project i used LogisticRegression model to predict. The accuracy is 0.78887 (Score which Kaggle gave)

In [178]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [74]:
data = pd.read_csv("train.csv")

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [75]:
data["VIP"] = data["VIP"].astype(bool)
data["CryoSleep"] = data["CryoSleep"].astype(bool)
data["VIP"] = data["VIP"].astype(int)
data["CryoSleep"] = data["CryoSleep"].astype(int)
data["Transported"] = data["Transported"].astype(int)
data[["Deck", "Cabin_number", "Side"]] = data["Cabin"].str.split("/",expand=True)
data = data.drop("Cabin", axis=1)

In [82]:
for label, content in data.items():
    if pd.api.types.is_string_dtype(content):
        data[label]=content.astype("category").cat.as_ordered()


In [84]:
datasets = data.copy()

In [90]:
for label, content in datasets.items():
    if not pd.api.types.is_numeric_dtype(content):
        datasets[label] = pd.Categorical(content).codes+1
for label, content in datasets.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            datasets[label] = content.fillna(content.median())

In [94]:
X = datasets.drop("Transported", axis = 1)
y = datasets["Transported"]

In [107]:
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.2,random_state=42)

model = LogisticRegression(solver='lbfgs',max_iter=1000)
model.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [106]:
model.score(X_val,y_val)

0.7722829212190915

In [108]:
log_reg_grid = {"C": np.logspace(-4, 4, 20),
                "solver": ["liblinear"]}

In [115]:
np.random.seed(42)
rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                                param_distributions=log_reg_grid,
                                cv=5,
                                n_iter=20,
                                verbose=True)
rs_log_reg.fit(X_train, y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


In [116]:
rs_log_reg.best_params_

{'solver': 'liblinear', 'C': 0.012742749857031334}

In [117]:
rs_log_reg.score(X_val,y_val)

0.7705577918343876

In [118]:
log_reg_grid = {"C": np.logspace(-4, 4, 30),
                "solver": ["liblinear"]}

gs_log_reg = GridSearchCV(LogisticRegression(),
                          param_grid=log_reg_grid,
                          cv=5,
                          verbose=True)

gs_log_reg.fit(X_train, y_train)

Fitting 5 folds for each of 30 candidates, totalling 150 fits


In [119]:
gs_log_reg.best_params_

{'C': 0.008531678524172805, 'solver': 'liblinear'}

In [120]:
gs_log_reg.score(X_val,y_val)

0.7722829212190915

In [166]:
test_data = pd.read_csv("test.csv")
test_data_copy = test_data.copy()
test_data.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Brence Harperez


In [167]:
def preprocess_data(test_data):
    test_data["VIP"] = test_data["VIP"].astype(bool)
    test_data["CryoSleep"] = test_data["CryoSleep"].astype(bool)
    test_data["VIP"] = test_data["VIP"].astype(int)
    test_data["CryoSleep"] = test_data["CryoSleep"].astype(int)
    test_data["Cabin"] = test_data["Cabin"].astype(str)
    test_data[["Deck", "Cabin_number", "Side"]] = test_data["Cabin"].str.split("/",expand=True)
    
    
    for label, content in test_data.items():
        if pd.api.types.is_string_dtype(content):
            test_data[label]=content.astype("category").cat.as_ordered()
    for label, content in test_data.items():
        if not pd.api.types.is_numeric_dtype(content):
            test_data[label] = pd.Categorical(content).codes+1
    for label, content in test_data.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                test_data[label] = content.fillna(content.median())
    return test_data

In [168]:
df_test = preprocess_data(test_data)
df_test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Deck,Cabin_number,Side
0,1,1,1,2785,3,27.0,0,0.0,0.0,0.0,0.0,0.0,2913,7,821,2
1,2,1,0,1868,3,19.0,0,0.0,9.0,0.0,2823.0,0.0,2407,6,928,2
2,3,2,1,258,1,31.0,0,0.0,0.0,0.0,0.0,0.0,3377,3,1,2
3,4,2,0,260,3,38.0,0,0.0,6652.0,0.0,181.0,585.0,2712,3,2,2
4,5,1,0,1941,3,20.0,0,10.0,0.0,635.0,0.0,0.0,669,6,1030,2


In [169]:
df_test = df_test.drop("Cabin", axis = 1)

In [171]:
test_preds = gs_log_reg.predict(df_test)
test_preds

array([1, 0, 1, ..., 1, 1, 1])

In [172]:
df_preds = pd.DataFrame()
df_preds["PassengerId"] = test_datas["PassengerId"]
df_preds["Transported"] = test_preds
df_preds["Transported"] = df_preds["Transported"].astype(bool)
df_preds

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True
...,...,...
4272,9266_02,True
4273,9269_01,False
4274,9271_01,True
4275,9273_01,True


In [174]:
df_preds.to_csv("test_prediction.csv", index=False)