# Space Titanic
> Predict if passengers of an intergalatic transporter transported its passengers or not. 

## Current Task
- Create more parameters from cabin and passengerId 
- Increase the accuracy (research solutions)
    - Is the data properly cleaned? 
    - Am I using the correct type of model? 
    - What hyperparameters should I be using

#### Understanding the data

![SpaceTitanicImage](<./SpaceTitanicData.png>)

#### Libraries

In [1]:
# import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, classification_report

### Retrieve and View Data

In [2]:
# Read in the train and test data
train = pd.read_csv('./spaceship-titanic/train.csv')
test = pd.read_csv('./spaceship-titanic/test.csv')
sampleSubmission = pd.read_csv('./spaceship-titanic/sample_submission.csv')
print(train.head())

  PassengerId HomePlanet CryoSleep  Cabin  Destination   Age    VIP  \
0     0001_01     Europa     False  B/0/P  TRAPPIST-1e  39.0  False   
1     0002_01      Earth     False  F/0/S  TRAPPIST-1e  24.0  False   
2     0003_01     Europa     False  A/0/S  TRAPPIST-1e  58.0   True   
3     0003_02     Europa     False  A/0/S  TRAPPIST-1e  33.0  False   
4     0004_01      Earth     False  F/1/S  TRAPPIST-1e  16.0  False   

   RoomService  FoodCourt  ShoppingMall     Spa  VRDeck               Name  \
0          0.0        0.0           0.0     0.0     0.0    Maham Ofracculy   
1        109.0        9.0          25.0   549.0    44.0       Juanna Vines   
2         43.0     3576.0           0.0  6715.0    49.0      Altark Susent   
3          0.0     1283.0         371.0  3329.0   193.0       Solam Susent   
4        303.0       70.0         151.0   565.0     2.0  Willy Santantines   

   Transported  
0        False  
1         True  
2        False  
3        False  
4         True  


#### FIll in missing

In [3]:
train.fillna(0, inplace=True)
test.fillna(0, inplace=True)

#### Notes on Data
> Predicting if the passenger was simiply transported or not (boolean).
> This is because that is the only values in the train which isn't in the test

### Process Data

#### Notes
> Now we need to remove unneeded values and turn non-numeric values into numbers

##### Remove
- Name
- PassengerId (removed on test later)
- Cabin

##### Turn into Numbers
- Destination
- Cabin
- VIP
- CryoSleep
- HomePlanet

#### Break Down Cabins

In [4]:
cabinTrain = train['Cabin']

deck = []
num = []
side = []


for cabin in cabinTrain:
    if isinstance(cabin, str):  # Ensure cabin is not NaN and is a valid string
        deck.append(cabin[0])   # Index 0 for deck
        num.append(cabin[2])    # Index 2 for num
        side.append(cabin[4])   # Index 4 for side
    else:
        # Handle cases where the value is NaN or invalid
        deck.append(None)
        num.append(None)
        side.append(None)

train["Deck"] = deck
train["Num"] = num
train["Side"] = side

train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,Deck,Num,Side
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,B,0,P
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,F,0,S
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,A,0,S
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,A,0,S
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,F,1,S


In [5]:
cabinTest = test['Cabin']

deck = []
num = []
side = []


for cabin in cabinTest:
    if isinstance(cabin, str):  # Ensure cabin is not NaN and is a valid string
        deck.append(cabin[0])   # Index 0 for deck
        num.append(cabin[2])    # Index 2 for num
        side.append(cabin[4])   # Index 4 for side
    else:
        # Handle cases where the value is NaN or invalid
        deck.append(None)
        num.append(None)
        side.append(None)

test["Deck"] = deck
test["Num"] = num
test["Side"] = side

#### Remove Name

In [6]:
train = train.drop(columns=["Name"])
train = train.drop(columns="PassengerId")
test = test.drop(columns=["Name"])

train = train.drop(columns="Cabin")
test = test.drop(columns="Cabin")


#### Destination to Number

In [7]:
uniqueDestinations = train['Destination'].unique()
print(uniqueDestinations)
# Looks like there are some missing values that I need to account for

['TRAPPIST-1e' 'PSO J318.5-22' '55 Cancri e' 0]


In [8]:
train['Destination'], _ = pd.factorize(train['Destination'])

In [9]:
test['Destination'], _ = pd.factorize(test['Destination'])

In [10]:
# Confirm there should be 4 different values
uniqueDestinationsNumbers = train['Destination'].unique()
print(uniqueDestinationsNumbers)

[0 1 2 3]


In [11]:
# Deck Side
train['Deck'], _ = pd.factorize(train['Deck'])
train['Deck'], _ = pd.factorize(train['Deck'])

test['Side'], _ = pd.factorize(test['Side'])
test['Side'], _ = pd.factorize(test['Side'])

#### VIP to Number

In [12]:
train['VIP'], _ = pd.factorize(train['VIP'])
test['VIP'], _ = pd.factorize(test['VIP'])

#### CryoSleep to Number

In [13]:
train['CryoSleep'], _ = pd.factorize(train['CryoSleep'])
test['CryoSleep'], _ = pd.factorize(test['CryoSleep'])

#### HomePlanet to Number

In [14]:
train['HomePlanet'], _ = pd.factorize(train['HomePlanet'])
test['HomePlanet'], _ = pd.factorize(test['HomePlanet'])

In [15]:
train = train.applymap(lambda x: x.replace('/', '0') if isinstance(x, str) else x)
test = test.applymap(lambda x: x.replace('/', '0') if isinstance(x, str) else x)

# Convert all cells to numeric; non-numeric values become NaN
train = train.applymap(lambda x: pd.to_numeric(x, errors='coerce'))
# Convert all cells to numeric; non-numeric values become NaN
train.head()

  train = train.applymap(lambda x: x.replace('/', '0') if isinstance(x, str) else x)
  test = test.applymap(lambda x: x.replace('/', '0') if isinstance(x, str) else x)
  train = train.applymap(lambda x: pd.to_numeric(x, errors='coerce'))


Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Num,Side
0,0,0,0,39.0,0,0.0,0.0,0.0,0.0,0.0,False,0,0.0,
1,1,0,0,24.0,0,109.0,9.0,25.0,549.0,44.0,True,1,0.0,
2,0,0,0,58.0,1,43.0,3576.0,0.0,6715.0,49.0,False,2,0.0,
3,0,0,0,33.0,0,0.0,1283.0,371.0,3329.0,193.0,False,2,0.0,
4,1,0,0,16.0,0,303.0,70.0,151.0,565.0,2.0,True,1,1.0,


#### Split Train Data Set

In [16]:
# Shuffle the data and reset the index
train = train.sample(frac=1, random_state=42).reset_index(drop=True)

Y = train['Transported']

X = train.drop(columns='Transported')



X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

### Train

In [17]:
# model = RandomForestClassifier(max_depth=30, n_estimators=40, min_samples_split=50, min_samples_leaf=14, random_state=88)
model = RandomForestClassifier(random_state=42) # This works better than the one above, even though it has worse accuracy

# Train the model
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Classification report
print(classification_report(y_test, y_pred))




Accuracy: 0.78
              precision    recall  f1-score   support

       False       0.77      0.80      0.79       887
        True       0.78      0.75      0.77       852

    accuracy                           0.78      1739
   macro avg       0.78      0.78      0.78      1739
weighted avg       0.78      0.78      0.78      1739



### Run Model on test

In [18]:
test_ID = test["PassengerId"]
test = test.applymap(lambda x: pd.to_numeric(x, errors='coerce'))

test = test.drop(columns="PassengerId")

predictions = model.predict(test)

  test = test.applymap(lambda x: pd.to_numeric(x, errors='coerce'))


### Format Answer / Create CSV

In [19]:
Answer = pd.DataFrame({
    "PassengerId": test_ID,
    "Transported": predictions
})

print(Answer.head())

  PassengerId  Transported
0     0013_01         True
1     0018_01        False
2     0019_01         True
3     0021_01         True
4     0023_01         True


In [20]:
Answer.to_csv("SpaceTitanic_Answer.csv", index=False)