# Space Titanic
> Predict if passengers of an intergalatic transporter transported its passengers or not. 

## Current Task
- Create more parameters from cabin and passengerId 
- Increase the accuracy (research solutions)
    - Is the data properly cleaned? 
    - Am I using the correct type of model? 
    - What hyperparameters should I be using

In [398]:
# import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

### Retrieve and View Data

In [399]:
# Read in the train and test data
train = pd.read_csv('./spaceship-titanic/train.csv')
test = pd.read_csv('./spaceship-titanic/test.csv')
sampleSubmission = pd.read_csv('./spaceship-titanic/sample_submission.csv')

#### FIll in missing

In [400]:
train.fillna(0, inplace=True)
test.fillna(0, inplace=True)

#### Notes on Data
> Predicting if the passenger was simiply transported or not (boolean).
> This is because that is the only values in the train which isn't in the test

### Process Data

#### Notes
> Now we need to remove unneeded values and turn non-numeric values into numbers

##### Remove
- Name
- PassengerId (removed on test later)
- Cabin

##### Turn into Numbers
- Destination
- Cabin
- VIP
- CryoSleep
- HomePlanet

#### Remove Name

In [401]:
train = train.drop(columns=["Name"])
train = train.drop(columns="PassengerId")
test = test.drop(columns=["Name"])

#### Remove Cabin

In [402]:
# Since there are so many different cabins and I don't know what to do yet, I will delete them. Get the best possible working for now.
train = train.drop(columns='Cabin')
test = test.drop(columns='Cabin')

#### Destination to Number

In [403]:
uniqueDestinations = train['Destination'].unique()
print(uniqueDestinations)
# Looks like there are some missing values that I need to account for

['TRAPPIST-1e' 'PSO J318.5-22' '55 Cancri e' 0]


In [404]:
train['Destination'], _ = pd.factorize(train['Destination'])

In [405]:
test['Destination'], _ = pd.factorize(test['Destination'])

In [406]:
# Confirm there should be 4 different values
uniqueDestinationsNumbers = train['Destination'].unique()
print(uniqueDestinationsNumbers)

[0 1 2 3]


#### VIP to Number

In [407]:
train['VIP'], _ = pd.factorize(train['VIP'])
test['VIP'], _ = pd.factorize(test['VIP'])

#### CryoSleep to Number

In [408]:
train['CryoSleep'], _ = pd.factorize(train['CryoSleep'])
test['CryoSleep'], _ = pd.factorize(test['CryoSleep'])

#### HomePlanet to Number

In [409]:
train['HomePlanet'], _ = pd.factorize(train['HomePlanet'])
test['HomePlanet'], _ = pd.factorize(test['HomePlanet'])

#### Split Train Data Set

In [410]:
# Shuffle the data and reset the index
train = train.sample(frac=1, random_state=42).reset_index(drop=True)

Y = train['Transported']

X = train.drop(columns='Transported')


X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

### Train

In [None]:
# model = RandomForestClassifier(max_depth=30, n_estimators=40, min_samples_split=50, min_samples_leaf=14, random_state=88)
model = RandomForestClassifier(random_state=42) # This works better than the one above, even though it has worse accuracy

# Train the model
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Classification report
print(classification_report(y_test, y_pred))


Accuracy: 0.79
              precision    recall  f1-score   support

       False       0.81      0.76      0.79       887
        True       0.77      0.81      0.79       852

    accuracy                           0.79      1739
   macro avg       0.79      0.79      0.79      1739
weighted avg       0.79      0.79      0.79      1739



### Run Model on test

In [412]:
test_ID = test["PassengerId"]

test = test.drop(columns="PassengerId")

predictions = model.predict(test)

### Format Answer / Create CSV

In [413]:
Answer = pd.DataFrame({
    "PassengerId": test_ID,
    "Transported": predictions
})

print(Answer.head())

  PassengerId  Transported
0     0013_01         True
1     0018_01        False
2     0019_01         True
3     0021_01         True
4     0023_01         True


In [414]:
Answer.to_csv("SpaceTitanic_Answer.csv", index=False)