# Spaceship Titanic Data Science Solutions

### Porting issues

- Specify plot dimensions, bring legend into plot.


### Best practices

- Performing feature correlation analysis early in the project.
- Using multiple plots instead of overlays for readability.

### Import function

In [None]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
from scipy import stats

# visualization
import seaborn as sns  #建立圖表
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder

## Acquire data

The Python Pandas packages helps us work with our datasets. We start by acquiring the training and testing datasets into Pandas DataFrames. We also combine these datasets to run certain operations on both datasets together.

In [None]:
train_df = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
test_df = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
combine = [train_df, test_df]

## Analyze by describing data

Pandas also helps describe the datasets answering following questions early in our project.

**Which features are available in the dataset?**

Noting the feature names for directly manipulating or analyzing these. These feature names are described on the [Kaggle data page here](https://www.kaggle.com/competitions/spaceship-titanic/data).

In [None]:
print(train_df.columns.values)

### Feature descriptions
- PassengerId : A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
- HomePlanet : The planet the passenger departed from, typically their planet of permanent residence.
- CryoSleep : Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- Cabin : The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- Destination : The planet the passenger will be debarking to.
- Age : The age of the passenger.
- VIP : Whether the passenger has paid for special VIP service during the voyage.
- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- Name : The first and last names of the passenger.
- Transported : Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

In [None]:
train_df.dtypes

**Which features are categorical?**

These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.

- Categorical: Transported, PassengerId, CryoSleep, Cabin, HomePlanet, Destination and VIP. 

**Which features are numerical?**

Which features are numerical? These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots for visualization.

- Continous: Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck.

In [None]:
# preview the data
train_df.head()

**Which features are mixed data types?**

Numerical, alphanumeric data within same feature. These are candidates for correcting goal.

- Cabin is alphanumeric.

**Which features may contain errors or typos?**

This is harder to review for a large dataset, however reviewing a few samples from a smaller dataset may just tell us outright, which features may require correcting.

- Name feature may contain errors or typos as there are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names.

In [None]:
train_df.tail()

### Missing value

In [None]:
train_df.isnull().sum()

In [None]:
test_df.isnull().sum()

In [None]:
train_df.info()
print('_'*40)
test_df.info()

In [None]:
train_df.describe()

**What is the distribution of categorical features?**

- Names are unique across the dataset (count=unique=8493)
- PassengerIds are unique across the dataset (count=unique=8693)
- CryoSleep variable as two possible values with 64% False (top=False, freq=5439/count=8476).
- VIP variable as two possible values with 98% False (top=False, freq=8291/count=8490).
- Cabin values have several dupicates across samples. Alternatively several passengers shared a cabin.
- HomePlanet takes three possible values. Earth port used by most passengers (top=Earth)
- Destination takes three possible values. TRAPPIST-1e port used by most passengers (top=TRAPPIST-1e)

In [None]:
train_df.describe(include=['O'])

### Clean data

In [None]:
for dataset in combine:    
    dataset['HomePlanet'].fillna(dataset['HomePlanet'].mode()[0],inplace = True)
    dataset['Age'].fillna(dataset['Age'].median(), inplace = True)
    dataset['CryoSleep'].fillna(dataset['CryoSleep'].mode()[0],inplace = True)
    dataset['Destination'].fillna(dataset['Destination'].mode()[0],inplace = True)
    dataset['VIP'].fillna(dataset['VIP'].mode()[0],inplace = True)
    dataset['RoomService'].fillna(dataset['RoomService'].median(), inplace = True)
    dataset['FoodCourt'].fillna(dataset['FoodCourt'].median(), inplace = True)
    dataset['ShoppingMall'].fillna(dataset['ShoppingMall'].median(), inplace = True)
    dataset['Spa'].fillna(dataset['Spa'].median(), inplace = True)
    dataset['VRDeck'].fillna(dataset['VRDeck'].median(), inplace = True)

In [None]:
train_df.isnull().sum()

In [None]:
test_df.isnull().sum()

### EDA

In [None]:
# Categorical features
cat_feats=['HomePlanet', 'CryoSleep', 'Destination', 'VIP']

# Plot categorical features
fig=plt.figure(figsize=(10,16))
for i, var_name in enumerate(cat_feats):
    ax=fig.add_subplot(4,1,i+1)
    sns.countplot(data=train_df, x=var_name, axes=ax, hue='Transported')
    ax.set_title(var_name)
fig.tight_layout()  # Improves appearance a bit
plt.show()

**DataFrame.sort_values**(by=‘’,axis=0,ascending=True, inplace=False, na_position=‘last’)

In [None]:
train_df[['HomePlanet', 'Transported']].groupby(['HomePlanet'], as_index=False).mean().sort_values(by='Transported', ascending=False)

In [None]:
train_df = pd.get_dummies(train_df, columns = ['HomePlanet'])
train_df.head()

In [None]:
test_df = pd.get_dummies(test_df,columns = ['HomePlanet'])
combine = [train_df,test_df]

In [None]:
train_df[["Destination", "Transported"]].groupby(['Destination'], as_index=False).mean().sort_values(by='Transported', ascending=False)

In [None]:
train_df = pd.get_dummies(train_df,columns = ['Destination'])
test_df = pd.get_dummies(test_df,columns = ['Destination'])
combine = [train_df,test_df]
train_df.head()

In [None]:
train_df[["VIP", "Transported"]].groupby(['VIP'], as_index=False).mean().sort_values(by='Transported', ascending=False)

In [None]:
for dataset in combine:
    le = LabelEncoder()
    dataset['VIP'] = le.fit_transform(dataset['VIP'])
train_df.head()

In [None]:
train_df[["CryoSleep", "Transported"]].groupby(['CryoSleep'], as_index=False).mean().sort_values(by='Transported', ascending=False)

In [None]:
train_df = pd.get_dummies(train_df,columns = ['CryoSleep'])
test_df = pd.get_dummies(test_df,columns = ['CryoSleep'])
combine = [train_df,test_df]
train_df.head()

In [None]:
g = sns.FacetGrid(train_df, col='Transported')  #FacetGrid 同時顯示多個圖表
g.map(plt.hist, 'Age', bins=20)

### Feature Engineering

In [None]:
for dataset in combine:   
    dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)
train_df.head()

In [None]:
train_df[['AgeBin','Transported']].groupby(['AgeBin'],as_index=False).mean().sort_values(by='Transported',ascending=False)

In [None]:
for dataset in combine:
    dataset.loc[ dataset['Age'] <= 15, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 15) & (dataset['Age'] <= 31), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 31) & (dataset['Age'] <= 47), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 47) & (dataset['Age'] <= 63), 'Age'] = 3
    dataset.loc[(dataset['Age'] > 63), 'Age'] = 4
    dataset['Age'] = dataset['Age'].astype(int)

train_df.head()

In [None]:
fig=plt.figure(figsize=(10,12))
plt.subplot(3,1,1)
sns.countplot(data=train_df, x='Age', hue='Transported', order=[0,1,2,3,4])
plt.title('Age')

In [None]:
for dataset in combine:
    dataset['Cabin'].fillna('Z/9999/Z', inplace=True)

In [None]:
for dataset in combine:
    dataset['deck'] = dataset['Cabin'].apply(lambda x:str(x)[:1])
    dataset['num'] = dataset['Cabin'].apply(lambda x:x.split('/')[1])
    dataset['num'] = dataset['num'].astype(int)
    dataset['side'] = dataset['Cabin'].apply(lambda x:str(x)[-1:])
    dataset['deck'].fillna(dataset['deck'].mode()[0],inplace=True)
    dataset['num'].fillna(dataset['num'].mode()[0],inplace=True)
    dataset['side'].fillna(dataset['side'].mode()[0],inplace=True)

In [None]:
train_df[['deck','Transported']].groupby(['deck'],as_index=False).mean().sort_values(by='Transported',ascending=False)

In [None]:
fig=plt.figure(figsize=(10,12))
plt.subplot(3,1,1)
sns.countplot(data=train_df, x='deck', hue='Transported', order=['A','B','C','D','E','F','G','T','Z'])
plt.title('Cabin deck')

In [None]:
deck_mapping = {"B": 1, "C": 1, "G": 2,"Z": 2, "A": 2, "F": 3, "D": 3, "E": 4, "T": 5}
for dataset in combine:
    dataset['deck'] = dataset['deck'].map(deck_mapping)

train_df.head()

In [None]:
side_map = {'P':1,'S':0}
for dataset in combine:
    dataset['side'] = dataset['side'].map(side_map)

In [None]:
for dataset in combine:
    dataset['side'].fillna(dataset['side'].mode()[0],inplace=True)

In [None]:
train_df.isnull().sum()

In [None]:
plt.subplot(3,1,2)
sns.histplot(data=train_df, x='num', hue='Transported',binwidth=20)
plt.title('Cabin number')
plt.xlim([0,2000])

In [None]:
for dataset in combine:
    dataset['region1']=(dataset['num']<300).astype(int)  
    dataset['region2']=((dataset['num']>=300)& (dataset['num']<600)).astype(int)  
    dataset['region3']=((dataset['num']>=600)& (dataset['num']<900)).astype(int)  
    dataset['region4']=((dataset['num']>=900)& (dataset['num']<1200)).astype(int)  
    dataset['region5']=((dataset['num']>=1200)& (dataset['num']<1500)).astype(int)  
    dataset['region6']=((dataset['num']>=1500)& (dataset['num']<1800)).astype(int)  
    dataset['region7']=(dataset['num']>1800).astype(int)  

In [None]:
for dataset in combine:
    dataset['group'] = dataset.PassengerId.apply(lambda x:x.split('_')[0])
    dataset['group'] = dataset['group'].astype(int)

In [None]:
train_df.head()

In [None]:
plt.scatter(x=train_df['ShoppingMall'],y=train_df['Transported'])

In [None]:
plt.scatter(x=train_df['FoodCourt'],y=train_df['Transported'])

In [None]:
plt.scatter(x=train_df['Spa'],y=train_df['Transported'])

In [None]:
plt.scatter(x=train_df['VRDeck'],y=train_df['Transported'])

In [None]:
plt.scatter(x=train_df['RoomService'],y=train_df['Transported'])

In [None]:
grid = sns.FacetGrid(train_df, row='Transported', col='deck', size=2.2, aspect=1.6)
grid.map(plt.hist, 'VRDeck', alpha=.5, bins=20)
grid.add_legend()

In [None]:
for dataset in combine:
    dataset['sum'] = dataset['VRDeck'] + dataset['Spa'] + dataset['ShoppingMall'] + dataset['RoomService'] + dataset['FoodCourt']

In [None]:
for dataset in combine:
    dataset['vr'] = dataset['VRDeck'] / dataset['sum']
    dataset['spa'] = dataset['Spa'] / dataset['sum']
    dataset['room'] = dataset['RoomService'] / dataset['sum']
    dataset['shop'] = dataset['ShoppingMall'] / dataset['sum']
    dataset['food'] = dataset['FoodCourt'] / dataset['sum']

In [None]:
for dataset in combine:
    dataset['vr'].fillna(0,inplace = True)
    dataset['spa'].fillna(0,inplace = True)
    dataset['room'].fillna(0,inplace = True)
    dataset['shop'].fillna(0,inplace = True)
    dataset['food'].fillna(0,inplace = True)

In [None]:
train_df.isnull().sum()

In [None]:
train_df.head()

### Preprocessing

In [None]:
train_df = train_df.drop(['Name', 'PassengerId','AgeBin','Cabin','num'], axis=1)
test_df = test_df.drop(['Name', 'PassengerId','AgeBin','Cabin','num'], axis=1)
combine = [train_df, test_df]
train_df.shape, test_df.shape

In [None]:
train_df.columns.values

In [None]:
corrmat = train_df.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

In [None]:
X_train = train_df.drop("Transported", axis=1)
Y_train = train_df["Transported"]
X_test  = test_df.copy()
X_train.shape, Y_train.shape, X_test.shape

Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Reference [Wikipedia](https://en.wikipedia.org/wiki/Logistic_regression).

Note the confidence score generated by the model based on our training dataset.

In [None]:
# Logistic Regression

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log

We can use Logistic Regression to validate our assumptions and decisions for feature creating and completing goals. This can be done by calculating the coefficient of the features in the decision function.

Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).

In [None]:
coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)

Next we model using Support Vector Machines which are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training samples, each marked as belonging to one or the other of **two categories**, an SVM training algorithm builds a model that assigns new test samples to one category or the other, making it a non-probabilistic binary linear classifier. Reference [Wikipedia](https://en.wikipedia.org/wiki/Support_vector_machine).

Note that the model generates a confidence score which is higher than Logistics Regression model.

In [None]:
# Support Vector Machines

svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc

In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. A sample is classified by a majority vote of its neighbors, with the sample being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. Reference [Wikipedia](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm).

KNN confidence score is better than Logistics Regression but worse than SVM.

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features) in a learning problem. Reference [Wikipedia](https://en.wikipedia.org/wiki/Naive_Bayes_classifier).

The model generated confidence score is the lowest among the models evaluated so far.

In [None]:
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian

The perceptron is an algorithm for supervised learning of binary classifiers (functions that can decide whether an input, represented by a vector of numbers, belongs to some specific class or not). It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. The algorithm allows for online learning, in that it processes elements in the training set one at a time. Reference [Wikipedia](https://en.wikipedia.org/wiki/Perceptron).

In [None]:
# Perceptron

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron

In [None]:
# Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc

In [None]:
# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd

This model uses a decision tree as a predictive model which maps features (tree branches) to conclusions about the target value (tree leaves). Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Reference [Wikipedia](https://en.wikipedia.org/wiki/Decision_tree_learning).

The model confidence score is the highest among models evaluated so far.

In [None]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

The next model Random Forests is one of the most popular. Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees (n_estimators=100) at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Reference [Wikipedia](https://en.wikipedia.org/wiki/Random_forest).

The model confidence score is the highest among models evaluated so far. We decide to use this model's output (Y_pred) for creating our competition submission of results.

In [None]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest

### Model evaluation

We can now rank our evaluation of all the models to choose the best one for our problem. While both Decision Tree and Random Forest score the same, we choose to use Random Forest as they correct for decision trees' habit of overfitting to their training set. 

In [None]:
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)

In [None]:
pip install catboost

In [None]:
from catboost import Pool, CatBoostClassifier
# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=100,
                           learning_rate=0.15,
                           depth=4,
                           cat_features = [0],
                           loss_function='MultiClass')
# Fit model
model.fit(X_train, Y_train)
# Get predicted classes
preds_class = model.predict(X_test)

In [None]:
preds_class = preds_class.T

In [None]:
preds_class[0,:]

## Best score

In [None]:
test_df = pd.read_csv('/kaggle/input/spaceship-titanic/sample_submission.csv')
submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Transported": preds_class[0,:]
    })
submission.to_csv('submission.csv', index=False)
#Best score: 0.80593
# array(['Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa',
#        'VRDeck', 'Transported', 'HomePlanet_Earth', 'HomePlanet_Europa',
#        'HomePlanet_Mars', 'Destination_55 Cancri e',
#        'Destination_PSO J318.5-22', 'Destination_TRAPPIST-1e',
#        'CryoSleep_False', 'CryoSleep_True', 'deck', 'side', 'region1',
#        'region2', 'region3', 'region4', 'region5', 'region6', 'region7',
#        'group', 'sum', 'vr', 'spa', 'room', 'shop', 'food'], dtype=object)


## References

This notebook has been created based on great work done solving the Titanic competition and other sources.

- [Spaceship Titanic:A complete guide](https://www.kaggle.com/code/samuelcortinhas/spaceship-titanic-a-complete-guide#Preprocessing)
- [Titanic Data Science Solutions](https://www.kaggle.com/code/startupsci/titanic-data-science-solutions)