# Spaceship Titanic

This notebook might help those who are doing kaggle for the first time. It contains basic approaches for machine learning problems. The main goal is to process the data and implement various machine learning problems. 

### Loading Data

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df_train = pd.read_csv('../input/spaceship-titanic/train.csv')
df_test =  pd.read_csv('../input/spaceship-titanic/test.csv')

In [None]:
df_train.head()

### Handling Missing Values

In [None]:
df_train.isnull().sum()

**Categorical Variables:**
1. HomePlanet
2. Transported (Target)
3. CryoSleep
4. VIP
5. Destination

**Converting into categories**: Setting nan values as another category

In [None]:
df_train.HomePlanet.unique()
home_planet = {
    "Europa" : 0,
    "Earth" : 1,
    "Mars" : 2,
}
df_train['HomePlanet']  = df_train.HomePlanet.map(home_planet)
df_test['HomePlanet']  = df_test.HomePlanet.map(home_planet)

df_train['HomePlanet'].fillna(-1, inplace = True)
df_test['HomePlanet'].fillna(-1, inplace = True)

In [None]:
vip = {
    False : 0,
    True : 1
}
df_train['VIP']  = df_train.VIP.map(vip)
df_test['VIP']  = df_test.VIP.map(vip)

df_train['VIP'].fillna(-1, inplace = True)
df_test['VIP'].fillna(-1, inplace = True)

In [None]:
destination = {
    'TRAPPIST-1e': 0,
    'PSO J318.5-22': 1,
    '55 Cancri e': 2
} 
df_train['Destination']  = df_train.Destination.map(destination)
df_test['Destination']  = df_test.Destination.map(destination)

df_test['Destination'].fillna(-1, inplace = True)
df_train['Destination'].fillna(-1, inplace = True)

In [None]:
df_train['CryoSleep'] = df_train.CryoSleep*1
df_test['CryoSleep'] = df_test.CryoSleep*1

df_test['CryoSleep'].fillna(-1, inplace = True)
df_train['CryoSleep'].fillna(-1, inplace = True)

In [None]:
# df_train['HomePlanet']  = df_train.HomePlanet.astype('category').cat.codes
# df_train['CryoSleep']   = df_train.CryoSleep.astype('category').cat.codes
# df_train['VIP']         = df_train.VIP.astype('category').cat.codes
# df_train['Destination'] = df_train.Destination.astype('category').cat.codes

**Cabin**

In [None]:
df_train['cabin_split'] = df_train.Cabin.astype('str').apply(lambda x: x.split('/'))
df_test['cabin_split'] = df_test.Cabin.astype('str').apply(lambda x: x.split('/'))

In [None]:
df_train['deck'] = df_train.cabin_split.apply(lambda x: x[0] if len(x)== 3 else None)
df_train['num']  = df_train.cabin_split.apply(lambda x: x[1] if len(x)== 3 else None)
df_train['side'] = df_train.cabin_split.apply(lambda x: x[2] if len(x)== 3 else None)

df_test['deck'] = df_test.cabin_split.apply(lambda x: x[0] if len(x)== 3 else None)
df_test['num']  = df_test.cabin_split.apply(lambda x: x[1] if len(x)== 3 else None)
df_test['side'] = df_test.cabin_split.apply(lambda x: x[2] if len(x)== 3 else None)

**Deck and Side are categorical** : So we will converting them into categories.

In [None]:
cab_deck = {
    'B' : -4,
    'F' : -3,
    'A' : -2, 
    'G' :-1,
    'E' : 1,
    'D' : 2,
    'C' : 3,
    'T' : 4
}
df_train['deck'] = df_train['deck'].map(cab_deck)
df_test['deck'] = df_test['deck'].map(cab_deck)

df_test['deck'].fillna(-1, inplace = True)
df_train['deck'].fillna(-1, inplace = True)

In [None]:
df_train.side.unique()
cab_side = {
    'P' : 0,
    'S' : 1
}
df_train['side'] = df_train['side'].map(cab_side)
df_test['side'] = df_test['side'].map(cab_side)

df_test['side'].fillna(-1, inplace = True)
df_train['side'].fillna(-1, inplace = True)

In [None]:
# Target variable
df_train['target'] = df_train.Transported.astype('category').cat.codes

Checking Correlation before filling the missing values of continuous variables

In [None]:
plt.figure(figsize = (20,10))
sns.heatmap(df_train.corr(), annot = True, fmt = '3.2f' , annot_kws={'size' : 15}, cmap="Set1")
plt.xticks(fontsize = 14)
plt.yticks(fontsize = 14)
plt.show()

Looking at this correlation table, variables RoomService, ShoppingMall, deck and VRDeck doesn't seem to have much contribution in output.
But to solidify this conclusion feature engineering should be done. But first, filling missing values in continuous variables.

## Handling Missing Values in Continuous Variables

**Continous Variables**
1. RoomService
2. FoodCourt
3. ShoppingMall
4. Spa
5. VRDeck
6. num (from cabin)
7. *Age*

In these variables, the num variable represents the room number of the cabin. Whereas other variables represent expenditure. To fill the missing values, taking **mean** would be a better choice. For variables like cabin, **mode** should be fine. 

**Filling the missing values with mean**

In [None]:
cols = ['RoomService', 'FoodCourt', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Age']

for col in cols:
    df_train[col].fillna(df_train[col].mean(), inplace= True)

**Filling the missing values with mode**

In [None]:
df_train['num'] = df_train.num.fillna(df_train.num.mode()[0])

### Handling Age

Age is an continuous variable, we can convert it into categorical variable for better prediction accuracy (in classification)

In [None]:
df_train.Age.hist()

### Binning

In [None]:
df_train['age_bin'] = pd.cut(df_train["Age"], bins = 10, labels= False)
df_test['age_bin'] = pd.cut(df_test["Age"], bins = 10, labels= False)

In [None]:
df_train.drop(['Name', 'Cabin', 'Transported'], axis = 1, inplace = True)
df_test.drop(['Name', 'Cabin'], axis = 1, inplace = True)
df_train.isnull().sum()

In [None]:
plt.figure(figsize = (20,10))
sns.heatmap(df_train.corr(), annot = True, fmt = '3.2f' , annot_kws={'size' : 15}, cmap="Set1")
plt.xticks(fontsize = 14)
plt.yticks(fontsize = 14)
plt.show()

In [None]:
df_train.head(20)

## Feature Engineering

The expenditure on RoomService, FoodCourt, ShoppingMall, Spa and VRDeck have very low correlation with the target. To improve some correlation, the percentage expenditure might help. This will normalise these values.

#### TOTAL EXPENDITURE = RoomService + FoodCourt + ShoppingMall + Spa + VRDeck

**Normalisation:**
* **% RoomService**  = RoomService /TOTAL EXPENDITURE 
* **% FoodCourt**    = FoodCourt /TOTAL EXPENDITURE 
* **% ShoppingMall** = ShoppingMall /TOTAL EXPENDITURE 
* **% Spa**          = Spa /TOTAL EXPENDITURE 
* **% VRDeck**       = VRDeck /TOTAL EXPENDITURE 

In [None]:
df_train['Expenditure'] = df_train['RoomService'] + df_train['Spa'] + df_train['FoodCourt'] + df_train['ShoppingMall'] + df_train['VRDeck']

df_train['RoomService'] = df_train['RoomService'] / df_train['Expenditure'] 
df_train['Spa']          = df_train['Spa'] / df_train['Expenditure'] 
df_train['FoodCourt']    = df_train['FoodCourt'] / df_train['Expenditure'] 
df_train['ShoppingMall'] = df_train['ShoppingMall'] / df_train['Expenditure'] 
df_train['VRDeck']       = df_train['VRDeck'] / df_train['Expenditure'] 
df_train['Expenditure']  = df_train['Expenditure']/ df_train['Expenditure'].max()

df_test['Expenditure'] = df_test['RoomService'] + df_test['Spa'] + df_test['FoodCourt'] + df_test['ShoppingMall'] + df_test['VRDeck']
df_test['RoomService'] = df_test['RoomService'] / df_test['Expenditure'] 
df_test['Spa']          = df_test['Spa'] / df_test['Expenditure'] 
df_test['FoodCourt']    = df_test['FoodCourt'] / df_test['Expenditure'] 
df_test['ShoppingMall'] = df_test['ShoppingMall'] / df_test['Expenditure'] 
df_test['VRDeck']       = df_test['VRDeck'] / df_test['Expenditure'] 
df_test['Expenditure']  = df_test['Expenditure']/ df_test['Expenditure'].max()

If the sum is 0, output would be NaN

In [None]:
df_train['RoomService'].fillna(0, inplace = True)
df_train['Spa'].fillna(0, inplace = True)
df_train['FoodCourt'].fillna(0, inplace = True)
df_train['ShoppingMall'].fillna(0, inplace = True)
df_train['VRDeck'].fillna(0, inplace = True)

df_test['RoomService'].fillna(0, inplace = True)
df_test['Spa'].fillna(0, inplace = True)
df_test['FoodCourt'].fillna(0, inplace = True)
df_test['ShoppingMall'].fillna(0, inplace = True)
df_test['VRDeck'].fillna(0, inplace = True)

In [None]:
df_train['deck'].fillna(-1, inplace = True)
df_train['side'].fillna(-1, inplace = True)

df_test['deck'].fillna(-1, inplace = True)
df_test['side'].fillna(-1, inplace = True)

In [None]:
df_train.isnull().sum()

In [None]:
df_train.head(10)

### Final Correlation

In [None]:
plt.figure(figsize = (20,10))
sns.heatmap(df_train.corr(), annot = True, fmt = '3.2f' , annot_kws={'size' : 15}, cmap="Set1")
plt.xticks(fontsize = 14)
plt.yticks(fontsize = 14)
plt.show()

### Dropping Unecessary Columns

Dropping column like cabin_split

In [None]:
df_train.drop(['cabin_split', 'Age', 'PassengerId'], axis = True, inplace = True)

In [None]:
df_test.drop(['cabin_split', 'Age'], axis = True, inplace = True)

In [None]:
df_train.head(20)

### Preprocessing

Columns like num have high variance and hence to be normalised

In [None]:
df_train['num']  = df_train.num.astype('int')
df_train['num']  = df_train['num']/ df_train['num'].max()

In [None]:
df_test.num.fillna(-1, inplace = True)
df_test['num']  = df_test.num.astype('int')
df_test['num']  = df_test['num']/ df_test['num'].max()

## Training and Testing

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [None]:
X = df_train.drop(['target'], axis = 1)
y = df_train.target

**Train and Valid split**

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.1, random_state=42)

## Logistic Regression

In [None]:
log_clf = LogisticRegression(max_iter = 1000)
log_clf.fit(X_train, y_train)

print('training Accuracy: ',log_clf.score(X_train, y_train, sample_weight=None))
print('validation Accuracy: ',log_clf.score(X_valid, y_valid, sample_weight=None))

## Random Forest

In [None]:
clf_rf1 = RandomForestClassifier(max_depth=3, random_state=0) # max depth = 3
clf_rf1.fit(X_train, y_train)

print('training Accuracy: ',clf_rf1.score(X_train, y_train, sample_weight=None))
print('validation Accuracy: ',clf_rf1.score(X_valid, y_valid, sample_weight=None))

In [None]:
clf_rf2 = RandomForestClassifier(max_depth=6, random_state=0)
clf_rf2.fit(X_train, y_train)
print('training Accuracy: ',clf_rf2.score(X_train, y_train, sample_weight=None))
print('validation Accuracy: ',clf_rf2.score(X_valid, y_valid, sample_weight=None))

In [None]:
clf_rf3 = RandomForestClassifier(max_depth=8, random_state=0)
clf_rf3.fit(X_train, y_train)
print('training Accuracy: ',clf_rf3.score(X_train, y_train, sample_weight=None))
print('validation Accuracy: ',clf_rf3.score(X_valid, y_valid, sample_weight=None))

In [None]:
clf_rf4 = RandomForestClassifier(max_depth=8, max_samples=100, bootstrap=True, n_jobs=-1, n_estimators=500)
clf_rf4.fit(X_train, y_train)
print('training Accuracy: ',clf_rf4.score(X_train, y_train, sample_weight=None))
print('validation Accuracy: ',clf_rf4.score(X_valid, y_valid, sample_weight=None))

## XGBOOST

In [None]:
from xgboost import XGBClassifier
clf_xgb = XGBClassifier()
clf_xgb.fit(X_train, y_train)

print('training Accuracy: ',clf_xgb.score(X_train, y_train, sample_weight=None))
print('validation Accuracy: ',clf_xgb.score(X_valid, y_valid, sample_weight=None))

In [None]:
from sklearn.metrics import classification_report

In [None]:
xgb2 = XGBClassifier(n_estimators=100, max_depth=8, learning_rate=0.1, subsample=0.5)

train_model = xgb2.fit(X_train, y_train)
pred7 = train_model.predict(X_valid)
print("Accuracy for model xgb2: %.2f" % (accuracy_score(y_valid, pred7) * 100))

## Support Vector Machine

In [None]:
from sklearn.svm import SVC
clf_svc = SVC(kernel = 'poly')
clf_svc.fit(X_train, y_train)

print('training Accuracy: ',clf_svc.score(X_train, y_train, sample_weight=None))
print('validation Accuracy: ',clf_svc.score(X_valid, y_valid, sample_weight=None))

## Conclusion

Based on the validation and training accuracy Random forest (clf_3) with depth 8 looks a better choice. Please feel free to comment suggestions. It will help me a lot.✌🏻 