# Data science skills are needed to solve a cosmic mystery

#Instruction

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension !

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Help save them and change history!

# Basically we have to find out if the paasenger is reached to their destination or not and what factor influence most / what least, there are 38 thousand people on the board.

In [148]:
#CSDA 5310 Personal project.
#Presented by Anjesh Sahani

#####################################################
#Importing the required package
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
from sklearn.impute import KNNImputer

In [149]:
#Reading csv data from local machine

df_train = pd.read_csv("C:/Users/anjes/Desktop/FALL/Visualization/Wk8/spaceship-titanic-dataset/train.csv")
df_test = pd.read_csv("C:/Users/anjes/Desktop/FALL/Visualization/Wk8/spaceship-titanic-dataset/test.csv")
###############################################################################################################


# Lets have overview of the data first, 
# shape functions tells about number of row and columns.
print(df_train.shape)
print("<!-----------------------------------------------------------------------!>")
print(df_test.shape)

(8693, 14)
<!-----------------------------------------------------------------------!>
(4277, 13)


In [150]:
#Merging both data set into one.
df_test['Transported'] = False

#before doing any machine learning preprocessing, we have to put all the data in same domain, 
df = pd.concat([df_train, df_test], sort=False)

#Dropping column that we don't want
df.drop(['Name', 'PassengerId'], axis=1, inplace=True)

#display first 5 records
df.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True


In [151]:
#To chceck if its concatinated or not
#if its true then we its done.
df.shape[0] == df_train.shape[0] + df_test.shape[0]

True

In [152]:
#First step is to identifying NULL values.
df.isna().sum()


#below will dislplay the number of NULL values in each columns.

HomePlanet      288
CryoSleep       310
Cabin           299
Destination     274
Age             270
VIP             296
RoomService     263
FoodCourt       289
ShoppingMall    306
Spa             284
VRDeck          268
Transported       0
dtype: int64

In [153]:
#Seprating the Cabin column because in Cabin column itself there is Three columns using split function.
df[['Deck','Num', 'Side']] = df['Cabin'].str.split('/', expand=True)


In [154]:
# drop the cabin column we already seprated the all three column inside Cabin column
df = df.drop(columns=['Cabin'])

In [155]:
df['Deck'] = df['Deck'].fillna("U")
df['Num'] = df['Num'].fillna(-1)
df['Side'] = df['Side'].fillna('U')

In [156]:
df['Deck'].value_counts()

Deck
F    4239
G    3781
E    1323
B    1141
C    1102
D     720
A     354
U     299
T      11
Name: count, dtype: int64

In [157]:
df.isna().sum()

HomePlanet      288
CryoSleep       310
Destination     274
Age             270
VIP             296
RoomService     263
FoodCourt       289
ShoppingMall    306
Spa             284
VRDeck          268
Transported       0
Deck              0
Num               0
Side              0
dtype: int64

In [158]:
df.head()

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Num,Side
0,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,B,0,P
1,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,F,0,S
2,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,A,0,S
3,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,A,0,S
4,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,F,1,S


In [159]:
df['Deck'] = df['Deck'].map({
   
    'G':0,
    'F':1,
    'E':2,
    'B':3,
    'C':4,
    'D':5,
    'A':6,
    'U':7,
    'T':8 
})

df['Side'] = df['Side'].map({
    'U':-1,
    'P':1,
    'S':2
})

In [160]:
Impute_list = ['Age','VIP', 'Num','CryoSleep','Side','Deck','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']
rest_columns = list(set(df.columns) - set(Impute_list))

df_rest = df[rest_columns]

imp = KNNImputer()
df_imputed = imp.fit_transform(df[Impute_list])
df_imputed = pd.DataFrame(df_imputed, columns=Impute_list)
df = pd.concat([df_rest.reset_index(drop=True), df_imputed.reset_index(drop=True)], axis=1)

In [161]:
df['Destination'] = df['Destination'].fillna('Un')
df['HomePlanet'] = df['HomePlanet'].fillna('U')

#
category_columns = ['Destination','HomePlanet']
for col in category_columns:
    df = pd.concat([df, pd.get_dummies(df[col], prefix=col)], axis=1)

In [162]:
#    
df = df.drop(columns=category_columns)

#Feature Enginering

In [163]:
bills_columns = ['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']

#total money spent by customer
df['amount_spent'] = df[bills_columns].sum(axis=1)

#standard deviation and mean
df['std_amount_spent'] = df[bills_columns].std(axis=1)
df['mean_amount_spent'] = df[bills_columns].mean(axis=1)

#Higest correlation columns
df['3_high_columns'] = df['CryoSleep'] + df['HomePlanet_Europa'] + df['Destination_55 Cancri e']

#Lowest correlation columns
df['3_low_columns'] = df['amount_spent'] + df['mean_amount_spent'] + df['HomePlanet_Earth']

In [164]:
#Finding correlation between all others columns with the Transported columns
df.corr()['Transported'].sort_values(ascending=False)

Transported                  1.000000
CryoSleep                    0.324525
3_high_columns               0.284257
HomePlanet_Europa            0.131977
Destination_55 Cancri e      0.083625
Side                         0.059872
Deck                         0.041775
FoodCourt                    0.034766
HomePlanet_U                 0.006403
HomePlanet_Mars              0.005643
ShoppingMall                 0.004189
Destination_PSO J318.5-22    0.000760
Destination_Un              -0.000554
VIP                         -0.018569
Num                         -0.035240
Age                         -0.050592
Destination_TRAPPIST-1e     -0.072731
HomePlanet_Earth            -0.119644
std_amount_spent            -0.121134
amount_spent                -0.140416
mean_amount_spent           -0.140416
3_low_columns               -0.140440
VRDeck                      -0.142770
Spa                         -0.154816
RoomService                 -0.174750
Name: Transported, dtype: float64

In [167]:
#Combine dataset in order to preprocess with model
df_train, df_test = df[:df_train.shape[0]] , df[df_train.shape[0]:]
df_test = df_test.drop(columns= 'Transported')

df_train.shape, df_test.shape

((8693, 25), (4277, 24))

# Now creating 5 different model to figure out which model is best fit.

In [168]:
import sys
!{sys.executable} -m pip install xgboost
!{sys.executable} -m pip install lightgbm 



#this model Help to boost traing and testing
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score



In [145]:
X = df_train.drop(columns='Transported')

y = df_train['Transported']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

y_train = np.reshape(y_train,(-1,1))
y_test = np.reshape(y_test,(-1,1))

#Two dimensional data
X_train.shape, y_train.shape

#Models 

In [169]:
model_1 = LogisticRegression()
model_2 = DecisionTreeClassifier()
model_3 = RandomForestClassifier()
model_4 = XGBClassifier()
model_5 = LGBMClassifier()


In [171]:
model_1.fit(X_train, y_train)
prediction = model_1.predict(X_test)
accuracy_score(y_test, prediction)

ValueError: could not convert string to float: 'Earth'