### <a id="cont"> The Challenge 

Predict which passengers were transported by the anomaly... We are suspecting the anamoly had a pattern, and we want to find that pattern.  

We have the records recovered from the spaceship’s damaged computer system. Here is the gameplan.

Start by visualising the dataset features. There are multiple [helper functions](#helpers) that take in the numbers and provide the visuals.

[How many passengers where Transported?](#vis_0)    

[Which planet do the majority of passengers belong to?](#vis_1)

[Which planet's passengers where mostly transported?](#vis_2)
    
[Did the Cryosleep impact?](#vis_3)

[Are the passengers travelling as Single or Groups?](#pair) 

The number of cabins in the Space craft is 6561 cabins in train dataset, and the total travellers are 8,693. So there has to be multiple passengers in some cabins. The cabin names are split into constituent parts, and used further for analysis     
  
[Did the Cabin location impact?](#vis_4)
    
[Did the Cabin Type impact?](#vis_4_type)
    
[Which Cabin numbers have seen maximum transports?](#vis_4_number)    
    
[Did the Destination impact?](#vis_5)
    
[Did the Age impact?](#vis-6)
    
[Did the VIP status impact?](#vis_7)
    
[Supporting Box plot visuals for further analysis of the dataset](#vis_sup)
    
[Distribution of Passenger based on their Age?](#Age)
    
[How does the money spent on Room Service impact the probability of Transported](#vis_8)
    
[How does the money spent on Total Spend and Age relate?](#vis_9)
    
[How does the money spent on Total Spend, Age relate to being transported](#vis_10)
    
[How are the names distributed?](#vis_11)
    
[Do Last_Names have any correlation with the Transported?](#vis_12)    
    
# [Lets Begin Machine Learning](#LogML)
    
[Training](#train)
    
[Predicting](#res)
    
[Submission](#sub)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px
import plotly.graph_objects as go

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
train = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
train.head()

In [None]:
# Always question, whether there are Missing values in the Dataset? 
train.info()

In [None]:
test = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
test.head()

### How to deal with the missing values?

Option 1: Simply drop the rows that have null values

Option 2: Intelligently fill the Null values after seeing how the null values are spread in the dataset.(This option is chosen for numerical)

Option 3: Making the null values as unknown in case of categorical / string values.... (This option is chosen)

In [None]:
#Filling the string null values with Unknown
for col in ['HomePlanet','CryoSleep','Cabin','Destination','VIP','Name']:
    train.loc[train[col].isna(),col] = 'Unknown'

In [None]:
#Filling the string null values with Unknown in test dataset
for col in ['HomePlanet','CryoSleep','Cabin','Destination','VIP','Name']:
    test.loc[test[col].isna(),col] = 'Unknown'

In [None]:
print("The median age of passenger in train set is {}".format(train.loc[~train.Age.isna(),'Age'].median()))
print("The median age of passenger in test set is {}".format(test.loc[~test.Age.isna(),'Age'].median()))

In [None]:
#Filling the numerical null values with 0.0 in case of spending. Using Median Age for the Age feature
for col in ['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']:
    train.loc[train[col].isna(),col] = 0.0
train.loc[train.Age.isna(),'Age'] = train.loc[~train.Age.isna(),'Age'].median()

for col in ['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']:
    test.loc[test[col].isna(),col] = 0.0
test.loc[test.Age.isna(),'Age'] = test.loc[~test.Age.isna(),'Age'].median()

 <a id="helpers">

In [None]:
#There are going to lot of visuals, and the below helper functions are going to make it a breeze to work with the dataset... and FUN!!!

def histogram_onefac(dataset,feature1,title):
    hist_vis = px.histogram(data_frame=dataset,x=feature1)
    hist_vis.update_layout(height=400,title=title)
    hist_vis.show()

def box_plot(dataset,cols):
    boxes = go.Figure()
    boxes.add_trace(go.Box(x=train[cols],name=cols))
    boxes.show()    
    
def scatter_twofac(dataset,feature1,feature2,title):
    scatter_vis_two = px.scatter(data_frame=train,x=feature1,y=feature2)
    scatter_vis_two.update_layout(height=800,title=title)
    scatter_vis_two.show()
    
def scatter_threefac(dataset,feature1,feature2,color_feat,title):
    scatter_vis_three = px.scatter(data_frame=train,x=feature1,y=feature2,color=color_feat)
    scatter_vis_three.update_layout(height=800,title=title)
    scatter_vis_three.show()
    
def histogram_visual(dataset,feature1,feature2,title):
    hist_vis = px.histogram(data_frame=dataset,x=feature1,color=feature2)
    hist_vis.update_layout(height=800,title=title)
    hist_vis.show()
    
def singleFactor_visualisation(dataset,feature):
    vis_singleFactor = dataset.groupby(feature)["PassengerId"].count().reset_index()
    vis_onef = px.bar(data_frame=vis_singleFactor,x=feature,y='PassengerId',color=feature)
    vis_onef.update_layout(height=800,title = "Count distribution of "+ feature + " of the passengers" )
    vis_onef.show()
    
def doubleFactor_visualisation(dataset,feature1,feature2,title):
    vis_doubleFactor = dataset.groupby([feature1,feature2])["PassengerId"].count().reset_index()
    vis_twof = px.bar(data_frame=vis_doubleFactor,x=feature1,y='PassengerId',color=feature2,barmode='group')
    vis_twof.update_layout(height=800,title = title )
    vis_twof.show()

[Back to Contents](#cont)

### <a id="vis_0">How many passengers where Transported? 
    
    Out of the 8,693 passengers, 4,378 passengers have been transported. A bit higher than 50%. 
    
    The dataset given to us balanced, so the feature learning most likely be easier.
    
This balanced dataset also means, using the traditional analysis to understand the pattern will be challenging. Still we visualise the data throughly before going into any machine learning or deep learning modeling.

In [None]:
singleFactor_visualisation(train,'Transported')

[Back to Contents](#cont)

### <a id="vis_1">Which planet the majority of passengers belong to? 

In [None]:
singleFactor_visualisation(train,'HomePlanet')

[Back to Contents](#cont)

### <a id="vis_2">Which planet's passengers where majorly transported? 

In [None]:
doubleFactor_visualisation(dataset=train,feature1='HomePlanet',feature2='Transported',title='Majority of passenger Transported belonged to')

[Back to Contents](#cont)

### <a id="vis_3">Did the Cryosleep impact? 

In [None]:
doubleFactor_visualisation(dataset=train,feature1='CryoSleep',feature2='Transported',title='Did the Cryosleep impact')

[Back to Contents](#cont)

### <a id="pair">Are the passengers travelling as Single or Groups? 
    
    The number of passengers in the cabin can be more than one, since there will be obviously families with children. Lets visualise that first...

    There are cabins that are pretty crowded with upto 6 passengers... That leads to another question? 

In [None]:
print("There are {} cabins in the space craft".format(len(train.Cabin.unique())))

In [None]:
cabin_members = train.groupby('Cabin')['PassengerId'].count()
cabin_plot = go.Figure()
cabin_plot.add_trace(go.Histogram(x=cabin_members.values[0:-1]))
cabin_plot.show()

[Back to Contents](#cont)

In [None]:
#Spliting the Name columns so that further analysis can be done on who is using the cabins
train = train.join(train.Name.str.split(' ',1,
                                         expand=True).rename(columns={1:'Last_name', 
                                                                      0:'First_name'}))
test = test.join(test.Name.str.split(' ',1,
                                         expand=True).rename(columns={1:'Last_name', 
                                                                      0:'First_name'}))

In [None]:
#Spliting the cabin numbers so that further analysis can be drilled down.
train = train.join(train.Cabin.str.split('/',2,
                                         expand=True).rename(columns={0:'Cabin_type', 
                                                                      1:'Cabin_no',
                                                                      2:'Cabin_loc'}))

In [None]:
test = test.join(test.Cabin.str.split('/',2,
                                         expand=True).rename(columns={0:'Cabin_type', 
                                                                      1:'Cabin_no',
                                                                      2:'Cabin_loc'}))

### <a id="vis_4">Did the Cabin location impact? 

In [None]:
doubleFactor_visualisation(dataset=train,feature1='Cabin_loc',feature2='Transported',title='Did the Cabin number impact')

[Back to Contents](#cont)

### <a id="vis_4_type">Did the Cabin type had any impact? 

In [None]:
doubleFactor_visualisation(dataset=train,feature1='Cabin_type',feature2='Transported',title="Did the cabin type impact")

[Back to Contents](#cont)

### <a id="vis_4_number">Which cabin numbers have seen maximum transport? 

In [None]:
fig = px.histogram(data_frame=train,y='Cabin_no',color='Transported')
fig.update_layout(height=1000,title='Which cabin numbers have seen maximum transports')
fig.show()

[Back to Contents](#cont)

### <a id="vis_5">Did the Destination impact? 

In [None]:
doubleFactor_visualisation(dataset=train,feature1='Destination',feature2='Transported',title='Did the Destination impact')

[Back to Contents](#cont)

### <a id="vis_7">Did the VIP status impact?
    
    There seems to be slight negative correlation between the VIP status and probability of being transported 

In [None]:
doubleFactor_visualisation(dataset=train,feature1='VIP',feature2='Transported',title='Did the Age impact')

[Back to Contents](#cont)

### <a id="vis_sup"> Understanding the distribution of the numerical values in Dataset
    
    This visuals was created after the below visuals 8 to 10 were created. After they were created, I got doubt on the distribution of the individual numerical features. Then decided to create the box plots of these features. Since this chart will be used as a support chart, all the plots were created in one chart. 
    
In order to have better visual the numerical with 0 values are removed from the visual, except Age Feature

In [None]:
for features in ['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']:
    dataset = train[train[features] != 0]
    histogram_onefac(dataset=dataset,feature1=features,title=features)

[Back to Contents](#cont)

### <a id="Age"> Distribution of Passengers' Age
    
    Something of surprise occurs. There are passengers with Age 0. Means are they new born babies? 
    
    Or there is some error? 
    
    It cannot be, because there is passenger_Id for these passengers with Age 0. So they must have been carried by their parents... 

In [None]:
histogram_onefac(dataset=train,feature1='Age',title='Age of Passengers')

[Back to Contents](#cont)

### <a id="vis-6"> Did the age have impact?
    
    Something of surprise occurs. There are passengers with Age 0. Means are they new born babies? 
    
    Or there is some error? 
    
    It cannot be, because there is passenger_Id for these passengers with Age 0. So they must have been carried by their parents... 

In [None]:
histogram_visual(dataset=train,feature1='Age',feature2='Transported',title='Age & Transported relation')

[Back to Contents](#cont)

### <a id="vis_8">How does the money spent on Room Service impact the probability of getting transported?
    
    Looks like passengers who spent the least on Room Service got transported... Are stingy people targeted? Bit of drilling down is required

In [None]:
histogram_visual(dataset=train[train.RoomService != 0],feature1='RoomService',feature2='Transported',title='Did the Room service spent impact')

[Back to Contents](#cont)

### <a id="vis_9">How does the total money spent and Age relate?
    
    Seems like Children below age 13 are not given any pocket money in the ship...

In [None]:
#Doing the calculation for test dataset also
test['TotalSpend'] = test.RoomService+test.VRDeck+test.Spa+test.ShoppingMall+test.FoodCourt

In [None]:
train['TotalSpend'] = train.RoomService+train.VRDeck+train.Spa+train.ShoppingMall+train.FoodCourt
histogram_visual(dataset=train[train.RoomService > 1],feature1='TotalSpend',feature2='Age',title="Realtion between Roomservice spend and Age")

[Back to Contents](#cont)

### <a id="vis_10">How does the money spent on Total Spend, Age relate to being transported?
    
    Inconclusive, but there is some level of correlation between the total spend and being transported.

In [None]:
scatter_threefac(dataset=train,feature1='Age',feature2='TotalSpend',color_feat='Transported',
                 title="Realtion between Roomservice, Age and being Transported")

[Back to Contents](#cont)

### <a id="vis_11">Names, Families and their distribution.
    
    Once the name is split into First and last name. Below visualisation comes up. 

In [None]:
Firstname_df = train.groupby('First_name')['PassengerId'].count().reset_index()
Lastname_df = train.groupby('Last_name')['PassengerId'].count().reset_index()

Firstname_df.sort_values('PassengerId',ascending=False,inplace=True)
Lastname_df.sort_values('PassengerId',ascending=False,inplace=True)

fig_1 = px.bar(data_frame=Firstname_df[:50],y='First_name',x='PassengerId')
fig_2 = px.bar(data_frame=Lastname_df[:50],y='Last_name',x='PassengerId')

fig_1.update_layout(yaxis={'categoryorder':'total descending'},height=1000)
fig_2.update_layout(yaxis={'categoryorder':'total descending'},height=1000)

fig_1.show()
fig_2.show()

[Back to Contents](#cont)

### <a id="vis_12">Last_Names, do they have any correlation with the Transported?
    
    Some of the families have completely taken away. While other families, only few are taken. Direct correlation is not visualised with the current way of looking at the data.

In [None]:
Last_tported = train.groupby(['Last_name','Transported'])['PassengerId'].count().reset_index()
Last_tported

In [None]:
fig = px.bar(data_frame=Last_tported[:500],y='Last_name',x='PassengerId',color='Transported')

fig.update_layout(yaxis={'categoryorder':'total descending'},height=1000)
fig.show()

[Back to Contents](#cont)

#  <a id="LogML"> Lets begin some Machine learning 

Lets begin with the simplest of them all. The logistic Regression. The objective is to check if the algorithm can provide some insights that our naked eyes have missed.

We will use the sci-kit learn Logistic regression, and then explore the results. Based on analysing the results, further steps will be taken

In [None]:
#Making the unknowns of Cryo Sleepers into False. If they were asleep then the computer system must have known
train.loc[train.CryoSleep == 'Unknown','CryoSleep'] = False
train.loc[train.VIP == 'Unknown','VIP'] = False
train.CryoSleep = train.CryoSleep.apply(lambda x : str(x))
train.VIP = train.VIP.apply(lambda x : str(x))
train.loc[train.Cabin_loc.isna(),'Cabin_loc'] = 'Special'

In [None]:
#Making the unknowns of Cryo Sleepers into False. If they were asleep then the computer system must have known
test.loc[test.CryoSleep == 'Unknown','CryoSleep'] = False
test.loc[test.VIP == 'Unknown','VIP'] = False
test.CryoSleep = test.CryoSleep.apply(lambda x : str(x))
test.VIP = test.VIP.apply(lambda x : str(x))
test.loc[test.Cabin_loc.isna(),'Cabin_loc'] = 'Special'

In [None]:
from sklearn.preprocessing import OneHotEncoder

def one_hotE(dataf,column):
    data = dataf[column].values.reshape(-1,1)
    variables = []
    for vals in dataf[column].unique():
        variables.append(column+ '_' +vals)
    # define one hot encoding
    print(variables)
    encoder = OneHotEncoder(sparse=False)
    # transform data
    onehot_list = encoder.fit_transform(data)
    df_ot = pd.DataFrame(onehot_list,columns=variables)
    return df_ot
    #print(onehot_planet[:10])

In [None]:
#One hot encoded all the categorical values in the testing data, using the helper function
hp_ds_t = one_hotE(test,'HomePlanet')
cs_ds_t = one_hotE(test,'CryoSleep')
cab_type_t = one_hotE(test,'Cabin_type')
destiDf_t = one_hotE(test,'Destination')
vipdf_t = one_hotE(test,'VIP')
cab_loc_t = one_hotE(test,'Cabin_loc')

In [None]:
test = test.join(hp_ds_t)
test = test.join(cs_ds_t)
test = test.join(cab_type_t)
test = test.join(destiDf_t)
test = test.join(vipdf_t)
test = test.join(cab_loc_t)

In [None]:
test.drop(['Cabin_loc','VIP','Destination',
            'Cabin_type','CryoSleep','HomePlanet',
           'Cabin','Name','PassengerId'],axis=1,inplace=True)

In [None]:
#One hot encoded all the categorical values in the training data, using the helper function
hp_ds = one_hotE(train,'HomePlanet')
cs_ds = one_hotE(train,'CryoSleep')
cab_type = one_hotE(train,'Cabin_type')
destiDf = one_hotE(train,'Destination')
vipdf = one_hotE(train,'VIP')
cab_loc = one_hotE(train,'Cabin_loc')

In [None]:
train = train.join(hp_ds)
train = train.join(cs_ds)
train = train.join(cab_type)
train = train.join(destiDf)
train = train.join(vipdf)
train = train.join(cab_loc)
#dropping all the categorical columns

In [None]:
train.drop(['Cabin_loc','VIP','Destination',
            'Cabin_type','CryoSleep','HomePlanet',
           'Cabin','Name','PassengerId',
           'Last_name','First_name'],axis=1,inplace=True)

In [None]:
#Cabin number is still object, so converting that to numbers. Null values are made to 0
train.loc[train.Cabin_no.isna(),'Cabin_no'] = 0
train.Cabin_no = train.Cabin_no.apply(lambda x: float(x))

In [None]:
#Cabin number is still object, so converting that to numbers. Null values are made to 0 in test set also
test.loc[test.Cabin_no.isna(),'Cabin_no'] = 0
test.Cabin_no = test.Cabin_no.apply(lambda x: float(x))

[Back to Contents](#cont)

###  <a id="train"> Starting to train 

In [None]:
# split a dataset into train and test sets
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
#Creating Training and Testing split, and popping the target
y = train.pop('Transported')
X = train

In [None]:
make_blobs(n_samples=2,n_features=38)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
# example of making a single class prediction
from sklearn.linear_model import LogisticRegression
# generate 2d classification dataset
#X, y = make_blobs(n_samples=100, centers=2, n_features=38, random_state=1)
# fit final model
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
# define one new instance
# make a prediction
y_pred = model.predict(X_test)

[Back to Contents](#cont)

###  <a id="res"> Training result 

In [None]:
# confusion matrix in sklearn
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report


# confusion matrix
matrix = confusion_matrix(y_test,y_pred, labels=[1,0])
print('Confusion matrix : \n',matrix)

# outcome values order in sklearn
tp, fn, fp, tn = confusion_matrix(y_test,y_pred, labels=[1,0]).reshape(-1)
print('Outcome values : \n', tp, fn, fp, tn)

# classification report for precision, recall f1-score and accuracy
matrix = classification_report(y_test,y_pred, labels=[1,0])
print('Classification report : \n',matrix)

In [None]:
test.drop(['Last_name','First_name'],axis=1,inplace=True)

In [None]:
#predicting the values for submission
y_sub = model.predict(test)

In [None]:
test = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')

[Back to Contents](#cont)

###  <a id="sub"> Submission 

In [None]:
my_submission = pd.DataFrame({'PassengerId': test.PassengerId, 'Transported': y_sub})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)

In [None]:
my_submission.sample(5)

[Back to Contents](#cont)

The Analysis will be continued....