<div style='font-size:200%;'>
    <a id='nan'></a>
    <h1 style='color: darkslateblue; font-weight: bold; font-family: Cascadia code;'>
        <center> Multiple Model analysis with EDA on Spaceship Titanic Dataset </center>
    </h1>
</div>

<center> <img src="https://64.media.tumblr.com/debec6caf9c8a009a73207228ffde853/5866f6929b208337-e6/s540x810/aacff5ed83186c696f4c3c6796eccdf2f8f68bd7.gifv"> </center>

- - -

<div style='font-size:200%;'>
    <a id='nan'></a>
    <h1 style='color: chartreuse; font-weight: bold; font-family: Cascadia code;'> Contents </h1>
</div>

- [Importing necessary libraries](#import)
- [Importing the data](#data)
- [Exploratory Data Analysis](#eda)
    - [NaN values heat-map](#heatmap)
    - [Typecasting and dropping columns](#typecast)
    - [Correlation between different features and our target variable](#corr)
    - [Distribution of Individuals based on HomePlanet](#planet)
- [Distribution of transported individuals](#trans)
- [Age distribution of the passengers](#age)
- [Auto-Visualization](#auto)
    - [Training Data](#auto-train)
    - [Testing Data](#auto-test)
- [Data Pre-processing](#preprocess)
    - [Imputing missing data](#impute)
    - [Typecasting and dropping columns](#typecast)
    - [One-Hot Encoding](#ohe)
    - [Splitting data into x (Values) and y (labels)](#split)
- [Classifying](#classify)
    - [Building and fitting the models](#build)
    - [Performance Analysis of the different models](#anal)
- [Submission](#submit)

- - -

<div style='font-size:200%;'>
    <a id='import'></a>
    <h1 style='color: maroon; font-weight: bold; font-family: Cascadia code;'>
        <center> Importing necessary libraries üìö </center>
    </h1>
    <img src="https://miro.medium.com/max/1400/1*RIrPOCyMFwFC-XULbja3rw.png">
</div>

In [None]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

<div style='font-size:200%;'>
    <a id='clean'></a>
    <h1 style='color: orangered; font-weight: bold; font-family: Cascadia code;'>
        <center> Importing the dataset ‚¨áÔ∏è </center>
    </h1>
</div>

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
dataTrain = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
dataTest = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')

<div style='font-size:200%;'>
    <a id='eda'></a>
    <h1 style='color: lightgreen; font-weight: bold; font-family: Cascadia code;'>
        <center> Exploratory Data Analysis üìä </center>
    </h1>
</div>

<h1 align="center" ><a id='heatmap'><b>Null values heat-map<b></a></h1>

In [None]:
fig, axes = plt.subplots(1, 2, sharex=True, figsize=(20,10))
sns.heatmap(ax=axes[0], yticklabels=False, data=dataTrain.isnull(), cbar=False, cmap="viridis")
sns.heatmap(ax=axes[1], yticklabels=False, data=dataTest.isnull(), cbar=False, cmap="tab20c")
axes[0].set_title('Heatmap of missing values in training data')
axes[1].set_title('Heatmap of missing values in testing data')
plt.show()

In [None]:
print('Unique HomePlanet:', dataTrain.HomePlanet.unique(), '\nUnique Destination:', dataTrain.Destination.unique())

<h1 align="center" ><a id='corr'><b>Correlation between different features and our target variable<b></a></h1>

## **Correlation coefficient**

The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two variables. The values range between -1.0 and 1.0. A calculated number greater than 1.0 or less than -1.0 means that there was an error in the correlation measurement. A correlation of -1.0 shows a perfect negative correlation, while a correlation of 1.0 shows a perfect positive correlation. A correlation of 0.0 shows no linear relationship between the movement of the two variables.

In [None]:
plt.figure(figsize=(12,8))
data = dataTrain.corr()["Transported"].sort_values(ascending=False)
indices = data.index
labels = []
corr = []
for i in range(1, len(indices)):
    labels.append(indices[i])
    corr.append(data[i])
sns.barplot(x=corr, y=labels, palette='magma')
plt.title('Correlation coefficient between different features and Transported')

In [None]:
plt.figure(figsize=(18, 9))
sns.heatmap(dataTrain.corr(), cmap="YlGnBu", annot=True)
plt.show()

<h1 align="center" ><a id='planet'><b>Distribution of Individuals based on HomePlanet<b></a></h1>

In [None]:
tPlanet = pd.crosstab(dataTrain['Transported'], dataTrain['HomePlanet'])
tDest = pd.crosstab(dataTrain['Transported'], dataTrain['Destination'])

In [None]:
plt.figure(figsize=(12,8))
colors = sns.color_palette('pastel')
plt.pie([item/len(dataTrain.HomePlanet) for item in dataTrain.HomePlanet.value_counts()], labels=['Earth', 'Europa', 'Mars'], colors=colors, autopct='%.0f%%')
plt.show()

<h1 align="center" ><a id='trans'><b>Distribution of transported individuals<b></a></h1>

In [None]:
plt.figure(figsize=(20,15))
plt.subplot(2,2,1)
sns.countplot(x = dataTrain.HomePlanet, hue = dataTrain.Transported, palette="viridis")
plt.title('Transported individuals - Home Planets', fontsize=15)
plt.xlabel('HomePlanet', fontsize=15)
plt.ylabel('Number of Individuals', fontsize=15)

plt.subplot(2,2,2)
sns.countplot(x = dataTrain.HomePlanet, hue = dataTrain.CryoSleep, palette="viridis")
plt.title('Transported individuals - Cryosleep', fontsize=14)
plt.xlabel('HomePlanet', fontsize=15)
plt.ylabel('Number of passengers', fontsize=15)

<h1 align="center" ><a id='age'><b>Age distribution of the passengers<b></a></h1>

In [None]:
plt.figure(figsize=(20,8))
sns.histplot(dataTrain.Age, color=sns.color_palette('magma')[2])
plt.show()

In [None]:
trainAge = dataTrain.copy()
testAge = dataTest.copy()
trainAge["type"] = "Train"
testAge["type"] = "Test"
ageDf = pd.concat([trainAge, testAge])
fig = px.histogram(data_frame = ageDf, 
                   x="Age",
                   color= "type",
                   color_discrete_sequence =  ['#FFA500','#87CEEB'],
                   marginal="box",
                   nbins= 100,
                   template="plotly_white"
                )
fig.update_layout(title = "Distribution of Age" , title_x = 0.5)
fig.show()

<div style='font-size:200%;'>
    <a id='auto'></a>
    <h1 style='color: green; font-weight: bold; font-family: Cascadia code;'>
        <center> Auto-Visualization ü§ñ</center>
    </h1>
</div>

<h1 align="center" ><a id='auto-train'><b>Training Data<b></a></h1>

In [None]:
!pip install autoviz

In [None]:
plt.figure(figsize = (10, 5))
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
df_av = AV.AutoViz('/kaggle/input/spaceship-titanic/train.csv')
plt.show()

<h1 align="center" ><a id='auto-test'><b>Testing Data<b></a></h1>

In [None]:
plt.figure(figsize = (10, 5))
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
df_av = AV.AutoViz('/kaggle/input/spaceship-titanic/test.csv')
plt.show()

<div style='font-size:200%;'>
    <a id='preprocess'></a>
    <h1 style='color: purple; font-weight: bold; font-family: Cascadia code;'>
        <center> Data Pre-processing ‚åõ </center>
    </h1>
</div>

<h1 align="center" ><a id='impute'><b>Imputing missing data<b></a></h1>

## **Data imputation**

Imputation is a technique used for replacing the missing data with some substitute value to retain most of the data/information of the dataset. These techniques are used because removing the data from the dataset every time is not feasible and can lead to a reduction in the size of the dataset to a large extend, which not only raises concerns for biasing the dataset but also leads to incorrect analysis.

In [None]:
idCol = dataTest.PassengerId.to_numpy()
dataTrain.set_index('PassengerId', inplace=True)
dataTest.set_index('PassengerId', inplace=True)

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
dataTrain = pd.DataFrame(imputer.fit_transform(dataTrain), columns=dataTrain.columns, index=dataTrain.index)
dataTest = pd.DataFrame(imputer.fit_transform(dataTest), columns=dataTest.columns, index=dataTest.index)
dataTrain = dataTrain.reset_index(drop=True)
dataTest = dataTest.reset_index(drop=True)

<h1 align="center" ><a id='typecast'><b>Typecasting and dropping columns<b></a></h1>

In [None]:
dataTrain.Transported = dataTrain.Transported.astype('int')
dataTrain.VIP = dataTrain.VIP.astype('int')
dataTrain.CryoSleep = dataTrain.CryoSleep.astype('int')
dataTrain.drop(columns=['Cabin', 'Name'], inplace=True)
dataTest.drop(columns=['Cabin', 'Name'], inplace=True)
dataTrain.head()

<h1 align="center" ><a id='ohe'><b>One-Hot Encoding üî•<b></a></h1>

## **One-Hot Encoding**

One hot encoding is one method of converting data to prepare it for an algorithm and get a better prediction. With one-hot, we convert each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns. Each integer value is represented as a binary vector. All the values are zero, and the index is marked with a 1.

In [None]:
dataTrain = pd.get_dummies(dataTrain, columns=['HomePlanet', 'CryoSleep', 'Destination'])
dataTest = pd.get_dummies(dataTest, columns=['HomePlanet', 'CryoSleep', 'Destination'])
dataTrain.head()

<h1 align="center" ><a id='split'><b>Splitting data into x (Values) and y (labels) ü™ì<b></a</h1>

In [None]:
yTrain = dataTrain.pop('Transported').to_numpy()
xTrain = dataTrain.to_numpy()
xTest = dataTest.to_numpy()
xTrain.shape, yTrain.shape, xTest.shape

<div style='font-size:200%;'>
    <a id='classify'></a>
    <h1 style='color: orange; font-weight: bold; font-family: Cascadia code;'>
        <center> Classifying üîÉ </center>
    </h1>
</div>

<h1 align="center" ><a id='build'><b>Building and fitting the models üèóÔ∏è<b></a></h1>

## **The K-nearest neighbours**

The K-nearest neighbours (KNN) classifier uses proximity to make classifications or predictions about independent data points. This technique may be used for both classification and regression scenarios and the output will vary. In classification instances, a decision is made based on majority vote, i.e., the class assigned to the new data point is taken to be the one that is most frequently seen in the vicinity of the point. KNN is also known as a lazy learner technique since a model is not learned. Instead, the raw data is stored and used everytime a prediction must be made.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knnClassifier = KNeighborsClassifier(3)
knnClassifier.fit(xTrain, yTrain)
knnClassifier.score(xTrain, yTrain)

## **Support Vector Classifier**

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane. SVM chooses the extreme points/vectors that help in creating the hyperplane.

In [None]:
from sklearn.svm import SVC

svClassifier = SVC()
svClassifier.fit(xTrain, yTrain)
svClassifier.score(xTrain, yTrain)

## **Random forest classifier**

The random forest classifier is an improvement over decision tree classifiers. Based on ensemble learning, a random forest classifier contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset. In general, a greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfClassifier = RandomForestClassifier()
rfClassifier.fit(xTrain, yTrain)
rfClassifier.score(xTrain, yTrain)

## **Naive Bayes classifier**

Naive Bayes makes the assumption that the features are independent. This means that we are still assuming class-specific covariance matrices (as in QDA), but the covariance matrices are diagonal matrices. This is due to the assumption that the features are independent.

So, given a training dataset of N input variables x with corresponding target variables t, (Gaussian) Naive Bayes assumes that the class-conditional densities are normally distributed.

In [None]:
from sklearn.naive_bayes import GaussianNB

nbClassifier = GaussianNB()
nbClassifier.fit(xTrain, yTrain)
nbClassifier.score(xTrain, yTrain)

<h1 align="center" ><a id='anal'><b>Performance Analysis of the different models üìà<b></a></h1>

In [None]:
dataPerf = pd.DataFrame(data={'Model': ['SVM', 'RandomForest', 'Naive-Bayes','KNN'], 'Score': [svClassifier.score(xTrain, yTrain), rfClassifier.score(xTrain, yTrain), nbClassifier.score(xTrain, yTrain), knnClassifier.score(xTrain, yTrain)]})

plt.figure(figsize=(12, 8))
sns.barplot(x="Model", y="Score", data=dataPerf, palette="magma")
plt.title('Performance analysis of different classifiers')
plt.show()

----
#### Based on the above findings, we can conclude that the RandomForestClassifier is best suited for this classification.
#### Hence we will use this model for our final submission.
----

<div style='font-size:200%;'>
    <a id='submit'></a>
    <h1 style='color: lightskyblue; font-weight: bold; font-family: Cascadia code;'>
        <center> Submission ‚úÖ </center>
    </h1>
</div>

In [None]:
submission = pd.DataFrame(columns=["PassengerId","Transported"])
submission["PassengerId"] = idCol
submission.set_index('PassengerId')
submission["Transported"] = rfClassifier.predict(xTest).astype(bool)
submission

In [None]:
submission.to_csv('submission.csv', index=False)
print('Submission succesful!')

<div style='font-size:200%;'>
    <a id='import'></a>
    <h3 style='color: orange; font-weight: bold; font-family: Cascadia code;'>
        <center> Alright, that's it folks, Thank you for the visit! </center>
    </h3>
    <h1 style='color: skyblue; font-weight: bold; font-family: Cascadia code;'>
        <center> Happy Kaggling! </center>
    </h1>
    <center><img style='height: 40%; width: 40%' src="https://miro.medium.com/max/3150/2*tQb2DNhHAMPj6u3peTXOFQ.png"> </center>
</div>