# <center> **Kaggle’s Spaceship Titanic Competition**
# <center> **Overview**

![image.png](attachment:image.png)

# **Introduction**

In the year 2912 the interstellar Spaceship Titanic has collided with a spacetime anomaly. Some of the passengers were transported to an alternate dimension. In this analysis, I will use the records recovered from the spaceship’s computer system to predict which passengers were affected.

# **Goal**

My goal is to get a high score on the Spaceship Titanic competition.

# **Dataset**

**Train file (spaceship_titanic_train.csv)** — Contains personal records of the passengers that would be used to build the machine learning model.

**Test file (spaceship_titanic_test.csv)** — Contains personal records for the remaining passengers, but not the target variable. It will be used to see how well our model performs on unseen data.

**Sample Submission File (sample_submission.csv)** — Contains the format to submit predictions.

# **Technical Requirements**

1. Exploratory data analysis
2. Pre-Processing of the data
3. Application of various machine learning models to predict which passengers were transported
4. Clear explanations of findings
5. Final conclusions
6. Suggestions on how the analysis can be improved

# **Standards**

> **Standard 1:** My standard for an acceptable accuracy score is approximately 80%. <BR>
> **Standard 2:** My standard for colinnearity is a Pearson correlation coefficient of approximately 0.8. <BR>    

# **Biases**

The main bias is that approximately 25% of the data is missing. Every feature except the PassengerId feature has over 2% missing data. 

# **Domain Knowledge**

I have no experience with space travel or alternate dimensions. I may have overlooked parts of the data that may have been most important and I may have given importance to parts that may have had little significance. 

# **Libraries**

In [2]:
import pandas as pd

import functions
import importlib
importlib.reload(functions)

<module 'functions' from 'c:\\Users\\Dell\\Documents\\AI\\Titanic\\Notebooks\\functions.py'>

# **Load Data**

In [3]:
train = pd.read_csv(
    r"C:\Users\Dell\Documents\AI\Titanic\Data\Data\train.csv",
    index_col=False
)

test = pd.read_csv(
    r"C:\Users\Dell\Documents\AI\Titanic\Data\Data\test.csv",
    index_col=False
)

# **Overview**

## **Dataset Features**

1. **PassengerId:** A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
2. **HomePlanet:** The planet the passenger departed from, typically their planet of permanent residence.
3. **CryoSleep:** Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
4. **Cabin:** The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
5. **Destination:** The planet the passenger will be debarking to.
6. **Age:** The age of the passenger.
7. **VIP:** Whether the passenger has paid for special VIP service during the voyage
8. **RoomService:** Amount the passenger has billed for Room Service.
9. **FoodCourt:** Amount the passenger has billed for Food Court.
10. **ShoppingMall:**: Amount the passenger has billed for Shopping Mall.
11. **Spa:**: Amount the passenger has billed for Spa.
12. **VRDeck:**: Amount the passenger has billed for VRDeck.
13. **Name**: The first and last names of the passenger.
13. **Transported**: Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

## **Number of Rows and Column**

In [4]:
print('Train Set Shape:', train.shape)
print('Test Set Shape:', test.shape)

Train Set Shape: (8693, 14)
Test Set Shape: (4277, 13)


### **Insights**

> * **Train Dataset** — There are about 8700 rows and 14 columns. The final column is "Transported" which indicates if the passenger was transported to an alternate dimension. 
> * **Test Dataset** — There are about 4300 rows and 13 columns. The "Transported" column has been deleted.

## **Data Types**

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


### **Insights**

> * **Categorical variables (object)** — The categorical variables in the training dataset are: PassengerId, HomePlanet, Cabin, Destination, and Name.
> * **Float variable (float64)** — The Numerical variables in our train dataset: Age, RoomService, FoodCourt, ShoppingMall, Spa and VRDeck.
> * **Boolean variables (bool)** — The Boolean Variable in our dataset are CryoSleep, VIP and Transported.

## **Numerical Features**

In [83]:
train.describe(include="number").map("{:,.2f}".format)

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.83,224.69,458.08,173.73,311.14,304.85
std,14.49,666.72,1611.49,604.7,1136.71,1145.72
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


## **Categorical and Boolean Features**

In [6]:
functions.UniqueValues(train)

Unique Values in PassengerId: 8693
Unique Values in HomePlanet: 3
Unique Values in CryoSleep: 2
Unique Values in Cabin: 6560
Unique Values in Destination: 3
Unique Values in VIP: 2
Unique Values in Name: 8473


In [9]:
print(train['HomePlanet'].unique())

['Europa' 'Earth' 'Mars' nan]


In [10]:
print(train['Destination'].unique())

['TRAPPIST-1e' 'PSO J318.5-22' '55 Cancri e' nan]


In [12]:
print(train['CryoSleep'].unique())

[False True nan]


In [13]:
print(train['VIP'].unique())

[False True nan]


### **Insights**

> * **HomePlanets:** Europa, Earth, Mars
> * **Destinations:** TRAPPIST-1e, PSO J318.5-22, 55 Cancri e
> * **CryoSleep:** True and False
> * **VIP:** True and False

## **Outliers**

In [85]:
functions.Outliers(train)

Age               77
RoomService     1861
FoodCourt       1823
ShoppingMall    1829
Spa             1788
VRDeck          1809
dtype: int64


### **Insights**

> * **Outliers:** All Six Numerical Features Contain Outliers
> * **Age:** Age has the fewest outliers (77)

## **Missing Values**

In [86]:
missing_values = functions.MissingValues(train)
missing_values

Unnamed: 0,NumberMissing,PercentageMissing
HomePlanet,201,2.31
CryoSleep,217,2.5
Cabin,199,2.29
Destination,182,2.09
Age,179,2.06
VIP,203,2.34
RoomService,181,2.08
FoodCourt,183,2.11
ShoppingMall,208,2.39
Spa,183,2.11


In [87]:
missing_values = functions.MissingValues(test)
missing_values

Unnamed: 0,NumberMissing,PercentageMissing
HomePlanet,87,2.03
CryoSleep,93,2.17
Cabin,100,2.34
Destination,92,2.15
Age,91,2.13
VIP,93,2.17
RoomService,82,1.92
FoodCourt,106,2.48
ShoppingMall,98,2.29
Spa,101,2.36


### **Insights**

> * **Percentage Overall:** Approximately 26% of Data is missing.
> * **Pecentage per Feature**: Each Feature Execpt PassengerId has over 2% Missing. 

## **Duplicate Data**

In [88]:
functions.Duplicates(train)


Duplicates: 0, (0.0%)


In [89]:
functions.Duplicates(test)

Duplicates: 0, (0.0%)


### **Insights**

> * **Number of Duplicates**: There are no duplicates in the train or test datasets.