### Labels

- PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

- HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

- Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

- Destination - The planet the passenger will be debarking to.

- Age - The age of the passenger.

- VIP - Whether the passenger has paid for special VIP service during the voyage.

- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

- Name - The first and last names of the passenger.

- Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

### Importing principal libraries


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Importing Data

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
index_test = test['PassengerId']

### 1- Checking the datasets

#### 1.1 - Missing Data

In [3]:
# Test
test.isnull().sum()/len(test['Destination'])*100

PassengerId     0.000000
HomePlanet      2.034136
CryoSleep       2.174421
Cabin           2.338087
Destination     2.151040
Age             2.127660
VIP             2.174421
RoomService     1.917232
FoodCourt       2.478373
ShoppingMall    2.291326
Spa             2.361468
VRDeck          1.870470
Name            2.197802
dtype: float64

In [4]:
# Train
train.isnull().sum()/len(train['Destination'])*100

PassengerId     0.000000
HomePlanet      2.312205
CryoSleep       2.496261
Cabin           2.289198
Destination     2.093639
Age             2.059128
VIP             2.335212
RoomService     2.082135
FoodCourt       2.105142
ShoppingMall    2.392730
Spa             2.105142
VRDeck          2.162660
Name            2.300702
Transported     0.000000
dtype: float64

#### 1.2 - Data Types


In [5]:
# Test
test.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
dtype: object

In [6]:
# Train
train.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

#### 1.3 - Deleting the variables we won't use in this case

In [7]:
train.drop(['PassengerId','Name','Cabin'],axis = 1,inplace = True)
test.drop(['PassengerId','Name','Cabin'],axis = 1,inplace = True)

### 2 - Missing Data Analysis ( Understanding the Problem )
#### Asking some questions.

#### 2.1 - Questions 001 - Is there social inequality between the planets?


In [8]:
# Lets separate one measure of richness to analyze, the chosen one will be VIP

vip_true = train[train['VIP'] == True]

print('\nPlanet With More VIP\n')
print(vip_true['HomePlanet'].value_counts())
print('\n')
print('Total Planets\n')
print(train['HomePlanet'].value_counts())


Planet With More VIP

Europa    131
Mars       63
Name: HomePlanet, dtype: int64


Total Planets

Earth     4602
Europa    2131
Mars      1759
Name: HomePlanet, dtype: int64


##### The answer is YES, so let's replace the missing data using this information.


In [9]:
# Earth = 0
# Europa = 1
# Mars = 2

train['HomePlanet'].fillna(train['VIP'],inplace = True)
test['HomePlanet'].fillna(test['VIP'],inplace = True)

In [10]:
train['HomePlanet'].value_counts()

Earth     4602
Europa    2131
Mars      1759
False      193
True         5
Name: HomePlanet, dtype: int64

In [11]:
def transform_data(value):
        if value == 'Earth':
            return 0
        elif value == 'Europa':
            return 1
        elif value == 'Mars':
            return 2
        elif value == True:
            return 1
        elif value == False:
            return 0
        else:
            return 0

In [12]:
train['HomePlanet'] = train['HomePlanet'].map(transform_data)
test['HomePlanet'] = test['HomePlanet'].map(transform_data)

In [13]:
train['HomePlanet'].value_counts()

0    4798
1    2136
2    1759
Name: HomePlanet, dtype: int64

In [14]:
# false = 0
# true = 1

train['VIP'].fillna(train['HomePlanet'],inplace = True)
test['VIP'].fillna(test['HomePlanet'],inplace = True)