## Hello, Kagglers!

You joined this competion because you already tired of titanic competition, and you want to imporve your skill more, right?

Me, too!

I made this notebook for those who are in the same situation as me.

Enjoy!

## What is the Task ?

Here is the description from "Overview" of this competition.

> Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.<br><br>
The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.<br><br>
While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!<br><br>
To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.<br><br>
Help save them and change history!<br>

## What can we use from our data?

Here are the column descriptions in our data.

> - PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always. <br>
> - HomePlanet - The planet the passenger departed from, typically their planet of permanent residence. <br>
> - CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.<br>
> - Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard. <br>
> - Destination - The planet the passenger will be debarking to. <br>
> - Age - The age of the passenger. <br>
> - VIP - Whether the passenger has paid for special VIP service during the voyage. <br>
> - RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the  Spaceship Titanic's many luxury amenities. <br>
> - Name - The first and last names of the passenger. <br>
> - Transported - Whether the passenger was transported to another dimension. This is the target, the 
column you are trying to predict. <br>

## So, what should we expect?

In this competition, you are supposed to predict predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Your model may save a lot of people lives for the future.

Getting excited, right?😄

Let's get started!

# Loading the Necessary Library

In [None]:
# these library are for data manipulation 
import pandas as pd
import numpy as np

#these library are for visualization
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

#this library is for model
from xgboost import XGBClassifier

import warnings
warnings.simplefilter('ignore')

# Loading the data

Firstly, we have to load the train data, the test data and the submission file and check what shape these data have.

In [None]:
# Loading train data and test data and submission file

train_data= pd.read_csv('../input/spaceship-titanic/train.csv')
test_data= pd.read_csv('../input/spaceship-titanic/test.csv')
sub = pd.read_csv("../input/spaceship-titanic/sample_submission.csv")

In [None]:
# Displaying first rows of the training data

train_data.head()

In [None]:
test_data.head()

In [None]:
sub.head()

# Exploring Data

Check the data more deeply!

In [None]:
# we can check the data information such as columns and dtype by using info()
train_data.info()

In [None]:
# we can analyze the statistical information such as count and mean by using describe()
train_data.describe()

In [None]:
test_data.info()

In [None]:
test_data.describe()

# Is there any Null in our data?

It is very important to check the data has missing value (we call this "Null" in Python) or not. 

Let's check it out!

In [None]:
# we can get information about number of Null in dataframe by using isnull()
# using sum(), we can check the sum of Null in each columns.
train_data.isnull().sum()

In [None]:
test_data.isnull().sum()

Of course there are some missing values...

Later, we are going to handle with this problem.

# Let's visualize the data!

It is also important to visualize the data because we may discover the tendency or relationship in the data.

In [None]:
ax = sns.countplot(x='Transported', data=train_data)
ax.set_title('Transported Counts');

In [None]:
plt.rc('font', size=13) # Set font size

ax = sns.countplot(x='HomePlanet', data=train_data)
ax.set_title('HomePlanet Counts');

In [None]:
ax = sns.countplot(x='CryoSleep', data=train_data) 
ax.set_title('CryoSleep Counts');

In [None]:
ax = sns.countplot(x='Destination', data=train_data)
ax.set_title('Destination Counts');

In [None]:
ax = sns.countplot(x='VIP', data=train_data)
ax.set_title('VIP Counts');

In [None]:
plt.figure(figsize=(8, 4)) # Set figure size

ax = sns.histplot(x='Age', data=train_data)
ax.set_title('Age Counts');

# Let's prepare the data for prediction!

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [None]:
train_df =pd.read_csv("../input/spaceship-titanic/train.csv") 
test_df = pd.read_csv("../input/spaceship-titanic/test.csv")

In [None]:
# For submission, we keep the PassengerId
test_df_ID = test_df['PassengerId']

In [None]:
# clean the data by fill the null 
imputer_cols = ["Age", "FoodCourt", "ShoppingMall", "Spa", "VRDeck" ,"RoomService"]
imputer = SimpleImputer(strategy="median" )
imputer.fit(train_df[imputer_cols])
train_df[imputer_cols] = imputer.transform(train_df[imputer_cols])
test_df[imputer_cols] = imputer.transform(test_df[imputer_cols])
train_df["HomePlanet"].fillna('Z', inplace=True)
test_df["HomePlanet"].fillna('Z', inplace=True)

In [None]:
label_cols = ["HomePlanet", "CryoSleep","Cabin", "Destination" ,"VIP"]
def label_encoder(train,test,columns):
    for col in columns:
        train[col] = train[col].astype(str)
        test[col] = test[col].astype(str)
        train[col] = LabelEncoder().fit_transform(train[col])
        test[col] =  LabelEncoder().fit_transform(test[col])
    return train, test

train_df ,test_df = label_encoder(train_df,test_df ,label_cols)

In [None]:
X_train = train_df.drop(["Transported","Name"], axis =1 )
y_train = train_df["Transported"]
X_test = test_df.drop("Name",axis=1)

In [None]:
X_train.dtypes

OK, PassengerId is still object.

We should convert it to int.

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
X_train["PassengerId"] = le.fit_transform(X_train["PassengerId"])
X_test["PassengerId"]= le.fit_transform(X_test["PassengerId"])

In [None]:
X_train.dtypes

Well, done! You can now expect the survivors.

# Let's make model and expect the survivors!

In [None]:
import xgboost as xgb
my_model = xgb.XGBClassifier()
my_model.fit(X_train, y_train)
   
# Predicting the Test set results
y_pred = my_model.predict(X_test)

In [None]:
y_pred

In [None]:
submission = pd.DataFrame(
    {'PassengerId':test_df_ID ,
     'Transported': y_pred},columns=['PassengerId', 'Transported'])

In [None]:
submission.to_csv("submission.csv",index=False)

In [None]:
submission.head()

# That's it!

This is very simple baseline. You should continue to save the all of passengers!