- [Overview](#overview)
    - [Etiquette](#etiquette)
    - [Titanic Sinks](#titanic_sanks)
    - [Timeline](#timeline)
    
- [Step 1. Setup](#setup)
- [Step 2. Import Dataset](#data_import)
    - [(1) Check Dataset](#check_dataset)
        - [A. Definition of Variables](#definition_variables)
    - [(2) Duplicated Rows?](#duplicated)
    - [(3) Missing Values Visualisation](#missing_value_visuslisation)
    
- [Step 3. EDA with Variables](#eda_with_variables)
    - [(1) Dependent Variable - Survived](#dependent_variable)
    - [(2) Independent Variable](#independent_variable)
        - [A. Survived ~ Pclass](#survived_pclass)
        - [B. Helper Graph](#helper_graph)
        - [C. Survived ~ Sex](#survived_sex)
        - [D. Survived ~ SibSp](#survived_sibsp)
        - [E. Survived ~ Parch](#survived_parch)
        - [F. Survived ~ Embarked](#survived_embarked)
        - [G. Embarked ~ Pclass](#embarked_pclass)
        - [H. Survived ~ Cabin](#survived_cabin)
        - [I. Survived ~ fare](#survived_fare)
    - [(3) Summary](#summary)
    
- [Step 4. Machine Learning](#machine_learning)
> If you want to know how to create table of contents in Kaggle Notebooks, please check this article [Create Table of Contents in a Notebook](https://www.kaggle.com/dcstang/create-table-of-contents-in-a-notebook) by David Tang

<a id="overview"></a>
## Overview
- This is my personal tutorial sharing with my students as example. 
- The whole processes will be shared from EDA to Modeling and Evaluation, Finally Submission. 
- The well-known notebooks shared will be enough for students to learn Kaggle as an entry level. 

> Happy to Code

<a id='etiquette'></a>
### Etiquette
- When students get codes and ideas from other notebooks, then please make sure to leave a reference and upvote it as well. 👆👆👆

<a id='titanic_sanks'></a>
### Titanic Sinks
- Sad, Tragedy but humanity
- Have you watched this movie? 

![](https://kathafmcki.files.wordpress.com/2015/11/titanic_movie-hd.jpg?w=700)


- Yes, We will deal with titanic data in this tutorial. 

<a id='timeline'></a>
### Timeline
- This code is from Subin An's [[TPS-Apr] Highlighting the Data](https://www.kaggle.com/subinium/tps-apr-highlighting-the-data). Thank you so much for your beautiful visualisation notebooks. (Upvoted).  
- For students, it's important to analyze the code below for your sake. It shows many features of Matplotlib. 
    + Always Remember this pic taken from https://matplotlib.org/stable/gallery/showcase/anatomy.html
![](https://matplotlib.org/stable/_images/sphx_glr_anatomy_001.png)

In [None]:
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


tl_dates = [
    "WED April 10",
    "SUN April 14",
    "MON April 15",
    "THU April 18"
]

tl_x = [1, 2, 6, 9]

tl_sub_x = [1.5, 2.4, 2.9, 3.4, 3.8, 4.5, 5.0, 6.5, 7, 7.6, 8]
tl_sub_times = [
    "1:30 PM",
    "9:00 AM",
    "1:42 PM",
    "7:15 PM",
    "10:00 PM",
    "11:30 PM",
    "11:40 PM",
    "12:20 AM",
    "12:45 AM",
    "2:00 AM",
    "2:20 AM",
]

tl_text = [
    "Titanic sets sail.",
    "Recieve Message.",
    "Baltic Warns Titanic\nof icebergs.", 
    "Smith requests the\n return of the message.",
    "Second Officer\n Lightroller is\n relievced from duty.",
    "Warning bells, iceberg\n sighting.",
    "Titanic hits an iceberg.",
    "Life boats are being\n lowered.",
    "Passengers slowly arrive\n on deck.",
    "Rear of boat begins to\n raise.",
    "Titanic sinks."
]

# Set figure & Axes
fig, ax = plt.subplots(figsize=(15, 5), constrained_layout=True)
ax.set_ylim(-2, 2)
ax.set_xlim(0, 10)


# Timeline : line
ax.axhline(0, xmin=0.1, xmax=0.95, c='#4a4a4a', zorder=1)
# Timeline : Date Points
ax.scatter(tl_x, np.zeros(len(tl_x)), s=120, c='#4a4a4a', zorder=2)
ax.scatter(tl_x, np.zeros(len(tl_x)), s=30, c='#fafafa', zorder=3)
# Timeline : Time Points
ax.scatter(tl_sub_x, np.zeros(len(tl_sub_x)), s=50, c='#4a4a4a',zorder=4)

# Date Text
for x, date in zip(tl_x, tl_dates):
    ax.text(x, -0.2, date, ha='center', 
            fontfamily='serif', fontweight='bold',
            color='#4a4a4a')
    

# Stemplot : vertical line
levels = np.zeros(len(tl_sub_x))    
levels[::2] = 0.3
levels[1::2] = -0.3
markerline, stemline, baseline = ax.stem(tl_sub_x, levels, use_line_collection=True)    
plt.setp(baseline, zorder=0)
plt.setp(markerline, marker=',', color='#4a4a4a')
plt.setp(stemline, color='#4a4a4a')

# Text
for idx, x, time, txt in zip(range(1, len(tl_sub_x)+1), tl_sub_x, tl_sub_times, tl_text):
    ax.text(x, 1.3*(idx%2)-0.5, time, ha='center', 
            fontfamily='serif', fontweight='bold',
            color='#4a4a4a' if idx!=len(tl_sub_x) else '#e3120b', fontsize=11)
    
    ax.text(x, 1.3*(idx%2)-0.6, txt, va='top', ha='center', 
        fontfamily='serif',color='#4a4a4a' if idx!=len(tl_sub_x) else '#e3120b')

# Spine
for spine in ["left", "top", "right", "bottom"]:
    ax.spines[spine].set_visible(False)

# Ticks    
ax.set_xticks([]) 
ax.set_yticks([]) 

# Title
ax.set_title("Titanic Timeline", fontweight="bold", fontfamily='serif', fontsize=16, color='#4a4a4a')

plt.show()

- The Ship, Titanic, sank in the very early morning hours of 15 April 1912. 
- Let us think that most of families would sleep at that time. It implies that many families were together when it happened. 

<a id='setup'></a>
## Step 1. Setup
- Let's set up basic libaries for Exploratory Data Analysis

In [None]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sb
import os 

print("Version Pandas", pd.__version__)
print("Version Matplotlib", matplotlib.__version__)
print("Version Numpy", np.__version__)
print("Version Seaborn", sb.__version__)

<a id='data_import'></a>
## Step 2. Data Import
- Three datasets-sample_submission, train, and test, are available to upload.

In [None]:
os.listdir('../input/tabular-playground-series-apr-2021/')

- Let's check datasets

In [None]:
train = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-apr-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-apr-2021/sample_submission.csv')

<a id='check_dataset'></a>
### (1) Check Dataset

- The number of rows in both train and test is 100,000. It's interesting. 
- One column in test is removed. Let's see what it is. 

In [None]:
train.shape, test.shape, sample_submission.shape

- The only difference is the column Survived.
- Some columns have missing values. It needs to deal with it in Feature Engineering Section.

In [None]:
train.info()

In [None]:
train.head()

In [None]:
test.info()

<a id='definition_variables'></a>
### A. Definition of Variables
- Let's check [here](https://www.kaggle.com/c/tabular-playground-series-apr-2021/data)


<a id='duplicated'></a>
### (2) Duplicated Rows?
- Let's see what the duplicated values. 

In [None]:
temp = pd.DataFrame({"id": [1, 1, 2, 3, 4], 
                     "values":["a", "a", "b", "c", "c"]})
temp

- The 1st and 2nd rows are same. 
- So, it needs to remove if it has. 
    + step 1. check if duplicated rows exist and remove them. 

In [None]:
temp.duplicated().sum()

In [None]:
temp.drop_duplicates(inplace=True)

In [None]:
temp.reset_index(drop=True)

- If train and test has duplicated rows, then remove it. 

In [None]:
train.duplicated().sum()

In [None]:
test.duplicated().sum()

- Okay. Very Good. Let's move on Missing Values Section. 

<a id='missing_value_visuslisation'></a>
### (3) Missing Values Visualisation


- First, Extract columns if the column has any missing values. 
- Second, Check train and test datasets 
- Third, Compare the missing values visualisation between train and test.

In [None]:
def check_na(data):
  isnull_total = data.isnull().sum()
  data_total = isnull_total.drop(isnull_total[isnull_total == 0].index).sort_values(ascending=False)
  isnull_ratio = (data.isnull().sum() / len(data)) * 100
  data_ratio = isnull_ratio.drop(isnull_ratio[isnull_ratio == 0].index).sort_values(ascending=False)

  missing_data = pd.DataFrame({'Missing Ratio' :data_ratio, 
                               'Missing Total' : data_total, 
                               'Data Type': data.dtypes[data_ratio.index]})
  
  return missing_data

In [None]:
check_na(train)

In [None]:
check_na(test)

In [None]:
sns.set_style("white")
sns.set_context("talk")

# dataset
miss_df_train = check_na(train)
miss_df_test = check_na(test)

# x축 index
x = np.arange(0, len(miss_df_train.index))
ratio_text = np.round(miss_df_train['Missing Ratio'].tolist() + miss_df_test['Missing Ratio'].tolist(), 1)

fig, ax = plt.subplots(figsize=(16, 10), facecolor="w")

# draw basic two graphs
ax.bar(x - 0.15, miss_df_train['Missing Total'], color='b', width = 0.3)
ax.bar(x + 0.15, miss_df_test['Missing Total'], color = "k", width = 0.3)

# Text
for i, p in enumerate(ax.patches):
  h = p.get_height()
  if i <= 4:
    ax.text(i-0.15, h + 1000, str(ratio_text[i]) + "%", ha = "center")
  else:
    ax.text(i-4.85, h + 1000, str(ratio_text[i]) + "%", ha = "center")

# X axis 
plt.xticks(x, miss_df_train.index)

# add grid
ax.grid(axis="y")

# delete some spines
for s in ["left", "right", "top"]:
    ax.spines[s].set_visible(False)
    ax.spines[s].set_visible(False)

# legend
colors = {'train':'blue', 'test':'black'}         
labels = list(colors.keys())
handles = [plt.Rectangle((0,0),1,1, color=colors[label]) for label in labels]
plt.legend(handles, labels, bbox_to_anchor = (0.9, 0.9))

# add background image
ax.axvspan(0.3, 4.3, fc="gray", alpha=0.2)
ax.text(1.5, 40000, "Relatively Small Portion of Missing Values\n", color="k", fontdict={"size":20})

# title
plt.title("Missing Values in Each Column", fontsize=30)

fig.tight_layout()
plt.show()

- What does this graph explain? 
    + For Cabin, it is a room where family stays inside Titanic. Since many different classes were together in the ship, it might be important to classifiy the survived. 
    + But, unfortunately, it has many missing values. Remove or not. If not, then how to fill out the missing values? This could be main issue for kagglers.
    + The missing values in other groups are relatively small, so, this could be easy to replace with the frequent values or median value in each column.
    
- Well, if you want to deal with it now, Let's read this article [Basic Feature Engineering with the Titanic Data](https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/)

> Happy to Code. 

<a id='eda_with_variables'></a>
## Step 3. EDA with Variables
- When analyzing data, it needs to split variables into two groups - Independent Variables, and Dependent Variable. 
- Train dataset has 12 variables, 11 independent variables and 1 dependent variable.
- It might be easier to start with 1 dependent variable, the column `Survived`


<a id="dependent_variable"></a>
### (1) Dependent Variable - Survived
- Let's count the frequency of each value. 

In [None]:
train['Survived'].value_counts()

- The value `0` means didn't survived, whereas the value `1` did survived. 
- As you know, the number might be confused, so, let's change the number with some distinguished words. 
- When conducting visualisation, it's better to copy original data leaving it for feature engineering. Tip.

In [None]:
# Copy Data
train_viz = train.copy()
train_viz.info()

In [None]:
train_viz = train_viz.replace({'Survived': [0, 1]}, {'Survived': ['Not Survived', 'Survived']})
train_viz['Survived'].value_counts()

<a id='independent_variable'></a>
### (2) Independent Variable
- The ratio of each value is 57.2% as Not Survived and 42.7% as Survived. 
- It's not important to draw visusalisation because we can't find any insight with dependent variable alone. 
- Now, it's time to look at independent variables considering the column, Survived, and with simple question, is this variable helpful to classify? 
    + PassengerId, Not at all
    + Pclass, Yes 
    + Name, maybe not at this time. 
    + Sex, Yes
    + Age, Yes
    + SibSp, Yes, related to family
    + Parch, Yes, related to family
    + Ticket, maybe not at this time because it has 75331 values. 
    + Fare, this implies social class can be helpful
    + Cabin, it has many values but need to consider the combination of letter.
        * If you want to know further, then it's time to check [Discussion](https://www.kaggle.com/c/tabular-playground-series-apr-2021/discussion?search=Cabin). In general, this kind of variables is always related to feature engineering. 
    + Embarked: Each value is a region where each Passenger rode in. So, Might be yes. 

In [None]:
train_viz['Ticket'].value_counts()

In [None]:
train_viz['Cabin'].value_counts()

<a id='survived_pclass'></a>
#### A. Survived ~ Pclass
- 1st step is Plcass needs to convert the int64 into string
- 2nd step is that it needs to create new table, so-called pivot table using crosstab( )
- 3rd step is to visualise bar graph based on the new table. 
    + I want you strongly recommend to read this book for visualisation. It contains many suggestions when visualising data with `rules`, `rules`, and `rules`, considering the readers to read graph that we all draw here.  
    + Book: [Fundamentals of Data Visualization](https://clauswilke.com/dataviz/)
    
![](https://clauswilke.com/dataviz/cover.png)

- Let's chagne Dtype of "Pclass" as object

In [None]:
train_viz['Pclass'] = train_viz['Pclass'].astype('object')
train_viz.info()

- Let's make pivot table using crosstab( ). 

In [None]:
crosstab_df = pd.crosstab(train_viz['Survived'], 
                          train_viz['Pclass'],margins = False)

crosstab_df

- get values each value from .loc['index'], replace index with values

In [None]:
survived_df = crosstab_df.loc['Survived']
non_survived_df = crosstab_df.loc['Not Survived']
print("Survived values:", survived_df.values)
print("Not Survived values", non_survived_df.values)

- When visualizing, the main concept is to compare between the Survived and the Not Survived per each independent variable.
- Each column has distinguished value to classify the dependent variable. 
    + Thus, the small values, which is not important, make it vague where the representative value must be clearer. 

In [None]:
survived_df = crosstab_df.loc['Survived']
non_survived_df = crosstab_df.loc['Not Survived']
survived_max_val = survived_df.values.max()
non_survived_max_val = non_survived_df.values.max()

# x축 index
x = np.arange(0, len(survived_df.index))

fig, ax = plt.subplots(figsize=(16, 10), facecolor="w")

# draw basic two graphs
ax.bar(x - 0.15, survived_df.values, 
       color=['#0095FF' if survived_df.values[idx] == survived_max_val else "#AECFE6" for idx in range(0, len(survived_df))], 
       width = 0.3)
ax.bar(x + 0.15, non_survived_df.values, 
       color = ['#FF6123' if non_survived_df.values[idx] == non_survived_max_val else "#E6C0B1" for idx in range(0, len(non_survived_df))], 
       width = 0.3)

# Text
for i, p in enumerate(ax.patches):
  h = p.get_height()
  fontweight = "normal"
  if i <= 2:
    ax.text(i-0.15, h + 1000, h, ha = "center", 
            fontsize=16 if h == survived_max_val else 12, 
            fontweight = "bold" if h == survived_max_val else fontweight, 
            color = "#0095FF" if h == survived_max_val else "#AECFE6"
           )
  else:
    ax.text(i-2.85, h + 1000, h, ha = "center", 
            fontsize=16 if h == non_survived_max_val else 12, 
            fontweight = "bold" if h == non_survived_max_val else fontweight, 
            color = "#FF6123" if h == non_survived_max_val else "#E6C0B1"
           )

# X axis 
plt.xticks(x, survived_df.index)
ax.set_xlabel("Pclass")

# add grid
ax.grid(axis="y")

# delete some spines
for s in ["left", "right", "top"]:
    ax.spines[s].set_visible(False)
    ax.spines[s].set_visible(False)

# legend
colors = {'Survived':'#0095FF', 'Not Survived':'#E6C0B1'}         
labels = list(colors.keys())
handles = [plt.Rectangle((0,0),1,1, color=colors[label]) for label in labels]
plt.legend(handles, labels, bbox_to_anchor = (0.25, 0.9))

plt.show()

<a id="helper_graph"></a>
#### B. Helper Graph
- It seems to re-use the same code above since many columns can be treated as object
    + Sex, SibSp, Parch, Cabin, Embarked, and even other columns can be. 
    
- So, let's create helper_graph function.

In [None]:
def helper_graph(data, 
                 dependent_variable = "Survived", 
                 independent_variable="Plcass", 
                 val_1 = "Survived", 
                 val_2 = "Not Survived", 
                 val_1_colors = ["#0095FF", "#AECFE6"], 
                 val_2_colors = ["#FF6123", "#E6C0B1"]
                ):
    crosstab_df = pd.crosstab(train_viz[dependent_variable], 
                              train_viz[independent_variable],margins = False)
    
    # data transforming
    val_1_df = crosstab_df.loc[val_1]
    val_2_df = crosstab_df.loc[val_2]
    
    val_1_max_val = val_1_df.values.max()
    val_2_max_val = val_2_df.values.max()

    # x축 index
    x = np.arange(0, len(val_1_df.index))

    fig, ax = plt.subplots(figsize=(16, 10), facecolor="w")

    # draw basic two graphs
    ax.bar(x - 0.15, val_1_df.values, 
           color=[val_1_colors[0] if val_1_df.values[idx] == val_1_max_val else val_1_colors[1] for idx in range(0, len(val_1_df))], 
           width = 0.3)
    ax.bar(x + 0.15, val_2_df.values, 
           color = [val_2_colors[0] if val_2_df.values[idx] == val_2_max_val else val_2_colors[1] for idx in range(0, len(val_2_df))], 
           width = 0.3)

    # Text
    for i, p in enumerate(ax.patches):
      h = p.get_height()
      fontweight = "normal"
      if i <= len(val_1_df)-1:
        ax.text(i-0.15, h + 1000, h, ha = "center", 
                fontsize=16 if h == val_1_max_val else 12, 
                fontweight = "bold" if h == val_1_max_val else fontweight, 
                color = val_1_colors[0] if h == val_1_max_val else val_1_colors[1]
               )
      else:
        ax.text(i-(len(val_1_df) - 0.15), h + 1000, h, ha = "center", 
                fontsize=16 if h == val_2_max_val else 12, 
                fontweight = "bold" if h == val_2_max_val else fontweight, 
                color = val_2_colors[0] if h == val_2_max_val else val_2_colors[1]
               )

    # X axis 
    plt.xticks(x, val_1_df.index)
    ax.set_xlabel(independent_variable)

    # add grid
    ax.grid(axis="y")

    # delete some spines
    for s in ["left", "right", "top"]:
        ax.spines[s].set_visible(False)
        ax.spines[s].set_visible(False)

    # legend
    colors = {val_1:val_1_colors[0], val_2: val_2_colors[0]}         
    labels = list(colors.keys())
    handles = [plt.Rectangle((0,0),1,1, color=colors[label]) for label in labels]
    plt.legend(handles, labels, bbox_to_anchor = (0.25, 0.9))

    plt.show()

- This graph can use in other dataset as well.
- Creating function is reusable when conducting similar task. 
- And if you find some bug in this function, then you can update it for your own task. 
    + I will show you. 

In [None]:
helper_graph(train_viz, 
             dependent_variable = "Survived", 
             independent_variable="Pclass", 
             val_1 = "Survived", 
             val_2 = "Not Survived", 
             val_1_colors = ["#0095FF", "#AECFE6"], 
             val_2_colors = ["#FF6123", "#E6C0B1"]
            )

- Okay, very good. Same result shown. 

<a id='survived_sex'></a>
#### C. Survived ~ Sex
- Let's visualize between Survived and Sex

In [None]:
helper_graph(train_viz, 
             dependent_variable = "Survived", 
             independent_variable="Sex", 
             val_1 = "Survived", 
             val_2 = "Not Survived", 
             val_1_colors = ["#0095FF", "#AECFE6"], 
             val_2_colors = ["#FF6123", "#E6C0B1"]
            )

- This graph said that Male passed away more than Female in number. 

<a id='survived_sibsp'></a>
#### D. Survived ~ SibSp
- Let's visualize between Survived and SibSp

In [None]:
train_viz['SibSp'] = train_viz['SibSp'].astype('object')
helper_graph(train_viz, 
             dependent_variable = "Survived", 
             independent_variable="SibSp", 
             val_1 = "Survived", 
             val_2 = "Not Survived", 
             val_1_colors = ["#0095FF", "#AECFE6"], 
             val_2_colors = ["#FF6123", "#E6C0B1"]
            )

- What does the category 0 mean by? 
    + Some passengers travel alone without Siblings.

<a id='survived_parch'></a>
#### E. Survived ~ Parch
- Let's visualize between Survived and Parch

In [None]:
train_viz['Parch'] = train_viz['Parch'].astype('object')
helper_graph(train_viz, 
             dependent_variable = "Survived", 
             independent_variable="Parch", 
             val_1 = "Survived", 
             val_2 = "Not Survived", 
             val_1_colors = ["#0095FF", "#AECFE6"], 
             val_2_colors = ["#FF6123", "#E6C0B1"]
            )

- This result is same with the result of the column "SibSp"
    + One interesting point is here. In most cases except for category 1, the number of the survived was higher than other group.
    + If you have a time, then it needs to figure it out why they are more survived. 

<a id='survived_embarked'></a>
#### F. Survived ~ Embarked
- Let's visualize between Survived and Embarked

In [None]:
helper_graph(train_viz, 
             dependent_variable = "Survived", 
             independent_variable="Embarked", 
             val_1 = "Survived", 
             val_2 = "Not Survived", 
             val_1_colors = ["#0095FF", "#AECFE6"], 
             val_2_colors = ["#FF6123", "#E6C0B1"]
            )

<a id='embarked_pclass'></a>
#### G. Embarked ~ Pclass
- Let's visualize between Embarked and Pclass
- C is Cherbourg, Q is Queenstown, S is Southampton
- Another question comes up. Why were the passengers from C more survived than the people from S?
    + Then it's easy to figure out when drawling new graph. 

In [None]:
temp = train.copy()
temp.dropna(subset=['Pclass', 'Embarked'], inplace=True)
temp_df = pd.crosstab(temp['Embarked'], temp['Pclass'], normalize='index')
fig, ax = plt.subplots(figsize=(10, 6))
ax = sns.heatmap(temp_df, annot=True, fmt=".1%")
plt.show()

- passengers from C and Q with 1st Pclass likely to be more survived than others. 
- This is important when conducting Feature Engineering. 
    + it can create new drawable variable, combining Embarked and Pclass. 

<a id='survived_cabin'></a>
#### H. Survived ~ Cabin
- For Cabin, How to deal with it.
- It's better to replace with "None", using fillna

In [None]:
train_viz["Cabin"].fillna("No Cabin", inplace = True)
train_viz["Cabin"].unique()

- Let's get first letter from each value. 

In [None]:
train_viz["Cabin_code"] = train_viz["Cabin"].str[0]
train_viz["Cabin_code"].unique()

In [None]:
helper_graph(train_viz, 
             dependent_variable = "Survived", 
             independent_variable="Cabin_code", 
             val_1 = "Survived", 
             val_2 = "Not Survived", 
             val_1_colors = ["#0095FF", "#AECFE6"], 
             val_2_colors = ["#FF6123", "#E6C0B1"]
            )

- N is Missing Values, among them, the number of Not Survived is much higher than survived, even A Code. 
- What is Cabin? 

![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Olympic_%26_Titanic_cutaway_diagram.png/440px-Olympic_%26_Titanic_cutaway_diagram.png)

- The image is from https://en.wikipedia.org/wiki/First-class_facilities_of_the_Titanic
- Wonder why people in Cabin Number were more died in other Cabin. 
- Now, let's focus on the continuous variables

<a id='survived_fare'></a>
#### I. Survived ~ fare
- Density plot is a good choice to see the trends with fare. 


In [None]:
temp = train_viz['Survived'].value_counts()
not_survived_num = temp.values[0]
survived_num = temp.values[1]
print(not_survived_num, survived_num)

fig, ax = plt.subplots(figsize=(10, 6))
sns.kdeplot(data=train_viz, x="Fare", hue="Survived", multiple="fill", ax=ax, 
           palette= ["#0095FF", "#FF6123"],)

# delete some spines
for s in ["left", "right", "top"]:
    ax.spines[s].set_visible(False)
    ax.spines[s].set_visible(False)
    
ax.text(250, 0.7, "{*} {**} {***}".format(**{"*": 'Survived', "**": '{:,}'.format(survived_num), "***": "people"}), color='w')
ax.text(150, 0.07, "{*} {**} {***}".format(**{"*": 'Not Survived', "**": '{:,}'.format(not_survived_num), "***": "people"}), color='w')

ax.set_ylabel("Rate (0 ~ 1)")

plt.show()

- It showed that the more you pay, the more you survive.

<a id='summary'></a>
### (3) Summary
- the ratio of Surived is 57.2 (Not Survived):42.8 (Survived)
    + It's not quite imbalanced data
- The male group were more died than the female group
- The wealthy class were more survived than the other group. 
    + Embarked, Pclass, Cabin. 
    + If it needs new variable, combining three variables, then I would make classify if each passenger was wealthy or not at that time.
- My Personal insight from EDA is the wealthy-female group is more survived than other group. 
- Now, Let's Move on Feature Engineering and Modeling in other notebooks. 

> Thank you for reviewing my notebooks. 

<a id="machine_learning"></a>
## Step 4. Machine Learning
- Please click here: [TPS-April SkLearn, PyCaret, LAML for Newbies](https://www.kaggle.com/j2hoon85/tps-april-sklearn-pycaret-laml-for-newbies). Let's continue. 

> If you like to read this notebook, please upvote :D