# Introduction
The sinking of Titanic is one of the most notorious shipredcks in history. In 1912, during her voyage, the titanic sank after colliding with an iceberg, killing 1502 out of 2224 passangers and crew.

<font color = "gray">
Content:

1. [Load and Check Data](#1)
2. [Variable Description](#2)
    * [Univariate Variable Analysis](#3)
        * [Categorical Variable Analysis](#4)
        * [Numerical Variable Analysis](#5)
3. [Basic Data Analysis](#6)
4. [Outlier Detection](#7)
        

In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

<a id = "1"></a>
# Load and Check Data 

In [35]:
train_df = pd.read_csv("C:/Users/MRE/Documents/GitHub/Titanic-Project/Dataset/train.csv")
test_df = pd.read_csv("C:/Users/MRE/Documents/GitHub/Titanic-Project/Dataset/test.csv")
test_PassangerId = test_df["PassengerId"]

In [None]:
train_df.head()

In [None]:
train_df.describe()

<a id = "2"></a><
# Variable Description

1. PassengerId: Unique id number of passanger
2. Survived: Passanger who survive (1) or die (0)
3. Pclass: Passanger Class
4. Name: Name
5. Sex: Gender of Passanger
6. Age: Age of Passanger
7. SibSp: Number of siblins/spouses
8. Parch: Number of parents/children
9. Ticket:  Ticket Number
10. Fare: Amount of money spent on ticket
11. Cabin: Cabin Category
12. Embarke: Port where passenger embarked (C = Cherbourg, Q = Queenstown, S = Southmpton)

In [None]:
train_df.info()

* float64(2): Fare and Age
* int64(5): Pclass, Sibsp, Parch, PassangerId, Survived
* object(5): Cabin, Embarked, Ticket, Name and Sex

<a id="3"></a><br>
# Univariate Variable Analysis
* Categorical Variable: Survived, Sex, Pclass, Embarked, Cabin, Name, Ticket, Sibsp and Parch
* Numerical Variable: Fare, age and PassengerId

<a id="4"></a><br>
## Categorical Variable Analysis

In [None]:
def bar_plot(variable):
    """
        input: variable ex: "Sex"
        output: bar plot & value count
    """

    var = train_df[variable] #Get Feature
    varValue = var.value_counts() #Count number of categorical variable(value/sample)

    #Visualize
    plt.figure(figsize=(9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable, varValue))

In [None]:
category1 = ["Survived", "Sex", "Pclass", "Embarked", "SibSp", "Parch"]
for c in category1:
    bar_plot(c)

<a id="5"></a><br>
## Numerical Variable Analysis

In [None]:
def plot_hist(variable):
    plt.figure(figsize=(9, 3))
    plt.hist(train_df[variable])
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist.".format(variable))
    plt.show()

In [None]:
numericVar = ["Fare", "Age", "PassengerId"]
for n in numericVar:
    plot_hist(n)

<a id="6"></a><br>
# Basic Data Analysis
* Pclass - Survived
* Sex - Survived
* Parch - Survived
* SibSp - Survived

In [None]:
# Pclass & Survived
train_df[["Pclass","Survived"]].groupby(["Pclass"], as_index=False).mean().sort_values(by = "Survived", ascending=False)

In [None]:
train_df[["Sex","Survived"]].groupby(["Sex"], as_index=False).mean().sort_values(by = "Survived", ascending=False)

In [None]:
train_df[["Parch","Survived"]].groupby(["Parch"], as_index=False).mean().sort_values(by = "Survived", ascending=False)

In [None]:
train_df[["SibSp","Survived"]].groupby(["SibSp"], as_index=False).mean().sort_values(by = "Survived", ascending=False)

<a id="7"></a><br>
# Outlier Detection

In [None]:
from collections import Counter


def detect_outlier(df, features):
    outlier_indices = []
    
    for c in features:

        #1st quartile
        Q1 = np.percentile(df[c], 25)
        #3rd quartile
        Q3 = np.percentile(df[c], 75)
        #IQR
        IQR = Q3 - Q1
        #Outlier Step
        outlier_step = IQR * 1.5
        #Detect Outlier and Their Indeces
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 - outlier_step)].index
        #Store indeces
        outlier_indices.extend(outlier_list_col)
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)

    return multiple_outliers

In [None]:
train_df.loc[detect_outlier(train_df, ["Age", "SibSp", "Parch", "Fare"])]

In [None]:
#Drop Outliers
train_df = train_df.drop(detect_outlier(train_df, ["Age", "SibSp", "Parch", "Fare"])axis=0).reset_index(drop=True)