 # Introduction
 
 The RMS Titanic sank in the early morning hours of 15 April 1912 in the North Atlantic Ocean, four days into her maiden voyage from Southampton to New York City. The largest ocean liner in service at the time, Titanic had an estimated 2,224 people on board. Her sinking resulted in the deaths of 1,502 people, making it one of the deadliest peacetime maritime disasters in history.
 
 **Content**
1. [Load and Check Data](#1)
2. [Variable Description](#2)
    * [Univariate Variable Analysis](#3)
        * [Categorical Variable Analysis](#4)
        * [Numerical Variable Analysis](#5)
3. [Basic Data Analysis](#6)
4. [Outlier Detection](#7)
5. [Missing Value](#8)
    * [Find Missing Value](#9)
    * [Fill Missing Value](#10)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
plt.style.use("seaborn-darkgrid") # plot style
import seaborn as sns # data visualization
from collections import Counter
import warnings # to ignore warnings
warnings.filterwarnings("ignore")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load and Check Data <a id = "1"></a>

In [None]:
train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
test_df = pd.read_csv("/kaggle/input/titanic/test.csv") # csv to be operated
passenger_id = test_df["PassengerId"]

In [None]:
test_df.head()

In [None]:
test_df.columns

# Variable Description <a id = "2"></a>

1. PassengerId   --> Unique id of the passenger
1. Survived   --> Survival (0 = No, 1 = Yes)
1. Pclass   --> Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
1. Name   --> Name of the passenger
1. Sex   --> Male/Female
1. Age   --> Age of the passenger
1. SibSp   --> Number of siblings or spouses
1. Parch   --> Number of parents or children
1. Ticket   --> Ticket Number
1. Fare   --> Fare of ticket
1. Cabin   --> Cabin number
1. Embarked   --> Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

In [None]:
train_df.info()

* float64(2): Age - Fare
* int64(5): PassengerId - Survived - Pclass - SibSp - Parch 
* object(5): Name - Sex - Ticket - Cabin - Embarked

# Univariate Variable Analysis <a id = "3"></a>

* Categorical Variable: Survived, Pclass, Name, Sex, Sibsp, Parch, Ticket, Cabin, Embarked
* Numerical Variable: Fare, Age, PassengerId

## Categorical Variable Analysis <a id = "4"></a>

In [None]:
def bar_plot(column_name):
    
    var = train_df[column_name] # Column name
    var_value = var.value_counts() # Total count number of that Column
    
    # Visualizataion
    # X axis is future category
    # Y axis is value frequency
    plt.figure(figsize=(9,3))
    plt.bar(var_value.index,var_value)
    plt.xticks(var_value.index)
    plt.ylabel("Frequency")
    plt.title(column_name)
    plt.show()
    print("{} : \n {}".format(column_name,var_value))
    

In [None]:
category1 = ["Survived","Sex","Pclass","Embarked","SibSp","Parch"]
for c in category1:
    bar_plot(c)

## From the bars we can understand;
1. Survived variable is imbalanced.
2. Sex variable is imbalanced. Male frequency ≈ %64
3. Pclass variable -> 1st class frequency ≈ %55, 2nd class frequency ≈ %24, 3rd class frequency ≈ %20.
4. Embarked variable is very imbalanced. Most of passengers are from Port of Southampton
5. SibSp variable --> Most passengers do not have siblings
6. Parch variable is similar to variable SibSp

## Numerical Variable Analysis <a id = "5"></a>

In [None]:
def hist_plot(column):
    plt.figure(figsize = (9,3))
    plt.hist(train_df[column],bins = 50)
    plt.xlabel(column)
    plt.ylabel("Frequency")
    plt.title("{} distribution with histogram".format(column))
    plt.show()

In [None]:
numericVar = ["Fare","Age","PassengerId"]
for i in numericVar:
    hist_plot(i)

## From the hist's we can understand;
1. Fare Distribution: A passenger paid 500 for ticket. He/She may be rich or may have paid his/her friends fare.
2. Age Distribution: Mostly passengers age are between 20 and 30.
3. PassengerId Distribution: Dont mint about the PassengerId distribution. The reason why is it looks bad is hist's default bin number.

# Basic Data Analysis <a id = "6"></a>

In this section, we will analysis the relations of some features with each other.
* Pclass - Survived
* Sex - Survived
* SibSp - Survived
* Parch - Survived


In [None]:
# Pclass - Survived

train_df[["Pclass","Survived"]].groupby(["Pclass"], as_index = False).mean().sort_values(by = "Survived",ascending = False)

In [None]:
# Sex - Survived

train_df[["Sex","Survived"]].groupby(["Sex"], as_index = False).mean().sort_values(by = "Survived",ascending = False)

In [None]:
# SibSp - Survived

train_df[["SibSp","Survived"]].groupby(["SibSp"], as_index = False).mean().sort_values(by = "Survived",ascending = False)

In [None]:
# Parch - Survived

train_df[["Parch","Survived"]].groupby(["Parch"], as_index = False).mean().sort_values(by = "Survived",ascending = False)

# Outlier Detection <a id = 7></a>


In [None]:
def detect_outliers(df,features):
    outlier_indices = []
    
    for c in features:
        # 1st quartile
        Q1 = np.percentile(df[c],25)
        # 3rd quartile
        Q3 = np.percentile(df[c],75)
        # IQR
        IQR = Q3 - Q1
        # Outlier step
        outlier_step = IQR * 1.5
        # detect outlier and their indeces
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        # store indeces
        outlier_indices.extend(outlier_list_col)
    
    outlier_indices = Counter(outlier_indices)
    # We want to take out the outliers that more than 2
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
    
    return multiple_outliers

In [None]:
train_df.loc[detect_outliers(train_df,["Age","SibSp","Parch","Fare"])]

In [None]:
# drop outliers
train_df = train_df.drop(detect_outliers(train_df,["Age","SibSp","Parch","Fare"]),axis = 0).reset_index(drop = True)

# Missing Value <a id = "8"></a>
* Find Missing Value
* Fill Missing Value

In [None]:
train_df_len = len(train_df)
train_df = pd.concat([train_df,test_df],axis = 0).reset_index(drop = True)

## Find Missing Value <a id = "9"></a>

In [None]:
train_df.columns[train_df.isnull().any()]

In [None]:
train_df.isnull().sum()

## Fill Missing Value <a id = "10"></a>
* Embarked has 2 missing values
* Fare has 1 missing value


In [None]:
train_df[train_df["Embarked"].isnull()]

In [None]:
train_df.boxplot(column = "Fare",by = "Embarked")

In [None]:
# We will fill na values by Fare feature
# C is closest to 80 so we gonna use it
train_df["Embarked"] = train_df["Embarked"].fillna("C")

In [None]:
train_df[train_df["Fare"].isnull()]

In [None]:
train_df["Fare"] = train_df["Fare"].fillna(train_df[train_df["Pclass"] == 3]["Fare"].mean())

In [None]:
train_df[train_df["PassengerId"] == 1044]