# Titanic Dataset

* The goal is to predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and so on.

* Attribute Descriptions:
    * **PassengerId**: a unique identifier for each passenger
    * **Survived**: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.
    * **Pclass**: passenger class.
    * **Name**, **Sex**, **Age**: self-explanatory
    * **SibSp**: how many siblings & spouses of the passenger aboard the Titanic.
    * **Parch**: how many children & parents of the passenger aboard the Titanic.
    * **Ticket**: ticket id
    * **Fare**: price paid (in pounds)
    * **Cabin**: passenger's cabin number
    * **Embarked**: where the passenger embarked the Titanic

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
import seaborn as sns

import eda_helper as eda

### Get the Data:

In [2]:
train_data = pd.read_csv('https://raw.githubusercontent.com/Spin8Cycle/data/main/data_samples/titanic/train.csv')
test_data = pd.read_csv('https://raw.githubusercontent.com/Spin8Cycle/data/main/data_samples/titanic/test.csv')

### Explore the Data:

In [3]:
eda.custom_info(train_data)

Unnamed: 0,Data Type,Non-Null Count,Null Count,% Missing,Distinct Values
PassengerId,int64,891,0,0.0,891
Survived,int64,891,0,0.0,2
Pclass,int64,891,0,0.0,3
Name,object,891,0,0.0,891
Sex,object,891,0,0.0,2
Age,float64,714,177,19.87,88
SibSp,int64,891,0,0.0,7
Parch,int64,891,0,0.0,7
Ticket,object,891,0,0.0,681
Fare,float64,891,0,0.0,248


In [14]:
# Numerical Data
train_data.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699113,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526507,1.102743,0.806057,49.693429
min,0.0,1.0,0.4167,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [15]:
# Categorical Data?
train_data.describe(include='object')

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


In [16]:
def category_distribution(df, colnames):
    """
    Return the counts and distribution of the df's categorical values.

    Parameters
    ----------
    df : DataFrame
        Source DataFrame

    colnames : list
        List of column names, that represent categorical values

    Returns
    -------
    dist_df : DataFrame

    """
    try:
        data = []
        index_1 = []
        index_2 = []
        for n in colnames:
            vc = df[n].value_counts()
            vc2 = df[n].value_counts(normalize=True)
            for i in vc.index:
                index_1.append(n)
                index_2.append(i)
                data.append([vc[i], round(vc2[i] * 100, 2)])

        dist_df = pd.DataFrame(
            data, index=[index_1, index_2], columns=["Counts", "Distribution, %"]
        )

        return dist_df
    except:
        print("Please check if df or colnames is entered")

In [18]:
category_distribution(train_data, ['Sex', 'Survived', 'Embarked', 'Pclass'])

Unnamed: 0,Unnamed: 1,Counts,"Distribution, %"
Sex,male,577,64.76
Sex,female,314,35.24
Survived,0,549,61.62
Survived,1,342,38.38
Embarked,S,644,72.44
Embarked,C,168,18.9
Embarked,Q,77,8.66
Pclass,3,491,55.11
Pclass,1,216,24.24
Pclass,2,184,20.65


### Prepare the Data: