# Kaggle Analysis of Titanic Dataset

This notebook aims to predict the survivors from the Titanic shipwreck based on the [Kaggle dataset.](https://www.kaggle.com/c/titanic/overview)

The procedure of analysis will be as follows:
1. Exploratory Data Analysis & Hypothesizing
2. Data Cleaning
3. Shortlist Best ML Models with PyCaret
4. Verify PyCaret Results with Manual Implementation
5. Discussion/Conclusion

The training dataset contains the following features:
* PassengerId – a unique ID number for the passenger.
* Survived – whether or not that passenger survived (0 = No, 1 = Yes)
* Pclass – ticket class (1st, 2nd, or 3rd class)
* Name – name of the passenger
* Sex – sex of the passenger
* Age – age of the passenger, in years
* SibSp – # of siblings / spouses aboard the Titanic
* Parch – # of parents / children aboard the Titanic
* Ticket – ticket number of the passenger
* Fare – fare paid by the passenger
* Cabin – passenger's cabin number
* Embarked – port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

## Exploratory Data Analysis & Hypothesizing

In [2]:
# Import data and check first few rows.
import pandas as pd

df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [8]:
print(f'Number of rows: {len(df)}')

# Verify data types.
df.dtypes

Number of rows: 891


PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [9]:
# Count null values.
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [3]:
# Get basic stats from numerical columns.
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [13]:
print(f'Number of passengers who survived: {len(df[df["Survived"] == 1])}')
print(f'Number of passengers who did not survive: {len(df[df["Survived"] == 0])}')

Number of passengers who survived: 342
Number of passengers who did not survive: 549


Hypotheses:
1. Younger passengers should be more likely to survive.
2. *Ticket*, *Fare*, and *Embarked* should hold no bearing on survival rate.
3. 1st class passengers should be more likely to survive. They may have been given lifeboat priority due to their stature, and the lifeboats were probably closer since 1st class passengers are on the upper decks.

In [16]:
# See a list of the unique cabin numbers.
print(f'Number of unique cabin numbers: {df.Cabin.nunique()}')
df.Cabin.unique()

Number of unique cabin numbers: 147


array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64',

## Data Cleaning

## Shortlist Best ML Models with PyCaret

## Verify PyCaret Results with Manual Implementation

In [None]:
# Build pipeline for chosen model since train and test are separated

## Discussion/Conclusion