# Analyzing survival probability on the Titanic 

### Content
 - Starting point 
 - Data analysis 
    - Data preparation 
    - Descriptive analysis and model developement 

#### 1. Starting Point

When you search internet for data science competitions the first thing you find is Kaggle. Kaggle, founded in 2010, is an online community of data scientists and machine learning practitioners which offers among others machine learning competitions. Since 2017 Kaggle is a subsidiary of Google. A very nice way to get familiar with the Kaggle platform and to dive into ML competitions is to start with the legendary Titanic ML competition. In order to start with the Titanic ML competition first thing is to install Kaggle API and to download the Titanic data set.


In [2]:
# Installation of Kaggle API and download of Titanic data set 

#pip install kaggle

Collecting kaggle
  Downloading kaggle-1.5.12.tar.gz (58 kB)
Collecting python-slugify
  Downloading python_slugify-5.0.2-py2.py3-none-any.whl (6.7 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py): started
  Building wheel for kaggle (setup.py): finished with status 'done'
  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73057 sha256=b5651653bcf1b3f52c44b4e7fed568e40a94b1c2034ef457f8093331ff8f36f1
  Stored in directory: c:\users\marcw\appdata\local\pip\cache\wheels\29\da\11\144cc25aebdaeb4931b231e25fd34b394e6a5725cbb2f50106
Successfully built kaggle
Installing collected packages: text-unidecode, python-slugify, kaggle
Successfully installed kaggle-1.5.12 python-slugify-5.0.2 text-unidecode-1.3
Note: you may need to restart the kernel to use updated packages.


With the following commands you can get:
- a fast overview on existing competitions
- download the datasets for the titanic competition

In [16]:
#!kaggle competitions list

#!kaggle competitions download -c titanic

ref                                            deadline             category            reward  teamCount  userHasEntered  
---------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
contradictory-my-dear-watson                   2030-07-01 23:59:00  Getting Started     Prizes        201           False  
gan-getting-started                            2030-07-01 23:59:00  Getting Started     Prizes        327           False  
tpu-getting-started                            2030-06-03 23:59:00  Getting Started  Knowledge        975           False  
digit-recognizer                               2030-01-01 00:00:00  Getting Started  Knowledge       6049           False  
titanic                                        2030-01-01 00:00:00  Getting Started  Knowledge      50726            True  
house-prices-advanced-regression-techniques    2030-01-01 00:00:00  Getting Started  Knowledge      13271           False  
connectx

#### 2. Data analysis
#### 2.1. Data preparation


Obviously, the goal of this data analysis competition is to set up a model which predicts the survival probability of a Titanic passenger based on characteristics we know about her/him as good a as possible. In a very first step we need to import some Python libraries which are needed to read the Titanic data sets from a csv as well as to manipulate data and to display e.g. plots and graphs.

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set() # setting seaborn default for plots

There are two data sets of relevance. The train.csv data set contains additionally to the characteristics/features of each passenger the survival information, i.e. whether the passenger has survived the Titanic tragedy or not. This data set can thus be used to build the most appropriate/best prediction model. The test.csv data set on the other hand contains all passenger features except for the dependent variable survival. This data set will be used in the competition and our prediction model will score (i.e. provide a prediction for) each passenger in the test.csv data set with regards to his/her likelihood of surviving the Titanic tragedy.

In [3]:
import csv

locpath1 = "C:/Users/marcw/01_projects/jupyterlab/01_kaggle_titanic/01_data/"

train_df = pd.read_csv(locpath1+"train.csv")
train_df

test_df = pd.read_csv(locpath1+"test.csv")
test_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


If we look at the train data set we see that it contains 12 columns and 891 rows, i.e. passengers. There is a "Passengerid", the dependent variable "Survived" and ten further characteristics for each passenger which might be used as explanatory variables in the prediction model. For the further analysis it is useful to have a look at the data type of each column. 

In [5]:
train_df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Moreover it is important to have a data set description which explains the content of the available attributes.

- survival:       Survival 0 = No, 1 = Yes
- pclass:         Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
- sex:            Sex
- Age:            Age in years
- sibsp:          # of siblings / spouses aboard the Titanic
- parch:          # of parents / children aboard the Titanic
- ticket:         Ticket number
- fare:           Passenger fare
- cabin:          Cabin number
- embarked:       Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton


Here is some further information on some attributes:

- pclass: A proxy for socio-economic status: 1st = Upper; 2nd = Middle; 3rd = Lower
- age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
- sibsp: The dataset defines family relations in this way...; Sibling = brother, sister, stepbrother, stepsister; Spouse = husband, wife (mistresses and fiancés were ignored)
- parch: The dataset defines family relations in this way...; Parent = mother, father; Child = daughter, son, stepdaughter, stepson; Some children travelled only with a nanny, therefore parch=0 for them.

After looking at the raw data, obviously a next very first step of the data preparation involves some simple descriptive univariate statistics for each attribute in the data set. This already shows some characteristics of the data which need to be considered during the further phase of preparing and analyzing the data. 
- For example, we can see that the attributes Age, Cabin and Embarked contain missing values, a common topic which we have to deal with in data science projects (e.g. by imputing missing values with some estimated values). 
- Moreover, we can get some first insights on the Titanic tragedy, e.g.  we see that only ~38% of the passengers in the train data set have survived. Fares go up to 512€ while 75% of all passengers payed a fare of lower then 31€, etc. 

In [6]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [7]:
train_df.describe(include=['O'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Moor, Mrs. (Beila)",male,CA. 2343,C23 C25 C27,S
freq,1,577,7,4,644


Similar we can look into the descriptive statistics for the test data set and compare these results with the train data set in order to see whether there are some obvious structural differences which ideally should not be the case.

In [8]:
test_df.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


In [9]:
test_df.describe(include=['O'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,418,418,418,91,418
unique,418,2,363,76,3
top,"Samaan, Mr. Elias",male,PC 17608,B57 B59 B63 B66,S
freq,1,266,5,3,270


In a next step we would like to look at some bivariate relations/correlations between the attribute survival (or the survival rate) and different features in the data set such as Pclsass, Sex, Age, etc.. Obviously the results of the correlation analysis should somehow be reasonable or mirrored against our expectations or hypotheses. E.g. I would expect that
- passengers in 1st and 2nd class have higher survival rates than those in 3rd class
- female passengers have higher survival rate that male passengers
- young an elderly passengers might also have higher survival chances
- ...

In [10]:
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Pclass,Survived
0,1,0.62963
1,2,0.472826
2,3,0.242363


In [11]:
train_df[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Sex,Survived
0,female,0.742038
1,male,0.188908


In [8]:
ana_col = [
           "Survived", 
           "Pclass", 
           "Sex", 
           "Age", 
           "SibSp", 
           "Parch",
           "Ticket",
           "Fare",
           "Cabin",
           "Embarked"
          ]


In [4]:
train_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [8]:
train_df.info()
train_df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [9]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set() # setting seaborn default for plots




In [None]:
#

import os
raw_data_path = os.path.join(os.path.pardir,'01_data')
raw_data_path