# Titanic - Kaggle Competition

### Introduction

The titanic was one of the world's greatest tragedies as well as one of the most unexpected ones, as the ship - the biggest in the world at the time, and widely regarded as "unsinkable" - has sunk on the first voyage after hitting an iceberg. Most of the 2200 passengers did not survive the accident. Our analysis tries to explore, using the available information, "who" survived - that is, what characteristics determined if a passenger will get a seat on a lifeboat or not. Is it status, is it sex, is it age, is it all of it and more? Or was it all mere luck?

##### Target audience
Although nothing can be done for the victims of the titanic anymore, the tragedy is worth exploring in order to understand human behaviour in critical situations. Social Study scholars may want to explore how does one decide which life is worth saving, or in other words, how is the value of the life measured (at least how it was back in the days of the titanic). It is the case that in life and death situations, all lives are equal? Or is it the case that some lives are valued more than others, and if so, what determines how?

### Data and Feature Engineering
For the purpose of this analysis, the dataset provided by Kaggle will be used. The data on the approximately 1300 passengers is already split into training and testing subsets. Below is a quick overview of the data structure:

In [2]:
import pandas as pd
train = pd.read_csv("https://raw.githubusercontent.com/LucianChiriac/Coursera_Capstone/master/titanic_datasets/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/LucianChiriac/Coursera_Capstone/master/titanic_datasets/test.csv")
train.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Before commencing with any analysis, all columns thoroughly analyzed in order to spot trends and relationships between variables. 

- PassengerId

Is a unique numerical value associated to each passenger. It does not contain any additional information that can be extracted to assist in the analysis, therefore this column will be droped before feeding the data into our algorithms.

- Survived

This is the dependent/target variable, with values 1 and 0 indicating wether a passenger survived or not. From the train set we concluded that 62% of the passengers did not survive the sinking of the boat. Our goal is to build a model to predict this variable for each passenger in turn - that is, predict which passengers live or die, based on the information provided about them.

- Pclass

This is a categorical variable, with values between 1 and 3 corresponding to travel classes (1 being the highest). One can hypothesize that richer passengers (i.e. those in first class) are more likely to survive the sinking of the boat than those poorer ones. Indeed a preliminary analysis reveals that 63% of first class travellers lived, as opposed to only 24% from the 3rd class. It is therefore expected the Pclass column will play an important role in the analysis.

- Name

Again, this is unique to each passenger, and not very usefull in itself. We do notive however that each name comes with a title attached (e.q: Mr, Miss, Mrs, etc). Is it possible that perhaps these titles do serve our analysis? After extracting them we concluded that they are mainly a proxy for Sex, but there is some information to be gained from extracting the title, particularly for some subgroups. For example, the title "Master" is associated to very young children, and we can use this title to infer the age for those passengers, where it is missing. 
We ended up keeping the title column in the analysis, and using it also for filling up missing values in the Age column, based on the title - age distribution.

- Sex

Categorical variable correspoding to male and female. Perhaps somewhat intuitively, we discovered that sex does play a huge role in determining the fate of a passenger. We observed first that the majority of passengers were male (65%); however, only 26% of them survived, in contrast to 74% of the women. It seems suggestive that there was indeed a "women and children first" policy when it came to assigning places on lifeboats.

- Age

Numerical value corresponding to age. Large number of missing values, which were filled in based on the title of each passenger - the median age for each title was used.
Upon analysing the age distribution we observed no signifficant anomalies. For most ages, the survival/death ratio was somewhat similar, with the exception of very young children (<10 years old). Although Age does not seem to play a big role otherwise, we decided to attempt to bin the column into some groups with distinct survivability. This procedure was however done by hand, "eyeballing" the bins; in a further iteration of the analysis, a better alternative needs to be applied, in order to be able to separate between the age groups that mostly differed between each other in terms of survivability.

- SibSp / Parch

Indicates wether a passenger was accompanied by Siblings/Spouses, Parents/Children, and how many. We hypotesized that passengers travelled accompanied may have somewhat better chances to survive (less likely to separate mother from child/ husband from wife when assigning boat seats). We computed a new variable as the sum of the 2 columns to compute the size of the family. When we checked the relationship between the family size and survival, we noticed that those travelling alone or with a big family (more than 3) were less likely to survive than those travelling with 1-3 family members. The differences seem to be signifficant (30% survivability for those alone, 55 to 72% survival chance for those travelling with up to 3 family members). As such, we created a new column corresponding to a family category: 0 if the passenger travelled either alone or accompanied by more than 3 people, 1 otherwise.


- Ticket

A ticket number - an alphanumerical column which adds no value to the analysis. No features seem to be extractable from this variable.

- Fare

The cost associated with ticket purchase. Although the vast majority of tickets cost less than 10 dollars, there are some extreme outliers (500+ for a ticket). This column may serve as a proxy for social status (similar to Pclass column), with, (we hypothesize), richer passengers (those who paid more), more likely to survive. Indeed our analysis indicates that passengers who paid more for tickets had better chances to survive. Due to the extreme range of the variables, we binned the column into 3 categories, corresponding to paying a small, medium or big amount for tickets.

- Cabin

The code for the cabin assigned to a passengers. Most of the values are missing in this column. At first thought it looks like a data collection error, or oversight. However, one may suggest that only passengers who were assigned a cabin have that value noted. The rest of the passengers may find whatever emtpy seat they get. We tested this hypothesis by creating a new column - cabin assigned / not assigned. Indeed, our analysis revealed that this new metric is related to Survival - with 60+ % of the passengers that were assigned a cabin surviving, in contrast to only 30% of the others. Again, this column may well be another proxy for social status - we assume that reservation of a cabin is associated with higher costs.

- Embarked

Indicates from which port did the passengers get on the boat. Upon analysis, we discover that there is a signifficant difference between the location where passengers embarked, with 55% survival amongst those embarking from port C, and only 34% amongst those embarking from port S.


### Modelling

For the modelling stage we decided on 4 of the most commonly used algorithms: Logistic Regression, Decision Trees, Support Vector Machines and K Nearest Neighbors. 

We applied cross-validation, splitting the data in 5, and compared the four models, using the basic settings. Without any parameter tunning, the SVC model performed best, with an 82% mean accuracy on the test sample. However, we took the analysis one step further and tested out a variety of hyperparameters for each model in turn. For this task we used the GridSearchCV package. After tunning, SVC remained the best performing model, with a slightly increased accuracy of 83.4%, so we used the SVC best estimator to make the predictions for the test set on kaggle.

### Results

The baseline score is 76% - score obtained by predicting that all males die and all females live. Using our SVM, as well as alternatively, our Decision Tree predictions, we only managed to increase accuracy by 1%, up to a 77%. This result, dissapointing as it is, seems to be in line with the fact that Sex is the main predictor of wether someone Survives or not. Future iterations of our workflow will attempt to improve on the score by tweaking and changing our feature engineering parameters.