# Titanic dataset 
This analysis is based on the tutorial from https://www.kaggle.com/startupsci/titanic-data-science-solutions .
The tutorial on How to submit the competition can be found here https://www.kaggle.com/alexisbcook/titanic-tutorial .
* On April, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. 
* One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. 
* Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as woman, children, and the upper-class.

## Problem definition

Predict how many people survived (or died) the Titanic shipwreck using machine learning. 

In [2]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

In [3]:
# load data
train_df = pd.read_csv('./data/train.csv')
test_df = pd.read_csv('./data/test.csv')
combine = [train_df, test_df]

In [4]:
train_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [5]:
train_df.columns.values

array(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype=object)

# Workflow goal

* **Classifying**. We may want to classify or categorize our samples. we may also want to understand the implications or correlation of different classes with our solution goal.

* **Correlating**. Is there a correlation among a feature and a solution goal? This is based on studying how the features values change accordingly to the solution value. We may also want to study how features correlate with other features.

* **Converting**. For modeling stage, one needs to prepare the data. For instance, converting text categorical values to numerical values. 

* **Completing**. Data preparaion may also require us to estimate any missing values within a feature. Model alforithms may work best when there are no missing values. 

* **Correcting**. We may also analyze the given training dataset for errors or possibly innacurate values within features and try to correct these values or exclude the samplkes containing the errors. One way to do this is to detect any outliers among our samples or features. We can also completely discard a feature if we believe that it is not contributing to our analysis in a significant way. (or skew the results)

* **Creating**. We can create new features based on an existing feature or a set of features, such that the new feature follows the correllation, conversion, completeness goals. 

* **Charting**. How to select the right visualization plots and charts depending on nature of the data and the solution goals. 

# Feature analysis
* **Categorical features**: Survived, Sex, and Embarked. Ordinal: Pclass.
* **Numerical features**: 
    1. Continuous: Age, Fare.
    2. Discrete: SibSp, Parch.
* **Mixed data types** : Numerical, alphanumeric data within same feature. They may need some correction: Ticket is a mix of numeric and alphanumeric data types. Cabin is alphanumeric. 
* **errors or typos**: Name feature may contain errors or typos as there are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names. 
* **Blank or null or empty values**:
    1. Cabin > Age > Embarked features contain a number of null values in that order for the training dataset
    2. Cabin > Age are incomplete in case of test dataset
* **data types** : 
    1. 7 features are integers or flot for the training data frame. 6 for the test data frame. 
    2. Five features are strings (objects):

In [6]:
print(train_df.tail(5))
print('_'*40)
print(train_df.info())
print('_'*40)
print(test_df.info())

     PassengerId  Survived  Pclass                                      Name  \
886          887         0       2                     Montvila, Rev. Juozas   
887          888         1       1              Graham, Miss. Margaret Edith   
888          889         0       3  Johnston, Miss. Catherine Helen "Carrie"   
889          890         1       1                     Behr, Mr. Karl Howell   
890          891         0       3                       Dooley, Mr. Patrick   

        Sex   Age  SibSp  Parch      Ticket   Fare Cabin Embarked  
886    male  27.0      0      0      211536  13.00   NaN        S  
887  female  19.0      0      0      112053  30.00   B42        S  
888  female   NaN      1      2  W./C. 6607  23.45   NaN        S  
889    male  26.0      0      0      111369  30.00  C148        C  
890    male  32.0      0      0      370376   7.75   NaN        Q  
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 89

# Numerical feature values

This helps us determine, among other early insights, how representative is the training dataset of the actual problem domain. 

* Total samples are 891 or 40\% of the actual number of passengers on board the Titanic. 
* Around 38\% samples survived representative of the actual survival rate at 32\%
* Most passengers (>75\%) did not travel with parents or children. 
* Nearly 30\% of the passengers had siblings and/or spouse aboard.
* Fares varied significantly with few passengers (<1\%) paying as high as \$512 
* Few elderly passengers (<1\%) within age range 65-80.

In [7]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


# Categorical features
* names are unique across the data set (count=unique=891)
* Sex variable as two possible values with 65\% male (top=male, freq=577/count=891)
* Cabin Values have several duplicates across samples. Alternatively several passengers shared a cabin
* Embarked takes three possible values. S port used by most passengers (top=S)
* Ticket feature has high ratio (22%) of duplicate values (unique=681)

In [8]:
train_df.describe(include=['O']) # I guess 'O' stands for 'Object'

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Hawksford, Mr. Walter James",male,1601,B96 B98,S
freq,1,577,7,4,644


# Some Assumtions

## correlating 
We want to know how well does each feature correlate with Survival. We want to do this early in our project and match these quick correlations with modelled correlations later in the project.

## Completing
1. We may want to complete Age feature as it is definitely correlated to survival. 
2. We may want to complete the Embarked feature as it may also correlate with survival or another important feature

## Correcting (drop features)
1. Ticket feature may be dropped from our analysis as it contains high ratio of duplicates (22\%) and there may not be a correlation between Ticket and survival.
2. Cabin feature may be dropped as it is highly incomplete or contains many null values both in training and test dataset.
3. PassengerId may be dropped from training dataset as it does not contribute to survival.
4. Name feature is relatively non-standard, may not contribute directly to survival, so maybe dropped. 

## creating. 
1. We nmay want to create a new feature called Family based on Parch and SibSp to get total count of family members on board.
2. We may want to engineer the Name feature to extract Title as a new feature
3. We may want to create new feature for Age bands. This turns a continous numerical feature into an ordinal categorical feature.
4. We may also want to create a Fare range feature if it helps our analysis. 

## Classifying.
We may also add to our assumptions based on the problem description noted earlier. 
1. Woman (Sex=female) were more likely to have survived.
2. Children (Age<?) were more likely to have survived.
3. The upper-class passengers (Pclass=1) were more likely to have survived.

# Analyze by pivoting feature
This is a quantitative analysis to confirm some of our observations and assumptions. This analysis is based on pivoting features against each other. It only makes sense to do this for features that do not have any empty values and are categorical features, ordinal or discrete. 
* **categorical: Sex**. 
* **Ordinal: Pclass**. It is evident that there is a numerical correlation between class and survival.
* **Discrete: SibSp and Parch.**

In [24]:
train_df[['Pclass','Survived']].mean()

Pclass      2.308642
Survived    0.383838
dtype: float64

In [104]:
print('Pclass')
print(train_df[['Pclass','Survived']].groupby(['Pclass'],as_index=False).mean())# 
print(train_df[['Pclass','Survived']].groupby(['Pclass'],as_index=False).std()) # 

print('Sex')
print('_'*40)
print(train_df[['Sex','Survived']].groupby(['Sex'],as_index=False).mean())# 
#print(train_df[['Sex','Survived']].groupby(['Sex'],as_index=False).std()) # 

print('SibSp')
print('_'*40)
print(train_df[['SibSp','Survived']].groupby(['SibSp'],as_index=False).mean())# 
#print(train_df[['SibSp','Survived']].groupby(['SibSp'],as_index=False).std()) # 

print('Parch')
print('_'*40)
print(train_df[['Parch','Survived']].groupby(['Parch'],as_index=False).mean())# 
#print(train_df[['Parch','Survived']].groupby(['Parch'],as_index=False).std()) # 

Pclass
   Pclass  Survived
0       1  0.629630
1       2  0.472826
2       3  0.242363
     Pclass  Survived
0  1.000000  0.484026
1  1.414214  0.500623
2  1.732051  0.428949
Sex
________________________________________
      Sex  Survived
0  female  0.742038
1    male  0.188908
SibSp
________________________________________
   SibSp  Survived
0      0  0.345395
1      1  0.535885
2      2  0.464286
3      3  0.250000
4      4  0.166667
5      5  0.000000
6      8  0.000000
Parch
________________________________________
   Parch  Survived
0      0  0.343658
1      1  0.550847
2      2  0.500000
3      3  0.600000
4      4  0.000000
5      5  0.200000
6      6  0.000000


In [64]:
# to convert this into a numpy array, simply use
Pclass=train_df[['Pclass','Survived']].values
print(Pclass.shape)

(891, 2)


In [88]:
Pclass1=train_df[train_df['Pclass']==1][['Pclass','Survived']].values
Pclass1.shape
Pclass2=train_df[train_df['Pclass']==2][['Pclass','Survived']].values
Pclass2.shape
Pclass3=train_df[train_df['Pclass']==3][['Pclass','Survived']].values
Pclass3.shape

(491, 2)

# Analyze by visualizing data
## correlating numerical features