# Titanic Dataset Kaggle
## Description
https://www.kaggle.com/c/titanic
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.


## Workflow stages

The competition solution workflow goes through seven stages described in the Data Science Solutions book.

    1. Question or problem definition.
    2. Acquire training and testing data.
    3. Wrangle, prepare, cleanse the data.
    4. Analyze, identify patterns, and explore the data.
    5. Model, predict and solve the problem.
    6. Visualize, report, and present the problem solving steps and final solution.
    7. Supply or submit the results.
    
## Workflow Goals

The data science solutions workflow solves for seven major goals.

1. **Classifying**. We may want to classify or categorize our samples. We may also want to understand the implications or correlation of different classes with our solution goal.

2. **Correlating**. One can approach the problem based on available features within the training dataset. Which features within the dataset contribute significantly to our solution goal? Statistically speaking is there a correlation among a feature and solution goal? As the feature values change does the solution state change as well, and visa-versa? This can be tested both for numerical and categorical features in the given dataset. We may also want to determine correlation among features other than survival for subsequent goals and workflow stages. Correlating certain features may help in creating, completing, or correcting features.

3. **Converting**. For modeling stage, one needs to prepare the data. Depending on the choice of model algorithm one may require all features to be converted to numerical equivalent values. So for instance converting text categorical values to numeric values.

4. **Completing**. Data preparation may also require us to estimate any missing values within a feature. Model algorithms may work best when there are no missing values.

5. **Correcting**. We may also analyze the given training dataset for errors or possibly innacurate values within features and try to corrent these values or exclude the samples containing the errors. One way to do this is to detect any outliers among our samples or features. We may also completely discard a feature if it is not contribting to the analysis or may significantly skew the results.

6. **Creating**. Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals.

7. **Charting**. How to select the right visualization plots and charts depending on nature of the data and the solution goals.


## 1. Question (Problem Definition)

Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.

# Let's load the data, describe what we have, etc....
1. load data (headers, column names, parse dates, etc...)
2. describe dataframe (head, describe, info, NaNs, type, ....)
3. ??

In [24]:
# load modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

%matplotlib inline 

#load data 
df_train = pd.read_csv('train.csv')

# methods to see df and get some info
print ('Shape: \n', df_train.shape,'\n')
print ('The Dataframe... \n', df_train.head(), '\n ________________________________\n')
print ('The columns types... \n', df_train.dtypes, '\n ________________________________\n')
print ('Some statistics: \n', df_train.describe(), '\n ________________________________\n')
print ('Where are the null values: ',df_train.isnull().any(), '\n ________________________________\n')


Shape: 
 (891, 12) 

The Dataframe... 
    PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0       

There are 11 features, we can try to know how many different values have each of the features..(binary, categorical, conitnuous)

In [55]:
print ('Values of Survived: ', df_train.Survived.unique(), '\n ________________________________\n')
print ('Values of PClass: ', df_train.Pclass.unique(), '\n ________________________________\n')
print ('Values of Age: ', sorted(df_train.Age.unique()), '\n ________________________________\n')
print ('Values of Sex: ', df_train.Sex.unique(), '\n ________________________________\n')
print ('Values of SibSp: ', sorted(df_train.SibSp.unique()), '\n ________________________________\n')
print ('Values of Parch: ', df_train.Parch.unique(), '\n ________________________________\n')
#print ('Values of Ticket: ', df_train.Ticket.unique())
print ('Values of Fare: ', sorted(df_train.Fare.unique()), '\n ________________________________\n')
print ('Values of Cabin: ', df_train.Cabin.unique(), '\n ________________________________\n')
print ('Values of Embarked: ', df_train.Embarked.unique(), '\n ________________________________\n')

Values of Survived:  [0 1] 
 ________________________________

Values of PClass:  [3 1 2] 
 ________________________________

Values of Age:  [0.83, 2.0, 3.0, 4.0, 5.0, 7.0, 8.0, 11.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 28.5, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 38.0, nan, 0.42, 0.67, 0.75, 0.92, 1.0, 6.0, 9.0, 10.0, 12.0, 13.0, 14.5, 20.5, 23.5, 24.5, 30.5, 32.5, 34.5, 36.0, 36.5, 37.0, 39.0, 40.0, 40.5, 41.0, 42.0, 43.0, 44.0, 45.0, 45.5, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 55.5, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 70.0, 70.5, 71.0, 74.0, 80.0] 
 ________________________________

Values of Sex:  ['male' 'female'] 
 ________________________________

Values of SibSp:  [0, 1, 2, 3, 4, 5, 8] 
 ________________________________

Values of Parch:  [0 1 2 5 3 4 6] 
 ________________________________

Values of Fare:  [0.0, 4.0125, 5.0, 6.2375, 6.4375, 6.45, 6.4958, 6.75, 6.8583, 6

## what we have?
1. 12 columns: PassengerId + 11 features
    1. **Survived**: int64, [0-1] (binary), no Null values, the mean indicates the % of survival (in the df_train is 38%). Comparing this with the survival rate of the entire data ( 1-1502/2224 = 0.32) indicates that **train_df is not well balanced (around 6% more of survival rate than the real one)**
    2. **PClass**: int64, [1,2,3], no Null Values. It indicates the ticket class
    3. **Age**: float64, [0:42 to 80, continuos], **with nan values**, under 1 the fraction is real, If the **age is estimated, is it in the form of xx.5** (info from Kaggle)
    4. **Sex**: object, ['male','female'], no Null Values, ¿shouldn't we pass it to categorical?
    5. **SibSp**: int64, [0, 1,2,3,4,5,8], **I have doubts with this** because in the Kaggle doumentation: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister / Spouse = husband, wife (mistresses and fiancés were ignored). **So, does it means number of people related? (through sibling..)**
    6. **Parch**: int64, [0, 1, 2, 3, 4, 5, 6], this is direct relationship: The dataset defines family relations in this way... Parent = mother, father / Child = daughter, son, stepdaughter, stepson / Some children travelled only with a nanny, therefore parch=0 for them.
    7. **Cabin**: object, letter indicating Deck (A, B, C, D, E) + number. There are NaNs!, **Letter A is for more luxury Deck??** And the numbers indicate location also? 
    8. **Ticket**: Ticket number, object, letters + numbers, no null data, without evident pattern
    9. **Fare**: float64, [0 to 512.3292], no null data, Passenger fare,
    10. **Name**: object, ¿string?, [Surname, Name (what is inside the () ?)]. Can we use this to relate people?
    11. **Embarked**: object, ['S','C','Q',nan], with null values, can we use Ticket number to obtain Embarked null data?
    
    
**NaNs or missing Values** --> Cabin, Age and Embarked

# Let's do some basic pivot tables and see what we have

In [84]:
# Passem 'Sex' to categorical
df_train['Sex']=df_train['Sex'].astype('category')

In [100]:
table1 = pd.pivot_table(df_train, index=['Pclass','Sex'],values=['Age','Fare','Parch','SibSp','Survived'],aggfunc=np.mean)
print(table1)

                     Age        Fare     Parch     SibSp  Survived
Pclass Sex                                                        
1      female  34.611765  106.125798  0.457447  0.553191  0.968085
       male    41.281386   67.226127  0.278689  0.311475  0.368852
2      female  28.722973   21.970121  0.605263  0.486842  0.921053
       male    30.740707   19.741782  0.222222  0.342593  0.157407
3      female  21.750000   16.118810  0.798611  0.895833  0.500000
       male    26.507589   12.661633  0.224784  0.498559  0.135447


Aquesta taula es molt interessant, i mostra com:
1. Hi han diferencies notables de valors mitjans d'**edat, fare, survived en funcio sexe i pclass**.
    1. Aixo ens pot servir per omplir missing values d'Edat --> podem agafar la mitjana que correspon a la Pclass i el sexe
    2. Hi ha un gradient clar amb el fare
2. Molt interessant la relacio entre **Pclass i Sexe amb Survived**

¿Com podries afegir una columna amb el total d'individus a per cada classe?

In [103]:
table1 = pd.pivot_table(df_train, index=['Pclass','Sex'],aggfunc=len, values='Survived')
print('Survived aqui son els que si i els que no: \n',table1)

Survived aqui son els que si i els que no: 
                Survived
Pclass Sex             
1      female        94
       male         122
2      female        76
       male         108
3      female       144
       male         347


In [109]:
table1 = pd.pivot_table(df_train, index=['Survived','Sex'],columns='Pclass',aggfunc=np.mean)
print(table1)

                       Age                              Fare             \
Pclass                   1          2          3           1          2   
Survived Sex                                                              
0        female  25.666667  36.000000  23.818182  110.604167  18.250000   
         male    44.581967  33.369048  27.255814   62.894910  19.488965   
1        female  34.939024  28.080882  19.329787  105.978159  22.288989   
         male    36.248000  16.022000  22.274211   74.637320  21.095100   

                               Parch                     PassengerId  \
Pclass                   3         1         2         3           1   
Survived Sex                                                           
0        female  19.773093  1.333333  0.166667  1.097222  325.000000   
         male    12.204469  0.259740  0.142857  0.213333  413.623377   
1        female  12.464526  0.428571  0.642857  0.500000  473.967033   
         male    15.579696  0.311111  0.64