# Titanic - Machine Learning from Disaster

### Objective
Create a model that predicts which passengers survived the Titanic shipwreck.

### Data
The *train.csv* dataset will be used to train the machine learning models, whereas the *test.csv* dataset will be used to see how well the models perform on unseen data.

**DATA DICTIONARY**

|Variable|Definition|Key or Observation|
|:------:|:--------:|----------------|
|PassengerId|Passager identification||
|Survived|Survival|0 = No, 1 = Yes|
|Pclass|Ticket class|1 = 1st/Upper, 2 = 2nd/Middle, 3 = 3rd/Lower|
|Name|Passager name||
|Sex|Sex|male or female|
|Age|Age in years|Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5|
|SibSp|# of siblings / spouses aboard the Titanic|Sibling = brother, sister, stepbrother, stepsister <br> Spouse = husband, wife (mistresses and fiancés were ignored)|
|Parch|# of parents / children aboard the Titanic|Parent = mother, father <br> Child = daughter, son, stepdaughter, stepson <br> Some children travelled only with a nanny, therefore parch=0 for them.|
|Ticket|Ticket number||
|Fare|Passenger fare||
|Cabin|Cabin number||
|Embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton|

### 1. Setting the machine enviroment


#### 1.1 Importing packages

In [20]:
import pandas # for data manipulation and analysis
import numpy # for numeric, matrix, array, logical operations

#### 1.2 Importing datasets

In [5]:
train = pandas.read_csv('train.csv', sep=',')
test = pandas.read_csv('test.csv', sep=',')

### 2. Analysing raw data

#### 2.1 Data structure 

In [5]:
# seeing sample train data
train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


#### 2.2 Basic stats

**SURVIVORS x VICTIMS**

In [39]:
# calculating totals of each group
total = train['Survived'].size
survivors = train['Survived'].sum()
victims = total - survivors

# % of each group
survivors_percentage = str(round(survivors/total*100,2)) + '%'
victims_percentage = str(round(victims/total*100,2)) + '%'

print('# of survivors: ',survivors,' (',survivors_percentage,')', sep='')
print('# of victims: ',victims,' (',victims_percentage,')', sep='')

# of survivors: 342 (38.38%)
# of victims: 549 (61.62%)


**PASSENGER BY TICKET CLASS**

In [36]:
# get unique values and counts of each ticket class
unique, count = numpy.unique(train['Pclass'], return_counts=True)

# put unique values and counts in a single array
pclass_unique_count = numpy.asarray((unique, count)).T

# retrieve counts for each class
pclass_1 = pclass_unique_count[0][1]
pclass_2 = pclass_unique_count[1][1]
pclass_3 = pclass_unique_count[2][1]

# how many survived in each class
survivors_by_class = train.groupby('Pclass')['Survived'].sum()
pclass_1_survivors = survivors_by_class[1]
pclass_2_survivors = survivors_by_class[2]
pclass_3_survivors = survivors_by_class[3]

# how many victims in each class
pclass_1_victims = pclass_1 - pclass_1_survivors
pclass_2_victims = pclass_2 - pclass_2_survivors
pclass_3_victims = pclass_3 - pclass_3_survivors

# % of survivors per class
pclass_1_survivors_percentage = str(round(pclass_1_survivors/pclass_1*100,2)) + '%'
pclass_2_survivors_percentage = str(round(pclass_2_survivors/pclass_2*100,2)) + '%'
pclass_3_survivors_percentage = str(round(pclass_3_survivors/pclass_3*100,2)) + '%'


# table of results
pandas.DataFrame({'Ticket': ['Class 1','Class 2','Class 3'],
                  'Survivors': [pclass_1_survivors, pclass_2_survivors, pclass_3_survivors],
                  'Victims': [pclass_1_victims, pclass_2_victims, pclass_3_victims],
                  'Total': [pclass_1, pclass_2, pclass_3],
                  "Percentage of Survivors": [pclass_1_survivors_percentage, pclass_2_survivors_percentage, pclass_3_survivors_percentage]})

Unnamed: 0,Ticket,Survivors,Victims,Total,Percentage of Survivors
0,Class 1,136,80,216,62.96%
1,Class 2,87,97,184,47.28%
2,Class 3,119,372,491,24.24%
