# The Sinking of RMS Titanic


   <img src=Stöwer_Titanic.jpg>

- No doubt was a tragedy (word titanic reflects tragedy) 
- the pictures of men and women, exploding in fear, mixed feelings of fear, hope, love, anger, sadness and self blaming
- it reflects mixed traits the arrogance, carlessness, courage, selfessness, and selfeshness
- what made it more tragic is the fact that the boats were not enough to carry all the pessengers, under the claim that the titanic is unsinkable, and that a ship that big will not sink before another ship comes for rescue
- what made it even worse is that the security guys allowed some boats to going half full..
- in such difficult situations it's really hard to set rules or put strict criteria for the pessengers to be saved
- the Captain of the titanic ordered that women and children be of higher priority, but were women and children given priority (or did the crew tend to give priority to women and children)? 
- of course they were not all women and children who were saved, many men were also saved, and many women couldn't make it
- but how did the crew choose, what was the general criteria? 
- Was a priority given to men of first and/or second class? or was the priority given to men whose children had no one else in the world to live with them?

we will try to answer the above questions through the [Titanic Data](https://d17h27t6h515a5.cloudfront.net/topher/2016/September/57e9a84c_titanic-data/titanic-data.csv), containing demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic. You can view a description of this dataset on the [Kaggle website](https://www.kaggle.com/c/titanic/data), where the data was obtained.

In [1]:
#importing essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


## Data Import and wrangling
As a first step we will import our data and get it ready for exploratory data analysis

### Data Import
Importing data, and displaying first few lines

In [2]:
titanic_data=pd.read_csv("titanic-data.csv")
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### Some notes about the data:

pclass: A proxy for socio-economic status (SES)
- 1st = Upper
- 2nd = Middle
- 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch=0 for them.

### Data wrangling 
Let's start by showing discriptive statistics for the dataframe, this will show us if there is any surprising data, or any data that needs to be fixed

In [3]:
titanic_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


first thing we noticed, is that the Age column seems to have abnormality (count is less than the other column). otherwise there doesn's seem to be extreme data points (all within expected range) .. also we need to parse the pessenger ID to be a string rather than an intiger, we don't need to make any arithmatic operations on it.we would also create a numerical representation of "Sex" in order to have an overview wether the majority are males or females (to have some descriptive statistics). Let's first modify the data then we can look into the Age matter

now let's convert the pessenger IDs to strings. and add the Sex_numeric column



In [4]:
titanic_data['Passenger_Id']=titanic_data['PassengerId'].apply(str)
del titanic_data['PassengerId']
titanic_data['Sex_numeric']=(titanic_data['Sex']=='female').apply(int)

Now we will investigate the missing records for age

In [5]:
titanic_data[titanic_data['Age'].isnull()==True].head(2)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Passenger_Id,Sex_numeric
5,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,6,0
17,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,,S,18,0


It seems that some pessenger age information is actually missing (NaN value). We will leave it this way for the original file, and be careful while drawing conclusions that some pessengers have missing data. for this we will create a new file containing only known age info, and we will use it whenever age is a factor

In [6]:
titanic_data_full_Age=titanic_data[titanic_data['Age'].isnull()==False]

In [7]:
titanic_data.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_numeric
count,891.0,891.0,714.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,0.352413
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,0.47799
min,0.0,1.0,0.42,0.0,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104,0.0
50%,0.0,3.0,28.0,0.0,0.0,14.4542,0.0
75%,1.0,3.0,38.0,1.0,0.0,31.0,1.0
max,1.0,3.0,80.0,8.0,6.0,512.3292,1.0


Now the pessenger_ID column is appearing in the descriptive statistics, while the Sex_numeric column is appearing. it's obvious that majority of passengers were males The data is ready for exploration