In [4]:
import pandas as pd
import sklearn

# Titanic Surviving Predict

This project is from Kaggle: https://www.kaggle.com/c/titanic/overview


#### All following code is 100% from Richard Xue.

## 1. Question Framing

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Is it possible to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc)?

Objective: The models I decide to use to predict are: 
- Feature Engineering

## 2. Loading data

We have 2 datasets: train and test. 
- train.csv contains passenger data, including death or alive
- test.csv contrains passenger data, but we don't know if they are alive or not

#### Data Explanation

In [11]:
# Survived      Survival        0 = No, 1 = Yes
# Pclass        Ticket class    1 = 1st, 2 = 2nd, 3 = 3rd
# Sex           Sex
# Age           Age             in years
# SibSp         # of siblings / spouses aboard the Titanic
# Parch         # of parents / children aboard the Titanic
# Ticket        Ticket number
# Fare          Passenger fare
# Cabin         Cabin number
# Embarked      Port of Embarkation     C = Cherbourg, Q = Queenstown, S = Southampton

In [21]:
train = pd.read_csv('train.csv')
print(train.shape)
train.head()

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [9]:
test = pd.read_csv('test.csv')
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## 3. Data Cleaning 

In [17]:
# We first deal with null values in our dataframe

def null_columns(df):
    """
    Prints the column name if this column contains NaN.
    
    Input: df - target dataframe
    Output: column names
    
    """
    for col in df.columns:
        if df.isna()[col].any():
            print(col)
null_columns(train)

Age
Cabin
Embarked


In [18]:
# We see that three columns contain NaN. Let's fix them one by one.

train[['PassengerId', 'Age']]

Unnamed: 0,PassengerId,Age
0,1,22.0
1,2,38.0
2,3,26.0
3,4,35.0
4,5,35.0
...,...,...
886,887,27.0
887,888,19.0
888,889,
889,890,26.0


In [19]:
train[train['Age'].isna()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S
