## Titanic DataSet
-------------------------
- This is one of the most intuitive and easy to understand dataset. The aim is to predict whether a particular passenger will survive or not? 
> All of us are familiar with the tragedy of 1912 - the shipwreck of Titanic. Although there were some element of luck involved in surving the sinking. some had more chances to survive than others such as women, children and the upper class because of preference in the life-boats.
>- There are features of the people and their status whether they survived or not.
>- On the basis of that data we have to build a model that can predict the most likely status of some other passengers.

### Exercises we will cover in this notebook
- Clean the data:
- Convert the string features into numerical values [ordered and unordered]
- Fill the missing data
- Build a Random Forest Model and predict on the test data set

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

In [2]:
train_df = pd.read_csv('data/train.csv')

In [3]:
train_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [4]:
print(train_df.shape)
# Need to convert all strings to integer classifiers.
# female = 0, Male = 1
train_df['Gender'] = train_df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
# One more column of Gender has been added into the train dataFrame
print(train_df.shape)

(891, 12)
(891, 13)


In [5]:
# Filling the missing values
# Embarked from 'C', 'Q', 'S'
# Should not port it as 1,2,3 because these embarkments are not in ordered list. Port "2" is not 2 times greater than Port "1", etc.

# All missing Embarked -> just make them embark from most common place
if len(train_df.Embarked[ train_df.Embarked.isnull() ]) > 0:
    train_df.Embarked[ train_df.Embarked.isnull() ] = train_df.Embarked.dropna().mode().values

Ports = list(enumerate(np.unique(train_df['Embarked'])))    # determine all values of Embarked,
Ports_dict = { name : i for i, name in Ports }              # set up a dictionary in the form  Ports : index
train_df.Embarked = train_df.Embarked.map( lambda x: Ports_dict[x]).astype(int)     # Convert all Embark strings to int

# All the ages with no data -> make the median of all Ages
median_age = train_df['Age'].dropna().median()
if len(train_df.Age[ train_df.Age.isnull() ]) > 0:
    train_df.loc[ (train_df.Age.isnull()), 'Age'] = median_age
# Infrormation has been transferred into Gender, hence we can remove these columns
train_df = train_df.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'PassengerId'], axis=1) 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [6]:
print(train_df.shape)
train_df.head(3)

(891, 8)


Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked,Gender
0,0,3,22.0,1,0,7.25,2,1
1,1,1,38.0,1,0,71.2833,0,0
2,1,3,26.0,0,0,7.925,2,0


In [7]:
# TEST DATA
test_df = pd.read_csv('data/test.csv', header=0)        # Load the test file into a dataframe

# Need to perform the same operation on these the sex column 
test_df['Gender'] = test_df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

# Embarked from 'C', 'Q', 'S'
if len(test_df.Embarked[ test_df.Embarked.isnull() ]) > 0:
    test_df.Embarked[ test_df.Embarked.isnull() ] = test_df.Embarked.dropna().mode().values

# Again convert all Embarked strings to int
test_df.Embarked = test_df.Embarked.map( lambda x: Ports_dict[x]).astype(int)

# Filling the missing data
# All the ages with no data -> make the median of all Ages
median_age = test_df['Age'].dropna().median()
if len(test_df.Age[ test_df.Age.isnull() ]) > 0:
    test_df.loc[ (test_df.Age.isnull()), 'Age'] = median_age

# All the missing Fares -> assume median of their respective class
if len(test_df.Fare[ test_df.Fare.isnull() ]) > 0:
    median_fare = np.zeros(3)
    for f in range(0,3):                                              # loop 0 to 2
        median_fare[f] = test_df[ test_df.Pclass == f+1 ]['Fare'].dropna().median()
    for f in range(0,3):                                              # loop 0 to 2
        test_df.loc[ (test_df.Fare.isnull()) & (test_df.Pclass == f+1 ), 'Fare'] = median_fare[f]

# Collect the test data's PassengerIds before dropping it
ids = test_df['PassengerId'].values
# Removing redundant columns
test_df = test_df.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'PassengerId'], axis=1) 

In [8]:
test_df.head(3)

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Embarked,Gender
0,3,34.5,0,0,7.8292,1,1
1,3,47.0,1,0,7.0,2,0
2,2,62.0,0,0,9.6875,1,1


In [9]:
# The data is now ready to go. So lets fit to the train, then predict to the test!
# Convert back to a numpy array
train_data = train_df.values
test_data = test_df.values


print 'Training...'
forest = RandomForestClassifier(n_estimators=100)
forest = forest.fit( train_data[0::,1::], train_data[0::,0] )

print 'Predicting...'
output = forest.predict(test_data).astype(int)

Training...
Predicting...


In [10]:
print('Output:')
pd.concat([pd.DataFrame(ids,columns = ['Passenger Ids']),pd.DataFrame(output,columns =['Survived'])],axis=1)

Output:


Unnamed: 0,Passenger Ids,Survived
0,892,0
1,893,0
2,894,0
3,895,1
4,896,0
5,897,0
6,898,0
7,899,0
8,900,1
9,901,0
