# Kaggle titanic competition

This is my attempt at cracking the titanic challenge posted by Kaggle. I primarily code in Python so it's my obvious language choice.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, they ask us to complete the analysis of what sorts of people were likely to survive.

Stick around, it should be fun!

Ok, we first import pandas library and read it in the training data.

Download the titanic dataset here: <https://www.kaggle.com/c/titanic>

Pandas has a nice built in function to read csv files : **read_csv**

In [2]:
import pandas as pd

titanic = pd.read_csv('/home/vagrant/titanic/dataset/train.csv')

**titanic** variable in our case is known as a data frame(df). Representing the object as a data frame will make manipulating the data so much easier, you'll find out later!

Print all the columns in the data frame, for reference!

In [3]:
print titanic.columns

Index([u'PassengerId', u'Survived', u'Pclass', u'Name', u'Sex', u'Age',
       u'SibSp', u'Parch', u'Ticket', u'Fare', u'Cabin', u'Embarked'],
      dtype='object')


Use **describe** to get basic statistics of all the columns like count, mean, max, min etc. 

In [4]:
print titanic.describe()

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


**Important:** As you can see, the count for the `Age` column is less than all other columns, so we need to make a decision here! Either discard the rows which are empty or fill it with some value.

We have missing data, in other words!

Here, we'll choose the later, we'll fill the empty values with the median `Age` values

In [5]:
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

## Non-numeric Data

Several of our columns are non-numeric, which is a problem when it comes time to make predictions -- we can't feed non-numeric columns into a machine learning algorithm and expect it to make sense of them.

We have to either exclude our non-numeric columns when we train our algorithm (Name, Sex, Cabin, Embarked, and Ticket), or find a way to convert them to numeric columns.

We'll ignore the Ticket, Cabin, and Name columns. There isn't much information we can extract from there. Most of the values in the cabin column are missing (only 204 values out of 891 rows), and it likely isn't a particularly informative column in the first place. The Ticket and Name columns are unlikely to tell us much without some domain knowledge about what the ticket numbers mean, and about which names correlate with characteristics like large or rich families.

## Converting the Sex column


The Sex column is non-numeric, but we want to keep it around -- it could be very informative. We can convert it to a numeric column by replacing each gender with a numeric code.

In our case, we'll replace `female` values with 1 and `male` values with a 0.

In [6]:
# Find all the unique genders -- the column appears to contain only male and female.
print(titanic["Sex"].unique())

# Replace all the occurences of male with the number 0.
# Replace all the occurences of female with the number 1.
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

['male' 'female']


##Converting the Embarked column


We now can convert the Embarked column to codes the same way we converted the Sex column. The unique values in Embarked are S, C, Q, and missing (nan). Each letter is an abbreviation of an embarkation port name.

In [7]:
print(titanic["Embarked"].unique())

# Replace all the occurences of S,C and Q with the number 0, 1 and 2.
# Replace missing values with S (assuming S was the common embarkation port)

titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

['S' 'C' 'Q' nan]


## Machine Learning section

We can now use linear regression to make predictions on our training set.

We can use the excellent `scikit-learn` library to make predictions. We'll use a helper from sklearn to split the data up into cross validation folds, and then train an algorithm for each fold, and make predictions.


In [14]:
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold

#The columns we'll use to predict the target

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

#Initialize our algorithm class
alg = LinearRegression()

#Generate cross validation folds for the titanic dataset. It returns the row
#indices corresponding to train and test.

#We set random state to ensure we get the same splits every time we run this.

kf = KFold(titanic.shape[0], n_folds=3,random_state=1)

predictions = []
for train, test in kf:
    train_pred = (titanic[predictors].iloc[train,:])
    # The target we're using to train.
    train_target = titanic["Survived"].iloc[train]
    
    # Train the algorithm
    alg.fit(train_pred, train_target)
    
    #Test the model
    test_pred = alg.predict(titanic[predictors].iloc[test,:])
    
    predictions.append(test_pred)

##Evaluating Error

Now we've done predicting, we need to evaluate our error

We'll first define our error metric. 

The metric will basically involve finding the number of values in `predictions` that are exactly the same as their counterparts in `titanic["Survived"]`, then dividing by the total number of people.

Before we do this, we need to combine the 3 sets of predictions into one column.

In [15]:
import numpy as np

# The predictions are in three seperate numpy arrays. Concatenate them into one

predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes.

predictions[predictions > .5] = 1
predictions[predictions <= .5] = 0

lengthOfPredictionsList = len(predictions)
sum = 0

for i in range(0,lengthOfPredictionsList):
    if predictions[i] == titanic["Survived"].iloc[i]:
        sum += 1

#calculate accuracy

accuracy = sum/lengthOfPredictionsList

print accuracy # 78.3%

## Logistic Regression

We have our first predictions! They aren't very good, though, with only 78.3% accuracy.

We need to make linear regression values between 0 and 1. This is a technique called logistic regression.

Sklearn has a class for logistic regression that we can use.

In [20]:
from sklearn import cross_validation

from sklearn.linear_model import LogisticRegression

# Initialization our algorithm

alg = LogisticRegression(random_state=1)

# Compute the accuracy for all the cross validation folds.

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

# Take the mean of the scores (becuase we have one for each fold)

print scores.mean()

0.787878787879


## Process our test test

To make a submission, we need to do the exact same steps on the test data that we took on the training data.



In [22]:
titanic_test = pd.read_csv("/home/vagrant/titanic/dataset/test.csv")

titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())

titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1

titanic_test["Embarked"] = titanic_test["Embarked"].fillna('S')

titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())

## Generate a submission file

Now that we have everything, it's time to train an algorithm on the training data and unleash it on the test data. Then we can make predictions on the test daata.

Finally, we'll generate a csv file with the predictions and passengers ID.

In [23]:
# Initialize the algorithm class
alg = LogisticRegression(random_state=1)

# Train the algorithm using all the training data
alg.fit(titanic[predictors], titanic["Survived"])

# Make predictions using the test set.
predictions = alg.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pd.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })
    
submission.to_csv("/vagrant/kaggle.csv", index=False)