# Introduction

Today, we have our first ever Kaggle Hack Session! We're going to be competing in the Titantic competition. The goal in this competition is to be able to predict who survived and who passed away during this tragedy, given information about the people involved.

We know that getting started for one of these competitions can be difficult, so we've provided this starter notebook to help you get up and running. Let's think about what we need to do when approaching any machine learning competition/problem. 

1) Determine your problem space. Do you have a classification problem, or a regression problem?

2) Determine what model you want to use (Always good to start off with simple models).

3) Load in and preprocess your dataset. Examine your database to see if there are any NULL or non-numeric values.

4) Split up your dataset into training and testing components. 

5) Create your model. This entails defining your function, your placeholders, the loss function, and the optimizer. 

6) Train, evaluate, and iterate on your model!

7) Once you have a model that you're satisfied with, load in test.csv (the test set for the Titanic competition), compute your predictions, save them to a CSV file, and submit to Kaggle. 

In [118]:
import pandas as pd
import tensorflow as tf
import numpy as np
import tensorflow as tf
from sklearn.datasets import make_classification

# Load in Data

You can download the data from the Kaggle website. The direct link is [here](https://www.kaggle.com/c/titanic/data), but we've already downloaded it for you. It's located in the Data subfolder. 

In [119]:
# Use the Pandas read_csv() function to load in the train.csv
titanicTrain = pd.read_csv('Data/train.csv')

# Examine Data

In [120]:
# Use the head function to see how the first couple rows of the dataframe looks like
titanicTrain.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [121]:
# Figure out what the different column names are
titanicTrain.columns.tolist()

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

Use other functions such as describe, max, mean, value_counts, etc to learn more about the dataset you're dealing with. 

# Clean Data

This is one of the most important parts of any machine learning pipeline. We want to make sure that the inputs we feed into any machine learning model are are valid, non-null, and are numerical values. To get you started with datapreprocessing, we'll show you one example of a column you may want to drop in this dataset 

In [122]:
# Visualize the data we're working with
titanicTrain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


So, as you can see above, some of the people don't have values for the age and cabin attributes. There are ways we can deal with this (for example, replace the null values with the median of the other values, replace them with 0, etc), but a simple method is to just drop the column.

In [123]:
# Drop the column
titanicTrain.drop(['Cabin'], axis = 1, inplace = True)
# alternatively, if you don't wish to modify the original data structure, you can re-assign the result of.drop().
#titanicTrain_dropped = titanicTrain.drop(['Cabin'], axis = 1) # For axis number (0 for rows and 1 for columns)


Another column that needs processing is the age.

In [124]:
# Do the preprocessing
titanicTrain.drop(['Cabin'], axis = 1, inplace = True)
medianAge = titanicTrain['Age'].median()
embarkedFill = 'S'
titanicTrain['Age'] = titanicTrain['Age'].fillna(medianAge, inplace = False)
titanicTrain['Embarked'] = titanicTrain['Embarked'].fillna(embarkedFill, inplace = False)

Now, try it on your own! The functions you will probably be using are (although you're not limited to just these!):
- [drop()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)
- [fillna](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)
- [get_dummies()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html)
- [dropna()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html)

In [125]:
# TODO Find the other attributes that may give us trouble later on! Once you find these
# columns, figure out if you just want to drop the attribute altogether or replace with 
# median, or something else!


# Do the preprocessing
titanicTrain.drop(['Cabin'], axis = 1, inplace = True)
medianAge = titanicTrain['Age'].median()
embarkedFill = 'S'
titanicTrain['Age'] = titanicTrain['Age'].fillna(medianAge, inplace = False)
titanicTrain['Embarked'] = titanicTrain['Embarked'].fillna(embarkedFill, inplace = False)
titanicTrain.drop(['Name'], axis = 1, inplace = True)
titanicTrain.drop(['Ticket'], axis = 1, inplace = True)
titanicTrain.isnull().sum()

# HINT: The name attribute is something you may want to look at. We don't want strings in our ML model!

PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64

Now that you know a couple ways of dealing with null values and string values, feel free to be creative! The best way to get a more accurate machine learning model is to understand the best ways to visualize and clean your data! This is one of the most important steps in any ML pipeline. 

# Create Training/Testing Matrices

So, now that we've made our final changes to our dataframe, we want to convert it into a matrix of numbers. We want our Y Matrix to be filled with binary labels indicating whether the person survived or not. Our X Matrix should contain all of the features that represent each individual.  

In [126]:
titanicTrain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.7+ KB


In [127]:
# Convert to matrices. 
# TODO Add/Remove columns as you see fit
X = titanicTrain[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].as_matrix()
Y = titanicTrain['Survived'].as_matrix()
Y = Y.reshape([Y.shape[0], 1]) # Reshaping from (891,) to (891,1)
print (X.shape)
print (Y.shape)

(891, 5)
(891, 1)


(OPTIONAL) Remember that whenever we have a dataset, it's good practice to seperate the dataset into 2 parts, one that we will use to train the model, and one that we will use to check how our model is doing as a test/validation set. 

In [128]:
# Divide into xTrain, yTrain, xTest, and yTest. Take the last 100 examples as test
numExamples = X.shape[0]
numTestExamples = 100

xTrain = X[:numExamples - numTestExamples] # xTrain contains the examples 0-791
yTrain = Y[:numExamples - numTestExamples] # yTrain contains the labels for examples 0-791
xTest = X[numExamples - numTestExamples:] # xTrain contains the examples 792-891
yTest = Y[numExamples - numTestExamples:] # yTrain contains the labels for examples 792-891


# Create Model

Now that we have all of our data loaded in and preprocessed, we can start on creating our model. This component is pretty open ended. You have the freedom to choose whichever model you'd like to create. If you need inspiration, take a look at the code for linear regression and logistic regression in the week2 and week3 folders. A few other reminders:

- Think about what types of objects you'll need to create. Placeholders, variables, optimizers, etc
- Think back to how we created the linear regression and logistic regression models. 

In [152]:
import pandas as pd
import tensorflow as tf
import numpy as np
import tensorflow as tf
from sklearn.datasets import make_classification

# Use the Pandas read_csv() function to load in the train.csv
titanicTrain = pd.read_csv('Data/train.csv')

# TODO Find the other attributes that may give us trouble later on! Once you find these
# columns, figure out if you just want to drop the attribute altogether or replace with 
# median, or something else!


# Do the preprocessing
titanicTrain.drop(['Cabin'], axis = 1, inplace = True)
medianAge = titanicTrain['Age'].median()
titanicTrain['Age'] = titanicTrain['Age'].fillna(medianAge)
titanicTrain['Embarked'] = titanicTrain['Embarked'].fillna('S')
titanicTrain.drop(['Name'], axis = 1, inplace = True)
titanicTrain.drop(['Ticket'], axis = 1, inplace = True)
titanicTrain.isnull().sum()


# Convert to matrices. 
# TODO Add/Remove columns as you see fit
X = titanicTrain[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].as_matrix()
Y = titanicTrain['Survived'].as_matrix()
Y = Y.reshape([Y.shape[0], 1]) # Reshaping from (891,) to (891,1)
print (X.shape)
print (Y.shape)


# Divide into xTrain, yTrain, xTest, and yTest. Take the last 100 examples as test
numExamples = X.shape[0]
numTestExamples = 100

xTrain = X[:numExamples - numTestExamples] # xTrain contains the examples 0-791
yTrain = Y[:numExamples - numTestExamples] # yTrain contains the labels for examples 0-791
xTest = X[numExamples - numTestExamples:] # xTest contains the examples 792-891
yTest = Y[numExamples - numTestExamples:] # yTest contains the labels for examples 792-891


# TODO Create your model here
lr = .01 # the learning rate
batch_size = 128 # the number of examples we will consider per iterations
n_epochs = 10000 # the number of iterations we will do

# TODO: Create placeholders for X (our features) and Y (our labels)
X = tf.placeholder(tf.float32, [None, 5])
Y = tf.placeholder(tf.float32, [None, 1])

# TODO: create Variables for w (our weights) and b (our biases)
w = tf.Variable(tf.truncated_normal(shape = [5, 1], stddev=0.01), name = 'w')
b = tf.Variable(tf.zeros([1, 5]), name = 'b')

logits = tf.matmul(X, w) + b
normalized_logits = tf.nn.sigmoid(logits)
# TODO: write code to compute the cross_entropy_loss and the mean_squared_loss.
# Experiment with both losses: which one performs better? Why might this be?
cross_entropy = tf.reduce_mean(tf.nn.sigmoid(logits))
mean_squared_loss = tf.reduce_mean(tf.square(Y - normalized_logits))
loss = mean_squared_loss
# TODO: Create a GradientDescentOptimizer that minimizes our loss. 
opt = tf.train.GradientDescentOptimizer(learning_rate = lr).minimize(loss)

# operations that help us monitour our accuracy
cp = tf.equal(tf.argmax(logits, axis = 1), tf.cast(Y, dtype = tf.int64))
acc = tf.reduce_mean(tf.cast(cp, tf.float32))

# TODO: create a global_variables_initializer, launch the graph, and run the optimization step for n_epochs iterations.
init = tf.global_variables_initializer()
sess = tf.InteractiveSession()
sess.run(init)
for i in range(n_epochs):
    #batch = MNIST.train.next_batch(batch_size)
    sess.run(opt, feed_dict = {X: xTrain, Y: yTrain})
    if i % 500 == 0:
        l = loss.eval(feed_dict = {X: xTrain, Y: yTrain})
        print("Loss: {}".format(l))
a = acc.eval(feed_dict = {X: xTrain, Y: yTrain})
print("test acc: {}".format(a))

(891, 5)
(891, 1)
Loss: 0.2203943282365799
Loss: 0.20686355233192444
Loss: 0.20557086169719696
Loss: 0.20505806803703308
Loss: 0.2047501802444458
Loss: 0.20454718172550201
Loss: 0.20440861582756042
Loss: 0.20431117713451385
Loss: 0.20424030721187592
Loss: 0.20418676733970642
Loss: 0.2041444182395935
Loss: 0.20411038398742676
Loss: 0.20408117771148682
Loss: 0.20405547320842743
Loss: 0.204032301902771
Loss: 0.20401056110858917
Loss: 0.20399032533168793
Loss: 0.20397034287452698
Loss: 0.2039516270160675
Loss: 0.2039330154657364
test acc: 0.6131479144096375


# Train Model

Now that you've created your model by defining your computational graph, you're ready to start training the model. Remember that training model basically means that we want to run our optimizer object over different parts of our training dataset. A few other reminders:
- Remember to create a Tensorflow session and initialize all of your variables
- Run your optimizer object at every iteration
- Keep track of how your model is doing every now and again

# Test Model

By now, you have a trained model and you're almost ready to submit! We want to now see how our model does on data that it has never seen before. We want to compute our predictions for the test set. We will then submit these predictions to Kaggle in order to see how accurate we are. A few reminders:
- Remember that preprocessing you did for the training dataset? You'll need to do that same preprocessing for this test set as well. 
- No need to initialize variables or anything. Everything is already trained! We just want to compute our predictions for this new set of data. 

In [151]:
# TODO Do the same data preprocessing you did for the train set
# TODO Compute the predictions for the testing set by evaluating your logits/normalized logits variables
# TODO Check that the predictions are the correct dimensionality 

# Use the Pandas read_csv() function to load in the train.csv
titanicTrain = pd.read_csv('Data/test.csv')

# TODO Find the other attributes that may give us trouble later on! Once you find these
# columns, figure out if you just want to drop the attribute altogether or replace with 
# median, or something else!


# Do the preprocessing
titanicTrain.drop(['Cabin'], axis = 1, inplace = True)
medianAge = titanicTrain['Age'].median()
titanicTrain['Age'] = titanicTrain['Age'].fillna(medianAge)
titanicTrain['Embarked'] = titanicTrain['Embarked'].fillna('S')
titanicTrain.drop(['Name'], axis = 1, inplace = True)
titanicTrain.drop(['Ticket'], axis = 1, inplace = True)
titanicTrain.isnull().sum()

X = titanicTrain[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].as_matrix()
Y = titanicTrain['Survived'].as_matrix()
Y = Y.reshape([Y.shape[0], 1]) # Reshaping from (891,) to (891,1)
print (X.shape)
print (Y.shape)


# Create Kaggle Submission

It's very important to be familiar with the exact Kaggle submission format. We basically want to create a CSV file where the first line of the CSV has the column names '' and '' (this will be different from competition to competition). The following lines will be contain the id number for the test as well as the prediction for that example.

In [66]:
import numpy as np
import csv

firstRow = [['id', 'pred']]
with open("predictions.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(firstRow)
    # TODO write the id number and the predictions you got from the last step!
    # HINT: Using a for loop might be helpful

Once you have the predictions.csv file, you can go ahead and submit to Kaggle! Great job!