# Let's Submit a Model to Kaggle! -> Titanic Competition

* Data is from: https://www.kaggle.com/c/titanic
* Recommended notebook: https://github.com/justmarkham/DAT8/blob/master/notebooks/13_advanced_model_evaluation.ipynb

## Load your libraries and read in your data:

In [1]:
import pandas as pd

# Data to train your model on - this data has the answers of who did and did not survive
train = pd.read_csv("train.csv")

# Data to apply the trained model to, so we can predict who did and did not survive, and submit the answers to Kaggle:
test = pd.read_csv("test.csv", index_col='PassengerId')

# view shapes of train and of test:
print("Train data rows and columns: ", train.shape)
print("Test data rows and columns:  ", test.shape)

Train data rows and columns:  (891, 12)
Test data rows and columns:   (418, 10)


The shape of the train data is 891 rows or passengers, and 12 variables or features that describe each passenger

The shape of the test data is 418 rows or passengers, and only 10 variables instead of 12. This is because one is the "Survived" column, which is absent because it is the one we are trying to predict, and the other is the "PassengerID" which I read in as the index instead of a column. 

In [2]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
test.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Preprocessing / data wrangling:
Anything we do here, we need to do to both the training and the testing data. 

### Check for missing values
Some missing values might not matter if we intend not to use those columns (also called features or variables) in our model. For example, many values are missing from Cabin, but we're not going to use Cabin in this particular model.

In [4]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [5]:
test.isnull().sum()

Pclass        0
Name          0
Sex           0
Age          86
SibSp         0
Parch         0
Ticket        0
Fare          1
Cabin       327
Embarked      0
dtype: int64

Age can be a determining factor as to whether someone might survive a sinking ship. For example, younger or older people may be more sensitive to extreme temperatures, but also, preference for rescue may have been given to children. Therefore, Age seems like it would be a good indicator as to whether or not someone survived, so we're going to use it in the model. However, both the train and test set are missing data for Age. We can't _remove_ all rows where `Age` is null, because Kaggle will notice that the test results are missing rows. On the other hand, if there are nulls in the `Age` column for the test set, the predictive model will fail. Instead, we can **impute** or "guess" the ages using the mean, median or mode.

In [6]:
# shape of the training data if we removed all rows that contained a NULL in the Age column:
train[train.Age.notnull()].shape

(714, 12)

In [7]:
print("Mean Age: ", train.Age.mean())
print("Mode (most frequent) Age: ", train.Age.mode())
print("Median Age: ", train.Age.median())

Mean Age:  29.69911764705882
Mode (most frequent) Age:  0    24.0
dtype: float64
Median Age:  28.0


Lets go with the median, and impute age in the train and test datasets

In [8]:
train.Age.fillna(train.Age.median(), inplace=True)

# Here, I've gone with the median Age calculated from the *training* data, rather than calculated from the test data,
# as it is more realistic this way. Plus, there is more training data than test data, so the calculated median age is 
#more likely to be representitive. 
test.Age.fillna(train.Age.median(), inplace=True)

In [9]:
train.shape

(891, 12)

We also have a missing value in the **test** dataset for `Fare`

In [10]:
print("Mean Fare: ", train.Fare.mean())
print("Median Fare: ", train.Fare.median())

Mean Fare:  32.2042079685746
Median Fare:  14.4542


In [11]:
test.Fare.fillna(train.Fare.median(), inplace=True)

## Models don't like strings, so let's make `Sex` a binary:

In [12]:
# add new column to train and test dataframes for binarised sex:
train['Sex_Female'] = train.Sex.map({'male':0, 'female':1})
test['Sex_Female'] = test.Sex.map({'male':0, 'female':1})

In [13]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_Female
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


In [14]:
test.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_Female
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,1
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,1


## Select columns to include in model

In [15]:
# list of available features:
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Sex_Female'],
      dtype='object')

In [16]:
# Selected features to use in our model. You can play around with this, include other features, or remove some listed here
features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_Female' ]

# assign the features used to train the model to X
X = train[features]

# Assign the answers that the model will learn to predict to y
y = train['Survived']

## Train our model using a [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) for `survived` vs `not survived`

There are several ways to split up training data into it's own training and testing sets. One way is test-train-split.
Another way is [cross validation](http://scikit-learn.org/stable/modules/cross_validation.html)

In [17]:
# We use the training set for both training and testing here, because we need the answers to evaluate how well our 
# model does. 

# we'll train the model on 80% of the data (train_size=0.8), and test it on the remaining 20% 
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=0)

# train a logistic regression model which will learn the two categories: survived, not survived
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

# make predictions for testing set
y_pred_class = logreg.predict(X_test)

# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.810055865922


## Great! Let's use this model.  
However, this model has only been trained on 80% of the training data. It might be more accurate if trained on teh whole dataset. So, lets **retrain** the model using _all_ the data, and then apply it to `test.csv` to make our predictions

In [18]:
# select only test dataset columns that we used as features in our training
test = test[features]
test.head()

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare,Sex_Female
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
892,3,34.5,0,0,7.8292,0
893,3,47.0,1,0,7.0,1
894,2,62.0,0,0,9.6875,0
895,3,27.0,0,0,8.6625,0
896,3,22.0,1,1,12.2875,1


In [19]:
# Train model on ALL train data
logreg.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [20]:
# apply model to test data
results = logreg.predict(test)

In [21]:
# 0 = not survived, 1 = survived
results

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1,
       0, 0,

In [22]:
# add `results` as a column to our test dataframe
test['Survived'] = results

# remove all other columns - keep new `Survived` col, and assign to a new dataframe:
submission_export = test[['Survived']]

In [23]:
submission_export.head()

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,0
896,1


In [24]:
# save as csv. Often you'll see people adding `index=False` but we want to keep the index cos it holds the `PassengerID`
submission_export.to_csv('Titanic_model_submission.csv')