Welcome to the Titanic machine learning practice exercise! Your job is to build a model to predict whether or not a particular passanger survived the disaster. This is a binary classification problem where the outcome $Y = 1$ if the passanger survived, and $Y = 0$ if not. The goal of this exercise is to get familiar with a typical machine learning model building work-flow, and practice working with data and models.

!["Titanic"](images/titanic.jpeg)

You have the following data about the passengers (some may be missing and you might need to figure out how to guess the missing values)


|Variable|	Definition	|Key|
| :- |-: | :-: |
|survival 	|Survival |	0 = No, 1 = Yes|
|pclass 	|Ticket class| 	1 = 1st, 2 = 2nd, 3 = 3rd|
|sex| 	Sex 	| |
|Age 	|Age in years 	| |
|sibsp 	|# of siblings / spouses aboard the Titanic| 	|
|parch 	|# of parents / children aboard the Titanic| 	
|ticket 	|Ticket number 	| |
|fare 	|Passenger fare 	| |
|cabin 	|Cabin number| 	    |  
|embarked| 	Port of Embarkation| 	C = Cherbourg, Q = Queenstown, S = Southampton|


You can build any sort of model you want, but if you are a beginner than you should start with logistic regression, which is a simple yet surprisingly powerful classification model that is important for understanding modern neural network technologies.

Logistic regression takes in a set of input data $X \in \mathbb{R}^{N_{data} \times N_{feat}}$ and learns a set of data weights $\beta \in \mathbb{R}^{N_{feat}}$, where 
$N_{data}, N_{feat}$ are the number of data points and number of predictive features, respectively. You do not have to worry about how the logistic model is trained (at first) for this exercise, because you can use model code from Scikit-learn and simply call the .fit() method. Internally, the model will solve a convex optimization problem that determines $\beta$ using your data $X$ and your set of outcome lables $Y$.

During prediction time, the logistic regression model makes a prediction $\hat{y_i}$ for a new datapoint (passenger) $x_i$ by the following formula

$P(y_i = 1) = \sigma(x_i \cdotp \beta)$

where 
 
$\sigma(t) = \frac{e^t}{1 + e^{-t}}$.
 
Since these predictions are probabilities, you can turn them into hard predictions by using a threshold of 0.5, that is 

$\hat{y_i} = 1 , \quad \text{where } \sigma(x_i \cdotp \beta) > 0.5$.

To complete this task you will need to
- Manipulate the data so that it can inputed into the scikit-learn LogisticRegression class. You will need to recode any categorical variables that you want to use (Why is that?). To keep things simple you can use one-hot encoding (google it!), but be careful to eliminate one category from your one-hot encoding (why this?). What could go wrong if you encode categorical variables with more than two categories as numbers? 

- Train your model on the training data, using the features which you think are important. You can use penalty=None to train a simple unregularized model. Then make a prediction on the test data. You can submit your results to [Kaggle](https://www.kaggle.com/c/titanic) to get your accuracy score and see how good your model is. Using as many variables as possible can help you to get a good training accuracy, but this doesn't necessarily mean that your model will generalize well to the test-set! Finding a good model usually takes some insight into the data and problem, as well as machine learning skill. You can also use cross-validation with the training data to pick a good model before going to the test set. 

If you complete these tasks very quickly and would like to go further, you can try the following bonus tasks.

- Bonus task 1: Try out some feature engineering. Make a new data column in the training data by using transformations of existing columns. Ratios of columns, log transforms, and power transforms (e.g. $x^2$) are all popular choices that you can play with. Can you improve your test-set classification accuracy by feature engineering?

- Bonus task 2: Implement your own logistic regression model using numpy. You can use scipy.optimize to train your model using the method of maximum likelihood. To do this you will need to solve the following optimization problem

$\max_{\beta} l(\beta, Y, X)$

where the log-likelihood function $l$ is given by

$l(\beta, Y, X) = \sum_{i =1}^{N_{data}} y_i \log(\sigma(x_i\cdotp \beta)) + (1 -y_i) \log(1 - \sigma(x_i\cdotp \beta))$.

You can derive this function by taking the log of the likelihood

$L(\beta, Y, X) = \prod_{i =1}^{N_{data}} P(y_i = \hat{y_i}| \beta, X)$,

where $\hat{y_i}$ is the predicted outcome, and $y_i$ the true outcome.



In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

In [2]:
titanic_train = pd.read_csv("data/titanic/train.csv")
titanic_test = pd.read_csv("data/titanic/test.csv")
titanic_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [3]:
from sklearn.model_selection import cross_val_score

#Here I have made some simple choices, but you can make this as complex as you want.
#For some competitive solutions check out the "code" section of the Kaggle Titanic competition. 
feature_cols = ["Sex", "Age", "Pclass", "Parch", "SibSp"]

def recode_data(df_raw):
    """Here I have done the encoding manually so that you can see it. 
       But for convenience when you build your own models 
       you can use the sklearn preprocessing tools. """

    df_encoded = df_raw.copy()
    df_encoded["Female"] = (df_encoded["Sex"] == "female").astype(int)

    df_encoded["Pclass_2"] = (df_encoded["Pclass"] == 2).astype(int)
    df_encoded["Pclass_3"] = (df_encoded["Pclass"] == 3).astype(int)
    #Pclass = 1 is not encoded as it would introduce a linear dependence. 
    #Pclass = 1 corresponds to Pclass_2 and Pclass_3 = 0

    df_encoded = df_encoded.drop(columns = ["Sex", "Pclass"])
    #Fill in missing ages with the mean age
    df_encoded.loc[df_encoded.isnull().any(axis=1), "Age"] = df_encoded.loc[~df_encoded.isnull().any(axis=1), "Age"].mean()
    return df_encoded


X_encoded = recode_data(titanic_train[feature_cols])
X_encoded

Unnamed: 0,Age,Parch,SibSp,Female,Pclass_2,Pclass_3
0,22.000000,0,1,0,0,1
1,38.000000,0,1,1,0,0
2,26.000000,0,0,1,0,1
3,35.000000,0,1,1,0,0
4,35.000000,0,0,0,0,1
...,...,...,...,...,...,...
886,27.000000,0,0,0,1,0
887,19.000000,0,0,1,0,0
888,29.699118,2,1,1,0,1
889,26.000000,0,0,0,0,0


In [4]:
#Train the model
Nfolds = 10
Y_train = titanic_train["Survived"]

clf = LogisticRegression(penalty = "none")
cv_accuracy = cross_val_score(clf, X_encoded, Y_train, cv = Nfolds)
print("Mean Accuracy ({}-folds) = {:.3f}".format(Nfolds, cv_accuracy.mean()))

Mean Accuracy (10-folds) = 0.787


In [5]:
#Train on the entire training set and make a prediction
clf.fit(X_encoded, Y_train)
Y_hat = clf.predict(recode_data(titanic_test[feature_cols]))

output = pd.DataFrame(titanic_test["PassengerId"])
output["Survived"] = Y_hat
output.to_csv("my_titanic_predictions.csv",
              index = False)

#When I submit this result to Kaggle I get an accuracy of 0.75119.

In [6]:
#Custom Logistic Regression Model for bonus task #2
from scipy.optimize import minimize
from functools import partial

class MyLogisticRegression(object):
    def __init__(self):
        self.beta = None
        
    def fit(self, Y, X):
        X_copy = X.copy()
        X_copy["intercept"] = 1.0
        X_copy = X_copy.astype(float)
        beta0 = np.zeros(X_copy.shape[1])
        
        options ={"iprint": 2,
                  "gtol": 1.0e-4,
                  "maxiter": 100}
        
        llh_partial = partial(self.llh,
                              Y=Y.values,
                              X=X_copy)
        
        nllh_partial = lambda beta: -1*llh_partial(beta)
        
        sol = minimize(nllh_partial,
                       beta0,
                       method = "L-BFGS-B",
                       jac = False,
                       options=options)
        self.beta = sol.x
    
    def llh(self, beta, Y, X):
        p = self.predict_probability(beta, X)
        
        #To avoid problems with inf and nan restrict the log transformations
        #to where the arguments will be away from 0.
        return np.log(Y[Y ==1]*p[Y ==1]).sum() + np.log((1 - Y[Y==0])*(1 -p[Y==0])).sum()
        
    def predict_probability(self, beta, X):
        mu = np.dot(X, beta)
        return 1.0/(1 + np.exp(-mu))
    
    def predict(self, X):
        X_copy = X.copy()
        X_copy["intercept"] = 1.0
        p = self.predict_probability(self.beta, X_copy.values)
        return (p > 0.5).astype(int)
    
mylogit = MyLogisticRegression()
mylogit.fit(Y_train, X_encoded)
my_Yhat = mylogit.predict(recode_data(titanic_test[feature_cols]))
print("Do my predictions match those from SK_learn?", (my_Yhat == Y_hat).all())
        

Do my predictions match those from SK_learn? True
