Welcome to the Titanic machine learning practice exercise! Your job is to build a model to predict whether or not a particular passanger survived the disaster. This is a binary classification problem where the outcome $Y = 1$ if the passanger survived, and $Y = 0$ if not. The goal of this exercise is to get familiar with a typical machine learning model building work-flow, and practice working with data and models.

!["Titanic"](images/titanic.jpeg)

You have the following data about the passengers (some may be missing and you might need to figure out how to guess the missing values)


|Variable|	Definition	|Key|
| :- |-: | :-: |
|survival 	|Survival |	0 = No, 1 = Yes|
|pclass 	|Ticket class| 	1 = 1st, 2 = 2nd, 3 = 3rd|
|sex| 	Sex 	| |
|Age 	|Age in years 	| |
|sibsp 	|# of siblings / spouses aboard the Titanic| 	|
|parch 	|# of parents / children aboard the Titanic| 	
|ticket 	|Ticket number 	| |
|fare 	|Passenger fare 	| |
|cabin 	|Cabin number| 	    |  
|embarked| 	Port of Embarkation| 	C = Cherbourg, Q = Queenstown, S = Southampton|


You can build any sort of model you want, but if you are a beginner than you should start with logistic regression, which is a simple yet surprisingly powerful classification model that is important for understanding modern neural network technologies.

Logistic regression takes in a set of input data $X \in \mathbb{R}^{N_{data} \times N_{feat}}$ and learns a set of data weights $\beta \in \mathbb{R}^{N_{feat}}$, where 
$N_{data}, N_{feat}$ are the number of data points and number of predictive features, respectively. You do not have to worry about how the logistic model is trained (at first) for this exercise, because you can use model code from Scikit-learn and simply call the .fit() method. Internally, the model will solve a convex optimization problem that determines $\beta$ using your data $X$ and your set of outcome lables $Y$.

During prediction time, the logistic regression model makes a prediction $\hat{y_i}$ for a new datapoint (passenger) $x_i$ by the following formula

$P(y_i = 1) = \sigma(x_i \cdotp \beta)$

where 
 
$\sigma(t) = \frac{1}{1 + e^{-t}}$.
 
Since these predictions are probabilities, you can turn them into hard predictions by using a threshold of 0.5, that is 

$\hat{y_i} = 1 , \quad \text{where } \sigma(x_i \cdotp \beta) > 0.5$.

To complete this task you will need to
- Manipulate the data so that it can inputed into the scikit-learn LogisticRegression class. You will need to recode any categorical variables that you want to use (Why is that?). To keep things simple you can use one-hot encoding (google it!), but be careful to eliminate one category from your one-hot encoding (why this?). What could go wrong if you encode categorical variables with more than two categories as numbers in a single column? 

- Train your model on the training data, using the features which you think are important. You can use penalty=None to train a simple unregularized model. Then make a prediction on the test data. You can submit your results to [Kaggle](https://www.kaggle.com/c/titanic) to get your accuracy score and see how good your model is. Using as many variables as possible can help you to get a good training accuracy, but this doesn't necessarily mean that your model will generalize well to the test-set! Finding a good model usually takes some insight into the data and problem, as well as machine learning skill. You can also use cross-validation with the training data to pick a good model before going to the test set. 

If you complete these tasks very quickly and would like to go further, you can try the following bonus tasks.

- Bonus task 1: Try out some feature engineering. Make a new data column in the training data by using transformations of existing columns. Ratios of columns, log transforms, and power transforms (e.g. $x^2$) are all popular choices that you can play with. Can you improve your test-set classification accuracy by feature engineering?

- Bonus task 2: Implement your own logistic regression model using numpy. You can use scipy.optimize to train your model using the method of maximum likelihood. To do this you will need to solve the following optimization problem

$\min_{\beta} l(\beta, Y, X)$

where the cross-entropy loss function $l$ is given by

$l(\beta, Y, X) = -1 *\sum_{i =1}^{N_{data}} y_i \log(\sigma(x_i\cdotp \beta)) + (1 -y_i) \log(1 - \sigma(x_i\cdotp \beta))$.

You can derive this function by taking the log of the likelihood and multiplying by -1

$L(\beta, Y, X) = \prod_{i =1}^{N_{data}} P(y_i = \hat{y_i}| \beta, X)$,

where $\hat{y_i}$ is the predicted outcome, and $y_i$ the true outcome.



In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

In [2]:
titanic_train = pd.read_csv("data/titanic/train.csv")
titanic_test = pd.read_csv("data/titanic/test.csv")
titanic_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
#Preprocess your training data here

In [None]:
#Train your LogisticRegression model here

In [5]:
#Train on the entire training set and make a prediction
Y_hat = None #Fill this in with your prediction

output = pd.DataFrame(titanic_test["PassengerId"])
output["Survived"] = Y_hat
output.to_csv("my_titanic_predictions.csv",
              index = False)

#Submit your result to Kaggle, what accuracy do you get?

In [1]:
#Custom Logistic Regression Model for bonus task #2
from scipy.optimize import minimize #You can try the optimization algorithms from this package

class MyLogisticRegression(object):
    def __init__(self):
        pass
        
    def fit(self, Y, X):
        #optimize the likelihood function here to get the "beta" coefficients.
        pass
        
    def llh(self, beta, Y, X):
        #likelihood function
        pass
        
    def predict_probability(self, beta, X):
        #Soft probability predictions
        pass
        
    def predict(self, X):
        #Hard 0-1 predcitions
        pass

NameError: name 'Y_train' is not defined