## Logistic Regression - How to interpret its coefficients

Not gonna bore ya with the history of logistic regression, its pervasiveness and popularity among practitioners yada yada yada. 

Instead, we'll take a *practical* approach to understand how logistic regression works. If you feel I have missed out any details, I'd recommend checking out this <a id='http://statweb.stanford.edu/~tibs/ElemStatLearn/'>book</a>. Without further ado, let's get started!

1. [Quick Primer](#Quick Primer)
2. [Titanic Example](#Titanic Example)

### Quick Primer
Logistic Regression is commonly defined as:
$$y = \frac{1}{1+e^{\Theta^Tx}}$$

You already know that, what's more interesting is the above equation can also be interpreted as follows:
$$log(\frac{y}{1-y}) = \Theta^Tx$$

Notice how the linear combination, $\Theta^T x$, is expressed as the log odds ratio (logit) of $y$, and this will segways well to our next section of how to interpret the coefficients and intercepts from logistic regressions.

### Titanic Example

<a id='https://www.kaggle.com/c/titanic/data'>Kaggle</a> is a great platform for budding data scientists to get more practice. I'm currently working through the Titanic dataset, and we'll use this as our case study for our logistic regression.

In [None]:
# Let's load some python libraries
import pandas as pd
import matplotlib.pylab as plt
import numpy as np

%matplotlib inline

In [None]:
# Read into our datasets
train = pd.read_csv('train.csv')
train.head()

Before any data crunching, let's clean our data. We'll be interested in the filed Sex later, so let's map the males to the number 0, and females to 1. And we'll separate out the x's and y's into two dataframes for ease of use.

In [None]:
train.Sex = train.Sex.apply(lambda x: 0 if x == 'male' else 1)

y_train = train.Survived
x_train = train.drop('Survived', axis=1)

Now that we've cleaned our data, let's feed it through sklearn's <a id="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">logistic regression</a> function to get the coefficients, $\Theta$, out. Then we'll manually compute the coefficients ourselves to convince ourselves of what's happening.

Note: Sklearn applies automatic regularization, so we'll set the parameter $C$ to a large value to emulate no regularization.



#### Explaining coefficient of a single dichotomous feature

Dichotomous just means the value can only be either 0 or 1, such as the field Sex in our titanic data set. In this section, we'll explore what the coefficients mean when regressing against only one dichotomous feature.

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=1e10)

feature = ['Sex']
clf.fit(x_train[feature], y_train)

Now that we've fitted the logistic regression function, we can ask sklearn to give us the two terms in $\Theta$, namely the intercept and the coefficient

In [None]:
print('intercept:', clf.intercept_)
print('coefficient:', clf.coef_)

With our newly fitted $\Theta$, now our logistic regression is of the form:
$$y = \frac{1}{1 + e^{-1.45707 + 2.51366x}}$$
or
$$log(\frac{y}{1-y}) = -1.45707 + 2.51366x$$

So, when $x = 0$, meaning $x = male$, our equation boils down to:
$$log(\frac{p(survived|x=male)}{1-p(survived|x=male)}) = log(\frac{p(survives|x=male)}{p(\overline{survive}|x=male)}) = -1.45707$$

Exponentiating both sides gives us:
$$\frac{p(survived|x=male)}{p(\overline{survived}|x=male)} = 0.232917$$

Let's check this ourselves with some python magic

In [None]:
survived_by_sex = train[train.Survived == 1].groupby(train.Sex).count()[['Survived']]
survived_by_sex['Total'] = train.Survived.groupby(train.Sex).count()
survived_by_sex['NotSurvived'] = survived_by_sex.Total - survived_by_sex.Survived
survived_by_sex['OddsOfSurvival'] = survived_by_sex.Survived / survived_by_sex.NotSurvived
survived_by_sex['ProbOfSurvival'] = survived_by_sex.Survived / survived_by_sex.Total

survived_by_sex[['Survived', 'NotSurvived', 'OddsOfSurvival']].iloc[0]

As you can see, for males, we had 109 men who survived, but 468 did not survive. The odds of survival, $\frac{109}{468} = 0.232906$, which the same as above.

If we logged our odds of survival for men, $log(0.232906) = -1.457073$, it's what we got out of our sklearn's intercept component.

In essence, the intercept term from the logistic regression is the log odds of our base reference term, which is men who has survived.

#### Explaining coefficient of a single continous feature

In [None]:
X