# Logistic Regression

In linear regression, the parameter $\beta_{1}$ represents the change in the response variable (output) for a unit change in $x_{1}$.

$$
y = \beta_{0} + \beta_{1}x_{1}
$$

In logistic regression, the parameter $\beta_{1}$ represents the change in the **log-odds** for a unit change in $x_{1}$.

$$
\log{(\frac{p}{1 - p})} = \beta_{0} + \beta_{1}x_{1}
$$

where $p$ is the probability of class membership.

So, the output is:

$$
p = \frac{ \exp{(\beta_{0} + \beta_{1}x_{1}}) }{1 + \exp{(\beta_{0} + \beta_{1}x_{1})}}
$$

(Note: Odds of an event is the ratio of the probability of an event by its complement).

More generally, Logistic Regression is a [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model) with the [Logit](https://en.wikipedia.org/wiki/Logit) link function and with [Bernoulli Distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution) as the distribution of error terms.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Exercise 1

Plot the logistic sigmoid function: $p$ for varying values of $x_{1}$. Let $\beta_{0} = 0$ and $\beta_{1} = 1$.



In [2]:
# Code Here
def sigmoid(z)
    ...

## GK or not? Revisited

This time we'll use Logistic Regression on a slightly larger version of the dataset (800 GK; 1200 outfield).

In [4]:
data = pd.read_csv('../data/players2.csv')

In [5]:
data.head()

Unnamed: 0,player_api_id,player_name,height,weight,Position
0,35626,Emanuele Belardi,187.96,176,GK
1,71352,Marcelo Jose Oliveira,185.42,176,Outfield
2,489240,Jaume,187.96,168,GK
3,41292,Angel Rodriguez,172.72,150,Outfield
4,38318,Jurgen Sierens,190.5,196,GK


### Exercise 2

Create a new column `Target` by mapping `Position` column. `GK` becomes `1`, `Outfield` becomes `0`.

In [6]:
# Code Here
data['Target'] = data['Position'].map({"GK": 1, "Outfield": 0})

In [7]:
data.head()

Unnamed: 0,player_api_id,player_name,height,weight,Position,Target
0,35626,Emanuele Belardi,187.96,176,GK,1
1,71352,Marcelo Jose Oliveira,185.42,176,Outfield,0
2,489240,Jaume,187.96,168,GK,1
3,41292,Angel Rodriguez,172.72,150,Outfield,0
4,38318,Jurgen Sierens,190.5,196,GK,1


### Exercise 3

Split the data into train (80%) and test (20%). The train will be further split using cross validation (through a grid search).

Use sklearn's train_test_split. Specify `stratify` option to retain class balance.

In [21]:
data.Target.value_counts()

0    1200
1     800
Name: Target, dtype: int64

In [8]:
from sklearn.model_selection import train_test_split

In [54]:
X = data[['height', 'weight']].values
y = data['Target'].values

In [22]:
X.shape

(2000, 2)

In [49]:
# Code Here
train_x, test_x, train_y, test_y = train_test_split(X, y, 
                                                    test_size= 0.2,
                                                    stratify=y,
                                                    random_state=1)

In [23]:
train_x.shape

(1600, 2)

In [24]:
test_x.shape

(400, 2)

## Grid Search

We'll now do grid search, a method for [Hyperparameter optimization](https://en.wikipedia.org/wiki/Hyperparameter_optimization).

In [40]:
from sklearn.model_selection import GridSearchCV

In [41]:
from sklearn.linear_model import LogisticRegression

### Exercise 4

Find the best parameter values for `penalty` and `C` among those given below using GridSearchCV.

Use 5-fold validation.

In [43]:
params = {'penalty': ['l1', 'l2'], 'C': [0.001,0.01,0.1,1,10,100,1000]}

In [44]:
gs = GridSearchCV(LogisticRegression(), params, cv=5, scoring='accuracy')

In [46]:
train_x.shape

(1600, 2)

In [47]:
train_y.shape

(1600, 1)

In [50]:
gs.fit(train_x, train_y)

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)

In [56]:
gs.best_params_

{'C': 1000, 'penalty': 'l2'}

## Making predictions

### Excerise 5

Create a model with the best parameters found above using whole train_set 

In [57]:
# Code Here
final_model = LogisticRegression(C=1000, penalty='l2')
final_model.fit(train_x, train_y)

LogisticRegression(C=1000, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### Exercise 6

Is someone with a height of 180 and weight of 170 a GK?

In [65]:
# Code Here
final_model.predict([[180, 170]])

array([0])

In [68]:
final_model.predict_proba([[200, 180]])

array([[ 0.09058536,  0.90941464]])

## Testing the model

### Exercise 7

Compute the test accuracy on the test_set.

In [61]:
from sklearn.metrics import accuracy_score

In [62]:
# Code Here
predictions = final_model.predict(test_x)

In [63]:
accuracy_score(test_y, predictions)

0.77000000000000002