# Logistic Regression

In linear regression, the parameter $\beta_{1}$ represents the change in the response variable (output) for a unit change in $x_{1}$.

$$
y = \beta_{0} + \beta_{1}x_{1}
$$

In logistic regression, the parameter $\beta_{1}$ represents the change in the **log-odds** for a unit change in $x_{1}$.

$$
\log{(\frac{p}{1 - p})} = \beta_{0} + \beta_{1}x_{1}
$$

where $p$ is the probability of class membership.

So, the output is:

$$
p = \frac{ \exp{(\beta_{0} + \beta_{1}x_{1}}) }{1 + \exp{(\beta_{0} + \beta_{1}x_{1})}}
$$

(Note: Odds of an event is the ratio of the probability of an event by its complement).

More generally, Logistic Regression is a [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model) with the [Logit](https://en.wikipedia.org/wiki/Logit) link function and with [Bernoulli Distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution) as the distribution of error terms.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Exercise 1

Plot the logistic sigmoid function: $p$ for varying values of $x_{1}$. Let $\beta_{0} = 0$ and $\beta_{1} = 1$.



In [2]:
# Code Here

## GK or not? Revisited

This time we'll use Logistic Regression on a slightly larger version of the dataset (800 GK; 1200 outfield).

In [None]:
data = pd.read_csv('../data/players2.csv')

### Exercise 2

Create a new column `Target` by mapping `Position` column. `GK` becomes `1`, `Outfield` becomes `0`.

In [1]:
# Code Here

### Exercise 3

Split the data into train (80%) and test (20%). The train will be further split using cross validation (through a grid search).

Use sklearn's train_test_split. Specify `stratify` option to retain class balance.

In [2]:
from sklearn.model_selection import train_test_split

In [None]:
# Code Here

# train_set = 
# test_set = 

## Grid Search

We'll now do grid search, a method for [Hyperparameter optimization](https://en.wikipedia.org/wiki/Hyperparameter_optimization).

In [3]:
from sklearn.model_selection import GridSearchCV

### Exercise 4

Find the best parameter values for `penalty` and `C` among those given below using GridSearchCV.

Use 5-fold validation.

In [None]:
params = {'penalty': ['l1', 'l2'], 'C': [0.001,0.01,0.1,1,10,100,1000]}

## Making predictions

### Excerise 5

Create a model with the best parameters found above using whole train_set 

In [None]:
# Code Here

### Exercise 6

Is someone with a height of 180 and weight of 170 a GK?

In [None]:
# Code Here

## Testing the model

### Exercise 7

Compute the test accuracy on the test_set.

In [None]:
# Code Here