# Logistic Regression and Regularisation

In [None]:
#Import libraries


### Loading the data

For this example we use a smaller dataset in the interest of CPU time. The dataset comes from [the UCI repository](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/) and is data from observed tissues in a breast cancer study. 

We won't look in details at what the features are but each column corresponds to some measurement of interest to the study. 

Note that some columns record the standard deviation over measurements and may therefore be highly correlated with other columns (*why is that a bad thing?*) we will come back to that in the regularisation part.

Load `data/breastcancerdata.csv` and have a look at it writing `data.head()`.



In [None]:
# add your code here to import and inspect the data ...


### Extracting the response

The first column are IDs corresponding to the patient, we will therefore ignore that column. The second column is the response of interest with `M` (malignant) and `B` (benign). You need to tell Sklearn that these are the two classes of interest. One way of doing this is to use the `LabelEncoder` tool from `sklearn.preprocessing`. In the below cell we:

* create a LabelEncoder and call it `le`
* fit the encoder to the unique values of that column using `loc` to specify the column and `unique` to find the unique values
* apply the label encoder by using `le.transform` on the column, name the result `response`

In [None]:
le = LabelEncoder()
le.fit(data.loc[:, 1].unique())
response = le.transform(data.loc[:, 1])

You can check what `response` looks like by outputting the first few values with 

```python
print(response[0:20])
```

the result you should get is

> `[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0]`

### Extracting the feature matrix

The feature matrix is constituted of the remaining columns. Using `loc` again, extract those columns and store them as `featmatrix`.

The resulting matrix should have `30` columns (you can check that with `featmatrix.shape` which should return `(569,30)`).

In [None]:
# add your code here...


### Train Test Split

Now that we have the feature matrix and the response, split it using the `train_test_split` function from `sklearn.model_selection`. 
Name the results `X_train, X_test, y_train, y_test`. Use `random_state=321` for reproducibility of the results.

In [None]:
# add your code here...


### Applying the Logistic Regression

The `LogisticRegression` model is located in `sklearn.linear_model` ([sklearn documentation here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)). 
Define a default model, fit it on the training data and predict `X_test`. Using `accuracy_score` from `sklearn.metrics`, display the accuracy.

**Note** `.predict` will return a class (0/1) while `.predict_proba` will return the score (in $[0, 1]$). 

In [None]:
# add your code here...


# Regularisation

## Motivation

When some features are almost colinear, meaning that a feature could be expressed with high accuracy by a linear combination of other features, the resulting model can be unstable and generalise badly. 

A simple way to test for colinearity is to compute the condition number of the feature matrix (this is relatively cheap when $\min(n, p)$ is less than 1000). 

The condition number is given by the ratio of the largest singular value to the smallest singular value. 
If it is very large, it means that the first principal component carries significantly more information (variance) than the last principal component. 
In other words, it hints at the fact that the last few components add very little information and therefore that the *effective dimensionality* of the dataset is less than $p$.

A condition number much higher than 1000 is typically a sign of colinearity and a hint that either PCA should be applied or regularisation. 

1. Compute the condition number of the feature matrix using `np.linalg.cond` and discuss

In [None]:
# add your code here to compute the condition number both ways


## Why does colinearity matter? 

Go backwards a bit in your notebook, at the place where you did the `train_test_split`. 
Change the random state from `321` to `123` and re-run the following few cells. 
What is the accuracy now? 
What does that tell you? 

After having had a look at that, set the random state back to `321` and re-execute the cells (so that we're all on the same page).

## Adding regularisation to the logistic regression

With SkLearn it's very easy to specify a regularisation term (`l1` or `l2`) as well as the strength of the regularisation `C`. 

To see this, define a logistic regression as before but this time, instead of just using `LogisticRegression()`, specify the `penalty` as being `l1` and specify `C` to be `2.0` (which corresponds to $\lambda = 0.5$). 
Fit it, predict the test and show the accuracy.

**Note**: the `C` parameter is the **inverse** of the regularisation strength. In other words, **lower** `C` means **more** regularisation.

In [None]:
# add your code here


## (Bonus) having a look at the coefficients

We will discuss this in more details in a minute but if you've arleady reached this point, you can try getting an intuition:

* display the coefficients for the basic logistic regression
* display the coefficients for the logistic regression with l1 regularisation (obtained before with grid search)

What do you observe? and why do you think that's a desirable trait of l1 regularisation?

**Note**: here you may want to use `plt.stem` to display the coefficients.
Do scale the `y-axis` so that the amplitudes can be compared. 

In [None]:
# your code here


## (Bonus) Using cross validation to find the best hyperparameter value

Ok so now we'd like to test a range of `C` for both `l1` and `l2` in order to find the "best" set of parameters, of course `GridSearchCV` is here to help! Note that you could also use `LogisticRegressionCV` which contains its own cross validation tool. 

Take the penalty to be either `l1` or `l2` and `C` to be $2^{-5}, 2^{-4}, \dots, 2^4, 2^5$. (It is standard to use regularisation strengths on a logarithmic scale).

In [None]:
# your code here

# use the best parameters and check the accuracy
