## Morning practical 2 day 2

Welcome to the second practical of today. Here, you will work on implementing regularised logistic regression, as well as implementing cross-validation on some data and making an ROC curve yourself. First run the two cells below to set things up.



In [None]:
#run this cell to set things up
import ipywidgets as widgets, numpy as np, pandas as pd
from numpy.random import default_rng
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import math
import seaborn as sns
from IPython.display import display, Markdown
from scipy.optimize import fmin_bfgs

In [None]:
#important functions
def mySigmoid(data):
    output = 1/(1+ np.exp(-data))
    return output

def linAlgRegHypothesis(data, thetas):
    data = np.array(data)
    oneFeatToAdd = np.ones(len(data))
    newFeatArray = np.c_[oneFeatToAdd, data]
    #make sure thetas are always of the form np.array([[theta1], [theta2]]), i.e. column vector
    if thetas.ndim < 2:
        thetas = thetas[:, np.newaxis]
    predictions = newFeatArray @ thetas
    return predictions

def linAlgLogRegHypothesis(data, thetas):
    output = mySigmoid(linAlgRegHypothesis(data, thetas))
    return output

def costFuncLogReg(x, y, thetas):
    predictions      = linAlgLogRegHypothesis(x, thetas)
    costsPerSample   = -y * np.log(predictions) - (1-y) * np.log(1 - predictions)
    totalCosts       = np.nansum(1/len(x) * costsPerSample)
    return totalCosts

def makeCrossValData(dataFrame, k=10):
    '''function to make splits into training and validation sets.
    Outputs two lists of length k, where each element is the indices of samples to train on for that fold, 
    and the indices of samples to test on for that fold, respectively.'''
    #shuffle data
    dataFrame = dataFrame.sample(frac=1)
    m = len(dataFrame)
    #see how many equal-sized sets you can make
    dataPerSplit = int(np.floor(m/k))
    dataPartitions = []
    counter = 0

    for i in range(0,k):
        #make a list of all the samples for each fold
        dataPartitions.append(list(range(counter,counter+dataPerSplit)))
        counter += dataPerSplit

    samplesEquallySplit = k * dataPerSplit
    if not samplesEquallySplit == m:
        #after making equal splits there will be samples left, i.e. you cannot always make k exactly evenly sized subsets.
        #randomly assign left over samples to folds after
        toDivide = m-samplesEquallySplit
        for extraSampleIndex in range(counter, counter+toDivide):
            #only assign to lists of samples that have the current minimum amount of samples
            currentSubsetSizes = np.array([len(subset) for subset in dataPartitions])
            assignTo = np.random.choice(np.where(currentSubsetSizes == np.min(currentSubsetSizes))[0])
            dataPartitions[assignTo].append(extraSampleIndex)
    
    #Now make the final cross-validation set: make k sets, each set has (k-1)/k folds to train on, and 1 fold to test on.
    testSet = []
    trainSet = []
    for validationSetIndex in range(0,k):
        #put 1 fold in the test set
        testSet.append(dataPartitions[validationSetIndex])
        #put all other folds in the train set
        trainSet.append(dataPartitions.copy())
        trainSet[validationSetIndex].pop(validationSetIndex)
        #this line makes sure all training set indices are in one big list, rather than k-1 small lists. 
        trainSet[validationSetIndex] = [item for sublist in trainSet[validationSetIndex] for item in sublist]
    
    return dataFrame, trainSet, testSet

## Regularisation

Regularisation is a method of automatically constraining how much your model can (over)fit on the training data. We add some factor (regularisation weight $\lambda$) times the sum of squares of the parameter (excluding the intercept ($\theta_0$) to the cost function. In this way, the model cannot pick extremely large values for the parameters, i.e. when you have 100 features, the model is forced to only have high $\theta$ parameters for those features that matter a lot for correct classification, while having extremely low or even 0 values for features that don't. Hence, regularisation also automatically selects features that are of importance to your problem: feature selection! Note that once you have trained the model and want to know the cost on the validation/test set, you should not used regularised cost: you care about your performance in the end (which you hope is better because you constrain the parameters during fitting).

* To get started, change your costFuncLogReg to have an extra argument `lambda_ = 0` ( _ because lambda is a keword for anonymous functions), that, if set to a value higher than 0, causes regularisation to be performed.
* Make sure to exclude the bias/intercept term ($\theta_0$) from this. By convention this is not regularised.
* While you are at it, also reorder the arguments to `thetas, x, y, lambda_=0` so it is easier to use BFGS or another optimizer if we want to!

Hint:
* Remember that the regularised logistic regression cost function is:
![APicture](RegLogRegEq.PNG) You already had the first part implemented, you only need to add the second part!



In [None]:
# your answer here


## Changing gradient descent

The gradients should also change. Luckily, since all that's added is a plus term, the change is extremely minor:
![gradients_logreg](GradientsRegLogReg.PNG)

* Up to you to implement the changes in the `linAlgGradientDescent` function. Add another `lambda_ = 0` argument and change the gradients as needed.
* After that's done, let's refactor: make one function called `computeGradients()` that computes and returns the gradients. Make another function called `gradientDescentStep()` that takes a step using current thetas, those gradients, and an alpha value. In this way, we can use the first one if we want to use BFGS, and the second one if we want to use gradient descent proper.

In [None]:
#old function
def linAlgGradientDescent(x, y, thetas, alpha, hypothesis = "linAlgLogRegHypothesis") :
    m = len(x)
    if thetas.ndim < 2:
        thetas = thetas[:, np.newaxis]
    preds  = globals()[hypothesis](x, thetas)
    if preds.shape != (m, 1):
        preds  = preds[:, np.newaxis]
    if y.ndim < 2:
        y = y[:, np.newaxis]
    errors = preds - y
    gradientSummation  = errors.T @ np.c_[np.ones(len(errors)), x]
    finalGradientSteps = alpha/m * gradientSummation
    newThetas          = thetas - finalGradientSteps.T
    return newThetas


# your answer


# refactor into separate functions


## Loading in some data for testing

Let's train on the Pima Indians dataset, which contains information on multiple clinical variables and whether or not patients have diabetes. The below code loads in the data. Up to you to investigate this data somewhat:

* Are there any NaNs in the data?
* Are there other values that seem circumspect? Name 2 examples. How many of these circumspect values are there in these features?
* How many cases and controls are there? Is this a balanced dataset?

Hint(s):
* Use the `.describe` method of the dataframe to help you answer these questions.
* Remember that you can index a dataframe using `df.loc[df["colName"] < 12, :]`

In [None]:
diabetesData = pd.read_csv("PimaIndiansDiabetes.csv")

#your answers below 




## Cleaning up the dataset

The dirty secret of ML is that you spend most of your time cleaning data. So you'll have to spend some time on that here. Do the following:

* Replace the 0 values with `np.nan` (**Note**: be aware that you shouldn't do this for all columns. Think about it.)
* Use [sklearn.impute.KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) to impute values that are missing for those columns where you inserted NaNs. Those who have followed the BiBC Essentials Course might remember K-Nearest Neighbour clustering. This function determines the (by default) 5 most similar samples (based on data that is _not_ missing) and sets the bmi/glucose level, etc. to the mean of their values. Euclidean distance is used. We will discuss K-Nearest Neighbour clustering in two days. For now, you can just use it. To do so, use `a = KNNImputer(missing_values = np.nan)` followed by `imputedData = a.fit_transform(nonImputedData)`.
* Note that this turns the DataFrame into a numpy array, which is not a problem but it's good to know. To make it into a dataframe again, use `pd.DataFrame(yourArray)`. Be sure to add back features that you might have removed because you didn't want to impute them.
* Mean-normalise (i.e. subtract the mean and divide by the standard deviation) the features using the function from yesterday (provided below). This should be done on all the data except the labels. Note that this function expected a DataFrame as input (DataFrames automatically apply features like .mean() by column, numpy arrays don't do that).
* Put the class into a `np.array` (a column vector) called `diabetesClassLabels`

In [None]:
from sklearn.impute import KNNImputer

def createNormalisedFeatures(dataFrame, mode = "range"):
    featureMeans = dataFrame.mean()
    if mode == "range":
        featureRanges = dataFrame.max() - dataFrame.min()
        normalisedFeatures = (dataFrame - featureMeans)/featureRanges
        return [normalisedFeatures, featureMeans, featureRanges]
    elif mode == "SD":
        featureSDs = dataFrame.std()
        normalisedFeatures = (dataFrame - featureMeans)/featureSDs
        return [normalisedFeatures, featureMeans, featureSDs]
    return None


# your answer




## Testing your new functions' mettle

Okay, now we can train regularised logistic regression on this data. Let's **use lambda values of 0, 0.5, 1, 5, 10, 100, and 1000**. We'll downsample the data so we have equal amounts of the positive and negative class, and train the classifier on 80% of the training data while testing on 20% held-out data (normally we'd use cross-validation but let's not put that extra level of complication in here as well). 

The visualisation of a decision boundary/what has been learned is somewhat complex: we can't just draw some boundary in 2D as our data isn't 2D but 8D.
We'll reduce the dimensionality to two dimensions using PCA, and then show in those two dimensions which points are positive or negative for diabetes, and what the classifier predicts everywhere in that plane. This is done for you. We'll talk about dimensionality reduction on the last day of this week. For now, know that, by its nature, dimensionality reduction will lose some of the true differences in your data, so visualisation of the decision boundary in this 2D space is bound to be an approximation, and cannot capture completely what your classifier is doing (as it's separating things in 8 dimensions rather than 2)! 

Your job:
* Make a list of the lambda values to train on (`lambdaValues`), an empty list to store the test cost in (`testCostList`), and a list for the final thetas after gradient descent (`finalThetaList`).
* Downsample the normalised diabetesData: remove random rows of the controls so you have equal # of non-diabetes and diabetes cases. You could use `np.random.choice(a=rowIndicesOfoRowsThatDon'tHaveDiabetes, size = howManySamplesNeedToBeRemoved, replace = False)`, where you then remove (`np.delete()` can be useful) those rows from the feature and class label array. You'll probably also need `np.ravel(diabetesClassLabels)` and `np.where()`. <br> Save the new data as `equalClassSizeDiabetesData` and `equalClassSizeClassLabels` for the labels.
* Randomly sample 80% of that for training, and save the rest for testing. **_Code for this is given below!_**.
* Now make a `for`-loop that loops over the different lambdaValues.
* In that loop, make another loop that performs 300 gradient descent steps with an alpha of 0.2 on `trainDataDiabetes`.
* After that's done, calculate the cost on `testDataDiabetes` **without regularisation (lambda of 0)**. Remember: you don't use the regularisation parameter in the final predictions, because you use it _during training_ to prevent overfitting, and then want to know how well you really do on the test data. 
* Append the result to the `testCostList`.
* Finally, look at the DataFrame containing the theta parameters found for the different values of lambdas, and the cost calculated on the test set (code to make it is given below). What do you see? 

Hints:
* There are many steps here. If you get stuck on one, ask a question or look at the answers to see how to do that step.
* `np.where` returns a tuple, of which you need the first element.
* Note that you can always insert a new cell above or below the current one for testing or debugging using `escape + a` or `escape + b`.

In [None]:
#make sure everyone gets the same split
np.random.seed(500)
startThetas = np.array([0] * 9)[:,np.newaxis]
nSteps      = 500
alpha       = 0.2

#make lists

#downsample the data

#     code for dividing into 80% and 20%
#     uncomment when you have done the above using Ctrl + /


# nrSamplesToTake        = int(np.ceil(0.8*np.sum(equalClassSizeClassLabels == 0)))
# negativeSampleIdxTrain = np.random.choice(np.arange(0,np.sum(equalClassSizeClassLabels == 0)), 
#                                        size = nrSamplesToTake, replace = False)
# positiveSampleIdxTrain = np.random.choice(np.arange(0,np.sum(equalClassSizeClassLabels == 1)), 
#                                        size = nrSamplesToTake, replace = False)
# positiveSamplesTrain, positiveClassLabelsTrain = equalClassSizeDiabetesData[np.ravel(equalClassSizeClassLabels) == 1,:][positiveSampleIdxTrain,:], equalClassSizeClassLabels[np.ravel(equalClassSizeClassLabels) == 1,:][positiveSampleIdxTrain,:]
# negativeSamplesTrain, negativeClassLabelsTrain = equalClassSizeDiabetesData[np.ravel(equalClassSizeClassLabels) == 0,:][negativeSampleIdxTrain,:], equalClassSizeClassLabels[np.ravel(equalClassSizeClassLabels) == 0,:][negativeSampleIdxTrain,:]
# trainDataDiabetes        = np.vstack([positiveSamplesTrain, negativeSamplesTrain])
# trainClassLabelsDiabetes = np.vstack([positiveClassLabelsTrain, negativeClassLabelsTrain])


# negativeSampleIdxTest = np.array([i for i in np.arange(0,np.sum(equalClassSizeClassLabels == 0)) if i not in negativeSampleIdxTrain])
# positiveSampleIdxTest = np.array([i for i in np.arange(0,np.sum(equalClassSizeClassLabels == 1)) if i not in positiveSampleIdxTrain])
# positiveSamplesTest, positiveClassLabelsTest = equalClassSizeDiabetesData[np.ravel(equalClassSizeClassLabels) == 1,:][positiveSampleIdxTest,:], equalClassSizeClassLabels[np.ravel(equalClassSizeClassLabels) == 1,:][positiveSampleIdxTest,:]
# negativeSamplesTest, negativeClassLabelsTest = equalClassSizeDiabetesData[np.ravel(equalClassSizeClassLabels) == 0,:][negativeSampleIdxTest,:], equalClassSizeClassLabels[np.ravel(equalClassSizeClassLabels) == 0,:][negativeSampleIdxTest,:]
# testDataDiabetes        = np.vstack([positiveSamplesTest, negativeSamplesTest])
# testClassLabelsDiabetes = np.vstack([positiveClassLabelsTest, negativeClassLabelsTest])

# your looping code, performing gradient descent for each lambda and
# calculating the cost on the test set after it's done, should go here:



#     code to make a final DataFrame to show what happens:
#     Uncomment all this code at once by selecting it and pressing Ctrl + /

# finalThetas = [np.ravel(elem) for elem in finalThetas]
# dataFrame = pd.DataFrame(np.c_[np.vstack(finalThetas), np.array(testCostList)])
# columnNames = ["theta_" + str(elem) for elem in [0, 1, 2, 3, 4, 5, 6, 7, 8]]
# columnNames.append("testSetCost")
# dataFrame.columns = columnNames
# dataFrame.set_index(lambdaValues, inplace = True)
# display(dataFrame)



## Regularised logistic regression results + visualisation

If all goes well, you will see that the cost on the test set is lowest when using $\lambda = 10$. In other words: unregularised logistic regression would have overfit, tuning the parameters _too specifically_ to the values in the training set, which increases the cost on the unseen test set. Regularisation guards against this by penalising too large parameters. Note that if you change the seed, sometimes you actually get lower cost values without regularisation: in that case there was apparently a lucky split of the data such that overfitting was very difficult anyway. However, you don't want to be dependent on lucky splits, so you can use regularisation to make sure that there's no overfitting to the training data.

We can visualise what logistic regression is doing in a dimension-reduced space. Below, the first plot shows how the data points look when reduced to 2 dimensions using PCA. The following plots show how the decision boundary changes for different $\lambda$ values. Since we evaluate it for the test data only, I have kept only those points in the plots that show what the classifier has learned.

In [None]:
from sklearn.decomposition import PCA

PCAForTwoD = PCA(n_components = 2)
PCAForTwoD.fit(normDiabFeats)
coordsXYOriginalData  = list(zip(*PCAForTwoD.transform(normDiabFeats)))
coordsXYTrainData     = list(zip(*PCAForTwoD.transform(trainDataDiabetes)))
coordsXYTestData      = list(zip(*PCAForTwoD.transform(testDataDiabetes)))
fig, ax = plt.subplots(figsize=(10,10))
ax.scatter(np.array(coordsXYTrainData[0])[np.ravel(trainClassLabelsDiabetes) == 1],
           np.array(coordsXYTrainData[1])[np.ravel(trainClassLabelsDiabetes) == 1],
           facecolor = "green", edgecolor = 'black', label = "Positive samples training set")
ax.scatter(np.array(coordsXYTrainData[0])[np.ravel(trainClassLabelsDiabetes) == 0],
           np.array(coordsXYTrainData[1])[np.ravel(trainClassLabelsDiabetes) == 0],
           facecolor = "red", edgecolor = 'black', label = "Negative samples training set")
ax.scatter(np.array(coordsXYTestData[0])[np.ravel(testClassLabelsDiabetes) == 0],
           np.array(coordsXYTestData[1])[np.ravel(testClassLabelsDiabetes) == 0],
           facecolor = "blue", edgecolor = 'black', label = "Negative samples test set")
ax.scatter(np.array(coordsXYTestData[0])[np.ravel(testClassLabelsDiabetes) == 1],
           np.array(coordsXYTestData[1])[np.ravel(testClassLabelsDiabetes) == 1],
           facecolor = "yellow", edgecolor = 'black', label = "Positive samples test set")
ax.legend()

for index, lambda_ in enumerate(lambdaValues):
    #predict on the test set given learned thetas
    predictionsOnTestSet         = linAlgLogRegHypothesis(testDataDiabetes, finalThetaList[index])
    predictedClassLabels         = np.ravel(np.where(predictionsOnTestSet <= 0.5, 0, 1))
    #make a series of values in the PCA space to transform back, and predict using the learned classifier
    #in that way, we can colour the background with the predicted probabilities.
    PCAX, PCAY                   = np.meshgrid(np.linspace(-6, 6, 1000), np.linspace(-6, 6, 1000))
    backTransformedFeatures      = PCAForTwoD.inverse_transform(np.c_[np.ravel(PCAX), np.ravel(PCAY)])
    classifierContourPredictions = linAlgLogRegHypothesis(backTransformedFeatures, finalThetaList[index])
    classifierContourPredictions = classifierContourPredictions.reshape(PCAX.shape)
    #make the figure
    
    fig, ax = plt.subplots(figsize=(10,10))
    contour = ax.contourf(PCAX, PCAY, classifierContourPredictions, cmap=plt.cm.coolwarm, alpha=0.8, levels = 10)
    fig.colorbar(contour, label = "probability of being class 1", ticks = list(np.arange(0,1.1,0.1)))
    ax.scatter(np.array(coordsXYTrainData[0]),
               np.array(coordsXYTrainData[1]),
               facecolor = "none", edgecolor = 'black', alpha = 0.3, label = "Training data")
    ax.scatter(np.array(coordsXYTestData[0])[np.logical_and(predictedClassLabels == 1, np.ravel(testClassLabelsDiabetes) == 1)],
               np.array(coordsXYTestData[1])[np.logical_and(predictedClassLabels == 1, np.ravel(testClassLabelsDiabetes) == 1)],
               facecolor = "#1b9e77", edgecolor = 'black', label = "Predicted positive and really positive")
    ax.scatter(np.array(coordsXYTestData[0])[np.logical_and(predictedClassLabels == 1, np.ravel(testClassLabelsDiabetes) == 0)],
               np.array(coordsXYTestData[1])[np.logical_and(predictedClassLabels == 1, np.ravel(testClassLabelsDiabetes) == 0)],
               facecolor = "#d95f02", edgecolor = 'black', label = "Predicted positive but really negative")
    ax.scatter(np.array(coordsXYTestData[0])[np.logical_and(predictedClassLabels == 0, np.ravel(testClassLabelsDiabetes) == 0)],
               np.array(coordsXYTestData[1])[np.logical_and(predictedClassLabels == 0, np.ravel(testClassLabelsDiabetes) == 0)],
               facecolor = "#7570b3", edgecolor = 'black', label = "Predicted negative and really negative")
    ax.scatter(np.array(coordsXYTestData[0])[np.logical_and(predictedClassLabels == 0, np.ravel(testClassLabelsDiabetes) == 1)],
               np.array(coordsXYTestData[1])[np.logical_and(predictedClassLabels == 0, np.ravel(testClassLabelsDiabetes) == 1)],
               facecolor = "#e7298a", edgecolor = 'black', label = "Predicted negative but really positive")
    ax.set_title("Dimension reduced classifier visualisation for lambda = " + str(lambda_) + "\n mean cost: " + str(testCostList[index]))
    ax.legend()

## Conclusion visualisation

The main thing to note is, of course, that such a dimension-reduced visualisation is imperfect. You do clearly see that as the lambda values increase, so too does the size of the banded regions become more uniform: the classifier has learned less because with lambdas of 100 and 1000 we are hugely penalising it for any fit to the training data.

## Classifier perfomance

We've talked in the lectures about the performance of a classification algorithm. We want to know the true positive rate and false positive rates for a given threshold, but also the classifier's performance over a range of thresholds. It is not too difficult to make a ROC curve yourself. Let's do that now for the best classifier (with the lowest mean cost on the test set).

Up to you to:
* Make a range of 200 thresholds (from 1 to 0) for saying something is the positive set (use `np.linspace` for this).
* Make two empty lists: `truePositiveRates` and `trueNegativeRates`.
* Make predictions on the `testDataDiabetes` using the best set of learned thetas (which you can manually select).
* Make a for loop over the different thresholds you defined. Within that loop:
    * Turn the predictions into class labels using `np.where` and the current threshold value.
    * Calculate the true positive rate (sensitivity/recall) and append it to the list.
    * Calculate the true negative rate and append it to the list.
* Finally make a plot of the sensitivity (true positive rate) on the y-axis and 1-specificity (1-TNR) on the x-axis. (use `fig, ax = plt.subplots()` and `ax.plot()`). Don't forget to set the axis labels and a title!

See the relevant excerpt from the slide below, and look [here](https://glassboxmedicine.com/2019/02/23/measuring-performance-auc-auroc/) for more explanation if you want it! <br> ![SensitivityAndSpecificity](SensitivityAndSpecificity.PNG)

In [None]:
#your answer here



## What I'd like you to remember here:
* What regularisation is, and how it works (penalising large weights for parameters, thereby forcing the algorithm to focus on those that really give it a lot of _bang for its buck_ and decreasing overfitting)
* How to implement regularisation, and how the parameter $\lambda$ affects it
* How to do some basic cleaning on a dataset, and what _the idea_ of imputation is (specifically of a KNNImputer)
* How to make a ROC plot, and what exactly is depicted on it, as well as why we might want to compare something like ROC AUC between classifiers, rather than accuracy.

## Final words

Congratulations. You've implemented regularised logistic regression on a real dataset (that you cleaned up yourself) and made your own ROC curve. We'll now move on to multiclass logistic regression and then to neural networks!

## Survey
"I want a Survey, hey! Giving feedback for the very first time. I want a su-u-u-u-rvey, got some feedback, on my mi-i-i-n-d". Thanks Weird Al, [very cool](https://www.youtube.com/watch?v=notKtAgfwDA). Here you go: [clickety-click](https://docs.google.com/forms/d/e/1FAIpQLSfaeqtRTz5KMqcmxQuOI5GYWHMejjh5_yuiCNSnNblpdKb0hQ/viewform?usp=sf_link).