Available at http://www.comp.nus.edu.sg/~cs3244/1910/04.colab

![Machine Learning](https://www.comp.nus.edu.sg/~cs3244/1910/img/banner-1910.png)
---
See **Credits** below for acknowledgements and rights.  For NUS class credit, you'll need to do the corresponding _Assessment_ in [CS3244 in Coursemology](http://coursemology.org/courses/1677) by the respective deadline (as in Coursemology). 

**You must acknowledge that your submitted Assessment is your independent work, see questions in the Assessment at the end.**


**Learning Outcomes for Week 04** 

After finishing these exercises and watching the videos, you should be able to:
* Linear Classification:
  * Describe the basic idea of linear classification.
  * Understand one linear algorithm's conceptual idea: the Perceptron learning algorithm (PLA);
  * Apply PLA on real examples;
  * Understand the mathematical proof how PLA converges;
  * Understand the concept of Non-linear transformations

* Logistic regression:
  * Understand how both linear and logistic regression works;
  * Be able to build a basic logistic regression classifiers from scratch;
  * Understand gradient descent methods using the cost functions to optimize a model's parameters $\theta$ for smooth cost functions;
  * Be able to derive the closed form expression for linear regression using squared error (L2);
  

_Welcome to the Week 04 Python notebook._ This week we will learn about classification algorithms.  We introduce **Linear Classifiers** and **Logistic Regression** in the lecture videos, and will be reviewing this material in the third tutorial.


In this notebook, we will go through different programming exercise in the Pre-tutorial part and some more programming and _mathematical proofs_ in the Post-tutorial part. In the Pre-tutorial, the programming exercises involve using Perceptron Algorithm and Logistic Regression from SciKit Learn (a cornerstone traditional ML toolkit, or `sklearn`). We will implement Perceptron Algorithm from scratch in the pre-tutorial exercise and Logistic regression in the post-tutorial exercise.

---
# Week 04: Pre-tutorial Work

* Watch the CS 3244 video playlist for Week 04 Pre.  This will introduce two basic classification method for this week's class: _Linear Classifiers_ and _Logistic Regression_.
* After watching the videos, complete the pre-tutorial exercises and questions below.

## 1 Linear Classifiers



From its name we understand that it classifies items using a line, or more generally for $k$ dimensional data, a hyperplane with $k-1$ dimensions. In simpler terms, if we have data corresponding to two or more classes, the linear classifier puts a straight, boundary line between them to differentiate the classes from each other.

Let's examine the diagram below.  This is a two-class problem as we can see there are two types of data points in the diagram, white circles and black circles; i.e., $y \in \{black,white\}$.  So we want to build a classifier (put a boundary line) between these two classes. We can see three straight lines, $H_1, H_2, H_3$ dividing the data points different way. From these three, $H_1$ & $H_2$ both classifies the data points correctly, but $H_3$ does not. These are our linear classifiers. In this week, we will learn how to draw such straight lines to maximize the accuracy of our classification on our data points $X$.
<div align="center">
<img src="https://upload.wikimedia.org/wikipedia/commons/2/20/Svm_separating_hyperplanes.png" width=300 />
 </div>
 
 _By Cyc [CC BY 2.0](https://creativecommons.org/licenses/by/2.0), via Wikimedia Commons._


**Your Turn (Question 1):** Which of the three lines, classifies the data points accurately?

Choose from: $H_1$, $H_2$, $H_3$

**Your Turn (Question 2)**: Which machine learning paradiam does logistic regression belongs to?

Choose from: _Supervised Learning, Unsupervised Learning, Reinforcement Learning, None of the above_

**Your Turn (Question 3)**: Both the Perceptron algorithm and Logistic Regression are used for binary classification. What's the difference between them?

_Replace with your answer_

**Your Turn (Question 4)**: Which of the following methods do we use to best fit the data in Logistic Regression?

_Choose from: Maximum Likelihood, Least Square Error, Jaccard distance, None of the above_

**Your Turn (Question 5)**: What is the disadvantage of the pocket algorithm compared to perceptron?

Choose from: _It requires additional memory to store the weights, It requires more iteration than the general Perceptron, It needs additional computation, No disadvantage_

## 2 Programming : Perceptron Algorithm from Sklearn

In this section, we will learn a linear classification algorithm: _Perceptron_ and apply it on the popular _Digits Dataset_. This is a very popular problem, where we have to classify handwritten digits. Let's start!

### .a MNIST Digits Dataset

The _Digits_ dataset is a preprocessed, simplified database of handwritten digits, originally adapted from the MNIST (Modified National Institute of Standards and Technology database) database. MNIST is widely used for training and testing machine learning algorithms. The input for each digit is a $8\times 8$ matrix, where each element is an integer ranging from $0...16$, representing the grayscale strength of the pixel.

There are $5,620$ instances in the database. You can download the dataset from [here](http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits). There are two files, one each for the training and testing data. Many benchmark experiments have been done on this dataset using different machine learning algorithm, making it historically important.  For example, the resurgence of the connectionist approach via deep learning was heralded by very good results on the larger MNIST dataset.  We'll use this dataset to build a pairwise linear classifier to distinguish two different digits.

<div align="center">
<img src="https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png" width=512 />
 </div>
 
  _By Josef Steppan (MNIST Examples) [CC BY 4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en), via Wikimedia Commons._


### .b Loading and Visualizing Input Data

Enough talk.  Let's do.  Let's fetch the digits dataset using `sklearn`. Then we can view one of the training examples as a  grayscale image.  Run the code and explore the output.

In [0]:
import matplotlib.pyplot as plt

# Import datasets
from sklearn import datasets
digits = datasets.load_digits() # load in the digits dataset

# How many samples are in this dataset?
n_samples = len(digits.images)
print ("The data size is "+ str(n_samples))

## Try changing the index from 0 to other values to see different numbers ##
print ("Here is the first instance:")
plt.imshow(digits.images[0], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

We can use subplots to visualize more than one image at a time.

In [0]:
# merging the labels with the data
images_and_labels = list(zip(digits.images, digits.target))

for index, (image, label) in enumerate(images_and_labels[:8]): # cycle through the first 8 images
    plt.subplot(2, 4, index + 1) # declare a 2x4 grid of plots, and declare we are working on the current indexth subplot
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title('Training: %i' % label)

n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

# output the size of the data and show some sample images from the dataset
plt.show()

The raw input we get from the dataset is vector of length 64, $\mathbf{x} = (x_0,x_1,x_2,...,x_{64})$.

For our linear model, we need to determine the weight vector, $\mathbf{\theta} = (\theta_0,\theta_1,\theta_2,...,\theta_{64})$.

We have $64$ features ($8\times 8$ input matrix) in the dataset for each digit, but not each feature is equally important. we can do some feature extraction and only use the useful information out of those $64$. For example, we can use **Intensity** as $x_1$ and **Vertical Symmetry** as $x_2$. So our feature vector is now instead $\mathbf{x} = (x_0,x_1,x_2)$ and the weights to the linear model are $\theta = (\theta_0,\theta_1,\theta_2)$.

<div align="center">
<img src="https://www.comp.nus.edu.sg/~neamul/Images/perceptron2.png" width=512 />
 </div>
 
 From the above figure we can see the plot of data points corresponding to the digits **1** and **5**. For these two digits, intensity and symmetry are two useful features to discriminate between them. We can see from the plot, the data points are divided into two, almost completely separated clusters. We can get a good straight line to distinguish between these two digits.
 

**Your Turn (Question 6)**: Why is symmetry a good feature to distinguish between 1 and 5? 

Choose from: _One of them is rotationally symmetric, glide reflection symmetric, reflection symmetric, translation symmetric, symmetry is not a good feature to distinguish between 1 and 5_

### .b Applying Perceptron Algorithm for Classification


Now, we will execute the perceptron algorithm using the `sklearn` library. It's pretty simple. We'll do the following tasks step-by-step using the library functions.
1. First, we load the dataset from the library (sklearn). 
2. Then, we divide the datatset into training and testing set.  
3. Then we created an instance of the Perceptron classifier from the sklearn library with necessary parameters.
4. **Train**: Then we train the classifier by calling the **fit** function, where we provide the training data $S$ and labels $T$. This one line yields our $h_\theta$.
5. **Test**: We can now apply $h_\theta$ (by calling the **predict** function) to classify new, unseen instance $\mathbf{x}^{(*)}$ from the test set.

You can see the summary accuracy of the classifier in the first line.   

Run the code and see the results for yourself.

In [0]:
# Import datasets, classifiers and performance metrics
from sklearn import datasets, metrics
from sklearn.linear_model import Perceptron

# The digits dataset
digits = datasets.load_digits()

n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

## You can adjust this value to get better score, later.  
percentage = 0.8

#set aside the frist 50% of data for the training and the remaining 50% for the testing
X_train = data[:int (n_samples * percentage)]
X_test = data[int (n_samples * (1 - percentage)):]

Y_train = digits.target[:int (n_samples * percentage)]
Y_test =  digits.target[int (n_samples * (1 - percentage)):]

# Create an instance of perceptron classifier
classifier = Perceptron(tol=None, max_iter =1000)

# We learn the digits on the first half of the digits
classifier.fit(X_train, Y_train)

# Now predict the value of the digit on the second half:
expected = Y_test
predicted = classifier.predict(X_test)

print("The classification score %.5f\n" % classifier.score(X_test, Y_test))


This is followed by additional detailed tables on the experiments (but we don't expect you to know what these tables mean yet; rest assured we'll explain them soon).


In [0]:
print("Classification report for classifier %s:\n%s" % (classifier, metrics.classification_report(expected, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))

### .c Visualize the Predictions

We added a few lines of code to visualize the digits and their prediction together (which is not that important, but is useful for us to verify whether we indeed predicted the right numbers!).


In [0]:
import matplotlib.pyplot as plt

# Showing the images and prediction from the classifier
images_and_predictions = list(zip(digits.images[int (n_samples * (1 - percentage)):], predicted))
for index, (image, prediction) in enumerate(images_and_predictions[:8]):
    plt.subplot(2, 4, index + 1)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title('Prediction: %i' % prediction)
	
plt.show()

## 3 Programming : Implementing Perceptron Algorithm

We have seen the `sklearn` implementation of the perceptron algorithm. We have used the digits dataset to train and test the classifier. While we used the library, we just called the `fit` function to train our classifier using the training data and it returns us the classifier. That's what Machine Learning is all about, right?

Not really. It's not enough for us to learn how to use the _fit_ function, the more important part is what this *fit* function does. It determines the _parameters_ (or _weights_) $\theta$ that classifies the training examples as best as it can with the given setting. When we call the `predict` function, to run our classifier on the test data, it uses the same parameters (that are learned by _fit_) to classifiy the test data.

In this section, we explore under the hood of this algorithm. Instead of calling the library function, our goal is to implement these **fit** and **predict** function by ourself: i.e., we'll determine the parameters $\theta$ using the training examples and then predict the test examples using those learned parameters.

### .a Selecting a subset of data


We will use the digits dataset to train and test our classifier. As we know there are data for 10 digits in the dataset. To make it easier for us, we will be making a binary classifier which will use only data for two digits.  We will thus only be distinguishing between **two digits** (say _digit1_ and _digit2_). Before we finally use our classifer for the digits data, you will select any two digits which you want to distinguish from one another. For our two-digit classes, we will think of _digit1_ as class **+1** and _digit2_ as class **-1**. We will use these labels for our digit data and give predictions as +1 or -1.

We have implemented a function (`extract_digits`) for you, which you can easily extract the data of the given digits to form the dataset without any hardship. This simple function takes four parameters: $X$, $Y$ (the data and their labels), as well as _digit1_ and _digit2_, the two digits to distinguish. Then it returns the filtered data and filtered labels.

We will use it in our implementation to extract the desired digits data later.

In [0]:
def extract_digits(X, Y, digit1, digit2):
    # adding the bias term with the feature vectors
    x0 = np.ones((X.shape[0], 1), np.int)
    X = np.concatenate((x0, X), axis = 1)

    # filter X to get the data (i.e., find the rows of data that match digit1 and digit2)
    X_1 = X[Y == digit1]
    X_2 = X[Y == digit2]
    # merge them together in a single array, X_data
    X_data = np.concatenate((X_1, X_2), axis = 0)

    # fill the column of answers, Y, with the correct labels
    Y_1 = np.full(X_1.shape[0], +1)
    Y_2 = np.full(X_2.shape[0], -1)
    # merge them together in a single array
    Y_data = np.concatenate((Y_1, Y_2), axis = 0)

    return X_data, Y_data

### .b Loading the data


In this phase, you will select the two digits you want to work with using the function `extract_digits`. Put your favourite numbers in the variables **digit1** and **digit2** at lines $17-18$.

To train our classifier, we will use $80\%$ of the data and the rest $20\%$ will be used for testing.

In [0]:
# import necessary libraries
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split # Library to split train and test set

# Load the digits dataset
digits = datasets.load_digits()

# Set the percentage of test data
percentage = 0.2

X = digits.data[:, :]
Y = digits.target

############### Your Turn: Put your favourite digits in the variables below
digit1 = 1
digit2 = 5

# We will now extract our digit data
X, Y = extract_digits(X, Y, digit1, digit2)

# Split the data according to the percentage declared above
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = percentage, random_state = 51)

print("Total training examples : %d\n Total test examples : %d" % (len(X_train), len(X_test)))

### .c Writing prediction function


We will first write a simple function, *predict_class*, which will predict the class of a given sample. Let's assume, we already learned the `weights` from our training examples. So we have two parameters for this function: *weights* and *test_sample*. 
* *weights*: the weight vector of our classifier
* *test_sample*: the sample we want to get prediction

To predict the class, we will multiply the weight vector with the test sample to yield the signal, $\theta^Tx$.  As we have only two digits in our data, we will treat _digit1_ as **+1** class and _digit2_ as **-1** class. If the multiplication result is greater than zero, the class prediction will be  +1 (*digit1*), otherwise -1  (*digit2*).

**Your Turn (Question 7):** Complete the code below to get prediction from `predict_class`.

_Copy the code you added or modified into the respective assessment question in Colaboratory_

In [0]:
def predict_class(weights, test_sample):
    """Equivalent .predict() function in sklearn's classifier

    Args:
        weights (array of floats): The weight vector of our classifier
        test_sample (array of floats): The sample we want to test or make prediction

    Returns:
        int: Prediction for the test_sample; +1 or -1
    """
    # Initialize the prediction variable
    prediction = 0

    # multiplying weight vector with the test_sample using dot product
    product = np.dot(weights, test_sample)

    ######################################################
    # Your Turn: write your own code here
    #
    # Put prediction = +1 if product > 0, otherwise prediction = -1
    #
    ######################################################

    return prediction


### .d Training the classifier


We will write a function named `perceptron_classfier`, which is equivalent to the *fit* function we used in `sklearn`. What we will do in this function is going to be the same thing that can be done by calling the library function. For this function, we have the following parameters:

* $X$ : an array of training examples (with the bias term) 
* $Y$ : column vector of training example labels
* `num_iter` : total number of iterations we will run our update formula

As we learned in the algorithm, we have to add one extra attribute to our training data $(x_0 = 1)$, the bias term. Then we will initialize our weight vector $\theta$ to all zeroes. Then we will run the update rule for the given number of iterations. At each iteration, we will find a misclassified example $(x^{(j)}, y^{(j)})$, then we will update our current weight vector using the following weight update rule: $\theta' = \theta + y^{(j)}x^{(j)}$

**Your Turn (Question 8):** Complete the code for `perceptron_classifier` by writing necessary statements where needed.

_Copy the code you added or modified in the assessment_


In [0]:
def perceptron_classifier(X, Y, num_iter):
    """ Equivalent .fit() function to sklearn's classifier

    Args:
        X: an array of training examples (with the bias term)
        Y: column vector of training example labels
        num_iter: total number of iterations we will run our update formula

    Returns:
        array of floats: The optimal weights if the dataset is linearly
        separable, and converges within num_iter iterations
    """

    # Initialize the weights as zeros
    weights = np.zeros(X[0].shape, np.int)

    # Number of training examples
    train_size = len(X)

    # Training the classifier
    for step in range(num_iter):
        ## Find a misclassified example
        ######################################
        #
        # Your turn: write your own code here
        #
        #
        ######################################

        ## Update the weight vector based on the misclassfied example
        ######################################
        #
        # Your Turn: write your own code here
        #
        ######################################

    return weights

### .e Putting it all together

As we have completed the prediction function and classifier function, we can now proceed to test our implementation. Now we have everything we need. We have our classfier training function, the prediction function, loaded the data and divided into training and testing sets. We just need to call our classifier function and train it. And then test the accuracy of the trained classifier.

Let's do it!

In [0]:
# Make sure you run all the code blocks above
# Training phase with 1000 iterations
theta = perceptron_classifier(X_train, Y_train, 1000)

test_size = len(X_test)

correct_predictions = 0 # Initialize the number of correct predictions

# Apply the model with the learned theta parameters on each of the testing examples, one at a time
for item in range(test_size):
    prediction = predict_class(theta, X_test[item])

    if prediction == Y_test[item]: # Accumulate correct results
        correct_predictions += 1

# Calculate the accuracy for our classifier
accuracy = correct_predictions / test_size

print("The classification accuracy : {}\n".format(accuracy))

How did you do?  Most results (even when you choose different digits to classify) will be in the 90s%.  Great work!

**Your turn (optional, for you to think about):** What happens when you cannot find a misclassfied example?  What happens when the dataset is not linearly separable? \[ _These are great study questions for continual assessments!_ \]

## 4 Programming : Logistic Regression from Sklearn

Let's do some simple hands-on exercise on _Logistic regression_.  We'll use the full version of the logistic regression that we would like you execute. 

We'll again use a dataset from the popular [UCI dataset repository](https://archive.ics.uci.edu/ml/index.php); in particular, the [Iris](https://archive.ics.uci.edu/ml/datasets/iris) dataset. This is perhaps the best known database to be found in the pattern recognition literature.  The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. 

We'll load in the data, take a look around and train a logistic regression classifier on it using `sklearn`. 

Feel free to look at other features and change the _hyperparameter_ values – the code is provided to you for your studying and experimentation.


### .a Load the Iris dataset

In [0]:
# Do some standard library imports for data science and machine learning
import numpy as np
import math
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.linear_model import LogisticRegression

Since Iris is a built-in dataset in `sklearn`, we can load it directly with a single line of code. 

In [0]:
iris = datasets.load_iris() # Load the Iris dataset from sklearn

By running the following code, we see what the dataset looks like.  Again, this is generally good practice, to know something about the semantics of the dataset.  The UCI repository has a file per dataset it hosts to describe the data.  In this case, you can read up on what the fields' values mean [here](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names). 

In [0]:
# Manufacture a dataframe for the raw data for sample inspection
iris_pd = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
iris_pd.head()

We can also take a look at some general statistics.

In [0]:
iris_pd.describe()

To simplify things, we take just the first two feature columns. Also, just to simplify things, we'll force the two non-linearly separable classes to be labeled with the same category, ending up with a binary classification problem.

In [0]:
# Take the first two columns to populate a binary classification problem
X = iris.data[:, :2]
# Collapse the other two non-zero class irises as one class
y = (iris.target != 0) * 1 

### .b Visualize the Data

Now each data point is a 2-dimensional vector, so we can easily visualize it by drawing a figure using the `plt` _matplotlib_ object.  

In [0]:
# Produce a scatterplot for the values
plt.figure(figsize=(10, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='b', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='r', label='1')
plt.legend()
plt.show()

### .c Applying Logistic Regression for Classification

Now we could use the above dataset as training data, to train a logistic regression model on it using `sklearn`. To train a model in `sklearn` is simple and easy: you just need to call the `fit` function. 

In [0]:
# Create an instance of Logistic Regression classifier
model = LogisticRegression(C=1e20, solver = 'liblinear')
# Train the model
model.fit(X, y)

Now, the trained model is stored in the variable `model`, we can print out the learned weights, and see the classification accuracy on the training data. 

In [0]:
print('The learned weights are {} {}'.format(model.intercept_, model.coef_)) 
# Your turn (to think about): Can you figure out what the outputs here mean?

preds = model.predict(X) # Predict on our training set.  
# Your turn (to think about): Is this indicative of testing performance?
print('The classification accuracy: {}'.format(((preds == y).mean())))

### .d Visualize the Decision Boundary

See? The classification accuracy is 100%, perfect! The model successfully finds a decision boundary that separates the two classes. Of course, this is based on the premise that the dataset is linearly separable. We can also superimpose the decision boundary on our earlier plot, using the following code:  

In [0]:
# First four lines are identical to the earlier cell
plt.figure(figsize=(10, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='b', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='r', label='1')
plt.legend()

# Set up the Logistic Regressions threshold for plotting 
# You can try changing this too, or give more than one value
colors = ['black']
confidence = [0.5] 

# Define parameters for plotting the LR boundary
x1_min, x1_max = X[:,0].min(), X[:,0].max(),
x2_min, x2_max = X[:,1].min(), X[:,1].max(),
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))
grid = np.c_[xx1.ravel(), xx2.ravel()]
probs = model.predict_proba(grid)[:, 1].reshape(xx1.shape)

# Plot the contour and show the plot
plt.contour(xx1, xx2, probs, confidence, linewidths=2, colors=['black'])
plt.show()

Congrats, you have successfully trained a logistic regression classifier on the *Iris* dataset!  

And whew!  You've completed a rather long Pre-tutorial notebook!  You (almost) deserve that shiny achievement...



----
# Week 04: Post-tutorial Work

Watch the Week 04 post-videos on the lecture topics introduced this week, then attempt the following exercises.  

## 5 Convergence Proof of Perceptron Algorithm


We have learned how the algorithm works in the previous sections. For our algorithm to work, the assumption is that the data points we want to classify should be linearly separable. We have seen a straight line separating two sets of instances before. Before proceeding, we are visualizing the decision boundary as a straight line in our 2D representation, but in general it is actually a hyperplane separating the two sets of instances. Assuming that we have linearly separable data points, this means there exists a linear separator between the classes. Now, can we gurantee that PLA will be able to find this hyperplane? That is, we want a convergence proof for our perceptron algorithm. Let's prove the convergence of PLA now.



### .a Problem Definition

Let's recall our notation before proceeding:
*  Input: $\mathbf{x} = \{x_1,x_2,...,x_n\}\hspace{4mm}$ n-dimensional feature vector
* Output: $y = \{+1\  \text{or} -1\}\hspace{4mm}$ divide into two classes
* Data: $\mathbf{X} = \{(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})\}\hspace{4mm}$ training data
* Target function: $f : \mathcal{X} \rightarrow \mathcal{Y}$
* Our hypothesis: $h_\theta : \mathcal{X} \rightarrow \mathcal{Y}$

To differentiate between two classes, we calculate the value of $\theta^\top \mathbf{x}$, if the value is greater than zero than it belongs to one class (+1), otherwise it belongs to the other one (-1); here $\theta = (\theta_1,\theta_2,...,\theta_n)$ is our weight vector. To make our notation simple, we multiply the training examples belonging to the -1 class by its class label i.e. -1. So now, our target is to find appropriate parameter vector $\theta$, so that for all training examples $(j = 1,2,...,m)$, $\theta^{\top} \mathbf{x}^{(j)} > 0$.

Previously we had two conditions for classification: for a feature vector $x$, its multiplication with the weight vector $\theta$ is greater than zero, then it belongs to +1 class; if less than zero, then it belongs to the -1 class. 

**Your Turn (Question 1)**: After multiplying the feature vector with the class label, how will only one condition correctly classify both classes?

_Replace with your answer_

### .b Some notations

We'll need to distinguish some terms with respect to their values before or after a particular iteration.  Let's introduce $(k)$ indicating a value at iteration $k$ and $(k+1)$ indicating a value at iteration $k+1$.  So for example, $x_{miss}(k)$ means the misclassified example from $k$-th iteration of the algorithm.

We'll also need a definition of the L2 norm (Euclidean distance).  We'll use $||*||^2$ to define the square ($*^2$) of the norm ($||*||$) of the argument $*$.

### .c Proof Formulation

Then we can say that PLA updates the weight when $\theta(k)^\top \mathbf{x}_{miss}(k) \leq 0$, at any $k$-th iteration. And the update rule is as follows:
$\theta(k+1) = \theta(k) + \mathbf{x}_{miss}(k)$ (previously we had  $\theta' = \theta + y^{(i)}{\mathbf{x}_{miss}}^{(i)}$, as we have multiplied the class label $y$ with the feature vector,  so $y^{(i)}{\mathbf{x}_{miss}}^{(i)}$ becomes only ${\mathbf{x}_{miss}}^{(i)}$)

One more thing, for the simplicity of our calculation, we will only increment our iteration number ($k$) if the weight is updated; we don't increment if the weight remains unchanged. So under PLA we have:
$$\theta(k+1) = \theta(k) + \mathbf{x}_{miss}(k),\  \text{when  }\theta(k)^\top \mathbf{x}_{miss}(k)\leq 0;\  \forall k$$
Standard PLA terminates when it finds a separating hyperplane, and keeps running otherwise (this is a little different than our coded example, which keeps iterating until *num_iter* iterations is reached).

We want to show that the algorithm finds a separating hyperplane, if the data is linearly separable. 

We will prove this by contradiction, by assuming the algorithm fails to find a separating hyperplane. If the algorithm fails, it never stops updating the weight vector as above. We'll prove that this updating process must stop after a finite number of iterations.



### .d Proof by Contradiction

By contradiction, we'll assume our algorithm fails to find a hyperplane, so it must keep on updating. So at any $k$-th iteration we have:
$\theta(k+1) = \theta(k) + x_{miss}(k)$ 



**Linear Upper Bound** 

Let's take the square of the norms of the vectors of both sides:
$$\begin{equation}
  \begin{split}
     ||\theta(k+1)||^2 & = & ||\theta(k) + \mathbf{x}_{miss}(k)||^2 \\
      & = & ||\theta(k)||^2 + ||\mathbf{x}_{miss}(k)||^2 + 2\theta(k)^\top \mathbf{x}_{miss}(k) \hspace{1cm} \text{(expanding the squares)}\\
      & \leq &  ||\theta(k)||^2 + ||\mathbf{x}_{miss}(k)||^2
   \end{split}
\end{equation}$$
This last line of this equation is true, because we are updating the weight vector in the $k$-th iteration.  This means that the inequality $\theta(k)^\top \mathbf{x}_{miss}(k)\leq 0$ is true.

So, now we can recursively apply this equation:
$$ \begin{equation}
\begin{split}
||\theta(k)||^2 & \leq &\ \   ||\theta(k-1)||^2 & + ||x_{miss}(k-1)||^2 \\
  & \leq &\ \  ||\theta(k-2)||^2 &+ ||\mathbf{x}_{miss}(k-2)||^2+ ||\mathbf{x}_{miss}(k-1)||^2  \hspace{1cm} \text{(expanding the first term)}\\
    & .& \\
     & .& \\
      & .& \\
      & \leq &\ \  ||\theta(0)||^2 &+ \sum_{t=0}^{k-1}||\mathbf{x}_{miss}(t)||^2 \hspace{1cm} \text{(collecting all but the first term into the summation)}
\end{split}
\end{equation}$$

Let us assume the norm of initial weight vector, $||\theta(0)|| = 0$ and $M = max_t \ ||{\mathbf{x}_{miss}} {(t)}||^2$, as we have $m$ training examples, $M$ is the maximum value of the norm across all $m$ examples.

Then we have, $$\begin{equation}
\begin{split}
||\theta(k)||^2 &  \leq & \ \  ||\theta(0)||^2 & + \sum_{t=0}^{k-1}||\mathbf{x}_{miss}(t)||^2 \\
&  \leq & \ \ 0 & + \sum_{t=0}^{k-1}M \\
& \leq & \ \  kM & & . . .  \hspace{1cm} (1; )
\end{split}
\end{equation}$$
So if PLA keeps making errors, this equation continues to expand, meaning that the norm of $\theta$ keeps increasing linear to $k$ (iteration number), as in Equation (1).



**Quadratic Lower Bound**  

Let's find a lower bound for how much $\theta$ must change from its initial value after $k$ iterations.  Again, we have the following update rule for PLA:

$\theta(k+1) = \theta(k) + \mathbf{x}_{miss}(k)$.

Hence we have,

$$\begin{equation}
\begin{split}
\theta(k) & = &\ \  \theta(k-1) + \mathbf{x}_{miss}(k-1) \\
 & = & \ \ \theta(k-2) + \mathbf{x}_{miss}(k-2) + \mathbf{x}_{miss}(k-1) \\
 & . & \\
  & . & \\
   & . & \\
   & = &\ \  \theta(0) + \sum_{i=0}^{k-1}\mathbf{x}_{miss}(i) & ...& \hspace{1cm}(2)
\end{split}
\end{equation}$$

Since the data is linearly separable, there must exist a weight vector $\theta^*$ which classifies all instances accurately.
We can say $\exists\  \theta^*$ such that $\theta^{*^\top}{\mathbf{x}^{(i)}} > 0, \forall i$; 

We multiply equation $(2)$ by $\theta^*$ and set $\theta(0) = 0$ then we get, $\theta(k)^\top \theta^* = \sum_{i=0}^{k-1}{\mathbf{x}_{miss}(i)}^\top \theta^*  $

Now, let's define another variable $\gamma = min_i \ \ {{\mathbf{x}_{miss}}{(i)}}^\top\theta^{*}$; ($\gamma > 0$)

So the equation becomes, 
$\theta(k)^top \theta^* = \sum_{i=0}^{k-1}\mathbf{x}(i)^\top \theta^* \geq k\gamma $.



Putting this all together, we have:
$$\begin{equation}
\begin{split}
k\gamma & \leq &\ \  \theta(k)^\top \theta^* \\
k^2\gamma^2 & \leq &\ \  ||\theta(k)^\top \theta^*||^2 & \hspace{1cm} \text{(Squaring both sides; as they are both scalars)}\\
& \leq &\ \  ||\theta(k)||^2 ||\theta^*||^2 & \hspace{1cm}\text{(By the Cauchy-Schwartz inequality)}\\
& \leq &\ \  kM \ ||\theta^*||^2 &\hspace{1cm}\text{(Invoke Eq. (1))}
\end{split}
\end{equation}$$
This equation should be true for all $k$ if the algorithm keeps updating the weight vector indefinitely.

If the algorithm keeps updating $\theta(k)$, we must have $k^2\gamma^2   \leq \ \  kM||\theta^*||^2\ $.

But this can be true only until $k \leq \frac{M||\theta^*||^2\ }{\gamma^2}$.   So there is a contradiction!

The above equation implies that the number of iterations where the weight vector is updated must be less than some constant. That means the algorithm will not update the weight vector forever, given the condition that the data points are linearly separable. Hence, the algorithm finds a separating hyperplane in finitely many iterations.

Therefore, the PLA will converge in finitely many iterations.




**Your Turn (Question 2)**: How did we prove the convergence of the algorithm?

Choose from: _By determining a fixed value for k (number of iteration) , By showing k depends on the optimal weight, By putting an upper limit on k_

## 6 Non-Linear Transformations
We have seen examples of using linear classifiers on different datasets. But the data we worked on up to now were largely linearly separable, i.e. we can draw a hyperplane to cleanly dinstinguish between the two sets. But in real life that is not the case. Most of the data we find in real life may not be linearly separable. We can not distinguish between two class just by drawing a straight line. Now, what should we do in such case? Is there a way to apply **Linear Classfiers** on **Linearly Non-Separable Data**?

The solution to this problem is simple: apply a **Non-Linear Transformation**. If we want to apply linear classifier on non-linear data points, then we can perform a non-linear transformation to those data points, so that they become linearly separable. In the figure below, we can see an example of non-linear data. We can see the data points of two classes are not separable by a straight line; but it looks like a circular boundary could divide them into two separate clusters.

So what do we need to do is tranform the data points!

<div align="center">
<img src="https://www.comp.nus.edu.sg/~neamul/Images/nonlinear_new1.png" width=700 />
 </div>



### .a Example of Non-linear Transformation

Now, there are a lot of ways to transform data points from one space to another. Depending on the original data points, we decide which transformation function we should use. In this example, taking the squares of each dimension: i.e., 
$(x_1,x_2,...,x_n) \xrightarrow\phi (x_1^2,x_2^2,...,x_n^2)$ works to manufacture a new dataset $Z$ that is linearly separable. We are not limited to just squaring though -- Instead of squaring, we can think of any transformation $\textbf{x} \xrightarrow\phi \textbf{z}$, where the transformation results in a linearly separable space between both clusters. Once we have transformed the data points to a different space, then we can apply our linear model to classify the data points.

<div align="center">
<img src="https://www.comp.nus.edu.sg/~neamul/Images/nonlinear_new2.png" width=800 />
 </div>

**Your Turn (Question 3)**: Transform this non-linear equation to a linear one: $y = ax^b$?

_Replace with your answer_

### .b Exercise: Classify Non-linear data with Linear Classifier

Let's look at another problem, suppose we want to classify the output of XOR (see figure below). So there are two classes, 1 and 0 (diamond and circle). From the figure, we can see that we cannot classify these points using a line. The easiest solution is to draw a curve (parabola) to classify these data points. However this solution is non-linear; drawing a curve is not possible with our linear model. We can apply a _Non-linear Transformation_,  but there is an alternative method (we will learn more about it in the second half of the module!).

<div align="center">
<img src="https://www.comp.nus.edu.sg/~neamul/Images/xmpl1.png" width=800 />
 </div>


**Your Turn (Question 4)**: We want to classify these points using perceptron, is it possible? Give reasons for your answer.

_Replace with your answer_

## 7 Programming : Implementing Logistic Regression


In this exercise, we are going to implement logistic regression from scratch. Logistic regression is a generalized linear model for classification that predicts the probability of a binary event. For example, we might use logistic regression to predict whether someone will be denied or approved for a loan, or whether an email is a spam or not.

Similar to linear regression, in logistic regression, input values ($\mathbf{X}$) are combined linearly using weights or coefficient values to predict an output value ($\mathbf{y}$). A key difference from linear regression is that the output value being modeled is a binary value ($0$ or $1$), rather than a numeric value.

In logistic regression, we’re essentially trying to find the weights that **maximize the likelihood of the training data** $\mathbf{X}$ and use them to categorize the target variable. Unlike linear regression, the likelihood maximization in logistic regression doesn’t have a closed form solution, and we'll need to solve the optimization problem with **gradient descent**. 

In [0]:
# Do some standard library imports for data science and machine learning
import numpy as np
import math
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.linear_model import LogisticRegression

### a. Logistic Function



Before we dive into logistic regression, let’s take a look at the **logistic function** (or equivalently, sigmoid function), the heart of the logistic regression technique.  The logistic function is defined as:

$$
g(z) = \frac{1}{1 + \exp^{-z}}
$$

It is a "S"-shaped curve that maps any real value to the range between $0$ and $1$:

<div align="center">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/600px-Logistic-curve.svg.png" width=400 />
 </div>
 
 By Qef (The Standard Logistic Regression) [CC BY 4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en), via Wikimedia Commons.


Let's get started.  

**Your Turn (Question 5):** Implement the `sigmoid` function below.

_Copy the code you added or modified in the assessment_

In [0]:
def sigmoid(z):
    """Calculate the sigmoid function on an input

    Args:
        input (float): The input value to transform

    Returns:
        float: Transformed value for the input; bounded between 0.0 and 1.0
    """

    #############################################################
    #
    # Your Turn: write your own code here
    #
    #############################################################

You can run the following code to plot the figure of your sigmoid function, and check if your implementation is correct. 

In [0]:
## Plotting testing harness
%matplotlib inline

# set the input space and define the function to plot
x = np.linspace(-10, 10)
y = sigmoid(x)

fig = plt.figure(figsize=(6,4))
ax = fig.add_subplot(111)

# stylize the plot
plt.style.use('ggplot')
plt.xlim(-11,11)
plt.ylim(0,1.1)

ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data',0))
ax.set_xticks([-10,-5,0,5,10])

ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data',0))
ax.set_yticks([0,0.5,1])
 
# perform and show the plot
plt.plot(x,y,label="Sigmoid",color = "blue")
plt.legend()
plt.show()

### .b Prediction Function



In linear regression, for a data point $x$, we predict its value $y$ by using a linear function $y = h_{\theta}(x) = \theta^\top x$. This is not a great solution for predicting binary-valued labels ($y\in \{0,1\}$). In logistic regression, we first use a linear function $z = h_{\theta}(x) = \theta^\top x$ to get the "score" of $x$ belonging to the $1$ class, and then squash the score $z$ into the range $(0,1)$ using the sigmoid function,  i.e., $y = \text{sigmoid}(z)$. In this way, we can interpret the prediction $y$ as the probability that $x$ belongs to the $1$ class. 

In summary, the prediction function of logistic regression is as follows:

$$
P(y = 1 | x) = h_{\theta}(x) = \text{sigmoid}(\theta^\top x) \\
P(y = 0 | x) = 1 - P(y = 1 | x) = 1 - h_{\theta}(x)
$$



Now, implement the prediction function of logistic regression by yourself. The function takes as input the input matrix $\mathbf{X}$ and the parameter vector $\theta$, and returns $h_{\theta}(\mathbf{X})$. 

Inputs:
- `theta`: the weight vector $\theta$ (an $n$-dimensional vector).
- `X`: an $m \times n$ matrix, where the $j$-th row is the $j$-th data point $x^{(j)}$. 

Returns: $h_{\theta}(\mathbf{X})$: A numpy array where the $j$-th entry is $h_{\theta}(x^{(j)})$. 



**Your Turn (Question 6):** Implement the `prediction` function of logistic regression.

_Copy the code you added or modified in the assessment_

In [0]:
def prediction(theta, X):
    """Equivalent .predict() function in sklearn's classifier

    Args:
        theta (array of floats; n): The weight vector of our classifier
        X (array of floats; m x n): The sample we want to test or make a prediction on

    Returns:
        array of floats; n: Prediction for the test_sample
    """
    predictions = np.zeros(X[0].shape, np.float)
    #############################################################
    #
    # Your Turn: write your own code here
    #
    #############################################################
    
    return predictions

### .c Calculating the Log-Likelihood



After we build up the logistic regression model, our goal is to search for a value of $\theta$ so that the probability $P(y = 1 | x) = h_{\theta}(x)$ is large when $x$ belongs to the $1$ class, and small when $x$ belongs to the $0$ class (since this is a binary classification task, this implies that $P(y = 0 | x)$ is large). We will learn $\theta$ from the training data. 

For a set of training examples with binary labels $\{(x^{(i)}, y^{(i)}) : i = 1, \cdots , m \}$, the log-likelihood of the training data measures how well our model fits the training data. The log-likelihood (denoted as $LL(\theta)$) is calculated as follows (refer to the course lecture notes to see its derivation):

$$
J(\theta) = - \frac{1}{m} \sum_i^m \left(y^{(i)} \log( h_\theta(x^{(i)}) ) + (1 - y^{(i)}) \log( 1 - h_\theta(x^{(i)}) ) \right).
$$

Note that only one of the two terms in the summation is non-zero, for each training example – depending on whether the label $y^{(i)}$ is $0$ or $1$. When $y^{(i)}=1$, minimizing the cost function implies that we need to make $h_{\theta}(x^{(i)})$ large; and when $y^{(i)}=0$, we want to make $1−h_{\theta}$ large, as explained above. 







Let's now implement this cost function. 

Inputs:
- `X`: an $m \times n$ matrix, where the $j$-th row is the feature vector of $x^{(j)}$. 
- `y`: a $m$-dimensional vector, where the $j$-th element is $y^{(j)}$. 
- `theta`: the weight vector $\theta$ (a $n$-dimentional vector). 

Returns: $J(\theta)$: A scalar. 

**Your Turn (Question 7):** Implement the `log_likelihood` function:

_Copy the code you added or modified in the assessment_

In [0]:
def log_likelihood(X, y, theta):
    """Calculates the log likelihood of the 

    Args:
        X (array of floats; m x n): The input data
        y (array of floats; m): The target values for the data
        theta (array of floats; n): The weight vector \theta

    Returns:
        float: cost, the log likelihood of the data X
    """
    likelihood = 0.0
    
    #############################################################
    #
    # Your Turn: write your own code here
    #
    #############################################################
    
    return likelihood

### .d Calculating the Gradient



We now have a cost function that measures how well a given hypothesis $h_{\theta}$ fits our training data. We can learn to classify our training data by minimizing $J(\theta)$ to find the best choice of $\theta$. 

Here we use the gradient descent to optimize $J(\theta)$. So we need to provide a function that computes the gradients of our cost function $J(\theta)$. We'll denote this as $\nabla_\theta J(\theta)$ for any requested choice of $\theta$.  

The detailed derivation of $\nabla_\theta J(\theta)$ is given later below in the post-class section of this notebook. For now, we simply write the answer as follows:

$$
\nabla_\theta J(\theta) = \frac{1}{m} X^\top (h_{\theta}(X) - y)
$$

$\nabla_\theta J(\theta)$ is the gradient and should be an $n$-dimensional vector, where the $i$-th element of $\nabla_\theta J(\theta)$ is the partial derivative of the loss function $J(\theta)$ with respect to $\theta_i$. 



It's your turn again! Please write your own code in the following function to calculate the gradient $\nabla_\theta J(\theta)$, given:  
- `X`: an $m \times n$ matrix, where the $j$-th row is the feature vector of $x^{(j)}$. 
- `y`: a $m$-dimensional vector, where the $j$-th element is $y^{(j)}$. 
- `theta`: the weight vector $\theta$. (an $n$-dimensional vector)

Returns: $\nabla_\theta J(\theta)$ (an $n$-dimensional vector)

**Tip**: Please call the `prediction` function you implemented to calculate $h_{\theta}(X)$. 

**Your Turn (Question 8):** Implement the `gradient` function.

_Copy the code you added or modified in the assessment_

In [0]:
def gradient(X, y, theta):
    """Calculates the gradient 

    Args:
        X (array of floats; m x n): The input data
        y (array of floats; m): The target values for the data
        theta (array of floats; n): The weight vector \theta

    Returns:
        array of floats; n: the gradient of the cost function
    """
    gradient = 0.0
    
    #############################################################
    #
    # Your Turn: write your own code here
    #
    #############################################################
    
    return gradient

### .e Building the Logistic Regression Function



Finally, we are ready to train the model using gradient descent. The following code provide you a training framework. 





The weight update rule in batch gradient descent is given by:
$$
\theta(t+1) = \theta(t) - \alpha \cdot \nabla_{\theta(t)} J(\theta(t))
$$
where $\theta(t)$ represents the weights in the $t$-th iteration, and $\alpha$ denotes the learning rate.  

In the following function, your task is to implement the weight update of batch gradient descent according to the above equation. Please call the `gradient` function to help you calculate the gradient in your code.

Inputs: 
- `X`: (np.array) an $m \times n$ matrix, where the $j$-th row is the feature vector of $x^{(j)}$. 
- `y`: (np.array) a $m$-dimensional vector, where the $j$-th element is $y^{(j)}$. 
- `num_steps`: (int) the number of training iterations. 
- `learning_rate`: (float) learning rate of gradient descent, \alpha. 
- `verbose`: (Boolean) should we print the training information or not?

Returns: the optimized weights $\theta$. 

**Your Turn (Question 9):** Complete the code to train the `logistic_regression` model.

_Copy the code you added or modified in the assessment_

In [0]:
def logistic_regression(X, Y, num_steps, learning_rate, verbose):
    """Optimises the weights for logistic regression, given a training dataset.

    Args:
        X (array of floats; m x n): The input data
        y (array of floats; m): The target values for the data
        num_steps (int): number of training iteration before termination
        learning_rate (float): learning rate for the gradient descent
        verbose (Boolean): print log-likelihood statistics?

    Returns:
        array of floats; n: The weight vector \theta
    """
    # Add the bias
    bias = np.ones((X.shape[0], 1))
    X = np.concatenate((bias, X), axis=1)

    # Initialize the weights
    weights = np.zeros(X.shape[1])

    # Training with gradient descent
    for step in range(num_steps):

        # Update weights with gradient
        #############################################################
        #
        # Your Turn: write your own code here
        #
        #############################################################

        # Print log-likelihood every step
        cost = log_likelihood(X, y, weights)
        if verbose and step % 10000 == 0:
            print('Number of iterations: {}; cost: {:.5f}'.format(step, cost))

    return weights

### .f Putting it into practice: Classification on the *Iris* dataset

Congratulations! You have implemented your own version of logistic regression. To see how well it works in practice, let's test it on a real-world dataset. 

We will again use the same, modified _Iris_ dataset as in our Pre-Class work. Please run the following code to load and visualize the data first.  

In [0]:
# Load the original dataset
iris = datasets.load_iris()

# Obtain the training data
X = iris.data[:, :2]
y = (iris.target != 0) * 1

# Visualize the training data
plt.figure(figsize=(10, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='b', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='r', label='1')
plt.legend()
plt.show()

# Your turn: Sanity check -- verify the dataset is the same as in the earlier plot

### .g Binary Classification with Logistic Regression



Excited? It's time to run our model to do binary classification on our modified  _Iris_ dataset. 

The following code trains your implemented logistic regression model.  You should expect to see a continuing decrease of the cost function. The training process will typically takes 10–60 seconds in our experience, but your mileage may vary.

In [0]:
weights = logistic_regression(X, y, num_steps = 300000, learning_rate = 0.1, verbose = True)

Let's inspect what we've done.  The following code prints the weights, classification accuracy, and plots the decision boundary.  

In [0]:
############################################################
def predict_prob(X, weights):
    """Returns prediction probabilities for the input data
    Args:
        X (array of floats; m x n): The input data
        weights (array of floats; n): The weight vector \theta

    Returns:
        array of floats; n: The predicted probability of +1 class, bounded (0,1)
    """
    # Add the bias, x_0, for each example
    bias = np.ones((X.shape[0], 1))
    X = np.concatenate((bias, X), axis=1)

    return sigmoid(np.dot(X, weights))
###########################################################

print('The learned weights are {}'.format(weights))

preds = predict_prob(X, weights).round()
# Calculate the accuracy
accu = (preds == y).mean() 
print('The classification accuracy: {}'.format(accu))

confidence = [0.5]
boundary_colors = ['black']

plt.figure(figsize=(10, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='b', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='r', label='1')
plt.legend()
x1_min, x1_max = X[:,0].min(), X[:,0].max(),
x2_min, x2_max = X[:,1].min(), X[:,1].max(),
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))
grid = np.c_[xx1.ravel(), xx2.ravel()]
probs = predict_prob(grid, weights).reshape(xx1.shape)
plt.contour(xx1, xx2, probs, confidence, linewidths=1, colors=boundary_colors)
plt.show()

If you got everything right, you are expected to obtain a perfect classification accuracy (e.g., 1.0), and plot a beautiful decision boundary to separate the data points correctly. 


### .h Compare your version with `sklearn`



Now compare your implementation with `sklearn`.  Print the training time, learned weights, and the classification accuracy. Do you find that there is a remarkably difference is about training time? `sklearn` is an order of magnitude faster. 



In [0]:
def test_sklearn_logistic_regression(X, y):
    """Testing harness for the sklearn logistic regression 
    Args:
        X (array of floats; m x n): The input data
        y (array of floats; m): The target values for the data
    """
    print('sklearn')
    LogisticRegression(C=1e20,solver = 'liblinear')
    %time model.fit(X, y)
    preds = model.predict(X)
    print('The learned weights are {} {}'.format(model.intercept_, model.coef_))
    print('The classification accuracy: {}'.format(((preds == y).mean())))
    print()
    return

def test_your_version_logistic_regression(X, y):
    """Testing harness for your version of logistic regression 
    Args:
        X (array of floats; m x n): The input data
        y (array of floats; m): The target values for the data
    """
    print('Your version')
    %time weights = logistic_regression(X, y, num_steps = 300000, learning_rate = 0.1, verbose = False)
    print('The learned weights are {}'.format(weights))
    preds = predict_prob(X, weights).round()
    print('The classification accuracy: {}'.format((preds == y).mean()))
    return
  
test_sklearn_logistic_regression(X, y)
test_your_version_logistic_regression(X, y)

Great job!  We're done here!

For more understanding:
* Your turn (optional): Hey, you know, we tested on our training data (a no-no)!  Modify your code to use the `train_test_split` function to divide your dataset properly into a training and testing dataset, with 50% for training and 50% for testing.
* Your turn (optional): Change the reporting to visualize the decision boundary (0.5 or greater) at different iterations during training. You can even plot them with different shades of gray.
* Your turn (optional): Try out a linear regression problem.  You'll have to change your dataset from a classification one to one with real valued outputs (regression).  You'll need to write out your own linear regression trainer.  
  * You can try with the same gradient descent algorithm you used for logistic regression.
  * Or with the closed-form analytic solution (harder)

**Your Turn (Question 10)**: Not optional.  Describe how you would apply a logistic regression on a 3-class classification problem.

_Replace with your answer_

## 8 Derivation of the logistic regression gradient



Here we derive the gradient of the cost function in logistic regression, i.e., $\nabla_{\theta} J(\theta)$, step by step. To recap, the cost function $J(\theta)$, is as follows:
$$
J(\theta) = - \frac{1}{m} \sum_i^m \left(y^{(i)} \log( h_\theta(x^{(i)}) ) + (1 - y^{(i)}) \log( 1 - h_\theta(x^{(i)}) ) \right).
$$

Before we calculate the gradient $\nabla_{\theta} J(\theta)$, we first calculate the gradient of the sigmoid function $g(z)$, which will be very useful in our derivation. The gradient of the sigmoid function $\nabla_{z} g(z)$ is as follows. 
$$
\nabla_{z} g(z) = \nabla_{z} \frac{1}{1 + \exp^{-z}} = \frac{\exp(z)}{(1 + \exp^{-z})^2} = \frac{1}{1 + \exp^{-z}} \cdot (1 - \frac{1}{1 + \exp^{-z}}) = g(z) (1 - g(z))
$$

So, we find that $\nabla_{z} g(z) = g(z) (1 - g(z))$. This is a very useful property of the sigmoid function, which will facilitate us a lot in the derivation of  $\nabla_{\theta} J(\theta)$. 

**Your Turn (Question 11)**: What is the gradient of $log(1 - sigmoid(z))$? Here $log$ denotes the natural logarithm, i.e., $ln(x)$, and $sigmoid(z)$ is our choice for $g(z)$.

Choose from: _$sigmoid(z)$, $- sigmoid(z)$, $1 - sigmoid(z)$, $1 + sigmoid(z)$_

Now, we derivate the gradient $\nabla_{\theta} J(\theta)$ by calculating its $i$-th element, i.e., the gradient of $J(\theta)$ with respect to $\theta_i$, as follows:

(Note that $h_{\theta}(x^{(j)}) = g(\theta^{\top} x^{(j)})$)

$$
\begin{eqnarray*} 
\frac {\partial J(\theta)}{\partial \theta_i} &=&- \frac 1 m \sum_{j=1}^m y^{(j)} \frac{1}{g(\theta^{\top} x^{(j)})}g(\theta^{\top} x^{(j)})(1-g(\theta^{\top} x^{(j)}))x_i^{(j)}+(1-y^{(j)})\frac{-1}{1-g(\theta^{\top} x^{(j)})}g(\theta^{\top} x^{(j)})(1-g(\theta^{\top} x^{(j)}))x_i^{(j)} \\ 
&=& - \frac 1 m \sum_{j=1}^m y^{(j)}(1-g(\theta^{\top} x^{(j)}))x_i^{(j)}+(y^{(j)}-1)g(\theta^{\top} x^{(j)})x_i^{(j)} \\ 
&=& \frac 1 m \sum_{j=1}^m (g(\theta^{\top} x^{(j)}) - y^{(j)})x_i^{(j)} 
\end{eqnarray*}
$$

If we write the following equation in vectorized form, then it's a lot simpler.  We have:
$$
\frac {\partial J(\theta)}{\partial \theta_i} = \frac 1 m (g(\theta^{\top} \mathbf{X}) - y)^{\top} \mathbf{X}_{:i}
$$
where $\mathbf{X}_{:i}$ represents the $i$-th column of $\mathbf{X}$ (an $m$ dimensional column vector). Finally, based on the above equation, we could write $\nabla_{\theta} J(\theta)$ in the vectorized form, as follows:

$$
\begin{eqnarray*} 
\nabla_{\theta} J(\theta) & = & \frac 1 m \mathbf{X}^{\top} (g(\theta^{\top} \mathbf{x}) - y) \\
& = & \frac 1 m \mathbf{x}^{\top} (h_{\theta}(\mathbf{x}) - y)
\end{eqnarray*}
$$

---
# Credits


Authored by Yip Ji Keong, Alvin; Mohammad Neamul Kabir; [Liangming Pan](http://www.liangmingpan.com/) and [Min-Yen Kan](http://www.comp.nus.edu.sg/~kanmy)  (2019), affiliated with [WING](http://wing.comp.nus.edu.sg), [NUS School of Computing](http://www.comp.nus.edu.sg) and [ALSET](http://www.nus.edu.sg/alset). Inspired in part by Andrew Ng's Coursera course and Yaser S. Abu-Mostafa's Caltech course.
Licensed as: [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/ ) (CC BY 4.0).
Please retain and add to this credits cell if using this material as a whole or in part.   Credits for photos given in their captions.