# Logistic Regression Notebook

## Sentiment Analysis with LR: Hotel reviews

Let's start with one simple sentiment Example from SLP Section 5.1.1. Given a set of weights and a single example compute the prediction.


In [257]:
import numpy as np

After feature extraction our example review can be boiled down to a vector of 6 features. And we know that its correct class is positive (1). Let's hardcode the example and the correct class.

In [258]:
example = np.array([3,2,1,3,0,4.15])

In [259]:
correct = 1

Our initial weights and bias, as given in the example.

In [294]:
weights = np.array([2.5, -5, -1.2, 0.5, 2.0, 0.7])

In [261]:
bias = .1

We score our example with the usual method, a dot product of the weight vector and feature vector plus the bias.

In [262]:
rawscore = np.dot(weights, example) + bias

In [263]:
rawscore

0.8050000000000003

And then turn that into a proper probability with the sigmoid.

In [264]:
def classprob(score): return 1/(1+np.exp(-score))

In [265]:
score = classprob(rawscore)

In [266]:
score

0.6910430124157229

Let's compute the CE loss for this single example. Recall the CE loss is just the negative log probability assigned to the right answer.  The correct answer for this example is positive (y = 1).


In [267]:
def CE_loss(y, y_hat):
    return y * -np.log(y_hat) + (1 - y) * -np.log(1 - y_hat)

In [268]:
CE_loss(1, score)

0.36955321052982226

Now that we have the probability assigned to the example let's perform a weight update using this as a training example.  To do this we need the gradient (partial derivative of the CE loss with respect to each feature's weight). 

Here we'll using numpy's array capabilities more fully. Instead of writing a loop where we multiply the (score - correct) by each feature value, we'll multiply 'example' directly; the result of this is an array where each element of example is multiplied by (score - correct). 

In [269]:
gradient = (score - correct) * example

In [270]:
gradient

array([-0.92687096, -0.61791398, -0.30895699, -0.92687096, -0.        ,
       -1.2821715 ])

With the gradient in hand, we can update the weights. Assuming a learning rate of 1 just to make things consistent with the slides.

In [271]:
learningrate = 1

In [272]:
new_weights = weights - learningrate * gradient

In [273]:
new_weights

array([ 3.42687096, -4.38208602, -0.89104301,  1.42687096,  2.        ,
        1.9821715 ])

Now that we have the new weights we can reassess the class probability of our example.

In [274]:
new_score = classprob(np.dot(new_weights, example) + bias)

And not surprisingly with only example and a big learning rate, we make a big leap towards the right answer (not realistic).

In [275]:
new_score

0.9999982077239783

And the loss goes down correspondingly.

In [276]:
CE_loss(1, new_score)

1.79227762783454e-06

There a couple of things we missed here.  The first is that we forgot to deal with the bias correctly.  Let's fix that by folding the bias into the features and the weights. That is add an additional feature which we'll fix at 1 and add a corresponding weight of 0.1.

So now our example is:

In [277]:
example = np.array([3,2,1,3,0,4.15,1])

In [278]:
np.shape(example)

(7,)

And the original weights are now:

In [295]:
weights = np.array([2.5, -5, -1.2, 0.5, 2.0, 0.7, 0.1])

In [296]:
weights

array([ 2.5, -5. , -1.2,  0.5,  2. ,  0.7,  0.1])

In [297]:
np.shape(weights)

(7,)

Now the score is just dot product.

In [282]:
rawscore = np.dot(weights, example)

In [283]:
score = classprob(rawscore)

Let's make sure we get the same answer as before.

In [284]:
score

0.691043012415723

With that new formulation of the bias we can do the weight update again. I'll leave that to you.

## Batch Processing

The next task is to do some batch processing.  That is, we want to do a single update from from a set of training instances. This could be a true batch over the whole training set or a mini-batch.  To do this we accumulate the gradients over a set of examples and then divide by the number of examples in the batch to get the gradient we need for the update. 

Let's assume we have a development batch stored away as a csv file, with examples as rows. We can read that in using numpy's genfromtxt.


In [285]:
def loadcsvtrain(filename):
	""" Return a training array, and a 1-d array of gold labels. Skip the header. """

	x = np.genfromtxt(filename, delimiter=',', skip_header=1, usecols=(0,1,2,3,4,5))
	y = np.genfromtxt(filename, delimiter=',', skip_header=1, usecols=(6))

	return x, y

In [286]:
X, y = loadcsvtrain("HotelDev.csv")

That should give us 20 examples (rows) with 6 feature values each (columns).

In [287]:
np.shape(X)

(20, 6)

And 20 answers. One for each example.

In [288]:
np.shape(y)

(20,)

Now, we need to update X a bit to deal with the bias term. Specifically, we need to add an additional feature value (1) to each row representing the pseudo-feature for the bias term. In other words we need to add a 1 to the end of each row.  

In [289]:
X = np.append(X, np.ones((20,1)), 1)

In [290]:
np.shape(X)

(20, 7)

And update the weights with an initial bias at the end.

In [298]:
np.shape(weights)

(7,)

And let's now get the predictions for each of the 20 examples with a single matrix operation. 

In [225]:
scores = classprob(np.dot(X, weights))

In [226]:
scores

array([9.99999977e-01, 9.99973631e-01, 9.99897414e-01, 9.99999986e-01,
       9.99999999e-01, 1.00000000e+00, 9.99734290e-01, 1.00000000e+00,
       1.00000000e+00, 1.00000000e+00, 3.89674161e-07, 2.43759588e-07,
       4.99105297e-01, 1.05299427e-02, 3.61003058e-03, 4.69311313e-02,
       8.95640383e-01, 4.28695286e-07, 5.31867380e-01, 1.53819056e-01])

When we look at y we can see that the first 10 examples are positive and the next 10 are negative. So these outputs aren't too bad.

In [227]:
y

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.])

In [251]:
def predict(score):
    return np.round(score)


To be specific, we're getting 2 of the training examples wrong.

In [253]:
predict(scores)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 1.,
       0., 1., 0.])

More importantly, we'd like to know the average CE loss for these examples. 

In [230]:
1/20 * np.sum(CE_loss(y, scores))

0.1969981910855772

Now we can move on to getting the gradients for a batch update. Let's first get the vector of differences between the predicted scores and the correct scores.

In [231]:
deltas = scores - y

In [232]:
deltas

array([-2.33201815e-08, -2.63692710e-05, -1.02586310e-04, -1.37349574e-08,
       -1.13339471e-09, -1.09021236e-10, -2.65709799e-04, -1.84297022e-14,
       -1.75415238e-12, -9.39981426e-12,  3.89674161e-07,  2.43759588e-07,
        4.99105297e-01,  1.05299427e-02,  3.61003058e-03,  4.69311313e-02,
        8.95640383e-01,  4.28695286e-07,  5.31867380e-01,  1.53819056e-01])

Given these scores we need to multiply each delta by its corresponding example feature vector in X. And then accumulate the gradients across each feature and finally divide by the number of examples to get the average update.  We can do that all once using a dot product.

In [233]:
avg_gradients = (1 / np.shape(X)[0]) * np.dot(X.T, deltas)

In [234]:
avg_gradients

array([0.15141652, 0.18950249, 0.        , 0.00820286, 0.15379752,
       0.46055176, 0.10705548])

Update the weights using the average gradient.

In [235]:
new_weights = weights - learningrate * avg_gradients

In [236]:
new_scores = classprob(np.dot(X, new_weights))

Check the predictions for the new scores we see that we've get the training set correct.

In [239]:
[predict(score) for score in new_scores]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

And the average loss (had been 0.19) goes down as well. 

In [240]:
(1/20) * np.sum(CE_loss(y, new_scores))

0.025768427482170342

And the new weights. Note how the bias term moved towards 0.

In [241]:
new_weights

array([ 2.34858348, -5.18950249, -1.2       ,  0.49179714,  1.84620248,
        0.23944824, -0.00705548])

In [299]:
!pwd

/Users/jim/GitRepos/Courses/CSCI5832
