# Logistic regression

In this session we will develop a simple implementation of Logistic Regression trained with SGD. The goal is to develop the understanding of gradient descent, the logistic regression model and the practical use of numpy.

## Modules
We use this opportunity to also practice the use of modules. A module is a Python file with a number of definitions. A module can be imported and used in a notebook, or in another module. Modules are a good way or organizing reusable Python code. 



First we'll load some toy data to use with our functions.  We'll make this into a binary problem by keeping only two species.

In [2]:
import numpy
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

iris = load_iris()

# skip rows with the label 2
data = iris.data[iris.target != 2]
target = iris.target[iris.target != 2]
X_train, X_val, y_train, y_val = train_test_split(data, target, 
                                                  test_size=1/3, random_state=123)


# Z-score the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)

print(X_train.shape)

(66, 4)


## Model definition


We'll first define the interface of our model:

- `predict` - compute predicted classes on new examples given a trained model
- `predict_proba` - - compute class probabilities on new examples given a trained model
- `fit` - train a model using features and labels from the training set

as well some auxiliary functions.

Create a Python file named `logisticregression.py` in your home directory. You will put the function definitions in this file, and import them into the notenbook. Remember that if you change something in the module file, you will need to restart the notebook kernel to reload the module into the notebook.


### Exercise 1

Define function `inverse_logit` which applies the function below to its input `z` and returns $\mathrm{logit}^{-1}(z)$. The mathematical formulation is:
$$
\mathrm{logit}^{-1}(z) = \frac{1}{1+\exp(-z)}
$$

Note that `z` can be a number or a numpy array.

**Example call** `inverse_logit(0.0)` should return 0.5. 

**Example call** `inverse_logit(numpy.array([-0.5, 0 , 0.5]))` should return `array([0.37754067, 0.5, 0.62245933])` where inverse logit function is applied to each element in the array.

Keep in mind that any variable of functions you are using in the function definition need to be imported inside the module.
After defining this function, import it into the notebook:

In [4]:
from logisticregression import inverse_logit

In [5]:
print(inverse_logit(0.5))
print(inverse_logit(-10.0))
print(inverse_logit(0.0))
print(inverse_logit(40.0))
print(inverse_logit(40.0) == inverse_logit(100.0))


0.6224593312018546
4.5397868702434395e-05
0.5
1.0
True


In [6]:
inverse_logit(numpy.array([-0.5,0.0, 0.5]))

array([0.37754067, 0.5       , 0.62245933])

(Due to limited precision of floating point numbers, past a certain absolute value of the input, our function becomes a constant 1 or 0.)

### Exercise 2 

Define function `predict_proba`, with two arguments:

- dictionary of model parameters `{'w':w,'b':b}`, where `w` is an numpy array of coefficients and `b` a scalar intercept
- numpy array (matrix) of new the features of new examples `X`

The function should return an array of probabilities of the positive class.

In [9]:
from logisticregression import predict_proba

In [10]:
# Initial model parameters
w = numpy.zeros((X_train.shape[1],))
b = 0
wb = {'w':w,'b':b}
# Use this initial model for prediction
p_pred = predict_proba(wb, X_val)
print(X_val.shape)
print(p_pred.shape)
print(p_pred)

(34, 4)
(34,)
[0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]


### Exercise 3
Define function `predict` which takes the same input as `predict_proba` but returns the class labels (0 or 1) instead of probabilities. 

**Hint:** Call the `predict_proba` function on the same inputs and obtain the probability outputs. Return 1 for items that are greater than or equal to 0.5 at the output of calling `predict_proba` and return 0 otherwise.  

In [12]:
from logisticregression import predict

In [13]:
y_pred = predict(wb, X_val)
print(X_val.shape)
print(y_pred.shape)
print(y_pred)

(34, 4)
(34,)
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


Our model interface is complete.

## Training
We will now implement the interface of the SGD training algorithm:

- `fit` which takes initial model parameters and trains it for one pass over the given training data

We will start with an auxiliary function `update` which does a single step of SGD.


### Exercise 4

Define function `update` which is given a single training example, and first uses the `predict_proba` function to get the predicted probability of the positive class, and then updates the weights and the bias of
the model depending on the difference between this probability and the actual target. 

The function is given these arguments:

- `wb` - the model weights and bias (dictionary of model parameters `{'w':w,'b':b}`, where `w` is an numpy array of coefficients and `b` a scalar intercept)
- `x`  - the feature vector of the training example
- `y`  - the class label of the training example
- `eta`- learning rate

The update should change the given parameters by implementing the following operations:
$$
\mathbf{w}_{new} = \mathbf{w}_{old} + \eta(y-p_{pred})\mathbf{x}
$$

and

$$
b_{new} = b_{old} + \eta (y-p_{pred})
$$

Finally, the function should return the value of the loss for the current examples, that is:
$$
-y \log_2(p_{pred}) - (1-y)\log_2(1-p_{pred})
$$


In [15]:
from logisticregression import update

In [30]:
from pprint import pprint
wb = {'w':numpy.zeros((X_train.shape[1],)), 'b':0}
eta = 0.1
# Show P(y=1) before and after update

# Process example 1
i = 0
print("Actual class: {}".format(y_train[i]))
print("P(y=1): {:.3}".format(predict_proba(wb, X_train[i])))
loss = update(wb, X_train[i], y_train[i], eta)
print("Loss: {:.3}".format(loss))
pprint(wb)
print("P(y=1): {:.3}".format(predict_proba(wb, X_train[i])))


print()
# Process example 5
i = 5
print("Actual class: {}".format(y_train[i]))
print("P(y=0): {:.3}".format(predict_proba(wb, X_train[i])))
loss = update(wb, X_train[i], y_train[i], eta)
print("Loss: {:.3}".format(loss))
pprint(wb)
print("P(y=0): {:.3}".format(predict_proba(wb, X_train[i])))




Actual class: 1
P(y=1): 0.5
Loss: 1.0
{'b': 0.05, 'w': array([-0.00510916, -0.01013003,  0.0586053 ,  0.06574479])}
P(y=1): 0.552

Actual class: 0
P(y=0): 0.48
Loss: 0.943
{'b': 0.002031277565616642,
 'w': array([ 0.04831826, -0.00041154,  0.1069583 ,  0.12408812])}
P(y=0): 0.423


### Exercise 5 

Define function `fit`, which will use the `update` function on each training example in turn, for a single iteration of SGD. The function takes the following arguments:

- `wb` - the current model weights and bias
- `X` - the matrix of training example features
- `y` - the vector of training example classes
- `eta=0.1` - the learning rate, with default 0.1

The function returns the sum of the losses on all the examples, as given by `update`.


In [33]:
from logisticregression import fit

In [35]:
wb = {'w':numpy.zeros((4,)), 'b':0}
eta = 0.01
J = 10

# Let's run 10 epochs of SGD
print("epoch loss")
for j in range(J):
    loss = fit(wb, X_train, y_train, eta=0.1)
    print("{} {:.3}".format(j, loss))

epoch loss
0 17.6
1 4.59
2 2.81
3 2.04
4 1.61
5 1.34
6 1.14
7 0.999
8 0.889
9 0.801


### Exercise 6

Train your model for one pass (epoch) using the `fit` function on the training data and check classification accuracy on validation data. You can use `accuracy_score` function from `sklearn.metrics` to find accuracy.

In [60]:
from sklearn.metrics import accuracy_score

model = {'w':numpy.zeros((4,)), 'b':0}
fit(model, X_train, y_train, eta = 0.1)
acc = accuracy_score(y_val, predict(model, X_val))
acc

1.0

## SGD classifier 

The scikit-learn SGD classifier is suitable to use on large datasets, as well as on sparse data such as character or word ngram counts.

We'll use the scikit-learn implementation of Logistic Regression with SGD to learn to classify posts on various discussion groups into topics.  There are twenty groups:

In [63]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
for group in data.target_names:
    print(group)

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc


The data is in the form of raw text, so we'll need to extract some features from it.

In [65]:
print(data.data[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


We will split the data into train and validation, and then extract word counts from the texts.

In [66]:
text_train, text_val, y_train, y_val = train_test_split(data.data, data.target, test_size=1/2, random_state=123)

In [67]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(analyzer='word', ngram_range=(1,1), lowercase=True)
X_train = vec.fit_transform(text_train)
X_val = vec.transform(text_val)

We can now try the SGDClassifier on this data.

In [68]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score

In [69]:
model = SGDClassifier(loss='log', random_state=123)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
print("{:.3}".format(accuracy_score(y_val, y_pred)))

InvalidParameterError: The 'loss' parameter of SGDClassifier must be a str among {'squared_epsilon_insensitive', 'modified_huber', 'squared_hinge', 'perceptron', 'squared_error', 'huber', 'epsilon_insensitive', 'hinge', 'log_loss'}. Got 'log' instead.

### Exercise 7

Experiment with different features and model hyperparameters, and find a well performing setting.

**Hint:** You can have a look at the parameters of CountVectorizer you can update (e.g. ngram_range, lowercase, etc.) here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html and the parameters of SGDClassifier (e.g. learning_rate, eta0) here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html