# Logistic regression

In this exercise you will train a logistic regression model via gradient descent in two simple scenarios.

The general setup is as follows:
* we are given a set of pairs $(x, y)$, where $x \in R^D$ is a vector of real numbers representing the features, and $y \in \{0,1\}$ is the target,
* for a given $x$ we model the probability of $y=1$ by $h(x):=g(w^Tx)$, where $g$ is the sigmoid function: $g(z) = \frac{1}{1+e^{-z}}$,
* to find the right $w$ we will optimize the so called logarithmic loss: $J(w) = -\frac{1}{n}\sum_{i=1}^n y_i \log{h(x_i)} + (1-y_i) \log{(1-h(x_i))}$,
* with the loss function in hand we can improve our guesses iteratively:
    * $w_j^{t+1} = w_j^t - \text{step\_size} \cdot \frac{\partial J(w)}{\partial w_j}$,
* we can end the process after some predefined number of epochs (or when the changes are no longer meaningful).

Let's start with the simplest example - linear separated points on a plane.

In [2]:
import numpy as np

np.random.seed(123)

# these parametrize the line
a = 0.3
b = -0.2
c = 0.001

# True/False mapping
def lin_rule(x, noise=0.):
    return a * x[0] + b * x[1] + c + noise < 0.

# Just for plotting
def get_y_fun(a, b, c):
    def y(x):
        return - x * a / b - c / b
    return y

lin_fun = get_y_fun(a, b, c)

In [3]:
# Training data

n = 500
range_points = 1
sigma = 0.05

X = range_points * 2 * (np.random.rand(n, 2) - 0.5)
y = [lin_rule(x, sigma * np.random.normal()) for x in X]

print(X[:10])
print(y[:10])

Let's plot the data.

In [4]:
import plotly.express as px

# plotly has a problem with coloring boolean values, hence stringify
# see https://community.plotly.com/t/plotly-express-scatter-color-not-showing/25962
fig = px.scatter(x=X[:, 0], y=X[:, 1], color=list(map(str, y)))
x_range = [np.min(X[:, 0]), np.max(X[:, 1])]
fig.add_scatter(x=x_range, y=list(map(lin_fun, x_range)), name='ground truth border')
fig.show()

Now, let's implement and train a logistic regression model. You should obtain accuracy of at least 96%.


The general setup is as follows:
* we are given a set of pairs $(x, y)$, where $x \in R^D$ is a vector of real numbers representing the features, and $y \in \{0,1\}$ is the target,
* for a given $x$ we model the probability of $y=1$ by $h(x):=g(w^Tx)$, where $g$ is the sigmoid function: $g(z) = \frac{1}{1+e^{-z}}$,
* to find the right $w$ we will optimize the so called logarithmic loss: $J(w) = -\frac{1}{n}\sum_{i=1}^n y_i \log{h(x_i)} + (1-y_i) \log{(1-h(x_i))}$,
* with the loss function in hand we can improve our guesses iteratively:
    * $w_j^{t+1} = w_j^t - \text{step\_size} \cdot \frac{\partial J(w)}{\partial w_j}$,
* we can end the process after some predefined number of epochs (or when the changes are no longer meaningful).

In [5]:
################################################################
# TODO: Implement logistic regression and compute its accuracy #
################################################################
features = 3

lr = 5 # step size
treshold = 1/2
n_epochs = 100 # number of passes over the training data

x = np.hstack([np.ones(n).reshape(n,1), X])
w = np.array([0.001] * features).reshape(-1,1)
y_ = np.array(y).reshape(n,1)

def sigm(z):
    return 1/(1+np.e**(-z))

def h(x, w):
    #print(sigm(x @ w)[:10])
    return sigm(x @ w)

def log_loss_eval(x, w_vec, y_vec):

    #here boolean values automatically cast
    arr = (y_vec * np.log(h(x, w_vec)) + (1 - y_vec) * np.log(1 - h(x, w_vec)))

    #print(arr[:10])
    return -1/n * np.sum(arr)

log_loss_eval(x, w, y_)

def predict(w, x=x):
    return h(x, w) > treshold

losses = [log_loss_eval(x, w, y_)]

for i in range(n_epochs):

    yhat = h(x, w)
    dJdwj = (np.sum((yhat - y_) * x, axis = 0))

    new_w = w - (1/n) * (lr * dJdwj).reshape(-1, 1)
    w = new_w

    loss = log_loss_eval(x, w, y_)
    losses.append(loss)

    if (i % 400 == 3):
        print(f'Iter: {i:>3} Loss: {loss:8.8f}\n w: {w}')

Let's visually asses our model. We can do this by using our estimates for $a,b,c$.

In [7]:
w = w.reshape(-1)
lin_fun2 = get_y_fun(w[1], w[2], w[0])

print(-a/b, -c/b)
print("Esimated to:")
print(-w[1]/w[2], -w[0]/w[2])

training_set_accuracy = np.sum((h(x, w) > 1/2) == np.array(y))/n
print(f'Accuracy on training data: {training_set_accuracy}')

fig = px.line(y=losses, labels={'y':'loss'})
fig.show()

In [8]:
#################################################################
# TODO: Pass your estimates for a,b,c to the get_y_fun function #
#################################################################

fig = px.scatter(x=X[:, 0], y=X[:, 1], color=list(map(str, y)))
x_range = [np.min(X[:, 0]), np.max(X[:, 1])]
fig.add_scatter(x=x_range, y=list(map(lin_fun, x_range)), name='ground truth border')
fig.add_scatter(x=x_range, y=list(map(lin_fun2, x_range)), name='estimated border')
fig.show()

Let's now complicate the things a little bit and make our next problem nonlinear.

In [9]:
# Parameters of the ellipse
s1 = 1.
s2 = 2.
r = 0.75
m1 = 0.15
m2 = 0.125

# 0/1 mapping, checks whether we are inside the ellipse
def circle_rule(x, y, noise=0.):
    return 1 if s1 * (x - m1) ** 2 + s2 * (y - m2) ** 2 + noise < r ** 2 else 0

In [10]:
# Training data

n = 500
range_points = 1

sigma = 0.1

X = range_points * 2 * (np.random.rand(n, 2) - 0.5)

y = [circle_rule(x, y, sigma * np.random.normal()) for x, y in X]

print(X[:10])
print(y[:10])

Let's plot the data.

In [11]:
import plotly.graph_objects as go

fig = px.scatter(x=X[:, 0], y=X[:, 1], color=list(map(str, y)))

xgrid = np.arange(np.min(X[:, 0]), np.max(X[:, 0]), 0.003)
ygrid = np.arange(np.min(X[:, 1]), np.max(X[:, 1]), 0.003)
contour =  go.Contour(
        z=np.vectorize(circle_rule)(*np.meshgrid(xgrid, ygrid, indexing="ij")),
        x=xgrid,
        y=ygrid
    )
fig.add_trace(contour)
fig.show()

Now, let's train a logistic regression model to tackle this problem. Note that we now need a nonlinear decision boundary. You should obtain accuracy of at least 90%.

Hint:
<sub><sup><sub><sup><sub><sup>
Use feature engineering.
</sup></sub></sup></sub></sup></sub>

In [12]:
features = 5

lr = 0.5 # step size
treshold = 1/2
n_epochs = 7000 # number of passes over the training data

def ed(np_arr):
    return np.expand_dims(np_arr, axis = 1)

new_features = [np.ones(n), X[:,0]**2, X[:,1]**2]
new_features = list(map(ed, new_features))
new_features.append(X)

x = np.hstack(new_features)
w = np.array([0] * features).reshape(-1,1)
y_ = np.array(y).reshape(n,1)

In [13]:
### Setup animation
res = .02
frames = 100

xgrid = np.arange(np.min(X[:, 0]), np.max(X[:, 0]), res)
ygrid = np.arange(np.min(X[:, 1]), np.max(X[:, 1]), res)

xx, yy = np.meshgrid(xgrid, ygrid, indexing="ij")
X_plot = np.c_[xx.ravel(), yy.ravel()]

_X = np.concatenate([np.ones(len(xgrid)*len(ygrid)).reshape(-1,1), X_plot**2, X_plot], axis=1)

all_preds = []

In [14]:
################################################################
# TODO: Implement logistic regression and compute its accuracy #
################################################################
def sigm(z):
    return 1/(1+np.e**(-z))

def h(x, w):
    #print(sigm(x @ w)[:10])
    return sigm(x @ w)

def log_loss_eval(x, w_vec, y_vec):

    #here boolean values automatically cast
    arr = (y_vec * np.log(h(x, w_vec)) + (1 - y_vec) * np.log(1 - h(x, w_vec)))

    #print(arr[:10])
    return -1/n * np.sum(arr)

log_loss_eval(x, w, y_)

def predict(w, x=x):
    return h(x, w) > treshold

losses = [log_loss_eval(x, w, y_)]

for i in range(n_epochs):
    if((i+1) % (n_epochs / frames) == 0):
        preds = (h(_X, w) > treshold).astype(int).reshape(len(xgrid), len(ygrid))
        all_preds.append(preds)

    yhat = h(x, w)
    dJdwj = (np.sum((yhat - y_) * x, axis = 0))

    new_w = w - (1/n) * (lr * dJdwj).reshape(-1, 1)
    w = new_w

    loss = log_loss_eval(x, w, y_)
    losses.append(loss)

    if ((i+1) % (n_epochs/10) == 0):
        print(f'Iter: {i:>3} Loss: {loss:8.8f}\n w: {w}')

In [16]:
w = w.reshape(-1)

training_set_accuracy = np.sum((h(x, w) > 1/2) == np.array(y))/n
print(f'Accuracy on training data: {training_set_accuracy}')

fig = px.line(y=losses, labels={'y':'loss'})
fig.show()

Let's visually asses our model.

Contrary to the previous scenario, converting our weights to parameters of the ground truth curve may not be straightforward. It's easier to just provide predictions for a set of points in $R^2$.

In [18]:
fig = go.Figure()

# Add initial contour plot for the first frame
contour = go.Contour(
    z=all_preds[0],
    x=xgrid,
    y=ygrid,
    opacity=0.6,
    colorscale="Blues",
    name="contour",
)

scatter = go.Scatter(
    x=X[:, 0],
    y=X[:, 1],
    mode='markers',
    marker=dict(color=['green' if l == 0 else 'red' for l in y], size=10),
    name="points",
)

fig.add_trace(contour)
fig.add_trace(scatter)

# Create frames for the animation
frames = []
for i, preds in enumerate(all_preds):
    contour = go.Contour(
        z=preds,
        x=xgrid,
        y=ygrid,
        opacity=0.6,
        colorscale="Blues",
        name="contour",
    )

    frame = go.Frame(data=[contour])
    frames.append(frame)

# Add the frames to the figure
fig.frames = frames

# Define the animation settings
animation_settings = dict(
    frame=dict(duration=100, redraw=True),
    fromcurrent=True,
    mode='immediate'
)

# Add play/pause buttons for the animation
fig.update_layout(
    updatemenus=[{
        'type': 'buttons',
        'showactive': False,
        'buttons': [
            {
                'label': 'Play',
                'method': 'animate',
                'args': [None, animation_settings]
            },
            {
                'label': 'Pause',
                'method': 'animate',
                'args': [[None], dict(frame=dict(duration=0, redraw=False), mode='immediate')]
            }
        ]
    }]
)

# Show the animated contour plot
fig.show()