# Exercises

1. You can use SGD or mini-batch GD for a large number of features. Additionally, if the training set doesn't have many instances, you can also use regular batch GD.
2. Gradient Descent will have a hard time converging to the global minimum if features are not standardized, unlike the normal equation or SVD. You can utilize a `StandardScaler()`
3. No, the cost function for Logistic Regression is convex, so you can eventually find the global min
4. For batch, you can converge to the global min. For mini-batch and SGD, the algorithm will often walk around the global min so you must use other techniques to correct this like a learning schedule
5. If the training error remains stable and lower, then that means the model is overfitting the training set and you need to use a regularization technique
6. No, it could be that in non-linear models (like polynomial) it needs to escape a local minima
7. SGD because there is an update at every training instance. Batch, however, can actually converge unlike the other two which walk around the global min. The other two can eventually converge if using regularization or a learning schedule
8. When there is a gap in the learning curve, it means the model has overfit the training data. To solve this you can: add more training instances, regularize the model, or decrease polynomial degrees in the transformer
9. The model has a high bias. To reduce it you should increase alpha, which is a hyperparameter that controls how much to regularize a model
10. a) Reduce the chance of overfitting the model b) If you want to select fewer features for the model to use c) Lasso by itself can be erratic if there are more features than training instances or if several features are highly correlated
11. 2 logistic regression classifiers since the problem is solved using 2 binary classifiers. The problem could also be framed as multi output but softmax applies to multilabel problems (i.e., selecting 1 of k classes).

In [79]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [6]:
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)

iris.data

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [86]:
X = iris.data[["petal length (cm)", "petal width (cm)"]].values
y = iris["target"].values

In [96]:
X_with_bias = np.c_[np.ones(len(X)), X]

total_size = len(X_with_bias)
valid_size, test_size = int(total_size * 0.2), int(total_size * 0.2)
train_size = total_size - valid_size - test_size

np.random.seed(42)
indices = np.random.permutation(total_size)

X_train = X_with_bias[indices[: train_size]]
y_train = y[indices[:train_size]]
X_valid = X_with_bias[indices[train_size : train_size+valid_size]]
y_valid = y[indices[train_size : train_size+valid_size]]
X_test = X_with_bias[indices[train_size+valid_size : total_size]]
y_test = y[indices[train_size+valid_size : total_size]]

In [97]:
assert (train_size + valid_size + test_size) == total_size

In [101]:
# one-hot encode output
# alternative is to use pd.get_dummies() function

def to_one_hot(a):
    b = np.zeros((a.size, a.max() + 1), dtype=np.int64)
    b[np.arange(a.size), a] = 1
    
    return b



Y_train_one_hot = to_one_hot(y_train)
Y_valid_one_hot = to_one_hot(y_valid)
Y_test_one_hot = to_one_hot(y_test)

In [102]:
# Standardize input
mean = X_train[:, 1:].mean(axis=0)
std = X_train[:, 1:].std(axis=0)

X_train[:, 1:] = (X_train[:, 1:] - mean) / std
X_valid[:, 1:] = (X_valid[:, 1:] - mean) / std
X_test[:, 1:] = (X_test[:, 1:] - mean) / std

In [107]:
def softmax(logits):
    exps = np.exp(logits)
    exp_sums = exps.sum(axis=1, keepdims=True)
    
    return exps / exp_sums

In [111]:
# batch gradient descent using a softmax algorithm and utilizing early stopping
# BGD: controlled by # of epochs and learning rate
# early stopping: when validation error is at its lowest
m, n = len(X_train), len(X_train[0])
k = len(np.unique(y_train)) # number of classes
learn_rate, n_epochs = 0.5, 5001
epsilon = 1e-5

np.random.seed(42)
Theta = np.random.randn(n, k)

for epoch in range(n_epochs):
    Y_proba = softmax(X_train @ Theta)
    if epoch % 1000 == 0:
        Y_proba_valid = softmax(X_valid @ Theta)
        xentropy_losses = -(Y_valid_one_hot * np.log(Y_proba_valid + epsilon))
        print(epoch, xentropy_losses.sum(axis=1).mean())
        
    error = Y_proba - Y_train_one_hot
    gradients = 1 / m * X_train.T @ error
    Theta = Theta - learn_rate * gradients

0 3.7085808486476917
1000 0.14519367480830644
2000 0.1301309575504088
3000 0.12009639326384539
4000 0.11372961364786884
5000 0.11002459532472425


In [112]:
logits = X_valid @ Theta
Y_proba = softmax(logits)
y_predict = Y_proba.argmax(axis=1)

accuracy_score = (y_predict == y_valid).mean()
accuracy_score

np.float64(0.9333333333333333)