2.Implement Logistic Regression (8 points): Implement a regularized Logistic Regression from scratch. Pick any dataset of your choice for this (you can use the UCI Machine Learning repository: https://archive.ics.uci.edu/ml/index.php or you can use a dataset we used in the past). Please verify the following:

• First, start with the regular feature set with regularization = 0. Determine if you are underfitting ot overfitting.

• If you are not overfitting, create polynomial features the way we did in the Linear Regression demo

• Again, train a logistic regression with zero regularization and vary the polynomial degree till you overfit on the train data.

• Once you overfit on the train data, start adding regularization. What happens with regularization? Can you reduce the overfitting with regularization?

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

In [None]:
# my logistic regression implementation
class LogisticRegression:
  def __init__(self, lr, iterations, C=1, penalty=None):
    """ lr: learning rate
        iterations: number of epochs
        C: weight of regularization
        penalty: type of regularization (l1, l2, None)"""
    self.penalties = {None: self.none_gradient, "l1" : self.l1_gradient, "l2" : self.l2_gradient}
    self.lr = lr
    self.iterations = iterations
    self.C = C
    self.weights = None
    self.bias = None
    self.penalty = self.penalties[penalty]

  # fit function
  def fit(self, X, y):
    self.weights = np.random.rand(X.shape[1])
    self.bias = np.array([0.0])
    Xsum = X.sum(axis=0)
    for i in range(self.iterations):
      bias_gradient = 0
      weight_gradient = 0
      gradient = (self.predict(X) - y).sum()*self.lr

      self.weights -= gradient * Xsum + self.penalty(self.weights)
      self.bias -= gradient + self.penalty(self.bias)


  # predict function
  def predict(self, X):
    return np.where(self.sigmoid(np.dot(X, self.weights) + self.bias) >= 0.5, 1, 0)

  # gradient of L1 regularization (Lasso)
  def l1_gradient(self, weights):
    return self.C * np.sign(weights)

  # gradent of L2 regularization (Ridge)
  def l2_gradient(self, weights):
    return self.C * weights

  # in case penalty is None: do nothing
  def none_gradient(self, weights):
    return np.zeros(weights.shape)

  # function of sigmoid
  def sigmoid(self, z):
    return 1 / (1 + np.exp(-z))

# print train and test accuracies
def print_results(y_train, y_test, y_pred_train, y_pred_test):
  print("Train Accuracy: ", accuracy_score(y_train, y_pred_train))
  print("Test Accuracy: ", accuracy_score(y_test, y_pred_test))





In [None]:
# load data from sklearn
data = load_breast_cancer()
x = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Normalizing data before giving it to logistic regression
scaler = StandardScaler()
x = scaler.fit_transform(x)

# splitting into 2 datasets: 80% goes to train dataset and 20% goes to test dataset
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# training logistic regression without any regularization
lr = LogisticRegression(lr=0.00001, iterations=3000)

lr.fit(X_train, y_train)

y_pred_train = lr.predict(X_train)
y_pred_test = lr.predict(X_test)

print_results(y_train, y_test, y_pred_train, y_pred_test)

Train Accuracy:  0.7868131868131868
Test Accuracy:  0.6929824561403509


##Model is overfitting on the training set!

In [None]:
# Logistic Regression with L1 regularization (same learning rate but more iterations compared to no regularization model)
lr = LogisticRegression(lr=0.00001, iterations=5000, C=1,penalty="l1")

lr.fit(X_train, y_train)

y_pred_train = lr.predict(X_train)
y_pred_test = lr.predict(X_test)

print("L1 regularization: ", end="")
print_results(y_train, y_test, y_pred_train, y_pred_test)

L1 regularization: Train Accuracy:  0.7736263736263737
Test Accuracy:  0.8333333333333334


In [None]:
# Logistic Regression with L2 regularization (same learning rate but more iterations compared to no regularization model)
lr = LogisticRegression(lr=0.00001, iterations=5000, C=0.0001, penalty="l2")

lr.fit(X_train, y_train)

y_pred_train = lr.predict(X_train)
y_pred_test = lr.predict(X_test)

print("L2 regularization: ", end="")
print_results(y_train, y_test, y_pred_train, y_pred_test)

L2 regularization: Train Accuracy:  0.7956043956043956
Test Accuracy:  0.7017543859649122


  After trying many different values for C, I noticed that L1 regularization does a better job for my Logistic regression algorithm and that models with regularization take longer to train. L1 did remove the overfitting but L2 still overfits but the accuracies are a little better.