### In this exercise we will attempt to build a Logistic Regression Estimator from the ground up.

Initially we use a Base class as a foundation and we build functionality on each step of the inheritance process. 

In [1]:
import numpy as np

## Exercise 1

In the BaseClassifier class below complete the classmethod from_list. The method is going to help create instances of the class from a python list. Specifically we need the method to:
* have one position argument named "param_list" of type list.
* assign each element of the "param_list" to an instance initialization argument in sequence.
* throw an exception if there are missing arguments or more arguments than expected with an error message.

In [2]:
class BaseClassifier:
    def __init__(self, theta=0.1, alpha=0.1, max_it=10, pred_threshold=0.5):
        self.theta = theta
        self.alpha = alpha
        self.max_it = max_it
        self.pred_threshold = pred_threshold
        self.name = "Binary Classifier"
    
    # Here you need to implement your code.
    @classmethod
    def from_list(self, param_list):
        a = BaseClassifier()
        if (len(param_list) == 4):
            a.theta, a.alpha, a.max_it, a.pred_threshold = param_list
            return a
        else:
            raise Exception('Length is incorrect!')
        
    def __repr__(self):
        return "Hi I am a " + self.name
    
    def __call__(self, *args, **kwargs):
        return self.predict(*args, **kwargs)
    
    def predict(self, *args, **kwargs):
        assert not hasattr(super(), 'predict')
    
    def train(self, *args, **kwargs):
        assert not hasattr(super(), 'train')
        


In [3]:
# This should print 1 and the appropriate error message

base_clf = BaseClassifier.from_list([1,1,1,1])
print(base_clf.theta)
base_clf = BaseClassifier.from_list([1,1,1])

1


Exception: Length is incorrect!

We need data to train our model on. We define our class 1 to be Normally distributed around +2 and our class 0 to be normally distributed around -2.

In [4]:
x1 = np.random.randn(100,2) + 2
x2 = np.random.randn(100,2) - 2
X = np.concatenate([x1,x2], axis=0)
y  = np.concatenate([np.ones(100), np.zeros(100)], axis=0)

## Exercise 2

We define a class LogisticRegression that inherits from the previously defined BaseClassifier which is a binary classifier based on the sigmoid neuron. In this exercise we are looking to implement the __init__ funtion. Specifically we want:
* to be able to have all the functionality of the BaseClassifier
* change the name of the class to "Logistic Regression Classifier"

## Exercise 3

During the training step of the gradient descent algorithm we need to calculate the value of the cost function. Since this is a binary classifier we can leverage the binary cross entropy as our cost function. The goal of this exercise is the development of the method "cost_fcn". Specifically we want to:
* take 2 arguments that have been declared below x and y. x is type np.array(train_size, 2) and y is type np.array(train_size,).
* calculate the binary cross entropy of the sigmoid output and true label "J" as scalar.
* return the scalar "J"

In [5]:
class LogisticRegression(BaseClassifier):
    # Here you are going to implement the answer for Exercise 2
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.name = 'LogisticRegressionClassifier'

    def sigmoid(self, z):
        return 1.0 / (1 + np.exp(-z))
    
    # Here you are going to implement the answer for Exercise 3
    def cost_fcn(self, x, y):
        z = np.dot(x, self.theta)
        y_pred = self.sigmoid(z)
        cost = - np.dot(y, np.log(y_pred)) - np.dot((1-y), np.log(1-y_pred))
        return cost

    def gradients(self, x, y):
        h = self.sigmoid(np.dot(x, self.theta))
        return (1.0 /self.train_size) * np.dot(x.T, (h-y))
    
    def predict(self, X):
        pred = np.dot(X, self.theta)
        pred[pred >= 0.5] = 1
        pred[pred < 0.5] = 0
        return pred
        
    def train(self, X, y):
        cost = []
        self.theta = np.random.rand(X.shape[1])
        self.train_size = X.shape[0]
        for it in range(self.max_it): 
            cost.append(self.cost_fcn(X,y))
            grads = self.gradients(X, y)
            self.theta = self.theta - self.alpha * grads
        print("Cost Function per iteration:")
        print(cost)


In [6]:
clf1 = LogisticRegression(alpha=0.01)
print(clf1)
print("Model alpha: ", clf1.alpha)

Hi I am a LogisticRegressionClassifier
Model alpha:  0.01


In [7]:
clf1.train(X,y)
print("Training accuracy: ", sum(clf1.predict(X))/(X.shape[0]/2))

Cost Function per iteration:
[26.279754786609033, 26.144236576520335, 26.01037713090243, 25.878145976349096, 25.74751337713285, 25.61845031321458, 25.490928459027366, 25.36492016300317, 25.240398427812515, 25.117336891288762]
Training accuracy:  0.99


It is very common to have data points close to the decision boundary to have observation errors or being misslabeled. Below we are introducing to our training dataset misslabeled datapoints. We add 10 additional datapoints and end up with 210 training datapoints in general and both are again type of np.array.

In [8]:
x1_misslabel = np.random.randn(5,2) + 1
x2_misslabel = np.random.randn(5,2) - 1
X_misslabel = np.concatenate([x1_misslabel,x2_misslabel], axis=0)    
y_misslabel  = np.concatenate([np.zeros(5), np.ones(5)], axis=0)
X_prime = np.concatenate([X, X_misslabel], axis=0)
y_prime = np.concatenate([y, y_misslabel], axis=0)

## Exercise 4

In order to be able to handle misslabeled training data points and avoid overfitting we wish to leverage regularization. Below we define a class that extends the LogisticRegression class and updates its functions to incorporate regularization. 

### Part 1
We want to update the current "cost_fcn" with the appropriate regularization. 
* We are expanding the inherited function and we are keeping the same number of arguments and types, x: np.array(train_size, 2) and y: np.array(train_size,)
* Return a scalar that incorporates the regularization value
(tip: you can use the np.linalg library)

### Part 2
We want to update the current "gradients" method to reflect the change in the "cost_fcn". Specifically:
* We are expanding the inherited function and we are keeping the same number of arguments and types, x: np.array(train_size, 2) and y: np.array(train_size,)
* Return an np.array(2,) that contains the gradients for "theta: based on the updated "cost_fnc"

In [9]:
class LogisticRegressionRegularized(LogisticRegression):
    # Part 0: set a good initial value for regularization_coef
    def __init__(self, regularization_coef=0.1, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.regularization_coef = regularization_coef
        
    def cost_fcn(self, x, y):
        cost = super().cost_fcn(x, y)
        cost += self.regularization_coef/(2* x.shape[0])*(np.dot(self.theta,self.theta))
        return cost
    
    def gradients(self, x, y):
        h = self.sigmoid(np.dot(x, self.theta))
        return (1.0 /self.train_size) * np.dot(x.T, (h-y)) + (1.0 /self.train_size)*self.regularization_coef*self.theta

In [10]:
# Bonus 0: do you think other arguments like max_it play role in avoiding the current problem?
clf2 = LogisticRegressionRegularized(max_it=10)

In [11]:
clf2.train(X_prime, y_prime)
print("Training accuracy: ", sum(clf2.predict(X_prime))/(X_prime.shape[0]/2))

Cost Function per iteration:
[138.53778998659521, 108.99922683468404, 90.16869337332128, 77.58911189136103, 68.75737362913752, 62.28267198350173, 57.363560598055315, 53.515906260043444, 50.43360185832649, 47.915056615062305]
Training accuracy:  0.9619047619047619
