### In this exercise I will attempt to build a Logistic Regression Estimator from the ground up.

Initially, I use a Base class as a foundation and I build functionality on each step of the inheritance process. 

In [1]:
import numpy as np

In [2]:
class BaseClassifier:
    def __init__(self, theta=0.1, alpha=0.1, max_it=10, pred_threshold=0.5):
        self.theta = theta
        self.alpha = alpha
        self.max_it = max_it
        self.pred_threshold = pred_threshold
        self.name = "Binary Classifier"
    
    @classmethod
    def from_list(self, param_list):
        a = BaseClassifier()
        if (len(param_list) == 4):
            a.theta, a.alpha, a.max_it, a.pred_threshold = param_list
            return a
        else:
            raise Exception('Length is incorrect!')
        
    def __repr__(self):
        return "Hi I am a " + self.name
    
    def __call__(self, *args, **kwargs):
        return self.predict(*args, **kwargs)
    
    def predict(self, *args, **kwargs):
        assert not hasattr(super(), 'predict')
    
    def train(self, *args, **kwargs):
        assert not hasattr(super(), 'train')
        


In [3]:
# This should print 1 and the appropriate error message

base_clf = BaseClassifier.from_list([1,1,1,1])
print(base_clf.theta)
base_clf = BaseClassifier.from_list([1,1,1])

1


Exception: ignored

I need data to train the model on. I define class 1 to be Normally distributed around +2 and  class 0 to be normally distributed around -2.

In [4]:
x1 = np.random.randn(100,2) + 2
x2 = np.random.randn(100,2) - 2
X = np.concatenate([x1,x2], axis=0)
y  = np.concatenate([np.ones(100), np.zeros(100)], axis=0)

I define a class LogisticRegression that inherits from the previously defined BaseClassifier which is a binary classifier based on the sigmoid neuron. In this section, I want to implement the __init__ funtion:
* to be able to have all the functionality of the BaseClassifier
* change the name of the class to "Logistic Regression Classifier"

In [14]:
class LogisticRegression(BaseClassifier):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.name = 'LogisticRegressionClassifier'

    def sigmoid(self, z):
        return 1.0 / (1 + np.exp(-z))
    
    def cost_fcn(self, x, y):
        z = np.dot(x, self.theta)
        y_pred = self.sigmoid(z)
        cost = - np.dot(y, np.log(y_pred)) - np.dot((1-y), np.log(1-y_pred))
        return cost

    def gradients(self, x, y):
        z = np.dot(x, self.theta)
        h = self.sigmoid(z)
        return (1.0 /self.train_size) * np.dot(x.T, (h-y))
    
    def predict(self, X):
        pred = np.dot(X, self.theta)
        pred[pred >= 0.5] = 1
        pred[pred < 0.5] = 0
        return pred
        
    def train(self, X, y):
        cost = []
        self.theta = np.random.rand(X.shape[1])
        self.train_size = X.shape[0]
        for it in range(self.max_it): 
            cost.append(self.cost_fcn(X,y))
            grads = self.gradients(X, y)
            self.theta = self.theta - self.alpha * grads
        print("Cost Function per iteration:")
        print(cost)


In [15]:
clf1 = LogisticRegression(alpha=0.01)
print(clf1)
print("Model alpha: ", clf1.alpha)

Hi I am a LogisticRegressionClassifier
Model alpha:  0.01


In [16]:
clf1.train(X,y)
print("Training accuracy: ", sum(clf1.predict(X))/(X.shape[0]/2))

Cost Function per iteration:
[28.7935103246232, 28.608019567485318, 28.42526347048498, 28.245181257001285, 28.06771393039164, 27.89280420988623, 27.720396469203145, 27.550436677751875, 27.382872344300324, 27.21765246298712]
Training accuracy:  1.0


It is very common to have data points close to the decision boundary to have observation errors or be mislabeled. Below,I will introducing mislabeledslabeled datapoints to the training dataset. I add 10 additional datapoints and end up with 210 training datapoints in general and both are again type of np.array.

In [17]:
x1_misslabel = np.random.randn(5,2) + 1
x2_misslabel = np.random.randn(5,2) - 1
X_misslabel = np.concatenate([x1_misslabel,x2_misslabel], axis=0)    
y_misslabel  = np.concatenate([np.zeros(5), np.ones(5)], axis=0)
X_prime = np.concatenate([X, X_misslabel], axis=0)
y_prime = np.concatenate([y, y_misslabel], axis=0)

In order to be able to handle mislabeled training data points and avoid overfitting I wish to leverage regularization. Below I define a class that extends the LogisticRegression class and updates its functions to incorporate regularization. 

In [21]:
class LogisticRegressionRegularized(LogisticRegression):
    def __init__(self, regularization_coef=0.01, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.regularization_coef = regularization_coef
        
    def cost_fcn(self, x, y):
        cost = super().cost_fcn(x, y)
        cost += self.regularization_coef/(2* x.shape[0])*(np.dot(self.theta,self.theta))
        return cost
    
    def gradients(self, x, y):
        gradient = super().gradients(x, y)
        gradient += (1.0 /self.train_size)*self.regularization_coef*self.theta
        return gradient

In [22]:
clf2 = LogisticRegressionRegularized(max_it=10)

In [23]:
clf2.train(X_prime, y_prime)
print("Training accuracy: ", sum(clf2.predict(X_prime))/(X_prime.shape[0]/2))

Cost Function per iteration:
[102.45323571695855, 84.3341347188418, 72.57681806534605, 64.47258124017553, 58.60146832691392, 54.176574447252946, 50.73502987408694, 47.98957208451409, 45.75353729691718, 43.90081243453098]
Training accuracy:  0.9809523809523809
