<h4>Implementing logistic regression from scratch in Python</h4>

In [1]:
import numpy as np
from utils import *

<p>Logistic regression is often mentioned in connection to classification tasks. The model is simple and one of the easy starters to learn about generating probabilities, classifying samples, and understanding gradient descent. This tutorial walks you through some mathematical equations and pairs them with practical examples in Python so that you can see exactly how to train your own custom binary logistic regression model.</p>

<h4>First we Initialize some parameters</h4>
<h5>In machine learning we call those parameters ,hyperparameter</h5>

<ul>
  <li>learning_rate</li>
  <li>Iteration</li>
  <li>weight :if not scratch this param generate automatically in Keras/Tensorflow</li>
  <li>bias : if not scratch this param generate automatically in Keras/Tensorflow</</li>
</ul> 



In [32]:
def __init__(self, learning_rate=0.001, n_iters=1000):
        self.lr = learning_rate
        self.n_iters = n_iters
        self.weights = None
        self.bias = None

<h4>Why Sigmoid ,why not linear function?</h4>
<h5>However, a linear activation function has two major problems :</h5>
<ul>
  <li>It’s not possible to use backpropagation as the derivative of the function is a constant and has no relation to the input x.</li>
  <li>All layers of the neural network will collapse into one if a linear activation function is used. No matter the number of layers in the neural network, the last layer will still be a linear function of the first layer. So, essentially, a linear activation function turns the neural network into just one layer.</li>
  
</ul> 

<h4>Non Linear Activation</h4>
<p>The linear activation function shown above is simply a linear regression model. 
Because of its limited power, this does not allow the model to create complex mappings between the network’s inputs and outputs. 
</p>
<p>Non-linear activation functions solve the following limitations of linear activation functions:</p>

<ul>
  <li>They allow backpropagation because now the derivative function would be related to the input, and it’s possible to go back and understand which weights in the input neurons can provide a better prediction.</li>
  <li>They allow the stacking of multiple layers of neurons as the output would now be a non-linear combination of input passed through multiple layers. Any output can be represented as a functional computation in a neural network.
</li>
  
</ul> 
<h5>***Reference</h5>
<a href="https://timvieira.github.io/blog/post/2014/02/11/exp-normalize-trick/">Exp Normalization Trick: Numerically stable sigmoid function

</a>

In [67]:
def _sigmoid(x):
    "Numerically stable sigmoid function."
    if x >= 0:
        z = np.exp(-x)
        return 1 / (1 + z)
    else:
        # if x is less than zero then z will be small, denom can't be
        # zero because it's 1+z.
        z = np.exp(x)
        return z / (1 + z)

In [68]:
def sigmoid( x):
    return 1 / (1 + np.exp(-x))
    

In [72]:
sigmoid(np.inf),_sigmoid(np.inf)

(1.0, 1.0)

<center><img src="sigmoid.png" alt="Girl in a jacket" width="800" height="700"></center>
<p>This function takes any real value as input and outputs values in the range of 0 to 1. 
The larger the input (more positive), the closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the output will be to 0.0, as shown below.</p>
<p>It is commonly used for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice because of its range.</p>

<ul>
  <li>The derivative of the function is f'(x) = sigmoid(x)*(1-sigmoid(x)).</li>

</li>
  
</ul>

<h4>Cross Entropy Loss</h4>

In [73]:
 def compute_loss(self, y_true, y_pred):
        # binary cross entropy
        epsilon = 1e-9
        y1 = y_true * np.log(y_pred + epsilon)
        y2 = (1-y_true) * np.log(1 - y_pred + epsilon)
        return -np.mean(y1 + y2)

<center><img src="bce.png" alt="Girl in a jacket" width="800" height="700"></center>
<center>Binary Cross Entropy Loss</center>
</br>
<p>  It’s a method of evaluating how well your algorithm models your dataset. If your predictions are totally off, your loss function will output a higher number. If they’re pretty good, it’ll output a lower number. As you change pieces of your algorithm to try and improve your model, your loss function will tell you if you’re getting anywhere.You are essentially finding all of the errors by comparing your ground truth y_true to your predictions y_pred (also known as y hat from your explanation section).</p>

<h5>***Reference</h5>
<a href='https://www.deeplearningbook.org/'>Deep Learning, by Ian Goodfellow, Yoshua Bengio and Aaron Courville.</a>

<h4>Gradient Descent and  (Forward + BackProp)</h4>

<center><img src="gradient.png" alt="Girl in a jacket" width="800" height="700"></center>
<p>Gradient Descent is an algorithm that is used to optimize the cost function or the error of the model. It is used to find the minimum value of error possible in your model. Gradient Descent can be thought of as the direction you have to take to reach the least possible error.</p>
<p>So here’s the definition: when you derive a function, you get an equation that tells you what the gradient of your function will be at any given value for x.
Looking back at the graph of the cost function, if we could therefore derive the cost function, we could find out what the gradient is.
</p>
<p>Here 'w' sets the direction and 'b' move the direction back and forth.</p>

<p>***References</p>

<a href='https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjY0sbh1v6AAxUd4TgGHYPHA_4QFnoECEYQAQ&url=https%3A%2F%2Fsee.stanford.edu%2Fmaterials%2Faimlcs229%2Fcs229-notes1.pdf&usg=AOvVaw13wb_XQus1qM1TAt6a8xnp&opi=89978449'>Stanford CS229.2018,Andrew Ng</a>

<a href='https://www.youtube.com/watch?v=4b4MUYve_U8'>Stanford Lecture :CS229</a>

<a href='https://www.deeplearningbook.org/'>Deep Learning  by Ian Goodfellow, Yoshua Bengio and Aaron Courville.</a>


In [74]:
def fit(self, X, y):
        n_samples, n_features = X.shape

        # init parameters
        self.weights = np.zeros(n_features)
        self.bias = 0

        # gradient descent
        for _ in range(self.n_iters):
            A = self.feed_forward(X)
            dz = A - y # derivative of sigmoid and bce X.T*(A-y)

            # compute gradients
            dw = (1 / n_samples) * np.dot(X.T, dz)
            db = (1 / n_samples) * np.sum(A - y)
            # update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

<h4>Here is the full code:</h4>

In [30]:
class LogisticRegression:
    def __init__(self, learning_rate=0.001, n_iters=1000):
        self.lr = learning_rate
        self.n_iters = n_iters
        self.weights = None
        self.bias = None
         
    #Sigmoid method
    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def compute_loss(self, y_true, y_pred):
        # binary cross entropy
        epsilon = 1e-9
        y1 = y_true * np.log(y_pred + epsilon)
        y2 = (1-y_true) * np.log(1 - y_pred + epsilon)
        return -np.mean(y1 + y2)

    def feed_forward(self,X):
        z = np.dot(X, self.weights) + self.bias
        A = self._sigmoid(z)
        return A

    def fit(self, X, y):
        n_samples, n_features = X.shape

        # init parameters
        self.weights = np.zeros(n_features)
        self.bias = 0

        # gradient descent
        for _ in range(self.n_iters):
            A = self.feed_forward(X)
            dz = A - y # derivative of sigmoid and bce X.T*(A-y)

            # compute gradients
            dw = (1 / n_samples) * np.dot(X.T, dz)
            db = (1 / n_samples) * np.sum(A - y)
            # update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
            
    def predict(self, X):
        y_hat = np.dot(X, self.weights) + self.bias
        y_predicted = self._sigmoid(y_hat)
        y_predicted_cls = [1 if i > 0.5 else 0 for i in y_predicted]
        
        return np.array(y_predicted_cls)

    def accuracy(self,y, y_hat):
        accuracy = np.sum(y == y_hat) / len(y)
        return accuracy
    
  

<h4>Output of the Regression</h4>
<p>Here we use sklearn toy dataset (Brest Cancer) to test our regression model.As compare to Sklearn Logistic Resression model it gives a good output with 93% sccuracy with 91.7% precision.
</p>

In [31]:
from sklearn.model_selection import train_test_split
from sklearn import datasets

dataset = datasets.load_breast_cancer()
X, y = dataset.data, dataset.target

X, y = dataset.data, dataset.target 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1234
)

regressor = LogisticRegression(learning_rate=0.0001, n_iters=1000)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
cm ,accuracy,sens,precision,f_score  = confusion_matrix(np.asarray(y_test), np.asarray(predictions))
print("Test accuracy: {0:.3f}".format(accuracy))
print("Confusion Matrix:",np.array(cm))

Test accuracy: 0.930
Confusion Matrix: [[39  6]
 [ 2 67]]


<h5>Thank You for reading the artice</h5>
<h5>Don't forget to follow me in github and Medium.</h5>
<h6>Here is the <a href=''>Code</a></h6> <a href=''>Code</a>