# Logistic Regression

A regression model helps us understand the relationship between input variables(or independent variables) and output variables(or dependent variables. There types of regression techniques like linear regression,logistic regression etc.
    
A linear regression model is used if the relationship between the dependent and independent variables is expected to be linear. By applying this model,we get a best fit line which predicts approximate output varible for given input variable(s).
                                                                                                                                                                                                                       
A logistic regression model is used when the output is binary,that is,either 0 or 1,either True or False etc. It is used when the output can either this or that. While a linear regression model tries ti find a best fit linear relation(that is, a line),the logistic regression model fits the data on a logistic sigmoid function.

A sigmoid function is an S-shaped monotonic differentiable function that has bounds,usually but not neccesarily bounded between 0 and 1. Some examples of sigmoid function are logistic sigmoid,hyperbolic tangent function etc.

The logistic sigmoid function is defined as:
$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

The hyperbolic tangent function is defined as:
$$
\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
$$

The logistic regression model uses the logistic sigmoid function (also called logit function). For a dataset containing 2 independent variables,the logt function can be modified as 

$$
\sigma(x_1,x_2) = \frac{1}{1 + e^{-(\theta_0 + \theta_1 x_1 + \theta_2 x_2)}}
$$

where $x_1$ and $x_2$ are the independent variables and $\theta_0$ , $\theta_1$ and $\theta_2$ are the parameters that we have to find. Once these parameters are found,we can use them to find the value of the logit function for given values of independent variables. This value is nothing but the probability that the event will happen for the given values of independent variables. But logistic regression is a classification model,that is,can have only 2 outputs which are 0 or 1. Therefore we need a threshold value,if the probability is above this value then the output will be 1 and 0 if the probability is below it. The threshold value that we will be using is 0.5 .

But how do we find the correct parameter values? To find the correct parameters we need to know about cost function and gradient descent.

### Cost function

In general,cost function is a function whose value at given vectors or scalars is a scalar which is a measure of how well the model predicts. So,a low value of this cost function means that the accuracy of this model is high. There are many types of cost functions,for example mean squared error,mean absolute error etc.

The cost function that we use for logistic regression model is
$$
J(\theta) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(h_\theta(x_i)) + (1 - y_i) \log(1 - h_\theta(x_i)) \right]   
$$

where n is the data size , $y_i$ is the actual output(either 0 or 1) at x=$x_i$ , $h_\theta(x_i)$ is the predicted value of the output  at x=$x_i$ . Note that $x_i$ is the vector representation of the independent variables and $\theta$ is the vector representation of parameters.

The main purpose of cost function is its helpfulness in increasing the accuracy of the model.To maximise the accuracy,we minimise the cost function and this can be done using various algorithms. The one we are going to use is gradient descent.

### Gradient Descent

Consider the function $f(x)=x^2+2x+1$ . The point of minima for any function has a gradient of 0. Using this fact,one can directly find the minima by differentiating the quadratic polynomial and equate it to 0. But it can also be solved in another method.

We know that gradient points towards the direction with steepest increase of the function and its value represents how fast the function rises. This means that the direction exactly opposite to the gradient direction is direction of steepest decrease and the value of the gradient represents how fast the function drops.

So,if we select a random point on the curve,move in a direction opposite to the gradient vector and move by a distance propotional to the value of the gradient,the new position we reach will be closer to the point of minima.If this process is repeated enough number of times with a correct choice of propotionality constant,we will reach a point which is so close to the point minima that for all practical applications this point can be considered as the point of minima.

The proportionality constant here is called learning rate.It is a hyperparameter and hence must be given by the user (but not calculated by the machine) before training the model.

Let us test this technique on $f(x)$

In [1]:
x=1
def gradient(x):
    return 2*x+2

learning_rate=0.1
number_of_iterations=5
def gradient_descent(x,number_of_iterations,learning_rate):
 for i in range(number_of_iterations):
     x=x-gradient(x)*learning_rate
 return x
xn=gradient_descent(x,number_of_iterations,learning_rate)
xn

-0.34464000000000006

For learning rate=0.1 and number of iterations=5,we got -0.34. Let us try again with different values.

In [2]:
learning_rate=0.2
number_of_iterations=5
xn=gradient_descent(x,number_of_iterations,learning_rate)
xn

-0.84448

By increasing learning rate to 0.2,we got -0.84 which is a better estimation as required answer is -1.

In [3]:
learning_rate=0.1
number_of_iterations=10
xn=gradient_descent(x,number_of_iterations,learning_rate)
xn

-0.7852516352000001

Increasing the number of iterations has also helped in getting a better estimation.

In [4]:
learning_rate=0.8
number_of_iterations=5
xn=gradient_descent(x,number_of_iterations,learning_rate)
xn

-1.15552

But for a higher value of learning rate,we missed the point of minima,that is,we have travelled too much distance that we crossed the point of minima.

In [5]:
learning_rate=0.8
number_of_iterations=50
xn=gradient_descent(x,number_of_iterations,learning_rate)
xn

-0.9999999999838344

But setting the number of iterations to a higher value has brought us back, very near to the point of minima.So,can we set the learning rate to any value and set the number of iterations to a very high value? It may work but it isn't suggested.

In our case increasing the number of iterations from 5 to 50 seemed like it didn't effect the time of computation because 5 and 50 are actually a small number of iterations and the calculation itself is basic. But in real life applications,the calculation is a lot more bulkier and the number of iterations is much much higher.On top of that,if we set the learning rate randomly and increase the number of iterations to a much more higher value,the time of computation becomes too large. So even though it may work,it is suggested not to set the learning rate randomly but make some calculations if possible or otherwise make an intelligent guess.

To find the best learning rate,there are techniques like grid search,random search etc. We shall use the random search technique,where we first make a prediction of the range in which the learning rate may lie,then iterate through as many values of learning rates as possible to find the best one in the assumed range.

Though this method seems to be much bulkier than the first method for the quadratic polynomial,this method is still useful. In this case we know the function and it a 2 dimensional curve. But for higher dimensional functions,this method is much more helpful. So we will be using this method to minimise our cost function.

The gradient of the logit function is 
$$
\nabla J(\theta)=\frac{1}{n} (X^T).(h-y)
$$


where X is the vector(that is,a matrix) containing the independent variables of all the data points in the data set,h is the vecotr containing all the predicted values of the data points and y is the vector containing all the correct values of the data points.

With this knowledge,let us now start writing the code.

### The Code

Importing the neccessary libraries and datasets

In [6]:
import numpy as np
import pandas as pd

df_train=pd.read_csv(r"D:\data2_train.csv")
df_test=pd.read_csv(r"D:\data2_test.csv")

Now,we initiate the $\theta$ vector . In this vector , first co-ordinate will be the constant term $\theta_0$ and the next two are the coefficients $\theta_1$ and $\theta_2$

In [7]:
theta=np.ones((3,1))

For the logit function

In [8]:
def logitfunc(x,t):
  c=np.exp(np.dot(x,t))
  return c/(1+c)

Let us now prep the vectors needed.

In [9]:
x=df_train.iloc[:,:2]
y=df_train.iloc[:,2:3]
x=np.insert(x,0,1,axis=1)
n=len(df_train)

Here we have added a column containing ones.These ones will not be changed and are added to match the dimensions of $\theta$ as we have to find the dot product of these vectors.So when they multiplied , we get $\theta_0$ + $\theta_1$$x_1$ + $\theta_2$$x_2$ , which is the expression that appears in the expression of logit function.

Let us also define the gradient of the logit function

In [10]:
def gradientfunc(x,y,n):
   h=logitfunc(x,theta)
   return (1/n)*np.dot((x.T),(h-y))

For updating the $\theta$ vector in an iterative manner,let us write another function

In [11]:
def gradientoptim(x,y,n,theta,lr,num_iter):
 for i in range(num_iter):
   g=gradientfunc(x,y,n)
   theta-=g*lr
   

Let us now set the number of iterations and learning rates

In [12]:
num_iter=10000
lro=0.31
n=len(y)

Let us now perform the gradient descent

In [13]:
gradientoptim(x,y,n,theta,lro,num_iter)
h=logitfunc(x,theta)

In [14]:
for i in range(3):
    print(h[i])

for i in range(3):
    print(h[799-i])

[5.89334441e-31]
[3.31214771e-37]
[2.31392988e-33]
[1.]
[1.]
[1.]


Clearly,the values at the start (which are supposed to be 0) are very near to zero and the values at the end (which are supposed to be 1) are very near to one. This means the model is performing,let us check its accuracy on the training model.

In [15]:
def setnormal(h):
 for i in range(len(h)):
   if h[i]<0.5:
      h[i]=0
   else:
      h[i]=1
setnormal(h)
def accuracyfunc(h,y):
 count=(h==y).sum()
 accuracy=(count.values[0]/n)*100
 return accuracy
print(f"The accuracy on training data is :{accuracyfunc(h,y)}%")

The accuracy on training data is :99.125%


But is this the maximum that can be attained? Let random search on learning rate.

In [16]:
def random_search(x,y,theta,iter,rangelo,rangehi):
    hi_lr=0
    hi_accuracy=0
    for i in range(iter):
        lrn=np.random.uniform(rangelo,rangehi)
        theta_c=theta.copy()
        gradientoptim(x,y,n,theta_c,lrn,num_iter)
        hn=logitfunc(x,theta)
        setnormal(hn)
        accuracy_n=accuracyfunc(hn,y)
        if accuracy_n>hi_accuracy:
            hi_accuracy=accuracy_n
            hi_lr=lrn
    return hi_lr,hi_accuracy

In [17]:
p=random_search(x,y,theta,10,0,1)
print(f"Learning rate={p[0]} and Accuracy={p[1]}")

Learning rate=0.2779272825673684 and Accuracy=99.125


In [18]:
x2=df_test.iloc[:,:2]
y2=df_test.iloc[:,2:3]
x2=np.insert(x2,0,1,axis=1)
n=len(df_test)
h2=logitfunc(x2,theta)
setnormal(h2)
acc_test=accuracyfunc(h2,y2)
print(f"Accuracy on testing data:{acc_test}%")

Accuracy on testing data:98.5%


Let us compare this with the accuracy of the model trained by scikit-learn.

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd

X_train = df_train[['Feature_1', 'Feature_2']]
y_train = df_train['Target']

X_test = df_test[['Feature_1', 'Feature_2']]
y_test = df_test['Target']

model = LogisticRegression()

model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)
accuracy_train = accuracy_score(y_train, y_train_pred)*100
print(f"Accuracy on training dataset: {accuracy_train}%")

y_test_pred = model.predict(X_test)
accuracy_test = accuracy_score(y_test, y_test_pred)*100
print(f"Accuracy on testing dataset: {accuracy_test}%")

Accuracy on training dataset: 98.625%
Accuracy on testing dataset: 99.0%
