# Subtask 1 : Mathematical Understanding

## Definition of Logistic Regression

Logistic Regression Analysis is a predictive modelling technique. It estimates a relation between a dependent variable and an independent variable. Logistic Regression is built on linear regression but it is modified to work on a particular case. It produces results in a binary format (either 0 or 1).

## Underlying principles

Logistic regression assumes a linear relationship between the features and the sigmoid(S Curve) of the target variable. It uses the sigmoid function to predict the binary value(0 or 1) of the dependent variable according to the given independent variable representing the positive class. The threshold value is set to be 0.5. Any value above it is the positive class (binary value set to 1) and below it is the negative class (binary value set to 0).

## Assumptions

The first assumption taken is that the independent variable should be binary. The second assumption is that there should be linear relation between the dependent variable and sigmoid of the independent variable. The model also assumes that the observations should be independent of each other. Also there should not be any multicollinearity.

## Equations Involved

### Linear Regression Equation

f(x) = c + b1x1 + b2x2 + ....

### Logistic Regression Equation

log(y/1-y) = c + b1x1 + b2x2 + .....

### Sigmoid Function

y = 1/(1 + e^(-x))

### Calculating Error Formula

J(w,b) = 1/N (Σ(i=1 to N)[(yi * log(f(xi))) + ((1-yi) * log(1-f(xi)))])

J'(w) = 1/N (Σ(i=1 to N)[2 * xi * (y^ - yi)])

J'(b) = 1/N (Σ(i=1 to N)[2 * (y^ - yi)])

where N is the number of samples

## Working of the Model

### Training

The model parameters weight and bias are learned by minimizing the J(w,b) using the optimization algorithm,i.e., gradient descent.

### Prediction

The probabaility is calculated using the sigmoid function. If the probability exceeds the threshold (which is 0.5), it is classified as the positive class(value =1) otherwise, it is classified as the negative class(value = 0).

# Subtask 2 : Training and Prediction

In [1]:
#importing the necessary libraries
import numpy as np
import pandas as pd

In [2]:
#python function to define the sigmoid function
def sigmoid(x): 
    y = 1/(1+np.exp(-x))
    return y

In [3]:
#defining the logistic regression algorithm
class Logistic_Regression(): 

    def __init__(self, lr = 0.01, n_iterations = 1000):
        self.lr = lr
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
    
    def fit(self,X,y): 
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        for i in range(self.n_iterations): #to run for the n number of iterations
            sigmoid_x = np.dot(X,self.weights) + self.bias
            prediction = sigmoid(sigmoid_x) #calculating the sigmoid function
            # calculating the gradient descent
            dw = (1/n_samples)*(np.dot(X.T,(prediction-y)))
            db = (1/n_samples)*(np.sum(prediction - y))
            #updating the weights and bias of the model
            self.weights = self.weights - self.lr*dw
            self.bias = self.bias - self.lr*db
    
    def predict(self,X): #function to predict y for the given x
        sigmoid_x = np.dot(X,self.weights) + self.bias
        prediction = sigmoid(sigmoid_x)
        y_pred = [0 if y<=0.5 else 1 for y in prediction]
        return y_pred
    def accuracy(self,y_pred,y):
        acc = (np.sum(y_pred==y))/len(y)
        return acc

In [4]:
#reading the files
train1 = pd.read_csv('ds1_train.csv')
test1 = pd.read_csv('ds1_test.csv')
train2 = pd.read_csv('ds2_train.csv')
test2 = pd.read_csv('ds2_test.csv')

In [5]:
#seperating the x and y values
X_train1 = train1.iloc[:,:-1].values
y_train1 = train1.iloc[:,-1].values
X_test1 = test1.iloc[:,:-1].values
y_test1 = test1.iloc[:,-1].values
X_train2 = train2.iloc[:,:-1].values
y_train2 = train2.iloc[:,-1].values
X_test2 = test2.iloc[:,:-1].values
y_test2 = test2.iloc[:,-1].values

In [6]:
#calling the class of logistic regression
log_reg = Logistic_Regression()

In [7]:
#Training the model for dataset 1
log_reg.fit(X_train1,y_train1)

In [8]:
#Predicting the values of dataset 1
y_test1_pred = log_reg.predict(X_test1)
y_train1_pred = log_reg.predict(X_train1)

In [9]:
#Training the model for dataset 2
log_reg.fit(X_train2,y_train2)

In [10]:
#Predicting the values for dataset 2
y_test2_pred = log_reg.predict(X_test2)
y_train2_pred = log_reg.predict(X_train2)

In [11]:
#Printing the accuracies of the predictions done by the model
print('accuracy of the ds1_test.csv',log_reg.accuracy(y_test1_pred,y_test1))
print('accuracy of the ds1_train.csv',log_reg.accuracy(y_train1_pred,y_train1))
print('accuracy of the ds2_test.csv',log_reg.accuracy(y_test2_pred,y_test2))
print('accuracy of the ds2_train.csv',log_reg.accuracy(y_train2_pred,y_train2))

accuracy of the ds1_test.csv 0.78
accuracy of the ds1_train.csv 0.785
accuracy of the ds2_test.csv 0.91
accuracy of the ds2_train.csv 0.90875


# Subtask 3 : Hyperparameter Tuning

The two important hyperparameters are learning rate and number of iterations.

In [12]:
#Defining a functrion to calculate the best learning rate
def best_lr(X_train, y_train, X_test, y_test):
    best_lr = 0
    best_acc = 0
    lr = [0.1,0.01,0.001]
    for i in lr:
        log_reg = Logistic_Regression(lr = i)
        log_reg.fit(X_train,y_train)
        y_pred = log_reg.predict(X_test)
        accuracy = log_reg.accuracy(y_pred,y_test)
    
        if accuracy>best_acc:
            best_acc = accuracy
            best_lr = i

    print('Best Learning Rate for Dataset : ',best_lr)
    print('Best Accuracy for Dataset corresponding to the best learning rate : ',best_acc)

In [13]:
#Getting the best learning rate for dataset 1
print('Dataset 1')
best_lr(X_train1,y_train1,X_test1,y_test1)

Dataset 1
Best Learning Rate for Dataset :  0.01
Best Accuracy for Dataset corresponding to the best learning rate :  0.78


  y = 1/(1+np.exp(-x))


In [14]:
#Getting the best learning rate for dataset 2
print('Dataset 2')
best_lr(X_train2,y_train2,X_test2,y_test2)

Dataset 2
Best Learning Rate for Dataset :  0.1
Best Accuracy for Dataset corresponding to the best learning rate :  0.92


In [15]:
#Defining a functrion to calculate the best number of iterations
def best_n_iterations(X_train,y_train,X_test,y_test):
    best_n_iterations = 0
    best_acc = 0
    n_iterations = [500,1000,5000,10000]
    for n in n_iterations:
        log_reg = Logistic_Regression(n_iterations = n)
        log_reg.fit(X_train,y_train)
        y_pred = log_reg.predict(X_test)
        accuracy = log_reg.accuracy(y_pred,y_test)
    
        if accuracy>best_acc:
            best_acc = accuracy
            best_n_iterations = n

    print('Best No of Iterations for Dataset : ',best_n_iterations)
    print('Best Accuracy for Dataset corresponding to the best No of iterations : ',best_acc)

In [16]:
#Getting the best number of iterations for dataset 1
print('Dataset - 1')
best_n_iterations(X_train1,y_train1,X_test1,y_test1)

Dataset - 1
Best No of Iterations for Dataset :  10000
Best Accuracy for Dataset corresponding to the best No of iterations :  0.83


In [17]:
#Getting the best number of iterations for dataset 2
print('Dataset - 2')
best_n_iterations(X_train2,y_train2,X_test2,y_test2)

Dataset - 2
Best No of Iterations for Dataset :  5000
Best Accuracy for Dataset corresponding to the best No of iterations :  0.92


# Subtask 4 : Comparison with Scikit-Learn

In [18]:
#importing the scikit-learn library
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [19]:
#defing a the logistic regression function using the scikit learn library
def scikit_logistic_regression(X_train,y_train,X_test,y_test):
    sk_log_reg = LogisticRegression()
    sk_log_reg.fit(X_train,y_train)
    y_pred = sk_log_reg.predict(X_test)
    accuracy = metrics.accuracy_score(y_test, y_pred)
    return accuracy

In [20]:
#printing the accuracies for the datasets
acc_test_1 = scikit_logistic_regression(X_train1,y_train1,X_test1,y_test1)
print('Accuracy for the test dataset 1 : ',acc_test_1)
acc_train_1 = scikit_logistic_regression(X_train1,y_train1,X_train1,y_train1)
print('Accuracy for the train dataset 1 : ',acc_train_1)
acc_test_2 = scikit_logistic_regression(X_train2,y_train2,X_test2,y_test2)
print('Accuracy for the test dataset 2 : ',acc_test_2)
acc_train_2 = scikit_logistic_regression(X_train2,y_train2,X_train2,y_train2)
print('Accuracy for the train dataset 2 : ',acc_train_2)

Accuracy for the test dataset 1 :  0.9
Accuracy for the train dataset 1 :  0.8825
Accuracy for the test dataset 2 :  0.9
Accuracy for the train dataset 2 :  0.915


## Comparison between the two models

In [21]:
log_reg = Logistic_Regression()

In [22]:
print('Scikit-Learn Logistic Regression Model :: ds1_test.csv :: ',acc_test_1)
print('My Model :: ds1_test.csv :: ',log_reg.accuracy(y_test1_pred,y_test1))
print('Scikit-Learn Logistic Regression Model :: ds1_train.csv :: ',acc_train_1)
print('My Model :: ds1_test.csv :: ',log_reg.accuracy(y_train1_pred,y_train1))
print('Scikit-Learn Logistic Regression Model :: ds2_test.csv :: ',acc_test_2)
print('My Model :: ds2_test.csv :: ',log_reg.accuracy(y_test2_pred,y_test2))
print('Scikit-Learn Logistic Regression Model :: ds2_train.csv :: ',acc_train_2)
print('My Model :: ds2_test.csv :: ',log_reg.accuracy(y_train2_pred,y_train2))

Scikit-Learn Logistic Regression Model :: ds1_test.csv ::  0.9
My Model :: ds1_test.csv ::  0.78
Scikit-Learn Logistic Regression Model :: ds1_train.csv ::  0.8825
My Model :: ds1_test.csv ::  0.785
Scikit-Learn Logistic Regression Model :: ds2_test.csv ::  0.9
My Model :: ds2_test.csv ::  0.91
Scikit-Learn Logistic Regression Model :: ds2_train.csv ::  0.915
My Model :: ds2_test.csv ::  0.90875
