# Logistic Regression (22MF10006)
---
---
        Note: ALL THE SUBTASKS ARE IN THIS SINGLE PYTHON NOTEBOOK ONLY
# Subtask 1: Mathematical Understanding

- Logistic Regression is a machine learning model which is used for binary classification problems. It is a classification algorithm and not a regression algorithm. This model applies the sigmoid function or the logistic function to a linear combination of input features which is then mapped to any value from 0 to 1.  It is a supervised learning algorithm.  
    
### Principles and Assumptions:

- It is designed for binary classification where the output or the target variable can take only two values (eg 0 or 1), for example predicting whether a student will pass his end-semester examinations for which the input features can be number of hours studied and no of hours slept.

- It assumes that the log-odds (also called logit, where logit(p) = log(p/1-p)) of the target variable or the outcome are a linear combination of the input features or parameters

- It also assumes that the errors in predicting log odds are independent of each other

- The independent variables or the input features should not be highly correlated to each other.

- For better results large data set should be taken.

### Mathematical Explanation:

- The Sigmoid function, is used to transform the linear combination of input features and their weight or coeffiecients into a value between 0 and 1. It is defined as follows:

        sigma(z) = 1/(1+exp(-z))

        where sigma(z) is the output between 0 to 1,
        exp is the exponential function,
        z is the linear combination of the input features and their respective coefficients or weights

- The logistic regression establishes a relationship between the features and the log-odds (logit) of the target variable.

        logit(y) = log( p(y=1)/ (1- p(y=1) ) )

        where p(y=1) is the probability that target variable y belongs to class 1 or the Positive class

        logit(y) = w0+ w1x1+ w2x2 + w3x3 + .....wnxn

        where w0, w1, w2, w3 ...wn represents the coefficients or weights associated with each feature
        x1, x2, x3 ... xn represents the input features
        logit(y) is the logit of the target variable


### How model learns and makes predictions:

- Initially the model starts with random values or zero values for all its coefficients w0, w1, w2....wn.

- In order to determine how good our model is performing we minimise a function called loss function which is defined as

        loss function or cost function= - (1\n)sum( ylog(y^) + (1-y) log(1-y^) )
        where y^ is the predicted output and y is the actual output and we sum it over all input features


- for every data point in the training set, logit z is calculated using current coefficients and then we use the sigmoid function to obtain sigma(z) which is use to decide if the data point belongs to class 1 or positive class.

- Once we obtain the predicted probabilities, we then compare to them to the actual target values by calculating the prediction errors, which represent the difference between the predicted probabilities and the actual target values.

- to minimise the loss function we update our coefficients using various iterative optimaztion algorithms such as gradient descent which computes the gradient of the loss function with respect to each coefficient and adjusts them so that it reduces loss. 

- The above steps goes on until we reach a certain no of iterations or a set of optimal coefficients is obtained.

- When the loss function is sufficiently minimised we run our model on the test set by applying a certain threshold value for the sigma(z) probability such as 0.5 so that all the proababilities above 0.5 postive class are treated as belonging to class 1 and probabilities less than or equal to 0.5 as class 0 or negative class.










 ---
# Subtask 2: Training and Prediction

In [1]:
import pandas as pd
import numpy as np

In [2]:
class MyLogisticRegression:

    def __init__(self, learning_rate = 0.001, iters = 10000):

        self.learning_rate = learning_rate
        self.iters = iters
        self.weights = None
        self.bias = None

    

    def sigmoid(self, z):
        return np.exp(-np.logaddexp(0, -z))         #this is th same as using 1/(1+e^-z) but i used this as i was facing overflow error in subtask 3
    
    def fit(self, X, y):                            #   X is a np array of size (no_of_samples * no_of_features)
        samples_count, features_count = X.shape
        self.weights = np.zeros(features_count)
        self.bias = 0

        #Below code is used for gradient descent algorithm which is one of the optimisation algorithms other options include newton rhapson method

        for iter in range(self.iters):

            #for getting z = w0 + w1x1 + w2x2 + w3x3 .... wnxn
            linear_eqn = np.dot(X, self.weights) + self.bias
            
            predicted = self.sigmoid(linear_eqn)

            #gradients
            dw = (1 / samples_count) * np.dot(X.T, (predicted - y))
            db = (1 / samples_count) * np.sum(predicted - y)

            #here bias and weights are updates

            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict(self, X):

        linear_eqn = np.dot(X, self.weights) + self.bias
        predicted = self.sigmoid(linear_eqn)
        return (predicted >= 0.5).astype(int)
    
    def f1_score(self, y, predictions):                 #to be used in subtask 3 for hyper_paramter tuning
        true_positives = np.sum((y == 1) & (predictions == 1))
        false_positives = np.sum((y == 0) & (predictions == 1))
        false_negatives = np.sum((y == 1) & (predictions == 0))

        precision = true_positives / (true_positives + false_positives)
        recall = true_positives / (true_positives + false_negatives)

        f1 = 2 * (precision * recall) / (precision + recall)
        return f1

In [3]:
train = pd.read_csv('ds1_train.csv')
X_train = train[['x_1', 'x_2']].to_numpy()
y_train = train['y'].to_numpy()

In [4]:
model = MyLogisticRegression()
model.fit(X_train, y_train)

In [5]:
predictions = model.predict(X_train)
accuracy = np.mean(predictions == y_train)
print("For training data, accuracy = ", accuracy)

For training data, accuracy =  0.8


In [6]:
test = pd.read_csv('ds1_test.csv')
X_test = test[['x_1', 'x_2']].to_numpy()
y_test = test['y'].to_numpy()

In [7]:
predictions = model.predict(X_test)
accuracy = np.mean(predictions == y_test)
print("For test data, accuracy = ", accuracy)

For test data, accuracy =  0.79


----
# Subtask 3: Hyperparameter Tuning

- The hyper parameters which affects the logistic regression model's performance are learning rate and no of iterations performed.

## Fine tuning learning rate and no of iterations to perform using grid search technique

In [8]:
import numpy as np
import pandas as pd

In [9]:
hyper_param = []
learning_rates = [0.01, 0.02, 0.03, 0.04, 0.05, 0.001, 0.002, 0.003, 0.004, 0.005, 0.0001, 0.0002, 0.0003, 0.0004, 0.0005]
iteration_choices = [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000]

for i in learning_rates:
    for j in iteration_choices:
        hyper_param.append((i, j))

print("Possible options: ", hyper_param)

Possible options:  [(0.01, 1000), (0.01, 2000), (0.01, 3000), (0.01, 4000), (0.01, 5000), (0.01, 6000), (0.01, 7000), (0.01, 8000), (0.01, 9000), (0.01, 10000), (0.02, 1000), (0.02, 2000), (0.02, 3000), (0.02, 4000), (0.02, 5000), (0.02, 6000), (0.02, 7000), (0.02, 8000), (0.02, 9000), (0.02, 10000), (0.03, 1000), (0.03, 2000), (0.03, 3000), (0.03, 4000), (0.03, 5000), (0.03, 6000), (0.03, 7000), (0.03, 8000), (0.03, 9000), (0.03, 10000), (0.04, 1000), (0.04, 2000), (0.04, 3000), (0.04, 4000), (0.04, 5000), (0.04, 6000), (0.04, 7000), (0.04, 8000), (0.04, 9000), (0.04, 10000), (0.05, 1000), (0.05, 2000), (0.05, 3000), (0.05, 4000), (0.05, 5000), (0.05, 6000), (0.05, 7000), (0.05, 8000), (0.05, 9000), (0.05, 10000), (0.001, 1000), (0.001, 2000), (0.001, 3000), (0.001, 4000), (0.001, 5000), (0.001, 6000), (0.001, 7000), (0.001, 8000), (0.001, 9000), (0.001, 10000), (0.002, 1000), (0.002, 2000), (0.002, 3000), (0.002, 4000), (0.002, 5000), (0.002, 6000), (0.002, 7000), (0.002, 8000), (0.0

In [10]:
train = pd.read_csv('ds1_train.csv')
X_train = train[['x_1', 'x_2']].to_numpy()
y_train = train['y'].to_numpy()

In [11]:
test = pd.read_csv('ds1_test.csv')
X_test = test[['x_1', 'x_2']].to_numpy()
y_test = test['y'].to_numpy()

### Finding best possible hyperparameter combination based on Accuracy of the model


In [12]:
max_accuracy = 0
best_learning_rate = 0
best_iterations = 0

for l in range(len(hyper_param)):
    model = MyLogisticRegression( learning_rate= hyper_param[l][0], iters = hyper_param[l][1])
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)         #training the model on test data set

    curr_accuracy = np.mean(predictions == y_test)          #evaluating the accuracy on test data set
    if max_accuracy < curr_accuracy:
        max_accuracy = curr_accuracy
        best_learning_rate = hyper_param[l][0]
        best_iterations = hyper_param[l][1]

print(f"Maximum Accuracy  = {max_accuracy} achieved by our model with learning rate {best_learning_rate} and no of iterations = {best_iterations}")

#the code takes approx a minute to run as i am considering 150 combinations of learning rates and iterations to perform

Maximum Accuracy  = 0.87 achieved by our model with learning rate 0.003 and no of iterations = 9000


### Finding best possible hyperparameter combination based on F1 score of the model

- F1 score takes into account both precision and recall. Its the harmonic mean of both.

In [13]:
best_f1_score = 0
best_learning_rate = 0
best_iterations = 0

for l in range(len(hyper_param)):
    model = MyLogisticRegression( learning_rate= hyper_param[l][0], iters = hyper_param[l][1])
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)         #training the model on test data set

    f1_score = model.f1_score(y_test, predictions)          #evaluating the f1_score on test data set
    if f1_score > best_f1_score:
        best_f1_score = f1_score
        best_learning_rate = hyper_param[l][0]
        best_iterations = hyper_param[l][1]
        
print(f"Best f1 score  = {best_f1_score} achieved by our model with learning rate {best_learning_rate} and no of iterations = {best_iterations}")

#the code takes approx half a minute to run as i am considering 150 combinations of learning rates and iterations to check which is the best

Best f1 score  = 0.8659793814432989 achieved by our model with learning rate 0.003 and no of iterations = 9000


- the best set of learning rate and no of iterations may differ depending on the type of environment that we choose to work upon for example, i got a different result when i used my code on google collab than when i ran my code locally on vs code on mac

- Also, I had avoided considering iterations in 1,00,000 range as i didn't want to increase the load on computer though we can further increase our no of iterations while performing grid search to include a large number of possibilities and get the best fitting model

----
# Subtask 4: Comparison with Scikit-Learn

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

In [15]:
train = pd.read_csv('ds1_train.csv')
X_train = train[['x_1', 'x_2']].to_numpy()
y_train = train['y'].to_numpy()

test = pd.read_csv('ds1_test.csv')
X_test = test[['x_1', 'x_2']].to_numpy()
y_test = test['y'].to_numpy()

In [16]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

model = MyLogisticRegression(learning_rate=0.003, iters=9000)           #   i have chosen learning_rate as 0.003 and iterations as 9000 
model.fit(X_train, y_train)                                             #  as it was the best for test data accuracy and f1 score in VS_Code
                                                                        #   alternatively one can input the best_learning_rate and iterations
                                                                        #   from the variables that i have provided

In [17]:
predictions_sk_train = logmodel.predict(X_train)
predictions_my_train = model.predict(X_train)

print("On Training DATA SET ds1_train: ")
print("F1_Score for Scikit-Learn model of logistic regression is ", f1_score(y_train, predictions_sk_train))
print("Accuracy for Scikit-Learn model of logistic regression is ", accuracy_score(y_train, predictions_sk_train))
print('')

print("F1_Score for My Model of logistic regression is ", model.f1_score(y_train, predictions_my_train))
print("Accuracy for My Model of logistic regression is ", np.mean(predictions_my_train == y_train))


On Training DATA SET ds1_train: 
F1_Score for Scikit-Learn model of logistic regression is  0.8839506172839507
Accuracy for Scikit-Learn model of logistic regression is  0.8825

F1_Score for My Model of logistic regression is  0.8010973936899863
Accuracy for My Model of logistic regression is  0.81875


In [18]:
predictions_sk_test = logmodel.predict(X_test)
predictions_my_test = model.predict(X_test)


print("On Test DATA SET ds1_test: ")
print("F1_Score for Scikit-Learn model of logistic regression is ", f1_score(y_test, predictions_sk_test))
print("Accuracy for Scikit-Learn model of logistic regression is ", accuracy_score(y_test, predictions_sk_test))
print('')

print("F1_Score for Scikit-Learn model of logistic regression is ",model.f1_score(y_test, predictions_my_test ))
print("Accuracy for My Model of logistic regression is ", np.mean(predictions_my_test == y_test))

On Test DATA SET ds1_test: 
F1_Score for Scikit-Learn model of logistic regression is  0.9056603773584904
Accuracy for Scikit-Learn model of logistic regression is  0.9

F1_Score for Scikit-Learn model of logistic regression is  0.8659793814432989
Accuracy for My Model of logistic regression is  0.87
