# Hyperparameter_Tuning of Logistic_Regression

#### In this subtask, hyperparameters are tuned using scikit learn library 

Numpy is used to create array and its functions are used such as exponential function, dot product, transpose of a matrix etc.

Pandas is used to read data from the csv files and convert it into list of arrays.

sklearn.linear_model provides the LogisticRegression model with its builtin features

sklearn.model_selection provides GridSearchCV which gives the best hyperparameter for the model which will predict results with highest accuracy

In [18]:
# Importing libraries

import numpy as np
import pandas as pd

The data from the csv files are extracted by pandas.read_csv function where it reads the data from the csv file and converts it into the dataframe.

In [19]:
# Initialising the traning and test examples
# df_train represents the traning DataFrame

df_train = pd.read_csv('ds1_train.csv')
df_test = pd.read_csv('ds1_test.csv')

In [20]:
# Printing the datframe df_train to see the type of data
# pandas.head(n) function is used to print the first n data entries of the respective dataframe. By default n=5

df_train.head(10)

Unnamed: 0,x_1,x_2,y
0,2.911809,60.359613,0.0
1,3.774746,344.149284,0.0
2,2.615488,178.222087,0.0
3,2.013694,15.259472,0.0
4,2.757625,66.194174,0.0
5,0.973922,41.677665,0.0
6,3.067275,143.27559,0.0
7,2.763094,35.969906,0.0
8,2.775772,29.569079,0.0
9,2.10983,76.636721,0.0


In [21]:
# It's important to check for the null values in both the training and the test dataset

print(df_train.isnull().sum())
print(df_test.isnull().sum())

x_1    0
x_2    0
y      0
dtype: int64
x_1    0
x_2    0
y      0
dtype: int64


In [22]:
# Printing the data entries of column y to see the types of labels
# pandas.value_counts() function is used to print the number of distinct entries in a particular coulmn

df_train['y'].value_counts()

0.0    400
1.0    400
Name: y, dtype: int64

In [23]:
# printing the total number of entries in a column
# pandas.count() is used to show the total number of entries in a column

df_train['y'].count()

800

From the above few run tests, it is clear our dataset has 3 columns named as 'x_1', 'x_2' and y.
The total number of entries are 800. Column 'y' has only two labels 0 and 1.

It's clear that the above dataset has only two types of data labesl i.e. 0 and 1. In such a case, Logistic Regression is a good model.

As it is a Logistic Regression model, we need a columnn lablled as 'x_0' with data entries equal to 1.

In [71]:
# Adding a column of data entries as 1

df_train['x_0'] = 1
df_test['x_0'] = 1

# Seperating the X and y of training data
# pandas.values function converts the data of DataFrame into array

X_train = df_train[['x_0', 'x_1', 'x_2']].values
y_train = df_train['y'].values

X_test = df_test[['x_0', 'x_1', 'x_2']].values
y_test = df_test['y'].values

X_train

array([[  1.        ,   2.91180854,  60.35961272],
       [  1.        ,   3.77474554, 344.1492843 ],
       [  1.        ,   2.61548828, 178.22208681],
       ...,
       [  1.        ,   2.96909526,  20.24997848],
       [  1.        ,   3.95753102,  27.26196973],
       [  1.        ,   4.02533402,  12.23316511]])

For reference, Logistic Regression function uses sigmoid function

In [72]:
# In Logistic Regression hypothesis function is the sigmoid function
# Defining the sigmoid function

def sigmoid(x):
    return 1/(1 + np.exp(-x))

It's time for the gradient ascent function in which theta will be updated everytime the loop runs.

alpha is the learning rate and num_iter is the number of iterations the loop runs.

In [77]:
# Declaring gradient ascent function
# array.shape gives the dimensions of array

def gradient_ascent(X, y, alpha, num_iter):
    
    m, n = X.shape()
    
    # Declaring theta as an array of zeros
    # numpy.zeros(n) creates a single dimensional array of n columns and 1 row
    
    theta = np.zeros(n)
    
    # The following for loop calculates the theta value required for prediction
    # The transpose function is not used beacuse the theta array created above is already in its transposed form as per the theory.
    
    # numpy.dot does the dot product of matrices
    # It calculates the sigmoid value from the sigmoid function and then calculates the dot product
    
    for i in range(num_iter):
        h = sigmoid(np.dot(X,theta))
        
        # h is the sigmoid value calculated and y is the value taken from the signature of gradient_ascent function
        
        gradient = np.dot(X_train.T, h - y) / m_train 
        theta = theta - alpha * gradient
    return theta

# This function thus returns theta 

### LogisticRegression() has many hyperparameter values such as penalty, solver and C.

'alpha' is , the learning rate, a hyperparameter used in optimization algorithms, such as gradient descent, to control the step size at each iteration when updating the model's parameters during training.

The number of iterations, often referred to as "epochs" or "iterations," is a critical hyperparameter that determines how many times the learning algorithm will update the model's parameters during training. Each iteration involves passing the entire training dataset through the algorithm and updating the model's parameters based on the gradients of the loss function.

### GridSearch

GridSearch is a technique for hyperparameter tuning and model selection in machine learning. It is a systematic way of searching through a predefined set of hyperparameter values to find the combination that yields the best model performance. GridSearch is commonly used to optimize the hyperparameters of a machine learning model and improve its generalization on unseen data.

In the following function; for loops goes through every value of alpha and number of iterations and calculate the theta corresponding to those values. Thereafter it trains the model and calculates the acuracy of both the training and test dataset and print it. 

In [87]:
# Defining GridSearch function
# It takes many parameters such as X_train, y_train, X_test, y_test, alpha, num_iter
# X_train is the traning set and contains features to be trained
# y_train is the training set column and conatins labels to be compared with
# X_test is the traning set and contains features to be trained
# y_test is the training set column and conatins labels to be compared with
# alpha is the learning rate hyperparameter
# num_iter is the number of iterations hyperparameter

def GridSearch(X_train, y_train, X_test, y_test, alpha, num_iter):
    
    # Initialising each calculating term to zero
    
    max_accur = 0
    max_accur_iter = 0
    max_accur_alpha = 0
    max_accur_theta = np.zeros(3)
    
    # Implementing GridSearch
    # Defining the for loop to run 
    # First loop is the loop running through alpha parameters list
    # Second loop is the loop running through number of iterations parameters list
    
    # a will be running through every term of alpha list
    
    for a in alpha:
        
        # iter will be running through num_iter list
        
        for iter in num_iter:
            
            # Calculating theta by calling gradient_ascent function
            
            theta = gradient_ascent(X_train, y_train, a, iter)

            # Calculating the sigmoid values and accuracy
            
            h_train = sigmoid(X_train.dot(theta))
            y_pred_train = (h_train >= 0.5).astype(int)
            train_accur = np.mean(y_pred_train == y_train)
            print('Accur: ', a, ' ', iter, ' ', train_accur)
            
            # Compares the training accuracy

            if train_accur >= max_accur:
                max_accur = train_accur
                max_accur_iter = iter
                max_accur_alpha = a
                max_accur_theta = theta

    # Calculating accuracy on the test set and printing it 
    
    h_test = sigmoid(X_test.dot(max_accur_theta))
    y_pred_test = (h_test >= 0.5).astype(int)
    test_accuracy = np.mean(y_pred_test == y_test)
    
    # Returning the calculated parameters

    return max_accur, max_accur_iter, max_accur_alpha, max_accur_theta, test_accuracy

In [88]:
# Declaring the alpha and num_iter parameters

alpha = [0.001, 0.01, 0.1]
num_iter = [10000, 100000, 200000, 300000, 400000]

In [90]:
# Calculating the required parameters from GridSearch

max_accur, max_accur_iter, max_accur_alpha, max_accur_theta, test_accuracy = GridSearch(X_train, y_train, X_test, y_test, alpha, num_iter)

Accur:  0.001   10000   0.8
Accur:  0.001   100000   0.8925
Accur:  0.001   200000   0.8925
Accur:  0.001   300000   0.88125
Accur:  0.001   400000   0.88125
Accur:  0.01   10000   0.78
Accur:  0.01   100000   0.8775
Accur:  0.01   200000   0.84625
Accur:  0.01   300000   0.84875
Accur:  0.01   400000   0.85


  return 1/(1 + np.exp(-x))


Accur:  0.1   10000   0.6525
Accur:  0.1   100000   0.87625
Accur:  0.1   200000   0.87875
Accur:  0.1   300000   0.88375
Accur:  0.1   400000   0.87875


In [91]:
# Printing the best parameters

print("Best hyperparameters:")
print("Iterations:", max_accur_iter)
print("Learning Rate:", max_accur_alpha)
print("Training Accuracy:", max_accur)
print("Test Accuracy:", test_accuracy)

Best hyperparameters:
Iterations: 200000
Learning Rate: 0.001
Training Accuracy: 0.8925
Test Accuracy: 0.9


Therefore, the best hyperparmeters are:
    
alpha = 0.001

num_iter = 200000

## Thank You