# Logistic_Regression

In this subtask, Logistic Regression model is used to train on 'ds1_train.csv' and then tested on 'ds1_test.csv'.

The required libraries for this task are numpy and pandas.

Numpy is used to create array and its functions are used such as exponential function, dot product, transpose of a matrix etc.

Pandas is used to read data from the csv files and convert it into list of arrays.

In [1]:
# Importing required libraries

import numpy as np
import pandas as pd

The data from the csv files are extracted by pandas.read_csv function where it reads the data from the csv file and converts it into the dataframe.

In [64]:
# Initialising the traning and test examples
# df_train represents the traning DataFrame

df_train = pd.read_csv('ds1_train.csv')
df_test = pd.read_csv('ds1_test.csv')

In [65]:
# Printing the datframe df_train to see the type of data
# pandas.head(n) function is used to print the first n data entries of the respective dataframe. By default n=5

df_train.head(10)

Unnamed: 0,x_1,x_2,y
0,2.911809,60.359613,0.0
1,3.774746,344.149284,0.0
2,2.615488,178.222087,0.0
3,2.013694,15.259472,0.0
4,2.757625,66.194174,0.0
5,0.973922,41.677665,0.0
6,3.067275,143.27559,0.0
7,2.763094,35.969906,0.0
8,2.775772,29.569079,0.0
9,2.10983,76.636721,0.0


In [95]:
# It's important to check for the null values in both the training and the test dataset

print(df_train.isnull().sum())
print(df_test.isnull().sum())

x_1    0
x_2    0
y      0
dtype: int64
x_1    0
x_2    0
y      0
dtype: int64


In [66]:
# Printing the data entries of column y to see the types of labels
# pandas.value_counts() function is used to print the number of distinct entries in a particular coulmn

df_train['y'].value_counts()

0.0    400
1.0    400
Name: y, dtype: int64

In [67]:
# printing the total number of entries in a column
# pandas.count() is used to show the total number of entries in a column

df_train['y'].count()

800

From the above few run tests, it is clear our dataset has 3 columns named as 'x_1', 'x_2' and y.
The total number of entries are 800. Column 'y' has only two labels 0 and 1.

So, it is feasible to train our model using the Logistic Regression model

As it is a Logistic Regression model, we need a columnn lablled as 'x_0' with data entries equal to 1.

In [2]:
# Adding a column of data entries as 1

df_train['x_0'] = 1
df_test['x_0'] = 1

# Seperating the X and y of training data
# pandas.values function converts the data of DataFrame into array

X_train = df_train[['x_0', 'x_1', 'x_2']].values
y_train = df_train['y'].values

X_test = df_test[['x_0', 'x_1', 'x_2']].values
y_test = df_test['y'].values

X_train

NameError: name 'df_train' is not defined

For reference, Logistic Regression function uses sigmoid function

In [69]:
# In Logistic Regression hypothesis function is the sigmoid function
# Defining the sigmoid function

def sigmoid(x):
    return 1/(1 + np.exp(-x))

It's time for the gradient ascent function in which theta will be updated everytime the loop runs.

alpha is the learning rate and num_iter is the number of iterations the loop runs.

In [80]:
# Declaring gradient ascent function
# array.shape gives the dimensions of array

m_train, n_train = X_train.shape
m_test, n_test = X_test.shape
def gradient_ascent(X, y, alpha, num_iter):
    
    # Declaring theta as an array of zeros
    # numpy.zeros(n) creates a single dimensional array of n columns and 1 row
    
    theta = np.zeros(n_train)
    
    # The following for loop calculates the theta value required for prediction
    # The transpose function is not used beacuse the theta array created above is already in its transposed form as per the theory.
    
    # numpy.dot does the dot product of matrices
    # It calculates the sigmoid value from the sigmoid function and then calculates the dot product
    
    for i in range(num_iter):
        h = sigmoid(X.dot(theta))
        
        # h is the sigmoid value calculated and y is the value taken from the signature of gradient_ascent function
        
        gradient = np.dot(X.T, h - y) / m_train 
        theta = theta - alpha * gradient
    return theta

# This function thus returns theta 

In [85]:
# Defining the alpha and number of iterations parameters

alpha = 0.01
num_iter = 100000

# Calling the gradient_ascent function and defining the updated theta 

updated_theta = gradient_ascent(X_train, y_train, alpha, num_iter)
updated_theta

array([-15.73156603,   6.60957114,  -0.13703499])

After getting the theta parameters, the output can be predicted on the training and test data followed by accuracy

In [87]:
# Calculating the output of prediction of training and test data
h_train = sigmoid(X_train.dot(updated_theta))
h_test = sigmoid(X_test.dot(updated_theta))

# Calculating the output of the above function as 0 or 1
# If the sigmoid function returns value less than 0.5 it will be treated as 0 
# whereas if the sigmoid function returns greater than or equal to 0.5, it will return 1

# Calculating the accuracy by adding all the positive results by numpy.sum method and dividing it by the total no. of test cases

# Calculating the accuracy of training dataset
h_train[h_train < 0.5] = 0
h_train[h_train >= 0.5] = 1
k_train = np.sum(h_train == y_train)
print('Accuracy of training set: ', k_train / m_train )

# Calculating the accuracy of test dataset
h_test[h_test < 0.5] = 0
h_test[h_test >= 0.5] = 1
k_test = np.sum(h_test == y_test)
print('Accuracy of test set: ', k_test / m_test)

Accuracy of training set:  0.8775
Accuracy of test set:  0.9


The accuracy of training dataset by the above model is 0.8775

The accuracy of test dataset by the above model is 0.9