<h3>Logistic Regression implemented on a Dataset with Binary Output</h3>

In [1]:
#Import Libraries
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.metrics import classification_report
from sklearn import preprocessing
from tqdm import tqdm
from time import time

<li>The dataset we are using is Breast Cancer Wisconsin (Diagnostic) Data Set.
<li>The dataset consists of 8 different independent parameters listed below-
                       <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;['id','clump_size','cell_size','cell_shape','marginal_adhesion','ep_cell_size','bare_nuclei','bland_chromatin','nor_nucleoli','mitosis','class']
<li>The parameters are ordinal categorical variables
<li>Our goal is to predict the class. 2 stands for benign cancer & 4 stands for malignant cancer. We changed 2 to 0 and 4 to 1 for convenience.
<li>Other details about the dataset can be found in the 
    <a href="https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)">link</a>.


In [2]:
names=['id','clump_size','cell_size','cell_shape','marginal_adhesion','ep_cell_size','bare_nuclei','bland_chromatin','nor_nucleoli','mitosis','class']
data=pd.read_csv('breast-cancer-wisconsin.data',names=names)
data=data.drop('id',axis=1)
#Dropped the columns with missing data in the bare_nuclei column
data=data[data['bare_nuclei']!='?']

In [3]:
#Output data points renamed from 2 & 4 to 0 & 2
data['class'][data['class']==2]=0
data['class'][data['class']==4]=1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [4]:
#Normalizing the Dataset
x = data.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
data = pd.DataFrame(x_scaled)
data.columns=['clump_size','cell_size','cell_shape','marginal_adhesion','ep_cell_size','bare_nuclei','bland_chromatin','nor_nucleoli','mitosis','class']
data=data.sample(frac=1)

In [5]:
#Displaying the dataset
data

Unnamed: 0,clump_size,cell_size,cell_shape,marginal_adhesion,ep_cell_size,bare_nuclei,bland_chromatin,nor_nucleoli,mitosis,class
372,0.666667,0.555556,0.555556,0.222222,0.111111,1.000000,0.666667,0.000000,0.000000,1.0
68,0.444444,0.000000,0.222222,0.000000,0.111111,0.000000,0.111111,0.000000,0.000000,0.0
145,0.777778,0.777778,0.666667,0.333333,1.000000,1.000000,0.666667,0.777778,0.666667,1.0
635,0.000000,0.111111,0.000000,0.222222,0.111111,0.000000,0.111111,0.000000,0.000000,0.0
295,0.000000,0.000000,0.000000,0.000000,0.111111,0.000000,0.222222,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...
531,0.555556,1.000000,1.000000,1.000000,0.333333,1.000000,0.666667,1.000000,0.000000,1.0
521,0.444444,0.000000,0.000000,0.000000,0.111111,0.000000,0.222222,0.000000,0.000000,0.0
316,1.000000,0.333333,0.666667,0.111111,0.111111,0.777778,0.555556,0.000000,0.000000,1.0
265,0.666667,0.111111,0.333333,0.000000,0.222222,0.333333,0.222222,0.222222,0.000000,1.0


In [6]:
#Dividing the data into 80% training & 20% testing
x_train=data[:546]
x_test=data[547:]

In [7]:
#Separating the Outcomes (Actual Y) from Predictors
#The prediction is arranged in the format (Columns,Rows)
y_train=x_train['class']
y_test=x_test['class']
y_train=y_train.to_numpy()
y_test=y_test.to_numpy()
y_train=y_train.reshape(y_train.shape[0],-1)
y_test=y_test.reshape(y_test.shape[0],-1)
y_train=y_train.T
y_test=y_test.T

In [8]:
#Deleting the Outcome Row & arranging the data in (Columns, Row) format
x_train=x_train.drop('class',axis=1)
x_test=x_test.drop('class',axis=1)
x_train=x_train.to_numpy()
x_test=x_test.to_numpy()
x_train=x_train.T
x_test=x_test.T

In [9]:
#Sigmoid Function
def sigmoid(a):
    return (1/(1+np.exp(-a)))

In [10]:
#Function to Calculate the Cost J
def calc_cost(m,Y,A):
    return -(1/m)*np.sum(Y*(np.log(A))+(1-Y)*(np.log(1-A)))

In [11]:
#Function to calculate Weight
def calc_weights(w,X):
    A = sigmoid(np.dot(w.T,X))
    return A

In [12]:
#Weight Updation Function
def update_weights(m,A,X,Y):
    dw=(1/m)*np.dot(X,(A-Y).T)
    return dw

In [13]:
#Gradient Descent Implementation
def gradient_descent(w, X, Y, num_iterations, learning_rate):
    costs=[]
    m=X.shape[1]
    for i in range(num_iterations):
        A=calc_weights(w,X)
        cost=calc_cost(m,Y,A)
        
        #Update Weights
        dw=update_weights(m,A,X,Y)
        w=w-learning_rate*dw
        
        #Save Cost each Iteration
        costs.append(cost)
    return w,dw,costs

In [14]:
#Calculate Accuracy of Predictions after n number of Iterations
def prediction(w, X, Y):
    m=X.shape[1]
    predict=np.zeros((1,m))
    A=np.dot(w.T,X)
    Y_prediction = np.zeros((1,m))
    
    count=0
    for i in range(1,m):
        if A[0][i]<=0.5:
            Y_prediction[0][i]=0
        else:
            Y_prediction[0][i]=1
            
    for i in range(1,m):
        if Y_prediction[0][i]==Y[0][i]:
            count+=1
        
    return (count/X.shape[1])*100

In [15]:
#Final Implementation by Calculating 
def logistic_regression(X_train, Y_train, X_test, Y_test, learning_rate=0.005,num_iterations=1000):
    
    #Initializing Weights
    w = np.zeros([X_train.shape[0],1])
    
    w, dw, costs = gradient_descent(w,X_train, Y_train, num_iterations, learning_rate)
    
    # Predict Test & Train
    train_accuracy = prediction(w, X_train, Y_train)
    test_accuracy = prediction(w, X_test, Y_test)
    
    return train_accuracy,test_accuracy,costs

In [16]:
train_accuracy,test_accuracy,costs=logistic_regression(x_train, y_train, x_test, y_test)

In [17]:
print('The Accuracy in percentage calculated on the Training Data is '+str(train_accuracy))

The Accuracy in percentage calculated on the Training Data is 96.15384615384616


In [18]:
print('The Accuracy in percentage calculated on the Test Data is '+str(test_accuracy))

The Accuracy in percentage calculated on the Test Data is 94.11764705882352


<h3>Finding Accuracy using Sklearn Library</h3>

In [19]:
#Arranging the Datasets in (Row, Column) format as required by Sklearn Library
x_train_lib=x_train.T
y_train_lib=y_train.T
x_test_lib=x_test.T
y_test_lib=y_test.T

In [20]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(x_train_lib, y_train_lib)

  y = column_or_1d(y, warn=True)


In [21]:
train_accuracy=clf.score(x_train_lib, y_train_lib)
print('The Accuracy in percentage calculated on the Train Data using sklearn Library is '+str(train_accuracy*100))

The Accuracy in percentage calculated on the Train Data using sklearn Library is 97.25274725274726


In [22]:
test_accuracy=clf.score(x_test_lib, y_test_lib)
print('The Accuracy in percentage calculated on the Test Data using sklearn Library is '+str(test_accuracy*100))

The Accuracy in percentage calculated on the Test Data using sklearn Library is 94.85294117647058
