## Implementing Logistic Regression from Scratch for on Breast Cancer Dataset


Dataset Details:

The Breast Cancer dataset includes Ten real-valued features that are computed for each cell nucleus:

- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)

The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.


In [1]:
# importing required Libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split #Scratch Code is available in Simple Linear Regression File
from sklearn.metrics import mean_squared_error #Scratch Code is available in Simple Linear Regression File
from Algo_LogisticRegression_FromScratch import LogisticRegression 

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/Sanchit028/Machine-Learning-from-scratch/main/03.%20Logistic%20Regression/breast_cancer.csv");
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


#### Data Pre-Processing


In [3]:
df.shape

(569, 32)

In [4]:
df.dtypes

id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave_points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave_points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst     

In [5]:
df.info()  # We can see there are no null values in the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave_points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [6]:
df['diagnosis'] = df['diagnosis'].apply(lambda val:1 if val=='M' else 0) #Converting categorical values to integers

In [7]:
#correlation with target variable
target_corr = np.abs(df.corrwith(df["diagnosis"]))#finding absolute
target_corr.sort_values(ascending=False) #printing in descending order

diagnosis                  1.000000
concave_points_worst       0.793566
perimeter_worst            0.782914
concave_points_mean        0.776614
radius_worst               0.776454
perimeter_mean             0.742636
area_worst                 0.733825
radius_mean                0.730029
area_mean                  0.708984
concavity_mean             0.696360
concavity_worst            0.659610
compactness_mean           0.596534
compactness_worst          0.590998
radius_se                  0.567134
perimeter_se               0.556141
area_se                    0.548236
texture_worst              0.456903
smoothness_worst           0.421465
symmetry_worst             0.416294
texture_mean               0.415185
concave_points_se          0.408042
smoothness_mean            0.358560
symmetry_mean              0.330499
fractal_dimension_worst    0.323872
compactness_se             0.292999
concavity_se               0.253730
fractal_dimension_se       0.077972
smoothness_se              0

In [8]:
X = df.drop(["diagnosis", "smoothness_mean", "symmetry_mean", "fractal_dimension_worst", "compactness_se", "concavity_se", "fractal_dimension_se", "smoothness_se", "fractal_dimension_mean", "id", "texture_se", "symmetry_se"], axis=1) #Removing irrelevant and target values
y = df["diagnosis"] #Target attribute
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((455, 20), (114, 20), (455,), (114,))

#### Model Training and prediction


In [11]:
#Object Creation and training for the Logistic Regression 
model = LogisticRegression(lr = 0.00001, n_iters= 50000)
m, b =model.fit(X_train, y_train, True)

Iteration: 1	Cost: 0.2500000000
Iteration: 2	Cost: 0.2833751898
Iteration: 3	Cost: 0.3275286329
Iteration: 4	Cost: 0.5139285764
Iteration: 5	Cost: 0.3151332790
Iteration: 6	Cost: 0.5029673180
Iteration: 7	Cost: 0.3141255848
Iteration: 8	Cost: 0.5036424693
Iteration: 9	Cost: 0.3055046555
Iteration: 10	Cost: 0.4959517590
Iteration: 11	Cost: 0.3020473087
Iteration: 12	Cost: 0.4932875656
Iteration: 13	Cost: 0.2948678961
Iteration: 14	Cost: 0.4860531104
Iteration: 15	Cost: 0.2907444409
Iteration: 16	Cost: 0.4816122799
Iteration: 17	Cost: 0.2845799644
Iteration: 18	Cost: 0.4741631697
Iteration: 19	Cost: 0.2805147312
Iteration: 20	Cost: 0.4686124024
Iteration: 21	Cost: 0.2752013462
Iteration: 22	Cost: 0.4609272213
Iteration: 23	Cost: 0.2713234473
Iteration: 24	Cost: 0.4545815599
Iteration: 25	Cost: 0.2666617821
Iteration: 26	Cost: 0.4467100692
Iteration: 27	Cost: 0.2629433444
Iteration: 28	Cost: 0.4397087954
Iteration: 29	Cost: 0.2587922235
Iteration: 30	Cost: 0.4316824109
Iteration: 31	Cost:

In [12]:
m # Coefficients

array([-0.02810895,  0.0146906 , -0.11408513, -0.01554075,  0.00164514,
        0.00247303,  0.00100004, -0.00039168,  0.00448887,  0.04287577,
        0.00011752, -0.02960902,  0.04761501, -0.06819914,  0.03128081,
        0.00028611,  0.00596678,  0.00739865,  0.00201963,  0.0011111 ])

In [13]:
b # Intercept

-0.004279732540599209

In [14]:
y_pred=model.predict(X_test)

In [15]:
mean_squared_error(y_test, y_pred)

0.06140350877192982

#### Checking our result by compairing it with sklearn model results


In [16]:
from sklearn.linear_model import LogisticRegression

reg = LogisticRegression()
reg.fit(X_train, y_train) # This might give a warning.

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [17]:
reg.predict(X_test)

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 0], dtype=int64)

In [18]:
reg.intercept_

array([-0.28365114])

In [19]:
reg.coef_

array([[-1.43825303, -0.278992  , -0.12334629,  0.00560196,  0.27158436,
         0.3678748 ,  0.15273438, -0.0772142 , -0.36840292,  0.11830439,
         0.01865096, -1.5724038 ,  0.37103834,  0.26434012,  0.02184983,
         0.10018613,  0.87509209,  1.09269982,  0.30516869,  0.29482979]])

In [20]:
mean_squared_error(y_test, reg.predict(X_test)) # Almost same

0.05263157894736842