# Breast Cancer Detection using Logistic Regression

Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.Logistic regression is used for solving the classification problems.

Breast cancer can be benign or malignant. As we have two outcomes here so we can use Logistic Regression for predicting Breast Cancer.

1. Load the Dataset

In [22]:
#import necessary libraries.
import numpy as np
import sklearn.datasets

In [23]:
#loading dataset

breast_cancer = sklearn.datasets.load_breast_cancer()
print(breast_cancer)

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]]), 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
 

In [24]:
#seperating data and target of our dataset

X = breast_cancer.data
Y = breast_cancer.target

print(X)
print(Y)
print("Shape of X and Y: ",X.shape,Y.shape)

[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 

 2. Analysing the dataset

In [25]:
#import pandas library for analysis and convert the dataset into DataFrame

import pandas as pd
data = pd.DataFrame(breast_cancer.data, columns = breast_cancer.feature_names)
data['class'] = breast_cancer.target

data.head(10) #viewing upper 10 rows of dataframe

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,class
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0
5,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,0
6,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,0.05742,...,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,0
7,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,0.07451,...,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151,0
8,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,0.07389,...,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072,0
9,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,0.08243,...,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,0


In [26]:
data.tail(10) #viewing lower 10 rows of dataframe

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,class
559,11.51,23.93,74.52,403.5,0.09261,0.1021,0.1112,0.04105,0.1388,0.0657,...,37.16,82.28,474.2,0.1298,0.2517,0.363,0.09653,0.2112,0.08732,1
560,14.05,27.15,91.38,600.4,0.09929,0.1126,0.04462,0.04304,0.1537,0.06171,...,33.17,100.2,706.7,0.1241,0.2264,0.1326,0.1048,0.225,0.08321,1
561,11.2,29.37,70.67,386.0,0.07449,0.03558,0.0,0.0,0.106,0.05502,...,38.3,75.19,439.6,0.09267,0.05494,0.0,0.0,0.1566,0.05905,1
562,15.22,30.62,103.4,716.9,0.1048,0.2087,0.255,0.09429,0.2128,0.07152,...,42.79,128.7,915.0,0.1417,0.7917,1.17,0.2356,0.4089,0.1409,0
563,20.92,25.09,143.0,1347.0,0.1099,0.2236,0.3174,0.1474,0.2149,0.06879,...,29.41,179.1,1819.0,0.1407,0.4186,0.6599,0.2542,0.2929,0.09873,0
564,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,0.1726,0.05623,...,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,0
565,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,0.1752,0.05533,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,0
566,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,0.159,0.05648,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,0
567,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,0.2397,0.07016,...,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,0
568,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,0.1587,0.05884,...,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,1


In [27]:
#counting the data based on the class

data['class'].value_counts()

1    357
0    212
Name: class, dtype: int64

In [28]:
print(breast_cancer.target_names)

['malignant' 'benign']


In [29]:
data.groupby('class').mean()

Unnamed: 0_level_0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,17.46283,21.604906,115.365377,978.376415,0.102898,0.145188,0.160775,0.08799,0.192909,0.06268,...,21.134811,29.318208,141.37033,1422.286321,0.144845,0.374824,0.450606,0.182237,0.323468,0.09153
1,12.146524,17.914762,78.075406,462.790196,0.092478,0.080085,0.046058,0.025717,0.174186,0.062867,...,13.379801,23.51507,87.005938,558.89944,0.124959,0.182673,0.166238,0.074444,0.270246,0.079442


After Analysis we found that:

0 - Malignant
1 - Benign

Cases for each class:

Malignant - 212
Benign - 357

3. Splitting the dataset into training and testing data.

In [30]:
#import train_test_split module
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.15, stratify=Y, random_state=1)
#test_size --> to specify percentage of test data needed.
#stratify --> for equal distribution of data based on some feature
#random_state --> for specific split of data, each value of random_state splits the data differently.

In [31]:
#checking the shape of test data and how equally is it distributed
print(Y.shape, Y_train.shape, Y_test.shape)
print(Y.mean(), Y_train.mean(), Y_test.mean())

(569,) (483,) (86,)
0.6274165202108963 0.6273291925465838 0.627906976744186


Here, we can see that 15% of data is used for testing and as the mean value of all data is almost equal so we can say that the data is distributed equally.

3. Build and Train the model using Logistic Regression

In [32]:
#import LogisticRegression model
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(solver='liblinear', max_iter=100) #creating the model
classifier.fit(X_train, Y_train) #training the model on training data

LogisticRegression(solver='liblinear')

4. Evaluate the model

In [33]:
#import accuracy_score
from sklearn.metrics import accuracy_score

4.a Checking accuracy by predicting training data

In [34]:
prediction_on_training_data = classifier.predict(X_train)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

print("Accuracy on training data: ",accuracy_on_training_data)

Accuracy on training data:  0.9503105590062112


4.b Checking accuracy by predicting test data

In [35]:
prediction_on_test_data = classifier.predict(X_test)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

print("Accuracy on test data: ", accuracy_on_test_data)

Accuracy on test data:  0.9651162790697675


5. Detecting whether the patient has Breast Cancer in Benign or Malignant stage

In [38]:
input_data = (9.268,12.87,61.49,248.7,0.1634,0.2239,0.0973,0.05252,0.2378,0.09502,0.4076,1.093,3.014,20.04,0.009783,0.04542,0.03483,0.02188,0.02542,0.01045,10.28,16.38,69.05,300.2,0.1902,0.3441,0.2099,0.1025,0.3038,0.1252)

#converting the tuple into numpy array
input_data_array = np.asarray(input_data)

#reshaping the array to use it for predicting output for one instance
input_data_reshaped = input_data_array.reshape(1,-1)

#predicting
prediction = classifier.predict(input_data_reshaped)
 #returns list with one element [0] if Malignant ; else returns list with one element [1] if Benign.

if prediction[0]==0:
    print("Breast Cancer is in Malignant Stage. May God Bless You!!")
else:
    print("Breast Cancer is in Benign Stage. Treatment can be started and patient can be saved, Consult a specialist asap!!!")

Breast Cancer is in Benign Stage. Treatment can be started and patient can be saved, Consult a specialist asap!!!


Our Model is ready!

# Trying to solve same problem using K Nearest Neighbors

In [58]:
import sklearn.datasets

breastCancer = sklearn.datasets.load_breast_cancer()

In [59]:
X1 = breastCancer.data
Y1 = breastCancer.target

print(X1.shape, Y1.shape)

(569, 30) (569,)


In [60]:
from sklearn.model_selection import train_test_split
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, Y1, test_size=0.15, stratify=Y, random_state=1)

print(Y1.shape, Y1_train.shape, Y1_test.shape)
print(Y1.mean(), Y1_train.mean(), Y1_test.mean())

(569,) (483,) (86,)
0.6274165202108963 0.6273291925465838 0.627906976744186


In [61]:
#import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=7) #creating model
clf.fit(X1_train, Y1_train) #training model

KNeighborsClassifier(n_neighbors=7)

In [62]:
prediction = clf.predict(X1_test) #prediction on test data
acc = accuracy_score(Y1_test, prediction) #accuracy on test data
print("Accuracy on test data: ",acc)

Accuracy on test data:  0.9418604651162791


In [63]:
input_data1 = (19.89,20.26,130.5,1214,0.1037,0.131,0.1411,0.09431,0.1802,0.06188,0.5079,0.8737,3.654,59.7,0.005089,0.02303,0.03052,0.01178,0.01057,0.003391,23.73,25.23,160.5,1646,0.1417,0.3309,0.4185,0.1613,0.2549,0.09136)

#converting the tuple into numpy array
input_data_array1 = np.asarray(input_data1)

#reshaping the array to use it for predicting output for one instance
input_data_reshaped1 = input_data_array1.reshape(1,-1)

#predicting
predicted = clf.predict(input_data_reshaped1)
 #returns list with one element [0] if Malignant ; else returns list with one element [1] if Benign.

if predicted[0]==0:
    print("Breast Cancer is in Malignant Stage. May God Bless You!!")
else:
    print("Breast Cancer is in Benign Stage. Treatment can be started and patient can be saved, Consult a specialist asap!!!")

Breast Cancer is in Malignant Stage. May God Bless You!!


Though we got correct cancer detection by both models but by using Logistic Regression we got 0.9651162790697675 accuracy and using KNN we got 0.9418604651162791 accuracy.

So, for this situation Logistic Regression model is more accurate