# Logistic Regression

## Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Importing the dataset

Ref: http://ijarcs.info/index.php/Ijarcs/article/download/4602/4132#:~:text=2.,cells%20are%20cancerous%20or%20not.

1. Clump Thickness: This is used to assess if cells are
mono-layered or multi-layered. Benign cells tend to be
grouped in mono-layers, while cancerous cells are often
grouped in multi-layer.
2. Uniformity of Cell Size: It is used to evaluate the
consistency in the size of cells in the sample. Cancer
cells tend to vary in size. That is why this parameter is
very valuable in determining whether the cells are
cancerous or not.
3. Uniformity of Cell Shape: It is used to estimate the
equality of cell shapes and identifies marginal variances,
because cancer cells tend to vary in shape.
4. Marginal Adhesion: Normal cells tend to stick together.
Cancer cells tend to loose this ability. So loss of
adhesion is a sign of malignancy.
5. Single Epithelial Cell Size: It is related to the
uniformity. Epithelial cells that are significantly enlarged
may be a malignant cell.
6. Bare Nuclei: This is a term used for nuclei that is not
surrounded by cytoplasm. Those are typically seen in
benign tumors.
7. Bland Chromatin: Describes a uniform “texture” of the
nucleus seen in benign cells. In cancer cells, the
chromatin tends to be coarser.
8. Normal Nucleoli: Nucleoli are small structures seen in
the nucleus. In normal cells the nucleolus is usually very
small if visible at all. In cancer cells the nucleoli become
much more prominent, and sometimes there are more of
them.
9. Mitoses: It is an estimate of the number of mitosis that
has taken place. Larger the value, greater is the chance of
malignancy

In [2]:
dataset = pd.read_csv('breast_cancer.csv')
X = dataset.iloc[:, 1:-1]
y = dataset.iloc[:, -1]
print(dataset.head())
print(X.head())
print(y.head())
print(dataset.info())
print(dataset.describe())

   Sample code number  Clump Thickness  Uniformity of Cell Size  \
0             1000025                5                        1   
1             1002945                5                        4   
2             1015425                3                        1   
3             1016277                6                        8   
4             1017023                4                        1   

   Uniformity of Cell Shape  Marginal Adhesion  Single Epithelial Cell Size  \
0                         1                  1                            2   
1                         4                  5                            7   
2                         1                  1                            2   
3                         8                  1                            3   
4                         1                  3                            2   

   Bare Nuclei  Bland Chromatin  Normal Nucleoli  Mitoses  Class  
0            1                3                1       

## Splitting the dataset into training and test set

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

## Training the Logistic Regression Model on the training set

In [4]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

LogisticRegression()

## Predicting the test set results 

In [5]:
y_pred = classifier.predict(X_test)
print(y_pred)
print(np.concatenate((np.array(y_test).reshape(len(y_test), 1), np.array(y_pred).reshape(len(y_pred), 1)), 1))
# mean accuracy on the given test data and labels.
print(classifier.score(X_test, y_test))

[2 2 4 4 2 2 2 4 2 2 4 2 4 2 2 2 4 4 4 2 2 2 4 2 4 4 2 2 2 4 2 4 4 2 2 2 4
 4 2 4 2 2 2 2 2 2 2 4 2 2 4 2 4 2 2 2 4 4 2 4 2 2 2 2 2 2 2 2 4 4 2 2 2 2
 2 2 4 2 2 2 4 2 4 2 2 4 2 4 4 2 4 2 4 4 2 4 4 4 4 2 2 2 4 4 2 2 4 2 2 2 4
 2 2 4 2 2 2 2 2 2 2 4 2 2 4 4 2 4 2 4 2 2 4 2 2 4 2]
[[2 2]
 [2 2]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [4 4]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [2 4]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [2 4]
 [4 4]
 [2 2]
 [4 4]
 [2 2]
 [4 4]
 [4 4]
 [4 2]
 [2 4]
 [4 4]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [4 4]


## Making a confusion matrix 

In [6]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[84  3]
 [ 3 47]]


Row Wise:
84 = correct prediction tumor is benign 
3  = Incorrect prediction tumor is benign
3  = Incorrect prediction tumor is malignant
47 = correct prediction that tumor is malignant

## Computing the accuracy with K-Fold Cross Validation

In [7]:
(84 + 47) / (84 + 47 + 3 + 3)

0.9562043795620438

In [8]:
from sklearn.model_selection import cross_val_score
accuracy = cross_val_score(estimator= classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracy.mean() * 100))
print("Standard Deviation: {:.2f} %".format(accuracy.std() * 100))

Accuracy: 96.70 %
Standard Deviation: 1.97 %
