# Description 

Logistic regression has many applications in data science, but in the world of healthcare, it can really drive life-changing action.

From the given data we have to detect breast cancer by applying a logistic regression model on a real-world dataset and predict whether a tumor is benign (not breast cancer) or malignant (breast cancer) based off its characteristics.

We will be able to build a logistic regression model to identify correlations between the following 9 independent variables and the class of the tumor (benign or malignant).

Clump thickness

Uniformity of cell size

Uniformity of cell shape

Marginal adhesion

Single epithelial cell

Bare Nuclei

Bland chromatin

Normal nucleoli

Mitoses

Logistic regression can identify important predictors of breast cancer using odds ratios and generate confidence intervals that provide additional information for decision-making. Model performance depends on the ability of the radiologists to accurately identify findings on mammograms.

Here we conduct 3-part case study:

Part 1: Data Preprocessing

Importing the dataset

Splitting the dataset into a training set and test set

Part 2: Training and Inference

Training the logistic regression model on the training set

Predicting the test set results

Part 3: Evaluating the Model

Making the confusion matrix

Computing the accuracy with k-Fold cross-validation

## step 1 : Data preprocessing

### Importing required libraries

In [1]:
import pandas as pd

### Importing  dataset


In [2]:
dataset=pd.read_csv("breast_cancer.csv")

In [3]:
dataset

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
...,...,...,...,...,...,...,...,...,...,...,...
678,776715,3,1,1,1,3,2,1,1,1,2
679,841769,2,1,1,1,2,1,1,1,1,2
680,888820,5,10,10,3,7,3,8,10,2,4
681,897471,4,8,6,4,3,4,10,6,1,4


#### Divide the dataset into independent variables and dependent variables

In [4]:
x=dataset.iloc[ : ,1:-1]

In [5]:
x

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses
0,5,1,1,1,2,1,3,1,1
1,5,4,4,5,7,10,3,2,1
2,3,1,1,1,2,2,3,1,1
3,6,8,8,1,3,4,3,7,1
4,4,1,1,3,2,1,3,1,1
...,...,...,...,...,...,...,...,...,...
678,3,1,1,1,3,2,1,1,1
679,2,1,1,1,2,1,1,1,1
680,5,10,10,3,7,3,8,10,2
681,4,8,6,4,3,4,10,6,1


In [6]:
y=dataset.iloc[ : ,-1]

In [7]:
y

0      2
1      2
2      2
3      2
4      2
      ..
678    2
679    2
680    4
681    4
682    4
Name: Class, Length: 683, dtype: int64

#### Spliting the data into training set and test set 

In [37]:
from sklearn.model_selection import train_test_split

In [38]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

In [39]:
x_train 

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses
312,10,1,1,1,2,10,5,4,1
202,1,1,1,1,1,1,3,1,1
263,5,1,1,1,2,1,3,1,1
395,3,1,2,1,2,1,2,1,1
101,8,2,3,1,6,3,7,1,1
...,...,...,...,...,...,...,...,...,...
9,4,2,1,1,2,1,2,1,1
359,5,1,1,2,2,1,2,1,1
192,1,1,1,1,2,1,1,1,1
629,3,1,1,1,2,1,2,1,1


#### Training the logistic regression model on the training set

In [40]:
from sklearn.linear_model import LogisticRegression

In [41]:
classifier=LogisticRegression(random_state=0)
classifier.fit(x_train,y_train)

LogisticRegression(random_state=0)

#### Predicting the test set results 


In [42]:
y_pred=classifier.predict(x_test)

#### Making confusion matrix

In [43]:
from sklearn.metrics import confusion_matrix
confusionMatrix=confusion_matrix(y_test,y_pred)
confusionMatrix

array([[84,  3],
       [ 3, 47]], dtype=int64)

##### Computing accuracy


In [44]:
(84+47)/(84+47+3+3)

0.9562043795620438

##### Computing the accuracy with K-fold Cross Validation

In [54]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = x_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 96.70 %
Standard Deviation: 1.97 %
