# Applying Logistic Regression Model to Braset Cancer Dataset

This Breast Cancer data from [the UCI repository](http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) contains records corresponding to 
cases of observed tumors.   There are a number of observations for each and a categorisation in the `class` column: 2 for benign (good), 4 for malignant (bad).  Your task is to build a logistic regression model to classify these cases. 

The data is provided as a CSV file.  There are a small number of cases where no value is available, these are indicated in the data with `?`. I have used the `na_values` keyword for `read_csv` to have these interpreted as `NaN` (Not a Number).
  

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.feature_selection import RFE

# Examining the Data:

In [2]:
bcancer = pd.read_csv("Data/breast-cancer-wisconsin.csv", na_values="?")
bcancer.head()

Unnamed: 0,sample_code_number,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1.0,3,1,1,2
1,1002945,5,4,4,5,7,10.0,3,2,1,2
2,1015425,3,1,1,1,2,2.0,3,1,1,2
3,1016277,6,8,8,1,3,4.0,3,7,1,2
4,1017023,4,1,1,3,2,1.0,3,1,1,2


In [3]:
# Checking the number of rows and number of columns
bcancer.shape

(699, 11)

In [4]:
# Looking at the statistical summary of the dataset
bcancer.describe()

Unnamed: 0,sample_code_number,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class
count,699.0,699.0,699.0,699.0,699.0,699.0,683.0,699.0,699.0,699.0,699.0
mean,1071704.0,4.41774,3.134478,3.207439,2.806867,3.216023,3.544656,3.437768,2.866953,1.589413,2.689557
std,617095.7,2.815741,3.051459,2.971913,2.855379,2.2143,3.643857,2.438364,3.053634,1.715078,0.951273
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,1238298.0,6.0,5.0,5.0,4.0,4.0,6.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [5]:
# Checking how many classes are in the "class" column
set(bcancer['class'])

{2, 4}

In [6]:
# Checking the number of samples for each class and to see whether the dataset is balanced?
print("No. of benign samples: ", bcancer[bcancer['class'] == 2].shape[0])
print("No. of malignant samples: ", bcancer[bcancer['class'] == 4].shape[0])

No. of benign samples:  458
No. of malignant samples:  241


In [7]:
# Deal with the NaN values in the data
bcancer.isna().sum()

sample_code_number              0
clump_thickness                 0
uniformity_cell_size            0
uniformity_cell_shape           0
marginal_adhesion               0
single_epithelial_cell_size     0
bare_nuclei                    16
bland_chromatin                 0
normal_nucleoli                 0
mitoses                         0
class                           0
dtype: int64

In [8]:
#drop NaN values
bcancer = bcancer.dropna()

In [9]:
# check shape again
bcancer.shape

(683, 11)

# Applying Logistic Regression Predictive Model

In [10]:
# Spliting data into training(80%) and testing data (20%) and using random_state=142
train, test = train_test_split(bcancer, test_size = 0.2, random_state=142)
print(train.shape)
print(test.shape)

(546, 11)
(137, 11)


In [11]:
# Predictions on test set
X_train = train.drop(['class', 'sample_code_number'], axis=1)
y_train = train['class']
X_test = test.drop(['class', 'sample_code_number'], axis=1)
y_test = test['class']

print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)
print(X_train.head())
print(y_train.head())

X_train shape:  (546, 9)
y_train shape:  (546,)
X_test shape:  (137, 9)
y_test shape:  (137,)
     clump_thickness  uniformity_cell_size  uniformity_cell_shape  \
566                3                     1                      2   
174                8                     6                      5   
565                5                     7                     10   
206               10                    10                      9   
569               10                    10                      8   

     marginal_adhesion  single_epithelial_cell_size  bare_nuclei  \
566                  1                            2          1.0   
174                  4                            3         10.0   
565                 10                            5         10.0   
206                  3                            7          5.0   
569                 10                            6          5.0   

     bland_chromatin  normal_nucleoli  mitoses  
566                3             

In [12]:
# Training logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In [13]:
# Predictions on test set
y_hat_train = model.predict(X_train)
y_hat_test = model.predict(X_test)
y_hat_train

array([2, 4, 4, 4, 4, 2, 2, 4, 2, 4, 4, 4, 4, 4, 2, 4, 2, 4, 4, 2, 2, 2,
       2, 2, 2, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 2, 2, 2,
       4, 4, 2, 4, 4, 2, 2, 4, 4, 2, 2, 4, 4, 2, 4, 2, 4, 2, 2, 4, 4, 2,
       2, 4, 4, 2, 4, 4, 2, 2, 2, 4, 2, 2, 4, 2, 4, 2, 2, 4, 4, 2, 2, 4,
       2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 4, 2, 2, 2, 2, 4, 2, 4, 2, 2, 2, 2,
       2, 4, 4, 2, 2, 2, 4, 2, 4, 2, 4, 2, 2, 4, 2, 4, 2, 2, 4, 2, 2, 2,
       2, 4, 2, 2, 2, 2, 2, 2, 2, 4, 4, 2, 2, 4, 2, 4, 2, 2, 2, 4, 2, 2,
       2, 4, 4, 2, 2, 4, 4, 2, 2, 2, 2, 2, 4, 2, 2, 4, 2, 4, 2, 2, 2, 4,
       4, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 4, 2, 2, 2, 4, 4, 4, 2,
       2, 4, 4, 2, 2, 2, 2, 2, 2, 2, 4, 4, 2, 2, 2, 2, 4, 4, 2, 2, 2, 2,
       4, 4, 2, 2, 2, 4, 2, 4, 4, 2, 4, 2, 2, 2, 2, 2, 4, 2, 4, 2, 2, 2,
       4, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 4, 4, 2, 2, 4, 4, 2, 2, 4, 2, 4, 2, 2, 4, 2, 4, 2, 2, 2,
       4, 4, 4, 2, 4, 2, 2, 2, 2, 4, 4, 2, 4, 4, 4,

# Model Evaluation

In [14]:
# Evaluating the performance of model
print("Accuracy score on training set: ", accuracy_score(y_train, y_hat_train))
print("Accuracy score on testing set: ", accuracy_score(y_test, y_hat_test))

Accuracy score on training set:  0.9688644688644689
Accuracy score on testing set:  0.9635036496350365


In [15]:
# Checking confusion matrix on test set
print("Confusion matrix on test set: ")
print(confusion_matrix(y_test, y_hat_test))

Confusion matrix on test set: 
[[83  2]
 [ 3 49]]


In [16]:
# Checking confusion matrix on train set
print("Confusion matrix on train set: ")
print(confusion_matrix(y_train, y_hat_train))

Confusion matrix on train set: 
[[350   9]
 [  8 179]]
