# Breast cancer detector with tumor type

We will use Logical regression model. 2 is mean that the tumor is benign, 4 is malignant.

## Importing the libraries

In [1]:
import pandas as pd
import numpy as np

## Importing the dataset

In [2]:
dataset = pd.read_csv("breast-cancer-wisconsin.csv")
x = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

In [3]:
print(x)

[[5 '?' '?' ... 3 1 1]
 [5 '4' '4' ... 3 2 1]
 [3 '1' '1' ... 3 1 1]
 ...
 [5 '10' '10' ... 8 10 2]
 [4 '8' '6' ... 10 6 1]
 [4 '8' '8' ... 10 4 1]]


**We have string values, that should be numeric. This values are string because of the '?' character.** To deal with this we will turn all the "?" characters into NaN with to_numeric(). We use for loop because to_numeric function takes an 1D array. 

In [4]:
for i in range(len(x[1])):
    x[:, i] = pd.to_numeric(x[:,i], errors='coerce')

## Taking care of missing data

We have missing data, which represented by "?". We have to replace them with meaningful data to process. To do this we will use most_frequent strategy of the SimpleImputer class.

**Note: Another approach is we would use the strategy most_frequent instead of average and with that we wouldn't have to convert the whole set to integer so we wouldn't use the last for loop. It doesn't change the accuracy of the model with this set.**

In [5]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
imputer.fit(x[:,:])
x[:,:] = imputer.transform(x[:,:])

In [6]:
print(x)

[[5.0 3.1375358166189113 3.2106017191977076 ... 3.0 1.0 1.0]
 [5.0 4.0 4.0 ... 3.0 2.0 1.0]
 [3.0 1.0 1.0 ... 3.0 1.0 1.0]
 ...
 [5.0 10.0 10.0 ... 8.0 10.0 2.0]
 [4.0 8.0 6.0 ... 10.0 6.0 1.0]
 [4.0 8.0 8.0 ... 10.0 4.0 1.0]]


## Splitting the dataset into the Training set and Test set

In [7]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)

## Training the Logical Regression model on the Training set

In [8]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(x_train, y_train)

LogisticRegression(random_state=0)

## Predicting the Test set results

In [9]:
y_pred = classifier.predict(x_test)
print(np.concatenate((y_pred.reshape(-1, 1), y_test.reshape(-1, 1)), axis = 1))

[[2 2]
 [2 2]
 [4 2]
 [2 2]
 [4 4]
 [2 2]
 [4 4]
 [2 2]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [4 4]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [4 4]
 [4 4]
 [2 2]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [4 4]
 [4 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [4 4]
 [2 2]
 [4 4]
 [2 2]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [4 4]
 [4 4]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [4 4]
 [4 4]
 [2 2]
 [2 4]
 [4 2]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [2 2]
 [4 4]
 [4 4]
 [4 4]
 [4 4]
 [2 2]
 [4 4]
 [2 2]
 [2 2]
 [4 4]
 [4 4]
 [4 4]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [4 4]
 [4 4]
 [2 2]
 [2 2]
 [4 4]
 [2 2]
 [2 2]]


## Making the Confusion Matrix and Accuracy Score

In [10]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)

In [11]:
print(cm)
print(accuracy_score(y_test, y_pred))

[[82  3]
 [ 1 54]]
0.9714285714285714


## Computing the accuracy with k-Fold Cross Validation

In [12]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = x_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 96.60 %
Standard Deviation: 2.58 %
