# Breast cancer detector

## Importing the libraries

In [1]:
import pandas as pd
import numpy as np

## Importing the dataset

In [2]:
dataset = pd.read_csv("breast-cancer-wisconsin.csv")
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [3]:
print(x)

[[1000025 5 '?' ... 3 1 1]
 [1002945 5 '4' ... 3 2 1]
 [1015425 3 '1' ... 3 1 1]
 ...
 [888820 5 '10' ... 8 10 2]
 [897471 4 '8' ... 10 6 1]
 [897471 4 '8' ... 10 4 1]]


**We have string values, that should be numeric. This values are string because of the '?' character.** To deal with this we will turn all the "?" characters into NaN with to_numeric(). We use for loop because to_numeric function takes an 1D array. 

In [4]:
for i in range(len(x[1])):
    x[:, i] = pd.to_numeric(x[:,i], errors='coerce')

## Taking care of missing data

We have missing data, which represented by "?". We have to replace them with meaningful data to process. To do this we will use most_frequent strategy of the SimpleImputer class.

In [5]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
imputer.fit(x[:, 1:])
x[:, 1:] = imputer.transform(x[:, 1:])

In [6]:
print(x)

[[1000025 5.0 3.1375358166189113 ... 3.0 1.0 1.0]
 [1002945 5.0 4.0 ... 3.0 2.0 1.0]
 [1015425 3.0 1.0 ... 3.0 1.0 1.0]
 ...
 [888820 5.0 10.0 ... 8.0 10.0 2.0]
 [897471 4.0 8.0 ... 10.0 6.0 1.0]
 [897471 4.0 8.0 ... 10.0 4.0 1.0]]


## Splitting the dataset into the Training set and Test set

In [7]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

## Feature Scaling

In [8]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

## Training

## Predicting the Test set results

## Making the Confusion Matrix

## Visualising the Test set results