## Breast Cancer Diagnostic - Automatically choosing the best algorithm

The goal here is to predict the diagnosis of a breast cancer, whether it is malignant or benign, depending on the values of several observations performed in cells.

## Understanding the dataset

### Attribute information

#### 1) ID number

#### 2) Diagnosis

- M = malignant
- B = benign

#### 3-32) Features (Mean, Std Err, Worst/Largest)

Ten real-valued features are computed for each cell nucleus:

- a) **radius** (mean of distances from center to points on the perimeter)
- b) **texture** (standard deviation of gray-scale values)
- c) **perimeter**
- d) **area**
- e) **smoothness** (local variation in radius lengths)
- f) **compactness** (perimeter^2 / area - 1.0)
- g) **concavity** (severity of concave portions of the contour)
- h) **concave points** (number of concave portions of the contour)
- i) **symmetry**
- j) **fractal dimension** ("coastline approximation" - 1)

The **mean**, **standard error** and **"worst" or largest** (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

### Observations

- All feature values are recoded with four significant digits.
- Missing attribute values: none.
- Class distribution: 357 benign, 212 malignant.

In [None]:
# import standard libraries for linear algebra, handling data and plotting
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('precision', 2)
sns.set(style="white", color_codes=True)

%matplotlib inline

In [None]:
# read the data into a Pandas DataFrame
data = pd.read_csv('../input/data.csv')

# show some samples
data.head()

In [None]:
# remove unnamed 32th column
data.drop(data.columns[32], axis=1, inplace=True)

# save identifications
ids = data['id']

# remove unnecessary id column
data.drop(['id'], axis=1, inplace=True)

In [None]:
# how many rows and columns are there in the data?
data.shape

In [None]:
# which are the names of the columns and their datatypes?
data.info()

In [None]:
# is there any column with null values?
data.isnull().sum()

In [None]:
# describing numerical features
data.describe()

In [None]:
# describing categorical features
data.describe(include=['O'])

In [None]:
# which are the possible values for the categorical attribute?
data['diagnosis'].value_counts()

In [None]:
# normalizing numeric values in order to avoid distortions
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data.iloc[:,1:] = scaler.fit_transform(data.iloc[:,1:])
data.head()

In [None]:
# analyzing correlation between features
data.corr()

In [None]:
# plotting this correlation between features
fig, ax = plt.subplots(figsize=(17, 17))
sns.heatmap(data=data.corr().round(2), annot=True, cmap="PiYG", ax=ax)

There are so many features. Therefore, pair plotting all of them isn't a good idea. Let's do it by each 10 sets.

In [None]:
# pair plot diagnosis + 10 features (mean)
sns.pairplot(data=data.iloc[:,0:11], hue='diagnosis', diag_kind='kde')

In [None]:
# pair plot diagnosis + 10 features (std error)
sns.pairplot(data=data.iloc[:,np.append([0],np.arange(11,21))], hue='diagnosis', diag_kind='kde')

In [None]:
# pair plot diagnosis + 10 features (worst/largest)
sns.pairplot(data=data.iloc[:,np.append([0],np.arange(21,31))], hue='diagnosis', diag_kind='kde')

From these 3 last plottings, we can't realize a clear distinction in the class (diagnosis) against a given pair of numerical attributes. Let's try scatter plotting a given pair of attributes.

In [None]:
# scatter plotting radius (mean) versus concave points (mean)
sns.FacetGrid(data, hue='diagnosis', height=5) \
   .map(plt.scatter, 'radius_mean', 'concave points_mean') \
   .add_legend()

We still cannot define the distinction. Let's try box plotting another feature.

In [None]:
# box plotting diagnosis against radius (mean)
sns.boxplot(x='diagnosis', y='radius_mean', data=data)

Still confusing. Let's try an histogram based on these same attributes.

In [None]:
# create an histogram on radius (mean)
sns.FacetGrid(data, hue='diagnosis')\
   .map(plt.hist, 'radius_mean', alpha=.5, bins=20)\
   .add_legend()

There's a region in the middle where we can't define perfectly the class. Let's try it with another feature.

In [None]:
# create an histogram on concave points (mean)
sns.FacetGrid(data, hue='diagnosis')\
   .map(plt.hist, 'concave points_mean', alpha=.5, bins=20)\
   .add_legend()

The features alone definitely can't split the class (i.e., give a clue to the diagnosis). Therefore, we'll include every numeric feature in the model. 

In [None]:
# select the features
#X = data[data.columns[[1, 3, 4, 7, 8]]] # use only selected 5 features
X = data.iloc[:,1:] # use all numeric features
X.head()

In [None]:
# select the class column
y = data.diagnosis
y.tail()

Let's start creating a model for the problem based on the data!

In [None]:
# importing packages used in model selection and metrics evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# importing all the necessary packages to use the various classification algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier

First of all, let's split the data we handled so far in two sets: training and testing. The latter must contain fewer rows.

In [None]:
# separate data for training (70%) and testing (30%)

print('original data shapes:', X.shape, y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print('splitted data shapes:', X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Next, we'll instantiate each algorithm to be checked. We'll insert them in a single list.

In [None]:
# instantiate checking algorithms
models = []
models.append(('Support Vector Machines (SVM)', SVC()))
models.append(('Logistic Regression', LogisticRegression()))
models.append(('Decision Tree', DecisionTreeClassifier()))
models.append(('K-Nearest Neighbours (3)', KNeighborsClassifier(n_neighbors=3)))
models.append(('K-Nearest Neighbours (7)', KNeighborsClassifier(n_neighbors=7)))
models.append(('K-Nearest Neighbours (11)', KNeighborsClassifier(n_neighbors=11)))
models.append(('Random Forest', RandomForestClassifier()))
models.append(('Random Forest (10)', RandomForestClassifier(n_estimators=10)))
models.append(('Random Forest (100)', RandomForestClassifier(n_estimators=100)))
models.append(('Gaussian Naïve Bayes', GaussianNB()))
models.append(('Perceptron (5)', Perceptron(max_iter=5)))
models.append(('Perceptron (10)', Perceptron(max_iter=10)))
models.append(('Perceptron (50)', Perceptron(max_iter=50)))
models.append(('Stochastic Gradient Decent (SGD)', SGDClassifier(max_iter=50)))
models.append(('Linear SVC', LinearSVC()))

For each algorithm, let's perform the training, try predicting values, and then measure the model accuracy. A confusion matrix is to be calculated, as well as the number of False Negatives found.

In [None]:
names = []
scores = []
falnegs = []

best_model = None
highest_score = 0.0
false_negatives = None

for name, model in models:
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    
    y_pred = model.predict(X_test)

    cm = confusion_matrix(y_pred, y_test)
    tn, fp, fn, tp = cm.ravel()
    print(name, '\n', cm, '\n')

    names.append(name)
    scores.append(score)
    falnegs.append(fn)

    if ((score > highest_score) or (score == highest_score and fn < false_negatives)):
        best_model = model
        highest_score = score
        false_negatives = fn
        
print('Best model:', best_model, '\n[Score: %.3f, False Negatives: %d]' % (highest_score, false_negatives))

In the end, this section will give us the best model found. We consider the higher accuracy and the fewer number of false negatives to chosse this model.

In [None]:
results = pd.DataFrame({'Model': names, 'Score': scores, 'FN': falnegs})
results.sort_values(by=['Score', 'FN'], ascending=[False, True])

With the best model found, let's predict the values for the entire dataset and then print the score and the confusion matrix.

In [None]:
# consider the best algorithm found
model = best_model

# train the model with the training dataset
model.fit(X_train, y_train)

# calculate the score against the whole dataset
score = model.score(X, y)
print('Final score:', score)

# produce the confusion matrix
y_pred = model.predict(X)
print('Confusion matrix:\n', confusion_matrix(y_pred, y), '\n')

Finally, let's create a DataFrame containing a possible submission to the competition.

In [None]:
submission = pd.DataFrame({
  "ID": ids,
  "Diagnosis": y,
  "Predicted": y_pred,
  "Correct": (y == y_pred).map({True: 1, False: 0})
})
submission.head(10)

If there is any incorrect classification, which were them?

In [None]:
# show the incorrectly classified cases

incorrectly = submission[submission["Correct"] == False]

incorrect = len(incorrectly.index)
total_cases = len(submission)
print('Incorrectly classified cases:', incorrect, \
      'of', total_cases, '(%.3f%%)' % (incorrect / total_cases))

incorrectly

False Negatives are unforgiven incorrect classifications for the given study. Which are they?

In [None]:
unforgiven_incorrectly = submission.query("Diagnosis == 'M' & Predicted == 'B'")
unforgiven_incorrectly.head()

The last thing: submitting the final file.

In [None]:
submission.to_csv("predicted.csv", index=False)