<!--NAVIGATION-->

<a href="https://colab.research.google.com/github/bpesquet/machine-learning-katas/blob/master/classic-datasets/Breast_Cancer.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>

# Breast Cancer Dataset

| Learning type | Activity type | Objective |
| - | - | - |
| Supervised | Binary classification | Predict if a tumor is benign or malignant |


## About the dataset

The [Breast Cancer][1] dataset is used for multivariate binary classification. There are 569 total samples with 30 features each. Features were computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image.

![](images/breast-cancer-logo.jpg)

[1]: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

## Package setup

In [None]:
#DO THE NECESSARY IMPORTS
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn import model_selection
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Display plots inline, change default figure size and change plot resolution to retina
# %matplotlib inline
# %config InlineBackend.figure_format = 'retina'
# Set Seaborn aesthetic parameters to defaults
sns.set()

## Step 1: Loading the data
**EXERCISE: In a similar was as you did with iris and wine ;).**

In [None]:
dataset = load_breast_cancer()

In [None]:
# Put data in a pandas DataFrame
df_breast = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df_breast.head()

In [None]:
print(dataset["DESCR"])

In [None]:
# Add target and class to DataFrame
df_breast['target'] = dataset.target
df_breast['class'] = dataset.target_names[dataset.target]
df_breast.head()

Since the original labels are in 0,1 format, we could change the labels to benign and malignant using .replace function. Using inplace=True will modify the dataframe breast_dataset.
**Optional**

In [None]:
#df_breast['target'].replace('Malignant',0 ,inplace=True)
#df_breast['target'].replace('Benign',1, ,inplace=True)
#if we do this we should also drop the class
df=df_breast.drop(["class"], axis=1)
df.head()

## Visualizing the Breast Cancer data

### Select your X and y's

In [None]:
# YOUR CODE HERE
X = df
y = df["target"]

y.head()

### Step 2: Normalize the data in X

In [None]:
#do the necessary imports (you can also do them all above)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)
X_scaled

### Optional

Convert this to a pandas dataframe again in order to visualize it. 

### Short exercise
Let's check whether the normalized data has a mean of zero and a standard deviation of one. **Those are indicators of proper normalization**

In [None]:
np.mean(X),np.std(X)

## Apply the PCA Method

In [None]:
#do the necessary imports


Next, let's create a **new DataFrame** that will have the principal component values for all the samples.

In [None]:
df_principal_breast = pd.DataFrame()

Once you have the principal components, you can find the **explained_variance_ratio**. It will provide you with the amount of information or variance each principal component holds after projecting the data to a lower dimensional subspace.

In [None]:
print('Explained variation per principal component: {}'.format(pca_breast.explained_variance_ratio_))

From the above output, you can observe that the principal component 1 holds **44.2%** of the information while the principal component 2 holds only **19%** of the information. Also, the other point to note is that while projecting thirty-dimensional data to a two-dimensional data, 36.8% information was lost.

## Plotting the visualization

In [None]:
plt.figure()
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis of Breast Cancer Dataset",fontsize=20)
targets = [0, 1]
colors = ['r', 'g']
for target, color in zip(targets,colors):
    indicesToKeep = df_breast['target'] == target
    plt.scatter(df_principal_breast.loc[indicesToKeep, 'principal component 1']
               , df_principal_breast.loc[indicesToKeep, 'principal component 2'], c = color, s = 50)

plt.legend(targets,prop={'size': 15})

### Additional Challenge: Can you fix the labels in the plot so they are the original classes? 
Benign and Malign instead of 1 and 0