# BREAST CANCER WISCONSIN ANALYSIS

## Purpose

The purpose of this notebook is to use the breast cancer database from the wisconsin in order to determine if according to ****some characterstics**** of a cell nuclei, the diagnosis is **malignant** or **belign**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('darkgrid')

from sklearn.model_selection import train_test_split

pd.set_option('max_columns', None)

## Comprehension of the data

A Fine Needle Aspiration (FNA) Biopsy is a simple procedure that involves passing a thin needle through the skin to sample fluid or tissue from a cyst or solid mass, as can be seen in the picture below. 

<img src="https://healthengine.com.au/info/uploads/VMC/DiseaseImages/1270_Breast_FNA.jpg">

### Then, for they look to cell nucleis of the breast and analyse its characteristics :

**a)** radius (mean of distances from center to points on the perimeter)

**b)** texture (standard deviation of gray-scale values)

**c)** perimeter

**d)** area

**e)** smoothness (local variation in radius lengths)

**f)** compactness (perimeter^2 / area - 1.0)

**g)** concavity (severity of concave portions of the contour)

**h)** concave points (number of concave portions of the contour)

**i)** symmetry

**j)** fractal dimension ("coastline approximation" - 1)

### And for each of these characteristics, there is 3 values :

**1)** the mean of all cells nucleis 

**2)** the standard deviation of all cell nucleis

**3)** finally, the "worst" which means the mean of the three largest values

### Some examples : 

- smoothness_mean : local variation for each cells are calculated, and we take the mean
- symmetry_worst : symmetry is calculated for each cells according their way of calcul, then the 3 worst are selected and finally they do the mean of t

### Here is the first line of the dataframe : 

In [None]:
data = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
display(data.head())
data.info()

The shape of this dataframe is (569, 33). It has 569 rows (= examples) and 33 columns. Furthermore, there
In these 33 columns, we have :
- the id column : id of an analysed breast
- an unnamed columns at the end, which contains only NaN values, so we will drop it
- All the characteristics : it takes 30 columns because we have 10 differents characteristics and 3 differents values for each of these characteristics so 10*3 = 30
- Finally, the most important columns : the **"label"**. This is what we want to predict given an example of cell nucleis. 


## Let's the correlation between the features

In [None]:
corr = data.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(20,20))
sns.heatmap(corr, mask=mask, center=0, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()

# Let's understand te label


In [None]:
y = data['diagnosis'].copy()

print('There is only {} differents label which are {}'.format(len(np.unique(y)), np.unique(y)))
print('----------------------------------------------')
proportion_B = (y == 'M').sum()/len(y) *100
print('''Proportion of label = 'B': {:0.2f} %'''.format(proportion_B))
print('''Proportion of label = 'M': {:0.2f} %'''.format(100 - proportion_B))

## But what means M and B ?

M stands for **Malignant** and B for **Benign**

## What is that ? 

The samples taken by the FNA realized are examined by a pathologist under a microscope. A detailed report will then be provided about the type of cells that were seen, including any suggestion that the cells might be cancer. It is important to remember that having a lump or mass does not necessarily mean that it is cancerous; many fine needle aspiration biopsies reveal that suspicious lumps or masses are benign (non-cancerous) or cysts. Aspirate samples may be described as one of the following types:

* **Inadequate/insufficient:** The sample taken was not adequate to exclude or confirm a diagnosis.
* **Benign:** There are no cancerous cells present. The lump or growth is under control and has no spread to other areas of the body.
* **Atypical/indeterminate, or suspicious of malignancy:** The results are unclear. Some cells appear abnormal but are not definitely cancerous. A surgical biopsy may be required to adequately sample the cells.
* **Malignant:** The cells are cancerous, uncontrolled and have the potential or have spread to other areas of the body.

In our dataset, only aspirate samples which are **only** Benign and Malignant are described

Now we have understand the dataset, let's do some **preprocessing** ! (This technique allows the dataset to be in the best form possible to feed the machine learning model)

## Let's change M to 1 and B to 0

In [None]:
from sklearn.preprocessing import LabelEncoder
def our_binary_encoder(y):
    le = LabelEncoder()
    y_enc = le.fit_transform(y)
    return y_enc

y_enc = our_binary_encoder(y)
y_enc.shape

### y_enc is a list where 1 means Malignant (cancerous) and 0 means Benign (no cancerous)

# Preprocessing

Here the main tasks will be to create separate X and y (our example and the label) in order to find a **mapping** between both. 

We will also **rescale** the data because some data may have more variations than others so the dataset may be biaised. Each columns will have a mean of 0 and a variance of 1.

In [None]:
def preprocessing(df):
    df = df.copy()
    
    ##drop useless columns (id and last) and the label
    df = df.drop(['id', 'Unnamed: 32','diagnosis'], axis=1)
    
    return df

X = preprocessing(data)
X

Now, the data is totally clean, no missing features and we suppose there is no outliers. Thus, preprocessing is very fast.

Now we split our data with train_test_split from sklearn.model_selection

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y_enc, test_size = 0.2, random_state = 42)
print(f'The shape of X_train is {X_train.shape}')
print(f'The shape of X_test is {X_test.shape}')
print(f'The shape of y_train is {y_train.shape}')
print(f'The shape of y_test is {y_test.shape}')

# Now, it's time to train our algorithms in 2 differents way

## 1. Let's use a Pipeline to show its strengh

A **pipeline** combines a **transformer** and a **estimator** in a same bloc !

We will do the following pipeline : 

<img src="https://tse4.mm.bing.net/th?id=OIP.n6CHRAuypD7Qw5cWFBpwsgHaFa&pid=Api">

* Scaling : We will scale every feature in order to have mean(column) = 0 and std(columns) = 1
* Dimensionality Reduction : We will use PCA (Principal Component Analysis). It allows to decrease the dimensionality while keeping the maximum of information.
* We will choose a learning algorithm that best fit with our training set and ALSO with the test set (to avoid overfitting)

# Librairies

### We define our librairies first

In [None]:
from sklearn.pipeline import make_pipeline

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import precision_score, confusion_matrix, classification_report

### Then we call them with simplier name

In [None]:
scaler = StandardScaler()
pca = PCA(n_components = 3) # n_components is the number of dimension after PCA

log_reg = LogisticRegression(solver ='lbfgs', C = 10**10)
svc = LinearSVC(C = 1, loss = 'hinge', max_iter=1000)
tree_clf = DecisionTreeClassifier(max_depth = 5)
forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)

liste_estimator = [log_reg, svc, tree_clf, forest_clf]
liste_estimator_str = ['log_reg', 'svc', 'tree_clf', 'forest_clf']

### Calculate the precision for each estimator

In [None]:
def show_classification_report_with_pipeline(X_train, y_train, X_test, y_test, liste_estimator):

    for estimator in liste_estimator:
        
        ## Define and train the pipeline on the trainset
        model = make_pipeline(scaler, pca, estimator)
        model.fit(X_train, y_train)
        
        ## Predict the label from the testset
        y_pred = model.predict(X_test)
        
        ## Do some report for each estimator (4 in our example)
        report = classification_report(y_test, y_pred)
        
        ## Print it
        print('Estimator : ',estimator)
        print(report)
        
show_classification_report_with_pipeline(X_train, y_train, X_test, y_test, liste_estimator)

# 2. Without Pipeline

## Let's do exactly the same thing but without a Pipeline, and we will plot confusion_matrix


### Here it is what the pipeline do automatically 

In [None]:
n_components = 8 # FIX THE FINAL NBR OF DIMENSION

def our_transformator_X(X_train, X_test, n_components):
    scaler = StandardScaler()
    pca = PCA(n_components = n_components)
    
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    pca_train = pca.fit_transform(X_train_scaled)
    pca_train_df = pd.DataFrame(pca_train, columns=["PCA" + str(i) for i in range(n_components)])

    pca_test = pca.transform(X_test_scaled)
    pca_test_df = pd.DataFrame(pca_test, columns=["PCA" + str(i) for i in range(n_components)])
    
    return pca, pca_train_df, pca_test_df

pca, pca_train_df, pca_test_df = our_transformator_X(X_train, X_test, n_components = n_components)

### Let's plot them

In [None]:
from sklearn.metrics import plot_confusion_matrix

def confusion_matrix(X_train, y_train, X_test, y_test, liste_estimator, cv = 10):
    for estimator in liste_estimator:
        estimator.fit(X_train, y_train)
        plot_confusion_matrix(estimator, X_test, y_test)

print(liste_estimator)
confusion_matrix(pca_train_df, y_train, pca_test_df, y_test, liste_estimator, cv=10)

## We will skip the part of hyperparameter optimization and keep logistic regression which seems th be the best performant

#### Let's do some plot to understand how PCA and LogisticRegression works

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(x = pca.explained_variance_ratio_, y = ["PCA" + str(i) for i in range(n_components)])
plt.xlim(0,1)
plt.xlabel('proportion de variance des données originales')
plt.ylabel('PCA')
plt.show()
print(f'{pca.explained_variance_ratio_.sum()*100}% of the variance is preserved when 30=>{n_components} dimension')

We can see that the reduction of dimensionality is good when the number of features is large ! However, it decreases the variance of the data ( = less information)

# Let's do a 2-D plot with boundaries with the best estimator = LogisticRegression

In [None]:
## Define X and y as : X is X_train after scaling and PCA(2) and y is y_train
pca, X, pca_test_df = our_transformator_X(X_train, X_test, n_components = 2)
y = y_train

## We use LogisticRegression and fit it
log_reg = LogisticRegression(solver="lbfgs", C=10**10, random_state=42)
log_reg.fit(X, y)

### Let's do a nice plot in 2D (n_components = 2)

In [None]:
x0, x1 = np.meshgrid(
        np.linspace(-7, 17, 500).reshape(-1, 1),
        np.linspace(-9, 13, 200).reshape(-1, 1),
    )

X_new = np.c_[x0.ravel(), x1.ravel()]

y_proba = log_reg.predict_proba(X_new)

plt.figure(figsize=(12, 6))
plt.plot(X[y==0].iloc[:,0], X[y==0].iloc[:,1], "bs")
plt.plot(X[y==1].iloc[:,0], X[y==1].iloc[:,1], "g^")

zz = y_proba[:, 1].reshape(x0.shape)
contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg)

left_right = np.array([-7, 17])
boundary = -(log_reg.coef_[0][0] * left_right + log_reg.intercept_[0]) / log_reg.coef_[0][1]

plt.clabel(contour, inline=0.5, fontsize=12)
plt.plot(left_right, boundary, "k--", linewidth=3)
plt.text(-5, 7.5, "Malignant", fontsize=14, color="b", ha="center")
plt.text(10, 7.5, "Belign", fontsize=14, color="g", ha="center")
plt.xlabel("PCA0", fontsize=14)
plt.ylabel("PCA1", fontsize=14)
plt.axis([-7, 17, -9, 13])
plt.show()