<a href="https://colab.research.google.com/github/Jeevana023/FMML/blob/main/Copy_of_FMML_Module_9_Project_Breast_Cancer_Prediction_with_MLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### Project for Module 9: Multi Layer Perceptron
#### Module Coordinator: Shantanu Agrawal

# Breast cancer prediction with an MLP Classifier



### Dataset Used: Breast Cancer Wisconsin (Diagnostic) Dataset
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

In [None]:
# Load the dataset
!pip install gdown
!gdown 1ndwj3XxQpw7rEuAYiImeGonCay7KGS9p
!unzip archive.zip

Downloading...
From: https://drive.google.com/uc?id=1ndwj3XxQpw7rEuAYiImeGonCay7KGS9p
To: /content/archive.zip
100% 49.8k/49.8k [00:00<00:00, 60.9MB/s]
Archive:  archive.zip
  inflating: data.csv                


In [None]:
# Required imports
# Note that we shall be using the sklearn module for easy experimentation.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, roc_auc_score, classification_report
from sklearn.preprocessing import Normalizer, MinMaxScaler, StandardScaler, RobustScaler, QuantileTransformer, LabelEncoder
from sklearn.pipeline import Pipeline

### Step 1: Exploratory Data Analysis (EDA)
We perform EDA on the data to help gain an understanding of it. This is an essential step before doing anything with that data and will help us get better results as well as interpret the results better.

In [None]:
# Get first 5 rows of the data to see features
breast_cancer = pd.read_csv('data.csv')
breast_cancer.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [None]:
# Get counts and data types for the attributes
breast_cancer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [None]:
# Print out the shape of the dataframe
breast_cancer.shape

(569, 33)

In [None]:
# Print out some statistics for the data
breast_cancer.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,0.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,


In [None]:
# Number of samples in each class
breast_cancer.groupby('diagnosis').size()

Unnamed: 0_level_0,0
diagnosis,Unnamed: 1_level_1
B,357
M,212


### Step 2: Feature Engineering

In [None]:
# Features "id" and "Unnamed: 32" are not useful so we remove them
feature_names = breast_cancer.columns[2:-1]
X = breast_cancer[feature_names]
# "diagnosis" feature is our class which form the label
y = breast_cancer.diagnosis

#### Transforming the prediction target

In [None]:
class_le = LabelEncoder()
# M -> 1 and B -> 0
y = class_le.fit_transform(breast_cancer.diagnosis.values)

#### Correlation Matrix
A matrix of correlations provides useful insight into relationships between pairs of variables.

In [None]:
sns.heatmap(
    data=X.corr(),
    annot=True,
    fmt='.2f',
    cmap='RdYlGn'
)

fig = plt.gcf()
fig.set_size_inches(20, 16)

plt.show()

### Step 3: Multi-layer Perceptron classifier evaluation after Pipeline and GridSearchCV usage

We use the sklearn MLPClassifier to create our classifier and train it. In case you do not understand any of the code, I encourage you to check out the documentation first, and if you still do not understand, reach out to a TA.

#### Model Parameter Tuning
[GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) returns the set of parameters which have an imperceptible impact on model evaluation. Model parameter tuning with other steps like data preprocessing and cross-validation splitting strategy can be automated by [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class.

#### Data standardization
[Preprocessing data](http://scikit-learn.org/stable/modules/preprocessing.html) provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

Let's start with defining the Pipeline instance. In this case we can use different approaches like `Normalizer`, `MinMaxScaler`, `StandardScaler`, `RobustScaler`, `QuantileTransformer` for data preprocesing and `MLPClassifier` for classification.

In [None]:
pipe = Pipeline(steps=[
    ('preprocess', StandardScaler()),
    ('classification', MLPClassifier())
])

Next, we need to prepare attributes with values for above steps which wanna to check by the model parameter tuning process: `activation`, `solver`, `max_iter` and `alpha`.

In [None]:
random_state = 42
mlp_activation = ['tanh', 'relu']
mlp_solver = ['sgd', 'adam']
mlp_max_iter = range(1000, 10000, 5000)
mlp_alpha = [0.01, 0.1, 1]
preprocess = [MinMaxScaler(), StandardScaler()]

Next, we need to prepare supported combinations for classifier parameters including above attributes. In Multi-layer Perceptron classifier case we don't use PCA or any other feature selection techniques.

In [None]:
mlp_param_grid = [
    {
        'preprocess': preprocess,
        'classification__activation': mlp_activation,
        'classification__solver': mlp_solver,
        'classification__random_state': [random_state],
        'classification__max_iter': mlp_max_iter,
        'classification__alpha': mlp_alpha
    }
]

Next, we need to prepare cross-validation splitting strategy object with `StratifiedKFold` and passed it with others to `GridSearchCV`. We use the `f1 score` metric.

In [None]:
mlp_grid = GridSearchCV(
    pipe,
    param_grid=mlp_param_grid,
    cv=3,  # Reducing the number of cross-validation folds
    scoring='f1',
    n_jobs=-1,  # Parallelize
    verbose=2,
    refit=True
)

mlp_grid.fit(X, y)
print(mlp_grid.best_params_)
print('\nBest F1 score for MLP: {:.2f}%'.format(mlp_grid.best_score_ * 100))

#### Model evaluation

Finally, we can establish the best parameters values which we pass to new feature selection and classifier instances. For example if `best_params` returned `StandardScaler` for data preprocessing and `1000`, `1`, `'tanh'` and `'adam'` values for `max_iter`, `alpha`, `activation` and `solver` classifier attributes, we use the code as below. Your result may vary so you should use whatever yielded the best parameters for you.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    random_state=42,
    test_size=0.32
)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
scaler = StandardScaler()

print('\nData preprocessing with {scaler}\n'.format(scaler=scaler))

X_train_scaler = scaler.fit_transform(X_train)
X_test_scaler = scaler.transform(X_test)

mlp = MLPClassifier(
    max_iter=1000,
    alpha=1,
    activation='tanh',
    solver='adam',
    random_state=42
)
mlp.fit(X_train_scaler, y_train)

mlp_predict = mlp.predict(X_test_scaler)
mlp_predict_proba = mlp.predict_proba(X_test_scaler)[:, 1]

print('MLP Accuracy: {:.2f}%'.format(accuracy_score(y_test, mlp_predict) * 100))
print('MLP AUC: {:.2f}%'.format(roc_auc_score(y_test, mlp_predict_proba) * 100))
print('MLP Classification report:\n\n', classification_report(y_test, mlp_predict))
print('MLP Training set score: {:.2f}%'.format(mlp.score(X_train_scaler, y_train) * 100))
print('MLP Testing set score: {:.2f}%'.format(mlp.score(X_test_scaler, y_test) * 100))

#### Confusion Matrix

Also known as an Error Matrix, is a specific table layout that allows visualization of the performance of an algorithm. The table has
two rows and two columns that reports the number of False Positives (FP), False Negatives (FN), True Positives (TP) and True Negatives (TN). This allows more detailed analysis than accuracy.

In [None]:
outcome_labels = sorted(breast_cancer.diagnosis.unique())

# Confusion Matrix for MLPClassifier
sns.heatmap(
    confusion_matrix(y_test, mlp_predict),
    annot=True,
    fmt="d",
    xticklabels=outcome_labels,
    yticklabels=outcome_labels
)

#### Receiver Operating Characteristic (ROC)

[ROC curve](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html) is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

In [None]:
# ROC for MLPClassifier
fpr, tpr, thresholds = roc_curve(y_test, mlp_predict_proba)

plt.plot([0,1],[0,1],'k--')
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.rcParams['font.size'] = 12
plt.title('ROC curve for MLPClassifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

#### F1-score after 5-fold cross-validation

In [None]:
strat_k_fold = StratifiedKFold(
    n_splits=5,
    )

scaler = StandardScaler()

X_std = scaler.fit_transform(X)

fe_score = cross_val_score(
    mlp,
    X_std,
    y,
    cv=strat_k_fold,
    scoring='f1'
)

print("MLP: F1 after 5-fold cross-validation: {:.2f}% (+/- {:.2f}%)".format(
    fe_score.mean() * 100,
    fe_score.std() * 2
))

### Final step: Conclusions (Fill in your results)

After the application of data standardization and tuning the classifier parameters we achieve the following results:

* Accuracy:
* F1-score:
* Precision:
* Recall:

