# Mini Batch Gradient Descent e Online Learning
L'ospedale San Giuseppe del Santissimo Cuore ti ha incaricato di addestrare un modello in grado di riconoscere tumori al seno maligni, in modo da poter aiutare i propri medici nelle diagnosi. Per farlo ti ha fornito [questi dati](https://github.com/ProfAI/machine-learning-modelli-e-algoritmi/blob/main/datasets/breast_cancer.csv).
</br>
Nel farlo utilizza il mini-batch gradient descent, testando diversi batch size: 8, 16, 32, 64, 128.
</br>
Seleziona il modello con le metriche migliori sul test set.
</br>
L'ospedale utilizza il modello da te realizzato per eseguire delle classificazioni e ti fornisce questi [nuovi dati](https://github.com/ProfAI/machine-learning-modelli-e-algoritmi/blob/main/datasets/breast_cancer_update.csv), sfruttali per migliorare il modello.

In [1]:
import numpy as np
import pandas as pd
BASE_URL = "https://raw.githubusercontent.com/ProfAI/machine-learning-modelli-e-algoritmi/main/datasets/"

data = pd.read_csv(BASE_URL+"breast_cancer.csv",index_col=0)

data.head()

Unnamed: 0_level_0,diagnosis,radius mean,texture mean,perimeter mean,area mean,smoothness mean,compactness mean,concavity mean,concave points mean,symmetry mean,...,radius worst,texture worst,perimeter worst,area worst,smoothness worstse,compactness worst,concavity worst,concave points worst,symmetry worst,fractal dimension worst
ID number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,...,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244


Divido il data set in test set e train set

In [2]:
from sklearn.model_selection import train_test_split
X = data.drop("diagnosis", axis=1).values
y = data["diagnosis"].values
X_train, X_test, y_train, y_test = train_test_split(X,y)

Ricorda di fare una standardizzazione dei dati (richiesto da regressione logistica).

In [3]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

Devo usare un algoritmo di mini batch gradient descent, provando le varie batch size indicate. Come classificatore usa SGDClassifier di sklearn, 
con la log loss come funzione di costo (regressione logistica).

In [4]:
from time import time
from sklearn.utils import shuffle

def mini_batch_fit(model, X, y, batch_size=32, epochs = 50):
    """
    Parametri:
    - model: modello di classificazione
    - batch_size: dimensioni di un batch
    - X, y: train set (X = matrice di features, y = target) 
    - epochs: numero di epoche
    """
    n_bathces = int(X.shape[0]/batch_size)
    classes = np.unique(y)
    tic = time()
    for epoch in range(epochs):
        X_shuffled, y_shuffled = shuffle(X, y)
        for batch in range(n_bathces):
            batch_start = int(batch*batch_size)
            batch_end = int((batch+1)*batch_size)
            X_batch = X_shuffled[batch_start:batch_end,:]
            y_batch = y_shuffled[batch_start:batch_end]
            model.partial_fit(X_batch, y_batch, classes)
        #print(f"Epoca {epoch+1} terminata")
    print(f"Allenamento terminato in {(time()-tic):.3f} s")

In [13]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, recall_score, log_loss
batch_sizes = [ 8, 16, 32, 64, 128]

for batch_size in batch_sizes:
    print(f"Batch_size = {batch_size}:")
    model = SGDClassifier(loss="log_loss",penalty="elasticnet",l1_ratio=0.10)
    mini_batch_fit(model, X_train, y_train, batch_size=batch_size)
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)
    acc = accuracy_score(y_test,y_pred)
    recall = recall_score(y_test,y_pred, pos_label="M")
    loss = log_loss(y_test,y_prob,labels=["M","B"])
    print(f"accuracy: {acc:.2f}")
    print(f"sensitivity: {recall:.2f}")
    print(f"Log loss: {loss:.3f}")



Batch_size = 8:
Allenamento terminato in 1.033 s
accuracy: 0.99
sensitivity: 1.00
Log loss: 0.040
Batch_size = 16:
Allenamento terminato in 0.617 s
accuracy: 0.98
sensitivity: 0.98
Log loss: 0.109
Batch_size = 32:
Allenamento terminato in 0.382 s
accuracy: 0.98
sensitivity: 1.00
Log loss: 0.065
Batch_size = 64:
Allenamento terminato in 0.178 s
accuracy: 0.97
sensitivity: 1.00
Log loss: 0.087
Batch_size = 128:
Allenamento terminato in 0.078 s
accuracy: 0.98
sensitivity: 0.98
Log loss: 0.164


In [14]:
for batch_size in batch_sizes:
    print(f"Batch_size = {batch_size}:")
    model = SGDClassifier(loss="log_loss",penalty="elasticnet",l1_ratio=0.10)
    mini_batch_fit(model, X_train, y_train, batch_size=batch_size,epochs=500)
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)
    acc = accuracy_score(y_test,y_pred)
    recall = recall_score(y_test,y_pred, pos_label="M")
    loss = log_loss(y_test,y_prob,labels=["M","B"])
    print(f"accuracy: {acc:.2f}")
    print(f"sensitivity: {recall:.2f}")
    print(f"Log loss: {loss:.3f}")

Batch_size = 8:
Allenamento terminato in 6.482 s
accuracy: 0.98
sensitivity: 1.00
Log loss: 0.029
Batch_size = 16:
Allenamento terminato in 3.373 s
accuracy: 0.98
sensitivity: 1.00
Log loss: 0.031
Batch_size = 32:
Allenamento terminato in 1.897 s
accuracy: 0.98
sensitivity: 1.00
Log loss: 0.029
Batch_size = 64:
Allenamento terminato in 1.405 s
accuracy: 0.98
sensitivity: 1.00
Log loss: 0.028
Batch_size = 128:
Allenamento terminato in 0.574 s
accuracy: 0.98
sensitivity: 1.00
Log loss: 0.034


Aumentare il numero di epoche da 50 a 500 non porta miglioramenti sostanziali in termini di prestazioni, ma allunga molto i tempi di allenamento.
Il modello migliore è quello con batch_size = 64 ed epoche = 50 (sensibilità più elevata).

EDIT: in realtà bisognava fare la standardizzazione; facendola i vari modelli sono molto simili tra di loro. Un ulteriore discriminante può essere valutare la log loss sul train set e scegliere il modello con la log loss più bassa. Qui si vede come la log loss sia più bassa (raggiunge meglio la convergenza).

In [15]:
#Ricreo il modello migliore
best_model = SGDClassifier(loss="log_loss",penalty="elasticnet",l1_ratio=0.10)
mini_batch_fit(best_model, X_train, y_train, batch_size=64,epochs=500)
    

Allenamento terminato in 0.786 s


## Online learning

In [8]:
new_data = pd.read_csv(BASE_URL+"breast_cancer_update.csv",index_col=0)
new_data.head()

Unnamed: 0_level_0,diagnosis,radius mean,texture mean,perimeter mean,area mean,smoothness mean,compactness mean,concavity mean,concave points mean,symmetry mean,...,radius worst,texture worst,perimeter worst,area worst,smoothness worstse,compactness worst,concavity worst,concave points worst,symmetry worst,fractal dimension worst
ID number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
91544001,B,12.22,20.04,79.47,453.1,0.1096,0.1152,0.08175,0.02166,0.2124,...,13.16,24.17,85.13,515.3,0.1402,0.2315,0.3535,0.08088,0.2709,0.08839
91544002,B,11.06,17.12,71.25,366.5,0.1194,0.1071,0.04063,0.04268,0.1954,...,11.69,20.74,76.08,411.1,0.1662,0.2031,0.1256,0.09514,0.278,0.1168
915452,B,16.3,15.7,104.7,819.8,0.09427,0.06712,0.05526,0.04563,0.1711,...,17.32,17.76,109.8,928.2,0.1354,0.1361,0.1947,0.1357,0.23,0.0723
915460,M,15.46,23.95,103.8,731.3,0.1183,0.187,0.203,0.0852,0.1807,...,17.11,36.33,117.7,909.4,0.1732,0.4967,0.5911,0.2163,0.3013,0.1067
91550,B,11.74,14.69,76.31,426.0,0.08099,0.09661,0.06726,0.02639,0.1499,...,12.45,17.6,81.25,473.8,0.1073,0.2793,0.269,0.1056,0.2604,0.09879


In [9]:
X_new = new_data.drop("diagnosis",axis=1).values
y_new = new_data["diagnosis"].values

In [10]:
X_new = ss.transform(X_new) #va fatto così? Sì, bisogna usare gli stessi
#parametri del primo allenamento
best_model.partial_fit(X_new,y_new)

In [11]:
print("Dopo online learning:")
y_pred = best_model.predict(X_test)
acc = accuracy_score(y_test,y_pred)
recall = recall_score(y_test,y_pred, pos_label="M")
print(f"accuracy: {acc:.2f}")
print(f"sensitivity: {recall:.2f}")

Dopo online learning:
accuracy: 0.98
sensitivity: 1.00
