# Index

- **Python Libraries**
- **Task 2**
    - *Reading Preprocessed Datasets*
    - *Exercise 1*
    - *Exercise 2*
        - *Balancing Strategy Definition*
        - *Exercise 2.1*
        - *Exercise 2.2*
        - *Exercise 2.3*
    - *Exercise 3 - Concentration*
    - *Writing Balanced Datasets*


# Pyhton Libraries

In [318]:
import pandas as pd
import numpy as np
import datetime as dt
import random

import transformers
from transformers import AutoModel, AutoTokenizer

import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

# Task 2

## Reading Preprocessed Datasets

In [319]:
read_data_path  = "../../data/dataset_preprocessed"

ap   = pd.read_csv(read_data_path+'/anagraficapazientiattivi.csv', header=0 ,names=['idcentro','idana','sesso','annodiagnosidiabete','tipodiabete','scolarita','statocivile','professione','origine','annonascita','annoprimoaccesso','annodecesso','label'])
diag = pd.read_csv(read_data_path+'/diagnosi.csv', header=0 ,names=['idcentro','idana','data','codiceamd','valore'])
elp  = pd.read_csv(read_data_path+'/esamilaboratorioparametri.csv', header=0 ,names=['idcentro','idana','data','codiceamd','valore'])
ei   = pd.read_csv(read_data_path+'/esamistrumentali.csv', header=0 ,names=['idcentro','idana','data','codiceamd','valore'])
pdf  = pd.read_csv(read_data_path+'/prescrizionidiabetefarmaci.csv', header=0 ,names=['idcentro','idana','data','codiceatc','quantita','idpasto','descrizionefarmaco'])
pdnf = pd.read_csv(read_data_path+'/prescrizionidiabetenonfarmaci.csv', header=0 ,names=['idcentro','idana','data','codiceamd','valore'])
pnd  = pd.read_csv(read_data_path+'/prescrizioninondiabete.csv', header=0 ,names=['idcentro','idana','data','codiceamd','valore'])

In [320]:
print(ap.shape)
ap.head()

(1555, 13)


Unnamed: 0,idcentro,idana,sesso,annodiagnosidiabete,tipodiabete,scolarita,statocivile,professione,origine,annonascita,annoprimoaccesso,annodecesso,label
0,1,5,M,1986.0,5,2.0,2.0,9.0,,1942,1990.0,2014.0,1
1,1,65,M,1997.0,5,2.0,1.0,9.0,,1942,1997.0,2018.0,1
2,1,74,M,1995.0,5,2.0,2.0,9.0,,1930,1996.0,2013.0,0
3,1,143,M,1991.0,5,2.0,,9.0,,1934,1991.0,2015.0,0
4,1,173,M,1988.0,5,2.0,2.0,9.0,,1939,1988.0,2012.0,1


In [321]:
print(diag.shape)
diag.head()

(106999, 5)


Unnamed: 0,idcentro,idana,data,codiceamd,valore
0,1,5,1997-12-01,AMD247,410
1,1,5,2000-06-01,AMD247,434.01
2,1,5,2000-06-01,AMD247,434.91
3,1,5,2000-12-02,AMD049,S
4,1,5,2000-12-02,AMD247,36.10


In [322]:
print(elp.shape)
elp.head()

(312389, 5)


Unnamed: 0,idcentro,idana,data,codiceamd,valore
0,1,5,2005-01-18,AMD001,169.0
1,1,5,2005-01-18,AMD002,76.0
2,1,5,2005-01-18,AMD004,90.232558
3,1,5,2005-01-18,AMD005,49.6
4,1,5,2005-06-06,AMD007,138.225058


In [323]:
print(ei.shape)
ei.head()

(15114, 5)


Unnamed: 0,idcentro,idana,data,codiceamd,valore
0,1,5,2006-01-04,AMD051,N
1,1,5,2006-11-14,AMD041,P
2,1,5,2006-11-20,AMD040,P
3,1,5,2007-06-01,AMD040,N
4,1,5,2008-01-16,AMD040,N


In [324]:
print(pdf.shape)
pdf.head()

(94228, 7)


Unnamed: 0,idcentro,idana,data,codiceatc,quantita,idpasto,descrizionefarmaco
0,1,5,2005-01-18,A10BA02,2.0,1.0,METFORAL*50CPR RIV 500mg
1,1,5,2005-01-18,A10BA02,2.0,3.0,METFORAL*50CPR RIV 500mg
2,1,5,2005-01-18,A10BA02,2.0,5.0,METFORAL*50CPR RIV 500mg
3,1,5,2005-01-18,A10BB12,1.0,3.0,AMARYL*30CPR 2mg
4,1,5,2005-06-21,A10BA02,2.0,1.0,METFORAL*50CPR RIV 500mg


In [325]:
print(pdnf.shape)
pdnf.head()

(9954, 5)


Unnamed: 0,idcentro,idana,data,codiceamd,valore
0,1,5,2008-06-20,AMD152,
1,1,5,2013-08-27,AMD152,
2,1,5,2013-12-31,AMD086,S
3,1,5,2013-12-31,AMD228,S
4,1,65,2014-12-19,AMD228,S


In [326]:
print(pnd.shape)
pnd.head()

(105297, 5)


Unnamed: 0,idcentro,idana,data,codiceamd,valore
0,1,5,2005-01-18,AMD121,C09AA05
1,1,5,2005-01-18,AMD124,C10AA05
2,1,5,2005-01-18,AMD124,C10AX06
3,1,5,2005-06-21,AMD121,C09AA05
4,1,5,2005-06-21,AMD124,C10AA05


## Exercise 1

*Class imbalance #1* - not all patients will have a cardiovascular event within the stabilised six-month period. Thus, we would expect that the class distribution is highly imbalanced. 

- For each patient $p_i$ such that $y(p_i)=0$, eliminate the last six months of history to avoid giving the model prediction hints into the future. 
- For each patient $p_i$ such that $y(p_i)=1$, create $m$ copies $\lbrace p_i^1, \cdots, p_i^m \rbrace$ such that all the cardiovascular events in the last six months are eliminated, and the other events are shuffled and cancelled at random.

In this way, you have a sort of balancing criterion (i.e., up-sampling the minority class).

First we divide the **ap** dataframe into:
- **ap_class0** each patient $p_i$ has $y(p_i)=0$
- **ap_class1** each patient $p_i$ has $y(p_i)=1$

In [327]:
print(f"ap has size        => {ap.shape}")
ap_class0 = ap[ap.label == 0]
print(f"ap_class0 has size => {ap_class0.shape}")
ap_class1 = ap[ap.label == 1]
print(f"ap_class1 has size => {ap_class1.shape}")


ap has size        => (1555, 13)
ap_class0 has size => (868, 13)
ap_class1 has size => (687, 13)


And now the rest of the dataframes are also divided into **df_class0** and **df_class1** following the same criteria:

In [328]:
diag_class0 = pd.merge(diag, ap_class0, on=['idcentro','idana'])[diag.columns]
elp_class0  = pd.merge(elp, ap_class0, on=['idcentro','idana'])[elp.columns]
ei_class0   = pd.merge(ei, ap_class0, on=['idcentro','idana'])[ei.columns]
pdf_class0  = pd.merge(pdf, ap_class0, on=['idcentro','idana'])[pdf.columns]
pdnf_class0 = pd.merge(pdnf, ap_class0, on=['idcentro','idana'])[pdnf.columns]
pnd_class0  = pd.merge(pnd, ap_class0, on=['idcentro','idana'])[pnd.columns]

diag_class1 = pd.merge(diag, ap_class1, on=['idcentro','idana'])[diag.columns]
elp_class1  = pd.merge(elp, ap_class1, on=['idcentro','idana'])[elp.columns]
ei_class1   = pd.merge(ei, ap_class1, on=['idcentro','idana'])[ei.columns]
pdf_class1  = pd.merge(pdf, ap_class1, on=['idcentro','idana'])[pdf.columns]
pdnf_class1 = pd.merge(pdnf, ap_class1, on=['idcentro','idana'])[pdnf.columns]
pnd_class1  = pd.merge(pnd, ap_class1, on=['idcentro','idana'])[pnd.columns]

We define a function that eliminates the events during the last 6 months for each patient

In [329]:
def dropLastSixMonths(df:pd.DataFrame, patients: pd.DataFrame) -> pd.DataFrame:

    newDF = pd.DataFrame(columns=df.columns)

    # We iterate for each patient on patients
    for p in patients.itertuples():
        # Take the cardiovascular events of the patient p
        aux = df[(df.idcentro==p.idcentro) & (df.idana==p.idana)]

        # Calculate the last valid data
        aux = aux.sort_values('data', ascending=False)
        lastData = next(aux.itertuples()).data
        lastData = dt.date.fromisoformat(lastData)
        limitData = lastData - dt.timedelta(days=30*6)

        # Take only the events before this limit data
        aux = aux[aux.data < limitData.strftime("%Y-%m-%d")]
        
        # We concatenate this with the new DataFrame
        newDF = pd.concat([newDF, aux])

    return newDF

Now we execute the function to every dataframe in our preprocessed dataset for the **ap_class0** patients:

In [330]:
aux = diag_class0.shape[0]
diag_class0 = dropLastSixMonths(diag_class0, ap_class0)
print(f"Diag_class0: {aux} => {diag_class0.shape[0]}")

aux = elp_class0.shape[0]
elp_class0 = dropLastSixMonths(elp_class0, ap_class0)
print(f"Elp_class0: {aux} => {elp_class0.shape[0]}")

aux = ei_class0.shape[0]
ei_class0 = dropLastSixMonths(ei_class0, ap_class0)
print(f"Ei_class0: {aux} => {ei_class0.shape[0]}")

aux = pdf_class0.shape[0]
pdf_class0 = dropLastSixMonths(pdf_class0, ap_class0)
print(f"Pdf_class0: {aux} => {pdf_class0.shape[0]}")

aux = pdnf_class0.shape[0]
pdnf_class0 = dropLastSixMonths(pdnf_class0, ap_class0)
print(f"Pdnf_class0: {aux} => {pdnf_class0.shape[0]}")

aux = pnd_class0.shape[0]
pnd_class0 = dropLastSixMonths(pnd_class0, ap_class0)
print(f"Pnd_class0: {aux} => {pnd_class0.shape[0]}")

Diag_class0: 49914 => 44071
Elp_class0: 168320 => 146637
Ei_class0: 7825 => 6291
Pdf_class0: 49264 => 44101
Pdnf_class0: 5718 => 4299
Pnd_class0: 51834 => 45656


Now for the **ap_class1** patients:

In [331]:
aux = diag_class1.shape[0]
diag_class1 = dropLastSixMonths(diag_class1, ap_class1)
print(f"Diag_class1: {aux} => {diag_class1.shape[0]}")

aux = elp_class1.shape[0]
elp_class1 = dropLastSixMonths(elp_class1, ap_class1)
print(f"Elp_class1: {aux} => {elp_class1.shape[0]}")

aux = ei_class1.shape[0]
ei_class1 = dropLastSixMonths(ei_class1, ap_class1)
print(f"Ei_class1: {aux} => {ei_class1.shape[0]}")

aux = pdf_class1.shape[0]
pdf_class1 = dropLastSixMonths(pdf_class1, ap_class1)
print(f"Pdf_class1: {aux} => {pdf_class1.shape[0]}")

aux = pdnf_class1.shape[0]
pdnf_class1 = dropLastSixMonths(pdnf_class1, ap_class1)
print(f"Pdnf_class1: {aux} => {pdnf_class1.shape[0]}")

aux = pnd_class1.shape[0]
pnd_class1 = dropLastSixMonths(pnd_class1, ap_class1)
print(f"Pnd_class1: {aux} => {pnd_class1.shape[0]}")

Diag_class1: 57085 => 49112
Elp_class1: 144069 => 126055
Ei_class1: 7289 => 5822
Pdf_class1: 44964 => 40007
Pdnf_class1: 4236 => 3153
Pnd_class1: 53463 => 47128


Now let's assign idcopy = 1 to every single patient on ap_class0

In [332]:
def insertIdCopyClass0(df:pd.DataFrame) -> pd.DataFrame:

    cols = list(df.columns)
    if not(np.isin('idcopy',cols)):
        cols.insert(2,'idcopy')
        df['idcopy'] = 1
        df = df[cols]
    return df

In [333]:
ap_class0   = insertIdCopyClass0(ap_class0)
diag_class0 = insertIdCopyClass0(diag_class0)
elp_class0  = insertIdCopyClass0(elp_class0)
ei_class0   = insertIdCopyClass0(ei_class0)
pdf_class0  = insertIdCopyClass0(pdf_class0)
pdnf_class0 = insertIdCopyClass0(pdnf_class0)
pnd_class0  = insertIdCopyClass0(pnd_class0)

diag_class0.head()

Unnamed: 0,idcentro,idana,idcopy,data,codiceamd,valore
30,1,74,1,2011-10-19,AMD130,S
28,1,74,1,2011-07-13,AMD247,353
29,1,74,1,2011-07-13,AMD247,401
46,1,74,1,2011-07-13,AMD083,401.0
22,1,74,1,2011-04-28,AMD067,S


At this point, we are done with the **ap_class0** patients and their events, so we focus on the **ap_class1** ones. We now have to create $m$ (= 10 for example) copies of each patient $p_i$ such that $y(p_i)=1$, and we shuffle or eliminate cardiovascular events randomly just to *oversample* the dataset:

In [334]:
def newMCopies(patients:pd.DataFrame, copies:int) -> pd.DataFrame:

    cols = patients.columns.tolist()
    cols = cols[0:2]+['idcopy']+cols[2:]
    newPatients = pd.DataFrame(columns=cols)

    for p in patients.iterrows():
        for i in range(copies):
            newP = dict(p[1])
            newP['idcopy'] = i+1
            newPatients.loc[newPatients.shape[0]] = newP

    return newPatients

In [335]:
M = 3
ap_class1_copies = newMCopies(ap_class1, M)

We can check that every patient has been cloned M times, and the only identification for each clone $p_i^j$ is $i=\lbrace$ idcentro, idana $\rbrace$ and $j = \lbrace$ idcopy $\rbrace$

In [336]:
print("Copies of Patient 1:")
ap_class1_copies.iloc[0:1*M]

Copies of Patient 1:


Unnamed: 0,idcentro,idana,idcopy,sesso,annodiagnosidiabete,tipodiabete,scolarita,statocivile,professione,origine,annonascita,annoprimoaccesso,annodecesso,label
0,1,5,1,M,1986.0,5,2.0,2.0,9.0,,1942,1990.0,2014.0,1
1,1,5,2,M,1986.0,5,2.0,2.0,9.0,,1942,1990.0,2014.0,1
2,1,5,3,M,1986.0,5,2.0,2.0,9.0,,1942,1990.0,2014.0,1


In [337]:
print("Copies of Patient 2:")
ap_class1_copies.iloc[M:2*M]

Copies of Patient 2:


Unnamed: 0,idcentro,idana,idcopy,sesso,annodiagnosidiabete,tipodiabete,scolarita,statocivile,professione,origine,annonascita,annoprimoaccesso,annodecesso,label
3,1,65,1,M,1997.0,5,2.0,1.0,9.0,,1942,1997.0,2018.0,1
4,1,65,2,M,1997.0,5,2.0,1.0,9.0,,1942,1997.0,2018.0,1
5,1,65,3,M,1997.0,5,2.0,1.0,9.0,,1942,1997.0,2018.0,1


In [338]:
# def newMCopiesRandom(df:pd.DataFrame, copies:int) -> pd.DataFrame:

#     # We add the column idcopy to our table
#     cols = df.columns.tolist()
#     cols = cols[0:2]+['idcopy']+cols[2:]
#     newDf = pd.DataFrame(columns=cols)

#     # For each event we make copies shuffling and cancelling randomly
#     for p in df.iterrows():
#         for i in range(copies):
#           y = random.randrange(100)
#           if (y%3 != 0):
#             newEvent = dict(p[1])
#             newEvent['idcopy'] = i % copies + 1
#             newDf.loc[newDf.shape[0]] = newEvent
#     return newDf

In [339]:
def newMCopiesRandom(df: pd.DataFrame, copies:int) -> pd.DataFrame:
  
  list_of_dataframes = []   

  # We iterate over the patients with class = 1
  for i in ap_class1.itertuples():

    # We get the events of df associated with the patient p
    events_patient = df.loc[(df['idcentro'] == i.idcentro) & (df['idana'] == i.idana)]
    events_patient['idcopy']=0

    cols = list(df.columns)
    cols.insert(2,'idcopy')
    events_patient = events_patient[cols]

    for j in range(copies):
      
      # We create a new dataframe with the events of patient (p_i^j) and assign the idcopy to (j+1)
      events_patient['idcopy'] = j+1

      # Now we decide randomly if each copy of the patient p is going to have the events generating a boolean decision list
      # We also include a random parameter weight in order to become False choice more probable than True.
      weight = random.randint(50,70)/100
      choice = random.choices([False,True],weights=[weight,1-weight], k=events_patient.shape[0])
      
      # And add the new dataframe to the list of dataframes
      list_of_dataframes.append(events_patient[choice])
   
  newDf = pd.concat(list_of_dataframes, axis=0)
  return newDf.reset_index(drop=True)

In [340]:
diag_class1_copies = newMCopiesRandom(diag_class1, M)

In [341]:
print("Events of Patient 1 Copy 1:", diag_class1_copies.loc[(diag_class1_copies['idcentro']==1)&
                                                            (diag_class1_copies['idana']==5)&
                                                            (diag_class1_copies['idcopy']==1)].shape[0])
print("Events of Patient 1 Copy 2:", diag_class1_copies.loc[(diag_class1_copies['idcentro']==1)&
                                                            (diag_class1_copies['idana']==5)&
                                                            (diag_class1_copies['idcopy']==2)].shape[0])
print("Events of Patient 1 Copy 3:", diag_class1_copies.loc[(diag_class1_copies['idcentro']==1)&
                                                            (diag_class1_copies['idana']==5)&
                                                            (diag_class1_copies['idcopy']==3)].shape[0])

Events of Patient 1 Copy 1: 77
Events of Patient 1 Copy 2: 81
Events of Patient 1 Copy 3: 87


In [342]:
elp_class1_copies  = newMCopiesRandom(df=elp_class1, copies=M)


In [343]:
ei_class1_copies   = newMCopiesRandom(df=ei_class1, copies=M)

In [344]:
pdf_class1_copies  = newMCopiesRandom(df=pdf_class1, copies=M)


In [345]:
pdnf_class1_copies = newMCopiesRandom(df=pdnf_class1, copies=M)


In [346]:
pnd_class1_copies  = newMCopiesRandom(df=pnd_class1, copies=M)

## Exercise 2

*Class Imbalance #2* - Action item *#1* isn't going to be sufficient for balancing purposes. Propose your balancing strategy - possibly an advanced approach - and evaluate: 
- **vanilla-LTSM** (*Exercise 2.1*)
- **T-LTSM** (*Exercise 2.2*)
- **PubMedBert** (*Exercise 2.3*) 

on the balanced version of the dataset.

### Balancing Strategy Definition

### Exercise 2.1

Let's try to use **cuda** (if possible) for a better performance:

In [347]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

In [348]:
# First we concatenate all the dataframes, the ones with class = 0 and the others, with class = 1
ap_class   = pd.concat([ap_class0,ap_class1_copies],    axis=0)
diag_class = pd.concat([diag_class0,diag_class1_copies],axis=0)
elp_class  = pd.concat([elp_class0,elp_class1_copies],  axis=0)
ei_class   = pd.concat([ei_class0,ei_class1_copies],    axis=0)
pdf_class  = pd.concat([pdf_class0,pdf_class1_copies],  axis=0)
pdnf_class = pd.concat([pdnf_class0,pdnf_class1_copies],axis=0)
pnd_class  = pd.concat([pnd_class0,pnd_class1_copies],  axis=0)

We read the *.csv* that contain all the information to translate *codiceamd* and *codiceatc* into descriptive texts

In [349]:
read_data_path  = "../../data/codici/"

codiciAMD = pd.read_csv(read_data_path+'/amd_codes_for_bert.csv', header=0 ,names=['codiceamd','meaning'])
codiciATC = pd.read_csv(read_data_path+'/atc_info_nodup.csv', header=0 ,names=['codiceatc','first','last','len','numpatients','numdates','atc_nome'])
codiciATC = codiciATC.filter(['codiceatc','atc_nome']).rename(columns={'atc_nome':'meaning'})

In [350]:
codiciAMD.head()

Unnamed: 0,codiceamd,meaning
0,AMD090,Diet only
1,AMD140,Self control
2,AMD215,Number of strips prescribed per week
3,AMD228,Integrated management
4,AMD086,Self-monitoring of blood glucose


In [351]:
codiciATC.head()

Unnamed: 0,codiceatc,meaning
0,A10AB01,insulin (human)
1,A10AB04,insulin lispro
2,A10AB05,insulin aspart
3,A10AB06,insulin glulisine
4,A10AC01,insulin (human)


Now, we define a function to become the dataset like the one showed in this table:

| comment_text | list | label |
|----------|----------|-------|
| Diet only | [*, *, $\cdots$, *] |$\hspace{0.25cm}0$|
| Insulin pump | [*, *, $\cdots$, *] |$\hspace{0.25cm}1$|
| Heart failure | [*, *, $\cdots$, *] |$\hspace{0.25cm}1$|
| Maculopathy | [*, *, $\cdots$, *] |$\hspace{0.25cm}0$|
| $\hspace{1cm}\vdots$ | $\hspace{0.6cm}\vdots$ |$\hspace{0.3cm}\vdots$|

In [352]:
def dataframeToBertFrame(df:pd.DataFrame, patients:pd.DataFrame) -> pd.DataFrame:
    # Do we have to translate AMD or ATC codes?
    if np.isin('codiceamd',list(df.columns)):
        codice = 'codiceamd'
        codiciDF = codiciAMD
    else:
        codice = 'codiceatc'
        codiciDF = codiciATC

    # At first, we add a column into the dataframe with the descriptive text of each code
    df = pd.merge(df, codiciDF, on=[codice]) 

    # Now we add the information about the patient
    df = pd.merge(df, patients, on=['idcentro','idana','idcopy'])

    # We are not interested in id's, codici,
    df = df.drop(columns=['idcentro','idcopy','idana',codice])

    # We take all the columns minus 'meaning'
    cols = list(df.columns)
    cols.remove('meaning')
    cols.remove('label')

    # Create the 'list' columns with a list with all the values associated to 'cols'
    df['list'] = df[cols].values.tolist()

    # We just want 'meaning' and 'list' columns
    df = df.filter(['meaning','list','label'])

    # Rename 'meaning' column to 'comment_text'
    df = df.rename(columns={'meaning':'comment_text'})

    return df

In [353]:
diag_class_bert = dataframeToBertFrame(diag_class, ap_class)

In [354]:
diag_class_bert.head()

Unnamed: 0,comment_text,list,label
0,Non diabetic retinopathy,"[2011-10-19, S, M, 1995.0, 5, 2.0, 2.0, 9.0, n...",0
1,Non diabetic retinopathy,"[2011-04-28, S, M, 1995.0, 5, 2.0, 2.0, 9.0, n...",0
2,Non diabetic retinopathy,"[2010-05-03, S, M, 1995.0, 5, 2.0, 2.0, 9.0, n...",0
3,Non diabetic retinopathy,"[2009-06-29, S, M, 1995.0, 5, 2.0, 2.0, 9.0, n...",0
4,Other comorbidities,"[2011-07-13, 353, M, 1995.0, 5, 2.0, 2.0, 9.0,...",0


[EJEMPLO DE USO PUBMEDBERT](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb#scrollTo=NLxxwd1scQNv)

In [355]:
# We define de model repository
model_name="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"

# Then we download the pytorch model
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [356]:
# CONFIG SECTION
MAX_LEN = 200
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
EPOCHS = 1
LEARNING_RATE = 1e-05

In [357]:
class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.comment_text = dataframe.comment_text
        self.targets = self.data.label
        self.max_len = max_len

    def __len__(self):
        return len(self.comment_text)

    def __getitem__(self, index):
        comment_text = str(self.comment_text[index])
        comment_text = " ".join(comment_text.split())

        inputs = self.tokenizer.encode_plus(
            comment_text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

In [358]:
# We create the dataset and the data loader for the neural network
train_size = 0.75
dataset = diag_class_bert
train_dataset = dataset.sample(frac=train_size, random_state=200)
test_dataset = dataset.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

print(f"FULL  Dataset: {dataset.shape}")
print(f"TRAIN Dataset: {train_dataset.shape}")
print(f"TEST  Dataset: {test_dataset.shape}")

training_set = CustomDataset(train_dataset, tokenizer, MAX_LEN)
testing_set = CustomDataset(test_dataset, tokenizer, MAX_LEN)

FULL  Dataset: (102871, 3)
TRAIN Dataset: (77153, 3)
TEST  Dataset: (25718, 3)


In [359]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

In [360]:
# Creating the customized model, by adding a drop out and a dense layer on top of distil bert to get the final output for the model. 
class PubMedBERTClass(torch.nn.Module):
    def __init__(self):
        super(PubMedBERTClass, self).__init__()
        self.l1 = transformers.BertModel.from_pretrained(model_name)
        self.l2 = torch.nn.Dropout(0.3)
        self.l3 = torch.nn.Linear(768, 6)
    
    def forward(self, ids, mask, token_type_ids):
        _, output_1= self.l1(ids, attention_mask = mask, token_type_ids = token_type_ids)
        output_2 = self.l2(output_1)
        output = self.l3(output_2)
        return output

model = PubMedBERTClass()
model.to(device)

Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


PubMedBERTClass(
  (l1): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True

In [361]:
# We define a loss function
def loss_fn(outputs, targets):
    return torch.nn.BCEWithLogitsLoss()(outputs,targets)
    
# And an optimizer
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

Let's train our model

In [362]:
def train(epoch):
    model.train()
    for _,data in enumerate(training_loader, 0):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.float)

        outputs = model(ids, mask, token_type_ids)

        optimizer.zero_grad()
        loss = loss_fn(outputs, targets)
        if _%5000==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [363]:
for epoch in range(EPOCHS):
    train(epoch)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 3.29 GiB already allocated; 0 bytes free; 3.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

### Exercise 2.2

### Exercise 2.3