# Title: AIDI 1002 Final Term Project Report

# Members' Names or Individual's Name: JOLSNA JAYADEVAN 
 


# Emails: **jolsnaj.mec@gmail.com**


# <u>Introduction:</u>

 ### Problem Description:
 
Natural language processing (NLP) is becoming increasingly important in health and healthcare. NLP helps in assisting the physicians in making decisions, predicting health outcomes, preventing unpleasant events, and improving quality of treatment by converting unstructured data into structured / standardised data. AI advancements employing pre-trained transformer topologies have revolutionised NLP in recent years. These advances have enabled researchers to create generalizable language models and use them to obtain higher accuracy on future downstream tasks. Pre-trained transformer designs have since become commonplace for language tasks including contextual long-distance dependencies, and have been integrated into commercial services such as Google Search and Amazon Alexa.

Considering recent achievements, clinical abbreviations and acronyms, continue to hamper NLP performance and practical use in health and healthcare. Abbreviations account for 30-50% of clinical language, such as doctor's notes, but barely 1% of general literature, such as news media. As a result, recognising, disambiguating, and extending abbreviations is critical in clinical NLP, and even little improvements would increase performance and practical applicability. Furthermore, recognising, disambiguating, and extending abbreviations can assist physicians, nurses, carers, and patients in understanding them, which has been demonstrated in trials to prevent medically-harmful misinterpretation.

### Context of the Problem:

Recognising, disambiguating, and expanding medical abbreviations and acronyms is critical for preventing medically-dangerous misinterpretation in natural language processing. We give the medical abbreviation and Acronym Meta-Inventory, a comprehensive database of medical abbreviations, to aid in recognition, disambiguation, and growth.  A comprehensive harmonisation of eight source inventories from a variety of healthcare disciplines and contexts yielded 104,057 abbreviations and 170,426 related senses. Cross-mapping of synonymous records automaticallyUsing cutting-edge machine learning minimises redundancy, simplifying future applications.

The Inventories we have used to develop the solution is :-

| Source | Description | Underlying Corpus | Medical Speciality |
| --- | --- | --- | --- |
| UMLS-LRABR | Unifed Medical Language System Lexical Resource for Abbreviations and Acronyms | Biomedical research | Multiple |
| ADAM | Another Database of Abbreviations in Medline | Biomedical research | Multiple |
| Berman | Manually-curated general pathology abbreviations | Clinical Notes | Pathology |
| Wikipedia | Publicly-curated list of medical and clinical trial abbreviations | Clinical Notes | Multiple |
| Vanderbilt1 | Semi-automatically derived from the medical record  | Sign -out Notes | Medicine |
| Vanderbilt2 | Semi-automatically derived from the medical record | Discharge Notes | Medicine |
| Stetson | Manually-curated from the general medical record | Sign-out Notes | Medicine |
| Columbia | Manually-curated from the obstetric medical record  | Clinical Notes | Obstetrics |

### Limitation About other Approaches:

Two fundamental challenges stand in the way of compiling a comprehensive sense inventory are :- 
1. faults have been identified in multiple sources, necessitating the use of quality control to correct them. 
2. Since abbreviations differ depending on expertise and setting, many individual sense inventories from various specialties and settings are required. 

The use of multiple inventories raises the chance of significant redundancy, necessitating cross-mapping (internal structure) to eliminate redundancy and facilitate future application. Due to the combinatorial nature of the problem, the number of comparisons increases exponentially with the number of records, making manual cross-mapping impractical. <br>

### Solution:

A high-quality, full, relevant, and non-redundant deep sense inventory could overcome the difficulties of interoperability and generalizability. Such an inventory would necessitate the extraction, collation, and organisation of various source inventories. <br>
<br>
We propose a comprehensive database of medical abbreviations and acronyms that combines many source sense inventories from various corporations, medical disciplines, and medical contexts into a single **Meta-Inventory**.<br>
<br>
The Meta-Inventory addresses the above-mentioned challenges with two major features:
1. Semi-automated quality control using heuristics to identify errors and improve reliability
2. Automated cross-mapping of synonyms using cutting-edge machine learning to eliminate redundancy and simplify future downstream tasks.

# Background

Explain the related work using the following table

| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Tom et al. [1] | They trained a BERT based transformer to predict answers from the passage of a question| SQUAD dataset for QA | Only 80% accuracy
| George et al. [2] | They trained a attention based sequence to sequence model using LSTM to predict answers from the passage of a question| SQUAD V2 dataset for QA | High accuracy but poor on unkown answers


The last row in this table should be about the method discussed in this paper (If you can't find the weakenss of this method then write about the future improvement, see the future work section of the paper)

# Methodology

In the selected paper, the following **data fields** from each source: [A] short form, or the abbreviation (e.g., "MS"); [B] long form, or the abbreviated version spelt out (e.g., "Multiple Sclerosis"); [C] source, or the name of the source inventory. Each row represents a single abbreviation (short form) and its associated sense (long form). <br>
<br>
Lastly, they added the following new data fields: [D] normalised short form, or a lexically normalised version of each short form, to reduce linguistic variation; [E] normalised long form, or a lexically normalised version of each long form, to reduce linguistic variation; [F] unique identifiers for each individual record, each unique short form, and each unique long form, to ease future database maintenance and use; The [G] group identifiers for each group of synonymous (cross-mapped) records, with the goal of reducing redundancy. 



### My Contribution 

I have tried to developed a BERT model using transfomers for the Prediction on a large corpus

# Implementation

**Implementation of the Project is done as 3 steps :-**
1. Preprocessing
2. Data Field Entry
3. Quality Control

In [8]:
# here the data fields mentioned in the inventory and the new normalised data fields mentioned in the  
# above cell has been developed and stored as master_functions.py and called here. 



# %load master_functions.py
'''
Master Functions
master_functions.py
'''
!pip install configupdater
import pandas as pd
import string
import subprocess
from configupdater import ConfigUpdater


# Function to clean data frame
def clean(df):
  # Remove leading and trailing white space
  cols = df.select_dtypes(['object']).columns
  df[cols] = df[cols].apply(lambda x: x.str.strip())


# Function to unnest columns in data frame
def expand_col(df, col, d='|'):
  # Split and stack indivcommitidual entries
  s = df[col].str.split(d).apply(pd.Series, 1).stack()
  # Match up with df indices
  s.index = s.index.droplevel(-1)
  # Name new column
  s.name = col
  # Delete old column
  del df[col]
  # Merge new column with df
  df = df.join(s)
  return df


# Function to define normalized short form
def normalized_short_form(sf):
  # Convert to lowercase
  sf = sf.lower()
  # Strip leading and trailing whitespace
  sf = sf.strip()
  # Remove all periods
  sf = sf.replace(".", "")
  # Convert all punctuation to underscore
  sf = sf.translate(str.maketrans(string.punctuation, '_'*len(string.punctuation)))
  return sf


# Function to execute command line LVG program
def lvg(input_file, flow, output_file, lvg_path):
  # Specify command
  command = [lvg_path, # Specify path
             '-i:' + input_file, # Input
             '-f:' + flow, # Normalization flow
             '-o:' + output_file, # Output
             '-R:1', # Restrict
             '-n'] # Suppress output
  # Execute command
  lvg_process = subprocess.check_output(command)
  return lvg_process


# Function to standardize CUI appearance
def standardize_cui(cui):
  # Use comma delimited CUIs
  cui = cui.replace('|',',')
  # Use CUIs with a capital C
  cui = cui.replace('c', 'C')
  return cui


# Function to add new SFUI
def add_new_SFUI(df_final):
  updater = ConfigUpdater()
  updater.read('setup.cfg')
  # Subset into assigned and unassigned
  df = df_final[df_final['SFUI']=='']
  df_final = df_final[df_final['SFUI']!='']
  if df.empty:
    return df_final
  else:
    # Sort by SF
    df = df.sort_values(by=['SF'])
    df = df.reset_index(drop=True)
    # Assign SFUI
    assignment = int(updater['metadata']['sfui_last_assignment'].value) + 1
    for index, row in df.iterrows():
      if index == 0:
        df['SFUI'].iat[index] = assignment
      elif df['SF'].at[index] == df['SF'].at[index-1]:
        df['SFUI'].iat[index] = assignment
      else:
        assignment += 1
        df['SFUI'].iat[index] = assignment
    # Format SFUI
    df['SFUI'] = 'S' + (df.SFUI.map('{:06}'.format))
    # Add back newly assigned
    df_final = pd.concat([df_final, df])
    df_final = df_final.reset_index(drop=True)
    # Update config file
    updater['metadata']['sfui_last_assignment'].value = assignment
    updater.update_file()
    # Return dataframe
    return df_final


# Function to add new LFUI
def add_new_LFUI(df_final):
  updater = ConfigUpdater()
  updater.read('setup.cfg')
  # Subset into assigned and unassigned
  df = df_final[df_final['LFUI']=='']
  df_final = df_final[df_final['LFUI']!='']
  if df.empty:
    return df_final
  else:
    # Sort by LF
    df = df.sort_values(by=['LF'])
    df = df.reset_index(drop=True)
    # Assign SFUI
    assignment = int(updater['metadata']['lfui_last_assignment'].value) + 1
    for index, row in df.iterrows():
      if index == 0:
          df['LFUI'].iat[index] = assignment
      elif df['LF'].at[index] == df['LF'].at[index-1]:
          df['LFUI'].iat[index] = assignment
      else:
          assignment += 1
          df['LFUI'].iat[index] = assignment
    # Format SFUI
    df['LFUI'] = 'L' + (df.LFUI.map('{:06}'.format))
    # Add back newly assigned
    df_final = pd.concat([df_final, df])
    df_final = df_final.reset_index(drop=True)
    # Update config file
    updater['metadata']['lfui_last_assignment'].value = assignment
    updater.update_file()
    # Return dataframe
    return df_final





## Step 1 -  Preprocessing

In [9]:
import pandas as pd
from master_functions import *

In [10]:
# this is the common data model developed for formatting all the 8 inventories selected

out_db = pd.DataFrame(columns=['GroupID', 'RecordID', 'SF', 'SFUI', 'NormSF', 
                               'LF', 'LFUI', 'NormLF', 'Source', 
                               # Auxiliary data fields
                               'SFEUI', 'LFEUI', 'Type', 'PrefSF', 'Score',
                               'Count', 'Frequency', 'UMLS.CUI'])

### Source #1: UMLS

In [11]:
# Loading the first inventory source

umls_db = pd.read_csv('https://raw.githubusercontent.com/lisavirginia/clinical-abbreviations/master/sources/1-umls/LRABR',
                      sep='|',
                      header=None,
                      names=['SFEUI', 'SF', 'Type', 'LFEUI', 'LF'],
                      na_filter=False,
                      index_col=False)

In [12]:
# cleaning the data and printing first 3 rows of the source

clean(umls_db)
umls_db.sample(3, random_state=0)

Unnamed: 0,SFEUI,SF,Type,LFEUI,LF
135790,E0672087,G. agilis,abbreviation,E0672086,Giardia agilis
58499,E0520697,ME,acronym,E0039236,median eminence
282286,E0761579,MYLK2,acronym,E0761578,myosin light chain kinase 2


In [13]:
# Fill Output Frame

umls_out = out_db.copy()
umls_out['SF'] = umls_db['SF']
umls_out['LF'] = umls_db['LF']
umls_out['Source'] = 'UMLS'
umls_out['SFEUI'] = umls_db['SFEUI']
umls_out['LFEUI'] = umls_db['LFEUI']
umls_out['Type'] = umls_db['Type']

In [14]:
umls_out.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI
135790,,,G. agilis,,,Giardia agilis,,,UMLS,E0672087,E0672086,abbreviation,,,,,
58499,,,ME,,,median eminence,,,UMLS,E0520697,E0039236,acronym,,,,,
282286,,,MYLK2,,,myosin light chain kinase 2,,,UMLS,E0761579,E0761578,acronym,,,,,


In [15]:
# append the output

out_list = []
out_list.append(umls_out)

### Source #2: ADAM

In [16]:
# Loading the second inventory source

adam_db = pd.read_csv('https://raw.githubusercontent.com/lisavirginia/clinical-abbreviations/master/sources/2-adam/adam_database',
                      sep='\t',
                      skiprows=38,  # skips readme portion
                      header=None,
                      names=['Pref_SF', 'Alt_SF', 'All_LF', 'Score', 'Count'],
                      na_filter=False,
                      index_col=False)

In [17]:
# cleaning the data by calling the function clean which was developed inside master_functions.py

clean(adam_db)
adam_db.sample(3, random_state=0)

Unnamed: 0,Pref_SF,Alt_SF,All_LF,Score,Count
13054,DMN,DMN:15,dysplastic melanocytic nevi:15:0.8045,0.8045,15
5739,BM,BM:8|Bm:1,bicuculline methiodide:9:0.6794,0.6794,9
19192,GISSI-2,GISSI-2:10,Gruppo Italiano per lo Studio della Sopravvive...,0.547,10


In [18]:
adam_out = out_db.copy()
adam_out['SF'] = adam_db['Alt_SF']
adam_out['LF'] = adam_db['All_LF']
adam_out['Source'] = 'ADAM'
adam_out['PrefSF'] = adam_db['Pref_SF']

In [19]:
adam_out.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI
13054,,,DMN:15,,,dysplastic melanocytic nevi:15:0.8045,,,ADAM,,,,DMN,,,,
5739,,,BM:8|Bm:1,,,bicuculline methiodide:9:0.6794,,,ADAM,,,,BM,,,,
19192,,,GISSI-2:10,,,Gruppo Italiano per lo Studio della Sopravvive...,,,ADAM,,,,GISSI-2,,,,


In [20]:
adam_out = expand_col(adam_out, 'SF')
adam_out = expand_col(adam_out, 'LF')
adam_out.drop_duplicates(inplace=True)

In [21]:
adam_out.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SFUI,NormSF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI,SF,LF
2789,,,,,,,ADAM,,,,ANF,,,,,ANF:19,atrial natriuretic peptide:19:0.7658
12925,,,,,,,ADAM,,,,DLS,,,,,DLS:159,dynamic light scattering:150:0.9220
23483,,,,,,,ADAM,,,,Ids,,,,,ids:1,idiotypes:14:0.0378


In [22]:
# assigning count information

temp = adam_out['SF'].str.split(':', expand=True)
adam_out['SF'] = temp[0]
adam_out['Count'] = temp[1] 

In [23]:
# assigning score information

temp = adam_out['LF'].str.split(':', expand=True)
adam_out['LF'] = temp[0]
adam_out['Score'] = temp[2]

In [24]:
# Reordering the columns

adam_out = adam_out[out_db.columns]
adam_out.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI
2789,,,ANF,,,atrial natriuretic peptide,,,ADAM,,,,ANF,0.7658,19,,
12925,,,DLS,,,dynamic light scattering,,,ADAM,,,,DLS,0.922,159,,
23483,,,ids,,,idiotypes,,,ADAM,,,,Ids,0.0378,1,,


In [25]:
# appending the output

out_list.append(adam_out)

### Source #3: Berman

In [26]:
# loading the third inventory source

berm_db = pd.read_csv('https://raw.githubusercontent.com/lisavirginia/clinical-abbreviations/master/sources/3-berman/12000_pathology_abbreviations.txt',
                      sep='=',
                      header=None,
                      names=['SF', 'LF'],
                      na_filter=False,
                      index_col=False)

In [27]:
clean(berm_db)
berm_db.sample(3, random_state=0)

Unnamed: 0,SF,LF
9783,au,arbitrary unit
3706,npo,nothing by mouth
3234,mdm,mid diastolic murmur


In [28]:
berm_out = out_db.copy()
berm_out['SF'] = berm_db['SF']
berm_out['LF'] = berm_db['LF']
berm_out['Source'] = 'Berman'

In [29]:
berm_out.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI
9783,,,au,,,arbitrary unit,,,Berman,,,,,,,,
3706,,,npo,,,nothing by mouth,,,Berman,,,,,,,,
3234,,,mdm,,,mid diastolic murmur,,,Berman,,,,,,,,


In [30]:
out_list.append(berm_out)

### Source #4 and #5: Vanderbilt

In [31]:
vcln_db = pd.read_csv('https://raw.githubusercontent.com/lisavirginia/clinical-abbreviations/master/sources/4-vanderbilt/vanderbilt_clinic_notes.txt',
                      sep='\t',
                      na_filter=False,
                      index_col=False)

In [32]:
clean(vcln_db)
vcln_db.sample(3, random_state=0)

Unnamed: 0,abbreviation,sense,variation,CUI,frequency
563,cmt,charcot-marie-tooth,CMT_6,c0007959,0.014
824,xray,energetic high-frequency electromagnetic radia...,Xray_5|xray_13|XRay_2,c0337030,1.0
436,gtt,glucose tolerance test,GTT_2,c0017741,0.005


In [33]:
vdis_db = pd.read_csv('https://raw.githubusercontent.com/lisavirginia/clinical-abbreviations/master/sources/4-vanderbilt/vanderbilt_discharge_sums.txt',
                      sep='\t',
                      na_filter=False,
                      index_col=False)

In [34]:
clean(vdis_db)
vdis_db.sample(3, random_state=0)

Unnamed: 0,abbreviation,sense,variation,CUI,frequency
979,q,22q (chromosome),q_1,c1521100,0.003
984,q2,every two hours,q2_4,c0585322,1.0
746,nabs,normal active bowel sounds,nabs_2|NAbs_1|NABS_16|NABS._1,c0278005,1.0


In [35]:
vcln_out = out_db.copy()
vcln_out['SF'] = vcln_db['variation']
vcln_out['LF'] = vcln_db['sense']
vcln_out['Source'] = 'Vanderbilt Clinic Notes'
vcln_out['Frequency'] = vcln_db['frequency']
vcln_out['UMLS.CUI'] = vcln_db['CUI']

In [36]:
vcln_out.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI
563,,,CMT_6,,,charcot-marie-tooth,,,Vanderbilt Clinic Notes,,,,,,,0.014,c0007959
824,,,Xray_5|xray_13|XRay_2,,,energetic high-frequency electromagnetic radia...,,,Vanderbilt Clinic Notes,,,,,,,1.0,c0337030
436,,,GTT_2,,,glucose tolerance test,,,Vanderbilt Clinic Notes,,,,,,,0.005,c0017741


In [37]:
vdis_out = out_db.copy()
vdis_out['SF'] = vdis_db['variation']
vdis_out['LF'] = vdis_db['sense']
vdis_out['Source'] = 'Vanderbilt Discharge Sums'
vdis_out['Frequency'] = vdis_db['frequency']
vdis_out['UMLS.CUI'] = vdis_db['CUI']

In [38]:
vdis_out.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI
979,,,q_1,,,22q (chromosome),,,Vanderbilt Discharge Sums,,,,,,,0.003,c1521100
984,,,q2_4,,,every two hours,,,Vanderbilt Discharge Sums,,,,,,,1.0,c0585322
746,,,nabs_2|NAbs_1|NABS_16|NABS._1,,,normal active bowel sounds,,,Vanderbilt Discharge Sums,,,,,,,1.0,c0278005


In [39]:
vand_out = vcln_out.append(vdis_out)
vand_out = vand_out.reset_index(drop=True)
vand_out.shape

  vand_out = vcln_out.append(vdis_out)


(2827, 17)

In [40]:
vand_out = expand_col(vand_out, 'SF')
vand_out.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI,SF
2023,,,,,intensive care unit,,,Vanderbilt Discharge Sums,,,,,,,1.0,c0021708,ICU_19
2670,,,,,total body surface area,,,Vanderbilt Discharge Sums,,,,,,,1.0,c0005902,tbsa_4
2464,,,,,every four,,,Vanderbilt Discharge Sums,,,,,,,0.998,c0585324,Q4_2


In [41]:
temp = vand_out['SF'].str.split('_', expand=True)
vand_out['SF'] = temp[0]
vand_out['Count'] = temp[1] 

# reordering columns
vand_out = vand_out[out_db.columns]
vand_out.sample(3, random_state=0)


out_list.append(vand_out)

### Source #6: Wikipedia

In [42]:
wabr_db = pd.read_csv('https://raw.githubusercontent.com/lisavirginia/clinical-abbreviations/master/sources/5-wikipedia/wikipedia_abbreviation_database.csv',
                      sep=',',
                      na_filter=False,
                      index_col=False)

In [43]:
clean(wabr_db)
wabr_db.sample(3, random_state=0)

Unnamed: 0,abr,long_form
346,BEP,"bleomycin, etoposide, and cisplatin"
1355,I&O,inputs and outputs
1261,HSM,hepatosplenomegaly


In [44]:
wtrl_db = pd.read_csv('https://raw.githubusercontent.com/lisavirginia/clinical-abbreviations/master/sources/5-wikipedia/wikipedia_clinical_trials.txt',
                      sep=':',
                      header=None,
                      names=['abr', 'long_form'],
                      na_filter=False,
                      index_col=False,
                      skipinitialspace=True)

In [45]:
clean(wtrl_db)
wtrl_db.sample(3, random_state=0)

Unnamed: 0,abr,long_form
252,SURTAVI,Safety and Efficacy Study of the Medtronic Cor...
111,EVEREST,Efficacy of Vasopressin Antagonism in Heart Fa...
226,CYTO-PV,Cytoreductive Therapy in Polycythemia Vera


In [53]:
wiki_db = wabr_db.append(wtrl_db)
wiki_db.shape

  wiki_db = wabr_db.append(wtrl_db)


(2952, 2)

In [54]:
wiki_out = out_db.copy()
wiki_out['SF'] = wiki_db['abr']
wiki_out['LF'] = wiki_db['long_form']
wiki_out['Source'] = 'Wikipedia'

In [55]:

wiki_out.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI
1301,,,ICCU,,,intensive cardiac care unit,,,Wikipedia,,,,,,,,
95,,,TRICC,,,Transfusion Requirements in Critical Care,,,Wikipedia,,,,,,,,
2286,,,SGB,,,stellate ganglion block,,,Wikipedia,,,,,,,,


In [56]:
out_list.append(wiki_out)

### Source #7 and #8: Stetson and Columbia

In [48]:
stet_db = pd.read_csv('https://raw.githubusercontent.com/lisavirginia/clinical-abbreviations/master/sources/6-stetson/sense_distribution_448.txt',
                      sep='\t',
                      header=None,
                      names=['SF', 'LF', 'Frequency'],
                      na_filter=False,
                      index_col=False)

In [49]:
clean(stet_db)
stet_db.sample(3, random_state=0)

Unnamed: 0,SF,LF,Frequency
733,med,medicine,0.386
122,d/c,discharge,0.884
113,na,normal axis,0.02002


In [50]:
stet_out = out_db.copy()
stet_out['SF'] = stet_db['SF']
stet_out['LF'] = stet_db['LF']
stet_out['Source'] = 'Stetson'
stet_out['Frequency'] = stet_db['Frequency']

In [51]:
stet_out.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI
733,,,med,,,medicine,,,Stetson,,,,,,,0.386,
122,,,d/c,,,discharge,,,Stetson,,,,,,,0.884,
113,,,na,,,normal axis,,,Stetson,,,,,,,0.02002,


In [52]:
out_list.append(stet_out)

### Merging all the 8 data sources (inventories )

In [64]:

for item in out_list:
  name = [x for x in globals() if globals()[x] is item][0]
  print(name, item.shape)

umls_out (294484, 17)
adam_out (94657, 17)
berm_out (12087, 17)
vand_out (4504, 17)
item (2952, 17)
stet_out (765, 17)
wiki_out (2952, 17)


In [65]:
db = pd.concat(out_list)
db.shape

(412401, 17)

### Saved the main file as Step10Output.csv file 

In [63]:
# final file is exported as Step1Output.csv file

db.to_csv('Step1Output.csv',
          index=False,
          header=True,
          sep='|')

##  Step 2: Add Data Fields

In [90]:
import os

#loading the Output dataset obtained from step 1 preprocessing
df = pd.read_csv('https://raw.githubusercontent.com/lisavirginia/clinical-abbreviations/master/code/Step1Output.csv',
                 sep='|',
                 header=0,
                 index_col=False,
                 na_filter=False,
                 dtype=object)

In [91]:
df.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI
368728,,,T-ALL,,,T cell acute lymphoblastic leukemias,,,ADAM,,,,T-ALL,0.7357,193,,
320521,,,Emax,,,maximal responses,,,ADAM,,,,Emax,0.2113,24,,
311671,,,CTG,,,connective tissue graft,,,ADAM,,,,CTG,0.7103,10,,


#### Add and Assign Record Identifier

In [92]:
assignment = 1
for index, row in df.iterrows():
    df['RecordID'].iat[index] = assignment
    assignment += 1

#### Format Record Identifier

In [93]:
df['RecordID'] = 'R' + (df.RecordID.map('{:06}'.format))
df.head(3)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI
0,,R000001,AA,,,achievement age,,,UMLS,E0000048,E0006859,acronym,,,,,
1,,R000002,AA,,,Alcoholics Anonymous,,,UMLS,E0000048,E0000204,acronym,,,,,
2,,R000003,AA,,,alcohol abuse,,,UMLS,E0000048,E0356324,acronym,,,,,


### Add Normalized Short Forms

The normalized short form is created by:
1. converting all text to lowercase; 
2. stripping leading and trailing whitespace;
3. standardizing punctuation to an underscore.

In [94]:
df['NormSF'] = df['SF'].apply(normalized_short_form)

In [95]:
df.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI
368728,,R368729,T-ALL,,t_all,T cell acute lymphoblastic leukemias,,,ADAM,,,,T-ALL,0.7357,193,,
320521,,R320522,Emax,,emax,maximal responses,,,ADAM,,,,Emax,0.2113,24,,
311671,,R311672,CTG,,ctg,connective tissue graft,,,ADAM,,,,CTG,0.7103,10,,


### Add Normalized Long Forms

In [96]:
lvg_path = 'C:/Users/lvg2104/Documents/clinical-abbreviations/code/lvg2019/bin/lvg.bat'

In [97]:
df['ASCII'] = 'Y'
df.loc[df.LF.str.contains('[^\x00-\x7F]') == True, 'ASCII'] = 'N'

In [98]:
uniq_LFs = pd.Series(df.loc[df['ASCII']=='Y']['LF'].unique())

In [99]:
uniq_LFs.to_csv('uniq_LFs.temp',
                index=False,
                header=False,
                encoding='ascii')

### Add Short Form Unique Identifier

#### Sort by SF

In [100]:
df = df.sort_values(by=['SF'])
df = df.reset_index(drop=True)

#### Assign SFUI

In [101]:
assignment = 1
for index, row in df.iterrows():
    if index == 0:
        df['SFUI'].iat[index] = assignment
    elif df['SF'].at[index] == df['SF'].at[index-1]:
        df['SFUI'].iat[index] = assignment
    else:
        assignment += 1
        df['SFUI'].iat[index] = assignment

#### Format SFUI

In [102]:
df['SFUI'] = 'S' + (df.SFUI.map('{:06}'.format))
df.head(5)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI,ASCII
0,,R294485,$Can,S000001,_can,Canadian dollars,,,ADAM,,,,$Can,0.8365,18.0,,,Y
1,,R389142,%,S000002,_,percent,,,Berman,,,,,,,,,Y
2,,R126776,%LN,S000003,_ln,percent lumenal narrowing,,,UMLS,E0665149,E0665148,abbreviation,,,,,,Y
3,,R126777,%LN,S000003,_ln,percent luminal narrowing,,,UMLS,E0665149,E0665148,abbreviation,,,,,,Y
4,,R126778,%LN,S000003,_ln,per cent lumenal narrowing,,,UMLS,E0665149,E0665148,abbreviation,,,,,,Y


### Add Long Form Unique Identifier

#### Sort by LF

In [103]:
df = df.sort_values(by=['LF'])
df = df.reset_index(drop=True)

#### Assign LFUI

In [104]:
assignment = 1
for index, row in df.iterrows():
    if index == 0:
        df['LFUI'].iat[index] = assignment
    elif df['LF'].at[index] == df['LF'].at[index-1]:
        df['LFUI'].iat[index] = assignment
    else:
        assignment += 1
        df['LFUI'].iat[index] = assignment

#### Format LFUI

In [105]:
df['LFUI'] = 'L' + (df.LFUI.map('{:06}'.format))
df.head(5)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI,ASCII
0,,R315532,DNIC,S019528,dnic,'diffuse noxious inhibitory controls',L000001,,ADAM,,,,DNIC,0.7571,95.0,,,Y
1,,R315587,DNR,S019572,dnr,'do not resuscitate',L000002,,ADAM,,,,DNR,0.5856,196.0,,,Y
2,,R135924,PHNO,S054985,phno,(+)-4-propyl-9-hydroxynaphthoxazine,L000003,,UMLS,E0672582,,acronym,,,,,,Y
3,,R354603,PHNO,S054985,phno,(+)-4-propyl-9-hydroxynaphthoxazine,L000003,,ADAM,,,,PHNO,0.5417,14.0,,,Y
4,,R342751,MK-801,S044572,mk_801,"(+)-5-methyl-10,11-dihydro-5H-dibenzo[a,d]cycl...",L000004,,ADAM,,,,MK-801,0.0146,15.0,,,Y


In [106]:
#discarding irrelevant columns

columns_to_drop = ['NormLF', 'LFEUI','Type','Frequency','UMLS.CUI']
df = df.drop(columns=columns_to_drop)
df.head(5)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,Source,SFEUI,PrefSF,Score,Count,ASCII
0,,R315532,DNIC,S019528,dnic,'diffuse noxious inhibitory controls',L000001,ADAM,,DNIC,0.7571,95.0,Y
1,,R315587,DNR,S019572,dnr,'do not resuscitate',L000002,ADAM,,DNR,0.5856,196.0,Y
2,,R135924,PHNO,S054985,phno,(+)-4-propyl-9-hydroxynaphthoxazine,L000003,UMLS,E0672582,,,,Y
3,,R354603,PHNO,S054985,phno,(+)-4-propyl-9-hydroxynaphthoxazine,L000003,ADAM,,PHNO,0.5417,14.0,Y
4,,R342751,MK-801,S044572,mk_801,"(+)-5-methyl-10,11-dihydro-5H-dibenzo[a,d]cycl...",L000004,ADAM,,MK-801,0.0146,15.0,Y


### Export

In [107]:
df = df.sort_values(by=['RecordID'])
df = df.reset_index(drop=True)

In [108]:
df.to_csv('Step2Output.csv',
          index=False,
          header=True,
          sep='|')

## Step 3: Quality Control

In [112]:
!pip install spellchecker



In [132]:

import re
import pandas as pd
import numpy as np



In [120]:
# Suppress false positive warnings
import warnings
warnings.filterwarnings("ignore")

In [121]:
df = pd.read_csv('https://raw.githubusercontent.com/lisavirginia/clinical-abbreviations/master/code/Step2Output.csv',
                 sep='|',
                 header=0,
                 index_col=False,
                 na_filter=False,
                 dtype=object)

In [122]:
df.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI
368728,,R368729,T-ALL,S068102,t_all,T cell acute lymphoblastic leukemias,L032523,T cell acute lymphoblastic leukemia,ADAM,,,,T-ALL,0.7357,193,,
320521,,R320522,Emax,S023535,emax,maximal responses,L107849,,ADAM,,,,Emax,0.2113,24,,
311671,,R311672,CTG,S015579,ctg,connective tissue graft,L060880,connective tissue graft,ADAM,,,,CTG,0.7103,10,,


In [123]:
df.shape

(409668, 17)

### Identify Errors

#### Heuristic 1: Duplicates

Identify which records exactly duplicate another record from the same source.

In [124]:
Extract1 = df[df.duplicated(['SF', 'LF', 'Source']) == True]
Extract1.shape

(3933, 17)

#### Heuristic 2: Punctuation

Identify excess punctuation in the long form (e.g. "nitric oxide;").

In [125]:
# Punctuation after LF (excludes .+%()[])
Extract2_1 = df[df['LF'].str.contains('.*[,\/#!\$\^&@\?<>\*:;{}=\-_\'~\"]$') == True]
Extract2_1.shape

(55, 17)

In [126]:
# Punctuation before LF (excludes .+%()[])
Extract2_2 = df[df['LF'].str.contains('^[,\/#!\$\^&@\?<>\*:;{}=\-_\'~\"].*') == True]
Extract2_2.shape

(7, 17)

In [127]:
# Excess periods before SF
Extract2_3 = df[df['SF'].str.contains('^[\.]+.*') == True]
Extract2_3.shape

(76, 17)

#### Heuristic 4: Content

The alphabetic characters in the short form don't occur anywhere in the long form.

In [133]:
# Include problematic sources
subset = df[(df['Source'] == 'Vanderbilt Clinic Notes') | 
            (df['Source'] == 'Vanderbilt Discharge Sums')]

# Instantiate output
missing_character = []
missing_char_data = []

In [134]:
# Iterate over dataframe
for index, row in subset.iterrows():
    
    # Extract alphabetic characters
    alph_SF = set(re.sub('[^A-Za-z]+', '', row['SF']).lower())
    alph_LF = set(re.sub('[^A-Za-z]+', '', row['LF']).lower())
    
    if alph_SF.issubset(alph_LF) == False:
        if (alph_SF - alph_LF) != {'x'}:
            missing_character.append(row['RecordID'])
            missing_char_data.append(alph_SF - alph_LF)

In [135]:
# Extract LFs missing characters
Extract4 = df[df['RecordID'].isin(missing_character)]
Extract4.shape

(217, 17)

#### Heuristic 5: User-Identified

In [136]:
Extract5 = df[(df['LF'].str.contains("#000066") |
              df['LF'].str.contains("typo") |
              df['LF'].str.contains("not an abbreviation") | 
              df['LF'].str.contains("not an acronym"))]
Extract5.shape

(49, 17)

### Format

#### Add Columns

In [138]:
# Error type, decision, modification
Extract1['error'], Extract1['action'] = ["duplicate", "retire"]
Extract2_1['error'], Extract2_1['action'] = ["punctuation after LF", "modify"]
Extract2_2['error'], Extract2_2['action'] = ["punctuation before LF", "modify"]
Extract2_3['error'], Extract2_3['action'] = ["punctuation before SF", "modify"]
Extract4['error'], Extract4['action'] = [missing_char_data, "modify"]
Extract5['error'], Extract5['action'] = ["user identified", "retire"]

### Merge

In [140]:
errors = pd.concat([Extract1, Extract2_1, Extract2_2, Extract2_3, Extract4, Extract5])
errors.shape

(4337, 19)

In [141]:
errors = errors.drop_duplicates(subset="RecordID")
errors.shape

(4331, 19)

### Export

In [143]:
errors.to_csv('Step3Output.csv',
              index=False,
              header=True,
              sep='|')

### import Errors

#### Import Annotated

In [144]:
errors = pd.read_csv('Step3Output.csv',
                     sep='|',
                     header=0,
                     index_col=False,
                     na_filter=False,
                     dtype=object)

In [145]:
errors.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,LFEUI,Type,PrefSF,Score,Count,Frequency,UMLS.CUI,error,action
4223,,R405161,q12,S097317,q12,every twelve,L074790,,Vanderbilt Discharge Sums,,,,,,6.0,1.0,c0585327,{'q'},modify
3431,,R253676,CCO,S012752,cco,cytochrom-c-oxidase,L063951,cytochrom c oxidase,UMLS,E0744336,E0020556,acronym,,,,,,duplicate,retire
2134,,R184147,OCC,S050915,occ,occiput,L120684,occiput,UMLS,E0698129,E0043489,abbreviation,,,,,,duplicate,retire


In [146]:
errors.shape

(4331, 19)

In [147]:
errors['action'].value_counts()

retire    3979
modify     352
Name: action, dtype: int64

### Remove None

In [148]:
errors = errors[(errors['action'] != 'none')]
errors.shape

(4331, 19)

### Subset Crosswalk

In [149]:
df_all = df # Keep unsubsetted version
df = df[~df['RecordID'].isin(errors['RecordID'])]
df.shape

(405337, 17)

### Subset Errors

In [150]:
retire = df_all[df_all['RecordID'].isin(errors[(errors['action'] == 'retire')]['RecordID'])]
retire.shape

(3979, 17)

In [151]:
modify = errors[(errors['action'] == 'modify')].iloc[:, 0:19]
modify.shape

(352, 19)

### Modify

#### Retire Duplicates

In [152]:
# Identify duplicates
dups = pd.concat([df, modify])
dups = dups[dups.duplicated(['SF', 'LF', 'Source']) == True]
dups.shape

(0, 19)

In [153]:
# Remove from modify
modify = modify[~modify['RecordID'].isin(dups['RecordID'])]
modify = modify.reset_index(drop=True)
modify.shape

(352, 19)

In [154]:
# Add to retire
retire = pd.concat([retire, df_all[df_all['RecordID'].isin(dups['RecordID'])]])
retire = retire.reset_index(drop=True)
retire.shape

(3979, 17)

### Strip Source Data

This is done as the source data is potentially no longer valid.

In [155]:
modify['SFUI'], modify['NormSF'], modify['NSFUI'], modify['PrefSF'] = ['', '', '', '']
modify['LFUI'], modify['NormLF'], modify['PrefLF'], modify['SFEUI'] = ['', '', '', '']
modify['LFEUI'], modify['Type'], modify['Score'], modify['Count'] = ['', '', '', '']
modify['Frequency'], modify['UMLS.CUI'] = ['', '']

In [156]:
modify.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,...,Type,PrefSF,Score,Count,Frequency,UMLS.CUI,error,action,NSFUI,PrefLF
6,,R296152,99mTc,,,technetium-99m-,,,ADAM,,...,,,,,,,punctuation after LF,modify,,
52,,R391364,gugals,,,"combined term for ""guys and gals""",,,Berman,,...,,,,,,,punctuation after LF,modify,,
259,,R404476,I,,,one (1),,,Vanderbilt Discharge Sums,,...,,,,,,,{'i'},modify,,


### Reassign Normalized Short Form

In [157]:
modify['NormSF'] = modify['SF'].apply(normalized_short_form)

In [158]:
modify.sample(3, random_state=0)

Unnamed: 0,GroupID,RecordID,SF,SFUI,NormSF,LF,LFUI,NormLF,Source,SFEUI,...,Type,PrefSF,Score,Count,Frequency,UMLS.CUI,error,action,NSFUI,PrefLF
6,,R296152,99mTc,,99mtc,technetium-99m-,,,ADAM,,...,,,,,,,punctuation after LF,modify,,
52,,R391364,gugals,,gugals,"combined term for ""guys and gals""",,,Berman,,...,,,,,,,punctuation after LF,modify,,
259,,R404476,I,,i,one (1),,,Vanderbilt Discharge Sums,,...,,,,,,,{'i'},modify,,


### Reassign LFUI

In [162]:
# Search existing LFUIs
for index, row in modify.iterrows():
    temp = df_all[(df_all['LF'] == modify['LF'].iat[index])]
    if temp.empty:
        modify['LFUI'].iat[index] = ''
    else:
        modify['LFUI'].iat[index] = temp.iloc[0]['LFUI']

### Add "Modified" Column

In [164]:
modify["Modified"] = "modified"
df["Modified"] = ""

### Append to Crosswalk

In [165]:
df = pd.concat([df, modify])
df = df.sort_values(by=['RecordID'])
df = df.reset_index(drop=True)
df.shape

(405689, 22)

### Export

#### Export Modify

In [166]:
# Get original rows
modify = df_all[df_all['RecordID'].isin(modify['RecordID'])]
modify.shape

(352, 17)

In [167]:
modify.to_csv('ModifiedRecords.csv',
              index=False,
              header=True,
              sep='|')

### Export Retire

In [168]:

retire.to_csv('RetiredRecords.csv',
              index=False,
              header=True,
              sep='|')


### Export Crosswalk

In [169]:
df.to_csv('Step3aOutput.csv',
          index=False,
          header=True,
          sep='|')

### This is the final crosswalk file developed and exported as Step3aOutput.csv file

### <u>Development of BERT model using transformers

In [171]:
!pip install torch

Collecting torch
  Downloading torch-2.1.2-cp39-cp39-win_amd64.whl (192.2 MB)
     ------------------------------------- 192.2/192.2 MB 14.5 MB/s eta 0:00:00
Installing collected packages: torch
Successfully installed torch-2.1.2


In [172]:
import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset

MAX_SEQ_LENGTH = 20


    
def _tokenize_string(tokenizer, text):
    """Tokenizes a string using the provided tokenizer"""
    
    cls_token = tokenizer.encode(tokenizer.cls_token)[0]
    eos_token = tokenizer.encode(tokenizer.eos_token)[0]

    if len(text.split()) > MAX_SEQ_LENGTH:
        raise AssertionError("Passed text that contains too many tokens to tokenizer. \
                             Max tokens: {}. Passed tokens: {}".format(MAX_SEQ_LENGTH, len(text.split())))
    tokenized_sequence = [tokenizer.encode(word) for word in text.split()]
    flattened_tokenized_sequence = []
    for word in tokenized_sequence:
        for token_num in word:
            flattened_tokenized_sequence.append(token_num)

    encoded_text = [cls_token] + flattened_tokenized_sequence + [eos_token]
    return encoded_text    
        
def _tokenize_train_data(df, tokenizer):
    """Loads the conll data into lists recursively"""

    df["tokenized_1"] = df["LF1"].apply(lambda x: _tokenize_string(tokenizer, x))
    df["tokenized_2"] = df["LF2"].apply(lambda x: _tokenize_string(tokenizer, x))

    return df

def _create_labels(df):
    """Transforms labels from Y/N to int"""
    
    df["label"] = df["Synonym"].apply(lambda x: int(x == "Y"))
    return df

def load_data(data_path, tokenizer):
    """Loads train data and tokenizes with Roberta"""
    # Initialize lists to hold our data in
    
    df = pd.read_csv(data_path)

    expected_columns = ["LF1", "LF2", "Synonym"]
    if len(df.columns) !=3:
        raise AssertionError("Passed dataframe with incorrect number of columns. Expected 3."
                             "Recieved columns: {}".format(df.columns))
    if sum(df.columns == expected_columns) != 3:
        raise AssertionError("Loaded dataframe does not match training data format. Expected columns: {} \
                             , received columns: {}.".format(expected_columns, df.columns))
      
    tokenized_df = _tokenize_train_data(df, tokenizer)
    
    df_with_labels = _create_labels(tokenized_df)
    
    
    tokenized_array_1 = np.zeros((len(df), MAX_SEQ_LENGTH))
    tokenized_array_2 = np.zeros((len(df), MAX_SEQ_LENGTH))
    tokenized_label_array = np.zeros((len(df), 1))
    
    for inx, (tokens_1, tokens_2, labels) in enumerate(zip(df["tokenized_1"], df["tokenized_2"], df["label"])):
        tokenized_array_1[inx, -min(len(tokens_1), MAX_SEQ_LENGTH):] = tokens_1[:MAX_SEQ_LENGTH]
        tokenized_array_2[inx, -min(len(tokens_2), MAX_SEQ_LENGTH):] = tokens_2[:MAX_SEQ_LENGTH]
        tokenized_label_array[inx, 0] = labels
        
    return df_with_labels[["tokenized_1", "tokenized_2", "label"]], tokenized_array_1,\
        tokenized_array_2, tokenized_label_array
    

class MatchingDataset(Dataset):
    """NER dataset."""

    def __init__(self, data_path, feature_array, tokenizer):
        """
        Args:
            data_path (string): Path to the train csv
            tokenizer: Model-specific tokenizer (from huggingface)
        """
        self.tokenizer = tokenizer   
        self.train_df, self.tokenized_text_1, self.tokenized_text_2, self.labels = load_data(data_path, tokenizer)
        self.feature_array = feature_array

        if len(self.feature_array) != len(self.labels):
            raise AssertionError("Passed incorrect number of additional features"
                                 "Received {}, expected {}.".format(len(self.feature_array), len(self.labels)))

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx_list):
        if torch.is_tensor(idx_list):
            idx_list = idx_list.tolist()

        sample_text_1 = torch.LongTensor(self.tokenized_text_1[idx_list])
        sample_text_2 = torch.LongTensor(self.tokenized_text_2[idx_list])
        additional_feats = torch.FloatTensor(self.feature_array[idx_list])
        sample_labels = torch.FloatTensor(self.labels[idx_list])
        sample = {'text_1': sample_text_1, 'text_2': sample_text_2, 'labels': sample_labels,
                  'additional_feats': additional_feats}

        return sample

In [174]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.36.1-py3-none-any.whl (8.3 MB)
     ---------------------------------------- 8.3/8.3 MB 15.1 MB/s eta 0:00:00
Collecting safetensors>=0.3.1
  Downloading safetensors-0.4.1-cp39-none-win_amd64.whl (277 kB)
     ------------------------------------- 277.8/277.8 kB 17.8 MB/s eta 0:00:00
Collecting tokenizers<0.19,>=0.14
  Downloading tokenizers-0.15.0-cp39-none-win_amd64.whl (2.2 MB)
     ---------------------------------------- 2.2/2.2 MB 17.5 MB/s eta 0:00:00
Collecting huggingface-hub<1.0,>=0.19.3
  Downloading huggingface_hub-0.19.4-py3-none-any.whl (311 kB)
     ------------------------------------- 311.7/311.7 kB 20.1 MB/s eta 0:00:00
Collecting fsspec>=2023.5.0
  Downloading fsspec-2023.12.2-py3-none-any.whl (168 kB)
     ------------------------------------- 169.0/169.0 kB 10.6 MB/s eta 0:00:00
Installing collected packages: safetensors, fsspec, huggingface-hub, tokenizers, transformers
  Attempting uninstall: fsspec
    Found e

In [176]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import RobertaConfig, RobertaModel

class MatchHead(nn.Module):
    """Roberta Head for Matching."""
    def __init__(
        self,
        base_model_feature_size,
        additional_feature_size,
        num_classes,
        rnn_dimension,
        linear_1_dimension,
    ):
        """Model architecture definition for the capitalization model in torch."""
        super(MatchHead, self).__init__()
        self.GRU_1 = nn.GRU(base_model_feature_size, rnn_dimension, bidirectional=False)
        self.GRU_2 = nn.GRU(base_model_feature_size, rnn_dimension, bidirectional=False)
        self.linear_1 = nn.Linear(rnn_dimension * 2 + additional_feature_size, linear_1_dimension)
        self.linear_2 = nn.Linear(linear_1_dimension, num_classes)

    def forward(self, data_1, data_2, additional_feats):
        """Forward pass"""

        # batch second is faster

        features_1 = data_1.permute(1, 0, 2)
        features_2 = data_2.permute(1, 0, 2)

        gru_1_output, _ = self.GRU_1(features_1)
        gru_2_output, _ = self.GRU_2(features_2)

        gru_1_output_permute = gru_1_output.permute(1, 0, 2)
        gru_2_output_permute = gru_2_output.permute(1, 0, 2)

        final_gru_state_1 = torch.squeeze(gru_1_output_permute[:, -1:, :])
        final_gru_state_2 = torch.squeeze(gru_2_output_permute[:, -1:, :])
        # Undoing the above permutation now that we are through GRU

        linear_input = torch.cat((final_gru_state_1, final_gru_state_2, additional_feats), -1)
        linear_output = self.linear_1(linear_input)
        activated_linear_output = F.relu(linear_output)
        pre_sigmoid_output = self.linear_2(activated_linear_output)
        sigmoid_output = F.sigmoid(pre_sigmoid_output)

        return sigmoid_output


class MatchArchitecture(nn.Module):
    "Transformer base model for matching."
    def __init__(
        self,
        base_model_path,
        base_model_name,
        is_custom_pretrained,
        base_model_feature_size,
        additional_feature_size,
        num_classes,
        rnn_dimension,
        linear_1_dimension,
    ):
        super(MatchArchitecture, self).__init__()

        if not is_custom_pretrained:
            self.base_model = RobertaModel.from_pretrained(base_model_name)
        else:
            self.base_model = RobertaModel.from_pretrained(base_model_path)

        for param in self.base_model.parameters():
            param.requires_grad = False

        self.match_head = MatchHead(
            base_model_feature_size, additional_feature_size, num_classes, rnn_dimension, linear_1_dimension
        )

    def forward(
            self,
            input_ids_1,
            input_ids_2,
            additional_feats,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            labels=None,
    ):
        """Forward pass"""


        outputs_1 = self.base_model(
            input_ids_1,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
        )
        outputs_2 = self.base_model(
            input_ids_2,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
        )

        # Outputs[0] is seq output, outputs[1] is pooled if you want to do a seq level task
        sequence_output_1 = outputs_1[0]
        sequence_output_2 = outputs_2[0]

        match_classification = self.match_head(sequence_output_1, sequence_output_2, additional_feats)

        return match_classification


class FFMatchHead(nn.Module):
    """Roberta Head for Matching."""
    def __init__(
        self,
        additional_feature_size,
        num_classes,
        linear_1_dimension,
    ):
        """Model architecture definition for the capitalization model in torch."""
        super(MatchHead, self).__init__()
        self.linear_1 = nn.Linear(additional_feature_size, linear_1_dimension)
        self.linear_2 = nn.Linear(linear_1_dimension, num_classes)

    def forward(self, additional_feats):
        """Forward pass"""

        # batch second is faster
        linear_input = additional_feats
        linear_output = self.linear_1(linear_input)
        activated_linear_output = F.relu(linear_output)
        pre_sigmoid_output = self.linear_2(activated_linear_output)
        sigmoid_output = F.sigmoid(pre_sigmoid_output)

        return sigmoid_output


class FFMatchArchitecture(nn.Module):
    "Transformer base model for matching."
    def __init__(
        self,
        additional_feature_size,
        num_classes,
        linear_1_dimension,
    ):
        super(FFMatchArchitecture, self).__init__()
        self.match_head = FFMatchHead(
            additional_feature_size, num_classes, linear_1_dimension
        )

    def forward(self, additional_feats):
        """Forward pass"""

        match_classification = self.match_head(additional_feats)
        return match_classification

In [229]:

from math import ceil
import numpy as np
import pandas as pd
from transformers import RobertaConfig, RobertaTokenizer, RobertaModel

from sklearn.preprocessing import MinMaxScaler
import sklearn.metrics as mt
from sklearn.model_selection import KFold
import torch
import torch.nn as nn
from torch.optim import Adam
from torch.utils.data import DataLoader

from model import MatchArchitecture
from data_utils import MatchingDataset

RANDOM_SEED = 117
SEQ_LEN = 10
RNN_DIM = 64
LINEAR_DIM = 64
CLASSES = 1
ROBERTA_FEAT_SIZE = 768
ADDITIONAL_FEAT_SIZE = 0
F1_POS_THRESHHOLD = .3
epsilon = 1e-8

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

train_size = 8870
oof_preds = np.zeros((train_size, 1))
oof_preds2 = np.zeros((train_size, 1))
oof_labels = np.zeros((train_size, 1))
cur_oof_inx = 0
for fold in range(5):
    VERSION = '1.1_fold_{}'.format(fold)
    SAVE_DIR = '/ssd-1/clinical/clinical-abbreviations/checkpoints/{}.pt'.format(VERSION)
    train_data_path = 'D:\\ARTIFICIAL_INTELLIGENCE\\SEM_1\\mlp\\clinical-abbreviations-master\\Train1.csv'
    val_data_path = 'D:\\ARTIFICIAL_INTELLIGENCE\\SEM_1\\mlp\\clinical-abbreviations-master\\Train2.csv'
    features_path = 'D:\\ARTIFICIAL_INTELLIGENCE\\SEM_1\\mlp\\clinical-abbreviations-master\\Train3.csv'

load_data = True
if load_data:
    path = 'D:\\ARTIFICIAL_INTELLIGENCE\\SEM_1\\mlp\\clinical-abbreviations-master\\'
    positives = pd.read_csv(path + 'Train1.csv', sep='|')
    negatives = pd.read_csv(path + 'Train2.csv', sep='|')

       
train_strings = pd.concat((positives, negatives), axis=0)
additional_feats = pd.read_csv(features_path)
if "target" in additional_feats.columns:
        additional_feats.drop("target", axis=1, inplace=True)
ADDITIONAL_FEAT_SIZE = additional_feats.shape[1]
kf = KFold(n_splits=5, random_state = RANDOM_SEED, shuffle = True)


        # TODO:@Ray improve the fold selection
cur_fold = 0
for train_inx, val_inx in kf.split(train_strings):
        if cur_fold == fold:
            break
        cur_fold += 1

X_train = train_strings.iloc[train_inx, :].reset_index(drop=True, inplace=False)
X_feats = additional_feats.iloc[train_inx, :].reset_index(drop=True, inplace=False)
X_test = train_strings.iloc[val_inx, :].reset_index(drop=True, inplace=False)
X_feats_test = additional_feats.iloc[val_inx, :].reset_index(drop=True, inplace=False)

X_feats = np.array(X_feats)        
X_feats_test = np.array(X_feats_test)
scaler = MinMaxScaler()
X_feats = scaler.fit_transform(X_feats)
X_feats_test = scaler.fit_transform(X_feats_test)
X_train.to_csv(train_data_path, index=False)
X_test.to_csv(val_data_path, index=False)

train_dataset = MatchingDataset(train_data_path, X_feats, tokenizer)
val_dataset = MatchingDataset(val_data_path, X_feats_test, tokenizer)

model = MatchArchitecture(
        None,
        'roberta-base',
        False,
        ROBERTA_FEAT_SIZE,
        ADDITIONAL_FEAT_SIZE,
        CLASSES,
        RNN_DIM,
        LINEAR_DIM,
).cuda()

def lr_scheduler(epoch):
        if epoch < 5:
            return 1e-3
        if epoch < 8:
            return 1e-4
        else:
            return 1e-5

train_config = {
        "batch_size": 16,
        "base_lr": .0001,
        "lr_shceduler": lr_scheduler,
        "n_epochs": 6
}


def _run_training_loop(model, train_config):
        """Runs the training loop to train the matcher."""
        # set up params for training loop

        criterion = nn.BCELoss(reduce=False)
        #criterion = torch.nn.MSELoss()

        opt = Adam(model.parameters(), lr=train_config["base_lr"])

epoch_learn_rates = []
epoch_train_losses = []
epoch_train_f1s = []
epoch_validation_losses = []
epoch_validation_f1s = []
train_steps_per_epoch = int(len(train_dataset) / train_config["batch_size"])
validation_steps_per_epoch = int(len(val_dataset) / train_config["batch_size"])

for epoch in range(train_config["n_epochs"]):

        train_generator = iter(DataLoader(train_dataset, batch_size=train_config["batch_size"], shuffle=True,
                                              num_workers=4))
        val_generator = iter(DataLoader(val_dataset, batch_size=train_config["batch_size"], shuffle=False,
                                            num_workers=4))

adjusted_lr = lr_scheduler(epoch)
for param_group in opt.param_groups:
        param_group["lr"] = adjusted_lr
        epoch_learn_rates.append(adjusted_lr)

print("Epoch: {}. LR: {}.".format(epoch, adjusted_lr))


model.train(True)
running_train_loss = 0
target_true = 0
predicted_true = 0
correct_true_preds = 0
mask_sum = 0
y_sum = 0
for step in range(train_steps_per_epoch):
                # Calculate losses

        sample = next(train_generator)
        X_batch_1 = sample['text_1'].cuda()
        X_batch_2 = sample['text_2'].cuda()
        y_batch = sample['labels'].cuda()
        additional_feats = sample['additional_feats'].cuda()

y_sum += torch.sum(y_batch).item() / train_config["batch_size"]
model.zero_grad()
sigmoid_output = model(X_batch_1, X_batch_2, additional_feats)

loss = criterion(sigmoid_output, y_batch)
loss = torch.mean(loss)

y_batch = y_batch.cpu()

                # Calculate metrics
running_train_loss += loss.cpu().item()

threshold_output = (sigmoid_output > F1_POS_THRESHHOLD).cpu().type(torch.IntTensor)
target_true += torch.sum(y_batch == 1).float().item()
predicted_true += torch.sum(threshold_output).float().item()
correct_true_preds += torch.sum(((threshold_output == y_batch) * threshold_output)== 1).cpu().float().item()

                # Propogate
loss.backward()
opt.step()

if step % 50 == 0:
    print("train step: ", step, "loss: ", running_train_loss/(step + 1))
print("y_sum: ", y_sum/(step + 1))

del loss, X_batch_1, X_batch_2, y_batch, sample, sigmoid_output, threshold_output

recall = correct_true_preds / (target_true + .1)
precision = correct_true_preds / (predicted_true +.1)
epoch_train_f1 = 2 * (precision * recall) / (precision + recall + epsilon)
epoch_train_f1s.append(epoch_train_f1)
epoch_train_loss = running_train_loss / train_steps_per_epoch
epoch_train_losses.append(epoch_train_loss)
print("Epoch {}, train loss of {}.".format(epoch, epoch_train_loss))
print("Epoch {}, train f1 of {}.".format(epoch, epoch_train_f1))

model.train(False)
running_validation_loss = 0
val_target_true = 0
val_predicted_true = 0
val_correct_true_preds = 0
for step in range(validation_steps_per_epoch):

                sample = next(val_generator)
                X_batch_1 = sample['text_1'].cuda()
                X_batch_2 = sample['text_2'].cuda()
                y_batch = sample['labels'].cuda()
                additional_feats = sample['additional_feats'].cuda()

y_sum += torch.sum(y_batch).item() / train_config["batch_size"]
model.zero_grad()
sigmoid_output = model(X_batch_1, X_batch_2, additional_feats)

loss = criterion(sigmoid_output, y_batch)
loss = torch.mean(loss)

y_batch = y_batch.cpu()
                # Calculate metrics
running_validation_loss += loss.cpu().item()
threshold_output = (sigmoid_output > F1_POS_THRESHHOLD).cpu().type(torch.IntTensor)
val_target_true += torch.sum(y_batch == 1).float().item()
val_predicted_true += torch.sum(threshold_output).float().item()
val_correct_true_preds += torch.sum(
((threshold_output == y_batch) * threshold_output)
== 1).cpu().float().item()

del loss, X_batch_1, X_batch_2, y_batch, sample, sigmoid_output, threshold_output

recall = val_correct_true_preds / (val_target_true +epsilon)
precision = val_correct_true_preds / (val_predicted_true+epsilon)
epoch_validation_f1 = 2 * (precision * recall) / (precision + recall + epsilon)
epoch_validation_f1s.append(epoch_validation_f1)
epoch_validation_loss = running_validation_loss / validation_steps_per_epoch
epoch_validation_losses.append(epoch_validation_loss)
print("Epoch {}, train loss of {}.".format(epoch, epoch_train_loss))
print("Epoch {}, train f1 of {}.".format(epoch, epoch_train_f1))
print("Epoch {}, validation loss of {}.".format(epoch, epoch_validation_loss))
print("Epoch {}, validation f1 of {}.".format(epoch, epoch_validation_f1))

torch.save(model.state_dict, SAVE_DIR + 'model.pt')
train_history = {
            "f1": epoch_train_f1s,
            "loss": epoch_train_losses,
            "val_f1": epoch_validation_f1s,
            "val_loss": epoch_validation_losses,
            "lr": epoch_learn_rates,
}
return model, train_history

model, train_hstory = _run_training_loop(model, train_config)

model.train(False)
validation_steps_per_epoch = ceil(len(val_dataset) / train_config["batch_size"])
val_generator = iter(DataLoader(val_dataset, batch_size=train_config["batch_size"], shuffle=False,
                                    num_workers=4))
preds = None
for step in range(validation_steps_per_epoch):
        sample = next(val_generator)
        X_batch_1 = sample['text_1'].cuda()
        X_batch_2 = sample['text_2'].cuda()
        y_batch = sample['labels'].cuda()
        additional_feats = sample['additional_feats'].cuda()

sigmoid_output = model(X_batch_1, X_batch_2, additional_feats)
if preds is None:
    preds = sigmoid_output.cpu().detach().numpy()
    labels = y_batch.cpu().detach().numpy()
else:
    temp_preds = sigmoid_output.cpu().detach().numpy()
    preds = np.concatenate([preds, temp_preds], axis=0)

print("Length of val_inx:", len(val_inx))
print("Length of preds:", len(preds))
print("Shape of oof_preds before assignment:", oof_preds.shape)
oof_preds[val_inx, 0] = preds[:len(val_inx), 0]
print("Shape of oof_preds after assignment:", oof_preds.shape)
cur_oof_inx += len(labels)
del model

train_filename = 'D:\\ARTIFICIAL_INTELLIGENCE\\SEM_1\\mlp\\clinical-abbreviations-master\\Train1.csv'
train_dataframe = pd.read_csv(train_filename, na_filter=False).drop("Unnamed: 0", axis=1)
target = train_dataframe['target']
thresholds = [.3, .4, .5, .6, .7]
for threshold in thresholds:
    print('F1 at {}: '.format(threshold), mt.f1_score(target, oof_preds > threshold))
    print('Recall at {}: '.format(threshold), mt.recall_score(target, oof_preds > threshold))
    print('Precision at {}: '.format(threshold), mt.precision_score(target, oof_preds > threshold))


IndexError: positional indexers are out-of-bounds