# NER using GPT-3.5

### Project name: Honos
Date: 24th May 2024

Author: Milindi Kodikara | Supervisor: Professor Karin Verspoor


Before running this notebook:
1. [Install Jupyter notebook](https://jupyter.org/install) 


2. [Setting up Azure OpenAI model](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/working-with-models?tabs=powershell#model-updates)


3. [Setting up connection to GPT-3.5 using Azure OpenAI service](https://learn.microsoft.com/en-us/azure/ai-services/openai/quickstart?tabs=command-line%2Cpython-new&pivots=programming-language-python)
        - In the Environment variables section, instead of doing what is outlined in the link, add the `API_KEY`, `API-VERSION`, `ENDPOINT` and `DEPLOYMENT-NAME` into a `.env` file in the root folder.
        
4. Add the correct filename paths for `data` in Step 1 and gold annotated data filename for the `evaluate()` function in Step 4. 


In [1]:
import pandas as pd
import re

import os
from openai import AzureOpenAI

from dotenv import load_dotenv
load_dotenv() 

True


### Step 1: Load and pre-process data and prompt library 


#### Step 1.1: Load datasets

In [2]:
# train_text.tsv
# pmid\tfilename\ttext

# TODO: Replace filepath for related data file
data = pd.read_csv("./genovardis_train_dev/train_text.tsv", sep='\t', header=0)

data.head(5)

Unnamed: 0,pmid,filename,text
0,12672033,pmid-12672033.txt,12672033|t|Análisis de mutaciones en DMBT1 en ...
1,12673366,pmid-12673366.txt,12673366|t|Análisis del polimorfismo G/C en la...
2,12701064,pmid-12701064.txt,12701064|t|Una nueva mutación compuesta hetero...
3,12716337,pmid-12716337.txt,12716337|t|Polimorfismo en la posición -174 de...
4,12719097,pmid-12719097.txt,12719097|t|Una nueva mutación en CACNA1F en un...


In [3]:
len(data)

427

In [4]:
# TODO: remove this after testing
data = data.head(2)

data

Unnamed: 0,pmid,filename,text
0,12672033,pmid-12672033.txt,12672033|t|Análisis de mutaciones en DMBT1 en ...
1,12673366,pmid-12673366.txt,12673366|t|Análisis del polimorfismo G/C en la...



#### Step 1.2: Load prompt library

Prompt id structure:
`p_<index>_<task>_<language>_<output>`

TODO: Figure out `<guideline>_<paradigm>`

In [5]:
# TODO: Ask for mutations, variants, SNPs etc.
prompt_library = pd.read_json('prompts.json')

prompt_library

Unnamed: 0,prompt_id,instruction,text
0,p_001_ner_en_tsv,Target: Find variant on DNA sequence entities ...,Text: {}
1,p_002_ner_es_tsv,Objetivo: encontrar variantes en entidades de ...,Text: {}


In [6]:
# TODO: remove this after testing
# prompt_library = prompt_library.head(1)
# 
# prompt_library


#### Step 1.3: Create data+prompt dataset

In [7]:
# TODO: Buff up the prompts with guidelines and examples (shots)
# pmid prompt_id embedded_prompt
def embed_data_in_prompts(row_data):
    prompts = []
    pmid = row_data['pmid']
    data_text = row_data['text']
    
    for index, row_prompt in prompt_library.iterrows():
        instruction = row_prompt['instruction']
        prompt_text = row_prompt['text'].format(data_text)
        # TODO: Figure out the new line characters 
        concatenated_prompt = '{}\n"{}"'.format(instruction, prompt_text)
        
        prompt = {'prompt_id': row_prompt['prompt_id'], 'prompt': concatenated_prompt}
        prompts.append(prompt)
    
    return {'pmid': pmid, 'prompts': prompts}


In [8]:

embedded_prompt_data_list = [embed_data_in_prompts(row_data) for index, row_data in data.iterrows()]

In [9]:
embedded_prompt_data_list[0]

{'pmid': 12672033,
 'prompts': [{'prompt_id': 'p_001_ner_en_tsv',
   'prompt': 'Target: Find variant on DNA sequence entities as \'DNAMutation\', RS number entities as \'SNP\', COSMIC mutation entities as \'SNP\', Allele on DNA sequence entities as \'DNAAllele\', wild type and mutations as \'NucleotideChange\', variant entities with insufficient information as \'OtherMutation\', gene entities as \'Gene\', disease entities as \'Disease\' and Transcript ID entities as \'Transcript\' in the provided spanish language text. Display results in the tsv format with the headers \'label\' to annotate the entity as one of \'DNAMutation\', \'SNP\', \'SNP\', \'DNAAllele\', \'NucleotideChange\', \'OtherMutation\', \'Gene\', \'Disease\', \'Transcript\' and \'span\' for the identified entity. Provide each label and span in a new line.\n"Text: 12672033|t|Análisis de mutaciones en DMBT1 en glioblastoma, meduloblastoma y tumores oligodendrogliales\n12672033|a|DMBT1 ha sido implicado como un posible gen s


### Step 2: Setting up GPT-3.5

In [10]:

client = AzureOpenAI(
    api_key=os.environ["API-KEY"],  
    api_version=os.environ["API-VERSION"],
    azure_endpoint=os.environ["ENDPOINT"]
    )
    
deployment_name=os.environ["DEPLOYMENT-NAME"]


In [11]:
# Testing the connection
test_response = client.chat.completions.create(model=deployment_name, messages=[{"role": "user", "content": "Hello, World!"}])
print(test_response.choices[0].message.content)

Hello! How can I assist you today?


In [12]:
# TODO: Ask Karin whether we should run again and again to see what gpt generates - yes! later!
results_list = []
def generate_results(prompt_items):
    
    pmid = prompt_items['pmid']
    
    for prompt_item in prompt_items['prompts']:
    
        prompt_id = prompt_item['prompt_id']
        prompt = prompt_item['prompt']
        
        # TODO: Look into hyper params like temp 
        response = client.chat.completions.create(model=deployment_name, messages=[{"role": "user", "content": prompt}])
        
        response_result = response.choices[0].message.content
        
        results_list.append({'pmid': pmid, 'prompt_id': prompt_id, 'result': response_result})
    
        # print(f'Prompt:\n{prompt}\n\nResponse:\n{response_result} \n----------\n')
        print(f'Prompt_id:\n{prompt_id}\n\npmid:\n{pmid}\n----------\n')
    
    return results_list
    

In [13]:
for embedded_prompt_data in embedded_prompt_data_list:
    generate_results(embedded_prompt_data)

Prompt_id:
p_001_ner_en_tsv

pmid:
12672033
----------
Prompt_id:
p_002_ner_es_tsv

pmid:
12672033
----------
Prompt_id:
p_001_ner_en_tsv

pmid:
12673366
----------
Prompt_id:
p_002_ner_es_tsv

pmid:
12673366
----------


In [14]:
results_list

[{'pmid': 12672033,
  'prompt_id': 'p_001_ner_en_tsv',
  'result': 'label   span\nGene    DMBT1\nDisease glioblastoma\nDisease meduloblastoma\nDisease tumores oligodendrogliales\nDNAMutation    mutación\nDNAMutation    sustituciones de bases\nSNP RS number  12672033\nSNP RS number  10q\nSNP RS number  cáncer de cerebro\nSNP RS number  gastrointestinal\nSNP RS number  pulmón\nDNAAllele  polimorfismos genéticos\nDNAAllele  deleción homocigota\nDNAAllele  deleciones hemizigotas\nDNAAllele  una región entre los intrones 10 y 26 de DMBT1\nNucleotideChange  cambios de bases\nNucleotideChange  cambios de aminoácidos\nNucleotideChange  silenciosas\nOtherMutation  sin embargo\nTranscript ID  12672033'},
 {'pmid': 12672033,
  'prompt_id': 'p_002_ner_es_tsv',
  'result': 'label   span\nDNAMutation    174-181\nDNAMutation    253-260\nDNAMutation    266-273\nDNAMutation    282-289\nDNAMutation    351-358\nDNAMutation    401-408\nDNAMutation    537-544\nDNAMutation    691-698\nDNAMutation    767-774

In [15]:
len(results_list)

4

### Step 3: Post-processing

In [16]:
# create df from results list and data df
# columns = pmid, prompt_id, filename, label, offset1, offset2, span
extracted_entity_results = pd.DataFrame(columns=['pmid','prompt_id','filename','label', 'offset_checked', 'offset1','offset2','span'])

In [17]:
len(extracted_entity_results)

0

In [18]:
label_entity_pattern = '^(?P<label>DNAMutation|SNP|DNAAllele|NucleotideChange|OtherMutation|Gene|Disease)\s+(?P<span>[\w\W]+)$'

def extract_tuple(tuple_string):
    stripped_tuple_string = tuple_string.strip()
    matches = re.search(label_entity_pattern, stripped_tuple_string)
    
    if not matches:
        return
    
    label = matches.group("label").strip()
    span = matches.group("span").strip()
    
    return {'label': label, 'span': span}

In [19]:
# extract each entity from the combined result string from gpt-3.5
# add each extracted tuple as a new row in extracted_entity_results df
def extract_ner_results(pmid, prompt_id, result_string):
    extracted_list = result_string.splitlines()
    extracted_tuple_list = [ extract_tuple(result_string) for result_string in extracted_list]
    
    for extracted_tuple in extracted_tuple_list:
        if extracted_tuple:
            row = {
                    "pmid": pmid,
                    "prompt_id": prompt_id,
                    "filename" : data.loc[data['pmid'] == pmid, 'filename'].iloc[0],
                    "label": extracted_tuple['label'],
                    "offset_checked": False,
                    "offset1": '',
                    "offset2": '',
                    "span": extracted_tuple['span']
                }
        
            extracted_entity_results.loc[len(extracted_entity_results)] = row
    

In [20]:
# extract the concatenated results strings into a new line for each tuple 
for result_dict in results_list:
    extract_ner_results(result_dict['pmid'], result_dict['prompt_id'], result_dict['result'])


In [21]:
extracted_entity_results

Unnamed: 0,pmid,prompt_id,filename,label,offset_checked,offset1,offset2,span
0,12672033,p_001_ner_en_tsv,pmid-12672033.txt,Gene,False,,,DMBT1
1,12672033,p_001_ner_en_tsv,pmid-12672033.txt,Disease,False,,,glioblastoma
2,12672033,p_001_ner_en_tsv,pmid-12672033.txt,Disease,False,,,meduloblastoma
3,12672033,p_001_ner_en_tsv,pmid-12672033.txt,Disease,False,,,tumores oligodendrogliales
4,12672033,p_001_ner_en_tsv,pmid-12672033.txt,DNAMutation,False,,,mutación
...,...,...,...,...,...,...,...,...
359,12673366,p_001_ner_en_tsv,pmid-12673366.txt,OtherMutation,False,,,G/C
360,12673366,p_001_ner_en_tsv,pmid-12673366.txt,Gene,False,,,RAD51
361,12673366,p_001_ner_en_tsv,pmid-12673366.txt,Disease,False,,,cáncer de mama
362,12673366,p_001_ner_en_tsv,pmid-12673366.txt,OtherMutation,False,,,G/C


In [22]:
len(extracted_entity_results)

364

In [23]:
# Find offsets 

# loop df, find each span, calculate the word length, find the indexes of each occurance 
for _, row in extracted_entity_results.iterrows():
    pmid = row['pmid']
    prompt_id = row['prompt_id']
    text = data.loc[data['pmid'] == pmid, 'text'].iloc[0]
    
    if not row['offset_checked'] and row['offset1'] == '':
        span = row['span']
        span_length = len(span)
        span_start_indexes = [m.start() for m in re.finditer(re.escape(span), text)]
        span_count = 0
        
        matching_spans = extracted_entity_results[(extracted_entity_results['pmid']==pmid) & (extracted_entity_results['prompt_id']==prompt_id) & (extracted_entity_results['span']==span) & (extracted_entity_results['offset1']=='') & (extracted_entity_results['offset_checked']==False)]
        
        for index, matched_span in matching_spans.iterrows(): 
            if span_start_indexes and span_count < len(span_start_indexes):
                extracted_entity_results.loc[index, 'offset1'] = str(span_start_indexes[span_count])
                extracted_entity_results.loc[index, 'offset2'] = str(span_start_indexes[span_count] + span_length)
                
                span_count = span_count + 1
            else: 
                # Add -1 to extra or missing ones 
                extracted_entity_results.loc[index, 'offset1'] = '-1'
                extracted_entity_results.loc[index, 'offset2'] = '-1'
                
            extracted_entity_results.loc[index, 'offset_checked'] = True
            
        # testing code
        # test_matching_spans = extracted_entity_results[(extracted_entity_results['pmid']==pmid) & (extracted_entity_results['prompt_id']==prompt_id) & (extracted_entity_results['span']==span)]
        # 
        # print(test_matching_spans)

In [24]:
extracted_entity_results

Unnamed: 0,pmid,prompt_id,filename,label,offset_checked,offset1,offset2,span
0,12672033,p_001_ner_en_tsv,pmid-12672033.txt,Gene,True,37,42,DMBT1
1,12672033,p_001_ner_en_tsv,pmid-12672033.txt,Disease,True,46,58,glioblastoma
2,12672033,p_001_ner_en_tsv,pmid-12672033.txt,Disease,True,60,74,meduloblastoma
3,12672033,p_001_ner_en_tsv,pmid-12672033.txt,Disease,True,77,103,tumores oligodendrogliales
4,12672033,p_001_ner_en_tsv,pmid-12672033.txt,DNAMutation,True,371,379,mutación
...,...,...,...,...,...,...,...,...
359,12673366,p_001_ner_en_tsv,pmid-12673366.txt,OtherMutation,True,-1,-1,G/C
360,12673366,p_001_ner_en_tsv,pmid-12673366.txt,Gene,True,-1,-1,RAD51
361,12673366,p_001_ner_en_tsv,pmid-12673366.txt,Disease,True,-1,-1,cáncer de mama
362,12673366,p_001_ner_en_tsv,pmid-12673366.txt,OtherMutation,True,-1,-1,G/C


In [25]:
len(extracted_entity_results)

364

In [26]:
# remove hallucinations
# TODO: Find a better way for this 
extracted_entity_results = extracted_entity_results[(extracted_entity_results['offset1'] != '-1') & (extracted_entity_results['offset2'] != '-1')]

In [27]:
extracted_entity_results

Unnamed: 0,pmid,prompt_id,filename,label,offset_checked,offset1,offset2,span
0,12672033,p_001_ner_en_tsv,pmid-12672033.txt,Gene,True,37,42,DMBT1
1,12672033,p_001_ner_en_tsv,pmid-12672033.txt,Disease,True,46,58,glioblastoma
2,12672033,p_001_ner_en_tsv,pmid-12672033.txt,Disease,True,60,74,meduloblastoma
3,12672033,p_001_ner_en_tsv,pmid-12672033.txt,Disease,True,77,103,tumores oligodendrogliales
4,12672033,p_001_ner_en_tsv,pmid-12672033.txt,DNAMutation,True,371,379,mutación
5,12672033,p_001_ner_en_tsv,pmid-12672033.txt,DNAMutation,True,797,819,sustituciones de bases
11,12672033,p_001_ner_en_tsv,pmid-12672033.txt,DNAAllele,True,1105,1128,polimorfismos genéticos
12,12672033,p_001_ner_en_tsv,pmid-12672033.txt,DNAAllele,True,251,270,deleción homocigota
13,12672033,p_001_ner_en_tsv,pmid-12672033.txt,DNAAllele,True,1527,1549,deleciones hemizigotas
14,12672033,p_001_ner_en_tsv,pmid-12672033.txt,DNAAllele,True,1564,1610,una región entre los intrones 10 y 26 de DMBT1


In [28]:
len(extracted_entity_results)

36


### Step 4: Evaluation

*Skip this part for evaluation dataset as there is no gold standard data to compare against.*

In [29]:
# train_annotations.tsv
# pmid\tfilename\tmark\tlabel\toffset1\toffset2\tspan

# TODO: Keep track of the variations between the runs eg: hyperparams (fixed), prompt that worked best etc. to add the metrics for result 
# Read and find what other people have done 

# brat format for NER
# <unique_id>   <label>  <offset1> <offset2>   <span> 
def bratify(eval_filepath=None, results=None):
    if eval_filepath is not None:
        
        gold_standard_annotations = pd.read_csv(eval_filepath, sep='\t', header=0)
        # TODO: Get gold standard data in brat formation for evaluation
        print(gold_standard_annotations.sample(5))
        # TODO: Save file in desired output file     
        
    if results is not None:
        # TODO: Get results in the brat format for evaluation
        # TODO: Remove extra whitespaces and new lines from the response for JSON format
        # TODO: Extract each new line as a row in the results 
        # formatted_response = re.sub('[^\S\t]', '', response.choices[0].message.content)
        results = results
        # TODO: Save file in desired output format
    

In [30]:
# TODO: Replace filepath of to convert to brat format
bratify("./genovardis_train_dev/train_annotation.tsv")

          pmid           filename mark    label  offset1  offset2  \
5442  19521089  pmid-19521089.ann  T20      SNP      879      886   
2379  16181814  pmid-16181814.ann  T14  Disease      309      339   
124   12736721  pmid-12736721.ann   T6     Gene      464      473   
4292  18272172  pmid-18272172.ann   T4     Gene      150      171   
6363  20534142  pmid-20534142.ann  T20      SNP     1428     1434   

                                span  
5442                         rs25531  
2379  enfermedad cerebral y hepática  
124                        M6P/IGF2R  
4292           ether-a-go-go-related  
6363                          rs6235  


In [31]:
# https://github.com/READ-BioMed/brateval


### Step 5: Saving output

`.tsv` file containing the annotations in the following format: 

`pmid   filename   label   offset1   offset2   span`.



In [32]:
# Extracting results of a specific prompt
def save_output(prompt_id):
    extracted_entity_results_subset = extracted_entity_results[(extracted_entity_results['prompt_id']==prompt_id)]
    extracted_entity_results_subset = extracted_entity_results_subset.drop(['prompt_id', 'offset_checked'], axis=1)
    print(f'Original len: {len(extracted_entity_results)}, subset len: {len(extracted_entity_results_subset)}\n\n')
    print('Sample:\n', extracted_entity_results_subset.sample(5))
    
    # get results for tsv in the format
    # `pmid   filename   label   offset1   offset2   span`.
    filename = f'genovardis_{prompt_id}.tsv'
    extracted_entity_results_subset.to_csv(filename, sep ='\t', index=False, header=True)
    
    print(f'\nSaved to {filename}\n------------\n')
    

In [33]:
for _, prompt in prompt_library.iterrows():
   save_output(prompt['prompt_id']) 

Original len: 36, subset len: 36


Sample:
          pmid           filename             label offset1 offset2  \
13   12672033  pmid-12672033.txt         DNAAllele    1527    1549   
145  12673366  pmid-12673366.txt              Gene      78      83   
161  12673366  pmid-12673366.txt           Disease     794     808   
15   12672033  pmid-12672033.txt  NucleotideChange     960     976   
160  12673366  pmid-12673366.txt     OtherMutation     787     790   

                       span  
13   deleciones hemizigotas  
145                   RAD51  
161          cáncer de mama  
15         cambios de bases  
160                     G/C  

Saved to genovardis_p_001_ner_en_tsv.tsv
------------

Original len: 36, subset len: 0



ValueError: a must be greater than 0 unless no samples are taken