# NER using GPT-3.5

### Project name: Honos
Date: 24th May 2024

Author: Milindi Kodikara | Supervisor: Professor Karin Verspoor


Before running this notebook:
1. [Install Jupyter notebook](https://jupyter.org/install) 


2. [Setting up Azure OpenAI model](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/working-with-models?tabs=powershell#model-updates)


3. [Setting up connection to GPT-3.5 using Azure OpenAI service](https://learn.microsoft.com/en-us/azure/ai-services/openai/quickstart?tabs=command-line%2Cpython-new&pivots=programming-language-python)
        - In the Environment variables section, instead of doing what is outlined in the link, add the `API_KEY`, `API-VERSION`, `ENDPOINT` and `DEPLOYMENT-NAME` into a `.env` file in the root folder.
        
4. Add the correct filename paths for `data` in Step 1 and gold annotated data filename for the `evaluate()` function in Step 4. 


In [1]:
import pandas as pd
import re

import os
from openai import AzureOpenAI

from dotenv import load_dotenv
load_dotenv() 

True


### Step 1: Load and pre-process data and prompt library 


#### Step 1.1: Load datasets

In [2]:
# train_text.tsv
# pmid\tfilename\ttext

# TODO: Replace filepath for related data file
data = pd.read_csv("./genovardis_train_dev/train_text.tsv", sep='\t', header=0)

data.head(5)

Unnamed: 0,pmid,filename,text
0,12672033,pmid-12672033.txt,12672033|t|Análisis de mutaciones en DMBT1 en ...
1,12673366,pmid-12673366.txt,12673366|t|Análisis del polimorfismo G/C en la...
2,12701064,pmid-12701064.txt,12701064|t|Una nueva mutación compuesta hetero...
3,12716337,pmid-12716337.txt,12716337|t|Polimorfismo en la posición -174 de...
4,12719097,pmid-12719097.txt,12719097|t|Una nueva mutación en CACNA1F en un...


In [3]:
len(data)

427

In [4]:
# TODO: remove this after testing
data = data.head(2)

data

Unnamed: 0,pmid,filename,text
0,12672033,pmid-12672033.txt,12672033|t|Análisis de mutaciones en DMBT1 en ...
1,12673366,pmid-12673366.txt,12673366|t|Análisis del polimorfismo G/C en la...


In [5]:
# clean up text by removing the appended pmid and title abstract tags at the start of each section

pattern = '(?:[\d]{1,10}\|t\|)(?P<title>[\w\W]+)(?:\\n[\d]{1,20}\|a\|)(?P<abstract>[\w\W]+)'

def clean_text(text):
    matches = re.search(pattern, text)
    reformatted_text = f'{matches.group("title")}\n{matches.group("abstract")}'
    return reformatted_text

data['text'] = [clean_text(text) for text in data['text']]


data

Unnamed: 0,pmid,filename,text
0,12672033,pmid-12672033.txt,Análisis de mutaciones en DMBT1 en glioblastom...
1,12673366,pmid-12673366.txt,Análisis del polimorfismo G/C en la región no ...


In [6]:
len(data)

2


#### Step 1.2: Load prompt library

Prompt id structure:
`p_<index>_<task>_<language>_<output>`

TODO: Figure out `<guideline>_<paradigm>`

In [7]:
prompt_library = pd.read_json('prompts.json')

prompt_library

Unnamed: 0,prompt_id,instruction,text
0,p_001_ner_en_json,Target: Find the gene and disease entities in ...,Text: {}
1,p_002_ner_es_json,Objetivo: encontrar las entidades genéticas y ...,Text: {}



#### Step 1.3: Create data+prompt dataset

In [8]:
# TODO: Buff up the prompts with guidelines and examples (shots)
# pmid prompt_id embedded_prompt
def embed_data_in_prompts(row_data):
    prompts = []
    pmid = row_data['pmid']
    data_text = row_data['text']
    
    for index, row_prompt in prompt_library.iterrows():
        instruction = row_prompt['instruction']
        prompt_text = row_prompt['text'].format(data_text)
        # TODO: Figure out the new line characters 
        concatenated_prompt = '{}\n"{}"'.format(instruction, prompt_text)
        
        prompt = {'prompt_id': row_prompt['prompt_id'], 'prompt': concatenated_prompt}
        prompts.append(prompt)
    
    return {'pmid': pmid, 'prompts': prompts}


In [9]:

embedded_prompt_data_list = [embed_data_in_prompts(row_data) for index, row_data in data.iterrows()]

In [10]:
embedded_prompt_data_list[0]

{'pmid': 12672033,
 'prompts': [{'prompt_id': 'p_001_ner_en_json',
   'prompt': 'Target: Find the gene and disease entities in the provided spanish language text. Display the results in JSON tuples with \'label\' attribute for identified entity type as either gene or disease and \'span\' for identified entities.\n"Text: Análisis de mutaciones en DMBT1 en glioblastoma, meduloblastoma y tumores oligodendrogliales\nDMBT1 ha sido implicado como un posible gen supresor de tumores en el cromosoma 10q en cáncer de cerebro, gastrointestinal y pulmón. La deleción homocigota y la falta de expresión son dos mecanismos conocidos para la inactivación de DMBT1. Evaluamos si la mutación somática, que representa un mecanismo importante de inactivación para la mayoría de los genes supresores de tumores, ocurre en el gen DMBT1. Se analizaron un total de 102 tumores cerebrales primarios, que consistían en 25 glioblastomas multiforme, 24 meduloblastomas y 53 tumores oligodendrogliales, mediante electrofor


### Step 2: Setting up GPT-3.5

In [11]:

client = AzureOpenAI(
    api_key=os.environ["API-KEY"],  
    api_version=os.environ["API-VERSION"],
    azure_endpoint=os.environ["ENDPOINT"]
    )
    
deployment_name=os.environ["DEPLOYMENT-NAME"]


In [12]:
# Testing the connection
test_response = client.chat.completions.create(model=deployment_name, messages=[{"role": "user", "content": "Hello, World!"}])
print(test_response.choices[0].message.content)

Hello! How can I assist you today?


In [13]:
# TODO: Ask Karin whether we should run again and again to see what gpt generates
results_list = []
def generate_results(prompt_items):
    
    pmid = prompt_items['pmid']
    
    for prompt_item in prompt_items['prompts']:
    
        prompt_id = prompt_item['prompt_id']
        prompt = prompt_item['prompt']

        response = client.chat.completions.create(model=deployment_name, messages=[{"role": "user", "content": prompt}])
        
        results_list.append({'pmid': pmid, 'prompt_id': prompt_id, 'result': response.choices[0].message.content.replace('\r', '').replace('\n', '')})
    
        print(f'Prompt:\n{prompt}\n\nResponse:\n{response.choices[0].message.content} \n----------\n')
    
    return results_list
    

In [14]:
for embedded_prompt_data in embedded_prompt_data_list:
    generate_results(embedded_prompt_data)

Prompt:
Target: Find the gene and disease entities in the provided spanish language text. Display the results in JSON tuples with 'label' attribute for identified entity type as either gene or disease and 'span' for identified entities.
"Text: Análisis de mutaciones en DMBT1 en glioblastoma, meduloblastoma y tumores oligodendrogliales
DMBT1 ha sido implicado como un posible gen supresor de tumores en el cromosoma 10q en cáncer de cerebro, gastrointestinal y pulmón. La deleción homocigota y la falta de expresión son dos mecanismos conocidos para la inactivación de DMBT1. Evaluamos si la mutación somática, que representa un mecanismo importante de inactivación para la mayoría de los genes supresores de tumores, ocurre en el gen DMBT1. Se analizaron un total de 102 tumores cerebrales primarios, que consistían en 25 glioblastomas multiforme, 24 meduloblastomas y 53 tumores oligodendrogliales, mediante electroforesis en gel sensible a la conformación en los 54 exones codificantes de DMBT1. 

In [15]:
results_list

[{'pmid': 12672033,
  'prompt_id': 'p_001_ner_en_json',
  'result': '{  "entities": [    {      "label": "gene",      "span": "DMBT1"    },    {      "label": "disease",      "span": "glioblastoma"    },    {      "label": "disease",      "span": "meduloblastoma"    },    {      "label": "disease",      "span": "tumores oligodendrogliales"    },    {      "label": "disease",      "span": "cáncer de cerebro"    },    {      "label": "disease",      "span": "gastrointestinal"    },    {      "label": "disease",      "span": "pulmón"    }  ]}'},
 {'pmid': 12672033,
  'prompt_id': 'p_002_ner_es_json',
  'result': '[    {        "label": "gen",        "span": "DMBT1"    },    {        "label": "patología",        "span": "glioblastoma"    },    {        "label": "patología",        "span": "meduloblastoma"    },    {        "label": "patología",        "span": "tumores oligodendrogliales"    },    {        "label": "patología",        "span": "cáncer de cerebro"    },    {        "label": "

### Step 3: Post-processing

In [16]:
# create df from results list and data df
# columns = pmid, prompt_id, filename, label, offset1, offset2, span


### Step 4: Evaluation

*Skip this part for evaluation dataset as there is no gold standard data to compare against.*

In [17]:
# train_annotations.tsv
# pmid\tfilename\tmark\tlabel\toffset1\toffset2\tspan

# brat format for NER
# <unique_id>   <label>  <offset1> <offset2>   <span> 
def bratify(eval_filepath=None, results=None):
    if eval_filepath is not None:
        
        gold_standard_annotations = pd.read_csv(eval_filepath, sep='\t', header=0)
        # TODO: Get gold standard data in brat formation for evaluation
        print(gold_standard_annotations.sample(5))
        # TODO: Save file in desired output file     
        
    if results is not None:
        # TODO: Get results in the brat format for evaluation
        results = results
        # TODO: Save file in desired output format
    

In [18]:
# TODO: Replace filepath of to convert to brat format
bratify("./genovardis_train_dev/train_annotation.tsv")

          pmid           filename mark    label  offset1  offset2  \
1039  15122711  pmid-15122711.ann  T17  Disease      511      529   
665   14630830  pmid-14630830.ann  T16     Gene     1061     1064   
6840  21070631  pmid-21070631.ann  T14  Disease      737      739   
620   14623461  pmid-14623461.ann   T1     Gene       40       44   
3711  17437275  pmid-17437275.ann  T10     Gene      688      692   

                    span  
1039  síndrome de Alpers  
665                  PML  
6840                  EA  
620                 EXO1  
3711                ZEB1  


In [19]:
# https://github.com/READ-BioMed/brateval


### Step 5: Saving output

`.tsv` file containing the annotations in the following format: 

`pmid   filename   label   offset1   offset2   span`.



In [20]:
# get results for tsv in the format
# `pmid   filename   label   offset1   offset2   span`.
