Precision: Calculating how many of the extracted relationships are valid

In [None]:
!pip install deepeval

In [None]:
import nltk
from nltk.stem import PorterStemmer
nltk.download("punkt")

# Initialize Python porter stemmer
ps = PorterStemmer()
def lemmatize(sent):
    return [ps.stem(word) for word in sent.split()]

In [57]:
import fitz,os
from sentence_transformers import SentenceTransformer, util

# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

#reading the pdf and hand annotation files
def read_pdf(pdf_file):
    start=False
    sentences=[]
    start_idx=0
    with fitz.open(pdf_file) as pdf_file:
        for page_index, page in enumerate(pdf_file):
            text = page.get_text("text").lower()
            text=text.split(". ")
            sentences.extend(text)
                
    return sentences
def read_files(root_dir, hand):
    
    lines=[]
    for files in os.listdir(root_dir):
        if files[-4:] != '.pdf':
            continue
        sentences = read_pdf(f"{root_dir}/{files}")
        lines.extend(sentences)

    # read in hand annotations
    for p in hand.iterrows():
        rel = p[1]['rel']
        subj = p[1]['subj']
        obj = p[1]['obj']
        out=f"{subj} {rel} {obj}" 
        lines.append(out)


    return lines

#computing cosine similarity
def vec(sentences):
    # Encode sentences
    embeddings = model.encode([sentences[0], sentences[1]])
    
    # Compute cosine similarity
    similarity = util.cos_sim(embeddings[0], embeddings[1])
    return similarity.item() # Value close to 1 indicates high similarity
    
#finding if the target string (relation triplet) is in the src (pdf + hand annotation)
def find(target, src):
    found=False
    matching_sentence=""
 
    for idx,sentence in enumerate(src):
        pred=" ".join(lemmatize(target))
        test=" ".join(lemmatize(sentence))
        cos = vec([pred,test])
        if pred in test or cos > 0.7:
            if cos >0.65 and cos < 0.7:
                print(f"Got a match for {pred }: {sentence}")
            elif cos <=0.65:
                print(f"Closest match to {pred} was {test}")
            found=True
            st_idx=idx
            matching_sentence=sentence
            return found, matching_sentence, 
            
    return found, matching_sentence
sentences = read_files("Docs",hand)

In [62]:
print(f"Example of a sentence from the pdf: {sentences[4]}\n\n")
print(f"Example from the hand annot. {sentences[-1]}")

Example of a sentence from the pdf: pharmacotherapy, especially selective 
serotonin reuptake inhibitors antidepressants, remains the most frequent option 
for treating depression during the acute phase, while other promising pharmaco-
logical options are still competing for the attention of practitioners


Example from the hand annot. Duloxetine High efficiency for  Neuropathic pain, fibromyalgia, anxiety


In [55]:
#find target in hand annotations
def find_in_hand(hand,target):
    max_=-1.0
    for row in hand.iterrows():
        hand_rel = f" {row[1]['subj']} {row[1]['rel']} {row[1]['obj']}"
        sub_h = row[1]['subj']
        cos=vec([hand_rel, target])
        

        if cos > max_:
            max_ = cos
            best=hand_rel
        
        cos=vec([best, hand_rel])
        if sub_h.lower() not in best.lower() or cos < 0.7:
            print(f"Closest match for {hand_rel} is {best} with {cos}")
            continue
        else:
            return best
            #print(f"Closest match for {hand_rel} is {best} with {vec([best, hand_rel])}")
            #print(rouge_([hand_rel], [best]))
            
    
            

In [58]:
from deepeval.metrics import ContextualRecallMetric
from deepeval.test_case import LLMTestCase
import pandas as pd
hand = pd.read_csv("Annotations.csv")
pred_files = ["NewRels_Skip3.csv", "NewRels_Skip2.csv"]

for pred_file in pred_files:
    preds = pd.read_csv(pred_file) 
    
    for p in preds.iterrows():
        ref = p[1]['ref']
        rel = p[1]['rel']
        subj = p[1]['subj']
        obj = p[1]['obj']
        out=f"{subj} {rel} {obj}"
        
        found, match = find(out, sentences)
        if found:
            score +=1
        else:
            print("Couldn't find a match for  ", out)
    print("Precisioin for {pred_file} is {score/

Couldn't find a match for   selective serotonin reuptake inhibitors (ssris) side effects sexual and digestive issues, irritability, anxiety, insomnia, and headache
Couldn't find a match for   all antidepressants side effects nausea, vomiting, sexual dysfunction, sedation, priapism, cardiotoxicity
Couldn't find a match for   cognitive-behavioral therapy (cbt) first-line treatment mild to moderate major depressive disorder (mdd)
Couldn't find a match for   monoamine oxidase inhibitors (maois) less commonly used major depressive disorder (mdd)
Couldn't find a match for   interpersonal therapy (ipt) first-line treatment mild to moderate major depressive disorder (mdd)
Couldn't find a match for   cognitive behavioral therapy (cbt) first-line treatment depression
Couldn't find a match for   selective serotonin reuptake inhibitors (ssris) side effects sexual and digestive (nausea and loss of appetite), as well as irritability, anxiety, insomnia, and headache
Couldn't find a match for   select

In [51]:
print(f"Recall score: {100*score/len(preds)}%")

Recall score: 85.84905660377359%
