## Evaluate the results of ChatGPT

Currently, The J-sim, R, and P are low between the ground truth and GPT predictions due to GPT being more precise or more broad in its selection of terms

We need a systematic evaluation metrics to compare the results without having to constantly manually run them

The approach:
1. Calculate J-similarity, precision, recall and then remove the terms in common
2. Check if a prediction term is within the same tree as a ground truth term
3. If two terms are within the same tree:
  * Treat the two terms as a match
  * Determine which term is closer to the root
    * Calculate a weight to apply depending on how close the closest term is to the root (Weight 1)
      * If closest term is one step away from root it should have a lower score than a closest term that is 2 steps away. This is to lower the weight of excessively generic terms (use 1-(1/(# of steps to closest term))
    * Calculate a weight to apply depending on the number of steps between the two 'matching' terms (Weight 2)
      * If the closest term to root is the ground truth term, use (1/(# of steps between the terms))
      * If the closest term to root is the prediction term, use -(1/(# of steps between the terms)): It is negative only to ensure we will later be able to inspect the directionality
  * The overall absolute weight should be a combination of the two, for example:
    * Overall absolute weight = Weight 1 + ABS(Weight 2)
  * Weighted similarity: Add the overall absolute weight for all matches, then add the j-sim (since we previously removed exact matches)

Adjusting for prediction number biases:
* The weighted similarity will be advantageous to ChatGPT 4 due to its tendency to dump every relevant term
* To account for that, we can penalize it for making excessive guesses by multiplying the weighted similarity against the ratio of (# of gold standard terms)/(# of predicted terms).

Evaluating whether the prediction has a tendency to be more specific or less specific than the gold standard
* Broadness evaluation: (# of positive Weight 2 values)/(# of negative Weight 2 values)
  * If broadness evaluation is >1, LLM model predictions are more specific than Ground truth/gold standard terms
  * If broadness evaluation is <1, LLM model predictions are less specific than ground truth/gold standard terms


In [None]:
import os
import pandas as pd

script_path = os.getcwd()
data_path = os.path.join(script_path,'data')
gpt_results_file = os.path.join(data_path,'gpt3_results.tsv')

gpt_results = pd.read_csv(gpt_results_file,delimiter='\t',header=0)
print(gpt_results.head(n=2))


## Generating data for Topic Category Evaluation

In [1]:
import os
import pandas as pd
import requests
import json

In [None]:
datafile = os.path.join('data','GPT categorization validation - data.tsv')
data = pd.read_csv(datafile,delimiter = '\t',header=0)
print(data.head(n=2))

In [None]:
namelist = data['Name'].unique().tolist()
resultlist = []
fail = []
for eachname in namelist:
    try:
        r = requests.get(f'https://api.data.niaid.nih.gov/v1/query?q={eachname}&fields=_id,name')
        result = json.loads(r.text)
        hit = result['hits'][0]
        url = f"https://data.niaid.nih.gov/resources?id={hit['_id']}"
        resultlist.append({'_id': hit['_id'],'name':hit['name'],'url':url})
    except:
        fail.append(eachname)
resultdf = pd.DataFrame(resultlist)
resultdf.to_csv(os.path.join('result','topic_category_evaluation.tsv'),sep='\t',header=True)