# Final Exam CS/INFO 662/762 Fall 2023
CS/INFO 762: 100 points ; CS/INFO 662  90 points

### <font color='red'>Due Dec 9th, 11:59am</font> - Submission via Canvas (.ipynb file)

## STUDENT NAME: <font color='red'>YOUR_NAME_HERE</font>


* Question 1a: Medical Mention Normalization with SAPBERT (PhD Students must include one graph feature) - 35/25 points
* Question 1b: Compute Recall - 15 points
* Question 1c: Random Forest: Feature Importance - 10 points
* Question 2: Language Model Questions (Long Written Answer) - 40 points

<font color='red'>As always WORK ON YOUR OWN for this final exam. Like last year, the final exam will be run through plagarism detection software. You may email me for clarification, but don't post on Stack Overflow, Quota, Reddit, etc..  You MAY use ChatGPT for ANY question, but the usual rules for citation and prompt inclusion in your answer apply.</font>


## Imports (if needed)

In [None]:
# If needed
#!pip uninstall --yes flair
!pip install obonet
!pip install py-rouge
!pip install node2vec
!pip install rouge-score


In [2]:
import time
import networkx
import obonet
import os
from nltk.corpus import stopwords  
from nltk.tokenize import word_tokenize
from rouge_score import rouge_scorer
import numpy as np
import heapq
import pandas as pd
#import scispacy
import spacy
import numpy as np
import torch
from io import StringIO
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel  

tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")  
model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext").cuda()

do_url = 'https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/main/src/ontology/HumanDO.obo'
hpo_url = 'http://purl.obolibrary.org/obo/hp.obo'
do = obonet.read_obo(do_url)
hpo = obonet.read_obo(hpo_url)
print('Disease Ontology is currently size:'+str(len(do))+" with "+str(do.number_of_edges())+' edges')
print('Human Phenotype Ontology is currently size:'+str(len(hpo))+" with "+str(hpo.number_of_edges())+' edges')

  from .autonotebook import tqdm as notebook_tqdm


Disease Ontology is currently size:11432 with 11462 edges
Human Phenotype Ontology is currently size:17664 with 21975 edges


## Question 1 - Concept Normalization
This question requires you to write use the SAPBERT embeddings you are familiar with from assignment #2 to generate candidate concepts for each input medical mentions for a merged overlapping knowledge graph of both the Disease Ontology (DO) and Human Phenotyper Ontology (HPO). 

### Set Up Knowledge Graph and Corpus Preparation 
This code is provided to you and creates:
* The merged knowledge graph (kgs) from the both Disease Ontology (DO) and the Human Phenotype Ontology (HPO) as a dataframe. You also have access to the original graphs in obo format to get graph features, for example you can use node2vec.
* The input corpus and medical mentions (labelled data) as a dataframe, "mention_mapping". It is built from the input corpus and you can assume that NER has already been done to identify the mentions to map. They are in the "mention" column and the correct concept (CUI) it should be mapped to is in the "CUI" column. 


In [3]:
def createIndex(graph,prefix):
    id2cui = {}
    cui2id = {}
    id_to_xref = {id_: data.get('xref') for id_, data in graph.nodes(data=True)}
    for graph_id,xrefs in id_to_xref.items():
        if(xrefs is None):
            cui = None
        else:
            cui = next((x for x in xrefs if x.startswith(prefix)),None)
            if(cui is not None):
                cui = cui.replace(prefix,'')
        id2cui[graph_id]=cui
        if(cui is not None):
            cui2id[cui]=graph_id
    return(id2cui,cui2id)


def convertCui2Doid(cui):
    if cui in cui2do:
        return cui2do[cui]
    return None

def hpoId2Name(oboid):
    return hpoid_to_name[oboid]

def doId2Name(oboid):
    if(oboid is None):
        return None
    if (doid_to_name[oboid]):
        return doid_to_name[oboid]
    else:
        return None


def get_mentions(filename,bardoc):
    all_mentions = []
    with open(filename, 'r') as file: 
        textdoc = file.read()
        for line in bardoc.splitlines():
            #print(line)
            start = int(line.split("||")[2])
            stop = int(line.split("||")[3])
            mention = textdoc[start:stop]
            if(not line.endswith("||||||")):
                start = int(line.split("||")[4])
                stop = int(line.split("||")[5])
                extramention = textdoc[start:stop]
                mention = mention+' '+extramention
                if(not line.endswith("||||")):
                    start = int(line.split("||")[6])
                    stop = int(line.split("||")[7])
                    extramention = textdoc[start:stop]
                    mention = mention+' '+extramention
            #print(mention)
            all_mentions.append(mention)
    return all_mentions

def read_files(directory):
    all_data = []
    for file in os.listdir(directory):
        #print(file)
        if file.endswith(".norm"):
            file_path = os.path.join(directory, file)
            with open(file_path, 'r') as file:
                csv_string = file.read()
            #normed = [line+"||||" for line in csv_string.splitlines() if line.count('|')==6]
            normed = [line if line.count('|') == 14 else (line+"||||" if line.count('|') == 10 else line+"||||||||") for line in csv_string.splitlines()]
            clean = '\n'.join(normed)
            note_file = (str(file.name).replace("train_norm","train_note").replace("norm","txt"))
            mentions = get_mentions(note_file,clean)
            df = pd.read_csv(StringIO(clean),engine='python',names=['ID', 'CUI', 'start1', 'stop1','start2','stop2','start3','stop3'],sep="\|\|")
            df['mention']=mentions
        all_data.append(df)
    return pd.concat(all_data, ignore_index=True)


hpo2cui,cui2hpo = createIndex(hpo,'UMLS:')
do2cui,cui2do = createIndex(do,'UMLS_CUI:')

hpoid_to_name = {id_: data.get('name') for id_, data in hpo.nodes(data=True)}
doid_to_name = {id_: data.get('name') for id_, data in do.nodes(data=True)}

df = pd.DataFrame(list(hpo2cui.items()))
df.columns=['HPOID','CUI']
df['DOID'] = df['CUI'].apply(convertCui2Doid)
df['HPO:Name'] = df['HPOID'].apply(hpoId2Name)
df['DO:Name'] = df['DOID'].apply(doId2Name)
hpokg = df.copy()
print("HPO Vocabulary: hpokg")
print(hpokg)
kgs = df.mask(df.eq('None')).dropna()

# Graph properties that may be useful
id_to_isa = {id_: data.get('is_a') for id_, data in hpo.nodes(data=True)}
id_to_xref = {id_: data.get('xref') for id_, data in do.nodes(data=True)}
result = next(iter(id_to_xref.values()))   

print("HPO and DO Joint Vocabulary:kgs")
print(kgs)
mention_mapping = read_files("train/train_norm/")
print("Input Corpus Mentions:mention_mapping")
mention_mapping

HPO Vocabulary: hpokg
            HPOID       CUI  DOID                          HPO:Name DO:Name
0      HP:0000001  C0444868  None                               All    None
1      HP:0000002  C4025901  None        Abnormality of body height    None
2      HP:0000003  C3714581  None      Multicystic kidney dysplasia    None
3      HP:0000005  C1708511  None               Mode of inheritance    None
4      HP:0000006  C0443147  None    Autosomal dominant inheritance    None
...           ...       ...   ...                               ...     ...
17659  HP:5201010      None  None  Microform cleft of the upper lip    None
17660  HP:5201011      None  None      Complete bilateral cleft lip    None
17661  HP:5201012      None  None    Incomplete bilateral cleft lip    None
17662  HP:5201013      None  None     Microform bilateral cleft lip    None
17663  HP:5201014      None  None    Asymmetric bilateral cleft lip    None

[17664 rows x 5 columns]
HPO and DO Joint Vocabulary:kgs
        

Unnamed: 0,ID,CUI,start1,stop1,start2,stop2,start3,stop3,mention
0,N000,C0011854,248,283,,,,,insulin dependent diabetes mellitus
1,N001,C4303631,298,327,,,,,a right above-knee amputation
2,N003,C0085671,537,553,,,,,dressing changes
3,N004,C0011079,558,569,,,,,debridement
4,N005,C0003232,611,622,,,,,antibiotics
...,...,...,...,...,...,...,...,...,...
6679,N139,C0442519,4695,4699,,,,,home
6680,N140,C0699203,4731,4737,,,,,motrin
6681,N141,C0593507,4740,4745,,,,,advil
6682,N142,C0332575,4863,4870,,,,,redness


## Question 1a: Generation of Candidate Concepts and their Features (35 points PhD/ 25 points MS)


#### Write code to find the best N candidate concepts for the mention using SAPBERT in the small (for final exam performance purposes) merged kgs vocabularuy.

The signature of the function should look something like this:
``` 
def getCandidates(mention_embeddings, vocabulary_embeddings, max_candidates):
```
* mention_embeddings would be SAPBERT embeddings of the mentions
* vocabulary_embeddings would be SAPBERT embeddings of the kgs vocabulary. You generate them using just DO concept text, just HPO concept text or perform a function to aggregate them.
* max_candidates (max candidates to return from kgs)

This function returns a list of the best N matches between the mention and the target merged knowledge graph based on feature similarity between the input node and the target node. Each match in the list is a tuple can contain any elements you need, but it should at least contain
 * a reference to the target concept, ie) row index|vocabulary_id
 * score (optional) or anything else you think you need
 
 
#### Write code to get a set of features for each candidate concepts that can be used for ranking the top N concepts to pick the most correct concept
The getFeatures function should generate features for an input mention text and one possible candidate mapping.
```
def getFeatures(mention_text, candidate_tuple_from_getCandidates)
```
These features will be used in Part 1b) to generate training data for a machine learning ranking algorithm.

Masters student need at least 2 features in their getFeatures code, some examples of lexical features include:
* counts of matching words or characters
* longest common subsequence (RougeL)
* ngram overlap, etc...

PhD Students will need an additional graph-based feature using relations in the ontology or ontology node vector representations such as node2vec. For example, one relevant feature may be checking the similarity of the input node to the parent node of the target. They can also be generated per random-walks like node2vec.


Hints:
 * stop words, stemming, lemmatizationm, headword matching are nice but not required for this tiny (mostly matching) gold data set
 * my advice is to do the minimal amount of work and come back later if you want to add more features
 * you may use ANY additional libraries as need

In [None]:
%%time
from numpy.linalg import norm

# Candidate Generation Code


## Question 1b: Compute Recall@3 and Generate Data for ML Algorithm in Part 1c (15 points)
 * Use your function in Part 1a) to generate 3 candidates for every mention and compute recall at n=3 candidates (For each mention, what is the fraction of times that the getCandidates returned the correct concept (CUI)?). Your re-ranking algorithm will not be able to do better than this. (5 points)
 * Many mentions represent concepts not included in our small merged kgs. Despite this, your recall performance may still not match your expectations using just SAPBERT embeddings. Explain why this might be (5 points).  
 * Create a labelled candidate ranking data set (5 points). For each mention, there will be 3 examples of which only 1 will have the correct CUI. Each example will have features (X) from part 1a and a label (Y). The label will be 1 if the features are sourced from the correct CUI and 0 if not. Use your getFeatures function to populate X. 



In [None]:
%%time
# Compute Recall for Candidate Generation Code


# Create X (data), Y (label) for ranking algorithm.



## Question 1c: Random Forest Candidate Ranker and Feature Analysis (10 points)

 * Split your data into training and testing data and then train scikit-learn's RandomForestClassifier to predict if a candidate node is the correct match. Output a classification report with accuracy.
 
 * Use scikit-learn's RandomForestClassifier to compute the relative importance of your features for this algorithm and graph them. Give your features reasonable names so they look nice on a graph.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from matplotlib import pyplot



## Question 2a (20 points)

One of the issues with medical normalization is that training data is sparse, some disease are over-represented, whereas some rare disease have a dictionary entry but few examples in clinical text. Making at least one reference to a paper discussed in class:


* Describe how you could use a LLM (like GPT-4) to generate a synthetic corpus for concept normalization to an ontology like the Human Phenotype Ontology described here? Assume you would like to generate synthetic data for concepts not included in typical training data. (10 points)


* Propose an evaluation method for your synthetic text generation method. How would you evaluate whether your approach is successfull? (10 points)



## Question 2b (20 points)

As of 2023, transfer learning using large language models such as GPT-4, etc.. is the current best practise for a large number of tasks. There has been speculation in the popular press that these models will function as artificial general intelligences, making domain specific models redundant.

* Making references to at least one paper discussed in class, describe performance results indicating that this is not the case. (10 points)

* Describe at least 2 benefits of using a domain specific language model that has been fine-tuned on a task,  relative to a model like GPT-4 (10 points)