# User Data Dictionary 

## Input

Reranked RAG Results from Cohere Rerank model see docs => [Cohere Rerank Documentation](https://docs.cohere.com/reference/rerank-1)

## Outcome

Produces a dataframe of data that suggests variable options based on a user input data dictionary in phase 1.

### Output Dataframe structure:

First 3 columns are related to a row of user data: 

user_study | user_var | user_label 

The next columns are specific to the LLM Output. Cohere will return the top 3 values for each variable and each option will contain the following columns:

| option_X_var | option_X_score | option_X_label | option_X_study

#### Conditional output scenarios

1) If one of the options returns a significantly high match and there are no other close competing options, then we will only return the first value
2) If there are no good options we return the options AND prompt our LLM to return an explantion of why the three options are faulty
3) If all/multiple options have a close significant match score we return all three options AND query a vector database on which variable's study most closely matches our current user submitted study (TBD)

| Column 1 | Column 2 | Column 3 |
| -------- | -------- | -------- |
| Cell 1A  | Cell 2A  | Cell 3A  |
| Cell 1B  | Cell 2B  | Cell 3B  |

### Setup

In [2]:
import os
import json
from langchain.docstore.document import Document
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
import cohere

from getpass import getpass
# from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass("OpenAI API Key: ")


In [3]:
docs = [Document(page_content='Was information on the COVID Ordinal Outcome Scale', metadata={'Study': 'ORCHID', 'Variable': 'OUT_OOSYN_29', 'id': 361188.0}),
 Document(page_content='Was information on the COVID Ordinal Outcome Scale', metadata={'Study': 'ORCHID', 'Variable': 'OUT_OOSYN_15', 'id': 361187.0}),
 Document(page_content='Was information on the COVID Ordinal Outcome Scale', metadata={'Study': 'ORCHID', 'Variable': 'OUT_OOSYN_8', 'id': 361189.0}),
 Document(page_content='Swine H1N1 Vaccination INFO: baseline_variables', metadata={'Study': 'H1N1', 'Variable': 'SWINE_H1N1_VACC', 'id': 208960.0}),
 Document(page_content='Who is the primary source of information?', metadata={'Study': 'SHEP', 'Variable': 'SH51_INFO_SOURCE', 'id': 407061.0}),
 Document(page_content='Abstracting for cohort or surveillance Q3', metadata={'Study': 'ARIC', 'Variable': 'HRAA03', 'id': 24922.0}),
 Document(page_content='Method of Data Collection', metadata={'Study': 'PROPPR', 'Variable': 'DOMOIVTR', 'id': 379440.0}),
 Document(page_content='Method of Data Collection', metadata={'Study': 'PROPPR', 'Variable': 'DOMOBLDT', 'id': 379439.0}),
 Document(page_content='COVID-19 specimen #1 type (e.g. nasopharyngeal swab', metadata={'Study': 'REDCORAL', 'Variable': 'CX_COVID1_SPECTYPE', 'id': 384005.0})]

docs

[Document(page_content='Was information on the COVID Ordinal Outcome Scale', metadata={'Study': 'ORCHID', 'Variable': 'OUT_OOSYN_29', 'id': 361188.0}),
 Document(page_content='Was information on the COVID Ordinal Outcome Scale', metadata={'Study': 'ORCHID', 'Variable': 'OUT_OOSYN_15', 'id': 361187.0}),
 Document(page_content='Was information on the COVID Ordinal Outcome Scale', metadata={'Study': 'ORCHID', 'Variable': 'OUT_OOSYN_8', 'id': 361189.0}),
 Document(page_content='Swine H1N1 Vaccination INFO: baseline_variables', metadata={'Study': 'H1N1', 'Variable': 'SWINE_H1N1_VACC', 'id': 208960.0}),
 Document(page_content='Who is the primary source of information?', metadata={'Study': 'SHEP', 'Variable': 'SH51_INFO_SOURCE', 'id': 407061.0}),
 Document(page_content='Abstracting for cohort or surveillance Q3', metadata={'Study': 'ARIC', 'Variable': 'HRAA03', 'id': 24922.0}),
 Document(page_content='Method of Data Collection', metadata={'Study': 'PROPPR', 'Variable': 'DOMOIVTR', 'id': 37944

In [4]:
variable_label = "covid"
# question = f"Find all variables that match this {variable_label}"
question = f"From the provided variables, find all that would be helpful in capturing information about {variable_label}"


### Cohere Reranking

This is the start of the data processing post `jupyter/notebooks/search_and_retreival/reranking.ipynb`

In [11]:
api_key = os.environ['COHERE_API_KEY']
coh = cohere.Client(api_key)

rerank_docs = [x.dict()['page_content'] for x in docs]

In [5]:
# Example cohere results n 3
results = [
    {
      "document": {
        "text": "COVID-19 specimen #1 type (e.g. nasopharyngeal swab"
      },
      "index": 2,
      "relevance_score": 0.08
    },
    {
      "document": {
        "text": "Was information on the COVID Ordinal Outcome Scale"
      },
      "index": 0,
      "relevance_score":  0.05
    },
    {
      "document": {
        "text": "Abstracting for cohort or surveillance Q3"
      },
      "index": 1,
      "relevance_score":  0.03
    },
]

In [12]:
# results = coh.rerank(query=question, documents=rerank_docs, top_n=10, model='rerank-multilingual-v2.0')

# for idx, r in enumerate(results):
#   print(f"Document Rank: {idx + 1}, Document Index: {r.index}")
#   print(f"Document: {r.document['text']}")
#   print(f"Relevance Score: {r.relevance_score:.2f}")
#   print("\n")

CohereAPIError: invalid api token

In [6]:
# Sort the list of dictionaries based on 'relevance_score' in descending order
sorted_results = sorted(results, key=lambda x: x['relevance_score'], reverse=True)

# Get the top 3 results
top_3_results = sorted_results[:3]

## Adding the Generation in RAG

Needed To Dos:
- Change the prompt template to more accurately support our end goal
  - ... which is?

Optional To Dos:
- Play around with temperature
- Select different OpenAI model

In [7]:
llm = ChatOpenAI(temperature=0.1, openai_api_key=OPENAI_API_KEY)

In [18]:
s = "Below I am putting a user query and retrieved context. Read both inputs and make sure to output one score from 0.0 to 1.0 that indicates how relevant the retrieved context is to answer the user query."
relevancy_prompt_list = []
for i in range(len(top_3_results)):
    relevancy_prompt_list.append(f"{s} Query: {variable_label} Context: {top_3_results[i]['document']['text']}")
relevancy_prompt_list

['Below I am putting a user query and retrieved context. Read both inputs and make sure to output one score from 0.0 to 1.0 that indicates how relevant the retrieved context is to answer the user query. Query: covid Context: COVID-19 specimen #1 type (e.g. nasopharyngeal swab',
 'Below I am putting a user query and retrieved context. Read both inputs and make sure to output one score from 0.0 to 1.0 that indicates how relevant the retrieved context is to answer the user query. Query: covid Context: Was information on the COVID Ordinal Outcome Scale',
 'Below I am putting a user query and retrieved context. Read both inputs and make sure to output one score from 0.0 to 1.0 that indicates how relevant the retrieved context is to answer the user query. Query: covid Context: Abstracting for cohort or surveillance Q3']

In [40]:
# Prompt V1
# We cannot set a seed for OpenAI

QA_PROMPT = PromptTemplate(
    input_variables=["query", "contexts"],
    template = """
    You are a data curator whos role is to harmonize biological variables in the NHLBI (National Heart Lung and Blood)
    data repository. You are tasked with evaluating input variables from data dictionary that describes new data that 
    will be added to the existing pool of variables in the repository. For each new variable, a vector search engine 
    has returned the three nearest existing variables found in the data repository. 

    Your job is determine which of the contexts identified by the search, if any, is the best fit for harmonizing the
    new variables identified to the contexts. You must explain why the selected context was chosen over the others, 
    provide information about the relevancy of each context to the new variable.

    Then you are to provide the user with as much information as we can on how they can align their new variable with 
    the selected context.
    
    When there is no obvious match provide additional information for why you can't make a determination.
    
    Contexts:
    {contexts}

    Original question: {query}""",
)

qa_chain = LLMChain(llm=llm, prompt=QA_PROMPT)

PromptTemplate(input_variables=['contexts', 'query'], template="\n    You are a data curator whos role is to harmonize biological variables in the NHLBI (National Heart Lung and Blood)\n    data repository. You are tasked with evaluating input variables from data dictionary that describes new data that \n    will be added to the existing pool of variables in the repository. For each new variable, a vector search engine \n    has returned the three nearest existing variables found in the data repository. \n\n    Your job is determine which of the contexts identified by the search, if any, is the best fit for harmonizing the\n    new variables identified to the contexts. You must explain why the selected context was chosen over the others, \n    provide information about the relevancy of each context to the new variable.\n\n    Then you are to provide the user with as much information as we can on how they can align their new variable with \n    the selected context.\n    \n    When ther

In [12]:
print(f"\n---\n".join([d['document']['text'] for d in top_3_results]))

COVID-19 specimen #1 type (e.g. nasopharyngeal swab
---
Was information on the COVID Ordinal Outcome Scale
---
Abstracting for cohort or surveillance Q3


In [37]:
print(f"Relevance Score: {qa_chain(inputs={'query': variable_label, 'contexts': f"\n---\n".join([d['document']['text'] for d in top_3_results])})['text']}")

top_3_results

Relevance Score: Based on the provided contexts, it seems that the new variable is related to COVID-19 data. The three nearest existing variables found in the data repository are related to COVID-19 specimen type, COVID Ordinal Outcome Scale, and abstracting for cohort or surveillance.

In this case, the best fit for harmonizing the new variable would be the "COVID-19 specimen #1 type" context. This is because the new variable likely pertains to a specific type of COVID-19 specimen, such as a nasopharyngeal swab, which aligns closely with this context. The relevance of the other contexts, such as the COVID Ordinal Outcome Scale and abstracting for cohort or surveillance, may not directly match the nature of the new variable.

To align the new variable with the selected context of "COVID-19 specimen #1 type," the user should ensure that the data collected for the new variable is specific to the type of specimen being analyzed for COVID-19. They should also follow any standardized protoc

[{'document': {'text': 'COVID-19 specimen #1 type (e.g. nasopharyngeal swab'},
  'index': 2,
  'relevance_score': 0.08},
 {'document': {'text': 'Was information on the COVID Ordinal Outcome Scale'},
  'index': 0,
  'relevance_score': 0.05},
 {'document': {'text': 'Abstracting for cohort or surveillance Q3'},
  'index': 1,
  'relevance_score': 0.03}]

In [41]:
inputs={'query': variable_label, 'contexts': f"\n---\n".join([d['document']['text'] for d in top_3_results])}

inputs 

{'query': 'covid',
 'contexts': 'COVID-19 specimen #1 type (e.g. nasopharyngeal swab\n---\nWas information on the COVID Ordinal Outcome Scale\n---\nAbstracting for cohort or surveillance Q3'}

### Prompt V1

#### Prompt

"""
    Your task is to compare the results of a user query to three retrieved context values.
    The user would like to know how closely their question aligns with a context.
    Your first step is to select a winning context by comparing the user query to the three contexts.
    You should explain your rationale in selecting the winning conext and explain how the user could alter their query to better align with the context.

    You should then discuss the other contexts and explain their relevance to the user query and why they were not selected. 
    
    We want to provide the user with as much information as we can on how to
    align their variables with the data corpus.
    
    Contexts:
    COVID-19 specimen #1 type (e.g. nasopharyngeal swab
    ---
    Was information on the COVID Ordinal Outcome Scale
    ---
    Abstracting for cohort or surveillance Q3

    Original question: covid"""

#### Output

Relevance Score: In comparing the user query "covid" to the three contexts provided, the most relevant context appears to be "COVID-19 specimen #1 type (e.g. nasopharyngeal swab)." This is because the user query directly relates to COVID-19, which aligns with the topic of this context. To better align their query with this context, the user could specify their question further, such as asking about the most common type of specimen used for COVID-19 testing or the accuracy of nasopharyngeal swab tests.

The context "Was information on the COVID Ordinal Outcome Scale" was not selected as the winning context because the user query does not specifically mention outcomes or scales related to COVID-19. However, if the user is interested in understanding the severity or progression of COVID-19 cases, they could modify their query to include terms like "COVID-19 outcome scale" or "severity of COVID-19 cases."

The context "Abstracting for cohort or surveillance Q3" was also not chosen as the winning context because the user query does not indicate a specific interest in cohort studies or surveillance related to COVID-19. If the user is looking for information on COVID-19 surveillance data or cohort studies, they could refine their query to include terms like "COVID-19 surveillance data" or "COVID-19 cohort studies."

Overall, by providing more specific details in their query, the user can better align their question with the relevant context and receive more accurate and helpful information.

In [9]:
for prompt in top_3_results:
    print(f"Prompt: {prompt['document']['text']}")
    print(f"Relevance Score: {qa_chain(inputs={'query': variable_label, 'contexts': prompt['document']['text']})['text']}")
    print("\n")
# out = qa_chain(
#     inputs={
#         "query": variable_label,
#         "contexts": "\n---\n".join([d['document']['text'] for d in top_3_results])
#     }
# )
# out["text"]

Prompt: COVID-19 specimen #1 type (e.g. nasopharyngeal swab


  warn_deprecated(


Relevance Score: Based on the user query "covid", the winning context would be "COVID-19 specimen #1 type (e.g. nasopharyngeal swab)". This is because the user query directly mentions COVID-19, which aligns closely with this context. The user could alter their query to be more specific by asking about the type of specimen used for COVID-19 testing, such as "What is the most common specimen type used for COVID-19 testing?".

The other two contexts, which were not selected as the winning context, are still relevant to the user query but to a lesser extent. 

The second context, "COVID-19 testing locations", could be relevant if the user is looking for information on where to get tested for COVID-19. To align their query with this context, the user could ask "Where can I get tested for COVID-19 near me?".

The third context, "COVID-19 symptoms", could also be relevant if the user is looking for information on the symptoms of COVID-19. To align their query with this context, the user could

## Prompt V2 Testing

Let's start by using some better test data to see how the LLM performs on test data that is more similar to what we are trying to match to.

In [None]:
variable_label_1 = "Highest level of formal/academic education achieved"
variable_label_2 = "lesion MLD and %DS in the persistent region"
variable_label_3 = "Doing things that make you feel valued"
variable_label_4 = "Acute Exacerbation, Medical Monitor Identified Even"
variable_label_5 = "COVID-19 specimen #1 type (e.g. nasopharyngeal swab"


# question = f"Find all variables that match this {variable_label}"
question = f"From the provided variable, find all that would be helpful in capturing information about {variable_label}"

