# User Data Dictionary 

## Input

Reranked RAG Results from Cohere Rerank model see docs => [Cohere Rerank Documentation](https://docs.cohere.com/reference/rerank-1)

## Outcome

Produces a dataframe of data that suggests variable options based on a user input data dictionary in phase 1.

### Output Dataframe structure:

First 3 columns are related to a row of user data: 

user_study | user_var | user_label 

The next columns are specific to the LLM Output. Cohere will return the top 3 values for each variable and each option will contain the following columns:

| option_X_var | option_X_score | option_X_label | option_X_study

#### Conditional output scenarios

1) If one of the options returns a significantly high match and there are no other close competing options, then we will only return the first value
2) If there are no good options we return the options AND prompt our LLM to return an explantion of why the three options are faulty
3) If all/multiple options have a close significant match score we return all three options AND query a vector database on which variable's study most closely matches our current user submitted study (TBD)

| Column 1 | Column 2 | Column 3 |
| -------- | -------- | -------- |
| Cell 1A  | Cell 2A  | Cell 3A  |
| Cell 1B  | Cell 2B  | Cell 3B  |

### Setup

In [5]:
import os
import json
from langchain.docstore.document import Document
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
import cohere

from getpass import getpass
# from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass("OpenAI API Key: ")


In [7]:
docs = [Document(page_content='Was information on the COVID Ordinal Outcome Scale', metadata={'Study': 'ORCHID', 'Variable': 'OUT_OOSYN_29', 'id': 361188.0}),
 Document(page_content='Was information on the COVID Ordinal Outcome Scale', metadata={'Study': 'ORCHID', 'Variable': 'OUT_OOSYN_15', 'id': 361187.0}),
 Document(page_content='Was information on the COVID Ordinal Outcome Scale', metadata={'Study': 'ORCHID', 'Variable': 'OUT_OOSYN_8', 'id': 361189.0}),
 Document(page_content='Swine H1N1 Vaccination INFO: baseline_variables', metadata={'Study': 'H1N1', 'Variable': 'SWINE_H1N1_VACC', 'id': 208960.0}),
 Document(page_content='Who is the primary source of information?', metadata={'Study': 'SHEP', 'Variable': 'SH51_INFO_SOURCE', 'id': 407061.0}),
 Document(page_content='Abstracting for cohort or surveillance Q3', metadata={'Study': 'ARIC', 'Variable': 'HRAA03', 'id': 24922.0}),
 Document(page_content='Method of Data Collection', metadata={'Study': 'PROPPR', 'Variable': 'DOMOIVTR', 'id': 379440.0}),
 Document(page_content='Method of Data Collection', metadata={'Study': 'PROPPR', 'Variable': 'DOMOBLDT', 'id': 379439.0}),
 Document(page_content='COVID-19 specimen #1 type (e.g. nasopharyngeal swab', metadata={'Study': 'REDCORAL', 'Variable': 'CX_COVID1_SPECTYPE', 'id': 384005.0})]

docs

[Document(page_content='Was information on the COVID Ordinal Outcome Scale', metadata={'Study': 'ORCHID', 'Variable': 'OUT_OOSYN_29', 'id': 361188.0}),
 Document(page_content='Was information on the COVID Ordinal Outcome Scale', metadata={'Study': 'ORCHID', 'Variable': 'OUT_OOSYN_15', 'id': 361187.0}),
 Document(page_content='Was information on the COVID Ordinal Outcome Scale', metadata={'Study': 'ORCHID', 'Variable': 'OUT_OOSYN_8', 'id': 361189.0}),
 Document(page_content='Swine H1N1 Vaccination INFO: baseline_variables', metadata={'Study': 'H1N1', 'Variable': 'SWINE_H1N1_VACC', 'id': 208960.0}),
 Document(page_content='Who is the primary source of information?', metadata={'Study': 'SHEP', 'Variable': 'SH51_INFO_SOURCE', 'id': 407061.0}),
 Document(page_content='Abstracting for cohort or surveillance Q3', metadata={'Study': 'ARIC', 'Variable': 'HRAA03', 'id': 24922.0}),
 Document(page_content='Method of Data Collection', metadata={'Study': 'PROPPR', 'Variable': 'DOMOIVTR', 'id': 37944

In [9]:
variable_label = "covid"
# question = f"Find all variables that match this {variable_label}"
question = f"From the provided variables, find all that would be helpful in capturing information about {variable_label}"


### Cohere Reranking

This is the start of the data processing post `jupyter/notebooks/search_and_retreival/reranking.ipynb`

In [11]:
api_key = os.environ['COHERE_API_KEY']
coh = cohere.Client(api_key)

rerank_docs = [x.dict()['page_content'] for x in docs]

In [14]:
# Example cohere results n 3
results = [
    {
      "document": {
        "text": "COVID-19 specimen #1 type (e.g. nasopharyngeal swab"
      },
      "index": 2,
      "relevance_score": 0.08
    },
    {
      "document": {
        "text": "Was information on the COVID Ordinal Outcome Scale"
      },
      "index": 0,
      "relevance_score":  0.05
    },
    {
      "document": {
        "text": "Abstracting for cohort or surveillance Q3"
      },
      "index": 1,
      "relevance_score":  0.03
    },
]

In [12]:
# results = coh.rerank(query=question, documents=rerank_docs, top_n=10, model='rerank-multilingual-v2.0')

# for idx, r in enumerate(results):
#   print(f"Document Rank: {idx + 1}, Document Index: {r.index}")
#   print(f"Document: {r.document['text']}")
#   print(f"Relevance Score: {r.relevance_score:.2f}")
#   print("\n")

CohereAPIError: invalid api token

In [15]:
# Sort the list of dictionaries based on 'relevance_score' in descending order
sorted_results = sorted(results, key=lambda x: x['relevance_score'], reverse=True)

# Get the top 3 results
top_3_results = sorted_results[:3]

## Adding the Generation in RAG

Needed To Dos:
- Change the prompt template to more accurately support our end goal
  - ... which is?

Optional To Dos:
- Play around with temperature
- Select different OpenAI model

In [27]:
llm = ChatOpenAI(temperature=0.1, openai_api_key=OPENAI_API_KEY)

In [18]:
s = "Below I am putting a user query and retrieved context. Read both inputs and make sure to output one score from 0.0 to 1.0 that indicates how relevant the retrieved context is to answer the user query."
relevancy_prompt_list = []
for i in range(len(top_3_results)):
    relevancy_prompt_list.append(f"{s} Query: {variable_label} Context: {top_3_results[i]['document']['text']}")
relevancy_prompt_list

['Below I am putting a user query and retrieved context. Read both inputs and make sure to output one score from 0.0 to 1.0 that indicates how relevant the retrieved context is to answer the user query. Query: covid Context: COVID-19 specimen #1 type (e.g. nasopharyngeal swab',
 'Below I am putting a user query and retrieved context. Read both inputs and make sure to output one score from 0.0 to 1.0 that indicates how relevant the retrieved context is to answer the user query. Query: covid Context: Was information on the COVID Ordinal Outcome Scale',
 'Below I am putting a user query and retrieved context. Read both inputs and make sure to output one score from 0.0 to 1.0 that indicates how relevant the retrieved context is to answer the user query. Query: covid Context: Abstracting for cohort or surveillance Q3']

In [24]:
QA_PROMPT = PromptTemplate(
    input_variables=["query", "contexts"],
    template = """Human: Below I am putting a user query and retrieved context. Read both inputs and make sure and provide and explanation how relevant the retrieved context is to answer the user query.


    Contexts:
    {contexts}

    Question: {query}""",
)

qa_chain = LLMChain(llm=llm, prompt=QA_PROMPT)

In [26]:
print(f"\n---\n".join([d['document']['text'] for d in top_3_results]))

COVID-19 specimen #1 type (e.g. nasopharyngeal swab
---
Was information on the COVID Ordinal Outcome Scale
---
Abstracting for cohort or surveillance Q3


In [28]:
for prompt in top_3_results:
    print(f"Prompt: {prompt['document']['text']}")
    print(f"Relevance Score: {qa_chain(inputs={'query': variable_label, 'contexts': prompt['document']['text']})['text']}")
    print("\n")
# out = qa_chain(
#     inputs={
#         "query": variable_label,
#         "contexts": "\n---\n".join([d['document']['text'] for d in top_3_results])
#     }
# )
# out["text"]

Prompt: COVID-19 specimen #1 type (e.g. nasopharyngeal swab
Relevance Score: The retrieved context is directly related to the user query about COVID-19. It provides information about the type of specimen that is commonly used for COVID-19 testing, which is a nasopharyngeal swab. This information is relevant as it helps to understand the testing process for COVID-19 and the type of sample that is typically collected for diagnosis.


Prompt: Was information on the COVID Ordinal Outcome Scale
Relevance Score: The retrieved context is not directly relevant to the user query about "covid." The context provided seems to be incomplete and does not provide enough information to determine its relevance to the user query. It would be helpful to have more context or a clearer connection to the user query in order to provide a relevant explanation.


Prompt: Abstracting for cohort or surveillance Q3
Relevance Score: The retrieved context "Abstracting for cohort or surveillance" does not seem direc