# Modeling - RAG

This notebook include testing of some models until we defined the final one.

Let's remember the strategies:

1. Extractive QA: The answer is a span of text from the context.
    1. In this case we can possible use some kind of retriever in the query and use the model as an extractor for the context. 
    2. The problem with this (at this point) is don't know how good each topic of the dataset is represented in the dataset.
2. Open Generative QA: The answer generated by one of the retrieved options.
    1. Using more complexity (in terms of memory and time) we can use a model to generate the answer. We rely in some pre-trained model knowledge better approach some queries variants.
    2. More complexity, more inference time, possible need GPU and not can generate no safety answer (considering the dataset is composed only by true/safe answers).
3. Generative QA: The answer is a free text.
    1. The same as the open generative QA but not always rely in the retrieved options.


## Libs and Variables

In [1]:
import re
import random


import numpy as np


import pandas as pd


import torch



import matplotlib.pyplot as plt

clean_repeated_ws = lambda text: re.sub(r"\s+", " ", text).strip()

In [2]:
def seed_all():
    """
    Seed all the random number generators to ensure reproducibility.

    This function sets the seed for the Python built-in random module and the NumPy random module to 42.
    This ensures that any random operations performed using these modules will produce the same results
    each time the code is run.
    """
    random.seed(42)
    np.random.seed(42)
    torch.manual_seed(42)


seed_all()

In [3]:
DATA_PATH = "../data/data.csv"

## Load data

In [4]:
data = pd.read_csv(DATA_PATH)

data["question"] = data["question"].astype(str)
data["answer"] = data["answer"].astype(str)

data["question"] = data["question"].apply(clean_repeated_ws)
data["answer"] = data["answer"].apply(clean_repeated_ws)

In [5]:
data.head(5).style

Unnamed: 0,question,answer
0,What is (are) Glaucoma ?,"Glaucoma is a group of diseases that can damage the eye's optic nerve and result in vision loss and blindness. The most common form of the disease is open-angle glaucoma. With early treatment, you can often protect your eyes against serious vision loss. (Watch the video to learn more about glaucoma. To enlarge the video, click the brackets in the lower right-hand corner. To reduce the video, press the Escape (Esc) button on your keyboard.) See this graphic for a quick overview of glaucoma, including how many people it affects, whos at risk, what to do if you have it, and how to learn more. See a glossary of glaucoma terms."
1,What is (are) Glaucoma ?,The optic nerve is a bundle of more than 1 million nerve fibers. It connects the retina to the brain.
2,What is (are) Glaucoma ?,"Open-angle glaucoma is the most common form of glaucoma. In the normal eye, the clear fluid leaves the anterior chamber at the open angle where the cornea and iris meet. When the fluid reaches the angle, it flows through a spongy meshwork, like a drain, and leaves the eye. Sometimes, when the fluid reaches the angle, it passes too slowly through the meshwork drain, causing the pressure inside the eye to build. If the pressure damages the optic nerve, open-angle glaucoma -- and vision loss -- may result."
3,Who is at risk for Glaucoma? ?,"Anyone can develop glaucoma. Some people are at higher risk than others. They include - African-Americans over age 40 - everyone over age 60, especially Hispanics/Latinos - people with a family history of glaucoma. African-Americans over age 40 everyone over age 60, especially Hispanics/Latinos people with a family history of glaucoma. See this graphic for a quick overview of glaucoma, including how many people it affects, whos at risk, what to do if you have it, and how to learn more."
4,How to prevent Glaucoma ?,"At this time, we do not know how to prevent glaucoma. However, studies have shown that the early detection and treatment of glaucoma, before it causes major vision loss, is the best way to control the disease. So, if you fall into one of the higher risk groups for the disease, make sure to have a comprehensive dilated eye exam at least once every one to two years. Get tips on finding an eye care professional. Learn what a comprehensive dilated eye exam involves."


## Data split

In [6]:
from sklearn.model_selection import train_test_split

# DEVELOPMENT
# data = data.head(100)

# split train test val (80-10-10)
train, test = train_test_split(data, test_size=0.2, random_state=42, shuffle=True)
val, test = train_test_split(test, test_size=0.5, random_state=42, shuffle=True)

## RAG

### Vector Database

In [7]:
from langchain_community.document_loaders import DataFrameLoader

loader = DataFrameLoader(train, page_content_column="question")
docs = loader.load()

Like we observerd in 01_data_exploration:
1. The dataset is composed repeted (or near) questions with different answers.
2. Some answers can have more than 5k tokens (considering tiktoken tokenizer).

For the first point:
1. It's possible to concatenate question and answers chunks to individualize and try to getter better approachs in case when the answer get more specific.
2. Also is possible to summarize the answers.

For the second point:
1. If we try to concat question/answer, we need to be attempt in the chunking strategy (truncation/padding).

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=50)
splits = text_splitter.split_documents(docs)

In [None]:
from langchain_community.vectorstores import FAISS
from langchain_ollama import OllamaEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings

# Remove pandas warning from Sentence Transformers lib
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

# embed = OllamaEmbeddings(model="llama3.2:3b")
embeddings_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embed = HuggingFaceEmbeddings(model_name=embeddings_model_name)

vectorstore = FAISS.from_documents(documents=splits, embedding=embed)



In [62]:
from langchain_ollama.llms import OllamaLLM
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts.prompt import PromptTemplate


# LLM
llm = OllamaLLM(model="llama3.2:3b")

# Prompt
prompt = """
You are a medical assistant AI Bot oriented by document search.
Based on the query of the user and the given context, extract the answer.
If you don't have certain about the answer, please let the user know and don't try to guess.
You can only answer direct information from the given context.
Question:
{query}
Context:
{context}
"""

prompt = PromptTemplate.from_template(prompt)

### Define the flow

In [89]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

import tiktoken

tokenizer = tiktoken.get_encoding("o200k_base")


class State(TypedDict):
    query: str
    context: List[Document]
    answer: str


def retrieve(state: State):
    retrieved_docs = vectorstore.similarity_search(state["query"], k=5)
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "\n\n".join(doc.metadata["answer"] for doc in state["context"])
    messages = prompt.invoke({"query": state["query"], "context": docs_content})

    print(messages)
    print(f"n_tokens = {len(tokenizer.encode(str(messages)))}")

    response = llm.invoke(messages)
    return {"answer": response}


graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

In [101]:
llm.get_name()

'OllamaLLM'

In [90]:
question = train.iloc[0].question
answer = train.iloc[0].answer

print(f"Question: {question}\n")
print(f"True Answer: {answer}\n")

inference = graph.invoke(
    {
        "query": question,
    }
)

print(f"Model Answer: \n{inference['answer']}")

Question: Where to find support for people with Alcohol Use and Older Adults ?

True Answer: Many people with alcohol problems find it helpful to talk with others who have faced similar problems. Mutual help groups, such as Alcoholics Anonymous (AA) 12-step programs, help people recover from alcohol use disorder. AA meetings are open to anyone who wants to stop drinking. Attending mutual-help groups is beneficial for many people who want to stop drinking. Many people continue to go to support/mutual help groups even after medical treatment for their alcohol problems ends. There are other mutual help groups available such as Smart Recovery, Life Ring, and Moderation Management. Learn more about available types of treatment for alcohol problems.

text="\nYou are a medical assistant AI Bot oriented by document search.\nBased on the query of the user and the given context, extract the answer.\nIf you don't have certain about the answer, please let the user know and don't try to guess.\nYou

## Tests

In [73]:
vectorstore.similarity_search(question, k=5)[0].metadata["answer"]

'Many people with alcohol problems find it helpful to talk with others who have faced similar problems. Mutual help groups, such as Alcoholics Anonymous (AA) 12-step programs, help people recover from alcohol use disorder. AA meetings are open to anyone who wants to stop drinking. Attending mutual-help groups is beneficial for many people who want to stop drinking. Many people continue to go to support/mutual help groups even after medical treatment for their alcohol problems ends. There are other mutual help groups available such as Smart Recovery, Life Ring, and Moderation Management. Learn more about available types of treatment for alcohol problems.'

## Development

In [102]:
# llm.invoke(f"Split the following question in three similar questions: {question}")
prompt.invoke({"query": question, "context": "This is a test context"})

StringPromptValue(text="\nYou are a medical assistant AI Bot oriented by document search.\nBased on the query of the user and the given context, extract the answer.\nIf you don't have certain about the answer, please let the user know and don't try to guess.\nYou can only answer direct information from the given context.\nQuestion:\nWhere to find support for people with Alcohol Use and Older Adults ?\nContext:\nThis is a test context\n")