# Building a RAG application from scratch

Let's start by loading the environment variables we need to use.

In [1]:
import os

os.environ["OPENAI_API_KEY"] = "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"
os.environ["PINECONE_API_KEY"] = "921f776d-0d66-4d45-8026-5eea98f28936"
os.environ["PINECONE_API_ENV"] = "us-east-1"

Let's define the LLM model that we'll use as part of the workflow.

In [3]:
from langchain_openai.chat_models import ChatOpenAI

model = ChatOpenAI(openai_api_key="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", model="gpt-4o")

For this example, we'll use a simple `StrOutputParser` to extract the answer as a string.

In [4]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

We want to provide the model with some context and the question. [Prompt templates](https://python.langchain.com/docs/modules/model_io/prompts/quick_start) are a simple way to define and reuse prompts.

In [5]:
from langchain.prompts import ChatPromptTemplate

template = """
You are an AI assistant, trained to provide understandable and accurate information about pharmacogenomics and drugs.
You will base your responses on the context and information provided. Output both your answer and a score of how confident you are,
 and also cite the references. Also provide the source of the chunks of the documents used for response.
If the information related to the question is not in the context and or in the information provided in the prompt, 
you will say 'I don't know'.
You are not a healthcare provider and you will not provide medical care or make assumptions about treatment.

Context: {context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

Let's start by loading the transcription in memory:

In [6]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import JSONLoader
from langchain_community.document_loaders.csv_loader import CSVLoader

folder_path = "/home/dhanushb/Wellytics/RAG_data/all_files"
jsondata = []
csvdata = []
pdfdocs = []
for filename in os.listdir(folder_path):
    if filename.endswith(".pdf"):
        file_path = os.path.join(folder_path, filename)
        loader = PyPDFLoader(file_path)
        doc = loader.load()
        pdfdocs.extend(doc)
    elif filename.endswith(".csv"):
        file_path = os.path.join(folder_path, filename)
        loader = CSVLoader(file_path)
        data = loader.load()
        csvdata.extend(data)
    elif filename.endswith(".json"):
        file_path = os.path.join(folder_path, filename)
        loader = JSONLoader(file_path, jq_schema=".",json_lines=False,text_content=False)
        data = loader.load()
        jsondata.extend(data)

for doc in pdfdocs:
    doc.page_content = doc.page_content.replace('\t', ' ')

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=5000, chunk_overlap=1000)
documents = text_splitter.split_documents(pdfdocs)
jsondocs = text_splitter.split_documents(jsondata)

documents += jsondocs + csvdata 

For our specific application, let's use 1000 characters instead:

Let's generate embeddings:

In [7]:
from langchain_openai.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

Setting up Pinecone

The first step is to create a Pinecone account, set up an index, get an API key, and set it as an environment variable `PINECONE_API_KEY`.

In [8]:
from langchain_pinecone import PineconeVectorStore

index_name = "pdfs-rag"

pinecone = PineconeVectorStore.from_documents(
    documents, embeddings, index_name=index_name
)

We can get a retriever directly from the vector store we created before: 

In [10]:
from langchain_pinecone import PineconeVectorStore

pinecone = PineconeVectorStore(embedding=embeddings, index_name="pdfs-rag")

In [9]:
retriever = pinecone.as_retriever()

We can create a map with the two inputs by using the [`RunnableParallel`](https://python.langchain.com/docs/expression_language/how_to/map) and [`RunnablePassthrough`](https://python.langchain.com/docs/expression_language/how_to/passthrough) classes. This will allow us to pass the context and question to the prompt as a map with the keys "context" and "question."

In [11]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

setup = RunnableParallel(context=retriever, question=RunnablePassthrough())

Let's setup the new chain using Pinecone as the vector store:

In [16]:
chain = setup | prompt | model | parser

In [13]:
#chain.invoke("What is Hollywood going to start doing?")

In [17]:
chain.invoke(" As part of my liver transplant, I take tacrolimus. My doctor recently informed me that I had a high chance of graft rejection and performed a pharmacogenetic test to determine whether my dosage needs to be adjusted. What does it indicate that I have CYP3A5 extensive metabolizer, according to my test results that I received today?")

'According to the pharmacogenetic test results indicating that you have a CYP3A5 extensive metabolizer status, it suggests that you may require a lower tacrolimus dose compared to individuals who are CYP3A5 poor metabolizers. This is because CYP3A5 extensive metabolizers tend to metabolize tacrolimus more efficiently, leading to potentially lower dose requirements to achieve optimal therapeutic levels and reduce the risk of graft rejection.\n\nConfidence Score: 9\n\nSource: Uesugi M et al. Impact of cytochrome P450 3A5 polymorphism in graft livers on the frequency of acute cellular rejection in living-donor liver transplantation. Pharmacogenet Genomics 2014;24:356-66.'