# Question Answering

## Overview

Recall the overall workflow for retrieval augmented generation (RAG):

![image.png](./img/RAG.jpg)

We discussed `Document Loading` and `Splitting` as well as `Storage` and `Retrieval`.

Let's load our vectorDB. 

In [2]:
import os
from langchain_openai import OpenAI
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=os.getenv("OPENAI_API_KEY")
)
llm_name = "gpt-3.5-turbo"

In [3]:
from langchain_community.vectorstores.chroma import Chroma
from langchain_openai import OpenAIEmbeddings

persist_directory = "docs/chroma/"
embedding = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

In [4]:
print(vectordb._collection.count())

209


In [7]:
question = "What are major topics for this class?"
docs = vectordb.similarity_search(question, k=3)
print("Length of docs: ", len(docs))

Length of docs:  3


In [8]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model_name=llm_name, temperature=0)

### RetrievalQA chain

In [9]:
from langchain.chains.retrieval_qa.base import RetrievalQA

In [10]:
qa_chain = RetrievalQA.from_chain_type(llm, retriever=vectordb.as_retriever())

In [12]:
result = qa_chain.invoke({"query": question})

In [16]:
print(result["result"])

The major topics for this class include machine learning, statistics, and algebra. Additionally, there will be discussions on extensions of the material covered in the main lectures.


### Prompt

In [17]:
# Build prompt
from langchain.prompts import PromptTemplate


template = """Use the following pieces of context to answer the question at the end. \
    If you don't know the answer, just say that you don't know, don't try to make up an answer. \
    Use three sentences maximum. Keep the answer as concise as possible. \
    Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [21]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)

In [22]:
question = "Is probability a class topic?"
result = qa_chain.invoke({"query": question})

In [36]:
print("Result: ", result["result"])
print("Source: ", result["source_documents"][0].metadata["source"])

Result:  Yes, probability is a class topic as the instructor assumes familiarity with basic probability and statistics. Thanks for asking!
Source:  docs/cs229_lectures/MachineLearning-Lecture01.pdf


### RetrievalQA chain types

In [37]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm, retriever=vectordb.as_retriever(), chain_type="map_reduce"
)

In [38]:
result = qa_chain_mr.invoke({"query": question})

In [39]:
print("Result: ", result["result"])

Result:  Yes, probability is a class topic in the document.


You can ignore this one for now
If you wish to experiment on the `LangChain plus platform`:

 * Go to [LangchainSmith](https://www.langchain.com/langsmith) and sign up
 * Create an API key from your account's settings
 * Use this API key in the code below   
 * uncomment the code  
 Note, the endpoint in the video differs from the one below. Use the one below.

In [49]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm, retriever=vectordb.as_retriever(), chain_type="map_reduce"
)
result = qa_chain_mr.invoke({"query": question})
print("Result: ", result["result"])

Result:  The prerequisites of basic computer science knowledge, probability and statistics, and linear algebra are needed for the class to ensure that students have the foundational knowledge required to understand and apply the advanced concepts and techniques of machine learning that will be taught in the course.


In [None]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm, retriever=vectordb.as_retriever(), chain_type="refine"
)
result = qa_chain_mr.invoke({"query": question})
print("Result: ", result["result"])

Result:  The additional context provided does not significantly impact the original answer, as it still remains clear that probability is a class topic in the course being discussed. The mention of using probability concepts to derive the next learning algorithm and the importance of probability in understanding classification problems are still relevant. Therefore, the original answer stands as an appropriate response to the question.


### RetrievalQA limitations
 
QA fails to preserve conversational history.

In [43]:
qa_chain = RetrievalQA.from_chain_type(llm, retriever=vectordb.as_retriever())

In [44]:
question = "Is probability a class topic?"
result = qa_chain.invoke({"query": question})
print("Result: ", result["result"])

Result:  Yes, probability is a class topic mentioned in the context provided. The instructor assumes familiarity with basic probability and statistics for the class.


In [45]:
question = "why are those prerequesites needed?"
result = qa_chain.invoke({"query": question})
print("Result: ", result["result"])

Result:  The prerequisites of basic knowledge in computer science, probability and statistics, and linear algebra are needed for the machine learning class because they form the foundational understanding required to grasp the concepts and algorithms taught in the course. Understanding these subjects helps students apply machine learning algorithms effectively to real-world problems and conduct research in the field.


Note, The LLM response varies. Some responses **do** include a reference to probability which might be gleaned from referenced documents. The point is simply that the model does not have access to past questions or answers, this will be covered in the next section.