# Question Answering

## Overview

Recall the overall workflow for retrieval augmented generation (RAG):

We discussed `Document Loading` and `Splitting` as well as `Storage` and `Retrieval`.

Let's load our vectorDB. 

In [1]:
import os
import openai
import sys

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

The code below was added to assign the openai LLM version filmed until it is deprecated, currently in Sept 2023. 
LLM responses can often vary, but the responses may be significantly different when using a different model version.

In [2]:
llm_name = "gpt-3.5-turbo"

In [3]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'resources/docs/chroma'
embedding = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

In [4]:
print(vectordb._collection.count())

209


In [5]:
question = "What are major topics for this class?"
docs = vectordb.similarity_search(question,k=3)
len(docs)

0

In [6]:
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name=llm_name, temperature=0)

### RetrievalQA chain

In [7]:
from langchain.chains import RetrievalQA

In [8]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

In [9]:
result = qa_chain({"query": question})

In [10]:
result["result"]

'The major topics for this class are computer science, programming languages, algorithms, data structures, software engineering, and computer networks.'

### Prompt

In [11]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)


In [12]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [13]:
question = "Is probability a class topic?"

In [14]:
result = qa_chain({"query": question})

In [15]:
result["result"]

'Yes, probability is a class topic in many math and statistics courses. Thanks for asking!'

In [16]:
result["source_documents"][0]

IndexError: list index out of range

### RetrievalQA chain types

In [None]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)

In [None]:
result = qa_chain_mr({"query": question})

In [None]:
result["result"]

'Based on the information provided, it is not clear if probability is a specific topic in the class.'

If you wish to experiment on the `LangChain plus platform`:

 * Go to [langchain plus platform](https://www.langchain.plus/) and sign up
 * Create an API key from your account's settings
 * Use this API key in the code below   
 * uncomment the code  
 Note, the endpoint in the video differs from the one below. Use the one below.

In [None]:
#import os
#os.environ["LANGCHAIN_TRACING_V2"] = "true"
#os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
#os.environ["LANGCHAIN_API_KEY"] = "..." # replace dots with your api key

In [None]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)
result = qa_chain_mr({"query": question})
result["result"]

'Based on the information provided, it is not clear if probability is a specific topic covered in the class.'

In [None]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine"
)
result = qa_chain_mr({"query": question})
result["result"]

'Based on the additional context provided, it is clear that probability is indeed a class topic. The instructor assumes familiarity with basic probability and statistics, and some review of probability will be provided in the discussion sections for those who need a refresher.'

### RetrievalQA limitations
 
QA fails to preserve conversational history.

In [None]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

In [None]:
question = "Is probability a class topic?"
result = qa_chain({"query": question})
result["result"]

'Yes, probability is a topic that will be covered in the class. The instructor assumes that students have familiarity with basic probability and statistics.'

In [None]:
question = "why are those prerequesites needed?"
result = qa_chain({"query": question})
result["result"]

'The prerequisites are needed because they provide a foundation of basic knowledge and skills that are necessary for understanding and applying machine learning algorithms. Having a basic knowledge of computer science and understanding concepts like big-O notation allows students to grasp the underlying principles and techniques used in machine learning. It ensures that all students are starting from a similar level of understanding and can effectively engage with the material taught in the class.'

Note, The LLM response varies. Some responses **do** include a reference to probability which might be gleaned from referenced documents. The point is simply that the model does not have access to past questions or answers, this will be covered in the next section.

## Experiment on your own

In [None]:
# Set up Chroma DB

from langchain.text_splitter import RecursiveCharacterTextSplitter

persist_directory = 'resources/docs/chroma'
embedding = OpenAIEmbeddings()

machine_learning_texts = [
    "Machine Learning, a branch of artificial intelligence, focuses on the development of algorithms that enable computers to learn from and make predictions or decisions based on data. Unlike traditional programming, where instructions are explicitly given, ML algorithms improve automatically through experience.",
    "Supervised learning, a primary method within Machine Learning, involves training an algorithm on a labeled dataset, where the correct output is known, enabling the model to learn over time to predict the output when given new data.",
    "Unsupervised learning, another core technique in Machine Learning, deals with input data without labeled responses. The system tries to learn the patterns and the structure from such data by itself, often used for clustering or anomaly detection.",
    "Reinforcement learning is a type of Machine Learning where an algorithm learns to perform an action from experience by making decisions in a game-like scenario. It optimizes its decision-making process based on the rewards received for actions taken.",
    "Deep Learning, a subset of Machine Learning, utilizes neural networks with many layers. These algorithms are particularly powerful in handling vast amounts of data and are behind many advances in image and speech recognition, language translation, and autonomous vehicles.",
    "Feature engineering is a crucial step in the Machine Learning process, involving the selection, modification, and creation of input variables or features that are used to train the model. It significantly impacts the performance of ML algorithms.",
    "Overfitting in Machine Learning occurs when a model learns the detail and noise in the training data to the extent that it performs poorly on new data. Techniques such as cross-validation, regularization, and pruning are used to avoid this issue.",
    "Machine Learning models require large volumes of data for training. However, the quality and relevance of the data are equally important, as biased or incomplete data can lead to inaccurate models that perpetuate biases or fail to perform as intended.",
    "The ethical implications of Machine Learning are a growing concern, as the technology can inadvertently reinforce existing biases, infringe on privacy, or be used in manipulative ways. Ensuring fairness, accountability, and transparency in ML models is crucial.",
    "The future of Machine Learning promises advancements in personalized medicine, environmental protection, and automated systems. However, it also poses challenges in job displacement, security, and ensuring ethical use of the technology.",
]

machine_learning_metadatas = [
    {"summary": "Introduction to ML", "category": "Basics"},
    {"summary": "Supervised Learning Overview", "category": "Learning Types"},
    {"summary": "Unsupervised Learning Explained", "category": "Learning Types"},
    {"summary": "Basics of Reinforcement Learning", "category": "Learning Types"},
    {"summary": "Deep Learning and Neural Networks", "category": "Advanced Topics"},
    {"summary": "Feature Engineering Techniques", "category": "Model Preparation"},
    {"summary": "Handling Overfitting in Models", "category": "Model Evaluation"},
    {"summary": "Importance of Data Quality", "category": "Data Preparation"},
    {"summary": "Ethical Considerations in ML", "category": "Ethics and Future"},
    {"summary": "Future Trends in Machine Learning", "category": "Ethics and Future"}
]


ml_db = Chroma.from_texts(
    texts=machine_learning_texts,
    metadatas=machine_learning_metadatas,
    embedding=embedding)

llm = ChatOpenAI(model_name=llm_name, temperature=0)


In [None]:
# Retrieve Relevant Documents for Question
question = "What are some of the main techniques used in Machine Learning?"
docs = ml_db.similarity_search(question,k=3)
len(docs)
for i in docs:
    print(i)

page_content='Deep Learning, a subset of Machine Learning, utilizes neural networks with many layers. These algorithms are particularly powerful in handling vast amounts of data and are behind many advances in image and speech recognition, language translation, and autonomous vehicles.' metadata={'summary': 'Deep Learning and Neural Networks', 'category': 'Advanced Topics'}
page_content='Machine Learning, a branch of artificial intelligence, focuses on the development of algorithms that enable computers to learn from and make predictions or decisions based on data. Unlike traditional programming, where instructions are explicitly given, ML algorithms improve automatically through experience.' metadata={'summary': 'Introduction to ML', 'category': 'Basics'}
page_content='Supervised learning, a primary method within Machine Learning, involves training an algorithm on a labeled dataset, where the correct output is known, enabling the model to learn over time to predict the output when giv

In [None]:
# Retrieval QA Chain (Prompt Template)

template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=ml_db.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

result = qa_chain({"query": question})

print(result['result'])
print(result['source_documents'])


Some of the main techniques used in Machine Learning include supervised learning, unsupervised learning, and reinforcement learning. Thanks for asking!
[Document(page_content='Deep Learning, a subset of Machine Learning, utilizes neural networks with many layers. These algorithms are particularly powerful in handling vast amounts of data and are behind many advances in image and speech recognition, language translation, and autonomous vehicles.', metadata={'summary': 'Deep Learning and Neural Networks', 'category': 'Advanced Topics'}), Document(page_content='Machine Learning, a branch of artificial intelligence, focuses on the development of algorithms that enable computers to learn from and make predictions or decisions based on data. Unlike traditional programming, where instructions are explicitly given, ML algorithms improve automatically through experience.', metadata={'summary': 'Introduction to ML', 'category': 'Basics'}), Document(page_content='Supervised learning, a primary me

In [None]:
# Retrieval QA Chain (Stuff)

qa_chain_stuff = RetrievalQA.from_chain_type(
    llm,
    retriever=ml_db.as_retriever(),
    return_source_documents=True,
    chain_type="stuff"
)

result = qa_chain_stuff({"query": question})

print(result['result'])
print(result['source_documents'])

Some of the main techniques used in Machine Learning include:

1. Supervised Learning: This technique involves training a model on labeled data, where the correct output is known. The model learns to make predictions or decisions based on this labeled data.

2. Unsupervised Learning: In this technique, the model learns from unlabeled data, finding patterns or structures within the data without any predefined labels. Clustering and dimensionality reduction are common tasks in unsupervised learning.

3. Reinforcement Learning: This technique involves training a model to make decisions in an environment, where it receives feedback in the form of rewards or penalties. The model learns to take actions that maximize the cumulative reward over time.

4. Deep Learning: Deep Learning is a subset of Machine Learning that uses neural networks with many layers. These networks are capable of learning complex patterns and representations from large amounts of data.

5. Transfer Learning: This techni

In [None]:
# Retrieval QA Chain (Map Reduce)

qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=ml_db.as_retriever(),
    return_source_documents=True,
    chain_type="map_reduce"
)

result = qa_chain_mr({"query": question})

print(result['result'])
print(result['source_documents'])

Some of the main techniques used in Machine Learning include:

1. Supervised Learning: This technique involves training a model on labeled data, where the input data is paired with the corresponding correct output. The model learns to make predictions based on this labeled data.

2. Unsupervised Learning: In this technique, the model is trained on unlabeled data, where there are no predefined output labels. The model learns to find patterns, structures, or relationships in the data without any guidance.

3. Reinforcement Learning: This technique involves training a model to make decisions in an environment to maximize a reward. The model learns through trial and error, receiving feedback in the form of rewards or penalties based on its actions.

4. Deep Learning: Deep Learning is a subset of Machine Learning that focuses on training deep neural networks with multiple layers. These networks can learn complex patterns and representations from large amounts of data.

5. Transfer Learning:

In [None]:
# Retrieval QA Chain (Refine)

qa_chain_refine = RetrievalQA.from_chain_type(
    llm,
    retriever=ml_db.as_retriever(),
    return_source_documents=True,
    chain_type="refine"
)

result = qa_chain_refine({"query": question})

print(result['result'])
print(result['source_documents'])

The original answer remains relevant and comprehensive in addressing the main techniques used in Machine Learning. However, considering the additional context, it is important to emphasize the significance of data quality and relevance in training Machine Learning models. Biased or incomplete data can indeed lead to inaccurate models that perpetuate biases or fail to perform as intended. Therefore, it is crucial to ensure that the data used for training is representative, diverse, and free from biases to achieve reliable and unbiased models.
[Document(page_content='Deep Learning, a subset of Machine Learning, utilizes neural networks with many layers. These algorithms are particularly powerful in handling vast amounts of data and are behind many advances in image and speech recognition, language translation, and autonomous vehicles.', metadata={'summary': 'Deep Learning and Neural Networks', 'category': 'Advanced Topics'}), Document(page_content='Machine Learning, a branch of artificia