<a href="https://colab.research.google.com/github/AdeebaRafi/AdeebaRafi/blob/main/Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Retrieval augmented generation

In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

In [None]:
! pip install langchain
! pip install openai
! pip install langchain-community
! pip install pypdf
! pip install tiktoken
! pip install chromadb
! pip install lark

Collecting langchain
  Downloading langchain-0.3.2-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.8 (from langchain)
  Downloading langchain_core-0.3.9-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.131-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.4.0,>=0.3.8->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting httpx<1,>=0.23.0 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading orjson-3.10.7-cp310-cp310-ma

In [None]:
import os
import openai
import sys
sys.path.append('/content/')

In [None]:
from google.colab import userdata
userdata.get('DUMMY')

'apple'

In [None]:
from google.colab import userdata

openai_api_key = userdata.get('OPENAI_API_KEY')
userdata.get('DUMMY_API_KEY')

'sk-XXXXXXXXXXXXXXX'

In [None]:
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [None]:
from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "how many r in strawberry?"
        }
    ]
)

ChatCompletionMessage(content='The word "strawberry" contains three instances of the letter \'r\'.', refusal=None, role='assistant', function_call=None, tool_calls=None)


In [None]:
completion.choices[0].message.content

'The word "strawberry" contains three instances of the letter \'r\'.'

## PDFs

Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("/content/MachineLearning-Lecture01.pdf")
pages = loader.load()

Each page is a `Document`.

A `Document` contains text (`page_content`) and `metadata`.

In [None]:
len(pages)

2

In [None]:
page = pages[0]

In [None]:
print(page.page_content[0:500])

MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning i


In [None]:
page.metadata

{'source': 'Sample.pdf', 'page': 0}

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
chunk_size =26
chunk_overlap = 4

In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Why doesn't this split the string below?

In [None]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [None]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [None]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [None]:
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

Ok, this splits the string but we have an overlap specified as 5, but it looks like 3? (try an even number)

In [None]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

Try your own examples!

## Recursive splitting details

`RecursiveCharacterTextSplitter` is recommended for generic text.

In [None]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [None]:
len(some_text)

496

In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""]
)

In [None]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

# Vectorstores and Embeddings

Recall the overall workflow for retrieval augmented generation (RAG):

We just discussed `Document Loading` and `Splitting`.

In [None]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("/content/MachineLearning-Lecture01.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [None]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [None]:
splits = text_splitter.split_documents(docs)

In [None]:
len(splits)

4

## Embeddings

Let's take our splits and embed them.

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)

  embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)


In [None]:
word1 = "i love burgers"
word2 = "the weather is great today"
word3 = "Its raining outside"

In [None]:
embedding1 = embedding.embed_query(word1)
embedding2 = embedding.embed_query(word2)
embedding3 = embedding.embed_query(word3)

In [None]:
import numpy as np

In [None]:
np.dot(embedding1, embedding3)

0.7477729189424241

## Vectorstores

In [None]:
from langchain.vectorstores import Chroma

In [None]:
persist_directory = 'docs/chroma/'

In [None]:
!rm -rf ./docs/chroma  # remove old database files if any

In [None]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [None]:
print(vectordb._collection.count())

4


### Similarity Search

In [None]:
question = "What is CS229?"

In [None]:
docs = vectordb.similarity_search(question,k=3)

In [None]:
len(docs)

3

Let's save this so we can use it later!

In [None]:
vectordb.persist()

  vectordb.persist()


# Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow.

Let's get our vectorDB from before.

## Vectorstore retrieval


### Similarity Search

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

In [None]:
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

  vectordb = Chroma(


In [None]:
print(vectordb._collection.count())

4


# Question Answering

We discussed `Document Loading` and `Splitting` as well as `Storage` and `Retrieval`.

Let's load our vectorDB.

The code below was added to assign the openai LLM version filmed until it is deprecated, currently in Sept 2023.
LLM responses can often vary, but the responses may be significantly different when using a different model version.

In [None]:
llm_name = "gpt-3.5-turbo"

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

In [None]:
print(vectordb._collection.count())

4


In [None]:
question = "What is CS229?"
docs = vectordb.similarity_search(question,k=3)
len(docs)

3

In [None]:
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name=llm_name, temperature=0, openai_api_key=openai_api_key)

  llm = ChatOpenAI(model_name=llm_name, temperature=0, openai_api_key=openai_api_key)


### RetrievalQA chain

In [None]:
from langchain.chains import RetrievalQA

In [None]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

In [None]:
result = qa_chain({"query": question})

  result = qa_chain({"query": question})


In [None]:
result["result"]

'CS229 is a machine learning class taught by Andrew Ng.'

### Prompt

In [None]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)


In [None]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [None]:
llm.invoke("Please summarize the logistics of CS229.").content

'CS229 is a graduate-level course at Stanford University that covers machine learning and statistical pattern recognition. The course covers topics such as supervised learning, unsupervised learning, deep learning, and reinforcement learning. Students are expected to complete programming assignments, a final project, and a final exam. The course is typically taught in a lecture format with additional discussion sections and office hours for students to ask questions and get help with the material.'

In [None]:
result["source_documents"][0]

Document(metadata={'page': 0, 'source': 'Sample.pdf'}, page_content='Arif)\n10\nModule\n5:\nHands-on\nwith\nGenerative\nAI\nModels\n(Trainer:\nZubair\nZafar)\n11\nModule\n6:\nDeveloping\nGenerative\nAI\nApplications\n(Trainer:\nMuhammad\nDanish\nIqbal)\n12\nModule\n7:\nResponsible\nUse\nof\nGenerative\nAI\n(Trainer:\nAbdullah\nArif)\n12\nModule\n8:\nHow\nto\nSell\nThese\nGenerative\nAI\nSkills\nLike\na\nPro\n(Trainer:\nHassan\nSyed\n,\nNaeem \nZafar,\nM.\nAnwar\nKhan)\n12\n6.0\nFollow\nup\nTechnical\nSessions:\nAll\nWeekdays\nOnline\n13\nPurpose\nof\nFollow\nup\nOnline\nSessions\n13\n7.0\nHackathon\nConducted\nat\nthe\nEnd\nof\nGen-AI\nTraining\n15\nIntroduction\nto\nHackathons\n15\niCodeGuru\nand\nHackathons\n15')

Note, The LLM response varies. Some responses **do** include a reference to probability which might be gleaned from referenced documents. The point is simply that the model does not have access to past questions or answers, this will be covered in the next section.

# References
https://learn.deeplearning.ai/courses/langchain-chat-with-your-data/


