# **LangChain:** Embeddings and Vectorstores

Recall the overall workflow for retrieval augmented generation (RAG):

![overview.jpeg](Images/Document_Loading.jpg)

In [1]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

We just discussed `Document Loading` and `Splitting`.

In [2]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("/home/centrox_ai/Desktop/ABDULLAH/llama2/rag/1.pdf"),
    PyPDFLoader("/home/centrox_ai/Desktop/ABDULLAH/llama2/rag/1.pdf"),
    PyPDFLoader("/home/centrox_ai/Desktop/ABDULLAH/llama2/rag/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("/home/centrox_ai/Desktop/ABDULLAH/llama2/rag/a.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [3]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [4]:
splits = text_splitter.split_documents(docs)

In [5]:
len(splits)

2225

## Embeddings

Let's take our splits and embed them.

In [6]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

  warn_deprecated(


In [7]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [9]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [10]:
import numpy as np

In [11]:
np.dot(embedding1, embedding2)

0.9631227500523626

In [12]:
np.dot(embedding1, embedding3)

0.7703257495981698

In [13]:
np.dot(embedding2, embedding3)

0.759162740110803

## Vectorstores

In [14]:
# ! pip install chromadb==0.4.3

In [8]:
from langchain.vectorstores import Chroma

In [9]:
persist_directory = '/home/centrox_ai/Desktop/ABDULLAH/langchain/LangChain-Chat-with-your-Data/chroma/'

In [11]:
!rm -rf ./home/centrox_ai/Desktop/ABDULLAH/langchain/LangChain-Chat-with-your-Data/chroma  # remove old database files if any

In [12]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [13]:
print(vectordb._collection.count())

2225


### Similarity Search

In [28]:
question = "who was the successor of Aurelian after he was murdered?"

In [29]:
docs = vectordb.similarity_search(question,k=3)

In [30]:
len(docs)

3

In [31]:
docs[0].page_content

'314 A History of Rome to 565 A. D.\nsuppression of the debased silver currency and the issuing of\na much improved coinage. Aurelian regarded himself as an\nabsolute monarch and employed on his coins the titles dominus\net deus natus —“born Lord and God. ”He likewise reëstablished\nin Rome the official cult of the Unconquered Sun God, previously\nintroduced by Elagabalus. One of the characteristics of this cult\nwas the belief that the monarch was the incarnation of the divine\nspirit, a belief which gave a moral justification to absolutism.\nProbus, 276 –282 A. D. Aurelian was murdered in 275 A. D.,\nand was succeeded by Tacitus, who met a like fate after a rule of\nless than two years. He was followed by Marcus Aurelius Probus,\nan able Illyrian officer. Probus was called upon to repel fresh\ninvasions of Germanic peoples, to subdue the rebellious Isaurians [263]\nin Asia Minor and suppress a revolt in Egypt. Everywhere he\nsuccessfully upheld the authority of the empire, but his st

Let's save this so we can use it later!

In [32]:
vectordb.persist()

## Failure modes

This seems great, and basic similarity search will get you 80% of the way there very easily. 

But there are some failure modes that can creep up. 

Here are some edge cases that can arise - we'll fix them in the next class.

In [33]:
question = "what did they say about matlab?"

In [34]:
docs = vectordb.similarity_search(question,k=5)

Notice that we're getting duplicate chunks (because of the duplicate `MachineLearning-Lecture01.pdf` in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

`docs[0]` and `docs[1]` are indentical.

In [35]:
docs[0]

Document(page_content='algorithm then? So what’s different? How come  I was making all that noise earlier about \nleast squares regression being a bad idea for classification problems and then I did a \nbunch of math and I skipped some steps, but I’m, sort of, claiming at the end they’re \nreally the same learning algorithm?  \nStudent: [Inaudible] constants?  \nInstructor (Andrew Ng) :Say that again.  \nStudent: [Inaudible]  \nInstructor (Andrew Ng) :Oh, right. Okay, cool.', metadata={'page': 13, 'source': '/home/centrox_ai/Desktop/ABDULLAH/llama2/rag//MachineLearning-Lecture03.pdf'})

In [36]:
docs[1]

Document(page_content="amount of notation. We'll probably all get used  to it in a few days and we'll standardize \nnotation and make a lot of our descripti ons of learning algorithms a lot easier.  \nBut again, if you see me write some symbol and you don't quite remember what it means, \nchances are there are others in  this class who've forgotten too. So please raise your hand \nand ask if you're ever wondering what some  symbol means. Any questions you have \nabout any of this?  \nYeah?  \nStudent: The variable can be anything? [Inaudible]?  \nInstructor (Andrew Ng) :Say that again.  \nStudent: [inaudible] zero theta one?  \nInstructor (Andrew Ng) :Right, so, well let me – this was going to be next, but the theta \nor the theta Is are called the parameters. Th e thetas are called the parameters of our \nlearning algorithm and theta zero, theta one, th eta two are just real numbers. And then it \nis the job of the learning algor ithm to use the training set to choose or to learn appr

We can see a new failure mode.

The question below asks a question about the third lecture, but includes results from other lectures as well.

In [37]:
question = "what did they say about regression in the third lecture?"

In [38]:
docs = vectordb.similarity_search(question,k=5)

In [39]:
for doc in docs:
    print(doc.metadata)

{'page': 0, 'source': '/home/centrox_ai/Desktop/ABDULLAH/llama2/rag//MachineLearning-Lecture03.pdf'}
{'page': 2, 'source': '/home/centrox_ai/Desktop/ABDULLAH/llama2/rag//MachineLearning-Lecture02.pdf'}
{'page': 14, 'source': '/home/centrox_ai/Desktop/ABDULLAH/llama2/rag//MachineLearning-Lecture03.pdf'}
{'page': 4, 'source': '/home/centrox_ai/Desktop/ABDULLAH/llama2/rag//MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': '/home/centrox_ai/Desktop/ABDULLAH/llama2/rag//MachineLearning-Lecture02.pdf'}


In [40]:
print(docs[4].page_content)

really makes a difference between a good so lution and amazing solution. And to give 
everyone to just how we do points assignments, or what is it that causes a solution to get 
full marks, or just how to write amazing so lutions. Becoming a grad er is usually a good 
way to do that.  
Graders are paid positions and you also get free  food, and it's usually fun for us to sort of 
hang out for an evening and grade all the a ssignments. Okay, so I will send email. So 
don't email me yet if you want to be a grader. I'll send email to the entire class later with 
the administrative details and to solicit app lications. So you can email us back then, to 
apply, if you'd be interested in being a grader.  
Okay, any questions about that? All right, okay, so let's get started with today's material. 
So welcome back to the second lecture. What  I want to do today is talk about linear 
regression, gradient descent, and the norma l equations. And I should also say, lecture 
notes have been posted

Approaches discussed in the next lecture can be used to address both!