<a href="https://colab.research.google.com/github/Alex112525/LangChain-with-LLMs/blob/main/Embeddings_and_VectorStores_Langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install langchain pypdf sentence_transformers chromadb

In [2]:
import numpy as np

## Embeddings

An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors.

Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space

#### HuggingFace Embeddings

In [3]:
from langchain.embeddings import HuggingFaceEmbeddings
embedding = HuggingFaceEmbeddings()

In [4]:
sentences = ["I like Hamburguers", "I like Pizza", "I like Programming", "The weather is cool outside"]
embeddings = [embedding.embed_query(sen) for sen in sentences]

In [5]:
print(embeddings[0])
len(embeddings[0])

[-0.02939228154718876, 0.08986327797174454, -0.0007837651646696031, -0.04436204954981804, -0.010849258862435818, 0.01615643873810768, -0.02946549840271473, -0.02205762080848217, 0.03789498284459114, 0.00710664689540863, -0.036479778587818146, -0.0306805782020092, -0.008992488496005535, -0.016738422214984894, -0.01777016371488571, -0.027559597045183182, 0.006301789544522762, -9.311235044151545e-05, -0.05282406881451607, -0.027767103165388107, -0.025167597457766533, -0.004394760355353355, -0.0152291813865304, 0.019534755498170853, 0.026362599804997444, -0.0005381685914471745, 0.0051782988011837006, -0.02530023083090782, -0.016055455431342125, -0.0001226386521011591, -0.002834279090166092, -0.03776457533240318, -0.03368791192770004, 0.016238775104284286, 1.3363209063754766e-06, 0.02781008929014206, -0.05031516030430794, 0.015401276759803295, -0.009767613373696804, 0.030515333637595177, 0.004960845224559307, -0.04259418323636055, -0.041071947664022446, -0.0077319275587797165, 0.01189866289

768

To determine if two sentences are similar, you can compute the cosine similarity between their embeddings. The cosine similarity is a measure of the similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them

In [6]:
simil = np.dot(embeddings[0], embeddings[1])
print(f"The similarity between '{sentences[0]}' and '{sentences[1]}' is: {round(simil*100,2)}%")

The similarity between 'I like Hamburguers' and 'I like Pizza' is: 38.55%


In [7]:
simil = np.dot(embeddings[1], embeddings[3])
print(f"The similarity between '{sentences[1]}' and '{sentences[3]}' is: {round(simil*100,2)}%")

The similarity between 'I like Pizza' and 'The weather is cool outside' is: 6.31%


#### SpacyEmbeddings

In [8]:
from langchain.embeddings import SpacyEmbeddings
s_emb = SpacyEmbeddings()

In [9]:
sentences = ["I like Hamburguers", "I like Pizza", "I like Programming", "The weather is cool outside"]
s_embeddings = [s_emb.embed_query(sen) for sen in sentences]

In [10]:
print(s_embeddings[0])
len(s_embeddings[0])

[-0.9977489113807678, 0.362231969833374, -0.662943959236145, 0.5147702097892761, 0.31709548830986023, -0.7147340774536133, 1.5253783464431763, 0.020086606964468956, 0.5353927612304688, -0.23669223487377167, 0.47885391116142273, 0.08887312561273575, 0.053243715316057205, -0.24007479846477509, -0.6280664801597595, -0.6196134686470032, 0.2873343825340271, 0.4676531255245209, -0.015774646773934364, -0.22300869226455688, -0.6867689490318298, -0.6164547801017761, -0.6381105780601501, 0.027278026565909386, 0.3923916816711426, -0.24323658645153046, 0.016015464439988136, 0.3272424638271332, -0.05785775184631348, -0.05975094437599182, 0.3496571481227875, 0.0060227313078939915, -0.00799520779401064, 0.6956599354743958, -0.674931526184082, -0.14988763630390167, 0.002886096714064479, 0.22291512787342072, 0.2595116198062897, -0.6202573776245117, -0.45861172676086426, 0.24878716468811035, -0.3522361218929291, 0.15946926176548004, -0.6310560703277588, -0.3818489611148834, 0.859626829624176, 1.07424688

96

In [11]:
simil = np.dot(s_embeddings[0], s_embeddings[1])
print(f"The similarity between '{sentences[0]}' and '{sentences[2]}' is: {round(simil,2)}%")

The similarity between 'I like Hamburguers' and 'I like Programming' is: 23.09%


In [12]:
simil = np.dot(s_embeddings[1], s_embeddings[3])
print(f"The similarity between '{sentences[1]}' and '{sentences[3]}' is: {round(simil,2)}%")

The similarity between 'I like Pizza' and 'The weather is cool outside' is: 4.31%


The quality of an embedding model depends on several factors such as the size of the dataset, the complexity of the data, and the type of problem you are trying to solve. Some models may perform better than others for a given domain

## VectoreStores

**Chroma** is a vector store and embeddings database designed to make it easy to build AI applications with embeddings. It is an open-source, lightweight embedding database that can be used to store embeddings locally.

In [13]:
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [14]:
loaders = [
    PyPDFLoader("/content/Attention.pdf"), # https://arxiv.org/abs/1706.03762
    PyPDFLoader("/content/Bert.pdf")      # https://arxiv.org/abs/1810.04805v2
]

docs = []
for loader in loaders:
  docs.extend(loader.load())

In [15]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(docs)

In [16]:
len(splits)

90

In [17]:
splits[1]

Document(page_content='our model establishes a new single-model state-of-the-art BLEU score of 41.8 after\ntraining for 3.5 days on eight GPUs, a small fraction of the training costs of the\nbest models from the literature. We show that the Transformer generalizes well to\nother tasks by applying it successfully to English constituency parsing both with\nlarge and limited training data.\n∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started\nthe effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and\nhas been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head\nattention and the parameter-free position representation and became the other person involved in nearly every\ndetail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and\ntensor2tensor. Llion also experimented with novel m

In [18]:
persist_dir = "docs/chroma"

In [19]:
# !rm rf ./docs/chroma/ # remove old database files

In [20]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding = HuggingFaceEmbeddings(),
    persist_directory=persist_dir
)

In [21]:
print(vectordb._collection.count())

354


### Similarity search

In [22]:
question = "What are attention mechanism"

In [23]:
docs_f = vectordb.similarity_search(question, k=3)

In [24]:
docs_f[0]

Document(page_content='Attention Visualizations\nInput-Input Layer5\nIt\nis\nin\nthis\nspirit\nthat\na\nmajority\nof\nAmerican\ngovernments\nhave\npassed\nnew\nlaws\nsince\n2009\nmaking\nthe\nregistration\nor\nvoting\nprocess\nmore\ndifficult\n.\n<EOS>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\nIt\nis\nin\nthis\nspirit\nthat\na\nmajority\nof\nAmerican\ngovernments\nhave\npassed\nnew\nlaws\nsince\n2009\nmaking\nthe\nregistration\nor\nvoting\nprocess\nmore\ndifficult\n.\n<EOS>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\nFigure 3: An example of the attention mechanism following long-distance dependencies in the\nencoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of\nthe verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for\nthe word ‘making’. Different colors represent different heads. Best viewed in color.\n13', metadata={'page': 12, 'source': '/content/Attention.pdf'})

In [25]:
for doc in docs_f:
  print(doc.metadata)

{'page': 12, 'source': '/content/Attention.pdf'}
{'page': 12, 'source': '/content/Attention.pdf'}
{'page': 12, 'source': '/content/Attention.pdf'}


In [26]:
question_2 = "What is a transformer"
docs_f = vectordb.similarity_search(question_2, k=5)

In [29]:
docs_f[0]

Document(page_content='Figure 1: The Transformer - model architecture.\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\nrespectively.\n3.1 Encoder and Decoder Stacks\nEncoder: The encoder is composed of a stack of N= 6 identical layers. Each layer has two\nsub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-\nwise fully connected feed-forward network. We employ a residual connection [ 11] around each of\nthe two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is\nLayerNorm( x+ Sublayer( x)), where Sublayer( x)is the function implemented by the sub-layer\nitself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding\nlayers, produce outputs of dimension dmodel = 512 .', metadata={'page': 2, 'source': '/cont

In [30]:
for doc in docs_f:
  print(doc.metadata)

{'page': 2, 'source': '/content/Attention.pdf'}
{'page': 2, 'source': '/content/Attention.pdf'}
{'page': 2, 'source': '/content/Attention.pdf'}
{'page': 1, 'source': '/content/Attention.pdf'}
{'page': 1, 'source': '/content/Attention.pdf'}
