<a href="https://colab.research.google.com/github/Rami-RK/Retrieval_Augmented_Generation_RAG/blob/main/RAG_Step_2_Vectorstores_and_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **RAG: Step 2 - Vectorstores and Embeddings**


### **Objectives**

At the end of the experiment you will be able to :

1. convert documentd into embeddings
2. understand the similarity measurement between two document
3. uderatand & store embeddings into the `Vector Store  Database`
4. download & store the Vector Database for future re-use

### **Installing and importing packages**

In [55]:
!pip install openai

In [None]:
!pip install langchain

In [57]:
!pip install pypdf



In [None]:
! pip install chromadb

In [59]:
import os
import openai
import numpy as np

#### **Authentication for OpenAI API**

In [6]:
f = open('/content/openapi_key.txt')
api_key = f.read()
type(api_key)

str

In [7]:
os.environ['OPENAI_API_KEY'] = api_key
openai.api_key= os.getenv('OPENAI_API_KEY')

### **Loading the documents**

In [60]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("/content/doc1.pdf"),
    PyPDFLoader("/content/doc1.pdf"),
    PyPDFLoader("/content/doc2.pdf"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [61]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 20
)

In [62]:
splits = text_splitter.split_documents(docs)

In [63]:
len(splits)

10

In [None]:
splits

### **Embeddings**

Let's take our splits and embed them.

In [64]:
!pip install tiktoken

In [13]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

### **Understanding similarity search with a toy example**

In [66]:
sentence1 = "i like dogs"
sentence2 = "i like cats"
sentence3 = "the weather is ugly, too hot outside"

In [67]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [68]:
len(embedding1), len(embedding2), len(embedding3)

(1536, 1536, 1536)

In [None]:
embedding1

In [70]:
np.dot(embedding1, embedding2)

0.9205685499367217

In [71]:
np.dot(embedding1, embedding3)

0.7648378172547102

In [72]:
np.dot(embedding2, embedding3)

0.7595550064713039

### **Vectorstores**

In [73]:
from langchain.vectorstores import Chroma # Light-weight and in memory

In [75]:
persist_directory = 'docs/chroma/'

In [76]:
!rm -rf ./docs/chroma  # remove old database files if any

In [77]:
vectordb = Chroma.from_documents(
    documents=splits, # splits we created earlier
    embedding=embedding,
    persist_directory=persist_directory # save the directory
)

In [78]:
print(vectordb._collection.count()) # same as number of splites

10


### **Similarity Search**

In [83]:
question = "What does it say about Mahatma Gandhi?"

In [84]:
docs = vectordb.similarity_search(question,k=3) # k --> No. of doc as return

In [85]:
len(docs)

3

In [86]:
print(docs[0].page_content)

India's struggle for independence from British colonial rule is a significant chapter in its history. 
Led by Mahatma Gandhi and other freedom fighters, the non-violent resistance movement 
gained momentum and eventually led to India's independence on August 15, 1947. This day is 
celebrated annually as Independence Day.
India's economy is one of the fastest-growing in the world. It has transitioned from an agrarian 
economy to a service-oriented and industrialized economy. The country is known for its 
software and information technology services, pharmaceuticals, textiles, agriculture, and 
manufacturing sectors. Major cities like Mumbai, Delhi, Bangalore, and Chennai are hubs of 
business and commerce, attracting investments and fostering innovation.
Delhi is the capital of India


In [87]:
docs[1].page_content

"India's struggle for independence from British colonial rule is a significant chapter in its history. \nLed by Mahatma Gandhi and other freedom fighters, the non-violent resistance movement \ngained momentum and eventually led to India's independence on August 15, 1947. This day is \ncelebrated annually as Independence Day.\nIndia's economy is one of the fastest-growing in the world. It has transitioned from an agrarian \neconomy to a service-oriented and industrialized economy. The country is known for its \nsoftware and information technology services, pharmaceuticals, textiles, agriculture, and \nmanufacturing sectors. Major cities like Mumbai, Delhi, Bangalore, and Chennai are hubs of \nbusiness and commerce, attracting investments and fostering innovation.\nDelhi is the capital of India"

In [88]:
docs[2].page_content

"India's diplomatic influence is also growing on the global stage. The country actively \nparticipates in international forums and has strong bilateral relations with nations around the \nworld. India is a founding member of the Non-Aligned Movement and plays an active role in \nvarious international organizations, such as the United Nations and World Trade Organization.\nIn conclusion, India is a vast and diverse country with a rich cultural heritage, stunning \nlandscapes, and a rapidly growing economy. It is a nation where ancient traditions coexist with \nmodern aspirations. Despite its challenges, India continues to evolve and leave an indelible \nmark on the world, making it a fascinating and dynamic country to explore."

Let's save this so we can use it later!

In [43]:
vectordb.persist()

### **Edge case where failure may happen**

This seems great, and basic similarity search will get you 80% of the way there very easily.

But there are some failure modes that can creep up.


**Failure Example 1**

In [89]:
question= 'what did they say about cultural heritage?'

In [90]:
docs = vectordb.similarity_search(question,k=5)

Notice that we're getting duplicate chunks (because of the duplicate `doc1.pdf` in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

`docs[0]` and `docs[1]` are indentical.

In [91]:
docs[0]

Document(page_content="India's cultural heritage is preserved in its ancient monuments and UNESCO World Heritage \nSites. From the intricate carvings of Khajuraho temples to the majestic forts of Rajasthan, these\narchitectural marvels reflect India's rich history and artistic traditions.\nIndia's diversity extends to its languages as well. While Hindi and English are the official \nlanguages at the national level, there are 22 officially recognized regional languages, including \nBengali, Tamil, Telugu, Marathi, Urdu, Punjabi, and Gujarati, among others. This linguistic \ndiversity is a testament to India's multicultural ethos.\nIn recent years, India has made significant strides in space exploration. The Indian Space \nResearch Organization (ISRO) has successfully launched satellites and missions, including the\nMars Orbiter Mission (MOM), also known as Mangalyaan. These achievements have placed \nIndia among the elite group of nations with advanced space programs.", metadata={'page'

In [92]:
docs[1]

Document(page_content="India has a rich cultural heritage that has evolved over thousands of years. It is home to various\nreligions, including Hinduism, Islam, Christianity, Sikhism, Buddhism, and Jainism, among \nothers. These religions coexist harmoniously, contributing to India's multicultural fabric. \nFestivals like Diwali, Eid, Christmas, and Holi are celebrated with great enthusiasm and bring \npeople from different communities together.\nThe history of India is characterized by ancient civilizations, invasions, and the establishment of\npowerful empires. The Indus Valley Civilization, one of the world's oldest urban civilizations, \nflourished in the northwestern part of the Indian subcontinent around 2500 BCE. Over the \ncenturies, India witnessed the rise and fall of several dynasties, including the Maurya, Gupta, \nand Mughal empires. The Mughal period, in particular, left a lasting impact on Indian culture, \nart, and architecture.", metadata={'page': 0, 'source': '/conten

**Failure Example 2**

We can see a new failure mode.

The question below asks a question from second doc, but includes results from other doc as well.

In [93]:
question = "what did they say about about Tourism?"

In [94]:
docs = vectordb.similarity_search(question,k=5)

In [95]:
for doc in docs:
    print(doc.metadata) # metadata contains information about from which doc the answer has been fetched

{'page': 0, 'source': '/content/doc2.pdf'}
{'page': 1, 'source': '/content/doc2.pdf'}
{'page': 0, 'source': '/content/doc2.pdf'}
{'page': 0, 'source': '/content/doc2.pdf'}
{'page': 0, 'source': '/content/doc1.pdf'}


In [97]:
print(docs[0].page_content)

India's cultural heritage is preserved in its ancient monuments and UNESCO World Heritage 
Sites. From the intricate carvings of Khajuraho temples to the majestic forts of Rajasthan, these
architectural marvels reflect India's rich history and artistic traditions.
India's diversity extends to its languages as well. While Hindi and English are the official 
languages at the national level, there are 22 officially recognized regional languages, including 
Bengali, Tamil, Telugu, Marathi, Urdu, Punjabi, and Gujarati, among others. This linguistic 
diversity is a testament to India's multicultural ethos.
In recent years, India has made significant strides in space exploration. The Indian Space 
Research Organization (ISRO) has successfully launched satellites and missions, including the
Mars Orbiter Mission (MOM), also known as Mangalyaan. These achievements have placed 
India among the elite group of nations with advanced space programs.


In [98]:
print(docs[1].page_content)

India's diplomatic influence is also growing on the global stage. The country actively 
participates in international forums and has strong bilateral relations with nations around the 
world. India is a founding member of the Non-Aligned Movement and plays an active role in 
various international organizations, such as the United Nations and World Trade Organization.
In conclusion, India is a vast and diverse country with a rich cultural heritage, stunning 
landscapes, and a rapidly growing economy. It is a nation where ancient traditions coexist with 
modern aspirations. Despite its challenges, India continues to evolve and leave an indelible 
mark on the world, making it a fascinating and dynamic country to explore.


In [99]:
print(docs[2].page_content)

worldwide.
Indian cuisine is renowned for its flavors, spices, and regional specialties. Each state has its 
own culinary traditions, offering a wide range of vegetarian and non-vegetarian dishes. Indian 
food has gained international popularity, with dishes like curry, biryani, dosa, and tandoori 
being enjoyed by people worldwide.
The Indian rupee is the official currency in the Republic of India. The rupee is subdivided into 
100 paise. The issuance of the currency is controlled by the Reserve Bank of India.
₹ The Indian rupee sign ( ) is the currency symbol for the Indian rupee the official currency of 
India
Tourism is a significant contributor to India's economy. The country attracts millions of visitors 
each year who come to explore its historical sites, architectural wonders, wildlife sanctuaries, 
and scenic landscapes. Iconic landmarks such as the Taj Mahal, Jaipur's palaces, Kerala's 
backwaters, and the beaches of Goa are popular tourist destinations.


In [None]:
print(docs[3].page_content)

In [96]:
print(docs[4].page_content)

India's struggle for independence from British colonial rule is a significant chapter in its history. 
Led by Mahatma Gandhi and other freedom fighters, the non-violent resistance movement 
gained momentum and eventually led to India's independence on August 15, 1947. This day is 
celebrated annually as Independence Day.
India's economy is one of the fastest-growing in the world. It has transitioned from an agrarian 
economy to a service-oriented and industrialized economy. The country is known for its 
software and information technology services, pharmaceuticals, textiles, agriculture, and 
manufacturing sectors. Major cities like Mumbai, Delhi, Bangalore, and Chennai are hubs of 
business and commerce, attracting investments and fostering innovation.
Delhi is the capital of India


### Download the vector DB

In [100]:
# Zip the entire folder
!zip -r /content/docs.zip /content/docs

  adding: content/docs/ (stored 0%)
  adding: content/docs/chroma/ (stored 0%)
  adding: content/docs/chroma/2578d79f-00c7-47e3-bcf8-77e2933257f7/ (stored 0%)
  adding: content/docs/chroma/2578d79f-00c7-47e3-bcf8-77e2933257f7/header.bin (deflated 61%)
  adding: content/docs/chroma/2578d79f-00c7-47e3-bcf8-77e2933257f7/link_lists.bin (stored 0%)
  adding: content/docs/chroma/2578d79f-00c7-47e3-bcf8-77e2933257f7/length.bin (deflated 13%)
  adding: content/docs/chroma/2578d79f-00c7-47e3-bcf8-77e2933257f7/data_level0.bin (deflated 100%)
  adding: content/docs/chroma/chroma.sqlite3 (deflated 68%)


In [101]:
from google.colab import files
files.download("/content/docs.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>