# **Vector DB - Weaviate DB**

**Note - This script executed in Google Colab**

- https://console.weaviate.cloud/


- **Weaviate is a cloud based** Vector DB. It allows **inmemory storage also**
- **Pinecone and Weaviate**  are **cloud based db**, we need to **take subscription**, but it provides initial free credits, we can create only 1 cluster. if we **dont want to save our private data** there, then we should use Chrome db/FAISS
- Set **API key** in Pinecone website
- We need to **define/create Cluster** . That time we get **WEAVIATE_CLUSTER**.
	- **WEAVIATE_CLUSTER = 'xxxx'**
- **Pinecone** is more fast and good than **Weaviate**. But both are good.

- Whenever **connecting to Weaviate Via  API key and env key**, that time**import direct Weaviate library** and use
- Whenever **importing embedding and doing db registry to Weaviate**, that time use **Weaviate from langchain.vectorstores import Weaviate**

## **Terminology:**
- **CHROMA/PINECONE-CLIENT** Db **pip installed**, Then called via **langchain's vectore_stores**
- Here we used hugging faces's embedding -**sentence-transformers** - **This framework generates embeddings for each input sentence**
- **Chunking/Chunk_size:** In document/datset we will have more no of tokens, but word embedding LLM models will have **token size /token_limitation** like 4k Tokens etc, So to accomodate to that size, we **split our data as chunks**
- **Chunk_overlap =50:** It takes **50 token behind from previous chunk** while creating next chunk


## **Below steps followed:**
- Login to **Pinecone website(Pinecone: https://www.pinecone.io/)**, Create
	- **APE_KEY**
	- **API_Env**
	- **New index**

-  **Download some document**
- Then **split that into chunks**
- Then import **openai embedding or hugging face embedding model** or some other embedding which converts **tokens/text to vector**
- Then use **pinecone library** and pass
    - **document which conveted to chunks to vector**  
    - **embedding model name**
    - **index**
- This converts **chunk to vectors/embedding**, which will be **saved inside index in pinecone cloud**
- Each chunks creates as 1 vector, we can see this in **Pinecone website, under our index**
- Then we need to **Use this vector_db** which we just now created by mentioning **vector_db**  
- Then use **as_retriever** to **read vector db** and **do  symantic search on this**
- Then this **symantic/similarity search** will give **K=4 relavant answers**, that along **with user Q** we will **feed to LLM** to provide **meaningfull response on that Q**.
- We can use **langchain's chain operation** - **RetrivalQA** for this
- We can set this # of relevant answer by setting **search_kwargs ={k:2}**
by using Chroma library
- Here **VectorDB does similarity search based on user Q** but **LLM just structure the VectorDB response and gives as output**. LLM wont do anything else. **Its also called RAG**
- This **RetrievalQA** passes Q to Vector db **retriever** and then passes this O/P with Q to llm model to do **summarization** internally
- We can use langchain's chain operation - **RetrivalQA** or **load_qa_chain** for this


In [None]:
!pip install weaviate-client langchain openai pypdf -q

## **0.  Use OPENAIKEY and WEAVIATE Key**
- Use OPENAIKEY and WEAVIATE Key to connect OpenAI and Weaviate separately

In [None]:
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

import os
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [None]:
WEAVIATE_API_KEY = userdata.get('WEAVIATE_API_KEY')
WEAVIATE_CLUSTER = userdata.get('WEAVIATE_CLUSTER')

import os
#Make is as env variable
os.environ["WEAVIATE_API_KEY"] = WEAVIATE_API_KEY
os.environ["WEAVIATE_CLUSTER"] = WEAVIATE_CLUSTER

## **1. Read the Document**
- Create directory pdfs and keep pdf file here, which will be used to created DB
- This pdf folder creating inside colab env,so it will deleted once session completes

In [None]:
!mkdir data

mkdir: cannot create directory ‘data’: File exists


### **Extract the Text from the PDF's**

In [None]:
from langchain.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("data")
data = loader.load()
data[:1]

[Document(page_content='Retrieval-Augmented Generation for Large Language Models: A Survey\nYunfan Gao1,Yun Xiong2,Xinyu Gao2,Kangxiang Jia2,Jinliu Pan2,Yuxi Bi3,Yi\nDai1,Jiawei Sun1,Qianyu Guo4,Meng Wang3and Haofen Wang1,3∗\n1Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University\n2Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University\n3College of Design and Innovation, Tongji University\n4School of Computer Science, Fudan University\nAbstract\nLarge Language Models (LLMs) demonstrate\nsignificant capabilities but face challenges such\nas hallucination, outdated knowledge, and non-\ntransparent, untraceable reasoning processes.\nRetrieval-Augmented Generation (RAG) has\nemerged as a promising solution by incorporating\nknowledge from external databases. This enhances\nthe accuracy and credibility of the models, particu-\nlarly for knowledge-intensive tasks, and allows for\ncontinuous knowledge updates and integration of\ndomai

### Split the whole document to chunks
- split that into chunks with **chunk_size=500, chunk_overlap=20** using **RecursiveCharacterTextSplitter**

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
docs = text_splitter.split_documents(data)

In [None]:
len(docs)

140

## **2. Creating Vector DB**

- Then import **openai embedding or hugging face embedding model** or some other embedding which converts **tokens/text to vector**
- Then use **Chroma/vectore db library** and pass
    - **document which conveted to chunks to vector**  
    - **embedding model name**
    - **persist directory**
- This converts **chunk to vectors/embedding**, which will be **saved inside db folder**

### **Initialize Embedding**

- Used OPENAI  embedding

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x785937fe7af0>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x785937e23ee0>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version='', openai_api_base=None, openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key='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', openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None)

### **Intialize Weaviate Vector DB**

- Follow the documentaion on Weaviate page and use same code to connect to cloud/server

In [None]:
import weaviate
from langchain.vectorstores import Weaviate

#Connect to weaviate Cluster
auth_config = weaviate.auth.AuthApiKey(api_key = WEAVIATE_API_KEY)
WEAVIATE_URL = WEAVIATE_CLUSTER

client = weaviate.Client(
    url = WEAVIATE_URL,
    additional_headers = {"X-OpenAI-Api-key": OPENAI_API_KEY},
    auth_client_secret = auth_config,
    startup_period = 10
)

In [None]:
client.is_ready()

True

### **Create Weaviate Schema**
- Follow the documentaion on Weaviate page and use same code to Define the structure

In [None]:
# define input structure
client.schema.delete_all()
client.schema.get()
schema = {
    "classes": [
        {
            "class": "Chatbot",
            "description": "Documents for chatbot",
            "vectorizer": "text2vec-openai",
            "moduleConfig": {"text2vec-openai": {"model": "ada", "type": "text"}},
            "properties": [
                {
                    "dataType": ["text"],
                    "description": "The content of the paragraph",
                    "moduleConfig": {
                        "text2vec-openai": {
                            "skip": False,
                            "vectorizePropertyName": False,
                        }
                    },
                    "name": "content",
                },
            ],
        },
    ]
}

client.schema.create(schema)

vectorstore = Weaviate(client, "Chatbot", "content", attributes=["source"])

### **Create Vector DB**
- Then use **vectorstore**  which we created schema above and pass the data (docs), Which will converts text to vector and saves in **weaviate**

In [None]:
# load text into the vectorstore
text_meta_pair = [(doc.page_content, doc.metadata) for doc in docs]
texts, meta = list(zip(*text_meta_pair))

# Pass the text as texts and meta to vectorstore, it cnverts to vector
vectorstore.add_texts(texts, meta)

### Use the Loaded Vector DB and do similarity check

In [None]:
query = "what is a RAG?"

# retrieve text related to the query
docs = vectorstore.similarity_search(query, top_k=3)
docs

In [None]:
len(docs)

140

## **3. Sementic Search**
- Then use **as_retriever** to read vector db and do **symantic search** on this
- Then this symantic/similarity search will give K=4 relavant answers, that along with user Q we will feed to LLM to provide meaningfull response on that Q.
- We can use langchain's chain operation - **RetrivalQA** or **load_qa_chain** for this

In [None]:
from langchain.llms import OpenAI

from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm=OpenAI(),
                                 chain_type="stuff",
                                 retriever=vectorstore.as_retriever()
                                 )

# Another way, But this below ex is incomplete, RAG O/P not added

"""
from langchain.chains.question_answering import load_qa_chain
# define chain
chain = load_qa_chain(
                    llm = OpenAI(),
                    chain_type="stuff")

# create answer
chain.run(input_documents=docs, question=query)
"""

In [None]:
query = "What is RAG?"
print("\n",qa.run(query))

# **END**