<a href="https://colab.research.google.com/github/AyeshaIjazTabassum/PythonAIBootcamp/blob/main/Day27.6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***Retrival Augmented Generatoon (RAGs)***

* RAGs gives LLMs additional knowledge
* In other words, we use RAGs to provide LLMs an external source of information to give better answer to our promots.

*Example*

We can use RAGs to provide LLMs with a list of relevant articles or books to read to answer a question. Now if you have a question , you can ask RAGs to provide you with a list of relevant articles or books to read to answer your question. This is a good way to get a better answer to your question.

*Challenge*

 Context Window Limitation : RAGs can only consider a limited amount of context at a time, which can limit its ability to understand complex questions or provide accurate answers. This is because RAGs are trained on a fixed-size context window, which can make it difficult to capture long-range dependencies or relationships between different pieces of information.




**What is tokens?**

In the context of language model , tokens are the smallest units of text that can be processed by the model.
* They can be words , subwords (smaller units of words), or even characters. The choice of tokenization depends on the specific model and the task at hand.
* Tokens are crucial because LLMs have a limit on how many tokens they can process in a single input. This limit is often referred to as the "sequence length " or "context length".
  * This means that if you want to input a long piece of text, you need to break it down into smaller chunks , or tokens, that can be processed individually by the model.

**Token Embedding**
Token embedding is a technique used in NLP to convert tokens into numerical vectors that can be processed by the model .
* Each token is mapped to a unique vector in a high-dimensional space, called the *embedding space*.
* The vectors are learned during the training process and capture the semantic meaning of the tokens.
  
**Vector DBs**
Vector databases are a type of database that stores and manages vectors, such as embeddings, in a way that allows for efficient querying and retrieval of similar vectors.

In [1]:
!pip install langchain langchain-community langchain-chroma langchain-google-genai python-dotenv



In [2]:
import os
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_chroma import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from dotenv import load_dotenv



In [3]:
# Define paths
file_path = r"/content/lord_of_the_rings.txt"  # Text file location
persistent_directory = "db/chroma_db"             # Folder to store embeddings
# Make sure the vector store folder exists
os.makedirs(persistent_directory, exist_ok=True)

In [4]:
# Load the text file
loader = TextLoader(file_path)
documents = loader.load()

**Chunks Overlap**

The overlap between the two sets of covers is the set of all covers that are common to both sets. This is the set of all covers that are in both sets of covers. If we set overlap 0 it means that there are no common covers between the two sets of covers. if we set it to 50 it means that 50% of the covers in the first set are common to the second set of covers.

In [5]:
# Split the document into chunks
splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
chunks = splitter.split_documents(documents)

print(f"Loaded and split into {len(chunks)} chunks")



Loaded and split into 43 chunks


In [10]:
import os
from google.colab import userdata
#  Create embeddings using Google model
load_dotenv()

# Ensure your GOOGLE_API_KEY is set either in a .env file or directly:
GOOGLE_API_KEY = userdata.get('API')
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY # Uncomment and replace with your actual API key

embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

In [11]:
# Save to Chroma vector store
db = Chroma.from_documents(chunks, embeddings, persist_directory=persistent_directory)

print("Embeddings created and saved in Chroma DB!")

Embeddings created and saved in Chroma DB!


In [12]:
# Loop through the first 3 document chunks
for i, doc in enumerate(chunks[:3]):
    # Generate the embedding vector for the text content of the chunk
    vector = embeddings.embed_query(doc.page_content)

    # Print the chunk index, vector length, and first 10 dimensions of the embedding
    print(f"Vector for chunk {i+1} (length {len(vector)}):\n{vector[:10]}...")

Vector for chunk 1 (length 768):
[-0.023168352, 0.014601793, 0.019198675, 0.00056040805, 0.0049894024, 0.07484264, 0.02419907, -0.077219784, -0.004445791, -0.052807443]...
Vector for chunk 2 (length 768):
[-0.02177426, 0.038907796, -0.025998116, 0.020623866, 0.026585685, 0.03816532, 0.017811857, -0.02841414, -0.0048194085, -0.003418056]...
Vector for chunk 3 (length 768):
[-0.023126155, 0.034766562, -0.0005128155, 0.01075273, -0.00939121, 0.04057802, 0.010675339, -0.042543054, 0.02203473, -0.0047027995]...


In [13]:
# Load from existing Chroma DB
db = Chroma(
    persist_directory=persistent_directory,
    embedding_function=embeddings
)

# Search with a sample question
results = db.similarity_search("Who is Frodo?", k=1)
for doc in results:
    print(doc.page_content)

Frodo was in his home in Hobbiton when Gandalf arrived, his old friend and mentor. It was here that Gandalf began to speak seriously about the Ring, and Frodo’s life would never be the same after that fateful meeting.


###**Part 2**

In [14]:
# Use current working directory
current_dir = os.getcwd()
vector_db_folder = os.path.join(current_dir, "db", "chroma_db")

In [17]:
embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")
db = Chroma(persist_directory=vector_db_folder, embedding_function=embeddings)

In [18]:
# Define the user's question
query = "Where does Gandalf meet Frodo?"

retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)
relevant_docs = retriever.invoke(query)

# Display the relevant results with metadata
print("Relevant Documents")
for i, doc in enumerate(relevant_docs, 1):
    print(f"Document {i}:\n{doc.page_content}\n")
    if doc.metadata:
        print(f"Source: {doc.metadata.get('source', 'Unknown')}\n")

Relevant Documents
Document 1:
Frodo was in his home in Hobbiton when Gandalf arrived, his old friend and mentor. It was here that Gandalf began to speak seriously about the Ring, and Frodo’s life would never be the same after that fateful meeting.

Source: /content/lord_of_the_rings.txt

