<a href="https://colab.research.google.com/github/Prakum14/Testfiles/blob/master/Telecom_RAG_with_Open_Source_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG : Retrieval Augmented Generation

**(with Open-source LLMs)**

OBJECTIVES:

1. Load the Documents
2. Splitting the documents into chunks
3. Embedding the chunks and storing them in vector db
4. Retrieving the relevant chunks to the query
 * Addressing Diversity
 * Addressing Specificity
5. Connecting with LLM to get a final grounded answer

## Introduction

**RAG diagram:**

<img src='https://drive.google.com/uc?id=1sCVvpsmtZEU1WSK1FFGMGHbEjrgtCNLi'>

**Vector Store and Retrieval:**

<img src='https://drive.google.com/uc?id=1_zX5gtSNrV8Qdx7Nz4_gMR8dCwvxCDS7' width=750px>

**Embedding Model:**

<img src='https://miro.medium.com/v2/resize:fit:1400/format:webp/1*ghzXkvn08B4XVHw-D8lJPg.png'>

**Retrieval in Action:**

<img src='https://miro.medium.com/v2/resize:fit:1400/format:webp/1*8w8_uIsagwvy1V8icxOfdQ.png' width=800px>

**Example workflow with embedding model:**

<br>

<img src='https://www.researchgate.net/publication/381125820/figure/fig2/AS:11431281249185289@1717499737731/Illustration-of-a-Retrieval-Augmented-Generation-RAG-workflow-Documents-are-loaded-and.ppm'>

### Install Dependencies

In [1]:
# Suppresses the output of the following installation commands to keep the notebook clean.
%%capture

# Installs the core LangChain library, which provides essential components for building LLM-based applications.
!pip -q install langchain-core

# Installs community-supported integrations for LangChain, including APIs, databases, and tool connectors.
!pip -q install langchain-community

# Installs Sentence Transformers, which is used for generating text embeddings for tasks like semantic search.
!pip -q install sentence-transformers

# Installs Hugging Face integration for LangChain, enabling the use of Hugging Face models.
!pip -q install langchain-huggingface

# Installs ChromaDB integration for LangChain, which allows interaction with the Chroma vector database.
!pip -q install langchain-chroma

# Installs ChromaDB, an open-source vector database for storing and retrieving high-dimensional embeddings.
!pip -q install chromadb

# Installs PyPDF, a library for reading and extracting text from PDF files, useful in document processing.
!pip -q install pypdf

### Import Required Packages

In [2]:
# Importing the os module to interact with the operating system, such as setting environment variables.
import os

# Importing NumPy, a library for numerical operations and handling arrays.
import numpy as np

# Importing getpass to securely take user input, such as API keys, without displaying them on the screen.
from getpass import getpass

# Importing HuggingFaceEndpoint from langchain_huggingface to interact with Hugging Face models via API.
from langchain_huggingface import HuggingFaceEndpoint

# Importing PyPDFLoader from langchain_community to load and extract text from PDF documents.
from langchain_community.document_loaders import PyPDFLoader

# Importing HuggingFaceEmbeddings from langchain_huggingface to generate text embeddings using Hugging Face models.
from langchain_huggingface import HuggingFaceEmbeddings

# Importing Chroma from langchain_chroma to use ChromaDB as a vector database for storing and retrieving embeddings.
from langchain_chroma import Chroma

# Importing PromptTemplate from langchain_core to create structured prompts for LLM-based applications.
from langchain_core.prompts import PromptTemplate

# Importing StrOutputParser from langchain_core to parse the output of an LLM into a simple string format.
from langchain_core.output_parsers import StrOutputParser

# Importing RunnablePassthrough from langchain.schema.runnable, which acts as a simple passthrough in chains.
from langchain.schema.runnable import RunnablePassthrough

In [3]:
# Downloading a PDF file from Google Drive using gdown.
# The file corresponds to "pca_d1.pdf".
#!gdown https://drive.google.com/uc?id=1Wy00e_FEBVwMx-jZBklNk9dzEW9a-LHc
!gdown /content/5G-NR-in-Bullets.pdf

# Downloading another PDF file from Google Drive using gdown.
# This file corresponds to "ens_d2.pdf".
#!gdown https://drive.google.com/uc?id=1gMv6Ew7oGCPD0CA4D5iN_zAUBWY-SSJQ
!gdown /content/Fundamentals_of_5G_Mobile_Networks-Wiley.pdf

Failed to retrieve file url:

	Cannot retrieve the public link of the file. You may need to change
	the permission to 'Anyone with the link', or have had many accesses.
	Check FAQ in https://github.com/wkentaro/gdown?tab=readme-ov-file#faq.

You may still be able to access the file from the browser:

	https://drive.google.com/uc?id=/content/5G-NR-in-Bullets.pdf

but Gdown can't. Please check connections and permissions.
Failed to retrieve file url:

	Cannot retrieve the public link of the file. You may need to change
	the permission to 'Anyone with the link', or have had many accesses.
	Check FAQ in https://github.com/wkentaro/gdown?tab=readme-ov-file#faq.

You may still be able to access the file from the browser:

	https://drive.google.com/uc?id=/content/Fundamentals_of_5G_Mobile_Networks-Wiley.pdf

but Gdown can't. Please check connections and permissions.


#### **Authentication for Huggingface API**

In [4]:
# Importing the os module to interact with the operating system, including setting environment variables.
import os

# Importing userdata from google.colab to securely access stored user credentials in Google Colab.
from google.colab import userdata

# Retrieving the stored Hugging Face API token from Colab's secure storage.
hfapi_key = userdata.get('HF_TOKEN')

# Storing the API token in an environment variable named "HF_TOKEN".
os.environ["HF_TOKEN"] = hfapi_key

# Storing the same API token in another environment variable "HUGGINGFACEHUB_API_TOKEN",
# which is commonly used for authentication in Hugging Face-based applications.
os.environ["HUGGINGFACEHUB_API_TOKEN"] = hfapi_key

In [None]:
# If your access token is in a text file, use this code cell.


# import os
# f = open('/content/hfapi_key.txt')
# hfapi_key=f.read()
# os.environ["HF_TOKEN"] = hfapi_key
# os.environ["HUGGINGFACEHUB_API_TOKEN"] = hfapi_key

### Prepare Open Source LLM

In [5]:
# importing HuggingFace model abstraction class from langchain
from langchain_huggingface import HuggingFaceEndpoint

In [6]:
# Initializing a Hugging Face endpoint for text generation using a specific model.
llm = HuggingFaceEndpoint(
    repo_id="HuggingFaceH4/zephyr-7b-beta",  # Specifies the model to be used (Zephyr-7B Beta), available on Hugging Face.

    task="text-generation",  # Defines the task type as text generation (commonly used for chatbots, summarization, etc.).

    max_new_tokens=512,  # Limits the maximum number of new tokens the model can generate in a response.

    top_k=30,  # Controls the diversity of the generated text by restricting the number of top probable tokens considered at each step.

    temperature=0.1,  # A low temperature value makes the model’s output more deterministic and focused.

    repetition_penalty=1.03,  # Slightly penalizes repeated tokens to reduce redundancy in responses.
)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [7]:
# Sending a general query to the Hugging Face model for text generation.
response = llm.invoke("What is 5G? give 5 points")

# Printing the model's response to the query.
print(response)



 about it.

5G is the fifth generation of wireless communication technology, succeeding 4G. Here are five key points about 5G:

1. **Speed and Latency**: 5G is designed to provide much faster data speeds and lower latency compared to its predecessors. While 4G can reach up to 1 Gbps, 5G can theoretically reach up to 20 Gbps. Lower latency means quicker response times, which is crucial for applications like autonomous vehicles and remote surgery.

2. **Network Capacity**: 5G is designed to support a much larger number of devices in the same area without a significant drop in performance. This is particularly important in crowded places like stadiums or city centers, where many people are using their devices at the same time.

3. **New Spectrum Bands**: 5G uses new, higher frequency bands (like mmWave) that were not used in previous generations. These bands offer more bandwidth, but they also have shorter ranges and are more easily blocked by buildings and other obstacles.

4. **Massive 

In [8]:
# Specific query
llm.invoke("What is QOS in 5G?")



" What are the different types of QOS in 5G?\n\nQOS (Quality of Service) in 5G refers to the ability to provide different priority levels to different applications, users, or data flows. This is crucial for ensuring that critical services, such as emergency communications or autonomous vehicles, have reliable and low-latency connectivity, while less critical services can use whatever resources are left over.\n\nIn 5G, there are three main types of QOS:\n\n1. **GBR (Guaranteed Bit Rate)**: This is the highest level of QOS. It guarantees a certain bit rate for a service, even during network congestion. This is ideal for real-time applications like voice calls or streaming services.\n\n2. **Non-GBR**: This is a best-effort QOS. It doesn't guarantee a certain bit rate, but it does prioritize certain types of traffic over others. This is suitable for services that can tolerate some variation in bandwidth, like web browsing or email.\n\n3. **Non-GBR with Prioritized Bit Rate (PBR)**: This is

In [9]:
# Specific query
llm.invoke("How is the SLOT format in 5G?")



" I am trying to understand the slot format in 5G. I have read that the slot format in 5G is different from LTE. Can someone explain how it is different?\n\nComment: Welcome to Stack Overflow! Please take the [tour](https://stackoverflow.com/tour), have a look around, and read through the [help center](https://stackoverflow.com/help) and [How to Ask](https://stackoverflow.com/help/how-to-ask) to get an idea of how the site works. When you have finished doing that, please come back and [edit](https://stackoverflow.com/posts/64183079/edit) your question to provide more details about what you are asking.\n\n## Answer (2)\n\nIn 5G, the slot format is defined by the slot format indicator (SFI) field in the DCI. The SFI field indicates the format of the slot, i.e., whether it is a DL slot, UL slot, or flexible slot. The flexible slot can be used for either DL or UL transmission depending on the configuration of the network.\n\nIn LTE, the slot format is determined by the special subframe con

### **Loading the documents**

[PDF Loader](https://python.langchain.com/docs/how_to/document_loader_pdf/)

In [11]:
# UPLOAD the Docs first to this notebook, then run this cell

from langchain_community.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    #PyPDFLoader("/content/pca_d1.pdf"),
    #PyPDFLoader("/content/ens_d2.pdf"),
    #PyPDFLoader("/content/ens_d2.pdf"),
    PyPDFLoader("/content/5G-NR-in-Bullets.pdf"),
    PyPDFLoader("/content/Fundamentals_of_5G_Mobile_Networks-Wiley.pdf"),
    PyPDFLoader("/content/Fundamentals_of_5G_Mobile_Networks-Wiley.pdf"), # Loading duplicate documents on purpose
]

docs = []
for loader in loaders:
    docs.extend(loader.load())

In [12]:
len(docs)        # 7 pages were there in total from above documents

1264

In [13]:
docs

[Document(metadata={'producer': 'Corel PDF Engine Version 20.1.0.708', 'creator': 'CorelDRAW 2018', 'creationdate': '2019-10-14T15:44:51+05:30', 'author': 'DS', 'moddate': '2019-12-03T12:49:16+05:30', 'title': 'Book Scanning and printing 10 copies.cdr', 'source': '/content/5G-NR-in-Bullets.pdf', 'total_pages': 592, 'page': 0, 'page_label': '1'}, page_content=''),
 Document(metadata={'producer': 'Corel PDF Engine Version 20.1.0.708', 'creator': 'CorelDRAW 2018', 'creationdate': '2019-10-14T15:44:51+05:30', 'author': 'DS', 'moddate': '2019-12-03T12:49:16+05:30', 'title': 'Book Scanning and printing 10 copies.cdr', 'source': '/content/5G-NR-in-Bullets.pdf', 'total_pages': 592, 'page': 1, 'page_label': '2'}, page_content='5G New Radio \nIN B[!LLETS \n1'),
 Document(metadata={'producer': 'Corel PDF Engine Version 20.1.0.708', 'creator': 'CorelDRAW 2018', 'creationdate': '2019-10-14T15:44:51+05:30', 'author': 'DS', 'moddate': '2019-12-03T12:49:16+05:30', 'title': 'Book Scanning and printing 

In [15]:
print(docs[1].page_content)

5G New Radio 
IN B[!LLETS 
1


### **Splitting of document**

[Recursively split by character](https://python.langchain.com/docs/how_to/recursive_text_splitter/)

[Split by character](https://python.langchain.com/docs/how_to/character_text_splitter/)

In [16]:
# Importing RecursiveCharacterTextSplitter from langchain_text_splitters
# This is used to split large text documents into smaller chunks for efficient processing.
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [17]:
# Initializing a RecursiveCharacterTextSplitter to split text into smaller chunks for processing.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Each text chunk will have a maximum of 500 characters.

    chunk_overlap=50  # Overlapping 50 characters between consecutive chunks to maintain context.
)

In [18]:
# Using the text splitter to divide the documents into smaller chunks.
splits = text_splitter.split_documents(docs)

# Printing the total number of splits (chunks) generated from the documents.
print(len(splits))

# Printing the length of the content of the first chunk to inspect its size.
print(len(splits[0].page_content))

# Displaying the content of the first chunk to examine the text that was split.
splits[0].page_content

10051
28


'5G New Radio \nIN B[!LLETS \n1'

In [19]:
splits[0]

Document(metadata={'producer': 'Corel PDF Engine Version 20.1.0.708', 'creator': 'CorelDRAW 2018', 'creationdate': '2019-10-14T15:44:51+05:30', 'author': 'DS', 'moddate': '2019-12-03T12:49:16+05:30', 'title': 'Book Scanning and printing 10 copies.cdr', 'source': '/content/5G-NR-in-Bullets.pdf', 'total_pages': 592, 'page': 1, 'page_label': '2'}, page_content='5G New Radio \nIN B[!LLETS \n1')

### **Embeddings**

Let's take our splits and embed them.

In [20]:
# Importing the torch module to interact with PyTorch for deep learning tasks.
import torch

# Checking if a CUDA-compatible GPU is available; if so, use it for faster processing, otherwise use the CPU.
device = "cuda" if torch.cuda.is_available() else "cpu"

# Printing the selected device (either "cuda" for GPU or "cpu" for CPU).
print(f"Device: {device}")

Device: cuda


In [21]:
# Embedding Model

from langchain_huggingface import HuggingFaceEmbeddings

modelPath ="mixedbread-ai/mxbai-embed-large-v1"                  # Model card: https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1
                                                                 # Find other Emb. models at: https://huggingface.co/spaces/mteb/leaderboard

# Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device': device}      # cuda/cpu

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': False}

embedding =  HuggingFaceEmbeddings(
    model_name=modelPath,     # Provide the pre-trained model's path
    model_kwargs=model_kwargs, # Pass the model configuration options
    encode_kwargs=encode_kwargs # Pass the encoding options
)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/266 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/114k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

In [22]:
embedding

HuggingFaceEmbeddings(model_name='mixedbread-ai/mxbai-embed-large-v1', cache_folder=None, model_kwargs={'device': 'cuda'}, encode_kwargs={'normalize_embeddings': False}, multi_process=False, show_progress=False)

### **Understanding similarity search with a toy example**

In [24]:
# Assigning the first sentence to the variable 'sentence1'.
#sentence1 = "i like dogs"
sentence1 = "i like mobile networks"

# Assigning the second sentence to the variable 'sentence2'.
#sentence2 = "i like cats"
sentence2 = "i like internet"

# Assigning the third sentence to the variable 'sentence3'.
#sentence3 = "the weather is ugly, too hot outside"
sentence3 = "the speed is too slow in 2G, difficult to watch videos"

In [25]:
# Embedding the first sentence to convert it into a vector representation.
embedding1 = embedding.embed_query(sentence1)

# Embedding the second sentence to convert it into a vector representation.
embedding2 = embedding.embed_query(sentence2)

# Embedding the third sentence to convert it into a vector representation.
embedding3 = embedding.embed_query(sentence3)

In [26]:
len(embedding1), len(embedding2), len(embedding3)

(1024, 1024, 1024)

In [27]:
embedding1[:10]

[-0.3834555447101593,
 -0.36403974890708923,
 -0.512786865234375,
 -0.16479787230491638,
 0.4292484223842621,
 0.08855875581502914,
 0.6033209562301636,
 -0.21614167094230652,
 1.4115427732467651,
 1.3808118104934692]

In [28]:
import numpy as np

def cosine_similarity(vector1, vector2):
    # Ensure that the vectors are numpy arrays
    vector1 = np.array(vector1)
    vector2 = np.array(vector2)

    # Calculate the dot product of the vectors
    dot_product = np.dot(vector1, vector2)

    # Calculate the magnitude (norm) of the vectors
    norm_vector1 = np.linalg.norm(vector1)
    norm_vector2 = np.linalg.norm(vector2)

    # Compute cosine similarity
    if norm_vector1 == 0 or norm_vector2 == 0:
        return 0  # Avoid division by zero
    return dot_product / (norm_vector1 * norm_vector2)


In [29]:
cosine_similarity(embedding1, embedding2), cosine_similarity(embedding1, embedding3), cosine_similarity(embedding2, embedding3)

(0.7756011355403276, 0.5681826946920143, 0.565947348340234)

### **Vectorstores**

In [30]:
# Light-weight and in memory
from langchain_chroma import Chroma

In [31]:
# Defining the directory path where Chroma (vector database) will store embeddings.
persist_directory = 'docs/chroma/'

# Removing the old Chroma database directory and its contents if they exist.
# This ensures that the previous embeddings or data do not interfere with the new ones.
!rm -rf ./docs/chroma

In [32]:
# Creating a Chroma vector database from the split documents and embeddings.
vectordb = Chroma.from_documents(
    documents=splits,  # Providing the document splits (chunks) created earlier to store in the database.

    embedding=embedding,  # Using the previously generated embeddings to represent the document chunks in vector form.

    persist_directory=persist_directory,  # Specifying the directory where the vector database will be saved.
)

In [33]:
# Same as number of splits
print(vectordb._collection.count())

10051


### **Similarity Search in Vector store**

Algorithms for retrieving relevant chunks In Vector databases,

In vector databases, algorithms for retrieving relevant chunks to a query are often based on **similarity search techniques**, primarily using nearest neighbor search.

Here are some common approaches:

>**Approximate Nearest Neighbor (ANN) Search:** Vector databases frequently use ANN algorithms to improve efficiency when searching for vectors that
are close to the query vector.
>
>Popular **ANN** algorithms include:
>
>1. HNSW (Hierarchical Navigable Small World Graph): This is a graph-based approach that finds approximate nearest neighbors using a multi-
layered graph structure.
>
>2. Faiss: An open-source library developed by Facebook, which uses various algorithms for fast similarity search, such as Product Quantization and
Inverted File System (IVF).
>
>3. Annoy (Approximate Nearest Neighbors Oh Yeah): Developed by Spotify, it uses a forest of random projection trees for approximate nearest
neighbor search.


In [37]:
# Defining a question to be used for querying or retrieving relevant information from the database or model.
#question = "How does ensemble method works?"
question = "How 5G network works?"

In [38]:
# Performing a similarity search in the Chroma vector database to find the most relevant document chunks for the given question.
docs = vectordb.similarity_search(question, k=6)  # 'k=6' specifies that the top 6 most similar document chunks will be returned.

In [39]:
# Printing the total number of retrieved document chunks (based on the similarity search).
print(len(docs))

# Looping through each retrieved document chunk to display its content.
for i in range(len(docs)):
    # Printing the content of each document chunk (page content).
    print(docs[i].page_content)

    # Printing a separator line for readability between document chunks.
    print('='*140)

6
24 Fundamentals of 5G Mobile Networks
and beyond that, to mobile users that can form a virtual pool of resources to be managed by 
the network. Bringing the applications through the cloud closer to the end user reduces the 
communication latency to support delay‐sensitive real‐time control applications.
It is envisaged that 5G will seamlessly integrate the existing RATs (e.g. GSM, HSPA, LTE 
and WiFi) with the complementary new ones invented in mmWave bands. MmWave technol-
24 Fundamentals of 5G Mobile Networks
and beyond that, to mobile users that can form a virtual pool of resources to be managed by 
the network. Bringing the applications through the cloud closer to the end user reduces the 
communication latency to support delay‐sensitive real‐time control applications.
It is envisaged that 5G will seamlessly integrate the existing RATs (e.g. GSM, HSPA, LTE 
and WiFi) with the complementary new ones invented in mmWave bands. MmWave technol-
should cooperate to jointly promote and 

### **Edge cases where failure may happen**

1. Lack of Diversity : Semantic search fetches all similar documents, but does not enforce diversity.

    - Notice that we're getting duplicate chunks (because of the duplicate `ens_d2.pdf` in the index). `docs[0]` and `docs[1]` are indentical.

  **Addressing Diversity - MMR (Maximum Marginal Relevance)**

Maximum Marginal Relevance (MMR) is a method used to retrieve relevant items to a query while avoiding redundancy. It does this by ensuring a balance between relevancy and diversity in the items retrieved.

<img src='https://miro.medium.com/v2/resize:fit:828/format:webp/1*U-9mPt5tBfPBPrwC4_oD1w.png'>

In [40]:
# Defining a question to retrieve relevant documents from the Chroma vector database.
question = 'How 5G network works?'

# Performing a similarity search in the vector database to find the most relevant documents for the given question (without MMR).
docs = vectordb.similarity_search(question, k=3)  # 'k=3' specifies that the top 3 most relevant document chunks will be returned.

# Printing the total number of documents retrieved.
print(len(docs))

# Looping through each retrieved document chunk to display its content.
for i in range(len(docs)):
    # Printing the content of each document chunk (page content).
    print(docs[i].page_content)

    # Printing a separator line for readability between document chunks.
    print('='*140)

3
24 Fundamentals of 5G Mobile Networks
and beyond that, to mobile users that can form a virtual pool of resources to be managed by 
the network. Bringing the applications through the cloud closer to the end user reduces the 
communication latency to support delay‐sensitive real‐time control applications.
It is envisaged that 5G will seamlessly integrate the existing RATs (e.g. GSM, HSPA, LTE 
and WiFi) with the complementary new ones invented in mmWave bands. MmWave technol-
24 Fundamentals of 5G Mobile Networks
and beyond that, to mobile users that can form a virtual pool of resources to be managed by 
the network. Bringing the applications through the cloud closer to the end user reduces the 
communication latency to support delay‐sensitive real‐time control applications.
It is envisaged that 5G will seamlessly integrate the existing RATs (e.g. GSM, HSPA, LTE 
and WiFi) with the complementary new ones invented in mmWave bands. MmWave technol-
should cooperate to jointly promote and 

**Example 1. Addressing Diversity - MMR-Maximum Marginal Relevance**

In [41]:
# Performing a similarity search using Maximum Marginal Relevance (MMR) to retrieve diverse document chunks.
docs_with_mmr = vectordb.max_marginal_relevance_search(question, k=3, fetch_k=6)

# Printing the total number of retrieved document chunks.
print(len(docs_with_mmr))

# Looping through each retrieved document chunk to display its content.
for i in range(len(docs_with_mmr)):
    # Printing the content of each document chunk (page content).
    print(docs_with_mmr[i].page_content)

    # Printing a separator line for readability between document chunks.
    print('='*140)

3
24 Fundamentals of 5G Mobile Networks
and beyond that, to mobile users that can form a virtual pool of resources to be managed by 
the network. Bringing the applications through the cloud closer to the end user reduces the 
communication latency to support delay‐sensitive real‐time control applications.
It is envisaged that 5G will seamlessly integrate the existing RATs (e.g. GSM, HSPA, LTE 
and WiFi) with the complementary new ones invented in mmWave bands. MmWave technol-
should cooperate to jointly promote and lead the development of global mobile communica-
tion technologies.
Jointly built by big South Korean companies such as Samsung and LG and its Electronic 
Communication Academy, the new 5G network architecture consists of three layers: Layer 1 
is the server gateway; Layer 2 is the outer cellular; and Layer 3 is the inner cellular. The inner 
cellular first transmits data to the outer cellular through the backhaul; then, the outer cellular
8 Fundamentals of 5G Mobile Network

2. Lack of specificity:  The question may be from a particular doc but answer may contain information from other doc.

  **Addressing Specificity: Working with metadata - Manually**

  **Working with metadata using self-query retriever - Automatically**

**Example 2. Addressing Specificity: Working with metadata - Manually**

In [42]:
# Defining a question to retrieve relevant documents from the Chroma vector database.
#question = "What is variance?"
question = "What is an NG interface?"

# Performing a similarity search to find the top 5 most relevant document chunks.
docs = vectordb.similarity_search(question, k=5)

# Looping through each retrieved document chunk to display its metadata.
for doc in docs:
    # Printing metadata information, which contains details about the source document.
    print(doc.metadata)

{'author': 'DS', 'creationdate': '2019-10-14T15:44:51+05:30', 'creator': 'CorelDRAW 2018', 'moddate': '2019-12-03T12:49:16+05:30', 'page': 51, 'page_label': '52', 'producer': 'Corel PDF Engine Version 20.1.0.708', 'source': '/content/5G-NR-in-Bullets.pdf', 'title': 'Book Scanning and printing 10 copies.cdr', 'total_pages': 592}
{'author': 'DS', 'creationdate': '2019-10-14T15:44:51+05:30', 'creator': 'CorelDRAW 2018', 'moddate': '2019-12-03T12:49:16+05:30', 'page': 51, 'page_label': '52', 'producer': 'Corel PDF Engine Version 20.1.0.708', 'source': '/content/5G-NR-in-Bullets.pdf', 'title': 'Book Scanning and printing 10 copies.cdr', 'total_pages': 592}
{'author': 'DS', 'creationdate': '2019-10-14T15:44:51+05:30', 'creator': 'CorelDRAW 2018', 'moddate': '2019-12-03T12:49:16+05:30', 'page': 60, 'page_label': '61', 'producer': 'Corel PDF Engine Version 20.1.0.708', 'source': '/content/5G-NR-in-Bullets.pdf', 'title': 'Book Scanning and printing 10 copies.cdr', 'total_pages': 592}
{'author':

We can filter the results based on metadata.

In [43]:
# Defining a question to retrieve relevant documents from a specific source (filtered by metadata).
#question = "What is the role of variance in PCA?"
question = "what is Machine Type Communication?"

# Performing a similarity search with a metadata filter to retrieve the top 5 most relevant document chunks.
docs = vectordb.similarity_search(
    question,
    k=5,  # Retrieving the top 5 relevant document chunks.
    filter={"source": "/content/ens_d2.pdf"}  # Applying a metadata filter to retrieve only results from 'ens_d2.pdf'.
)

# Looping through each retrieved document chunk to display its metadata.
for doc in docs:
    # Printing metadata information to track the source of the retrieved document chunks.
    print(doc.metadata)

In [44]:
# Performing a similarity search with Maximum Marginal Relevance (MMR) while applying a metadata filter.
docs_with_mmr = vectordb.max_marginal_relevance_search(
    question,     # The query to retrieve relevant document chunks.
    k=2,          # Number of diverse document chunks to return.
    fetch_k=5,    # Initial number of retrieved chunks before selecting the most diverse ones.
    filter={"source": "/content/ens_d2.pdf"}  # Applying a metadata filter to only fetch results from 'ens_d2.pdf'.
)

In [45]:
# Looping through each retrieved document chunk to display its content.
for i in range(len(docs_with_mmr)):
    # Printing the content of the retrieved document chunk.
    print(docs_with_mmr[i].page_content)

    # Printing a separator line for readability between document chunks.
    print('='*140)

[**Addressing Specificity -Automatically: Working with metadata using self-query retriever**](https://python.langchain.com/docs/how_to/self_query/)

### **Additional tricks: Compression**

Another approach for improving the quality of retrieved docs is compression. Information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

[Contextual compression](https://python.langchain.com/docs/how_to/contextual_compression/) is meant to fix this.

## **Retrieval**

**[Vectorstore as a retriever](https://python.langchain.com/docs/how_to/vectorstore_retriever/)**

**Better Approach**

In [46]:
# Defining the query for retrieving relevant document chunks.
#question = "What is principal component analysis?"
question = "How 5G network works?"

# Converting the Chroma vector database into a retriever object.
# `search_kwargs={"k": 3}` specifies that we want to retrieve the top 3 most relevant document chunks.
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

# Retrieving the relevant documents by invoking the retriever with the query.
docs = retriever.invoke(question)

# Displaying the retrieved document chunks.
docs

[Document(id='ece11b3c-b69d-47f8-b84d-7c9ed9a124c5', metadata={'author': 'Jonathan Rodriguez', 'creationdate': '2015-05-24T04:41:05+00:00', 'creator': 'PyPDF', 'ebx_publisher': 'John Wiley & Sons, Inc.', 'moddate': '2015-05-24T15:43:58+01:00', 'page': 65, 'page_label': '24', 'producer': 'iTextSharp™ 5.5.0 ©2000-2013 iText Group NV (AGPL-version); modified using iTextSharp™ 5.5.0 ©2000-2013 iText Group NV (AGPL-version)', 'source': '/content/Fundamentals_of_5G_Mobile_Networks-Wiley.pdf', 'title': 'Fundamentals of 5G Mobile Networks', 'total_pages': 336}, page_content='24 Fundamentals of 5G Mobile Networks\nand beyond that, to mobile users that can form a virtual pool of resources to be managed by \nthe network. Bringing the applications through the cloud closer to the end user reduces the \ncommunication latency to support delay‐sensitive real‐time control applications.\nIt is envisaged that 5G will seamlessly integrate the existing RATs (e.g. GSM, HSPA, LTE \nand WiFi) with the complem

In [47]:
# Converting the Chroma vector database into a retriever with Maximum Marginal Relevance (MMR).
# `search_type="mmr"` enables MMR-based retrieval.
# `k=2` specifies that we want to return 2 final document chunks.
# `fetch_k=5` retrieves the top 5 most relevant chunks initially, then selects the 2 most diverse ones.
retriever = vectordb.as_retriever(search_type="mmr", search_kwargs={"k": 2, "fetch_k": 5})

# Retrieving the relevant document chunks using MMR.
docs = retriever.invoke(question)

# Displaying the retrieved document chunks.
docs

[Document(id='ece11b3c-b69d-47f8-b84d-7c9ed9a124c5', metadata={'author': 'Jonathan Rodriguez', 'creationdate': '2015-05-24T04:41:05+00:00', 'creator': 'PyPDF', 'ebx_publisher': 'John Wiley & Sons, Inc.', 'moddate': '2015-05-24T15:43:58+01:00', 'page': 65, 'page_label': '24', 'producer': 'iTextSharp™ 5.5.0 ©2000-2013 iText Group NV (AGPL-version); modified using iTextSharp™ 5.5.0 ©2000-2013 iText Group NV (AGPL-version)', 'source': '/content/Fundamentals_of_5G_Mobile_Networks-Wiley.pdf', 'title': 'Fundamentals of 5G Mobile Networks', 'total_pages': 336}, page_content='24 Fundamentals of 5G Mobile Networks\nand beyond that, to mobile users that can form a virtual pool of resources to be managed by \nthe network. Bringing the applications through the cloud closer to the end user reduces the \ncommunication latency to support delay‐sensitive real‐time control applications.\nIt is envisaged that 5G will seamlessly integrate the existing RATs (e.g. GSM, HSPA, LTE \nand WiFi) with the complem

## **Augmentation**

In [48]:
from langchain_core.prompts import PromptTemplate                                    # To format prompts
from langchain_core.output_parsers import StrOutputParser                            # to transform the output of an LLM into a more usable format
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough          # Required by LCEL (LangChain Expression Language)

In [49]:
# Defining a prompt template for answering questions based on retrieved context.
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, that is if the answer is not in the context, then just say that you don't know, don't try to make up an answer.
Always say "thanks for asking!" at the end of the answer.

{context}
Question: {question}
Helpful Answer:"""

# Creating a PromptTemplate object using the defined template.
# The template takes two input variables: 'context' (retrieved documents) and 'question' (user query).
QA_PROMPT = PromptTemplate(input_variables=["context", "question"], template=template)

## **Creating final RAG Chain**

> <img src='https://www.pinecone.io/_next/image/?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2Fvr8gru94%2Fproduction%2F63f8a8482c9ec06a8d7d1041514f87c06dd108a9-3442x942.png&w=3840&q=75' width=1200px>

[[Image source](https://www.pinecone.io/learn/series/langchain/langchain-expression-language/)]

In [50]:
# Function to retrieve relevant document chunks based on a given question.
def get_context_info(question):
    # Creating a retriever using the Chroma vector database with Maximum Marginal Relevance (MMR).
    # `search_type="mmr"` ensures diversity in retrieved results.
    # `fetch_k=5` first fetches the top 5 most relevant chunks.
    # `k=3` selects the 3 most diverse and relevant chunks from the 5 fetched.
    retriever = vectordb.as_retriever(search_type="mmr", search_kwargs={"k": 3, "fetch_k": 5})

    # Retrieving the document chunks using the retriever.
    docs = retriever.invoke(question)

    # Returning the retrieved document chunks.
    return docs

In [51]:
# Importing RunnableLambda for wrapping functions as runnable components.
from langchain_core.runnables import RunnableLambda

# Creating a retrieval pipeline using RunnableParallel.
retrieval = RunnableParallel(
    {
        # Retrieving relevant document chunks based on the input question.
        # RunnableLambda ensures that `get_context_info` runs dynamically when called.
        "context": RunnableLambda(lambda x: get_context_info(x["question"])),

        # Simply passing the question as is.
        "question": RunnableLambda(lambda x: x["question"])
    }
)

In [52]:
#retrieval.invoke({"question": "What is PCA ?"})
retrieval.invoke({"question": "What is X2 interface ?"})

{'context': [Document(id='b1a02e0f-fd94-4456-9b78-f30a56aa8245', metadata={'author': 'DS', 'creationdate': '2019-10-14T15:44:51+05:30', 'creator': 'CorelDRAW 2018', 'moddate': '2019-12-03T12:49:16+05:30', 'page': 55, 'page_label': '56', 'producer': 'Corel PDF Engine Version 20.1.0.708', 'source': '/content/5G-NR-in-Bullets.pdf', 'title': 'Book Scanning and printing 10 copies.cdr', 'total_pages': 592}, page_content='5G NR in BULLETS\n1.6.5 X2 INTERFACE \n* The X2 interface was originally introduced within the release 8 version of the 3GPP specifications. Its primary function is to provide\ncontrol plane and user plane connectivity between two LTE Base Stations (cNode 13). The X2 interface was updated within the release\n15 version of the specifications to allow connectivity between an eNode B and a gNodc B. This is a requirement for the Non\xad\nStandalone Base Station architectures 3, 3a and 3x'),
  Document(id='4e3fd4fb-fc46-4387-bf16-f3b3af4d8d22', metadata={'author': 'DS', 'creation

In [53]:
#retrieval.invoke({"question": "How ensemble methods works?"})
retrieval.invoke({"question": "What is Network Slicing ?"})

{'context': [Document(id='d8270290-b73e-4a2c-95b3-d6da175c3800', metadata={'author': 'DS', 'creationdate': '2019-10-14T15:44:51+05:30', 'creator': 'CorelDRAW 2018', 'moddate': '2019-12-03T12:49:16+05:30', 'page': 79, 'page_label': '80', 'producer': 'Corel PDF Engine Version 20.1.0.708', 'source': '/content/5G-NR-in-Bullets.pdf', 'title': 'Book Scanning and printing 10 copies.cdr', 'total_pages': 592}, page_content='sG NR in BULLETS\n1.15 NETWORK SLICING \n* Network Slicing refers to the selection and allo<.:allon of network r,esour<.:es to suit the requiremt:nts of a specific service. For example,\nan eM!3B user is likely to require high throughputs so that user should be allocated network resources which support high throughputs.\nIn contrast, a URLLC user is likely to require low latency so that user should be allocated network resources which support low latency'),
  Document(id='35d18ede-00e6-4d58-a259-bfe4591b39e4', metadata={'author': 'DS', 'creationdate': '2019-10-14T15:44:51+05

In [54]:
# Constructing a Retrieval-Augmented Generation (RAG) chain.

rag_chain = (retrieval       # Step 1: Retrieve relevant document chunks based on the user's question.
             | QA_PROMPT     # Step 2: Format the retrieved context and question into a structured prompt.
             | llm           # Step 3: Pass the prompt to the language model (LLM) for answer generation.
             | StrOutputParser()  # Step 4: Convert the LLM's output into a plain string format.
             )

In [55]:
#response = rag_chain.invoke({"question": "What is PCA ?"})
response = rag_chain.invoke({"question": "What is EDGE computing ?"})

response



' EDGE Computing allows a UE to access services which are hosted close to the serving Base Station. This approach helps to improve both end-user experience and network efficiency. End-user experience can be improved by lower latencies while network efficiency can be improved by reduced backhaul transport requirements. Hosting services close to the serving Base Station means that there is a User Plane Function (UPF) and a Data Network (DN) or Local. Thanks for asking!'

In [57]:
#response = rag_chain.invoke({"question": "What is principal component analysis?"})
response = rag_chain.invoke({"question": "What is EDGE computing in 5G?"})

response



' EDGE Computing in 5G allows a User Equipment (UE) to access services hosted close to the serving Base Station, improving both end-user experience and network efficiency. It reduces latencies for users and reduces backhaul transport requirements for the network. This is achieved by having a User Plane Function (UPF) and a Data Network (DN) or Local DN near the serving Base Station. Thanks for asking!'

In [59]:
#response = rag_chain.invoke({"question": "How ensemble method works?"})
response = rag_chain.invoke({"question" : "What is Network Slicing?"})
response



' Network Slicing refers to the selection and allocation of network resources to suit the requirements of a specific service. For example, an eMBB user is likely to require high throughputs so they should be allocated network resources which support high throughputs. In contrast, a URLLC user is likely to require low latency so they should be allocated network resources which support low latency. Thanks for asking!'

In [64]:
# For queries that is not in documents
#response = rag_chain.invoke({"question": "Who is the CEO of OpenAI "})
response = rag_chain.invoke({"question": "Who is the CEO of Gemini "})

print(response)           # It should return "I don't know. Thanks for asking!". The open-source model used is not that great.



 I don't know. Thanks for asking!


[**Details of Chroma through LangChain**](https://python.langchain.com/docs/integrations/vectorstores/chroma/)

### **Download the vector DB**

In [65]:
# Zip the entire folder
!zip -r /content/docs.zip /content/docs

  adding: content/docs/ (stored 0%)
  adding: content/docs/chroma/ (stored 0%)
  adding: content/docs/chroma/chroma.sqlite3 (deflated 82%)
  adding: content/docs/chroma/4106c588-dc3d-42e6-9b74-fe42ad0f66d7/ (stored 0%)
  adding: content/docs/chroma/4106c588-dc3d-42e6-9b74-fe42ad0f66d7/data_level0.bin (deflated 9%)
  adding: content/docs/chroma/4106c588-dc3d-42e6-9b74-fe42ad0f66d7/index_metadata.pickle (deflated 43%)
  adding: content/docs/chroma/4106c588-dc3d-42e6-9b74-fe42ad0f66d7/header.bin (deflated 56%)
  adding: content/docs/chroma/4106c588-dc3d-42e6-9b74-fe42ad0f66d7/link_lists.bin (deflated 80%)
  adding: content/docs/chroma/4106c588-dc3d-42e6-9b74-fe42ad0f66d7/length.bin (deflated 32%)


In [66]:
from google.colab import files
files.download("/content/docs.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### **Upload the vector db from previous step and unzip**

In [67]:
!unzip /content/docs.zip  -d /

Archive:  /content/docs.zip
replace /content/docs/chroma/chroma.sqlite3? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: /content/docs/chroma/chroma.sqlite3  
  inflating: /content/docs/chroma/4106c588-dc3d-42e6-9b74-fe42ad0f66d7/data_level0.bin  
  inflating: /content/docs/chroma/4106c588-dc3d-42e6-9b74-fe42ad0f66d7/index_metadata.pickle  
  inflating: /content/docs/chroma/4106c588-dc3d-42e6-9b74-fe42ad0f66d7/header.bin  
  inflating: /content/docs/chroma/4106c588-dc3d-42e6-9b74-fe42ad0f66d7/link_lists.bin  
  inflating: /content/docs/chroma/4106c588-dc3d-42e6-9b74-fe42ad0f66d7/length.bin  


In [68]:
# Importing necessary modules for embeddings and vector database storage.
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

# Initializing the Hugging Face embedding model.
embedding =  HuggingFaceEmbeddings(
    model_name="mixedbread-ai/mxbai-embed-large-v1",  # Using a pre-trained embedding model for text vectorization.

    model_kwargs={'device': "cuda" if torch.cuda.is_available() else "cpu"},  # Configuring the model to use GPU if available.

    encode_kwargs={'normalize_embeddings': False}  # Disabling normalization to preserve the original embedding scale.
)

# Creating a Chroma vector database instance.
vectordb = Chroma(
    persist_directory='docs/chroma/',  # Setting the directory where embeddings will be stored.
    embedding_function=embedding  # Specifying the embedding function to convert text into vector representations.
)

### **Re-ranking example with Open-source model**

* [Retrieve & Re-Rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html)
* [MS MARCO Cross-Encoders](https://www.sbert.net/docs/pretrained-models/ce-msmarco.html) for Re-ranking
  * Usage with **SentenceTransformers
Pre-trained models** can be used like this:

In [69]:
# Define a query and some candidate sentences
query = "I love programming in Python."

# Some toy data representing candidate sentences/documents
candidates = [
    "Python is a great programming language.",
    "I enjoy long walks on the beach.",
    "Machine learning can be used to build models.",
    "I like writing code in Python.",
    "Artificial intelligence is fascinating."
]

In [70]:
Paragraph1 = candidates[0]
Paragraph2 = candidates[1]
Paragraph3 = candidates[2]

In [71]:
# Importing CrossEncoder from sentence-transformers to perform pairwise relevance scoring.
from sentence_transformers import CrossEncoder

# Specifying the pre-trained cross-encoder model for relevance ranking.
model_name = 'cross-encoder/ms-marco-TinyBERT-L-2-v2'

# Initializing the cross-encoder model with a max sequence length of 512 tokens.
model = CrossEncoder(model_name, max_length=512)

# Using the cross-encoder to compute relevance scores for different paragraphs against the query.
scores = model.predict([
    (query, Paragraph1),  # Comparing query with the first paragraph.
    (query, Paragraph2),  # Comparing query with the second paragraph.
    (query, Paragraph3)   # Comparing query with the third paragraph.
])

# Printing the computed relevance scores.
print(scores)

config.json:   0%|          | 0.00/787 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/17.6M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/525 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[  5.3639755 -11.360193  -11.534376 ]


In [72]:
print(scores)

[  5.3639755 -11.360193  -11.534376 ]


* **Usage with Transformers**

In [73]:
# Import necessary modules from the transformers library
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch  # Import PyTorch for tensor operations

# Load a pre-trained model for sequence classification using the specified model name
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Load the corresponding tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize the query and paragraph pairs, ensuring proper formatting for model input
features = tokenizer(
    [query, query, query],        # Query repeated for each paragraph
    [Paragraph1, Paragraph2, Paragraph3],  # Each paragraph to compare with the query
    padding=True,                 # Pads sequences to the same length
    truncation=True,               # Truncates sequences if too long
    return_tensors="pt"            # Returns PyTorch tensors
)

# Set the model to evaluation mode (disables training-specific layers like dropout)
model.eval()

# Perform inference without computing gradients (saves memory and speeds up computation)
with torch.no_grad():
    scores = model(**features).logits  # Get logits (raw output scores before softmax)
    print(scores)  # Print the similarity scores for each query-paragraph pair

tensor([[  5.3640],
        [-11.3602],
        [-11.5344]])


In [74]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: write

In [75]:
model.push_to_hub("praveenku1479/TelecomQnARAG")
tokenizer.push_to_hub("praveenku1479/TelecomQnARAG")

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/praveenku1479/TelecomQnARAG/commit/f73a817b14b1ebf296880324efe791f83831b26e', commit_message='Upload tokenizer', commit_description='', oid='f73a817b14b1ebf296880324efe791f83831b26e', pr_url=None, repo_url=RepoUrl('https://huggingface.co/praveenku1479/TelecomQnARAG', endpoint='https://huggingface.co', repo_type='model', repo_id='praveenku1479/TelecomQnARAG'), pr_revision=None, pr_num=None)

In [76]:
from transformers import AutoModelWithLMHead, AutoTokenizer

In [77]:
loaded_model = AutoModelForSequenceClassification.from_pretrained("praveenku1479/TelecomQnARAG")

config.json:   0%|          | 0.00/852 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

In [78]:
loaded_tokenizer = AutoTokenizer.from_pretrained("praveenku1479/TelecomQnARAG")

tokenizer_config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [81]:
response = rag_chain.invoke({"question": "What is 5G?"})
response



' 300000-200000-200000-20000-20000-20000-200000-20000-20000-20000-20000-2000000-20000-20000-20000-20000-20000-200000-20000-20000-20000-20000-200000-2000000-2000-2000-20000-20000-2000000-200000-2000-20000-200000-20000-20000-2000-20000-2000000-200000-2000-200000-2000-200000-200000-2000-20000-20000-20000-200000-20000-20000-20000-200000-20000-200000-20000-20000-200000-2000-200000-20000-200000-20000-2000000-20000-20000-20000-20000-20000-200000-20000-20000-20000-20000-2000000-200000-2000-2000-200000-20000-200000-2'

In [82]:
!pip -q install gradio

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.2/62.2 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m321.9/321.9 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.5/12.5 MB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [83]:
import gradio

In [84]:
def generate_query_response(prompt, max_length):
  response = rag_chain.invoke({"question": prompt})
  return response


In [85]:
import gradio as gr

# Gradio elements

# Input from user
in_prompt = gr.Textbox(label="Enter your query")
in_max_length = gr.Slider(minimum=50, maximum=500, value=200, step=10, label="Maximum Response Length")

# Output response
out_response = gr.Textbox(label="Generated Response")


iface = gr.Interface(
    fn=generate_query_response,
    inputs=[in_prompt, in_max_length],
    outputs=out_response,
    title="Technical Query Response Generator",
)

iface.launch()


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://de684b6222a6dad970.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


