# Part 1: Retrieval-Augmented Generation (RAG) Model for QA Bot

### Problem Statement:

### Develop a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA)

### bot for a business. Use a vector database like Pinecone DB and a generative model like

### Cohere API (or any other available alternative). The QA bot should be able to retrieve

### relevant information from a dataset and generate coherent answers.


## Installation Requirements


In [1]:
!pip install faiss-cpu cohere PyPDF2 numpy


Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Collecting cohere
  Downloading cohere-5.9.2-py3-none-any.whl.metadata (3.4 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting boto3<2.0.0,>=1.34.0 (from cohere)
  Downloading boto3-1.35.19-py3-none-any.whl.metadata (6.6 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Downloading fastavro-1.9.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting httpx>=0.21.2 (from cohere)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx-sse==0.4.0 (from cohere)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting parameterized<0.10.0,>=0.9.0 (from cohere)
  Downloading parameterized-0.9.0-py2.py3-none-any.whl.metadata (18 kB)
Collecting types-requests<3.0.0,>=2.0.0 (from cohere)
  Downloading types_requests-2.32.0.20240914-py3-none-a

In [41]:
!pip list

Package                            Version
---------------------------------- ---------------
accelerate                         0.33.0
aiohttp                            3.9.5
aiohttp-retry                      2.8.3
aiosignal                          1.3.1
alembic                            1.13.2
altair                             5.4.1
amqp                               5.2.0
aniso8601                          9.0.1
annotated-types                    0.7.0
antlr4-python3-runtime             4.9.3
anyio                              3.7.1
anyio                              3.7.1
appdirs                            1.4.4
asttokens                          2.4.1
asyncssh                           2.15.0
atpublic                           5.0
attrs                              23.2.0
autocommand                        2.2.2
backoff                            2.2.1
backports.tarfile                  1.2.0
beautifulsoup4                     4.12.3
billiard                           4.2.0
b

DEPRECATION: Loading egg at c:\users\cyril\appdata\local\programs\python\python312\lib\site-packages\anyio-3.7.1-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
DEPRECATION: Loading egg at c:\users\cyril\appdata\local\programs\python\python312\lib\site-packages\cohere-5.9.2-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
DEPRECATION: Loading egg at c:\users\cyril\appdata\local\programs\python\python312\lib\site-packages\fastavro-1.9.7-py3.12-win-amd64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
DEPRECATION: Loading egg at c:\users\cyri

In [13]:
!pip uninstall httpx

^C


## Read contexts from PDF file


In [15]:
import PyPDF2

def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text()
    return text


## Spliting the context into chunks


In [16]:
def split_text(text, chunk_size=1000):
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
    return chunks


## Using Cohere API services for Text Embeding.


In [36]:
import cohere
import numpy as np

co = cohere.Client('SGXUJ2vUDqaNNpJwh1ffmo1PFkGmN50W6ghcW4UA')

def create_embeddings(texts, batch_size=40):
    embeddings = []
    for i in range(0, len(texts), batch_size):

        batch = texts[i:i+batch_size]
        print(batch)
        response = co.embed(texts=batch, model="embed-english-v3.0", input_type="search_document")
        embeddings.append(response.embeddings)

    return np.vstack(embeddings)


## Initializing FAISS Indexing


In [37]:
import faiss
import numpy as np

dimension = 1024
index = faiss.IndexFlatL2(dimension)


## Processing: Reading a pdf file, Extracting text from that PDF, Spliting into chunks of data, Vector Embeding and Converting the embeded data into FAISS Index


In [40]:
pdf_path = r"C:\Users\cyril\Downloads\Applied-Generative-AI-for-Beginners.pdf"
text = extract_text_from_pdf(pdf_path)
chunks = split_text(text)

# Generate embeddings
chunk_embeddings = create_embeddings(chunks)
print(chunk_embeddings)
# Add embeddings to FAISS index
index.add(np.array(chunk_embeddings).astype(np.float32))


['Applied \nGenerative AI for Beginners\nPractical Knowledge on Diffusion Models, \nChatGPT, and Other LLMs\n—\nAkshay Kulkarni\nAdarsha ShivanandaAnoosh KulkarniDilip GudivadaApplied Generative AI for \nBeginners\nPractical Knowledge on\xa0Diffusion \nModels, ChatGPT, and\xa0Other LLMs\nAkshay\xa0Kulkarni\nAdarsha\xa0Shivananda\nAnoosh\xa0Kulkarni\nDilip\xa0GudivadaApplied Generative AI for Beginners: Practical Knowledge on Diffusion Models, \nChatGPT, and Other LLMs\nISBN-13 (pbk): 978-1-4842-9993-7   ISBN-13 (electronic): 978-1-4842-9994-4\nhttps://doi.org/10.1007/978-1-4842-9994-4\nCopyright © 2023 by Akshay Kulkarni, Adarsha Shivananda, Anoosh Kulkarni,  \nDilip Gudivada\nThis work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the \nmaterial is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, \nbroadcasting, reproduction on microfilms or in any other physical way, and transmission o

AttributeError: 'list' object has no attribute 'raw_items'

## Function to retrive relevent information from the documents.


In [79]:
def retrieve(query, index, k=3):

    query_embed = co.embed(texts=[query], model="embed-english-v3.0", input_type="search_document").embeddings

    D, I = index.search(np.array(query_embed).astype(np.float32), k)

    return [chunks[i] for i in I[0]]

## Generation of the text from prompt


In [89]:
query = "What is Generative AI and its applications?"
context = retrieve(query, index)

contexts = ""
for cont in context:
  contexts = contexts + cont

prompt = f"**Context/Knowledge**: {contexts} \n\n **Query**: {query}"

## Streaming the output.


In [90]:
import cohere

stream = co.chat_stream(
  model='command-r-plus-08-2024',
  message=prompt,
  temperature=0.4,
  chat_history=[],
  prompt_truncation='AUTO',
  #connectors=[{"id":"web-search"}],
  max_tokens=4096
)

for event in stream:
  if event.event_type == "text-generation":
    print(event.text, end='')

## Generative AI:
Generative AI is a fascinating and rapidly evolving field within artificial intelligence. It involves the development of advanced algorithms and models that can create new and diverse content, mimicking the creative process of humans. Unlike traditional AI systems that are designed for specific tasks, generative AI focuses on learning patterns and structures from existing data to produce novel outputs.

## Applications of Generative AI:
1. **Text Generation**:
   - Generative AI models can write creative stories, news articles, poetry, and even code. For example, GPT (Generative Pre-trained Transformer) models have gained significant attention for their ability to generate coherent and contextually relevant text.
   - These models can assist content creators, writers, and marketers in generating ideas, outlines, and drafts, thereby increasing productivity.

2. **Image Generation**:
   - Generative AI can create realistic images, artwork, and even modify or enhance exi

# **Provide several example queries**


In [91]:
query = "How do diffusion models work in generating images?"
context = retrieve(query, index)

contexts = ""
for cont in context:
  contexts = contexts + cont
prompt = f"**Context/Knowledge**: {contexts} \n\n **Query**: {query}"

import cohere

stream = co.chat_stream(
  model='command-r-plus-08-2024',
  message=prompt,
  temperature=0.4,
  chat_history=[],
  prompt_truncation='AUTO',
  #connectors=[{"id":"web-search"}],
  max_tokens=4096
)

for event in stream:
  if event.event_type == "text-generation":
    print(event.text, end='')

Diffusion models offer a unique and innovative approach to image generation by employing a process that can be likened to a series of steps, each adding a layer of complexity to the image generation process. Here's a breakdown of how diffusion models generate images:

**1. Noise Schedule and Markov Chain:** The process begins with the definition of a noise schedule, which is a sequence of noise levels ranging from minimal to significant. This schedule is crucial as it determines the progression of noise introduction. The model then employs a Markov chain, a sequential process where each step corresponds to a noise level in the schedule. 

**2. Adding Noise and Latent Representation:** At each step of the Markov chain, the model introduces noise to the image. This is a controlled process, where the amount of noise added is determined by the diffusion rate. Simultaneously, the model also uses a latent representation model, typically a neural network, to encode the image into a latent rep

In [92]:
query = "What is the architecture of ChatGPT and how is it fine-tuned?"
context = retrieve(query, index)

contexts = ""
for cont in context:
  contexts = contexts + cont
prompt = f"**Context/Knowledge**: {contexts} \n\n **Query**: {query}"

import cohere

stream = co.chat_stream(
  model='command-r-plus-08-2024',
  message=prompt,
  temperature=0.4,
  chat_history=[],
  prompt_truncation='AUTO',
  #connectors=[{"id":"web-search"}],
  max_tokens=4096
)

for event in stream:
  if event.event_type == "text-generation":
    print(event.text, end='')

The architecture of ChatGPT is based on the Transformer model, a powerful neural network architecture initially introduced by Vaswani et al. in 2017. Specifically, ChatGPT utilizes a "decoder-only" version of the Transformer, which is well-suited for language generation tasks. The Transformer architecture consists of an encoder and a decoder, but in the case of ChatGPT, only the decoder component is used.

**Architecture Components**:
- **Decoder-Only Transformer**: The decoder in the Transformer architecture is responsible for generating output sequences. It takes an input and generates a corresponding response. In ChatGPT, the decoder is trained to produce coherent and contextually appropriate responses to user queries.
- **Attention Mechanism**: ChatGPT employs the self-attention mechanism, a key feature of the Transformer architecture. This mechanism allows the model to weigh the importance of different parts of the input sequence when generating a response. It enables the model to

In [93]:
query = "What are the key differences between Google Bard and ChatGPT?"
context = retrieve(query, index)

contexts = ""
for cont in context:
  contexts = contexts + cont
prompt = f"**Context/Knowledge**: {contexts} \n\n **Query**: {query}"

import cohere

stream = co.chat_stream(
  model='command-r-plus-08-2024',
  message=prompt,
  temperature=0.4,
  chat_history=[],
  prompt_truncation='AUTO',
  #connectors=[{"id":"web-search"}],
  max_tokens=4096
)

for event in stream:
  if event.event_type == "text-generation":
    print(event.text, end='')

The key differences between Google Bard and ChatGPT can be summarized as follows:

**Architecture**: The most significant distinction lies in their architectural design. ChatGPT employs a decoder-only architecture, which means it is optimized for generating text. It takes input and generates a response based on that input. On the other hand, Google Bard utilizes an encoder-decoder architecture. This architecture allows Bard to both encode input and decode it to generate a response. The encoder-decoder setup enables Bard to handle tasks that require understanding and processing the input before generating an output.

**Capabilities**: Both models are large language models with impressive capabilities, but they excel in different areas. ChatGPT, with its decoder-only architecture, is particularly skilled at generating text, making it excellent for tasks like language translation, summarization, and creative writing. Google Bard, however, is better at tasks that require real-world knowled

In [94]:
query = "How can Large Language Models (LLMs) be applied in enterprise solutions?"
context = retrieve(query, index)

contexts = ""
for cont in context:
  contexts = contexts + cont
prompt = f"**Context/Knowledge**: {contexts} \n\n **Query**: {query}"

import cohere

stream = co.chat_stream(
  model='command-r-plus-08-2024',
  message=prompt,
  temperature=0.4,
  chat_history=[],
  prompt_truncation='AUTO',
  #connectors=[{"id":"web-search"}],
  max_tokens=4096
)

for event in stream:
  if event.event_type == "text-generation":
    print(event.text, end='')

The application of Large Language Models (LLMs) in enterprise solutions offers a wide range of possibilities and benefits, as outlined in the provided context. Here are some key ways LLMs can be utilized in enterprise settings:

- **Private Generalized LLM API**: This approach focuses on data privacy, customization, and control. By developing a private LLM API, enterprises can create tailored language models that cater to their specific industry or use case. This allows for better control over sensitive data, ensuring privacy and security. Enterprises can use this to build applications for customer support, content generation, or personalized recommendations, ensuring that the model aligns with their unique requirements.

- **LLMs for Enterprise and LLM Ops**: Integrating LLMs into enterprise operations can revolutionize various processes. For instance, LLMs can be used for automated customer support, generating personalized responses to inquiries, and handling a vast array of customer

In [95]:
query = "What are the benefits and limitations of the Transformer architecture?"
context = retrieve(query, index)

contexts = ""
for cont in context:
  contexts = contexts + cont
prompt = f"**Context/Knowledge**: {contexts} \n\n **Query**: {query}"

import cohere

stream = co.chat_stream(
  model='command-r-plus-08-2024',
  message=prompt,
  temperature=0.4,
  chat_history=[],
  prompt_truncation='AUTO',
  #connectors=[{"id":"web-search"}],
  max_tokens=4096
)

for event in stream:
  if event.event_type == "text-generation":
    print(event.text, end='')

The Transformer architecture has brought about significant advancements in the field of natural language processing (NLP) and has several advantages:

**Benefits of Transformer Architecture:**
1. **Parallel Processing and Efficiency:** One of the key strengths of the Transformer is its ability to process input sequences in parallel. Unlike traditional sequential models, which process data step by step, the Transformer can handle all input elements simultaneously. This parallel processing capability leads to faster training and inference times, making it highly efficient for large-scale language tasks.
2. **Attention Mechanism:** The Transformer's attention mechanism is a powerful tool that allows the model to focus on relevant parts of the input sequence. It assigns attention weights to different elements, enabling the model to weigh the importance of each word or token in the context. This mechanism helps the model capture long-range dependencies and understand the relationships betwe

In [96]:
query = "How does the attention mechanism in Transformer models work?"
context = retrieve(query, index)

contexts = ""
for cont in context:
  contexts = contexts + cont
prompt = f"**Context/Knowledge**: {contexts} \n\n **Query**: {query}"

import cohere

stream = co.chat_stream(
  model='command-r-plus-08-2024',
  message=prompt,
  temperature=0.4,
  chat_history=[],
  prompt_truncation='AUTO',
  #connectors=[{"id":"web-search"}],
  max_tokens=4096
)

for event in stream:
  if event.event_type == "text-generation":
    print(event.text, end='')

The attention mechanism in Transformer models is a crucial component that enables the model to focus on relevant parts of the input sequence and capture important dependencies and relationships between elements. Here's a detailed explanation of how it works:

**1. Scaled Dot-Product Attention:** The attention mechanism used in Transformers is often the Scaled Dot-Product Attention. This process involves the following steps:
   - **Query, Key, and Value Vectors:** The input to the attention mechanism is a set of vectors: the query vector (Q), key vectors (K), and value vectors (V). In the context of the Transformer, these vectors are derived from the input embeddings and are learned during the training process.
   - **Dot Product:** The attention weights are calculated by taking the dot product between the query vector and each key vector. The dot product measures the similarity between the query and each key. Higher dot products indicate higher similarity.
   - **Scaling:** To prevent 