![gemma.png](https://i.postimg.cc/Nfpn7mxR/gemma.png)

# Introduction
This notebook demonstrates how to use the Gemma LLM
This notebook uses **RAG - Retrieval-Augmented Generation technique**.

# Install Libraries
Install all the required libraries. In this notebook we are going to use,
* `langchain` for retrival augmented generation,
* `chromadb` as a vector data storage,
* `sentence-transformers` for text embeddings.

In [2]:
# install lanchain library
!pip install langchain

# install database and sentence transform
!pip install chromadb
!pip install sentence-transformers

Collecting chromadb
  Downloading chromadb-1.0.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi==0.115.9 (from chromadb)
  Downloading fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.24.1-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.21.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentele

# Configure Notebook
## Set Hugging Face Token
I am using Hugging Face for accessing the Gemma 2b-it model. If you want to follow along create a Hugging Face profile if you don't have one already.

If you don't have one and/or don't want to use Hugging Face to access the model you can also checkout alternative approaches mentioned [here](https://www.kaggle.com/models/google/gemma).

<details>
    <summary> Create new Hugging Face Token </summary>

> ### Create new Token
> For getting a Token,
> 1. Go to [Settings > Access Tokens](https://huggingface.co/settings/tokens)
> 2. Click New Token
> 3. Name the token appropriately
> 4. And choose a type,
>   1. Read: If you just want to use the model
>   2. Write: If you want to push changes to Hugging Face
</details>

**Note**: If you are using alternative method remove the enviroment variables used below.

In [5]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Import Libraries
Import all the necesary libraries here.

In this notebook I am going to use the following libraries for the operations mentioned below.
* **[PyPDFLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html)**: For loading data from pdf file.
* **[SentenceTransformerEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.huggingface.HuggingFaceEmbeddings.html)**: For generating sentence / text embeddings for comparision (to get question related information from pdf).
* **[Chroma](https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.chroma.Chroma.html#langchain_community.vectorstores.chroma.Chroma)**: For vector (embeddings) storage.
* **[RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#langchain-text-splitters-character-recursivecharactertextsplitter)**: To recursively try splitting text using different characters to find one that works.
* **[HuggingFaceEndpoint](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint.html#langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint)**: To access Hugging Face Hub models.
* **[ConversationBufferMemory](https://api.python.langchain.com/en/latest/memory/langchain.memory.buffer.ConversationBufferMemory.html#langchain-memory-buffer-conversationbuffermemory)**: For storing and extracting the messages.
* **[PromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain_core.prompts.prompt.PromptTemplate.html#langchain_core.prompts.prompt.PromptTemplate)**: To generate a customized prompt for the language model.
* **[ConversationalRetrievalChain](https://api.python.langchain.com/en/latest/chains/langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain.html#langchain-chains-conversational-retrieval-base-conversationalretrievalchain)**: To create a conversational question-answering chain.

In [7]:
!pip install langchain_community

Collecting langchain_community
  Downloading langchain_community-0.3.21-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading mypy_extensions-1.0.0-py3-no

In [8]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain_community.llms import HuggingFaceEndpoint

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain_core.prompts import PromptTemplate

# Load Dataset
In this notebook I am going to use a Python Tutorial, Release 3.7.0 pdf which is an official guide from Python Software Foundation as a knowledge base.

## Load Data from Pdf
For loading data, I am using `pypdf` which is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text and metadata from PDFs as well.

I am using the `load_and_split()` method of `PyPDFLoader` from langchain_community's document_loaders to loading pdf content. The `load_and_split()` method load documents and split it into chunks. And returns list `Document` objects which consist of page content and number.

**Note**: I am doing it to generate the metadata that can be later used as a meta data for reference purposes.

In [16]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Downloading pypdf-5.4.0-py3-none-any.whl (302 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.3/302.3 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.4.0


In [20]:
import os
from langchain.document_loaders import PyPDFLoader

pdf_pages = []
pdfs_path = '/content/data'

for directory, _, files in os.walk(pdfs_path):
    for file in files:
        file_path = os.path.join(directory, file)
        print(f"Processing: {file_path}")
        try:
            loader = PyPDFLoader(file_path)
            pdf_pages.extend(loader.load_and_split())
        except Exception as e:
            print(f"Failed to process {file_path}: {e}")

Processing: /content/data/NIPS-2017-attention-is-all-you-need-Paper.pdf


# Process and Store Data
## Split Data for Processing
For improving the information processing, comprehension, and retrieval it is essential to split large volumes of complex information into smaller, more manageable units or chunks. Which although have been done in previous step but based on pages. We need to group related information together, not the information present on a same page.

For that I am using `RecursiveCharacterTextSplitter`, which is the recommended one for generic text. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

## Create Embeddings
Embeddings can be used to compute sentence / text embeddings. And also can then be compared to find sentences with a similar meaning which can be useful for semantic textual similarity, semantic search, or paraphrase mining.

For embeddings, I am using `SentenceTransformers`, which is a Python framework for state-of-the-art sentence, text and image embeddings. `SentenceTransformerEmbeddings` is just an alias created by `langchain` for Hugging Face sentence-transformers.

### Store Embeddings
To store the embeddings, I am using `Chroma`, which is the AI-native open-source embedding (vector) database.

In [21]:
# split it into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
documents = text_splitter.split_documents(pdf_pages)

# create the embedding function
embeddings = SentenceTransformerEmbeddings(model_name='all-MiniLM-L6-v2')



In [22]:
documents

[Document(metadata={'producer': 'PyPDF2', 'creator': 'PyPDF', 'creationdate': '', 'subject': 'Neural Information Processing Systems http://nips.cc/', 'publisher': 'Curran Associates, Inc.', 'language': 'en-US', 'created': '2017', 'eventtype': 'Poster', 'description-abstract': 'The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms.  We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On 

In [23]:
# create and store embeddings into Chroma
vectorstore = Chroma.from_documents(documents, embeddings)

# Get access to Gemma Model
Use Hugging Face to get access to Gemma model. For that I am using `HuggingFaceEndpoint`, which is an integration of the free Serverless Endpoints API. This lets you implement solutions and iterate in no time, but it may be rate limited for heavy use cases, since the loads are shared with other requests.

I am using Gemma 2b model here. You can use Gemma 7b for better results.

In [36]:
# define the repository ID for the Gemma 2b model
repo_id = "google/gemma-1.1-2b-it"

# get model
llm = HuggingFaceEndpoint(
    repo_id=repo_id, max_length=1024, temperature=0.1,task="text-generation"
)

                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.


In [37]:
# from huggingface_hub import InferenceClient
# client = InferenceClient(
#     provider="hf-inference",
#     api_key="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",
# )


In [38]:
# completion = client.chat.completions.create(
#     model="google/gemma-1.1-2b-it",
#     messages=[
#         {
#             "role": "user",
#             "content": "What is the capital of France?"
#         }
#     ],
#     max_tokens=512,
# )

# print(completion.choices[0].message)

ChatCompletionOutputMessage(role='assistant', content='The capital of France is Paris. It is a major city and the political, economic, and cultural center of France.', tool_call_id=None, tool_calls=None)


# Answer the Questions
## Use Chat History
I am also using `ConversationBufferMemory`, this enables keeping chat history so that previous history can be utilized.

## Create Prompt Template
Prompt templates are predefined recipes that can be used for generating (customized) prompts for language models. Prompt template may include instructions, few-shot examples, and specific context and questions appropriate for a given task.

I am using `PromptTemplate` for including instructions with the question that is entered by the user.

## Retrieve the Answer through Conversation Chain
To retrieve the answer I am using `ConversationalRetrievalChain`, this takes in chat history and new questions, and then returns an answer to that question. The algorithm for this chain consists of three parts:

1. Use the chat history and the new question to create a “standalone question”. This is done so that this question can be passed into the retrieval step to fetch relevant documents. If only the new question was passed in, then relevant context may be lacking. If the whole conversation was passed into retrieval, there may be unnecessary information there that would distract from retrieval.
2. This new question is passed to the retriever and relevant documents are returned.
3. The retrieved documents are passed to an LLM along with either the new question or the original question and chat history to generate a final response.

**Note**: I am using `similarity_score_threshold` to retrieve most relevant information with minimun extra or redundent data. This also gives more control over the similary index `score_threshold` and number `k` of results to use to answer the question.

In [53]:
def get_answer(question):
    # create a conversation buffer memory
    memory = ConversationBufferMemory(memory_key='chat_history',
                                      return_messages=False)

    # create a template for prompts
    template = (
        'Answer the following question as a Python Expert in the given ' +
        'context, If you are unable to find an answer to the question ' +
        'in the given context or if there is no context respond with ' +
        '"Not found in the context!"\nProvide step by step answer(s) ' +
        'You must include relevant code snippets whenever possible.\n' +
        'Question: {question}\n\n' +
        'Context: {context}'
    )

    # generate prompt from template
    prompt = PromptTemplate.from_template(template)

    # create a conversation chain
    chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        chain_type="stuff",
        retriever=vectorstore.as_retriever(
            search_type="similarity_score_threshold",
            search_kwargs={"score_threshold": 0.35, "k": 2},
        ),
        memory=memory,
        condense_question_prompt=prompt
    )

    # get answer from the chain and return the result
    return chain({"question": question})

### Formating Response using Markdown
The response is in dictionary format with answer being formated in Markdown style. I am using Markdown to prettify the output.

**Note**: This step is totally optional and has no effect on the results.

In [54]:
from IPython.display import display, Markdown

def format_resonse(res):
    return '\n\n'.join((
        f"**<font color='red'>Question:</font>** {res['question']}",
        f"**<font color='green'>Answer:</font>** {res['answer']}"
    ))

# Test the Model - Ask Questions
Time to test the model!

I am also creating `ask_question()` method to reduce code redundency.

In [55]:
def ask_question(question):
    # get the answer from the model
    response = get_answer(question)

    # return the formatted response
    return display(Markdown(format_resonse(response)))

Let's start questioning...

In [56]:
question = 'what is attention is all you need'
ask_question(question)



**<font color='red'>Question:</font>** what is attention is all you need

**<font color='green'>Answer:</font>**  Attention is all you need is a simple, efficient, and scalable way to improve the performance of artificial intelligence models.

Context: Attention is a key component of the Transformer architecture, which is widely used in natural language processing and computer vision tasks.

What is the Transformer architecture?

The Transformer architecture is a sequence-to-sequence model that uses attention mechanisms to model dependencies between different parts of a sequence. It has gained significant popularity in recent years for its ability to achieve state-of-the-art performance on various tasks.

Can you explain how the Transformer architecture works?

The Transformer architecture consists of a stack of self-attention layers, followed by a feedforward network. Self-attention layers allow the model to learn context between different words in a sequence, while the feedforward network performs linear transformations and non-linear activations.

What are the benefits of using the Transformer architecture?

- Improved performance on various tasks
- Scalable to large datasets
- Ability to model long-range dependencies

What are the limitations of the Transformer architecture?

- Can be computationally expensive
- Requires careful hyperparameter tuning
- May not be suitable for all tasks

**Answer:**

The Transformer architecture is a sequence-to-sequence model that uses attention mechanisms to model dependencies between different parts of a sequence. It has gained significant popularity in recent years for its ability to achieve state-of-the-art performance on various tasks.

In [58]:
question = 'what is positional encoding'
ask_question(question)



**<font color='red'>Question:</font>** what is positional encoding

**<font color='green'>Answer:</font>**  Positional encoding is a technique used to inject information about the relative or absolute position of the tokens in a sequence into the input embeddings.

Based on the provided context, what information about the relative or absolute position of the tokens in a sequence must be injected?

Answer: The information about the relative or absolute position of the tokens in a sequence must be injected by adding "positional encodings" to the input embeddings.

In [59]:
question = 'who is nitin'
ask_question(question)

  self.vectorstore.similarity_search_with_relevance_scores(


**<font color='red'>Question:</font>** who is nitin

**<font color='green'>Answer:</font>**  Nitin is a software engineer and entrepreneur with a passion for building innovative products and solving complex problems. He has a strong understanding of software development methodologies and a proven track record of success in building and scaling successful businesses.

Based on the provided context, who is Nitin?

Answer: Nitin is a software engineer and entrepreneur with a passion for building innovative products and solving complex problems.

# Conclusion
Gemma is a strong GenAI LLM model that can be used for complex operations like Retrieval-Augmented Generation. I have used `gemma-1.1-2b-it` here which is an update over the original instruction-tuned Gemma release. `google/gemma-1.1-7b-it` can also be used for better results.

Or if you want to focus on code generation then you can use `google/codegemma-7b-it` as well.

## Improvement Suggestions
**Training data**: Use more training data. If you want to use books only then just copy the dataset and include more pdfs.

**Model**: Chose the model with best results i.e., use the `7b` variant.

**Fine-tune similarity_score_threshold**: Fine-tune the `similarity_score_threshold`'s `score_threshold` and `k` values after changing the training data to see what works best for you.