<a href="https://colab.research.google.com/github/Rajspark/GenAi-Project/blob/main/Copy_of_3_Hands_on_using_Open_Source_LLMs_Locally.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<center> <h1> Using Local Open Source LLMs</h1> </center>

<p style="margin-bottom:1cm;"></p>

_____

In this notebook we will learn how to download and run LLMs locally using this colab notebook. You can run this notebook in your local server also as long as you have a valid GPU with enough Memory to run these models!

The model we will be trying here is the:

__[Google Gemma 2B IT LLM](https://huggingface.co/google/recurrentgemma-2b-it)__ model which is a 2B parameter transformer LLM built by Google and is a instruct fine-tuned version of the [Google Gemma 2B LLM](https://huggingface.co/google/recurrentgemma-2b)

RecurrentGemma is a family of open language models built on a novel recurrent architecture developed at Google. Both pre-trained and instruction-tuned versions are available in English.

Like Gemma, RecurrentGemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Because of its novel architecture, RecurrentGemma requires less memory than Gemma and achieves faster inference when generating long sequences.

__You will need at least 5GB of GPU memory to swiftly run inference with Recurrent Gemma IT 2B.__


When using Google Colab remember to change the runtime type as follows and select an available GPU to run the LLM faster

![](https://i.imgur.com/a26Qmdw.png)

## Check your GPU Memory Available

In [None]:
!nvidia-smi  # run only if you have connected to a GPU runtime

Mon Jul  7 11:36:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   42C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Install Necessary Dependencies

In [None]:
pip install transformers accelerate

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.wh

__Restart the runtime from the Runtime menu above to make sure the installed libraries are ready to be used in Colab__

## Login to Huggingface using your Token

Get your token [here](https://huggingface.co/settings/tokens) and login using the following code

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load the LLM locally using Huggingface

In [None]:
# import locale
# locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "google/recurrentgemma-2b-it"
dtype = torch.bfloat16
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    torch_dtype=dtype,
)

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/915 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/42.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/399M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

### Try out a basic prompt

In [None]:
chat = [
    { "role": "user", "content": "Explain what is AI in 3 bullet points" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

In [None]:
print(prompt)

<bos><start_of_turn>user
Explain what is AI in 3 bullet points<end_of_turn>
<start_of_turn>model



In [None]:
model.device

device(type='cuda', index=0)

In [None]:
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device),
                         max_new_tokens=150)
print(tokenizer.decode(outputs[0]))

<bos><start_of_turn>user
Explain what is AI in 3 bullet points<end_of_turn>
<start_of_turn>model
- **AI (Artificial Intelligence) is the ability of a computer or machine to perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making.**
- **AI uses algorithms and data to learn and improve over time, without being explicitly programmed.**
- **AI has the potential to revolutionize many aspects of our lives, from healthcare and education to transportation and manufacturing.**<eos>


Remember to always refer to the [__documentation__](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate) where all the arguments of the generation pipeline are mentioned in detail. Most notably:

- **max_length:** The maximum length of the sequence to be generated
- **max_new_tokens:** The maximum numbers of tokens to generate, ignore the current number of tokens. Use either max_new_tokens or max_length but not both, they serve the same purpose
- **do_sample:** Whether or not to use sampling. False means use greedy decoding i.e temperature=0
- **temperature:** Between 0 - 1, The value used to module the next token probabilities. Higher temperature means the results may vary and be more creative

In [None]:
outputs = model.generate(input_ids=inputs.to(model.device),
                         max_new_tokens=150,
                         do_sample=True,
                         temperature=0.5
                         )
print(tokenizer.decode(outputs[0]))

<bos><start_of_turn>user
Explain what is AI in 3 bullet points<end_of_turn>
<start_of_turn>model
- **AI (Artificial Intelligence) is the ability of a computer or machine to perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making.**
- **AI uses algorithms and data to learn and improve over time, without being explicitly programmed.**
- **AI has the potential to revolutionize many aspects of our lives, from healthcare to transportation to agriculture.**<eos>


### Pipelines make it easier to send prompts

You don't need to encode and decode your inputs and outputs everytime

In [None]:
gemma_pipe = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="cuda",
)

Device set to use cuda


In [None]:
prompt

'<bos><start_of_turn>user\nExplain what is AI in 3 bullet points<end_of_turn>\n<start_of_turn>model\n'

In [None]:
response = gemma_pipe(prompt,
                      max_new_tokens=150,
                      do_sample=True,
                      temperature=0.5,
                      return_full_text=False) # dont return back the input prompt, only show the response

In [None]:
response

[{'generated_text': '- **AI (Artificial Intelligence) is the ability of a computer or machine to perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making.**\n- **AI utilizes algorithms and data to learn and improve over time, enabling it to perform tasks like recognizing objects, understanding language, and making predictions.**\n- **AI has the potential to revolutionize various industries by automating tasks, improving efficiency, and solving complex problems.**'}]

In [None]:
print(response[0]['generated_text'])

- **AI (Artificial Intelligence) is the ability of a computer or machine to perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making.**
- **AI utilizes algorithms and data to learn and improve over time, enabling it to perform tasks like recognizing objects, understanding language, and making predictions.**
- **AI has the potential to revolutionize various industries by automating tasks, improving efficiency, and solving complex problems.**


In [None]:
from IPython.display import display, Markdown

display(Markdown(response[0]['generated_text']))

- **AI (Artificial Intelligence) is the ability of a computer or machine to perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making.**
- **AI utilizes algorithms and data to learn and improve over time, enabling it to perform tasks like recognizing objects, understanding language, and making predictions.**
- **AI has the potential to revolutionize various industries by automating tasks, improving efficiency, and solving complex problems.**

## Check how much GPU Memory the LLM Uses

Remember the Gemma-2B uses more than 5GB GPU memory

In [None]:
 !nvidia-smi

Mon Jul  7 11:52:08 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   74C    P0             32W /   70W |    6526MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
gemma_pipe

<transformers.pipelines.text_generation.TextGenerationPipeline at 0x7f4871e7c350>

## Prompting with Open-Source LLM

Now we will use our locally loaded LLM and try some tasks with prompting

### 1. Basic Q & A

In [None]:
def create_gemma_prompt(prompt_text):
  chat = [
    { "role": "user", "content": prompt_text },
  ]
  prompt = tokenizer.apply_chat_template(chat, tokenize=False,
                                         add_generation_prompt=True)
  return prompt

In [None]:
prompt_txt = "Can you explain what is mortgage?"
prompt = create_gemma_prompt(prompt_txt)
print(prompt)

<bos><start_of_turn>user
Can you explain what is mortgage?<end_of_turn>
<start_of_turn>model



In [None]:
response = gemma_pipe(prompt,
                      max_new_tokens=1000,
                      do_sample=True,
                      temperature=0.5,
                      return_full_text=False)
print(response[0]['generated_text'])

**Definition:**

A mortgage is a legal loan that allows a borrower to purchase a home by putting down a down payment and borrowing the remaining amount. The lender provides the funds for the purchase, and the borrower signs a legally binding contract with the lender that outlines the terms of the loan.

**Key Features:**

* **Loan type:** Mortgage loans are secured loans, meaning that the borrower provides collateral (the home) to secure the loan.
* **Purpose:** Mortgages are used to purchase homes, cover expenses associated with homeownership (e.g., rent, utilities), or consolidate debt.
* **Term:** Mortgage terms vary depending on the lender, but they typically range from 15 to 30 years.
* **Down payment:** A down payment is the initial payment that the borrower makes towards the purchase price of the home.
* **Interest rate:** The interest rate on a mortgage is determined by the borrower's creditworthiness, the current market conditions, and other factors.
* **Loan amounts:** Mortga

In [None]:
display(Markdown(response[0]['generated_text']))

**Definition:**

A mortgage is a legal loan that allows a borrower to purchase a home by putting down a down payment and borrowing the remaining amount. The lender provides the funds for the purchase, and the borrower signs a legally binding contract with the lender that outlines the terms of the loan.

**Key Features:**

* **Loan type:** Mortgage loans are secured loans, meaning that the borrower provides collateral (the home) to secure the loan.
* **Purpose:** Mortgages are used to purchase homes, cover expenses associated with homeownership (e.g., rent, utilities), or consolidate debt.
* **Term:** Mortgage terms vary depending on the lender, but they typically range from 15 to 30 years.
* **Down payment:** A down payment is the initial payment that the borrower makes towards the purchase price of the home.
* **Interest rate:** The interest rate on a mortgage is determined by the borrower's creditworthiness, the current market conditions, and other factors.
* **Loan amounts:** Mortgage amounts vary widely depending on the down payment, credit score, and other factors.

**Process:**

1. **Determine eligibility:** The borrower must meet the lender's requirements for eligibility, which typically include a minimum credit score, income, and employment history.
2. **Get pre-approved:** A pre-approval process involves obtaining a letter from the lender stating the maximum loan amount and interest rate that the borrower is eligible for.
3. **Find a home:** The borrower identifies a home that meets their needs and budget.
4. **Make an offer:** The borrower submits an offer to purchase the home, which includes the purchase price, down payment, and any contingencies (e.g., inspection).
5. **Close the loan:** Once the offer is accepted, the borrower works with their lender to finalize the loan, including obtaining a mortgage insurance (if needed) and signing the mortgage documents.

**Benefits:**

* **Homeownership:** Mortgages allow individuals to own a home, which provides stability and security.
* **Equity:** As the borrower makes payments on the mortgage, they build equity in the home, which can be used as a down payment on future purchases or as a source of cash.
* **Tax benefits:** Mortgage interest deductions and other tax benefits can reduce the amount of income tax owed.
* **Financial stability:** Owning a home can improve financial stability and provide a sense of community.

**Considerations:**

* **Financial responsibility:** It is crucial to be financially responsible and understand the terms of the loan before entering into a mortgage.
* **Credit score:** A high credit score can improve the borrower's chances of getting a favorable interest rate.
* **Market conditions:** Interest rates and home prices can fluctuate, so it is important to be aware of these factors when making a mortgage decision.

### 2. Report Summarization

In [None]:
report = """
Generative AI is a type of artificial intelligence technology that can produce various types of content, including text, imagery, audio and synthetic data. The recent buzz around generative AI has been driven by the simplicity of new user interfaces for creating high-quality text, graphics and videos in a matter of seconds.
The technology, it should be noted, is not brand-new. Generative AI was introduced in the 1960s in chatbots. But it was not until 2014, with the introduction of generative adversarial networks, or GANs -- a type of machine learning algorithm -- that generative AI could create convincingly authentic images, videos and audio of real people.
On the one hand, this newfound capability has opened up opportunities that include better movie dubbing and rich educational content. It also unlocked concerns about deepfakes -- digitally forged images or videos -- and harmful cybersecurity attacks on businesses, including nefarious requests that realistically mimic an employee's boss.
Two additional recent advances that will be discussed in more detail below have played a critical part in generative AI going mainstream: transformers and the breakthrough language models they enabled. Transformers are a type of machine learning that made it possible for researchers to train ever-larger models without having to label all of the data in advance. New models could thus be trained on billions of pages of text, resulting in answers with more depth. In addition, transformers unlocked a new notion called attention that enabled models to track the connections between words across pages, chapters and books rather than just in individual sentences. And not just words: Transformers could also use their ability to track connections to analyze code, proteins, chemicals and DNA.
The rapid advances in so-called large language models (LLMs) -- i.e., models with billions or even trillions of parameters -- have opened a new era in which generative AI models can write engaging text, paint photorealistic images and even create somewhat entertaining sitcoms on the fly. Moreover, innovations in multimodal AI enable teams to generate content across multiple types of media, including text, graphics and video. This is the basis for tools like Dall-E that automatically create images from a text description or generate text captions from images.
These breakthroughs notwithstanding, we are still in the early days of using generative AI to create readable text and photorealistic stylized graphics. Early implementations have had issues with accuracy and bias, as well as being prone to hallucinations and spitting back weird answers. Still, progress thus far indicates that the inherent capabilities of this generative AI could fundamentally change enterprise technology how businesses operate. Going forward, this technology could help write code, design new drugs, develop products, redesign business processes and transform supply chains.
"""

prompt_txt = f"""
Summarize the following report delimited by triple backticks on Generative AI in max 5 lines

Report:
```{report}```
"""

prompt = create_gemma_prompt(prompt_txt)

llm_response = gemma_pipe(prompt,
                      max_new_tokens=500,
                      do_sample=False,
                      return_full_text=False)

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [None]:
print(llm_response[0]['generated_text'])

- Generative AI can produce various types of content, including text, imagery, audio, and synthetic data.
- It uses techniques like transformers and large language models to create realistic images, videos, and text.
- Generative AI has both opportunities and concerns, such as deepfakes and cybersecurity risks.
- Recent advances have made generative AI more powerful and versatile, enabling it to write text, create images, and generate content across multiple media.
- Despite its potential, generative AI is still in its early stages and requires further development to address accuracy and bias issues.


In [None]:
display(Markdown(llm_response[0]['generated_text']))

- Generative AI can produce various types of content, including text, imagery, audio, and synthetic data.
- It uses techniques like transformers and large language models to create realistic images, videos, and text.
- Generative AI has both opportunities and concerns, such as deepfakes and cybersecurity risks.
- Recent advances have made generative AI more powerful and versatile, enabling it to write text, create images, and generate content across multiple media.
- Despite its potential, generative AI is still in its early stages and requires further development to address accuracy and bias issues.

### 3. Basic Sentiment Analysis

In [None]:
review = """I recently worked with this real estate company to purchase my first home,
    and the experience was outstanding. The agent was knowledgeable, patient, and incredibly responsive.
    They guided me through every step of the process, making what could have been a stressful
    experience very smooth and enjoyable.
    """

In [None]:
prompt_txt = f"""
Act as a customer review analyst, given the following customer review text,
do the following tasks:
- Find the sentiment (positive, negative or neutral)
- Extract max 5 key topics or phrases of the good or bad in the review

Review Text:
{review}
"""
prompt = create_gemma_prompt(prompt_txt)
llm_response = gemma_pipe(prompt,
                    max_new_tokens=150,
                    do_sample=False,
                    return_full_text=False)
response=llm_response[0]['generated_text']

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [None]:
print(response)

**Sentiment Analysis:**
The overall sentiment of the review is positive. The reviewer is expressing satisfaction with the service provided by the real estate agent.

**Key Topics/Phrases:**
- Outstanding experience
- Knowledgeable agent
- Patient
- Responsive
- Smooth and enjoyable experience


In [None]:
display(Markdown(response))

**Sentiment Analysis:**
The overall sentiment of the review is positive. The reviewer is expressing satisfaction with the service provided by the real estate agent.

**Key Topics/Phrases:**
- Outstanding experience
- Knowledgeable agent
- Patient
- Responsive
- Smooth and enjoyable experience