In [1]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# LLM Basics with Hugging Face
This notebook demonstrates how to load LLM models by utilizing Hugging Face, and how to make queries.
<!--table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/Gemma_Basics_with_HF.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table-->


Adapted for EECE.4860/5860 at UMass Lowell

## Prerequisites 

### Account on Intel Tiber AI Cloud (or run on your local GPU if available)

You will need a standard account on Intel Tiber AI Cloud, where we have tested this notebook. Students have been given instructions on how to sign up for an account on Intel Tiber.

Once you open this notebook on Tiber cloud, make sure to select **PyTorch GPU** kernel to run it.

### HuggingFace setup

Before we dive into the tutorial, let's get you set up with HuggingFace:

1. **Hugging Face Account:**  If you don't already have one, you can create a free Hugging Face account by clicking [here](https://huggingface.co/join).
2. **LLM Model Access:** Head over to the [Gemma model page](https://huggingface.co/google/gemma-2b) and [llama2 model papge](https://huggingface.co/meta-llama/Llama-2-7b-hf) and accept the usage conditions.
3. **Hugging Face Token:**  You need to create a token on HuggingFace and use it to login from this notebook. Once you are logged in, you can download the models. Check [this guide](https://huggingface.co/docs/hub/en/security-tokens) on how to create a token on HF. Generate a Hugging Face access (preferably `write` permission) token by clicking [here](https://huggingface.co/settings/tokens). **Save the token in a safe document that you can access**. Once you've completed these steps, you're ready to move on to the next section where we'll install necessary packages and log into HuggingFace Hub.


### Import Necessary Packages


In [2]:
# import necessary packages
import os
import sys
import torch

### Install dependencies
Run the cell below to install all the required dependencies.

In [3]:
!{sys.executable} -m pip install --upgrade -q transformers huggingface_hub peft \
  accelerate bitsandbytes datasets trl ipywidgets

### Log into Hugging Face Hub


In [4]:
# you could use OS env variable to store the HF token
#from huggingface_hub import login
#login(os.environ["HF_TOKEN"])

# or use an input box on this notebook to copy/paste the token
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

**If there is no error in the previous step, you are all set and ready to explore the possibilities with LLM models!**


**You need to click the next cell to proceed**

## Instantiate the Gemma 2B model (or other models)

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights, pre-trained variants, and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.

Please note we list here a few variants of the Gemma models for you to play with.

Other models is this example include Llama 2 from Meta.

Let's get started by loading the model from Hugging Face Hub.

### Loading the model from HF Hub

In [5]:
#model_id = "google/gemma-1.1-2b-it"
#model_id = "google/gemma-2-2b-it"
model_id = "google/gemma-2-9b-it"
#model_id = "google/gemma-2-27b-it"
#model_id = "meta-llama/Llama-2-7b-hf"
#device = "cuda"
USE_CPU = False
device = "xpu:0" if torch.xpu.is_available() else "cpu"
if USE_CPU:
    device = "cpu"
print(f"using device: {device}")

using device: xpu:0


In [6]:
# Let's load the tokenizer first
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

In [7]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# We could typically quantize the model to reduce its weight
# But to simplify the process, we won't quantize it in this notebook

# Let's load the chosen model
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [8]:
#print(model)

### Trying it out

In [9]:
prompt = "My favourite color is"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=20)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

My favourite color is blue. It's the color of the sky and the ocean, and it always makes me feel


In [10]:
prompt = "Who won the 2016 baseball World Series? Answer:"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=40)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Who won the 2016 baseball World Series? Answer: The Chicago Cubs



In [11]:
prompt = "How to judge Umass Lowell? Answer:"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=512)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

How to judge Umass Lowell? Answer: It depends on what you're looking for.

**Here's a breakdown to help you decide:**

**Strengths:**

* **Strong STEM programs:** UMass Lowell excels in engineering, computer science, and other STEM fields. They have state-of-the-art facilities and experienced faculty.
* **Affordable tuition:** Compared to other public universities in Massachusetts, UMass Lowell offers relatively affordable tuition.
* **Location:** Lowell is a vibrant city with a rich history and a growing economy. It's also conveniently located near Boston and other major cities.
* **Research opportunities:** UMass Lowell encourages undergraduate research and offers many opportunities for students to get involved.
* **Diverse student body:** UMass Lowell has a diverse student population from all over the world.

**Weaknesses:**

* **Large class sizes:** Some introductory courses can have large class sizes, which may make it harder to get individual attention from professors.
* **Limite

In [12]:
prompt = "What can you use an LLM for? Answer:"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=512)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

What can you use an LLM for? Answer:

LLMs are incredibly versatile and can be used for a wide range of applications, including:

**Communication and Language:**

* **Chatbots and Conversational AI:** Create interactive chatbots for customer service, education, or entertainment.
* **Text Generation:** Generate creative content such as stories, poems, articles, and dialogue.
* **Language Translation:** Translate text between languages with high accuracy.
* **Summarization and Paraphrasing:** Summarize large amounts of text or rephrase it in different ways.
* **Grammar and Spelling Correction:** Improve the quality of written text by correcting errors.

**Information Retrieval and Knowledge Management:**

* **Question Answering:** Provide answers to questions based on a given context or knowledge base.
* **Search Engine Optimization (SEO):** Generate relevant keywords and content for search engines.
* **Document Analysis and Classification:** Categorize and analyze documents based on the

In [13]:
prompt = "generate a list of winners and years of the World Series Championships from 2010 to 2025, the ouput should be the json format. Answer:"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=512)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

generate a list of winners and years of the World Series Championships from 2010 to 2025, the ouput should be the json format. Answer:

```json
{
  "2010": "San Francisco Giants",
  "2011": "St. Louis Cardinals",
  "2012": "San Francisco Giants",
  "2013": "Boston Red Sox",
  "2014": "San Francisco Giants",
  "2015": "Kansas City Royals",
  "2016": "Chicago Cubs",
  "2017": "Houston Astros",
  "2018": "Boston Red Sox",
  "2019": "Washington Nationals",
  "2020": "Los Angeles Dodgers",
  "2021": "Atlanta Braves",
  "2022": "Houston Astros",
  "2023": null,
  "2024": null,
  "2025": null
}
```




In [14]:
prompt = "generate a list of winners and years of the World Series Championships from 2010 to 2025, the ouput should be the json format.The json output must include the year of 2024. Answer:"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=512)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

generate a list of winners and years of the World Series Championships from 2010 to 2025, the ouput should be the json format.The json output must include the year of 2024. Answer:

```json
{
  "2010": "San Francisco Giants",
  "2011": "St. Louis Cardinals",
  "2012": "San Francisco Giants",
  "2013": "Boston Red Sox",
  "2014": "San Francisco Giants",
  "2015": "Kansas City Royals",
  "2016": "Chicago Cubs",
  "2017": "Houston Astros",
  "2018": "Boston Red Sox",
  "2019": "Washington Nationals",
  "2020": "Los Angeles Dodgers",
  "2021": "Atlanta Braves",
  "2022": "Houston Astros",
  "2023": "TBD",
  "2024": "TBD",
  "2025": "TBD"
}
```

**Explanation:**

* **Structure:** The JSON object uses the year as the key and the winning team as the value.
* **Years:** The years range from 2010 to 2025, including 2024.
* **TBD:**  For the years 2023, 2024, and 2025, "TBD" is used to indicate that the World Series champion has not yet been determined.



Let me know if you have any other quest

In [15]:
prompt = "generate a list of winners and years of the World Series Championships from 2010 to 2025, the ouput need to discuss the information about the year 2025. Answer:"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=512)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

generate a list of winners and years of the World Series Championships from 2010 to 2025, the ouput need to discuss the information about the year 2025. Answer:

It's impossible to provide the World Series winner for 2025 because the event hasn't happened yet! 

Here are the World Series champions from 2010 to 2023:

* **2010:** San Francisco Giants
* **2011:** St. Louis Cardinals
* **2012:** San Francisco Giants
* **2013:** Boston Red Sox
* **2014:** San Francisco Giants
* **2015:** Kansas City Royals
* **2016:** Chicago Cubs
* **2017:** Houston Astros
* **2018:** Boston Red Sox
* **2019:** Washington Nationals
* **2020:** Los Angeles Dodgers
* **2021:** Atlanta Braves
* **2022:** Houston Astros
* **2023:** Houston Astros 


We'll have to wait and see who takes home the trophy in 2025! 

