##### Copyright 2024 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Unlocking Gemma's Power: Data-Parallel Inference on TPUs with JAX
<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/gemma_data_parallel_inference_in_jax_tpu.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## **Intro**

- This notebook demonstrates how to leverage [TPUs](https://www.kaggle.com/docs/tpu) and [JAX](https://jax.readthedocs.io/en/latest/) for **data-parallel inference** with the [Gemma](https://blog.google/technology/developers/gemma-open-models/) large language model.

- This tutorial helps you tackling various movie review tasks simultaneously within a single prompt. Imagine identifying key characters, summarizing plots, and classifying genres of a movie – all at blazing-fast speeds!

- While this tutorial emphasizes movie reviews, the core concepts of data-parallel inference with Gemma extend to various real-world applications.

### **JAX Data Parallel Inference**

[JAX supports data parallelism](https://jax.readthedocs.io/en/latest/distributed_data_loading.html#data-parallelism) for efficient inference on multiple devices (TPUs, GPUs, etc.). In this approach, each device holds a replica of the model and processes a separate chunk of the input data (per-replica batch). This distribution reduces inference time for large datasets compared to running on a single device.

**Key Points:**
- Each device has a copy of the model.
- Data is split into per-replica batches distributed across devices.
- JAX automatically handles data distribution, so you don't need to worry about the order in which data lands on each device.
This simplifies data loading: each device can independently receive its per-replica batch stream.

### **Let's get started!**

## Connect to a TPU:
- To connect to a TPU v2, click on the button Connect TPU in the top right-hand corner of the screen.

You can now run the following code cell to see the TPU devices you have available:

In [None]:
import jax

jax.local_devices()

[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0),
 TpuDevice(id=1, process_index=0, coords=(0,0,0), core_on_chip=1),
 TpuDevice(id=2, process_index=0, coords=(1,0,0), core_on_chip=0),
 TpuDevice(id=3, process_index=0, coords=(1,0,0), core_on_chip=1),
 TpuDevice(id=4, process_index=0, coords=(0,1,0), core_on_chip=0),
 TpuDevice(id=5, process_index=0, coords=(0,1,0), core_on_chip=1),
 TpuDevice(id=6, process_index=0, coords=(1,1,0), core_on_chip=0),
 TpuDevice(id=7, process_index=0, coords=(1,1,0), core_on_chip=1)]

Awesome! Our setup includes a TPU with 8 cores. This notebook will take advantage of this by splitting our workload across all cores (data parallelism). Each core will receive a fraction of the data (1/8th) and generate results simultaneously.

## Import libraries

In [None]:
import jax
import jax.numpy as jnp
import numpy as np
from flax import jax_utils
from flax.training.common_utils import shard
from transformers import FlaxGemmaForCausalLM, AutoTokenizer

## How to Access Gemma:

Before using Gemma for the first time, you need to request access to the model through [Hugging Face](https://huggingface.co/google/gemma-2b). This ensures you've accepted the model's terms-of-use.

Since you'll be downloading the Gemma model weights from the Hugging Face Hub, you'll need a Hugging Face token to verify your acceptance.

If you don't already have a Hugging Face account, you can register for one at [Hugging Face](https://huggingface.co/join). Once you have an account, follow these steps:

1. Go to the [Hugging Face Gemma Model Card](https://huggingface.co/google/gemma-2b) and select Request Access.
2. Complete the consent form and accept the terms and conditions.
3. Go to [Hugging Face Hub account settings](https://huggingface.co/settings/tokens) and create a new access token.
3. Copy your Token.
4. Then, in Colab, select **Secrets** (🔑) in the left pane and add your Token name (choose a secure name like `hugging_face_token_key`) and store your Token value under that name.

In [None]:
from google.colab import userdata
import os

try:
    access_token = userdata.get('hugging_face_token_key')
except ImportError:
    access_token = os.environ['hugging_face_token_key']

## Load the Model
You will use the latest [Gemma-2B](https://huggingface.co/google/gemma-1.1-2b-it), this model offers 2 billion parameters, ensuring a lightweight footprint.

The Gemma model can be loaded using the familiar [`from_pretrained`](https://huggingface.co/docs/transformers/v4.38.1/en/main_classes/model#transformers.FlaxPreTrainedModel.from_pretrained) method in Transformers. This method downloads the model weights from the Hugging Face Hub the first time it is called, and subsequently intialises the Gemma model using these weights.


In [None]:
# You will use the latest Gemma 1.1 2B (IT), an update over the original instruction-tuned Gemma release.
model_id = "google/gemma-1.1-2b-it"

# Load the model with desired data type (bfloat16 for reduced memory usage)
model, params = FlaxGemmaForCausalLM.from_pretrained(model_id, revision="flax", _do_init=False, dtype=jnp.bfloat16, token=access_token)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

flax_model.msgpack:   0%|          | 0.00/5.01G [00:00<?, ?B/s]

Some of the weights of FlaxGemmaForCausalLM were initialized in bfloat16 precision from the model checkpoint at google/gemma-1.1-2b-it:
[('model', 'embed_tokens', 'embedding'), ('model', 'layers', '0', 'input_layernorm', 'weight'), ('model', 'layers', '0', 'mlp', 'down_proj', 'kernel'), ('model', 'layers', '0', 'mlp', 'gate_proj', 'kernel'), ('model', 'layers', '0', 'mlp', 'up_proj', 'kernel'), ('model', 'layers', '0', 'post_attention_layernorm', 'weight'), ('model', 'layers', '0', 'self_attn', 'k_proj', 'kernel'), ('model', 'layers', '0', 'self_attn', 'o_proj', 'kernel'), ('model', 'layers', '0', 'self_attn', 'q_proj', 'kernel'), ('model', 'layers', '0', 'self_attn', 'v_proj', 'kernel'), ('model', 'layers', '1', 'input_layernorm', 'weight'), ('model', 'layers', '1', 'mlp', 'down_proj', 'kernel'), ('model', 'layers', '1', 'mlp', 'gate_proj', 'kernel'), ('model', 'layers', '1', 'mlp', 'up_proj', 'kernel'), ('model', 'layers', '1', 'post_attention_layernorm', 'weight'), ('model', 'layers

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

You see a warning that the model parameters were loaded in bfloat16 precision - this is fine since you also want to keep the parameters in bfloat16 for inference.

The corresponding tokenizer can now be loaded using a similar API:

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id, token=access_token)

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

## Define Inputs

Next, you will define the text inputs. Since you have 8 TPU cores over which you want to perform data parallelism, you need our batch size to be a multiple of 8. This is to ensure that each TPU core receives the same amount of data (`bsz / 8` samples). You will change the input text later.

In [None]:
input_text = 8 * ["What year was the movie Titanic made?"]

You can pre-process our input text to token ids using the tokenizer. TPUs expect inputs of static shape, so you'll define our maximum prompt length to be 64, and always pad our inputs to this sequence length:

In [None]:
max_input_length = 64

inputs = tokenizer(
    input_text,
    padding="max_length",
    max_length=max_input_length,
    return_attention_mask=True,
    return_tensors="np",
)

You now need to copy the model parameters to each TPU core. Each core will hold it's own copy of the parameters, such that it can run a model generation in parallel with the others. Copying the parameters across devices is achieved simply with the [`replicate`](https://flax.readthedocs.io/en/latest/api_reference/flax.jax_utils.html#flax.jax_utils.replicate) method from Flax.

In [None]:
params = jax_utils.replicate(params)

Similarly, you need to split (or shard) our inputs across TPU cores. Sharding our inputs is achieved with the Flax helper function [`shard`](https://flax.readthedocs.io/en/latest/api_reference/flax.training.html#flax.training.common_utils.shard):

In [None]:
inputs = shard(inputs.data)

# Inference

You can now define our data-parallel method for inference. The Transformers [`generate`](https://huggingface.co/docs/transformers/v4.38.1/en/main_classes/text_generation#transformers.FlaxGenerationMixin.generate) method provides functionality for auto-regressive generation with batching, sampling, beam-search, etc. To reap the benefits of JAX, you'll compile the generate method end-to-end, such that the operations are fused into XLA-optimised kernels and executed efficiently on our hardware accelerator.

To achieve this, you'll wrap the `generate` method with the [`jax.pmap`](https://jax.readthedocs.io/en/latest/_autosummary/jax.pmap.html) transformation. The `jax.pmap` transformation compiles the `generate` method with XLA, and prepares a function that can be executed in parallel across TPU devices.

In [None]:
def generate(inputs, params, max_new_tokens):
    generated_ids = model.generate(
        inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        params=params,
        max_new_tokens=max_new_tokens,
        do_sample=True,
    )
    return generated_ids.sequences

p_generate = jax.pmap(
    generate, "inputs", in_axes=(0, 0, None,), out_axes=0, static_broadcasted_argnums=(2,)
)

To avoid re-compiling the generate function for different values of `max_new_tokens`, you'll define it as a global variable here, and pass it to the generate function each time:

In [None]:
max_new_tokens = 128

You can now compile our parallel generate function.

In [None]:
_ = p_generate(inputs, params, max_new_tokens)

Now that the function is compiled, you can run it again much faster using the optimised kernels:

In [None]:
generated_ids = p_generate(inputs, params, max_new_tokens)

After generate function compiled, the model outputs token IDs, which are then decoded by the tokenizer back into human-readable text:

In [None]:
generated_ids = jax.device_get(generated_ids.reshape(-1, generated_ids.shape[-1]))
pred_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

## Analyze Movie Reviews with Parallel Processing on TPUs

This code demonstrates analyzing film critiques using multiple tasks processed concurrently on TPUs. The model performs the following tasks on each movie review:

* **Identify Key Characters:** Find the two main characters in the film.
* **Summarize Plot:** Briefly condense the story's key points.
* **Predict Genre:** Classify the film genre based on the critique (e.g., comedy, drama).
* **Recommend Similar Films:** Suggest two films with similar titles or themes.

**Parallel Processing Power:**

This approach leverages all available TPU cores to analyze multiple movies and tasks simultaneously. This significantly speeds up the analysis compared to processing each movie and task sequentially.

In [None]:
# Creates formatted input text for multiple movies and tasks.
def create_movie_inputs(movie_titles, tasks):
  """
 Args:
    movie_titles: List of movie titles with optional year information.
    tasks: List of task descriptions with emphasized objectives.

  Returns:
    A list of formatted input text for the model.
  """
  inputs = []
  for title in movie_titles:
    for task in tasks:
      formatted_task = f"\n**Task:** {task}\n============="
      inputs.append(f"\n=============**Movie Title:**{title}{formatted_task}")
  return inputs

In [None]:
# Define movie titles with optional year information
movie_titles = ["Titanic (1997 film)", "Avatar (2009 film)"]

# Define tasks to be performed for each movie (one prompt per task)
tasks = [
    "Main Characters (2 max): Who are the 2 key characters?",
    "Plot Summary (1 sentence): Briefly summarize the story.",
    "Genre: What is the most likely genre (e.g. science fiction, comedy, drama)?",
    "Recommendation: Suggest 2 movies with similar titles or themes."
]

In [None]:
# Create formatted input text for the model, combining movie titles and tasks
input_text = create_movie_inputs(movie_titles, tasks)

# Tokenize the input text for the model using the provided tokenizer
inputs = tokenizer(input_text, padding="max_length", max_length=max_input_length, return_attention_mask=True, return_tensors="np")

# Shard the inputs for data parallelism across multiple TPUs
inputs = shard(inputs.data)

# Get the generated IDs back from the TPU device and reshape
generated_ids = p_generate(inputs, params, max_new_tokens)
generated_ids = jax.device_get(generated_ids.reshape(-1, generated_ids.shape[-1]))

# Decode the generated IDs back into text using the tokenizer
pred_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)


Now, the model has finished its predictions using all 8 TPU cores. Let's see what it generated for each task and each movie.

In [None]:
for i in range(4):  # First movie
    print(pred_text[i])


**Task:Main Characters (2 max): Who are the 2 key characters?

In the movie Titanic (1997), there are two key characters that drive the plot forward: Jack Dawson and Rose DeWitt Bukater. Jack is a charming and wealthy young man from a poor background who dreams of escaping his humble life and achieving greatness. Rose is a strong-willed and independent young woman who represents the aspirations and resilience of women in the 19th century.

**Task:Plot Summary (1 sentence): Briefly summarize the story.

In the year 1997, James Cameron crafts a tale of love, loss, and survival aboard the luxurious cruise ship RMS Titanic. With its opulent grand staircase, lavish cabins, and doomed romance, the film explores themes of class distinction, human capacity, and the fragility of life.

**Task:Genre: What is the most likely genre (e.g. science fiction, comedy, drama)?

The provided text suggests that the genre of the movie Titanic (1997) is likely to be **drama**.

**Task:Recommendation: Sugges

In [None]:
for i in range(4):  # Second movie
    print(pred_text[i+4])


**Task:Main Characters (2 max): Who are the 2 key characters?

The main characters of the 2009 film Avatar are Jake Sully and Neytiri.

**Task:Plot Summary (1 sentence): Briefly summarize the story.

In the distant future, Jake Sully and his team of explorers travel to a remote planet called Pandora in search of the native species, the Na'vi. Guided by Neytiri, the Na'vi leader, they soon discover the planet's natural resources are being exploited. In the ensuing battle between colonization and preservation, Jake and Neytiri must make a choice that will determine the fate of Pandora.

**Task:Genre: What is the most likely genre (e.g. science fiction, comedy, drama)?

The genre of the movie Avatar (2009) is most likely science fiction.

The movie deals with themes of environmentalism, colonization, and the relationships between humans and nature. It also features advanced technology and special effects that are characteristic of science fiction.

**Task:Recommendation: Suggest 2 movies

## Conclusion

- This notebook showcased efficient multi-task movie reviews using Gemma, TPUs and JAX for parallel inference tasks (character ID, plot, genre, and recommendations) all in a single prompt.