##### Copyright 2024 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Gemma - Run with Ollama Python library

Author: Sitam Meur

*   GitHub: [github.com/sitamgithub-MSIT](https://github.com/sitamgithub-MSIT/)
*   X: [@sitammeur](https://x.com/sitammeur)

Description: This notebook demonstrates how you can run inference on a Gemma model using  [Ollama Python library](https://github.com/ollama/ollama-python). The Ollama Python library provides the easiest way to integrate Python 3.8+ projects with Ollama.

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/[Gemma_2]Using_with_Ollama_Python.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## Setup

### Select the Colab runtime
To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:

1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **T4 GPU**.

## Installation

Install Ollama through the offical installation script.

In [None]:
!curl -fsSL https://ollama.com/install.sh | sh

Install Ollama Python library through the official Python client for Ollama.

In [None]:
!pip install -q ollama

## Start Ollama

Start Ollama in background using nohup.

In [None]:
!nohup ollama serve > ollama.log &

## Prerequisites

*   Ollama should be installed and running. (This was already completed in previous steps.)
*   Pull the gemma2 model to use with the library: `ollama pull gemma2:2b`
    *  See [Ollama.com](https://ollama.com/) for more information on the models available.

In [None]:
import ollama

In [None]:
ollama.pull('gemma2:2b')

## Inference

Run inference using Ollama Python library.

### Generate

In [None]:
import markdown
from ollama import generate

# Generate a response to a prompt
response = generate("gemma2:2b", "Explain the process of photosynthesis.")
print(response["response"])

#### Streaming Responses

To enable response streaming, set `stream=True`.

In [None]:
# Stream the generated response
response = generate('gemma2:2b', 'Explain the process of photosynthesis.', stream=True)

for part in response:
  print(part['response'], end='', flush=True)

#### Async client

To make asynchronous requests, use the `AsyncClient` class.

In [None]:
import asyncio
import nest_asyncio
from ollama import AsyncClient

nest_asyncio.apply()


async def generate():
    """
    Asynchronously generates a response to a given prompt using the AsyncClient.

    This function creates an instance of AsyncClient and sends a request to generate
    a response for the specified prompt. The response is then printed.
    """
    # Create an instance of the AsyncClient
    client = AsyncClient()

    # Send a request to generate a response to the prompt
    response = await client.generate(
        "gemma2:2b", "Explain the process of photosynthesis."
    )
    print(response["response"])

# Run the generate function
asyncio.run(generate())

### Chat

In [None]:
from ollama import chat

# Start a conversation with the model
messages = [
    {
        "role": "user",
        "content": "What is keras?",
    },
]

# Get the model's response to the message
response = chat("gemma2:2b", messages=messages)
print(response["message"]["content"])

#### Streaming Responses

To enable response streaming, set `stream=True`.

In [None]:
# Stream the chat response
stream = chat(
    model="gemma2:2b",
    messages=[{"role": "user", "content": "What is keras?"}],
    stream=True,
)

for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

#### Async client + Streaming

To make asynchronous requests, use the `AsyncClient` class, and for streaming, use `stream=True`.

In [None]:
import asyncio
import nest_asyncio
from ollama import AsyncClient

nest_asyncio.apply()


async def chat():
    """
    Asynchronously sends a chat message to the specified model and prints the response.

    This function sends a message with the role "user" and the content "What is keras?"
    to the model "gemma2:2b" using the AsyncClient's chat method. The response is then streamed.
    """
    # Define the message to send to the model
    message = {"role": "user", "content": "What is keras?"}

    # Send the message to the model and print the response
    async for part in await AsyncClient().chat(
        model="gemma2:2b", messages=[message], stream=True
    ):
        print(part["message"]["content"], end="", flush=True)

# Run the chat function
asyncio.run(chat())

## Conclusion

Congratulations! You have successfully run inference on a Gemma model using the Ollama Python library. You can now integrate this into your Python projects.