In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Multimodal retail recommendation: using Gemini to recommend items based on images and image reasoning

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retail/multimodal_retail_recommendations.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fretail%2Fmultimodal_retail_recommendations.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/retail/multimodal_retail_recommendations.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retail/multimodal_retail_recommendations.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retail/multimodal_retail_recommendations.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retail/multimodal_retail_recommendations.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retail/multimodal_retail_recommendations.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retail/multimodal_retail_recommendations.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retail/multimodal_retail_recommendations.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            


| | |
|-|-|
|Author(s) | [Thu Ya Kyaw](https://github.com/iamthuya) |

## Overview

For retail companies, recommendation systems improve customer experience and thus can increase sales.

This notebook shows how you can use the multimodal capabilities of Gemini 1.5 Pro model to rapidly create a multimodal recommendation system out-of-the-box.

## Scenario

The customer shows you their living room:

|Customer photo |
|:-----:|
|<img src="https://storage.googleapis.com/github-repo/img/gemini/retail-recommendations/rooms/spacejoy-c0JoR_-2x3E-unsplash.jpg" width="80%">  |



Below are four chair options that the customer is trying to decide between:

|Chair 1| Chair 2 | Chair 3 | Chair 4 |
|:-----:|:----:|:-----:|:----:|
| <img src="https://storage.googleapis.com/github-repo/img/gemini/retail-recommendations/furnitures/cesar-couto-OB2F6CsMva8-unsplash.jpg" width="80%">|<img src="https://storage.googleapis.com/github-repo/img/gemini/retail-recommendations/furnitures/daniil-silantev-1P6AnKDw6S8-unsplash.jpg" width="80%">|<img src="https://storage.googleapis.com/github-repo/img/gemini/retail-recommendations/furnitures/ruslan-bardash-4kTbAMRAHtQ-unsplash.jpg" width="80%">|<img src="https://storage.googleapis.com/github-repo/img/gemini/retail-recommendations/furnitures/scopic-ltd-NLlWwR4d3qU-unsplash.jpg" width="80%">|


How can you use Gemini 1.5 Pro, a multimodal model, to help the customer choose the best option, and also explain why?

### Objectives

Your main objective is to learn how to create a recommendation system that can provide both recommendations and explanations using a multimodal model: Gemini 1.5 Pro.

In this notebook, you will begin with a scene (e.g. a living room) and use the Gemini 1.5 Pro model to perform visual understanding. You will also investigate how the Gemini Pro model can be used to recommend an item (e.g. a chair) from a list of furniture items as input.

By going through this notebook, you will learn:
- how to use the Gemini Pro model to perform visual understanding
- how to take multimodality into consideration in prompting for the Gemini Pro model
- how the Gemini Pro model can be used to create retail recommendation applications out-of-the-box

### Costs
This tutorial uses billable components of Google Cloud:

- Vertex AI

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

## Getting Started

### Install Vertex AI SDK for Python

In [None]:
%pip install --upgrade --user google-cloud-aiplatform

### Restart current runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [None]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench).

In [None]:
import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

### Define Google Cloud project information and initialize Vertex AI

Initialize the Vertex AI SDK for Python for your project:

In [1]:
# We define two variables, PROJECT_ID and LOCATION, which are crucial to 
# identifying the Google Cloud project and the region where Vertex AI resources are located.
# The strings "[insert project here]" and "[insert location here]" are placeholders.
# Replace them with your actual GCP project ID and region (e.g., "us-central1").

PROJECT_ID = "[insert project here]"  # @param {type:"string"}
LOCATION = "[insert location here]"   # @param {type:"string"}

# Next, we import the vertexai library so we can initialize it for our project.
import vertexai

# We initialize Vertex AI with the specified project and location.
# This step ensures that all Vertex AI functionalities will be associated 
# with the correct resources in the specified project and region.
vertexai.init(project=PROJECT_ID, location=LOCATION)


### Import libraries

In [2]:
# We import GenerativeModel and Image from the Vertex AI generative models library.
# GenerativeModel is our primary interface for generative AI tasks (like text or image generation),
# and Image is specifically designed for image-related operations.

from vertexai.generative_models import (
    GenerativeModel,  # The base interface for various generative tasks
    Image             # A specialized class for handling image generation and manipulation
)


## Using Gemini 1.5 Pro model

The Gemini 1.5 Pro model `gemini-1.5-pro` is a multimodal model that supports adding image and video in text or chat prompts for a text response.

### Load Gemini 1.5 Pro model

In [3]:
# We create an instance of a generative model named "gemini-1.5-pro".
# This tells Vertex AI which specific model we want to use for our multimodal tasks.
multimodal_model = GenerativeModel("gemini-1.5-pro")


### Define helper functions

In [4]:
# We import modules required for handling and displaying images within a Jupyter environment.
import http.client           # Enables us to handle low-level HTTP protocol requests and responses
import io                    # Allows for in-memory binary I/O (BytesIO) operations
import typing                # Provides type hints to help clarify function parameters and return types
import urllib.request        # Helps us open and read URLs (for example, to retrieve images)

import IPython.display       # Lets us display images or other media directly in a Jupyter environment
from PIL import Image as PIL_Image       # A popular Python imaging library (Pillow) for image manipulation
from PIL import ImageOps as PIL_ImageOps # Contains operations for image resizing, flipping, etc.


def display_image(
    image: Image,
    max_width: int = 600,
    max_height: int = 350
) -> None:
    """
    We display a Vertex AI Image object in the notebook while ensuring
    it's converted to a standard RGB mode and resized if it exceeds
    the specified dimensions.
    """
    # We convert the underlying image to a PIL Image to allow manipulation.
    pil_image = typing.cast(PIL_Image.Image, image._pil_image)

    # We check if the image mode is not RGB (e.g., RGBA, CMYK), then convert
    # it to RGB. Some Jupyter environments do not support alpha channels.
    if pil_image.mode != "RGB":
        pil_image = pil_image.convert("RGB")

    # We grab the current dimensions of the PIL image.
    image_width, image_height = pil_image.size

    # We check if the image exceeds our desired dimensions. If it does,
    # we resize it while keeping the aspect ratio intact.
    if max_width < image_width or max_height < image_height:
        pil_image = PIL_ImageOps.contain(pil_image, (max_width, max_height))

    # We hand the PIL image off to a helper function that compresses
    # and displays the image inline in the Jupyter notebook.
    display_image_compressed(pil_image)


def display_image_compressed(pil_image: PIL_Image.Image) -> None:
    """
    We compress and display a PIL Image object in a Jupyter notebook
    to reduce file size while retaining decent image quality.
    """
    # We create an in-memory binary stream where we can save the compressed image.
    image_io = io.BytesIO()

    # We save the image as a JPEG with 80% quality, which strikes a balance
    # between visual fidelity and file size. 'optimize=True' helps reduce size further.
    pil_image.save(image_io, "jpeg", quality=80, optimize=True)

    # We retrieve the compressed bytes from the in-memory stream.
    image_bytes = image_io.getvalue()

    # We create an IPython Image object and display it inline in the notebook.
    ipython_image = IPython.display.Image(image_bytes)
    IPython.display.display(ipython_image)


def get_image_bytes_from_url(image_url: str) -> bytes:
    """
    We fetch raw image bytes from a given URL. This function raises an exception
    if the server response indicates the content is not in PNG or JPEG format.
    """
    # We open the URL and read the response into memory.
    with urllib.request.urlopen(image_url) as response:
        # We cast the response object for clarity when typing.
        response = typing.cast(http.client.HTTPResponse, response)

        # We ensure the returned data is either PNG or JPEG,
        # otherwise we raise an error.
        if response.headers["Content-Type"] not in ("image/png", "image/jpeg"):
            raise Exception("Image can only be in PNG or JPEG format")

        # We read all the raw bytes of the image from the response.
        image_bytes = response.read()

    # We return the entire byte array of the downloaded image.
    return image_bytes


def load_image_from_url(image_url: str) -> Image:
    """
    We load an image from a given URL and convert it into
    a Vertex AI Image object, which we can feed into generative models.
    """
    # We first get the raw bytes for the image from the URL.
    image_bytes = get_image_bytes_from_url(image_url)

    # We then convert those raw bytes into a Vertex AI Image object.
    return Image.from_bytes(image_bytes)


def print_multimodal_prompt(contents: list):
    """
    We iterate through a list of text strings and Image objects,
    displaying images in the notebook and printing text to provide
    a complete overview of what we're sending to the model.
    """
    for content in contents:
        # If the item is a Vertex AI Image, display it inline in the notebook.
        if isinstance(content, Image):
            display_image(content)
        else:
            # Otherwise, we treat it as text and simply print it.
            print(content)


### Visual understanding with Gemini 1.5 Pro

Here you will ask the Gemini 1.5 Pro model to describe a room in details from its image. To do that you have to **combine text and image in a single prompt**.

In [None]:
# We define a URL that points to an image of a room. This image 
# will be used as a prompt for our multimodal model.
room_image_url = (
    "https://storage.googleapis.com/github-repo/img/gemini/retail-recommendations/rooms/"
    "spacejoy-c0JoR_-2x3E-unsplash.jpg"
)

# We convert the above URL into a Vertex AI Image object so it can be passed to our model.
room_image = load_image_from_url(room_image_url)

# This prompt asks the model to describe what is visible in the room 
# and the overall atmosphere of the setting.
prompt = "Describe what's visible in this room and the overall atmosphere:"

# We combine our text prompt and the room image into one list, 
# so both the text and the image will be processed by the model.
contents = [
    prompt,
    room_image,
]

# We call the 'generate_content' method on our multimodal model, 
# asking it to produce output (responses) in a streaming manner.
responses = multimodal_model.generate_content(contents, stream=True)

# We display the combined prompt, which includes the text prompt and the room image.
print("-------Prompt--------")
print_multimodal_prompt(contents)

# Finally, we print the model's response. The 'end=""' prevents 
# extra newlines from being inserted between streamed chunks.
print("\n-------Response--------")
for response in responses:
    print(response.text, end="")


### Generating open recommendations based on built-in knowledge

Using the same image, you can ask the model to recommend **a piece of furniture** that would fit in it alongside with the description of the room.

Note that the model can choose **any furniture** to recommend in this case, and can do so from its only built-in knowledge.

In [None]:
# We define two separate prompt strings:
# 1) "Recommend a new piece of furniture for this room:"
# 2) "and explain the reason in detail"
# 
# Then, we combine these text prompts and the previously loaded 'room_image' into a single list.
# The multimodal model will analyze both the text and the image together.

prompt1 = "Recommend a new piece of furniture for this room:"
prompt2 = "and explain the reason in detail"

contents = [
    prompt1,
    room_image,
    prompt2
]

# We generate content from our multimodal model in a streaming fashion.
# As soon as the model starts producing partial responses, we can read them in real-time.
responses = multimodal_model.generate_content(contents, stream=True)

# For clarity, we first print the prompt contents (text + image) so we can see
# exactly what the model is receiving.
print("-------Prompt--------")
print_multimodal_prompt(contents)

# Then, we print the model's response. The 'end=""' argument ensures that 
# we don't add unnecessary newlines between each streaming chunk.
print("\n-------Response--------")
for response in responses:
    print(response.text, end="")


In the next cell, you will ask the model to recommend **a type of chair** that would fit in it alongside with the description of the room.

Note that the model can choose **any type of chair** to recommend in this case.

In [None]:
# We define two separate prompt strings:
# 1) "Describe this room:"
# 2) "and recommend a type of chair that would fit in it"
#
# Then, we combine these text prompts and the previously loaded 'room_image' into a single list
# so the multimodal model will analyze both the text and the image together.

prompt1 = "Describe this room:"
prompt2 = "and recommend a type of chair that would fit in it"

contents = [
    prompt1,
    room_image,
    prompt2
]

# We generate content from our multimodal model in a streaming fashion.
# As soon as the model starts producing partial responses, we can read them in real-time.
responses = multimodal_model.generate_content(contents, stream=True)

# For clarity, we first print the prompt contents (text + image) so we can see
# exactly what the model is receiving.
print("-------Prompt--------")
print_multimodal_prompt(contents)

# Then, we print the model's response. The 'end=""' argument ensures
# we don't add unnecessary newlines between each streaming chunk.
print("\n-------Response--------")
for response in responses:
    print(response.text, end="")


### Generating recommendations based on provided images

Instead of keeping the recommendation open, you can also provide a list of items for the model to choose from. Here you will download a few chair images and set them as options for the Gemini model to recommend from. This is particularly useful for retail companies who want to provide recommendations to users based on the kind of room they have, and the available items that the store offers.

In [None]:
# We start by defining a list of URLs pointing to different chair images. 
# We'll later load these into Vertex AI Image objects for display and analysis.
furniture_image_urls = [
    "https://storage.googleapis.com/github-repo/img/gemini/retail-recommendations/furnitures/cesar-couto-OB2F6CsMva8-unsplash.jpg",
    "https://storage.googleapis.com/github-repo/img/gemini/retail-recommendations/furnitures/daniil-silantev-1P6AnKDw6S8-unsplash.jpg",
    "https://storage.googleapis.com/github-repo/img/gemini/retail-recommendations/furnitures/ruslan-bardash-4kTbAMRAHtQ-unsplash.jpg",
    "https://storage.googleapis.com/github-repo/img/gemini/retail-recommendations/furnitures/scopic-ltd-NLlWwR4d3qU-unsplash.jpg",
]

# We convert each furniture image URL into a Vertex AI Image object 
# using the 'load_image_from_url' function. This allows us to include them 
# in our multimodal prompt.
furniture_images = [
    load_image_from_url(url) for url in furniture_image_urls
]

# We compose a list of prompt items (text + images). Labeling each chair 
# in the prompt helps the model reference them more accurately. 
# This approach reduces hallucinations and generally produces better results.
contents = [
    "Consider the following chairs:",
    "chair 1:",
    furniture_images[0],
    "chair 2:",
    furniture_images[1],
    "chair 3:",
    furniture_images[2],
    "chair 4:",
    furniture_images[3],
    "room:",
    room_image,  # Previously loaded room image
    "You are an interior designer. For each chair, explain whether it would be appropriate for the style of the room:",
]

# We invoke 'generate_content' on our multimodal model, 
# passing in the prompt (contents) and asking for a streaming response.
responses = multimodal_model.generate_content(contents, stream=True)

# We display the prompt to see what we're sending to the model,
# which shows both the text and inline images (if in a Jupyter environment).
print("-------Prompt--------")
print_multimodal_prompt(contents)

# Lastly, we print the model's response in real-time as it's streamed.
# Using end="" ensures we don't add extra blank lines between chunks.
print("\n-------Response--------")
for response in responses:
    print(response.text, end="")


You can also return the responses in JSON format, to make it easier to plug recommendations into a recommendation system:

In [None]:
# We create a list of prompt items (text + images). We label each chair 
# to help the model reference them accurately. We also include a final instruction 
# asking the model to respond in JSON format, indicating whether each chair 
# would fit into the room, along with an explanation.

contents = [
    "Consider the following chairs:",
    "chair 1:",
    furniture_images[0],
    "chair 2:",
    furniture_images[1],
    "chair 3:",
    furniture_images[2],
    "chair 4:",
    furniture_images[3],
    "room:",
    room_image,
    (
        "You are an interior designer. Return in JSON, for each chair, "
        "whether it would fit in the room, with an explanation:"
    ),
]

# We generate responses using our multimodal model in a streaming fashion, 
# which allows us to retrieve partial results as they are produced.
responses = multimodal_model.generate_content(contents, stream=True)

# We print out the combined prompt first, which attempts to display 
# or print each item (text or images) in the notebook.
print("-------Prompt--------")
print_multimodal_prompt(contents)

# We then print the model's responses as they are streamed. 
# The 'end=""' parameter avoids adding extra newlines after each chunk.
print("\n-------Response--------")
for response in responses:
    print(response.text, end="")


## Conclusion

This notebook showed how you can easily build a multimodal recommendation system using Gemini for furniture, but you can also use the similar approach in:

- recommending clothes based on an occasion or an image of the venue
- recommending wallpaper based on the room and settings

You may also want to explore how you can build a RAG (retrieval-augmented generation) system where you retrieve relevant images from your store inventory to users who can they use Gemini to help identify the most ideal choice from the various options provided, and also explain the rationale to users.