In [1]:
!apt install tesseract-ocr libtesseract-dev
!pip install -q -U google-generativeai chromadb pytesseract

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libtesseract-dev is already the newest version (4.1.1-2.1build1).
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 38 not upgraded.


In [2]:
import time
from tqdm import tqdm
import pathlib
import google.generativeai as genai
import chromadb
from chromadb import Documents, EmbeddingFunction, Embeddings
import pandas as pd
from PIL import Image
import pytesseract
from IPython.display import Markdown

In [3]:
from google.colab import userdata
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

genai.configure(api_key=GOOGLE_API_KEY)

# Gemini Final Exercise

You're an astronomy student who's very curious about the Apollo 11 missions,
and through your research, you've found a lot of different types of data (otherwise known as multimodal) from NASA's
public archive.

1. Text: You have the full final NASA report post-mission, spanning over 300
pages of incredibly informative content that details a summary of everything
that happened as well as conclusions that NASA researchers and engineers
came to. For the sake of this exercise, we've selected 3 particularly interesting pages, and converted them to images (you'll see soon why).

2. Video: You also have several clips of the famous Neil Armstrong and Buzz Aldrin footage as they
first stepped onto the moon, containing highlights of their moonwalks as well
as raising the American flag.

3. Audio: Finally, you have highlights from the audio recorded throughout the
mission, which provides insights into how communication between the astronauts
occurred as well as from the astronauts to mission control.

Now, you want to search through and summarize this information for your
upcoming research paper. Using your newfound skills from this course, you
can accomplish this using Gemini! In particular, we will build a Retrieval Augmented Generation (RAG) system that you can directly interact with.

## Data Preparation

Before we begin, ensure that you've uploaded the resources.zip folder and unzipped it using the following command:

In [4]:
!wget -O resources.zip "https://video.udacity-data.com/topher/2024/June/66744e79_resources/resources.zip"

--2025-09-27 08:51:31--  https://video.udacity-data.com/topher/2024/June/66744e79_resources/resources.zip
Resolving video.udacity-data.com (video.udacity-data.com)... 172.64.148.171, 104.18.39.85, 2a06:98c1:3102::ac40:94ab, ...
Connecting to video.udacity-data.com (video.udacity-data.com)|172.64.148.171|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 286142532 (273M) [application/zip]
Saving to: ‘resources.zip’


2025-09-27 08:51:33 (152 MB/s) - ‘resources.zip’ saved [286142532/286142532]



In [5]:
!unzip resources.zip

Archive:  resources.zip
replace __MACOSX/._resources? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: __MACOSX/._resources    
replace __MACOSX/resources/._video? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: __MACOSX/resources/._video  
replace resources/.DS_Store? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: resources/.DS_Store     
replace __MACOSX/resources/._.DS_Store? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: __MACOSX/resources/._.DS_Store  
replace __MACOSX/resources/._audio? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: __MACOSX/resources/._audio  
replace __MACOSX/resources/._text? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: __MACOSX/resources/._text  
replace resources/video/Apollo11PlaqueComparison.mov? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: resources/video/Apollo11PlaqueComparison.mov  
replace __MACOSX/resources/video/._Apollo11PlaqueComparison.mov? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: __MACOSX/re


As we saw throughout this course, when working with different types of data,
we first need to parse it in a way that Gemini can understand. We will prepare our data by extracting all file names from the `resources` directory.

In [6]:
data_dir = pathlib.Path("resources/")
all_file_names = [str(file) for file in data_dir.rglob("*") if file.is_file() and not file.name.startswith('.')]

In [7]:
for file_name in all_file_names:
    print(file_name)

print(len(all_file_names))

resources/audio/Apollo11OnboardAudioHighlightClip3.mp3
resources/audio/Apollo11OnboardAudioHighlightClip2.mp3
resources/audio/Apollo11OnboardAudioHighlightClip5.mp3
resources/audio/Apollo11OnboardAudioHighlightClip4.mp3
resources/audio/Apollo11OnboardAudioHighlightClip1.mp3
resources/video/RaisingTheAmericanFlag.mov
resources/video/BuzzDescendsCompilation.mov
resources/video/Apollo11PlaqueComparison.mov
resources/video/Apollo11MoonwalkMontage.mov
resources/video/OneSmallStepCompilation.mov
resources/video/Apollo11Intro.mov
resources/text/images-020.jpg
resources/text/images-333.jpg
resources/text/images-023.jpg
14


You should expect to see 14 files.

## Retrieval Augmented Generation (RAG)

To showcase how we build a RAG, we will first build one for the Text case, and generalize it further after. Here is the general idea:
1. **Data Preparation** (done above): We first collected various types of data from NASA's public archive related to the Apollo 11 mission, including text, video, and audio files.
2. **Data Extraction and Summarization**: Extract the multimodal data from images, e.g. extract text from images using Optical Character Recognition (OCR), and use Gemini to generate summaries using a specialized prompt.
3. **Embedding Generation**: Convert the generated summaries into vector embeddings using Gemini's Text Embedding Model. These embeddings represent the summaries in a numerical format suitable for efficient similarity searches.
4. **Creating a Vector Database**: A Vector database was created to store the embeddings. This database facilitates fast and efficient retrieval of relevant documents based on similarity searches. We chose to use Chroma DB.
5. **Querying the RAG System**: For a given query, the system retrieves the most relevant documents (based on their embeddings) and generates a response using the retrieved documents as context.

Something important to note is that RAGs are usually used only when there is a surplus of data. In other words, if the data can't fit into the model prompt. In this case, the data we provided likely can fit into Gemini's 1 million token window, but for the sake of simplicity and restrictions of Google Colab's runtime, we opted to use a smaller set of data.

### Text

We will use Tesseract OCR (Optical Character Recognition) to extract text from images of the NASA report.

In [8]:
pytesseract.pytesseract.tesseract_cmd = (r'/usr/bin/tesseract')

Let's create a function to take in our images of a PDF, transcribe them into text, and summarize each of them.

In [9]:
def create_text_summary():
  path = pathlib.Path("resources/text")

  text_summary_prompt = f"""You are an assistant tailored for summarizing text for retrieval.
  These summaries will be turned into vector embeddings and used to retrieve the raw text.
  Give a concise summary of the text that is well optimized for retrieval. Here is the text."""

  images = []
  text_summaries = []

  for f in path.glob("*"):
    if f.is_dir() or f.name.startswith('.'):
      continue

    image = Image.open(f)
    response = model.generate_content(
                [text_summary_prompt, image]
              )

    images.append(f) # TODO
    text_summaries.append(response.text.strip()) # TODO

  return images, text_summaries

In [13]:
safety_settings = [
    {
        "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
        "threshold": "BLOCK_NONE",
    },
]

model = genai.GenerativeModel('models/gemini-2.5-flash', safety_settings=safety_settings)

In [14]:
image_files, text_summaries = create_text_summary()

Now, we can check out the generated summaries of the three pages we have!

In [15]:
for text_summary in text_summaries:
  print(text_summary)

This document outlines the Apollo 11 Flight Plan, prepared by the Flight Planning Branch and Flight Crew Support Division with TRW Systems, for the AS-506/CSM-107/LM-5 lunar landing mission. It details operations and crew activities based on July 16, 1969 launch parameters and a 72° azimuth. The plan is under Crew Procedures Control Board (CPCB) configuration control. Proposed changes, categorized by impact on crew training, test objectives, budget, activity scheduling, or flight data, must be submitted via a Crew Procedures Change Request. Mr. T. A. Guillory coordinates changes, and Mr. W. J. North handles requests for copies or distribution list updates.
This document outlines the ground rules and assumptions for an IM EPS analysis related to a lunar mission. Key assumptions include: descent stage batteries activated 30 minutes before Earth liftoff; a 3.8-hour lunar orbit checkout; ascent and descent batteries paralleled for powered descent and pre-liftoff; S-band equipment 100% acti

We create the Chroma database using the generated summaries. You might be wondering what Vector DB and Chroma DB are.

**Vector Database**: A specialized database designed to store and manage high-dimensional vectors, which are numerical representations of data points. It allows efficient similarity searches to find vectors (and their corresponding data) that are close to a given query vector.

**Chroma DB**: An implementation of a vector database used to store and retrieve vector embeddings. These embeddings are generated from our summaries and allow us to perform efficient similarity searches.


In [16]:
class GeminiEmbeddingFunction(EmbeddingFunction):
  def __call__(self, input: Documents) -> Embeddings:
    model = 'models/text-embedding-004'
    title = "Custom query"
    return genai.embed_content(model=model,
                                content= input, # TODO: What would we pass in as content?
                                task_type="retrieval_document", # TODO: What sort of task_type would this be? Check out the Gemini docs for what options there are.
                                title=title)["embedding"]

In [17]:
def create_chroma_db(documents, name):
  chroma_client = chromadb.Client()
  db = chroma_client.get_or_create_collection(
        name=name,
        embedding_function=GeminiEmbeddingFunction()
    )    # TODO: Create a chroma db using the name and above embedding function. Hint: check out `get_or_create_collection`

  for i, d in enumerate(documents):
    db.add(
      documents=d,
      ids=str(i)
    )
  return db

In [18]:
text_db = create_chroma_db(text_summaries, name="text_summaries") # TODO: Create a db for all the text summaries using the above function

  embedding_function=GeminiEmbeddingFunction()


Let's also take a peak at the `text_db` and ensure that embeddings were generated:

In [19]:
# Create a row for each (document, embedding) pair
data = [
    {"document": doc, "embedding": emb}
    for doc, emb in zip(text_db.peek()["documents"], text_db.peek()["embeddings"])
]

df = pd.DataFrame(data)
df


Unnamed: 0,document,embedding
0,This document outlines the Apollo 11 Flight Pl...,"[0.08514145016670227, 0.0008094013319350779, 0..."
1,This document outlines the ground rules and as...,"[0.05311640724539757, -0.020012691617012024, -..."
2,This document describes a mission's two primar...,"[0.07370813190937042, 0.0243355892598629, -0.0..."





Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



You should see a column called `embeddings` with what are seemingly random values, but these values are actually high-dimensional vectors that represent the semantic meaning of your summaries.

Now let's actually try querying our information. We'll test a simple example like getting some file that has to do with the Apollo 11 Flight Plan.

In [20]:
def get_relevant_files(query, db):
  results = ... # TODO: query from the db to get the top few results
  return ... # TODO

In [21]:
files = get_relevant_files("Apollo 11 Flight Plan", text_db)
print(files)

Ellipsis


You should expect to see something like `['1', '0', '2']`. This means that the first entry in the `text_db` is most similar. If you look above at our `pd.DataFrame` output, the document with id 1 is the document about the Apollo 11 Flight Plan, so this is working as we expected!

### Video and Audio

Congrats! You've successfully built a working RAG for text. Now, let's extend this concept to Video and Audio, and build out some more complex queries. We'll begin by generalizing the above summary creation function to all sorts of modalities.

In [22]:
import pathlib
import time
import mimetypes

# If you don't already have data_dir defined, uncomment the next line:
# data_dir = pathlib.Path("resources")

# Allowed file extensions for text/audio/video (adjust if needed)
ALLOWED_TEXT_EXTS = {".txt", ".md", ".csv", ".json", ".html", ".htm", ".xml"}
ALLOWED_AUDIO_EXTS = {".mp3", ".wav", ".m4a", ".flac", ".aac"}
ALLOWED_VIDEO_EXTS = {".mp4", ".mov", ".mkv", ".avi", ".webm"}

def _wait_for_file_processing(uploaded_file, poll_interval=3, timeout=300):
    """Poll uploaded_file until its state is not PROCESSING. Returns final file object.
       Retries on transient server errors when calling genai.get_file."""
    start = time.time()
    file_obj = uploaded_file
    while True:
        # check timeout
        if time.time() - start > timeout:
            raise TimeoutError(f"Timeout waiting for file {uploaded_file.name} to finish processing.")
        try:
            file_obj = genai.get_file(file_obj.name)
        except Exception as e:
            # transient server response conversion error/500 - wait and retry
            print(f"Warning: get_file raised {type(e).__name__}: {e}. Retrying in {poll_interval}s...")
            time.sleep(poll_interval)
            continue

        state_name = getattr(file_obj.state, "name", None)
        if state_name == "PROCESSING":
            print(f"Waiting for file {uploaded_file.name} to finish processing...")
            time.sleep(poll_interval)
            continue
        # if failed, raise
        if state_name == "FAILED":
            raise RuntimeError(f"Uploaded file {uploaded_file.name} processing FAILED.")
        # finished (READY or equivalent)
        return file_obj

def create_summary(modality: str):
    path = pathlib.Path(data_dir) / modality

    summary_prompt = (
        f"You are an assistant tailored for summarizing {modality} for retrieval.\n"
        f"These summaries will be turned into vector embeddings and used to retrieve the raw {modality}.\n"
        f"Give a concise summary of the {modality} that is well optimized for retrieval. Here is the {modality}."
    )

    files = []
    summaries = []

    for f in path.glob("*"):
        # skip directories and hidden files
        if f.is_dir() or f.name.startswith("."):
            continue

        print("Processing:", f)

        # --- TEXT modality ---
        if modality == "text":
            # skip non-text extensions (prevents opening images like images-020.jpg)
            if f.suffix.lower() not in ALLOWED_TEXT_EXTS:
                mime_type, _ = mimetypes.guess_type(f)
                if mime_type is None or not (mime_type.startswith("text") or f.suffix.lower() in ALLOWED_TEXT_EXTS):
                    print(f"Skipping non-text file: {f}")
                    continue

            # try UTF-8 read, fallback to latin-1 if needed
            try:
                with open(f, "r", encoding="utf-8") as fh:
                    raw_text = fh.read()
            except UnicodeDecodeError:
                print(f"UTF-8 decode failed for {f}, trying latin-1 with errors='ignore'.")
                with open(f, "r", encoding="latin-1", errors="ignore") as fh:
                    raw_text = fh.read()

            # send to model (string input)
            response = model.generate_content([summary_prompt, raw_text])

            # extract text safely
            summary_text = ""
            if hasattr(response, "text"):
                summary_text = response.text.strip()
            elif isinstance(response, dict) and "text" in response:
                summary_text = response["text"].strip()
            else:
                # last resort: str()
                summary_text = str(response).strip()

            files.append(f)
            summaries.append(summary_text)

        # --- AUDIO / VIDEO modality ---
        elif modality in ("audio", "video"):
            # validate extension (optional but helpful)
            allowed_exts = ALLOWED_AUDIO_EXTS if modality == "audio" else ALLOWED_VIDEO_EXTS
            if f.suffix.lower() not in allowed_exts:
                mime_type, _ = mimetypes.guess_type(f)
                if mime_type is None or not mime_type.startswith(modality):
                    print(f"Skipping non-{modality} file: {f}")
                    continue

            # upload file and wait for processing to finish robustly
            uploaded_file = genai.upload_file(f)
            try:
                ready_file = _wait_for_file_processing(uploaded_file, poll_interval=3, timeout=600)
            except Exception as e:
                print(f"Skipping {f} due to upload/processing error: {e}")
                continue

            # send the file object reference to the model
            response = model.generate_content([summary_prompt, ready_file])

            # extract text safely
            summary_text = ""
            if hasattr(response, "text"):
                summary_text = response.text.strip()
            elif isinstance(response, dict) and "text" in response:
                summary_text = response["text"].strip()
            else:
                summary_text = str(response).strip()

            files.append(f)
            summaries.append(summary_text)

        else:
            # unknown modality
            print(f"Unknown modality: {modality} (skipping {f})")
            continue

    return files, summaries

Now, we will create a folder with all of our data of different modalities. In particular, the first 5 are audio files, next 3 are text files, and final 6 are video files.

In [23]:
all_files = []
all_summaries = []
for modality_type in ["audio", "text", "video"]:
    files, summaries = create_summary(modality_type)
    all_files.extend(files)
    all_summaries.extend(summaries)

    print("Collected", len(all_summaries), "summaries.")

Processing: resources/audio/Apollo11OnboardAudioHighlightClip3.mp3
Processing: resources/audio/Apollo11OnboardAudioHighlightClip2.mp3
Processing: resources/audio/Apollo11OnboardAudioHighlightClip5.mp3
Processing: resources/audio/Apollo11OnboardAudioHighlightClip4.mp3
Processing: resources/audio/Apollo11OnboardAudioHighlightClip1.mp3
Collected 5 summaries.
Processing: resources/text/images-020.jpg
Skipping non-text file: resources/text/images-020.jpg
Processing: resources/text/images-333.jpg
Skipping non-text file: resources/text/images-333.jpg
Processing: resources/text/images-023.jpg
Skipping non-text file: resources/text/images-023.jpg
Collected 5 summaries.
Processing: resources/video/RaisingTheAmericanFlag.mov
Waiting for file files/vc8gec4ns83d to finish processing...
Processing: resources/video/BuzzDescendsCompilation.mov
Waiting for file files/ulqvbmykm9tw to finish processing...
Processing: resources/video/Apollo11PlaqueComparison.mov
Waiting for file files/93wtdptu4noe to fini

In [29]:
video_audio_db = create_chroma_db(text_summaries, name="summaries") # TODO: Create a db for all the text summaries using the above function
peek = video_audio_db.peek()

embeddings = video_audio_db.peek()['embeddings']
documents = video_audio_db.peek()['documents']

df = pd.DataFrame({
    "document": documents,
    "embedding": [list(e) for e in embeddings]  # force into object column
})

print(df.head())



  embedding_function=GeminiEmbeddingFunction()


                                            document  \
0  This document outlines the Apollo 11 Flight Pl...   
1  This document outlines the ground rules and as...   
2  This document describes a mission's two primar...   

                                           embedding  
0  [0.08514145016670227, 0.0008094013319350779, 0...  
1  [0.05311640724539757, -0.020012691617012024, -...  
2  [0.07370813190937042, 0.0243355892598629, -0.0...  


Again, ensure that the embeddings were generated. Notice that now, we have audio, video, and text data.

In [31]:
import numpy as np

embeddings = video_audio_db.peek()['embeddings']
documents = video_audio_db.peek()['documents']

emb_array = np.vstack(embeddings)   # shape = (n_samples, embedding_dim)

df = pd.DataFrame(emb_array)        # embedding dims = numeric columns
df["document"] = documents

print(df.head())


          0         1         2         3         4         5         6  \
0  0.085141  0.000809  0.025561  0.042306  0.071648 -0.001135  0.045755   
1  0.053116 -0.020013 -0.061028  0.009171  0.010112  0.003292  0.080573   
2  0.073708  0.024336 -0.001695  0.015251  0.060523  0.013323  0.045922   

          7         8         9  ...       759       760       761       762  \
0  0.053318  0.030082  0.040133  ... -0.004881 -0.022323 -0.024749 -0.003635   
1  0.059754 -0.038403  0.037040  ... -0.028587 -0.003267 -0.004826 -0.043701   
2  0.044094  0.000835  0.061872  ... -0.027183  0.030712  0.003261  0.001619   

        763       764       765       766       767  \
0  0.015273 -0.037634  0.034557  0.023508  0.000775   
1 -0.055209 -0.019669 -0.019569  0.076849 -0.025311   
2  0.010639 -0.015047  0.001038 -0.010422 -0.019432   

                                            document  
0  This document outlines the Apollo 11 Flight Pl...  
1  This document outlines the ground rules and 

In [34]:
data = {
    "embeddings": video_audio_db.peek()["embeddings"],
    "documents": video_audio_db.peek()["documents"]
}

# ✅ Build DataFrame safely (embeddings stored as list objects)
df = pd.DataFrame({
    "document": data["documents"],
    "embedding": [list(e) for e in data["embeddings"]]  # force object dtype
})

print(df.head())


                                            document  \
0  This document outlines the Apollo 11 Flight Pl...   
1  This document outlines the ground rules and as...   
2  This document describes a mission's two primar...   

                                           embedding  
0  [0.08514145016670227, 0.0008094013319350779, 0...  
1  [0.05311640724539757, -0.020012691617012024, -...  
2  [0.07370813190937042, 0.0243355892598629, -0.0...  


In [44]:
files = get_relevant_files("communication with Mission Control", video_audio_db)
print(files)

Ellipsis


Can we do more than just return the most relevant file? Yes we can! We can ask Gemini to return a response to the query using the files it thinks are most relevant, provide an answer and tell us what files it used! This is really exciting, and has vast applications in many industries.

In [49]:
def get_relevant_files(query, db, top_k=3):
    results = db.query(query_texts=[query], n_results=top_k)
    # results["ids"] will be a list of lists -> flatten
    return [int(idx) for idx in results["ids"][0]]


In [50]:
for response in get_relevant_files("Explain what happened with the Apollo 11 Mission.", video_audio_db):
    print(response)

0
2
1


In [51]:
for response in get_relevant_files("What happens at the Translunar Coast in the Mission Description?", video_audio_db):
    print(response)

2
0
1


In [52]:
for response in get_relevant_files("REPLACE ME: Ask any questions you'd like here about Apollo 11!", video_audio_db):
    print(response)

0
2
1


Congrats! You've built a full end to end multimodal RAG with just a few tools. We hope you enjoyed following along in this notebook and learned a lot on the way.