# Import the necessary packages

In [1]:
import os
import re
import requests
from datetime import datetime

from bs4 import BeautifulSoup, NavigableString

import vertexai
from vertexai.generative_models import (
    GenerativeModel,
    GenerationConfig,
    Tool,
    FunctionDeclaration,
    Part)

Next, run the following code to initialize Vertex AI

In [2]:
project_id = !gcloud config get project
project_id = project_id[0]

vertexai.init(project=project_id, location="us-central1")

# Download SEC filings and parse text to analyze

In this section, you will tackle the limitations of processing massive SEC filings with Gemini. You will explore utilizing the document structure to efficiently extract specific sections and confirm their suitability for analysis within Gemini's token limit.

Run the following code in your notebook and update the values of `PARTNER_COMPANY` and `PARTNER_WEBSITE` to prepare a header to call **EDGAR** on behalf of your company.

In [3]:
# IMPORTANT! You must set these variables with REAL VALUES for the correct headers to be in place to access the API
PARTNER_COMPANY = "YOUR COMPANY'S NAME"
PARTNER_WEBSITE = "https://your-companys-website"

headers = {"User-Agent": f"{PARTNER_COMPANY} +{PARTNER_WEBSITE}"}

Use the following code to name a local directory for storing filings and defining the following two functions:

- download_single_filing - It downloads a single SEC filing identified by a company's Central Index Key (CIK), the form type(an annual 10-K or quarterly 10-Q) and the date of filing.
- download_range_of_filings - It queries all the filings of a company to find filings within a date range of a particular form type and download them.

In [4]:
BASE_DIR = "filings" # Directory for storing raw filings

def download_single_filing(url, cik, form, date, file_extension):
    """
    This function downloads and saves a SEC filing from a specified URL.

    Args:
        cik (str): Central Index Key (CIK) for the company.
        form (str): The type of SEC filing (e.g., 10-K, 10-Q).
        date (str): The filing date in YYYY-MM-DD format.
        file_extension (str): File extension for saved file (usually 'txt' or 'zip').
    """

    # Define request headers to simulate a browser visit
    response = requests.get(url, headers=headers)

    if response.status_code == 200:

        # Make the directory accord
        dir_name = f"{BASE_DIR}/{cik}"
        os.makedirs(dir_name, exist_ok=True)
        file_path = f"{dir_name}/{form}_{date}.{file_extension}"

        with open(file_path, 'wb') as file:
            file.write(response.content)

        print(f"Downloaded {form} for CIK {cik} on {date} to {file_path}")

        return file_path

    else:
        print(f"Failed to download {form} for CIK {cik} on {date}. Status code: {response.status_code}")

        return None

def download_range_of_filings(cik, starting_year_and_quarter,
                            ending_year_and_quarter, include_10q = False):
    """
    Download filings from EDGAR for a given CIK and clean them up.

    Args:
        cik (str): Central Index Key (CIK)
        starting_year_and_quarter (str): Specified in the format "2023 Q1"
        ending_year_and_quarter (str): Specified in the format "2024 Q4"
        include_10q (bool): Whether to include 10-Qs in addition to 10-Ks
    """

    url = f"https://data.sec.gov/submissions/CIK{cik}.json"
    headers = {'User-Agent': 'Google Partner Learning Services Demo +https://partners.cloud.google.com/learn'}
    response = requests.get(url, headers=headers)

    forms_to_download = {"10-K", "10-Q"} if include_10q else {"10-K"}

    start_year, start_quarter = starting_year_and_quarter.split()
    end_year, end_quarter = ending_year_and_quarter.split()

    if response.status_code == 200:
        data = response.json()

        filing_paths = []
        for filing_date, form, accession_number in zip(data['filings']['recent']['filingDate'], data['filings']['recent']['form'], data['filings']['recent']['accessionNumber']):
            if (form in forms_to_download) and (int(start_year) <= datetime.strptime(filing_date, '%Y-%m-%d').year <= int(end_year)):
                file_path = download_single_filing(
                    f"https://www.sec.gov/Archives/edgar/data/{cik}/{accession_number.replace('-', '')}/{accession_number}.txt",
                    cik, form, filing_date, 'htm'
                )
                filing_paths.append (file_path)
        return filing_paths
    else:
        print("Error from call.")
        print(response)

Next, test the functions using the **CIK of Alphabet**, Google's parent company, which is 0001652044.

In [5]:
alphabet_cik = "0001652044"

download_range_of_filings(cik=alphabet_cik,
                starting_year_and_quarter="2024 Q1",
                ending_year_and_quarter="2024 Q4")

Downloaded 10-K for CIK 0001652044 on 2024-01-31 to filings/0001652044/10-K_2024-01-31.htm


['filings/0001652044/10-K_2024-01-31.htm']

Open the directory filings/0001652044, right click on the .htm file and select Open in New Browser Tab to preview the document. Notice that the table of contents of these documents are highly structured and all the 10-K filings will follow the same structure

Instantiate a GenerativeModel using Gemini Pro version gemini-2.0-flash.

In [6]:
model = GenerativeModel("gemini-2.0-flash",
                        generation_config=GenerationConfig(temperature=0),
                       )

To determine if you can send the entire document to **Gemini** to analyze at once, use your GenerativeModel's count_tokens method to see the number of tokens in the raw file.

In [7]:
downloaded_path = "filings/0001652044/10-K_2024-01-31.htm"

with open(downloaded_path, 'r') as f:
    filing_text = f.read()

response = model.count_tokens(filing_text)
print(response)    

total_tokens: 5798629
total_billable_characters: 13029439
prompt_tokens_details {
  modality: TEXT
  token_count: 5798629
}



With a token count of 5,798,629, you can see that even with Gemini 2.0 Flash's large token window of 1,048,576 tokens, these documents are too long to read in a single pass.

You could implement a Retrieval-Augmented Generation (RAG) framework to query small chunks of these documents, but here you are looking for a broader understanding of sections as a whole rather than smaller chunks consisting of a few facts in the document. Here, you don't mind passing a large number of tokens to Gemini, as this will be an internal tool used by a relatively small number of analysts, not a public tool handling thousands of queries.

SEC filings are required to adhere to a strict structure with named sections. You can use this standard structure of the documents to read the text between one section header and the next.

Run the following code in your notebook to define a list of the sections(items) and functions to help retrieve all of the text from one item header until the next item header.

In [8]:
items = ['Business',
        'Risk Factors',
        'Unresolved Staff Comments',
        'Properties', 'Legal Proceedings',
        'Mine Safety Disclosures',
        'Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities',
        'Management’s Discussion and Analysis of Financial Condition and Results of Operations',
        'Quantitative and Qualitative Disclosures About Market Risk',
        'Financial Statements and Supplementary Data',
        'Changes in and Disagreements with Accountants on Accounting and Financial Disclosure',
        'Controls and Procedures',
        'Other Information',
        'Disclosure Regarding Foreign Jurisdictions that Prevent Inspections',
        'Directors, Executive Officers, and Corporate Governance',
        'Executive Compensation',
        'Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters',
        'Certain Relationships and Related Transactions, and Director Independence',
        'Principal Accountant Fees and Services',
        'Exhibit and Financial Statement Schedules',
        'Form 10-K Summary']

def find_div_id(soup, item_name: str):
    found_tag = soup.find('a', string=item_name)
    if found_tag:
        return found_tag["href"].strip('#')
    else:
        print(f"Couldn't find a matching tag for: {item_name}")
        return None

def get_text_between(soup, cur_name, end_name):
    cur = soup.find('div', id=find_div_id(soup, cur_name)).next_sibling
    end = soup.find('div', id=find_div_id(soup, end_name))
    while cur and cur != end:
        if isinstance(cur, NavigableString):
            text = cur.strip()
            if len(text):
                yield text
        cur = cur.next_element

def get_items_from_filings(item_names, filing_paths):

    print("Items of interest: " + ", ".join(item_names) + "\n")
    item_strings = {item: f"<{item}>\n" for item in item_names}

    for path in filing_paths:
        with open(path, 'r', encoding='utf-8') as file:
            content = file.read()

        soup = BeautifulSoup(content, 'html.parser')

        for item in item_names:
            item_index = items.index(item)
            item_index = item_index if item_index < len(items) - 1 else 0
            item_output = ' '.join(text for text in get_text_between(soup, item, items[item_index + 1]))
            item_strings[item] += f"From {os.path.basename(path)}" + "\n" + item_output + "\n"

    return "\n\n".join(item_strings.values())    

Next, test the above functions by running the following command and save your notebook.

In [9]:
item_names = ["Management’s Discussion and Analysis of Financial Condition and Results of Operations"]

report = get_items_from_filings(item_names, [downloaded_path])

# Print the first 2,000 characters as an example.
print(report[0:2000] + "...")

Items of interest: Management’s Discussion and Analysis of Financial Condition and Results of Operations

<Management’s Discussion and Analysis of Financial Condition and Results of Operations>
From 10-K_2024-01-31.htm
Table of Contents Alphabet Inc. ITEM 7. MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS Please read the following discussion and analysis of our financial condition and results of operations together with “Note about Forward-Looking Statements,” Part I, Item 1 "Business," Part I, Item 1A "Risk Factors," and our consolidated financial statements and related notes included under Item 8 of this Annual Report on Form 10-K. The following section generally discusses 2023 results compared to 2022 results. Discussion of 2022 results compared to 2021 results to the extent not included in this report can be found in Item 7 of our 2022 Annual Report on Form 10-K. Understanding Alphabet’s Financial Results Alphabet is a collection of businesses 

Count the number of tokens in this section to see if it is under Gemini's token window of 1,048,576 tokens.

In [10]:
response = model.count_tokens(report)
print(response)


total_tokens: 12688
total_billable_characters: 50620
prompt_tokens_details {
  modality: TEXT
  token_count: 12688
}



# Let Gemini look up companies' CIKs via Google Search

You were able to download Alphabet's annual report because you were provided its Central Index Key (CIK) number. Gemini already knows some CIKs for large public companies like Alphabet, but for some smaller companies, it may hallucinate and invent inaccurate CIKs. You can instead look up correct numbers by having Gemini use everyone's favorite well-known, public, up-to-date source of information: Google Search.

Run the following command to see Gemini provide a CIK from its internal knowledge. The CIK of Summit Therapeutics is actually 0001599298, but without the ability to look it up, Gemini provides some alternative numbers.

In [11]:
model.generate_content("What is Summit Therapeutics' CIK with the SEC?").text

"Summit Therapeutics Inc.'s CIK (Central Index Key) with the SEC is **0001349558**.\n"

Use [Grounding with Google Search](https://cloud.google.com/vertex-ai/generative-ai/docs/grounding/overview) to provide Gemini with a tool to conduct Google searches.

In [12]:
from google import genai
from google.genai.types import Tool, GenerateContentConfig, GoogleSearch, HttpOptions
from IPython.display import display, HTML
import os

project_id = os.environ['GOOGLE_CLOUD_PROJECT'] = 'qwiklabs-gcp-03-f435e0bdc8a5'
location = os.environ['GOOGLE_CLOUD_LOCATION'] = 'us-central1'
os.environ['GOOGLE_GENAI_USE_VERTEXAI'] = 'True'

# Initialize the client, explicitly passing project and location for Vertex AI
client = genai.Client(project=project_id, location=location, http_options=HttpOptions(api_version="v1"))
model_id = "gemini-2.0-flash"

# Configure Google Search as a tool for grounding
search_tool = Tool(
    google_search=GoogleSearch()
)

# Helper function to lookup a company's SEC CIK
def lookup_cik(company_name: str) -> str:
    prompt = f"""
Find the Securities and Exchange Commission's Central Index Key (CIK)
for {company_name}.
Return only the 10-digit integer CIK including leading zeroes.
"""
    # Generate content with grounding via Google Search
    response = client.models.generate_content(
        model=model_id,
        contents=prompt,
        config=GenerateContentConfig(
            tools=[search_tool],
            response_modalities=["TEXT"],
        )
    )

    # Concatenate all text parts of the response
    cik = "".join(part.text for part in response.candidates[0].content.parts).strip()
    print(f"Lookup for CIK of {company_name} resulted in: {cik}\n")

    # Display the grounding metadata (search suggestions and sources)
    if hasattr(response.candidates[0], 'grounding_metadata'):
        display(
            HTML(
                response.candidates[0]
                .grounding_metadata.search_entry_point.rendered_content
            )
        )

    return cik

**Note**: When you use Grounding with Google Search, you are required to display the corresponding Google Search Suggestions which help you understand what Google search was conducted and allow you to investigate the results. The **lookup_cik** function below includes a **display** block that renders the provided HTML in this notebook.

Test the function and confirm you see the [Google Search Suggestion](https://cloud.google.com/vertex-ai/generative-ai/docs/grounding/grounding-search-suggestions) results.

In [13]:
lookup_cik("Summit Therapeutics")

Lookup for CIK of Summit Therapeutics resulted in: The Securities and Exchange Commission's (SEC) Central Index Key (CIK) for Summit Therapeutics is 0001599298.



"The Securities and Exchange Commission's (SEC) Central Index Key (CIK) for Summit Therapeutics is 0001599298."

Create a FunctionDeclaration that will help Gemini know about function to lookup a **CIK**.

In [14]:
lookup_cik_fd = FunctionDeclaration(
    name="lookup_cik",
    description="Look up a company's CIK used for its SEC filings.",
    parameters={
        "type": "object",
        "properties": {
            "company_name": {
                "type": "string",
                "description": "The name of the company to look up."
            },
        },
        "required": [
            "company_name"
        ]
    },
)

**Note**: The **search_tool** loaded above cannot be combined with other Tools when passed to Gemini, so in order to create a Tool that combines it and other functions, you can create a dedicated model instance and invoke that via another function as you are doing here.

# Empower Gemini to retrieve document sections from relevant years

In this section, you will equip Gemini to analyze specific sections of public company documents across different timeframes. You will define functions to retrieve relevant filings and extract information from desired sections within those documents for a more comprehensive analysis.

In your notebook, paste and run the following `FunctionDeclaration` that you can use to let Gemini know about the function you used earlier to download SEC filings and review its parameters. In particular, pay attention to how the `items_of_interest` section allows Gemini to provide an array of strings, but consisting only of strings matching the enum values you provide (the names of the various sections in the reports).

In [15]:
retrieve_filings_fd = FunctionDeclaration(
    name="retrieve_filings",
    description="Retrieve filings from the SEC EDGAR API for a company within a date range.",
    parameters={
        "type": "object",
        "properties": {
            "cik": {
                "type": "string",
                "description": "The CIK of a company whose documents will be retrieved"
            },
            "starting_year_and_quarter": {
                "type": "string",
                "description": "The first report quarter year and quarter in the format: 2024 Q1"
            },
            "ending_year_and_quarter": {
                "type": "string",
                "description": "The year and quarter in the format: 2024 Q1"
            },
            "include_quarterly_reports": {
                "type": "boolean",
                "description": "Whether to include 10-Q quarterly filings in addition to annual 10-K filings"
            },
            "items_of_interest": {
                "description": "An array of one or more section (called items) of interest from 10-K or 10-Q filings",
                "type": "array",
                "items": {
                "type": "string",
                "enum": [
                    "Business",
                    "Risk Factors",
                    "Unresolved Staff Comments",
                    "Properties",
                    "Legal Proceedings",
                    "Mine Safety Disclosures",
                    "Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities",
                    "Management’s Discussion and Analysis of Financial Condition and Results of Operations",
                    "Quantitative and Qualitative Disclosures About Market Risk",
                    "Financial Statements and Supplementary Data",
                    "Changes in and Disagreements with Accountants on Accounting and Financial Disclosure",
                    "Controls and Procedures",
                    "Other Information",
                    "Disclosure Regarding Foreign Jurisdictions that Prevent Inspections",
                    "Directors, Executive Officers and Corporate Governance",
                    "Executive Compensation",
                    "Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters",
                    "Certain Relationships and Related Transactions, and Director Independence",
                    "Principal Accountant Fees and Services",
                    "Exhibit and Financial Statement Schedules",
                    "Form 10-K Summary"
                ]
                }
        }
        },
        "required": [
            "cik",
            "starting_year_and_quarter",
            "ending_year_and_quarter",
            "items_of_interest"
        ]
    },
)    

Run the code snippet below to create a **Tool** that combines both of the `FunctionDeclarations` above.

In [16]:
sec_tool = Tool(function_declarations=[retrieve_filings_fd,
                                    lookup_cik_fd])    

Create a **system_instruction** and instantiate a new model that will follow the instructions to create analyses using the **SEC filings**. In your instructions, you will provide Gemini the current date so that it can calculate relevant dates for queries using relative terminology.

In [17]:
response1 = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Find the SEC CIK for Acme Corp. Return only the 10-digit CIK.",
    config=GenerateContentConfig(tools=[search_tool], response_modalities=["TEXT"])
)
cik = "".join(p.text for p in response1.candidates[0].content.parts).strip()

from vertexai.generative_models import GenerativeModel, GenerationConfig, Tool, FunctionDeclaration

# Define your function declarations (lookup_cik_fd, retrieve_filings_fd)
sec_tool = Tool(function_declarations=[lookup_cik_fd, retrieve_filings_fd])

system_instruction = """
    - You are a research assistant to a financial analyst.
    - Answer the user's question with an analysis, basing your response on information
    used in filings from the SEC EDGAR database.
    - Quote the SEC filing documents to support your analysis.
    - If you are not certain about a CIK, use your tool to look it up.
    - The current date is {current_date}.
    """.format(
        current_date=datetime.today().strftime("%Y-%m-%d")
    )

model = GenerativeModel(
    model_name="gemini-2.0-flash",
    generation_config=GenerationConfig(temperature=0),
    system_instruction=system_instruction,
    tools=[sec_tool]
)

response2 = model.generate_content(
    contents=f"Retrieve the 10-K for CIK {cik} covering 2023 Q4, and extract 'Risk Factors'."
)

When you ask Gemini a question related to a public company, it may return a response, or it may return a request for a function call. Starting with `Gemini 2.0`, it may ask for function calls in separate rounds of chat, or **multiple parallel function calls** in a single round of response. If multiple function calls are requested, your response **must include a function response for each function call** Gemini has requested (the Part.from_function_response section). To handle these cases, it is usually very helpful to define a handle_response function:

In [18]:
def handle_response(response):

    parts_for_inner_response = []
    for part in response.candidates[0].content.parts:
        # If the content part has an attribute called 'text'
        if hasattr(part, "text"):
            print("\n" + part.text)
        # If the content has an attribute called 'function_call'
        if part.function_call:
            function_call = part.function_call
            try:
                if function_call.name == "lookup_cik":
                    cik = lookup_cik(function_call.args["company_name"])
                    parts_for_inner_response.append(
                        Part.from_function_response(
                            name="lookup_cik",
                            response={
                                "content": cik,
                            },
                        )
                    )
                if function_call.name == "retrieve_filings":
                    context = ""
                    filing_paths = download_range_of_filings(function_call.args["cik"],
                                    function_call.args["starting_year_and_quarter"],
                                    function_call.args["ending_year_and_quarter"],
                                    function_call.args.get("include_10q", False)
                    )
                    if filing_paths:
                        # TODO: Load cleaned filing docs
                        report = get_items_from_filings(function_call.args["items_of_interest"], filing_paths)
                        parts_for_inner_response.append(
                            Part.from_function_response(
                                name="retrieve_filings",
                                response={
                                    "content": report,
                                },
                            )
                        )
                    else:
                        print("No valid filings found or an error was encountered in retrieving them.")
            except AttributeError as e:
                print("Exception:")
                print(response)
                print(part)
                print(e)
    if parts_for_inner_response:
        inner_response = chat.send_message(parts_for_inner_response)
        handle_response(inner_response)   

Start a chat session with the model.

In [19]:
chat = model.start_chat()

Run the following cell to ask a question that compares **a corresponding section across two companies reports**.

In [20]:
response = chat.send_message("How do Alphabet's risks in 2024 compare to Amazon's?")
handle_response(response)

Lookup for CIK of Alphabet resulted in: The Securities and Exchange Commission (SEC) Central Index Key (CIK) for Alphabet Inc. is 0001652044.



Lookup for CIK of Amazon resulted in: The Securities and Exchange Commission's (SEC) Central Index Key (CIK) for Amazon is 0001018724.



Downloaded 10-K for CIK 0001652044 on 2024-01-31 to filings/0001652044/10-K_2024-01-31.htm
Items of interest: Risk Factors

Downloaded 10-K for CIK 0001018724 on 2024-02-02 to filings/0001018724/10-K_2024-02-02.htm
Items of interest: Risk Factors


Based on the 2024 SEC filings, here's a comparison of the risk factors for Alphabet and Amazon:

**Alphabet's Key Risks:**

*   **Competition:** Alphabet faces intense competition across various industries, including online advertising and AI. They must continue to innovate to remain competitive.
    > *"We face intense competition. If we do not continue to innovate and provide products and services that are useful to users, customers, and other partners, we may not remain competitive, which could harm our business, financial condition, and operating results."*
*   **Advertising Revenue Dependence:** A significant portion of Alphabet's revenue comes from advertising, making them vulnerable to reduced spending by advertisers or technologies t

ou can also compare sections across time. Run the following code snippet to see an example of Gemini **comparing one section across multiple 'versions' (or filing years) of a document**.

In [21]:
chat = model.start_chat()
response = chat.send_message("How has Home Depot changed the way it describes its business over the past 3 years?")
handle_response(response)

Lookup for CIK of Home Depot resulted in: The Securities and Exchange Commission's (SEC) Central Index Key (CIK) for Home Depot is 0000354950.



Downloaded 10-K for CIK 0000354950 on 2025-03-21 to filings/0000354950/10-K_2025-03-21.htm
Downloaded 10-K for CIK 0000354950 on 2024-03-13 to filings/0000354950/10-K_2024-03-13.htm
Downloaded 10-K for CIK 0000354950 on 2023-03-15 to filings/0000354950/10-K_2023-03-15.htm
Downloaded 10-K for CIK 0000354950 on 2022-03-23 to filings/0000354950/10-K_2022-03-23.htm
Items of interest: Business


Over the past three years, Home Depot has consistently emphasized its interconnected shopping experience and strategic investments in supply chain and technology to enhance customer experience. Here's a breakdown of the key changes:

*   **Focus on Interconnected Experience:** All filings from 2022 to 2025 emphasize the interconnected shopping experience, highlighting the integration of digital and physical channels. For instance, the 2025 10-K states, "We continued our strategic investments aimed at creating an interconnected, frictionless shopping experience that enables our customers to seamlessl