# Gemini API: Data extraction

This notebook demonstrates how large language models can be used as a **practical data-cleaning tool for unstructured text**. Many social science datasets—interviews, reports, policy documents, open-ended survey responses—contain valuable information that is difficult to analyse in raw form. Entity extraction turns this text into structured variables (people, organisations, locations, dates) that can be reviewed, filtered, and analysed at scale. The focus here is not prediction, but **reducing manual cleaning and coding effort while preserving interpretability and research control**.

**Example use cases**

* Interview transcripts: extract referenced people, institutions, places, and time periods
* Policy documents: identify agencies, programs, laws, and jurisdictions mentioned
* Media and news analysis: track actors, organisations, and locations across articles
* Survey free-text responses: summarise named entities for downstream coding
* Archival or historical texts: standardise names and locations for quantitative analysis


## Setup

In [None]:
%pip install -U -q "google-genai>=1.0.0"

## Configure your API key

To run the following cell, your API key must be stored it in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see [Authentication](https://github.com/google-gemini/cookbook/blob/main/quickstarts/Authentication.ipynb) for an example.

In [2]:
from google import genai
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GEMINI_API')
client = genai.Client(api_key=GOOGLE_API_KEY)

# Select the model

Additionally, select the model you want to use from the available options below:

In [29]:
MODEL_ID = "gemini-3-flash-preview"

# Examples

### Extracting few entities at once

This example shows how entity extraction can be used as a data-cleaning step for unstructured research text.
The text below resembles an excerpt from an interview transcript or policy document.

Our goal is to extract key entities—such as organisations, locations, and time periods—that are embedded in free text and are otherwise time-consuming to code manually.

In [30]:
interview_excerpt = """
In an interview conducted as part of the regional governance study, officials from the Department of
Education described the rollout of the Early Skills Development Program across Victoria and New South
Wales. According to a senior policy adviser at the Department of Education, the program was first
announced in mid-2019 and formally implemented in early 2020.

Researchers from Monash University and Deakin University reported collaborating with local education
authorities in Melbourne, Geelong, and Sydney to evaluate the program’s initial outcomes. One
interviewee from Monash University noted that coordination with the Victorian Department of Education
occurred through regular workshops held in Melbourne between 2020 and 2021.


"""

You will use Gemini Flash model for fast responses.

In [24]:
for m in client.models.list():
    if "gemini" in m.name.lower():
        print(m.name)

models/gemini-2.5-flash
models/gemini-2.5-pro
models/gemini-2.0-flash
models/gemini-2.0-flash-001
models/gemini-2.0-flash-exp-image-generation
models/gemini-2.0-flash-lite-001
models/gemini-2.0-flash-lite
models/gemini-exp-1206
models/gemini-2.5-flash-preview-tts
models/gemini-2.5-pro-preview-tts
models/gemini-flash-latest
models/gemini-flash-lite-latest
models/gemini-pro-latest
models/gemini-2.5-flash-lite
models/gemini-2.5-flash-image
models/gemini-2.5-flash-preview-09-2025
models/gemini-2.5-flash-lite-preview-09-2025
models/gemini-3-pro-preview
models/gemini-3-flash-preview
models/gemini-3-pro-image-preview
models/gemini-robotics-er-1.5-preview
models/gemini-2.5-computer-use-preview-10-2025
models/gemini-embedding-001
models/gemini-2.5-flash-native-audio-latest
models/gemini-2.5-flash-native-audio-preview-09-2025
models/gemini-2.5-flash-native-audio-preview-12-2025


In [31]:
from IPython.display import Markdown

cleaning_prompt = f"""
From the given text, extract the following entities and return them as lists.

Entities to extract:
- organisations
- locations
- time periods

Text:
{interview_excerpt}

Organisations = []
Locations = []
Time_periods = []
"""


response = client.models.generate_content(
    model=MODEL_ID,
    contents=cleaning_prompt,
)
print(response.text)


Markdown(response.text)

Organisations = ["Department of Education", "Monash University", "Deakin University", "Victorian Department of Education"]
Locations = ["Victoria", "New South Wales", "Melbourne", "Geelong", "Sydney"]
Time_periods = ["mid-2019", "early 2020", "2020 and 2021"]


Organisations = ["Department of Education", "Monash University", "Deakin University", "Victorian Department of Education"]
Locations = ["Victoria", "New South Wales", "Melbourne", "Geelong", "Sydney"]
Time_periods = ["mid-2019", "early 2020", "2020 and 2021"]

You can modify the form of the answer for your extracted entities even more:

In [32]:
entities_list_prompt = f"""
Extract entities and count how often each appears.

Return exactly:

Organisations = {{name: count}}
Locations = {{name: count}}
Time_periods = {{period: count}}

Text:
{interview_excerpt}
"""

response = client.models.generate_content(
    model=MODEL_ID,
    contents=entities_list_prompt
)

Markdown(response.text)

Organisations = {'Department of Education': 2, 'Monash University': 2, 'Deakin University': 1, 'Victorian Department of Education': 1}
Locations = {'Victoria': 1, 'New South Wales': 1, 'Melbourne': 2, 'Geelong': 1, 'Sydney': 1}
Time_periods = {'mid-2019': 1, 'early 2020': 1, '2020 and 2021': 1}

### Numbers

Try entity extraction of phone numbers

In [33]:
customer_service_email = """
  Hello,
  Thank you for reaching out to our customer support team regarding your
  recent purchase of our premium subscription service.
  Your activation code has been sent to +87 668 098 344
  Additionally, if you require immediate assistance, feel free to contact us
  directly at +1 (800) 555-1234.
  Our team is available Monday through Friday from 9:00 AM to 5:00 PM PST.
  For after-hours support, please call our
  dedicated emergency line at +87 455 555 678.
  Thanks for your business and look forward to resolving any issues
  you may encounter promptly.
  Thank you.
"""

In [34]:
phone_prompt = f"""
  From the given text, extract the following entities and return a list of them.
  Entities to extract: phone numbers.
  Text: {customer_service_email}
  Return your answer in a list:
"""

response = client.models.generate_content(
    model=MODEL_ID,
    contents=phone_prompt
)

Markdown(response.text)

- +87 668 098 344
- +1 (800) 555-1234
- +87 455 555 678

### URLs


Try entity extraction of URLs and get response as a clickable link.

In [None]:
url_text = """
  Gemini API billing FAQs

  This page provides answers to frequently asked questions about billing
  for the Gemini API. For pricing information, see the pricing page
  https://ai.google.dev/pricing.
  For legal terms, see the terms of service
  https://ai.google.dev/gemini-api/terms#paid-services.

  What am I billed for?
  Gemini API pricing is based on total token count, with different prices
  for input tokens and output tokens. For pricing information,
  see the pricing page https://ai.google.dev/pricing.

  Where can I view my quota?
  You can view your quota and system limits in the Google Cloud console
  https://console.cloud.google.com/apis/api/generativelanguage.googleapis.com/quotas.

  Is GetTokens billed?
  Requests to the GetTokens API are not billed,
  and they don't count against inference quota.
"""

In [None]:
url_prompt = f"""
  From the given text, extract the following entities and return a list of them.
  Entities to extract: URLs.
  Text: {url_text}
  Do not duplicate entities.
  Return your answer in a markdown format:
"""

response = client.models.generate_content(
    model=MODEL_ID,
    contents=url_prompt
)

Markdown(response.text)

```
- https://ai.google.dev/pricing
- https://ai.google.dev/gemini-api/terms#paid-services
- https://console.cloud.google.com/apis/api/generativelanguage.googleapis.com/quotas
```