# LLM-for-Metadata-Harvesting

This notebook demonstrates the use of Large Language Models (LLMs) for automated metadata extraction from web-based dataset portals.  
It showcases an experiment on the [Actual probability distribution for Quercus robur (2000–2020)](https://stac.ecodatacube.eu/veg_quercus.robur_anv.eml/collection.json?.language=en) dataset, using a combination of web scraping and LLM-based entity extraction to populate metadata fields according to the Croissant data standard.

Key features:
- Web scraping utilities for extracting full page text from dataset portals
- Environment configuration for flexible API and model usage
- LLM client support for OpenAI and Gemini models (with extensibility for custom clients)
- Automated extraction of core metadata fields such as description, license, creator, keywords, and more

The results of the experiment are presented at the end of the notebook.  
Some code cells are included for future development and may not be directly relevant to this specific experiment.


# Scrap from the web portal of the dataset

You can use own-defined function to get the data from website, or use the pre-defined function in webutils.

Mind that you should check the `robots.txt` first to make sure if it is legal or allowed to scrap from this website.

In [1]:
from llm_metadata_harvester.webutils import extract_full_page_text
import nest_asyncio

url = "https://stac.ecodatacube.eu/veg_quercus.robur_anv.eml/collection.json?.language=en"

# Apply nest_asyncio to allow asyncio.run() in Jupyter
nest_asyncio.apply()

# Run the async function
full_text = await extract_full_page_text(url)

# Optionally display or save it
print(full_text[:10])  # Print the first 100 characters

EcoDataCub


## 🔧 Environment Configuration

To configure your API keys or other environment variables, you can use a `.env` file or set them directly in your shell.

### 📄 Using a `.env` File

Place the `.env` file in **one** of the following locations:

- The **root directory** of your project  
- The **same directory** as the script you're running  
- Or any directory, **as long as it's the current working directory**

> ℹ️ The `load_dotenv()` function automatically looks for a `.env` file in the current working directory by default.

#### 💡 Example `.env` File
```env
OPENAI_API_KEY=your_api_key_here
ANOTHER_SECRET=value_here
```

### Using global environment variable

Alternatively, you can set environment variables directly in your shell:

```bash
export OPENAI_API_KEY=your_api_key_here
```

In [2]:
from tqdm import tqdm
from llm_metadata_harvester.harvester_operations import extract_entities
from llm_metadata_harvester.llm_client import LLMClient
from dotenv import load_dotenv

# can put your .env file in the root of the project
# or in the same directory as this script
# or set the environment variables directly in your shell
# load_dotenv() will look for a .env file in the current directory
load_dotenv()

True

## Metadata Fields

The metadata fields defined below follow the **Croissant data standard**.

In [3]:
# Define the metadata fields and their descriptions
# These fields are from croissant data standard
meta_field_dict = {
    "description": "Description of the dataset.",
    "license": "The license of the dataset. Croissant recommends using the URL of a known license, e.g., one of the licenses listed at https://spdx.org/licenses/.",
    "name": "The name of the dataset.",
    "creator": "The creator(s) of the dataset.",
    "datePublished": "The date the dataset was published.",
    "keywords": "A set of keywords associated with the dataset, either as free text, or a DefinedTerm with a formal definition.",
    "publisher": "The publisher of the dataset, which may be distinct from its creator.",
    "sameAs": "The URL of another Web resource that represents the same dataset as this one.",
    "dateModified": "The date the dataset was last modified.",
    "inLanguage": "The language(s) of the content of the dataset."
}

## LLM Client Support

The `llm client` currently supports **OpenAI** and **Gemini** models.

To use other models, you can define your own LLM client class.  
Your custom class should implement a `chat` method that returns a string as the LLM response.

In [4]:
llm = LLMClient(model_name="gemini-2.5-flash-preview-05-20", temperature=0.0)

clean_nodes = extract_entities(
    text=full_text,
    meta_field_dict=meta_field_dict,
    llm=llm
)

The output of `extract_entities` looks like a list of lists of dictionaries, where each dictionary is structured like this:

```python
'license': [{'entity_name': 'license',
             'entity_value': 'CC-BY-SA-4.0',
             'source_id': 'chunk_0',
             'file_path': 'unknown_source'}]
```