# PNDB API Data Extraction Pipeline

This notebook extracts biodiversity data from the French National Biodiversity Database (PNDB) using their OpenDataSoft API and saves it to CSV format.

## Overview
- **Data Source**: PNDB (Pôle National de Données de Biodiversité)
- **API**: OpenDataSoft platform v1.0
- **Endpoint**: https://pndb.opendatasoft.com/api/datasets/1.0/search/
- **Goal**: Extract Title, description, metadata about biodiversity datasets for further analysis
- **Output**: Structured CSV file with dataset information

## 1. Import Required Libraries

In [2]:
import requests
import pandas as pd
import json
import time
from datetime import datetime
import os
from typing import List, Dict, Any
import re

print("Libraries imported successfully")

Libraries imported successfully


## 2. API Configuration

In [3]:
# PNDB API Configuration
BASE_URL = "https://pndb.opendatasoft.com/api/datasets/1.0/search/"
API_KEY = "7de48c5bd7100022a4527661269206a9e92125f940f32dcd749b5da1"

# API Parameters (OpenDataSoft v1.0 format)
PARAMS = {
    'rows': 100,
    'start': 0,
    'apikey': API_KEY
}

# Headers for the API request
HEADERS = {
    'User-Agent': 'PNDB-Data-Extractor/1.0',
    'Accept': 'application/json'
}

print(f"API Base URL: {BASE_URL}")
print(f"Using API Key: {API_KEY[:20]}...")
print(f"Initial parameters: {PARAMS}")

API Base URL: https://pndb.opendatasoft.com/api/datasets/1.0/search/
Using API Key: 7de48c5bd7100022a452...
Initial parameters: {'rows': 100, 'start': 0, 'apikey': '7de48c5bd7100022a4527661269206a9e92125f940f32dcd749b5da1'}


## 3. Test API Connection

In [4]:
# Test the API connection
test_params = PARAMS.copy()
test_params['rows'] = 1

try:
    print("Testing API connection...")
    response = requests.get(BASE_URL, params=test_params, headers=HEADERS, timeout=30)
    response.raise_for_status()
    
    test_data = response.json()
    
    print(f"API Connection Successful")
    print(f"Status Code: {response.status_code}")
    print(f"Response Keys: {list(test_data.keys())}")
    
    if 'datasets' in test_data and len(test_data['datasets']) > 0:
        print("\nSample Dataset Structure:")
        sample_dataset = test_data['datasets'][0]
        for key in sample_dataset.keys():
            print(f"  - {key}: {type(sample_dataset[key])}")
        
        if 'metas' in sample_dataset:
            print("\nAvailable Metadata Fields:")
            metas = sample_dataset['metas']
            for key in list(metas.keys())[:10]:
                print(f"  - metas.{key}: {type(metas[key])}")
    
    total_datasets = test_data.get('nhits', 0)
    print(f"\nTotal datasets available: {total_datasets:,}")
    
except requests.exceptions.RequestException as e:
    print(f"API Connection Failed: {e}")
    print("Please check your internet connection and API endpoint.")
except Exception as e:
    print(f"Unexpected error: {e}")

Testing API connection...
API Connection Successful
Status Code: 200
Response Keys: ['nhits', 'parameters', 'datasets']

Sample Dataset Structure:
  - datasetid: <class 'str'>
  - metas: <class 'dict'>
  - has_records: <class 'bool'>
  - data_visible: <class 'bool'>
  - features: <class 'list'>
  - attachments: <class 'list'>
  - alternative_exports: <class 'list'>
  - fields: <class 'list'>
  - basic_metas: <class 'dict'>
  - interop_metas: <class 'dict'>
  - extra_metas: <class 'dict'>

Available Metadata Fields:
  - metas.domain: <class 'str'>
  - metas.staged: <class 'bool'>
  - metas.visibility: <class 'str'>
  - metas.metadata_processed: <class 'str'>
  - metas.modified: <class 'str'>
  - metas.license: <class 'str'>
  - metas.description: <class 'str'>
  - metas.publisher: <class 'str'>
  - metas.theme: <class 'list'>
  - metas.title: <class 'str'>

Total datasets available: 10,795


## 4. Data Extraction Function

In [5]:
def extract_pndb_data(max_records: int = None) -> List[Dict[Any, Any]]:
    """
    Extract data from PNDB API with pagination support.
    
    Args:
        max_records: Maximum number of records to extract (None for all)
    
    Returns:
        List of dataset dictionaries
    """
    all_datasets = []
    start = 0
    rows = 100
    
    print(f"Starting data extraction from PNDB API...")
    print(f"Target: {'All available records' if max_records is None else f'{max_records:,} records'}")
    
    while True:
        current_params = PARAMS.copy()
        current_params['start'] = start
        current_params['rows'] = rows
        
        try:
            print(f"\nFetching records {start + 1:,} to {start + rows:,}...")
            response = requests.get(BASE_URL, params=current_params, headers=HEADERS, timeout=30)
            response.raise_for_status()
            
            data = response.json()
            current_results = data.get('datasets', [])
            
            if not current_results:
                print("No more results found. Extraction complete.")
                break
            
            all_datasets.extend(current_results)
            print(f"Retrieved {len(current_results)} records. Total so far: {len(all_datasets):,}")
            
            if max_records and len(all_datasets) >= max_records:
                all_datasets = all_datasets[:max_records]
                print(f"Reached target of {max_records:,} records.")
                break
            
            total_count = data.get('nhits', 0)
            if start + rows >= total_count:
                print(f"Reached end of available data ({total_count:,} total records).")
                break
            
            start += rows
            time.sleep(0.1)
            
        except requests.exceptions.RequestException as e:
            print(f"Error during API request: {e}")
            break
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON response: {e}")
            break
    
    print(f"\nExtraction completed! Total records extracted: {len(all_datasets):,}")
    return all_datasets

## 5. Data Processing Function

In [6]:
def process_dataset_metadata(datasets: List[Dict]) -> pd.DataFrame:
    """
    Process raw dataset metadata into a structured DataFrame.
    
    Args:
        datasets: List of raw dataset dictionaries from API
    
    Returns:
        Processed pandas DataFrame
    """
    processed_data = []
    
    print(f"Processing {len(datasets):,} datasets...")
    
    for i, dataset in enumerate(datasets):
        try:
            metas = dataset.get('metas', {})
            
            title = metas.get('title', dataset.get('title', ''))
            description = metas.get('description', dataset.get('description', ''))
            creator = metas.get('publisher', metas.get('creator', metas.get('source', '')))
            theme = metas.get('theme', metas.get('keyword', ''))
            keywords = metas.get('keyword', metas.get('keywords', ''))
            language = metas.get('language', 'fr')
            geographic_coverage = metas.get('geographic_coverage', metas.get('spatial', ''))
            
            if isinstance(theme, list):
                theme = '; '.join(theme)
            if isinstance(keywords, list):
                keywords = '; '.join(keywords)
            
            # Clean HTML from description
            clean_description = re.sub(r'<[^>]+>', ' ', str(description))
            clean_description = re.sub(r'\s+', ' ', clean_description).strip()
            
            dataset_info = {
                'dataset_id': dataset.get('datasetid', ''),
                'title': str(title).strip(),
                'description': clean_description,
                'creator': str(creator).strip(),
                'theme': str(theme).strip(),
                'keywords': str(keywords).strip(),
                'language': str(language).strip(),
                'geographic_coverage': str(geographic_coverage).strip(),
                'records_count': metas.get('records_count', 0),
                'modified': metas.get('modified', ''),
                'publisher': metas.get('publisher', ''),
                'license': metas.get('license', ''),
                'territory': ', '.join(metas.get('territory', [])) if isinstance(metas.get('territory'), list) else str(metas.get('territory', ''))
            }
            
            processed_data.append(dataset_info)
            
            if (i + 1) % 1000 == 0:
                print(f"  Processed {i + 1:,}/{len(datasets):,} datasets...")
                
        except Exception as e:
            print(f"Error processing dataset {i}: {e}")
            continue
    
    df = pd.DataFrame(processed_data)
    
    print(f"\nProcessing completed")
    print(f"DataFrame shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    
    return df

## 6. Execute Data Extraction

In [7]:
print("=== PNDB Data Extraction Started ===")
print(f"Timestamp: {datetime.now()}")

# Extract all available datasets
raw_datasets = extract_pndb_data(max_records=None)

print(f"\nRaw extraction completed: {len(raw_datasets):,} datasets retrieved")

=== PNDB Data Extraction Started ===
Timestamp: 2025-11-06 12:49:17.628455
Starting data extraction from PNDB API...
Target: All available records

Fetching records 1 to 100...
Retrieved 100 records. Total so far: 100

Fetching records 101 to 200...
Retrieved 100 records. Total so far: 200

Fetching records 201 to 300...
Retrieved 100 records. Total so far: 300

Fetching records 301 to 400...
Retrieved 100 records. Total so far: 400

Fetching records 401 to 500...
Retrieved 100 records. Total so far: 500

Fetching records 501 to 600...
Retrieved 100 records. Total so far: 600

Fetching records 601 to 700...
Retrieved 100 records. Total so far: 700

Fetching records 701 to 800...
Retrieved 100 records. Total so far: 800

Fetching records 801 to 900...
Retrieved 100 records. Total so far: 900

Fetching records 901 to 1,000...
Retrieved 100 records. Total so far: 1,000

Fetching records 1,001 to 1,100...
Retrieved 100 records. Total so far: 1,100

Fetching records 1,101 to 1,200...
Retrie

## 7. Process and Structure the Data

In [8]:
if raw_datasets:
    df_pndb = process_dataset_metadata(raw_datasets)
    
    print("\n=== Data Processing Results ===")
    print(f"Total datasets processed: {len(df_pndb):,}")
    print(f"DataFrame shape: {df_pndb.shape}")
    
    print("\nFirst 3 rows:")
    display(df_pndb.head(3))
    
    print("\nData types:")
    print(df_pndb.dtypes)
    
    print("\nBasic statistics:")
    print(f"  Non-empty titles: {df_pndb['title'].notna().sum():,}")
    print(f"  Non-empty descriptions: {df_pndb['description'].notna().sum():,}")
    print(f"  Average records per dataset: {df_pndb['records_count'].mean():.2f}")
    
else:
    print("No data was extracted. Please check the API connection and parameters.")

Processing 10,000 datasets...
  Processed 1,000/10,000 datasets...
  Processed 2,000/10,000 datasets...
  Processed 3,000/10,000 datasets...
  Processed 4,000/10,000 datasets...
  Processed 5,000/10,000 datasets...
  Processed 6,000/10,000 datasets...
  Processed 7,000/10,000 datasets...
  Processed 8,000/10,000 datasets...
  Processed 9,000/10,000 datasets...
  Processed 10,000/10,000 datasets...

Processing completed
DataFrame shape: (10000, 13)
Columns: ['dataset_id', 'title', 'description', 'creator', 'theme', 'keywords', 'language', 'geographic_coverage', 'records_count', 'modified', 'publisher', 'license', 'territory']

=== Data Processing Results ===
Total datasets processed: 10,000
DataFrame shape: (10000, 13)

First 3 rows:


Unnamed: 0,dataset_id,title,description,creator,theme,keywords,language,geographic_coverage,records_count,modified,publisher,license,territory
0,dendrophyllia-cornigera-scleractinia-presence-...,Dendrophyllia cornigera (Scleractinia) presenc...,Lien vers les données Data were extracted from...,Ifremer,Species distribution,,fr,,0,2021-06-14T14:27:24+00:00,Ifremer,CC BY-NC-SA 4.0,
1,suivi-floristiques-en-exclos-marais-de-sougeal...,Suivi floristiques en exclos (Marais de Sougea...,Lien vers les données Objectif : Cette tâche v...,ECOBIO UMR 6553 CNRS Université de Rennes 1,,,fr,,0,2018-06-15T09:18:52+00:00,ECOBIO UMR 6553 CNRS Université de Rennes 1,,
2,suivi-floristique-observation-aquatique-marais...,Suivi floristique -observation aquatique- (Mar...,Lien vers les données Cette opération vise à m...,ECOBIO UMR 6553 CNRS Université de Rennes 1,,,fr,,0,2018-03-15T14:25:51+00:00,ECOBIO UMR 6553 CNRS Université de Rennes 1,,



Data types:
dataset_id             object
title                  object
description            object
creator                object
theme                  object
keywords               object
language               object
geographic_coverage    object
records_count           int64
modified               object
publisher              object
license                object
territory              object
dtype: object

Basic statistics:
  Non-empty titles: 10,000
  Non-empty descriptions: 10,000
  Average records per dataset: 0.01


## 8. Save Data to CSV

In [9]:
if 'df_pndb' in locals() and not df_pndb.empty:
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    csv_filename = f"pndb_metadata_extracted_{timestamp}.csv"
    
    try:
        df_pndb.to_csv(csv_filename, index=False, encoding='utf-8-sig')
        
        print(f"Data Export Completed ")
        print(f"File saved as: {csv_filename}")
        print(f"File size: {os.path.getsize(csv_filename) / 1024 / 1024:.2f} MB")
        print(f"Records saved: {len(df_pndb):,}")
        print(f"Columns saved: {len(df_pndb.columns)}")
        
        if os.path.exists(csv_filename):
            print(f"File successfully created and verified")
            
            test_df = pd.read_csv(csv_filename, encoding='utf-8-sig', nrows=5)
            print(f"File can be read back successfully")
            print(f"Sample verification: {len(test_df)} rows read")
        else:
            print("Error: File was not created")
            
    except Exception as e:
        print(f"Error saving to CSV: {e}")
        
else:
    print("No data available to save.")

Data Export Completed 
File saved as: pndb_metadata_extracted_20251106_125020.csv
File size: 7.66 MB
Records saved: 10,000
Columns saved: 13
File successfully created and verified
File can be read back successfully
Sample verification: 5 rows read


## 9. Summary

In [12]:
print("PNDB Data Extraction Summary")
print(f"Extraction completed at: {datetime.now()}")

if 'df_pndb' in locals() and not df_pndb.empty:
    print(f"\nResults:")
    print(f"  Total datasets extracted: {len(df_pndb):,}")
    print(f"  Data fields captured: {len(df_pndb.columns)}")
    print(f"  CSV file created: {csv_filename if 'csv_filename' in locals() else 'Not created'}")
    
    print(f"\nNext Steps:")
    print(f"  1. Apply Named Entity Recognition (NER) to extract species names")
    print(f"  2. Filter datasets by geographic regions of interest")
    print(f"  3. Create visualizations of data distribution and themes")
    print(f"  4. Explore individual datasets for detailed biodiversity data")
    
    print(f"\nTechnical Notes:")
    print(f"  API Endpoint: {BASE_URL}")
    print(f"  Pagination: 100 records per request")
    print(f"  Encoding: UTF-8 with BOM for French text")
    print(f"  Rate limiting: 0.1s delay between requests")
    print(f"  API Format: OpenDataSoft v1.0")
    
else:
    print("\nNo data was successfully extracted.")
    print("Please check:")
    print("  Internet connection")
    print("  API endpoint availability")
    print("  API key validity")
    print("  Request parameters")


PNDB Data Extraction Summary
Extraction completed at: 2025-11-06 12:52:55.334577

Results:
  Total datasets extracted: 10,000
  Data fields captured: 13
  CSV file created: pndb_metadata_extracted_20251106_125020.csv

Next Steps:
  1. Apply Named Entity Recognition (NER) to extract species names
  2. Filter datasets by geographic regions of interest
  3. Create visualizations of data distribution and themes
  4. Explore individual datasets for detailed biodiversity data

Technical Notes:
  API Endpoint: https://pndb.opendatasoft.com/api/datasets/1.0/search/
  Pagination: 100 records per request
  Encoding: UTF-8 with BOM for French text
  Rate limiting: 0.1s delay between requests
  API Format: OpenDataSoft v1.0
