# xDD Data Extraction: Retrieving Scientific Snippets

> **Licensing & Attribution**  
> Code and methods in this notebook: **MIT License** (this repo).  
> Data and outputs derived from **xDD** (snippets, per-term CSVs, BibJSON from xDD metadata, figures that reproduce xDD content): **CC BY-NC 4.0** — attribution required; no commercial use.  
> Source: xDD (UW–Madison). “All xDD output is licensed under CC-BY-NC.”



## 📌 Overview
This notebook automates the process of retrieving **text snippets** from the **xDD API**, allowing for structured storage of relevant geological literature. We focus on the term **"volcanic arc"**, but the method can be extended to other geological concepts.

**Workflow:**
1. Define API query parameters (e.g., search term, result limits).
2. Call the `process_and_save_results()` function from `xdd_api.py` to extract data.
3. Store the extracted snippets in **structured CSV files** inside dedicated folders.

---


## 1. Import Required libraries

In [1]:
# Import core libraries
import os
import sys
import pandas as pd

# Import custom functions
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..", "src")))

# Import the extraction function
from xdd_api import process_and_save_results, build_bibliography_for_term

# Move the working directory one level up 
os.chdir(os.path.abspath(os.path.join(os.getcwd(), "..")))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Drew\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Drew\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 2. Define Search Parameters

The xDD API allows querying scientific literature using **keyword-based search**. Below, we define:
- **Search terms**: The geological terms we want to extract.
- **API parameters**: Controls the type of results retrieved.
- **Storage folder**: Defines where the extracted data will be saved.

---

In [6]:
# List of terms to retreive from xDD
search_terms = [
                "volcanic arc", 
]


# Parameters for the xDD API call
params = {
        "full_results": True,             # Retreive all matching documents
        "clean": True,                    # Removes HTML formatting 
        "fragment_limit": 2000,           # Max number of snippets per documents
        "max_acquired": "2023-01-01",     # Ensures reproducability, total documents in xDD at this date is known to be 16,023,367.
         }

headers = {}  # Not required for xDD call

# Root folder for data
root_folder = r"D:\MSc\data"

🔹 Query Parameters for xDD API Extraction
Each parameter controls how the xDD API processes and filters search results when retrieving document snippets.

| Parameter          | Description                                                                                                                              | Usage Notes                                             |
| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
| `term`             | **Required.** Specifies the search term(s) or phrase(s). Supports wildcards (`*`) with at least 3 characters. Terms must be URL-encoded. | e.g. `volcanic arc`                                     |
| `case_sensitive`   | Optional. If included (no value needed), the search is case-sensitive. By default, searches are case-insensitive.                        | `default`                                               |
| `no_word_stemming` | Optional. If true, disables stemming (e.g., "stromatolite" ≠ "stromatolites"). Default is to use stemming.                               | `default`                                               |
| `full_results`     | Optional. Enables pagination and returns the total number of matching documents.                                                         | `true` – Necessary for full-text searches               |
| `clean`            | Optional. Removes HTML tags from highlights in results.                                                                                  | `true` – Necessary for raw text outputs                 |
| `fragment_limit`   | Optional. Limits number of text fragments per document. Default is `5`.                                                                  | `2000` – Set arbitrarily high to retrieve all fragments |
| `max_acquired`     | Optional. Limits results to documents acquired before a given time (ISO 8601 format or relative date).                                   | e.g., `2023-01-01` – Ensures reproducibility            |


## 3. Run Data Extraction

This step will:
- Send requests to the **xDD API** for each search term.
- **Retrieve and store** relevant scientific text snippets.
- Handle **pagination** automatically to extract all available results.

---


In [7]:
df = process_and_save_results(search_terms, params, headers, root_folder, overwrite=True)
build_bibliography_for_term(root_folder="D:/MSc/data", term="central mineral belt")

[STATUS UPDATE] Elapsed Time: 0.00 hours - Processing 1 terms...
[INFO] (1/1) Processing term 'volcanic arc'...
Processing term 'volcanic arc'...
Saved processed text for term 'volcanic arc'.
[INFO] Completed 'volcanic arc' successfully.
[BIBJSON] Building bibliography for 'volcanic arc' from D:\MSc\data\volcanic arc\volcanic arc_processed_text.csv ...
[BIBJSON] Wrote 20058 records → D:\MSc\data\volcanic arc\volcanic arc.bibjson
[BIBJSON] Wrote snippet→bib map → D:\MSc\data\volcanic arc\volcanic arc_bib_map.csv
[COMPLETED] Processed 1/1 terms in 1.75 minutes.
[BIBJSON] Wrote 69 records → D:/MSc/data\central mineral belt\central mineral belt.bibjson
[BIBJSON] Wrote snippet→bib map → D:/MSc/data\central mineral belt\central mineral belt_bib_map.csv


## 4. Review Extracted Data

Once the script has completed, we can examine the **retrieved text snippets**.

Each **search term** has its own **folder** containing:
- A **CSV file** with extracted text.
- A **log file** tracking API runtime and document count.

Let's now preview some of the extracted data.

---


In [4]:
# Define the root directory where data is stored
root_folder = os.path.join(os.getcwd(), r"D:\MSc\data")

# Construct the correct file path (keeping spaces in folder/filenames)
term_folder = os.path.join(root_folder, search_terms[0])  
csv_file_path = os.path.join(term_folder, f"{search_terms[0]}_processed_text.csv")  # Keep original naming

# Check if the file exists and load it
if os.path.exists(csv_file_path):
    df = pd.read_csv(csv_file_path)
    print(f"✅ Successfully loaded {len(df)} records from {csv_file_path}")
    display(df.head(10))  # Show first 10 rows
else:
    print(f"⚠ No extracted data found at: {csv_file_path}")
    print("🔍 Check if the API query ran successfully and if the correct file structure is used.")


✅ Successfully loaded 238 records from D:\MSc\data\central mineral belt\central mineral belt_processed_text.csv


Unnamed: 0,pubname,publisher,_gddid,title,doi,coverDate,URL,authors,highlight,processed_text,search_term
0,Earth and Planetary Science Letters,Elsevier,54cc15a9e138236bcc929e97,The importance of late- and post-orogenic crus...,10.1016/0012-821X(94)90207-0,July 1994,http://www.sciencedirect.com/science/article/p...,"Kerr, Andrew; Fryer, Brian J.",". Bailey, Geology of the Kaipokok Bay-Big Rive...",bailey geology kaipokok bay big river area cen...,central mineral belt
1,Open-File Report,USGS,55a4631ce13823757cc6f221,New information resources of the U.S. Geologic...,10.3133/ofr87400J,1987,https://pubs.er.usgs.gov/publication/ofr87400J,U.S. Geological Survey,". Geology of the central mineral belt, [Labrador]",geology central mineral belt labrador,central mineral belt
2,Circular,USGS,55b8d895e13823bd29ba8a6a,Uranium in the metal-mining districts of Colorado,10.3133/cir215,1953,https://pubs.er.usgs.gov/publication/cir215,"King, Robert Ugstad; Leonard, B.F.; Moore, F.B...","for uranium ore deposits, exclusive of the Col...",uranium ore deposits exclusive colorado platea...,central mineral belt
3,Circular,USGS,55b8d895e13823bd29ba8a6a,Uranium in the metal-mining districts of Colorado,10.3133/cir215,1953,https://pubs.er.usgs.gov/publication/cir215,"King, Robert Ugstad; Leonard, B.F.; Moore, F.B...","materials. Excluding the Colorado Plateaus, t...",materials excluding colorado plateaus metalmin...,central mineral belt
4,Circular,USGS,55b8d895e13823bd29ba8a6a,Uranium in the metal-mining districts of Colorado,10.3133/cir215,1953,https://pubs.er.usgs.gov/publication/cir215,"King, Robert Ugstad; Leonard, B.F.; Moore, F.B...",". However, recent discoveries of pitchblende o...",however recent discoveries pitchblende outside...,central mineral belt
5,Circular,USGS,55b8d895e13823bd29ba8a6a,Uranium in the metal-mining districts of Colorado,10.3133/cir215,1953,https://pubs.er.usgs.gov/publication/cir215,"King, Robert Ugstad; Leonard, B.F.; Moore, F.B...",confined to those districts from which metal p...,confined districts metal production high centr...,central mineral belt
6,Circular,USGS,55b8d895e13823bd29ba8a6a,Uranium in the metal-mining districts of Colorado,10.3133/cir215,1953,https://pubs.er.usgs.gov/publication/cir215,"King, Robert Ugstad; Leonard, B.F.; Moore, F.B...",", 1916; Bastin and Hill, 1917; Phair, 1952). W...",bastin hill phair within fringing central mine...,central mineral belt
7,Circular,USGS,55b8d895e13823bd29ba8a6a,Uranium in the metal-mining districts of Colorado,10.3133/cir215,1953,https://pubs.er.usgs.gov/publication/cir215,"King, Robert Ugstad; Leonard, B.F.; Moore, F.B...",of the central mineral belt have not,central mineral belt,central mineral belt
8,Canadian Journal of Earth Sciences,Canadian Science Publishing,5750a8e6cf58f19f0eea0f18,New Fossil Localities in the Lush's Bight Terr...,10.1139/e72-140,November 1972,http://www.nrcresearchpress.com/doi/abs/10.113...,"Strong, D. F.; Kean, B. F.","* SNELGROCA'E. ,K. 1928. The geology sf the c...",snelgroca geology sf central mineral belt sf n...,central mineral belt
9,Canadian Journal of Earth Sciences,Canadian Science Publishing,575137dacf58f1170ef3b0d2,Hydrothermal alteration and lithogeochemistry ...,10.1139/cjes-2015-0237,May 2016,http://www.nrcresearchpress.com/doi/10.1139/cj...,"Buschette, Michael J.; Piercey, Stephen J.; Po...","Mineral Belt, Newfoundland. In Geological Asso...",mineral belt newfoundland geological association,central mineral belt


###### 5. Understanding the Extracted Data

Each row in the extracted dataset represents a **text snippet** retrieved from the xDD database.

### 🔹 Key Columns:
| Column          | Description |
|----------------|------------|
| `pubname`      | Journal name |
| `title`        | Article title |
| `doi`          | Digital Object Identifier (DOI) |
| `coverDate`    | Publication date |
| `URL`          | Direct link to the article |
| `authors`      | List of contributing authors |
| `highlight`    | **Raw text snippet** retrieved from xDD |
| `processed_text` | **Cleaned version** of the extracted text |
| `search_term` | Term used in the extraction call to xDD API |

---


In [5]:
# Count total documents retrieved
num_docs = len(df)
unique_journals = df["pubname"].nunique()
unique_authors = df["authors"].nunique()

print(f"📊 Total Documents Retrieved: {num_docs}")
print(f"📚 Unique Journals: {unique_journals}")
print(f"👨‍🔬 Unique Authors: {unique_authors}")


📊 Total Documents Retrieved: 238
📚 Unique Journals: 20
👨‍🔬 Unique Authors: 63


## 6. Next Steps

Now that we have **retrieved and stored the data**, the next step is to **compute and save** statistics from these text snippets.

📌 **Next Notebook: `2_Calculate Statistics.ipynb`**
- **Compute and save** word statistics.
- Store results for **statistical & semantic processing**.

---
