# xDD Data Extraction: Retrieving Scientific Snippets

## 📌 Overview
This notebook automates the process of retrieving **text snippets** from the **xDD API**, allowing for structured storage of relevant geological literature. We focus on the term **"volcanic arc"**, but the method can be extended to other geological concepts.

**Workflow:**
1. Define API query parameters (e.g., search term, result limits).
2. Call the `process_and_save_results()` function from `xdd_api.py` to extract data.
3. Store the extracted snippets in **structured CSV files** inside dedicated folders.

---


## 1. Import Required libraries

In [1]:
# Import core libraries
import os
import sys
import pandas as pd

# Import custom functions
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..", "src")))

# Import the extraction function
from xdd_api import process_and_save_results

# Move the working directory one level up 
os.chdir(os.path.abspath(os.path.join(os.getcwd(), "..")))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Drew\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Drew\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 2. Define Search Parameters

The xDD API allows querying scientific literature using **keyword-based search**. Below, we define:
- **Search terms**: The geological terms we want to extract.
- **API parameters**: Controls the type of results retrieved.
- **Storage folder**: Defines where the extracted data will be saved.

---

In [2]:
# List of terms to retreive from xDD
search_terms = [
                "sask craton", 
]


# Parameters for the xDD API call
params = {
        "full_results": True,             # Retreive all matching documents
        "clean": True,                    # Removes HTML formatting 
        "fragment_limit": 2000,           # Max number of snippets per documents
        "max_acquired": "2023-01-01",     # Ensures reproducability, total documents in xDD at this date is known to be 16,023,367.
         }

headers = {}  # Not required for xDD call

# Root folder for data
root_folder = r"D:\MSc\data"

🔹 Query Parameters for xDD API Extraction
Each parameter controls how the xDD API processes and filters search results when retrieving document snippets.

| Parameter          | Description                                                                                                                              | Usage Notes                                             |
| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
| `term`             | **Required.** Specifies the search term(s) or phrase(s). Supports wildcards (`*`) with at least 3 characters. Terms must be URL-encoded. | e.g. `volcanic arc`                                     |
| `case_sensitive`   | Optional. If included (no value needed), the search is case-sensitive. By default, searches are case-insensitive.                        | `default`                                               |
| `no_word_stemming` | Optional. If true, disables stemming (e.g., "stromatolite" ≠ "stromatolites"). Default is to use stemming.                               | `default`                                               |
| `full_results`     | Optional. Enables pagination and returns the total number of matching documents.                                                         | `true` – Necessary for full-text searches               |
| `clean`            | Optional. Removes HTML tags from highlights in results.                                                                                  | `true` – Necessary for raw text outputs                 |
| `fragment_limit`   | Optional. Limits number of text fragments per document. Default is `5`.                                                                  | `2000` – Set arbitrarily high to retrieve all fragments |
| `max_acquired`     | Optional. Limits results to documents acquired before a given time (ISO 8601 format or relative date).                                   | e.g., `2023-01-01` – Ensures reproducibility            |


## 3. Run Data Extraction

This step will:
- Send requests to the **xDD API** for each search term.
- **Retrieve and store** relevant scientific text snippets.
- Handle **pagination** automatically to extract all available results.

---


In [3]:
df = process_and_save_results(search_terms, params, headers, root_folder, overwrite=False)

[STATUS UPDATE] Elapsed Time: 0.00 hours - Processing 1 terms...
[INFO] (1/1) Processing term 'sask craton'...
Skipping term 'sask craton' as processed file already exists. Use overwrite=True to reprocess.
[INFO] No data retrieved or processing skipped for 'sask craton'.
[COMPLETED] Processed 0/1 terms in 0.00 minutes.


## 4. Review Extracted Data

Once the script has completed, we can examine the **retrieved text snippets**.

Each **search term** has its own **folder** containing:
- A **CSV file** with extracted text.
- A **log file** tracking API runtime and document count.

Let's now preview some of the extracted data.

---


In [4]:
# Define the root directory where data is stored
root_folder = os.path.join(os.getcwd(), r"D:\MSc\data")

# Construct the correct file path (keeping spaces in folder/filenames)
term_folder = os.path.join(root_folder, search_terms[0])  
csv_file_path = os.path.join(term_folder, f"{search_terms[0]}_processed_text.csv")  # Keep original naming

# Check if the file exists and load it
if os.path.exists(csv_file_path):
    df = pd.read_csv(csv_file_path)
    print(f"✅ Successfully loaded {len(df)} records from {csv_file_path}")
    display(df.head(10))  # Show first 10 rows
else:
    print(f"⚠ No extracted data found at: {csv_file_path}")
    print("🔍 Check if the API query ran successfully and if the correct file structure is used.")


✅ Successfully loaded 894 records from D:\MSc\data\sask craton\sask craton_processed_text.csv


Unnamed: 0,pubname,publisher,_gddid,title,doi,coverDate,URL,authors,highlight,processed_text,search_term
0,Canadian Journal of Earth Sciences,Canadian Science Publishing,574b0e80cf58f14b93b599a2,Archean rocks in the southern Rottenstone Doma...,10.1139/e01-027,July 2001,http://www.nrcresearchpress.com/doi/abs/10.113...,"Bickford, M E; Hamilton, M A; Wortman, G L; Hi...",THO. It is unlikely that the Archean rocks are...,tho unlikely archean rocks part archean sask c...,sask craton
1,Canadian Journal of Earth Sciences,Canadian Science Publishing,574b0e80cf58f14b93b599a2,Archean rocks in the southern Rottenstone Doma...,10.1139/e01-027,July 2001,http://www.nrcresearchpress.com/doi/abs/10.113...,"Bickford, M E; Hamilton, M A; Wortman, G L; Hi...","Domain, for Lithoprobe seismic sections indica...",domain lithoprobe seismic sections indicate sa...,sask craton
2,Canadian Journal of Earth Sciences,Canadian Science Publishing,574b0e80cf58f14b93b599a2,Archean rocks in the southern Rottenstone Doma...,10.1139/e01-027,July 2001,http://www.nrcresearchpress.com/doi/abs/10.113...,"Bickford, M E; Hamilton, M A; Wortman, G L; Hi...","Sci. Vol. 38, 2001 Fig. 4. Common Pb data for...",sci vol common pb data archean basement rocks ...,sask craton
3,Canadian Journal of Earth Sciences,Canadian Science Publishing,574b0e80cf58f14b93b599a2,Archean rocks in the southern Rottenstone Doma...,10.1139/e01-027,July 2001,http://www.nrcresearchpress.com/doi/abs/10.113...,"Bickford, M E; Hamilton, M A; Wortman, G L; Hi...",al. (1997) for exposed Archean rocks of the Sa...,exposed archean rocks sask craton including sa...,sask craton
4,Canadian Journal of Earth Sciences,Canadian Science Publishing,574b0e80cf58f14b93b599a2,Archean rocks in the southern Rottenstone Doma...,10.1139/e01-027,July 2001,http://www.nrcresearchpress.com/doi/abs/10.113...,"Bickford, M E; Hamilton, M A; Wortman, G L; Hi...","rocks of the exposed Sask craton. Rather, the ...",rocks exposed sask craton rather pb data plot ...,sask craton
5,Canadian Journal of Earth Sciences,Canadian Science Publishing,574b0e80cf58f14b93b599a2,Archean rocks in the southern Rottenstone Doma...,10.1139/e01-027,July 2001,http://www.nrcresearchpress.com/doi/abs/10.113...,"Bickford, M E; Hamilton, M A; Wortman, G L; Hi...",the elevated 208Pb/204Pb that is noted in some...,elevated pb pb noted samples sask craton rocks...,sask craton
6,Canadian Journal of Earth Sciences,Canadian Science Publishing,574b0e80cf58f14b93b599a2,Archean rocks in the southern Rottenstone Doma...,10.1139/e01-027,July 2001,http://www.nrcresearchpress.com/doi/abs/10.113...,"Bickford, M E; Hamilton, M A; Wortman, G L; Hi...",from the Sask craton. Comparisons with the Hea...,sask craton comparisons hearne craton,sask craton
7,Canadian Journal of Earth Sciences,Canadian Science Publishing,574b0e80cf58f14b93b599a2,Archean rocks in the southern Rottenstone Doma...,10.1139/e01-027,July 2001,http://www.nrcresearchpress.com/doi/abs/10.113...,"Bickford, M E; Hamilton, M A; Wortman, G L; Hi...",Lake Domain) or other Archean rocks within the...,lake domain archean rocks within internides sa...,sask craton
8,Canadian Journal of Earth Sciences,Canadian Science Publishing,574b0e80cf58f14b93b599a2,Archean rocks in the southern Rottenstone Doma...,10.1139/e01-027,July 2001,http://www.nrcresearchpress.com/doi/abs/10.113...,"Bickford, M E; Hamilton, M A; Wortman, G L; Hi...",of the Sask craton in the Iskwatikan and Hunte...,sask craton iskwatikan hunter bay windows,sask craton
9,Canadian Journal of Earth Sciences,Canadian Science Publishing,574b0e80cf58f14b93b599a2,Archean rocks in the southern Rottenstone Doma...,10.1139/e01-027,July 2001,http://www.nrcresearchpress.com/doi/abs/10.113...,"Bickford, M E; Hamilton, M A; Wortman, G L; Hi...","Collerson, and J.F. Lewry. In preparation. Dis...",collerson lewry preparation distribution arche...,sask craton


###### 5. Understanding the Extracted Data

Each row in the extracted dataset represents a **text snippet** retrieved from the xDD database.

### 🔹 Key Columns:
| Column          | Description |
|----------------|------------|
| `pubname`      | Journal name |
| `title`        | Article title |
| `doi`          | Digital Object Identifier (DOI) |
| `coverDate`    | Publication date |
| `URL`          | Direct link to the article |
| `authors`      | List of contributing authors |
| `highlight`    | **Raw text snippet** retrieved from xDD |
| `processed_text` | **Cleaned version** of the extracted text |
| `search_term` | Term used in the extraction call to xDD API |

---


In [5]:
# Count total documents retrieved
num_docs = len(df)
unique_journals = df["pubname"].nunique()
unique_authors = df["authors"].nunique()

print(f"📊 Total Documents Retrieved: {num_docs}")
print(f"📚 Unique Journals: {unique_journals}")
print(f"👨‍🔬 Unique Authors: {unique_authors}")


📊 Total Documents Retrieved: 894
📚 Unique Journals: 21
👨‍🔬 Unique Authors: 114


## 6. Next Steps

Now that we have **retrieved and stored the data**, the next step is to **compute and save** statistics from these text snippets.

📌 **Next Notebook: `2_Calculate Statistics.ipynb`**
- **Compute and save** word statistics.
- Store results for **statistical & semantic processing**.

---
