# BiS‚ÇÇ Superconductor Literature Mining: Epistemological Claims Extraction

## üìã Project Overview

This notebook implements a **two-stage information extraction pipeline** for BiS‚ÇÇ-based superconductor research papers:

**Stage 1 (This Notebook):** Extract epistemological claims with structured attributes from scientific text using Gemma 2 9B-IT
**Stage 2 (Future):** Convert extracted claims into graph-ready JSON entities and relationships

### Input
- Scientific paper excerpts (abstracts + key sections, 100-1000 words)
- Text segments tagged with source identifiers (e.g., `ARTICLE: 1306.3346v2`)

### Output
- Structured Markdown lists containing:
  - **Claims**: Self-contained scientific findings
  - **Meta-attributes**: Source, study type, epistemic classification
  - **Physical attributes**: Materials, drivers, effects, mechanisms, trends

---




## 1. Environment Setup and Dependency Installation

In this notebook, we utilize **Gemma 2.5 9B** for claim extraction. This requires the `transformers` ecosystem and `bitsandbytes` for 4-bit or 8-bit quantization to ensure efficient inference on available GPU hardware.



In [None]:
# Install core libraries for model loading and inference
!pip install -q -U transformers accelerate bitsandbytes

print("‚úÖ Packages installed successfully.")

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m10.1/10.1 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.1/59.1 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m536.7/536.7 kB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[?25h‚úÖ Packages installed successfully.


To make your notebook robust and portable between Kaggle and Google Colab, we should use a `try-except` block to detect the environment and load the secrets accordingly.

Here is the refined section. I have updated the header to sit as a subsection of the setup and added logic to handle both platforms seamlessly.

---

### 1.1 Authentication and Secret Management

Accessing gated models like **Gemma 2.5 9B** requires authentication via a Hugging Face token. The following script detects the running environment (Google Colab or Kaggle) and retrieves the `HF_TOKEN` stored in the respective secrets manager.


In [None]:
# -----------------------------------------------------------------------------
# Authentication and secret management
# -----------------------------------------------------------------------------
import os
import sys
from huggingface_hub import login

try:
    # 1. Try Google Colab Secrets
    from google.colab import userdata
    HF_TOKEN = userdata.get('HF_TOKEN')
    print("‚úÖ Detected Google Colab environment.")

except ImportError:
    try:
        # 2. Try Kaggle Secrets
        from kaggle_secrets import UserSecretsClient
        user_secrets = UserSecretsClient()
        HF_TOKEN = user_secrets.get_secret("HF_TOKEN")
        print("‚úÖ Detected Kaggle environment.")

    except ImportError:
        # 3. Fallback to local environment variable
        HF_TOKEN = os.getenv("HF_TOKEN")
        if HF_TOKEN:
            print("‚úÖ Detected Local/Generic environment.")
        else:
            HF_TOKEN = None

# Validate and Login
if HF_TOKEN:
    os.environ["HF_TOKEN"] = HF_TOKEN
    try:
        login(token=HF_TOKEN)
        print("‚úÖ Successfully logged in to Hugging Face Hub.")
    except Exception as e:
        print(f"‚ùå Login failed: {e}")
else:
    raise ValueError(
        "‚ùå Error: 'HF_TOKEN' not found. Please add it to Colab Secrets or Kaggle Secrets."
    )

‚úÖ Detected Google Colab environment.


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


‚úÖ Successfully logged in to Hugging Face Hub.


## 2. Model Initialization (Gemma 2.5 9B)

To perform claim extraction, we load the **Gemma 2.5 9B** instruction-tuned model. Given the memory constraints, we utilize `BitsAndBytesConfig` to load the model in **4-bit quantization** (NF4 format). This significantly reduces VRAM usage while maintaining inference quality suitable for semantic extraction tasks.

### 2.1 Configuration and Loading
We initialize the tokenizer and the model with `device_map="auto"` to automatically distribute layers across available GPU/CPU resources.

In [None]:
# -----------------------------------------------------------------------------
# Import libraries
# -----------------------------------------------------------------------------

import pandas as pd
import os
import time
import json
import torch
import re  # Added for regex sanitization
from typing import Dict, List
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# -----------------------------------------------------------------------------
# Mount External Storage
# -----------------------------------------------------------------------------

from google.colab import drive

mount_path = '/content/drive'

if not os.path.exists(mount_path):
    print("üîÑ Mounting Google Drive...")
    drive.mount(mount_path)
    print("‚úÖ Google Drive mounted successfully.")
else:
    print("‚úÖ Google Drive is already mounted.")


üîÑ Mounting Google Drive...
Mounted at /content/drive
‚úÖ Google Drive mounted successfully.


In [None]:
# -----------------------------------------------------------------------------
# Configuration
# -----------------------------------------------------------------------------
# Replace with the exact Hugging Face ID for Gemma 2.5 9B
MODEL_ID = "google/gemma-2-9b-it"

# Configure 4-bit quantization to fit the 9B model in Colab/Kaggle memory
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# -----------------------------------------------------------------------------
# Model & Tokenizer Loading
# -----------------------------------------------------------------------------
try:
    print(f"üîÑ Loading Tokenizer: {MODEL_ID}...")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

    print(f"üîÑ Loading Model: {MODEL_ID}...")
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        quantization_config=bnb_config,
        device_map="auto",
        torch_dtype=torch.float16
    )
    print("‚úÖ Model loaded successfully.")

except Exception as e:
    print(f"‚ùå Error loading model: {e}")
    raise

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

üîÑ Loading Tokenizer: google/gemma-2-9b-it...


config.json:   0%|          | 0.00/857 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

üîÑ Loading Model: google/gemma-2-9b-it...


`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/39.1k [00:00<?, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/464 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

‚úÖ Model loaded successfully.


### 2.1 Hardware Verification
Before proceeding with inference, we verify the available PyTorch version and GPU resources. Confirming CUDA availability is critical for the efficient execution of the 4-bit quantized Gemma model.

In [None]:
# -----------------------------------------------------------------------------
# Hardware Resource Check
# -----------------------------------------------------------------------------
print(f"üîπ PyTorch version: {torch.__version__}")
print(f"üîπ CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"‚úÖ GPU Detected: {gpu_name}")
    print(f"‚úÖ GPU Memory: {gpu_mem:.2f} GB")
else:
    print("‚ö†Ô∏è  No GPU detected. Inference will be extremely slow on CPU.")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 3. Prompt Engineering: Epistemological Claim Extraction

To extract high-quality structured data, we design a comprehensive system prompt for **Gemma 2.5 9B**. The prompt enforces a specific **JSON-like Markdown structure** and includes:

1.  **System Role:** Establishing the persona of a Material Physicist to bias the model towards technical accuracy.
2.  **Pronoun Resolution:** A critical instruction to replace ambiguous terms (e.g., "it", "the sample") with specific chemical formulas (e.g., $LaO_{0.5}F_{0.5}BiS_2$).
3.  **Epistemic Classification:** Categorizing claims into *Observations* (raw data), *Inferences* (conclusions), or *Speculations* (hypotheses).
4.  **Few-Shot Examples:** Providing concrete examples of "Input -> Output" pairs to guide the model's reasoning.

In [None]:
# Cell 5: Define the Extraction Prompt Template

EXTRACTION_PROMPT = """# SYSTEM ROLE
You are an expert Materials Physicist and Data Scientist specializing in BiS2-based layered superconductors. You have a deep understanding of scientific epistemology (distinguishing between raw data, logical conclusions, and theoretical models).

# TASK
Your goal is to extract **Epistemological Claims** from the provided scientific text. You must structure these claims logically, resolve any ambiguous pronouns, and categorize the nature of the finding.

**Input:** A segment of scientific text containing Source IDs (e.g., `ARTICLE: 1306.3346v2`).
**Output:** A structured Markdown list of Claims and their Attributes.

---

# GUIDELINES FOR EXTRACTION

### 1. TEXT STANDARDIZATION (CRITICAL)
* **Resolve Pronouns:** Never use pronouns like "it", "they", "the sample", or "this system" in the extracted `{CLAIM}` text. **You must replace them with the specific material name** (e.g., *LaO0.5F0.5BiS2*, *The BiS2 layer*) or the specific condition found in the text.
* **Self-Contained:** Each claim must make sense if read in isolation, without the surrounding paragraph.

### 2. META-ATTRIBUTES (Classify the nature of the claim)
These attributes help categorize the reliability and type of knowledge.

* **{SOURCE ID}** *(Mandatory)*: The exact tag from the text (e.g., `1306.3346v2`).
* **{STUDY TYPE}** *(Mandatory)*:
    * `Experimental`: Involves synthesis, physical measurements (XRD, Resistivity, SQUID), or fabrication.
    * `Theoretical`: Involves DFT, band structure calculations, modeling, or simulations.
* **{EPISTEMIC TYPE}** *(Mandatory)*:
    * `Observation`: Direct reporting of data, values, or behaviors (e.g., "Tc is 4.5K", "Lattice constant is 13.2√Ö").
    * `Inference`: A conclusion or causal link drawn from data (e.g., "Strain causes the Tc enhancement").
    * `Speculation`: Future predictions, possibilities, or unverified hypotheses (e.g., "We anticipate finding new materials").
* **{POLARITY}** *(Mandatory)*:
    * `Positive`: Enhancement, increase, promotion, or positive correlation.
    * `Negative`: Degradation, suppression, decrease, or negative correlation.
    * `Neutral`: No change, independence, or a simple existence description.

### 3. PHYSICAL ATTRIBUTES (The Physics content)
Extract these only if present. If an attribute is not explicitly stated, **OMIT** it.

* **{SUBJECT}**: The specific material system (e.g., *LaO0.5F0.5BiS2*). *Prefer specific formulas over general family names.*
* **{DRIVER}**: The variable changed or the intervention applied (e.g., *Se substitution*, *High Pressure Annealing*).
* **{EFFECT}**: The observed outcome (e.g., *Tc enhancement*, *Lattice contraction*).
* **{TREND}**: The shape/direction of the relationship (e.g., *Monotonically increases*, *Dome-shaped*).
* **{MECHANISM}**: The physical explanation for *why* the effect occurred (e.g., *Due to in-plane chemical pressure*, *Fermi surface nesting*).
* **{SCOPE}**: Constraints (e.g., *Universal mechanism*, *Polycrystalline samples only*).

---

# FEW-SHOT EXAMPLES

#### Example 1: Experimental Inference
**Input:**
"ARTICLE: 1306.3346v2; We synthesized polycrystalline LaO0.5F0.5BiS2. It was found that annealing under high pressure generated uniaxial strain. Consequently, the highly-strained sample showed higher superconducting properties. We conclude that uniaxial strain enhances superconductivity."

**Output:**
- {CLAIM}: High pressure annealing generates uniaxial strain in polycrystalline LaO0.5F0.5BiS2.
  - {META DATA}:
    - Source ID: 1306.3346v2
    - Study Type: Experimental
    - Epistemic Type: Observation
    - Polarity: Neutral
  - {PHYSICAL ATTRIBUTES}:
    - Subject: LaO0.5F0.5BiS2
    - Driver: High Pressure Annealing
    - Effect: Uniaxial strain

- {CLAIM}: Uniaxial strain enhances superconducting properties in LaO0.5F0.5BiS2.
  - {META DATA}:
    - Source ID: 1306.3346v2
    - Study Type: Experimental
    - Epistemic Type: Inference
    - Polarity: Positive
  - {PHYSICAL ATTRIBUTES}:
    - Subject: LaO0.5F0.5BiS2
    - Driver: Uniaxial strain
    - Effect: Enhancement of superconductivity
    - Mechanism: Not Explicitly Stated

#### Example 2: Theoretical Observation
**Input:**
"ARTICLE: 1454.454v1; We performed DFT calculations on the CeOBiS2 system. The results indicate that the Selenium p-orbital hybridizes strongly with the Bismuth p-orbital, forming the conduction band."

**Output:**
- {CLAIM}: DFT calculations indicate that Selenium p-orbitals hybridize strongly with Bismuth p-orbitals to form the conduction band in CeOBiS2.
  - {META DATA}:
    - Source ID: 1454.454v1
    - Study Type: Theoretical
    - Epistemic Type: Observation
    - Polarity: Neutral
  - {PHYSICAL ATTRIBUTES}:
    - Subject: CeOBiS2
    - Mechanism: Orbital hybridization (Se-p and Bi-p)
    - Effect: Formation of conduction band

#### Example 3: Negative/Suppression Effect
**Input:**
"ARTICLE: 1404.6359v2; In the NdO1-xFxBiS2 system, substituting Bi with Sb caused the lattice constant c to decrease. This substitution led to a rapid degradation of Tc."

**Output:**
- {CLAIM}: Substituting Bi with Sb in NdO1-xFxBiS2 causes the lattice constant c to decrease.
  - {META DATA}:
    - Source ID: 1404.6359v2
    - Study Type: Experimental
    - Epistemic Type: Observation
    - Polarity: Negative
  - {PHYSICAL ATTRIBUTES}:
    - Subject: NdO1-xFxBiS2
    - Driver: Sb substitution (for Bi)
    - Effect: Decrease of lattice constant c

- {CLAIM}: Sb substitution leads to a rapid degradation of Tc in NdO1-xFxBiS2.
  - {META DATA}:
    - Source ID: 1404.6359v2
    - Study Type: Experimental
    - Epistemic Type: Observation
    - Polarity: Negative
  - {PHYSICAL ATTRIBUTES}:
    - Subject: NdO1-xFxBiS2
    - Driver: Sb substitution
    - Effect: Degradation of Tc

---

# INSTRUCTIONS
1. Analyze the INPUT text below.
2. Extract **Claims** and structure them using the template above.
3. **Mandatory:** Replace all pronouns in the Claim text with specific entities.
4. **Mandatory:** Include the `ARTICLE ID` tag for every claim.
5. Do not extract Entities (e.g., do not list "Entity 1", "Entity 2"). Focus only on Claims and Attributes.

# INPUT TEXT:
\"\"\"
{TEXT}
\"\"\"
"""

print("‚úÖ Extraction prompt template defined")

‚úÖ Extraction prompt template defined


### 3.1 Inference Configuration

To ensure reproducibility and minimize hallucinations‚Äîa critical requirement for scientific data extraction‚Äîwe utilize **Greedy Decoding** (`do_sample=False`, `temperature=0.0`). This forces the model to select the most probable token at each step, ensuring that identical inputs yield identical outputs.

**Note on Repetition Penalty:** We explicitly disable repetition penalties (`repetition_penalty=1.0`). Since our target output format relies on a repetitive schema (e.g., repeatedly using keys like `{CLAIM}` and `{META DATA}`), standard penalties would incorrectly suppress the necessary structural tags.

In [None]:
# -----------------------------------------------------------------------------
# Generation Parameters (Deterministic)
# -----------------------------------------------------------------------------
generation_params = {
    "max_new_tokens": 2048,       # Allow sufficient length for multi-claim extraction
    "temperature": 0.0,           # CRITICAL: Deterministic output (no randomness)
    "top_p": 1.0,                 # Irrelevant when temp=0, but set for completeness
    "top_k": 50,                  # Standard sampling window
    "repetition_penalty": 1.0,    # KEEP AT 1.0 - structured lists repeat keys naturally
    "do_sample": False            # Greedy search (most likely token at each step)
}

print("‚úÖ Generation parameters configured:")
print(json.dumps(generation_params, indent=2))

‚úÖ Generation parameters configured:
{
  "max_new_tokens": 2048,
  "temperature": 0.0,
  "top_p": 1.0,
  "top_k": 50,
  "repetition_penalty": 1.0,
  "do_sample": false
}


## 4. Test Data Preparation

Before processing the full corpus, we validate the prompt logic using three representative samples of varying complexity:
1.  **Sample 1 ($LaO_{0.5}F_{0.5}BiS_2$):** Focuses on "High Pressure Annealing" and uniaxial strain (Observation + Inference).
2.  **Sample 2 ($LaOBiS_{2-x}Se_x$):** Focuses on thermoelectric properties and element substitution (Quantitative data).
3.  **Sample 3 ($Bi_2(O,F)S_2$):** A complex, multi-paragraph text comparing different measurement techniques (Resistivity vs. Annealing) with conflicting trends.

These samples ensure the model can handle different study types and resolve complex sentence structures.

In [None]:
# -----------------------------------------------------------------------------
# Sample Input Texts for Validation
# -----------------------------------------------------------------------------

# Sample 1: High Pressure Annealing Study (LaO0.5F0.5BiS2)
# Contains clear causal links: High Pressure -> Strain -> Superconductivity
SAMPLE_TEXT_1 = """{ARTICLE: 1306.3346v2; Correlation between crystal structure and superconducting properties of the BiS2-based superconductor LaO0.5F0.5BiS2 was investigated. We have prepared LaO0.5F0.5BiS2 polycrystalline samples with various lattice constants. It was found that the annealing the sample under high pressure generated uniaxial strain along the c axis. Further, the highly-strained sample showed higher superconducting properties. We concluded that the uniaxial strain along the c axis was positively linked with the enhancement of superconductivity in the LaO1-xFxBiS2 system. The correlation between crystal structure and superconducting properties of the BiS2-based superconductor LaO0.5F0.5BiS2 has been investigated. We have synthesized LaO0.5F0.5BiS2 polycrystalline samples with various annealing conditions up to 3 steps. The HP annealing generates uniaxial strain along the c axis. The generated strain is returned to the initial state of the As-grown sample by annealing the sample in an evacuated quartz tube at 700 ¬∫C. The highest superconducting properties, Tc and shielding fraction, are observed in the HP sample, and the superconducting properties is degraded by reducing the uniaxial strain. On the basis of those results, we conclude that the enhancement of the superconducting properties in LaO1-xFxBiS2 by applying post-annealing under high pressure is caused by the generation of the uniaxial strain along the c axis.}"""

# Sample 2: Thermoelectric Properties (LaOBiS2-xSex)
# Contains quantitative data (ZT, Power Factor) and greek letters
SAMPLE_TEXT_2 = """{ARTICLE: 1409.2189v2; We have investigated the thermoelectric properties of the novel layered bismuth chalcogenides LaOBiS2-xSex. The partial substitution of S by Se produced the enhancement of electrical conductivity (metallic characteristics) in LaOBiS2-xSex. The power factor largely increased with increasing Se concentration. The highest power factor was 4.5 ÔÅ≠W/cmK2 at around 470 ¬∫C for LaOBiS1.2Se0.8. The obtained dimensionless figure-of-merit (ZT) was 0.17 at around 470 ¬∫C in LaOBiS1.2Se0.8. In conclusion, we have synthesized polycrystalline samples of novel layered bismuth chalcogenides LaOBiS2-xSex and systematically investigated thermoelectric properties. It was found that a partial substitution of S by Se enhanced metallic conductivity. The power factor largely increased with increasing Se concentration. The highest power factor was 4.5 ÔÅ≠W/cmK2 at around 470 ¬∫C for LaOBiS1.2Se0.8. We found that the thermal conductivity for LaOBiS2-xSex is independent of both temperature and Se concentration. Using an average value of thermal conductivity, ÔÅ´ ÔÄΩ 2 W/m¬∑K, we calculated the dimensionless figure-of-merit (ZT) as a function of temperature. The highest ZT was 0.17 at around 470 ¬∫C in LaOBiS1.2Se0.8. Optimization of the carrier concentration and/or the local structure will further enhance the thermoelectric performance of the layered bismuth chalcogenides.}"""

# Sample 3: Complex Multi-method comparison (Bi2(O,F)S2)
# Contains conflicting trends based on measurement method (HP resistivity vs HP annealing)
SAMPLE_TEXT_3 = """{ARTICLE: 1508.01656v1; Pressure effects on a recently discovered BiS2-based superconductor Bi2(O,F)S2 (Tc = 5.1 K) were examined via two different methods; high pressure resistivity measurement and high pressure annealing. The effects of these two methods on the superconducting properties of Bi2(O,F)S2 were significantly different although in both methods hydrostatic pressure is applied to the sample by the cubic-anvil-type apparatus. In high pressure resistivity measurement, Tc linearly decreased at the rate of -1.2 K GPa-1. In contrast, the Tc of 5.1 K is maintained after high pressure annealing under 2 GPa and 470¬∞C of optimally doped sample despite significant change of lattice parameters. In addition, superconductivity was observed in fluorine-free Bi2OS2 after high pressure annealing. These results suggest that high pressure annealing would cause a unique effect on physical properties of layered compounds. Figure 5(a) shows the Tcs at ambient pressure and at ~2 GPa for various BiS2-based superconductors reported to date as a function of a-axis length. The values of a-axis lengths are measured at room temperature and ambient pressure. Tcs are determined by the onset of diamagnetic transition or zero resistivity. In the doped samples, the Tc and a-axis length values are those of the optimally-doped ones. (Sr,La)FBiS2 has the longest a-axis among these compounds. In RE(O,F)BiS2, a-axis shrinks and Tc increases with increasing atomic number of RE from La to Nd. The a-axis lengths of Bi2(O,F)S2 and Bi4O4S3 / Bi3O2S3, whose blocking layers contain fluorite-type BiO layers, are shorter than that of Nd(O,F)BiS2, although ionic radius (coordination number 6)27) of Bi is between Nd and Pr. Tcs of BiS2-based superconductors under ambient pressure show a dome-like tendency with the top of Tc ~5.5 K at a ~3.98 √Ö in (Nd0.2Sm0.8)(O0.7F0.3)BiS2. When a-axis in longer than ~4.0 √Ö, significant increase of Tc is observed in HP resistivity measurement. In contrast, compounds with shorter a-axis lengths, Bi2(O,F)S2 and Bi4O4S3, show rapid decrease of Tc in HP resistivity measurement. The relation between lattice parameter and Tc for as-synthesized and HP annealed Bi2(O,F)S2 is summarized in Fig. 6. In the optimally-doped samples with a- and c-axes longer than ~3.97 √Ö and shorter than ~13.73 √Ö, the value of Tc is maintained at ~5.1 K. In the underdoped samples, a-axis is shorter than ~3.97 √Ö and c-axis is longer than ~13.73 √Ö, and Tc increases as a- and c-axis expands and shrinks by HP anenaling. Tcs for undoped Bi2OS2 sintered under high pressures are also in this trend. It should be emphasized that in HP annealed / synthesized undoped Bi2OS2, superconductivity is achieved without intentional carrier doping. In Bi2OS2, the Bi-S planes are not very flat, the in-plane S-Bi-S angle being 159.8¬∞. The expansion of a-axis may lead to flatter Bi-S plane. In LaOBiS2, F-doping not only increases the carrier concentration but also flattens the buckling of the Bi‚ÄìS plane and this structural transformation is also related to the appearance of superconductivity29). Similar phenomena would happen in the undoped and underdoped Bi2(O,F)S2 by HP annealing, which resulted in the increases of Tcs in these samples. The decrease of Tc in HP resistivity measurement might be explained by the tendency shown in Fig. 6(a). In HP resistivity measurement, a-axis might shrink by applying high pressures at low temperatures, and superconductivity could be disappeared. Structural analysis on Bi2(O,F)S2 under high pressures at low temperatures would provide fruitful information to clear this point. 5. Conclusion High pressure (HP) resistivity measurement and HP annealing were performed for a BiS2-based superconductor Bi2(O,F)S2, which caused different variation of Tc. In HP resistivity measurement, Tc linearly decreased at the rate of -1.2 K GPa-1. In contrast, by HP annealing at 2 GPa and 470¬∞C, Tc increased in undoped and underdoped samples, and maintained at 5.1 K in optimally-doped sample. In HP resistivity measurement high pressure is applied in-situ at low temperatures, while HP annealing quenches the high pressure and high temperature phase to ambient pressure. Although in both cases hydrostatic high pressure is applied to the sample by a cubic-anvil-type apparatus, the difference between the two methods should be considered carefully. HP annealing technique have been mainly developed on BiS2-based superconductors, but this method can cause unique effects on physical properties of various layered compounds.}"""

# Compile all samples
SAMPLE_TEXTS = {
    "Sample_1": SAMPLE_TEXT_1,
    "Sample_2": SAMPLE_TEXT_2,
    "Sample_3": SAMPLE_TEXT_3
}

print(f"‚úÖ Loaded {len(SAMPLE_TEXTS)} sample texts for extraction")
for name in SAMPLE_TEXTS.keys():
    print(f"  - {name}")

‚úÖ Loaded 3 sample texts for extraction
  - Sample_1
  - Sample_2
  - Sample_3


## 5. Inference Logic Definition

We encapsulate the extraction pipeline into a single, unified function `extract_claims`. This function supports two modes of operation controlled by the `streaming` flag:

1.  **Streaming Mode (`True`):** Uses `TextIteratorStreamer` and Python threading to print tokens in real-time. This is essential for monitoring long-context generation and detecting hallucinations early.
2.  **Batch/Silent Mode (`False`):** Uses standard blocking generation. This is preferred when running the full corpus processing loop where console output needs to be minimized.

Both modes include performance telemetry (tokens per second) to quantify computational efficiency.

In [None]:
def extract_claims(text: str, streaming: bool = True, verbose: bool = True) -> str:
    """
    Extracts epistemological claims from scientific text using a chat LLM.

    Args:
        text (str): Scientific text with source ID tags.
        streaming (bool): If True, prints tokens in real-time.
        verbose (bool): If True, prints timing and speed statistics.

    Returns:
        str: Generated structured Markdown output.
    """

    # ------------------------------------------------------------------
    # 1. Prepare Prompt
    # ------------------------------------------------------------------
    formatted_prompt = EXTRACTION_PROMPT.replace("{TEXT}", text)
    messages = [{"role": "user", "content": formatted_prompt}]

    # Chat template ‚Üí RETURNS A DICT (BatchEncoding), not a tensor
    model_inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    )

    # Move ALL tensors to device
    model_inputs = {k: v.to(model.device) for k, v in model_inputs.items()}
    input_ids = model_inputs["input_ids"]

    # ------------------------------------------------------------------
    # 2. Generation Parameters
    # ------------------------------------------------------------------
    generation_kwargs = dict(**model_inputs, **generation_params)

    # Stability for chat LLMs
    generation_kwargs.setdefault("eos_token_id", tokenizer.eos_token_id)
    generation_kwargs.setdefault("pad_token_id", tokenizer.eos_token_id)

    if verbose:
        print(f"üîπ Input tokens: {input_ids.shape[-1]}")
        mode_str = "Streaming" if streaming else "Batch"
        print(f"‚è≥ Generating claims ({mode_str} Mode)...")
        if streaming:
            print("-" * 40)
        start_time = time.time()

    # ------------------------------------------------------------------
    # 3. Generation
    # ------------------------------------------------------------------
    response = ""

    if streaming:
        streamer = TextIteratorStreamer(
            tokenizer,
            skip_prompt=True,
            skip_special_tokens=True
        )
        generation_kwargs["streamer"] = streamer

        thread = Thread(target=model.generate, kwargs=generation_kwargs)
        thread.start()

        for new_text in streamer:
            response += new_text
            print(new_text, end="", flush=True)

    else:
        with torch.no_grad():
            outputs = model.generate(**generation_kwargs)

        prompt_len = input_ids.shape[-1]
        response_tokens = outputs[0][prompt_len:]
        response = tokenizer.decode(response_tokens, skip_special_tokens=True)

    # ------------------------------------------------------------------
    # 4. Performance Stats
    # ------------------------------------------------------------------
    if verbose:
        elapsed = time.time() - start_time
        output_tokens = len(tokenizer.encode(response))
        tps = output_tokens / elapsed if elapsed > 0 else 0

        if streaming:
            print("\n" + "-" * 40)
        print(f"‚úÖ Generation complete in {elapsed:.2f}s")
        print(f"üìä Speed: {output_tokens} tokens ({tps:.1f} tokens/sec)")

    return response


## 6. Validation on Sample Data

We now execute the unified extraction pipeline on our three test samples. This step serves two purposes:
1.  **Qualitative Verification:** Ensuring the model correctly resolves pronouns (e.g., replacing "It" with "LaO0.5F0.5BiS2") and captures the correct physical attributes.
2.  **Performance Benchmarking:** Establishing a baseline for tokens/second to estimate the time required for the full corpus.

The output below demonstrates the model's ability to structure complex causal chains into discrete JSON-like claims.

In [None]:
# -----------------------------------------------------------------------------
# Run Validation on All Samples
# -----------------------------------------------------------------------------

# Store results for inspection if needed later
validation_results = {}

print(f"üöÄ Starting validation on {len(SAMPLE_TEXTS)} samples...\n")

for name, text in SAMPLE_TEXTS.items():
    print("=" * 80)
    print(f"üìÑ PROCESSING: {name}")
    print("=" * 80)

    # We use streaming=True to visually inspect the generation quality in real-time
    extracted_output = extract_claims(text, streaming=True, verbose=True)

    # Store result
    validation_results[name] = extracted_output

    print("\n" + " " * 80 + "\n") # Spacing between samples

print("‚úÖ Validation complete.")

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


üöÄ Starting validation on 3 samples...

üìÑ PROCESSING: Sample_1
üîπ Input tokens: 2042
‚è≥ Generating claims (Streaming Mode)...
----------------------------------------
Here are the extracted Epistemological Claims from the provided text:

- {CLAIM}: High pressure annealing generates uniaxial strain along the c axis in LaO0.5F0.5BiS2.
  - {META DATA}:
    - Source ID: 1306.3346v2
    - Study Type: Experimental
    - Epistemic Type: Observation
    - Polarity: Neutral
  - {PHYSICAL ATTRIBUTES}:
    - Subject: LaO0.5F0.5BiS2
    - Driver: High Pressure Annealing
    - Effect: Uniaxial strain along the c axis

- {CLAIM}: Annealing LaO0.5F0.5BiS2 in an evacuated quartz tube at 700 ¬∫C returns the uniaxial strain to the initial state of the As-grown sample.
  - {META DATA}:
    - Source ID: 1306.3346v2
    - Study Type: Experimental
    - Epistemic Type: Observation
    - Polarity: Neutral
  - {PHYSICAL ATTRIBUTES}:
    - Subject: LaO0.5F0.5BiS2
    - Driver: Annealing at 700 ¬∫C in a

## 7. Performance Analysis & Optimization Strategy

Following the validation on our reference samples, we evaluate both the **qualitative quality** of the extraction and the **computational efficiency** of the pipeline.

### 7.1 Qualitative Assessment: Gemma 2.5 9B
**Overall Grade:** A (Excellent)
**Status:** Production-Ready

The performance of **Gemma 2.5 9B-IT** confirms that our "prompt surgery" strategy‚Äîstripping broad entity extraction to focus on logic and metadata‚Äîwas highly effective. The model demonstrates high-precision handling of scientific text, successfully navigating complex distinctions that typically confuse smaller models.

#### Key Strengths
* **Text Standardization (Pronoun Resolution):** The model consistently avoided the "context trap," successfully replacing ambiguous pronouns ("It," "The sample") with specific chemical identifiers (e.g., *LaO0.5F0.5BiS2*). This ensures every extracted claim is self-contained.
* **Epistemic & Logic Precision:**
    * **Classification:** Accurately distinguished between *Observations* (direct measurements), *Inferences* (conclusions), and *Mechanisms* (causal drivers).
    * **Nuance:** It successfully parsed "null results" (e.g., thermal conductivity being independent of temperature) and correctly assigned *Polarity* based on the physical desirability of the outcome.
* **Methodological Disambiguation:** It displayed sophisticated reasoning by distinguishing between similar conditions with opposite effects, such as "High Pressure Resistivity Measurement" vs. "High Pressure Annealing."

#### Minor Limitations
* **Recall on Speculative Content:** While precision was excellent, the model occasionally omits *Speculations* (future predictions) or highly specific *Mechanistic details* (e.g., subtle structural atomic angles). *Note: This conservative behavior is generally safer for automated knowledge graph construction as it prevents hallucinatory edges.*

---

### 7.2 Computational Efficiency & Optimization
**Current Benchmark (Google Colab T4):** ~5.5 - 6.4 tokens/sec

The current inference speed reflects the limitations of 4-bit quantization on a single NVIDIA T4 GPU. While sufficient for this dissertation, the pipeline can be optimized for larger corpora, particularly if migrating to **Kaggle (2x T4 GPUs)**:

1.  **Data Parallelism (Throughput Doubling):**
    * Instead of splitting the model across GPUs (Tensor Parallelism), which is unnecessary for a 9B model that fits in one T4 (approx. 6GB VRAM), we can load **two independent instances** of the model‚Äîone on each GPU.
    * This allows us to process two abstracts simultaneously, effectively doubling the processing speed to **~12+ tokens/sec (aggregate)**.
2.  **Batch Inference:**
    * Currently, we process texts sequentially (Batch Size = 1). Grouping short abstracts into batches of 4 or 8 could significantly improve GPU utilization.
3.  **Flash Attention 2:**
    * Integrating Flash Attention (available on Ampere A100/L4 GPUs, though limited on T4) would reduce the quadratic complexity of the attention mechanism, speeding up generation on longer context windows.

## 8. Data Ingestion & Structure Analysis

We load the full normalized corpus generated in *Notebook 4*. The following script includes automatic environment detection (Kaggle vs. Colab) and robust JSON parsing to handle potential nesting variations (e.g., whether the root is a list or a dictionary wrapper).

Finally, we verify the integrity of the **Primary Keys** (Article IDs) to ensure we can accurately map extracted claims back to their source.

In [None]:
# -----------------------------------------------------------------------------
# 1. Automatic Path Detection
# -----------------------------------------------------------------------------
# Define potential paths for different environments
POTENTIAL_PATHS = [
    "/kaggle/input/bis2-corpus-v1-normalized-20260119-130825/bis2_corpus_v1_normalized_20260119_130825.json",
    "/content/drive/MyDrive/TFM/data/corpora/03_normalized/bis2_corpus_v1_normalized_20260119_130825.json"
]

# Select the first path that exists
corpus_path = next((path for path in POTENTIAL_PATHS if os.path.exists(path)), None)

if corpus_path:
    print(f"‚úÖ Corpus found at: {corpus_path}")
    print(f"   Filename: {os.path.basename(corpus_path)}")
else:
    raise FileNotFoundError("‚ùå Error: Corpus file not found in known paths. Check mount or input directory.")

# -----------------------------------------------------------------------------
# 2. Robust Data Loading
# -----------------------------------------------------------------------------
try:
    with open(corpus_path, 'r') as f:
        raw_data = json.load(f)
    print("‚úÖ JSON loaded successfully.")
except json.JSONDecodeError as e:
    raise ValueError(f"‚ùå Invalid JSON format: {e}")

# Heuristic to flatten nested JSON structures
if isinstance(raw_data, dict):
    # Check for common wrapper keys
    if "papers" in raw_data:
        records = raw_data["papers"]
        print("   -> Structure detected: Dictionary with 'papers' key.")
    else:
        # Fallback: Find the key containing the largest list
        longest_key = max(raw_data.keys(), key=lambda k: len(raw_data[k]) if isinstance(raw_data[k], list) else 0)
        records = raw_data[longest_key]
        print(f"   -> Structure detected: Dictionary with key '{longest_key}'.")
elif isinstance(raw_data, list):
    records = raw_data
    print("   -> Structure detected: Direct List of records.")
else:
    raise ValueError("‚ùå Error: JSON structure is neither a list nor a recognized dict wrapper.")

# -----------------------------------------------------------------------------
# 3. DataFrame Creation & Key Verification
# -----------------------------------------------------------------------------
df = pd.DataFrame(records)
print(f"‚úÖ DataFrame created with {len(df)} records.")

print("\n" + "="*50)
print("KEY INTEGRITY CHECK")
print("="*50)

possible_primary = []
possible_secondary = []

for col in df.columns:
    try:
        # Skip unhashable types (lists/dicts) for uniqueness checks
        # We sample the first value to check type
        if isinstance(df[col].iloc[0], (list, dict)):
            continue

        # Check for Primary Key (Unique & Non-Null)
        if df[col].is_unique and df[col].notna().all():
            possible_primary.append(col)

        # Check for Secondary Keys (Categorical: <50% unique but >1)
        n_unique = df[col].nunique()
        if 1 < n_unique < (len(df) * 0.5):
            possible_secondary.append(col)
    except Exception:
        continue

print(f"üîë Candidate PRIMARY Keys (Unique IDs): {possible_primary}")
print(f"üè∑Ô∏è  Candidate SECONDARY Keys (Categories): {possible_secondary[:5]}")

print("\n" + "="*50)
print("FIRST RECORD PREVIEW")
print("="*50)
# Transpose (.T) is easier to read for wide datasets
display(df.head(1).T)

‚úÖ Corpus found at: /content/drive/MyDrive/TFM/data/corpora/03_normalized/bis2_corpus_v1_normalized_20260119_130825.json
   Filename: bis2_corpus_v1_normalized_20260119_130825.json
‚úÖ JSON loaded successfully.
   -> Structure detected: Dictionary with 'papers' key.
‚úÖ DataFrame created with 122 records.

KEY INTEGRITY CHECK
üîë Candidate PRIMARY Keys (Unique IDs): ['arxiv_id', 'entry_id', 'title', 'abstract', 'published', 'updated', 'pdf_url', 'extraction']
üè∑Ô∏è  Candidate SECONDARY Keys (Categories): ['year', 'primary_category']

FIRST RECORD PREVIEW


Unnamed: 0,0
arxiv_id,2406.01263v2
entry_id,http://arxiv.org/abs/2406.01263v2
doi,10.7566/JPSJ.93.074707
title,Pb Substitution Effects on Lattice and Electro...
abstract,We examined the effect of Pb substitution in t...
authors,"[Miku Sasaki, Kotaro Inada, Fumito Mori, Takaa..."
authors_str,"Miku Sasaki, Kotaro Inada, Fumito Mori, Takaak..."
published,2024-06-03T12:24:51+00:00
updated,2024-06-05T06:50:28+00:00
year,2024


## 9. Full Corpus Extraction Pipeline

We execute the extraction pipeline on the sorted dataframe. To handle the long runtime safely, we implement:
1.  **JSONL Streaming:** Writing each record immediately to disk (`.jsonl` format) to prevent data loss in case of a crash.
2.  **Resume Capability:** Automatically detecting existing output files and resuming from the last processed row.
3.  **Progress Monitoring:** Using `tqdm` for a visual progress bar and periodic logging.

**Note:** We sort by `arxiv_id` to ensure deterministic processing order.

In [None]:
import os
import json
from tqdm.auto import tqdm

# -----------------------------------------------------------------------------
# 1. Configuration & Path Setup
# -----------------------------------------------------------------------------

# Define the directory where results will be saved (Google Drive recommended for persistence)
OUTPUT_DIR = "/content/drive/MyDrive/TFM/data/output/"
OUTPUT_FILENAME = "Gemma_2_9B_claims_output_v1.jsonl"

# Create directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Full path
output_file_path = os.path.join(OUTPUT_DIR, OUTPUT_FILENAME)
print(f"üìÇ Output will be saved to: {output_file_path}")

# -----------------------------------------------------------------------------
# 2. Data Preparation
# -----------------------------------------------------------------------------

# Sort dataframe to ensure deterministic order (crucial for resuming correctly)
df_sorted = df.sort_values(by='arxiv_id', ascending=True)

# Check for existing progress to resume
lines_processed = 0
if os.path.exists(output_file_path):
    with open(output_file_path, "r", encoding="utf-8") as f:
        lines_processed = sum(1 for _ in f)
    print(f"üîÑ Found existing output. Resuming from record #{lines_processed}...")
else:
    print("‚ú® No existing output found. Starting from scratch.")

# -----------------------------------------------------------------------------
# 3. Execution Loop
# -----------------------------------------------------------------------------

print(f"üöÄ Starting extraction on {len(df_sorted)} papers...")

# Open file in append mode ('a')
with open(output_file_path, "a", encoding="utf-8") as f_out:

    # Use enumerate to get a sequential counter (i) separate from the dataframe index
    # We skip 'lines_processed' amount of items in tqdm
    for i, (idx, row) in tqdm(enumerate(df_sorted.iterrows()),
                              total=len(df_sorted),
                              initial=lines_processed):

        # SKIP logic: If we haven't reached the resume point, continue
        if i < lines_processed:
            continue

        try:
            # A. Construct Input Text
            # We combine Abstract and Extraction (Conclusion) columns
            abstract = str(row.get("abstract", "")).strip()
            extraction_col = str(row.get("extraction", "")).strip() # Previous regex extraction
            arxiv_id = str(row.get("arxiv_id", "Unknown"))

            # Format: {ID} {Content}
            input_text = f"ARTICLE: {arxiv_id};\n{abstract}\n{extraction_col}"

            # B. Run Inference
            # streaming=False for batch processing (faster, no console spam)
            # verbose=False to keep logs clean
            model_output = extract_claims(input_text, streaming=False, verbose=False)

            # C. Construct Result Record
            result_record = {
                "arxiv_id": arxiv_id,
                "processed_index": i,
                "input_length_chars": len(input_text),
                "model_version": "gemma-2-9b-it",
                "raw_output": model_output  # The markdown string containing claims
            }

            # D. Write to JSONL immediately
            f_out.write(json.dumps(result_record) + "\n")
            f_out.flush() # Ensure data hits the disk

        except Exception as e:
            print(f"‚ùå Error processing ID {row.get('arxiv_id', 'Unknown')}: {e}")

            # Log error record to maintain sequence alignment
            error_record = {
                "arxiv_id": row.get("arxiv_id"),
                "error": str(e),
                "processed_index": i
            }
            f_out.write(json.dumps(error_record) + "\n")
            f_out.flush()

print(f"\n‚úÖ Batch processing complete. Results saved to: {output_file_path}")

Found existing output. Resuming from index 1...
Starting extraction on 122 papers...


  1%|          | 1/122 [00:00<?, ?it/s]

‚ùå Error processing ID 1602.05320v1: CUDA out of memory. Tried to allocate 4.89 GiB. GPU 0 has a total capacity of 14.74 GiB of which 3.33 GiB is free. Process 5103 has 11.41 GiB memory in use. Of the allocated memory 10.36 GiB is allocated by PyTorch, and 914.68 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

‚úÖ Batch processing complete. Results saved to Gemma_2_9B-it_claims_output.jsonl


## 10. JSONL to JSON Schema Transformation

This notebook processes a JSONL file containing raw scientific claims and transforms them into a structured JSON dictionary.

**Logic:**

1. **Iterate:** The code reads the JSONL input line by line.
2. **Split:** It isolates individual claims within the `raw_output` field using the `- {CLAIM}:` delimiter.
3. **Parse:** For each claim block, it extracts the claim text and iteratively parses nested sections (like `{META DATA}` and `{PHYSICAL ATTRIBUTES}`) by identifying key-value pairs (lines starting with `- Key: Value`).
4. **Index:** A global counter assigns a unique `claim_id` (e.g., `claim_0001`) to each entry.

**Target JSON Skeleton:**
The output will follow this structure:



```json
{
  "claim_0001": {
    "claim_id": "claim_0001",
    "arxiv_id": "2406.01263v2",
    "claim_text": "Measurements of resistivity... show anomalies...",
    "metadata": {
      "Source ID": "2406.01263v2",
      "Study Type": "Experimental",
      "Epistemic Type": "Observation",
      "Polarity": "Neutral"
    },
    "physical_attributes": {
      "Subject": "LaO0.5F0.5Bi1-xPbxS2 (x‚â•0.08)",
      "Driver": "Pb substitution",
      "Effect": "Anomalies at T*"
    }
  },
  "claim_0002": { ... }
}

```


In [None]:
import json
import re
from typing import List, Dict, Any

# -----------------------------------------------------------------------------
# Parsing Logic (State Machine)
# -----------------------------------------------------------------------------

def parse_claims_to_schema(records: List[Dict]) -> Dict[str, Any]:
    """
    Parses raw LLM markdown output into structured JSON objects.

    Args:
        records (list): List of dicts containing 'raw_output' and 'arxiv_id'.

    Returns:
        dict: A dictionary of parsed claims keyed by a unique ID.
    """
    structured_output = {}
    claim_counter = 1

    for record in records:
        arxiv_id = record.get("arxiv_id", "unknown")
        raw_output = record.get("raw_output", "")

        # 1. Split text into chunks starting with "- {CLAIM}:"
        # We use a regex lookahead or non-capturing group to handle the split
        claim_chunks = re.split(r'(?:^|\n)-\s*\{CLAIM\}\s*:\s*', raw_output)

        for chunk in claim_chunks:
            if not chunk.strip():
                continue

            # Initialize container for this specific claim
            claim_id = f"claim_{claim_counter:04d}"
            claim_data = {
                "claim_id": claim_id,
                "arxiv_id": arxiv_id,
                "claim_text": "",
                "metadata": {},
                "physical_attributes": {}
            }

            # 2. Process line by line (State Machine)
            lines = chunk.split('\n')
            current_section = None

            for subline in lines:
                stripped = subline.strip()
                if not stripped:
                    continue

                # A. Detect Section Headers (e.g., "- {META DATA}:")
                # Matches: "- {SECTION NAME}:" with variable spacing
                section_match = re.match(r'-\s*\{(.*?)\}\s*:', stripped)
                if section_match:
                    section_name = section_match.group(1).upper()
                    if "META" in section_name:
                        current_section = "metadata"
                    elif "PHYSICAL" in section_name:
                        current_section = "physical_attributes"
                    continue

                # B. Detect Key-Value Pairs (e.g., "- Source ID: 1234.5678")
                # Matches: "- Key: Value"
                kv_match = re.match(r'-\s*([^:]+?):\s*(.*)', stripped)

                if kv_match and current_section:
                    key = kv_match.group(1).strip()
                    val = kv_match.group(2).strip()
                    claim_data[current_section][key] = val

                # C. Capture Claim Text (If no section is active)
                elif current_section is None:
                    # Append text that appears before the first section header
                    claim_data["claim_text"] += stripped + " "

            # Cleanup
            claim_data["claim_text"] = claim_data["claim_text"].strip()

            # Only add if valid claim text exists
            if claim_data["claim_text"]:
                structured_output[claim_id] = claim_data
                claim_counter += 1

    return structured_output

print("‚úÖ Parsing function defined.")

‚úÖ Parsing function defined.


### 10.1 Cell 11: Execution and Saving
Now we run the parser on the loaded data and save the final clean JSON.

In [None]:
# -----------------------------------------------------------------------------
# Execute Parsing
# -----------------------------------------------------------------------------

# Path to the JSONL output from the previous step
# (Ensure this matches the OUTPUT_FILENAME from Section 9)

# input_file_path = "/kaggle/input/gemma-claims/Gemma_2_9B-it_claims_output.jsonl" # Update if needed
input_file_path = "/content/drive/MyDrive/TFM/data/output/Gemma_2_9B-it_claims_output.jsonl"

print(f"üìÇ Loading raw data from: {input_file_path}")

try:
    with open(input_file_path, "r", encoding="utf-8") as f:
        # Filter empty lines just in case
        raw_records = [json.loads(line) for line in f if line.strip()]

    print(f"‚úÖ Loaded {len(raw_records)} raw records.")

    # Run Parser
    print("‚öôÔ∏è Parsing raw text into structured schema...")
    transformed_data = parse_claims_to_schema(raw_records)

    print(f"‚úÖ Extracted {len(transformed_data)} unique claims.")

    # -----------------------------------------------------------------------------
    # Save Final Structured Data
    # -----------------------------------------------------------------------------
    output_json_path = "Gemma_2_9B_parsed_claims_v1.json"

    with open(output_json_path, "w", encoding="utf-8") as f_out:
        json.dump(transformed_data, f_out, indent=2)

    print(f"üíæ Saved structured data to: {output_json_path}")

    # Preview
    print("\n--- Preview (First 2 Claims) ---")
    print(json.dumps(list(transformed_data.values())[:2], indent=2))

except FileNotFoundError:
    print(f"‚ùå Error: Input file not found at {input_file_path}")
except Exception as e:
    print(f"‚ùå Error during processing: {e}")

üìÇ Loading raw data from: /content/drive/MyDrive/TFM/data/output/Gemma_2_9B-it_claims_output.jsonl
‚úÖ Loaded 122 raw records.
‚öôÔ∏è Parsing raw text into structured schema...
‚úÖ Extracted 730 unique claims.
üíæ Saved structured data to: Gemma_2_9B_parsed_claims_v1.json

--- Preview (First 2 Claims) ---
[
  {
    "claim_id": "claim_0001",
    "arxiv_id": "2406.01263v2",
    "claim_text": "Measurements of resistivity, thermal expansion, specific heat, and Seebeck coefficient show anomalies at certain temperatures (T*) for LaO0.5F0.5Bi1-xPbxS2 (x\u22650.08).",
    "metadata": {
      "Source ID": "2406.01263v2",
      "Study Type": "Experimental",
      "Epistemic Type": "Observation",
      "Polarity": "Neutral"
    },
    "physical_attributes": {
      "Subject": "LaO0.5F0.5Bi1-xPbxS2 (x\u22650.08)",
      "Driver": "Pb substitution",
      "Effect": "Anomalies at T*"
    }
  },
  {
    "claim_id": "claim_0002",
    "arxiv_id": "2406.01263v2",
    "claim_text": "Large thermal expa

## 11. Final Export

We save the processed dictionary to a persistent JSON file. This file (`Gemma_2_9b-it_processed_claims.json`) serves as the input for **Notebook 5.2**, where we will perform Named Entity Recognition (NER) on the specific chemical formulas identified in these claims.

In [None]:
# -----------------------------------------------------------------------------
# Configuration
# -----------------------------------------------------------------------------
local_filename = "Gemma_2_9b-it_processed_claims.json"
drive_path = "/content/drive/MyDrive/TFM/data/output/" # Update path if needed

# -----------------------------------------------------------------------------
# Save to Local Environment (Working Directory)
# -----------------------------------------------------------------------------
try:
    with open(local_filename, "w", encoding="utf-8") as f:
        json.dump(transformed_data, f, indent=2, ensure_ascii=False)

    print(f"‚úÖ Successfully saved {len(transformed_data)} claims to local file: '{local_filename}'")

    # -----------------------------------------------------------------------------
    # Backup to Google Drive (Optional but Recommended)
    # -----------------------------------------------------------------------------
    if os.path.exists(drive_path):
        drive_full_path = os.path.join(drive_path, local_filename)
        with open(drive_full_path, "w", encoding="utf-8") as f:
            json.dump(transformed_data, f, indent=2, ensure_ascii=False)
        print(f"‚úÖ Backup saved to Drive: '{drive_full_path}'")
    else:
        print(f"‚ö†Ô∏è Drive path not found. File only saved locally.")

except Exception as e:
    print(f"‚ùå Error saving file: {e}")

‚úÖ Successfully saved 730 claims to local file: 'Gemma_2_9b-it_processed_claims.json'
‚úÖ Backup saved to Drive: '/content/drive/MyDrive/TFM/data/output/Gemma_2_9b-it_processed_claims.json'
