## Phase 1: Setup & Ingestion Liscence Application

In [27]:
!pip install unstructured pypdf python-docx nltk --quiet  # Install Unstructured to extract structured content from PDFs
!pip install pdfminer.six --quiet
!pip install "unstructured[pdf]" --quiet
# Download NLTK tokenizer for sentence splitting (used later)
import nltk
nltk.download("punkt")# Required for sentence tokenization

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [28]:
#Import and Load PDF

from unstructured.partition.pdf import partition_pdf
import os
from collections import Counter
import re


# Place path for PDF application
pdf_path = "/content/NoMarkUP30%_LA_Ch1_General Info_FINAdsL.pdf"

# Use Unstructured's partition_pdf to break the PDF into structured elements
elements = partition_pdf(filename=pdf_path)

# Preview total elements found
print(f"Total elements extracted: {len(elements)}")
print("First element preview:")
print(elements[0])


for i, el in enumerate(elements[:10]):                   #########Running into a ton of uncategorized text
    print(f"\n=== Element {i} ===")
    print(f"Category: {el.category}")
    print(f"Text:\n{el.text.strip()}")


Total elements extracted: 918
First element preview:
IKE Enrichment Facility

=== Element 0 ===
Category: Title
Text:
IKE Enrichment Facility

=== Element 1 ===
Category: Title
Text:
License Application

=== Element 2 ===
Category: Title
Text:
TABLE OF CONTENTS

=== Element 3 ===
Category: Title
Text:
Page

=== Element 4 ===
Category: UncategorizedText
Text:
1.0

=== Element 5 ===
Category: UncategorizedText
Text:
GENERAL INFORMATION ............................................................................................. 1

=== Element 6 ===
Category: UncategorizedText
Text:
1.1

=== Element 7 ===
Category: UncategorizedText
Text:
FACILITY AND PROCESS OVERVIEW ........................................................ 1.1-1

=== Element 8 ===
Category: UncategorizedText
Text:
1.1.1 Facility Layout Description ................................................................. 1.1-3

=== Element 9 ===
Category: UncategorizedText
Text:
1.1.2 Process Overview ............................

## Phase 2:  Structure Detection & Mapping && Embedding & Vector Search

 Need to return and categorize table of content entries


In [29]:
# Step 1: Collect all short blocks from the document
# We're focusing on blocks with <10 words — likely to be headers, footers, or labels  -- THough this is likely an error and will need to be shortened or readdressed in a new way




# This doesn't make sense as section headers will likely be short as with table of contents
from collections import Counter

short_blocks = [
    el.text.strip()
    for el in elements
    if hasattr(el, "text") and 0 < len(el.text.strip().split()) < 10
]

#  Count how often each short block appears
footer_counts = Counter(short_blocks)

# If something appears too often (e.g., >5 times), it's probably a footer/header
# You can tune this threshold depending on document size
common_footers = {
    text for text, count in footer_counts.items() if count > 5
}

print(footer_counts)

print(" Probable footers/headers:")
for item in common_footers:
    print("-", item)

Counter({'Eagle Rock Enrichment Facility SAR': 47, 'Rev. 5': 24, '--': 24, 'IKE Enrichment Facility': 15, 'NA': 15, 'Trace5': 3, '1.0': 2, '1.1': 2, '1.2': 2, '1.3': 2, '1.4': 2, 'Process Overview': 2, 'Source material and SNM are used in this area.': 2, 'Total Mass kg (lb)': 2, 'Uranium Content kg (lb)': 2, 'None': 2, '40 each': 2, '11,360 L (3,000 gal)': 2, '379 L (100 gal)': 2, 'Chemical: UF6, UF4, UO2F2, oxides and other compounds': 2, 'License Application': 1, 'TABLE OF CONTENTS': 1, 'Page': 1, 'GENERAL INFORMATION ............................................................................................. 1': 1, 'FACILITY AND PROCESS OVERVIEW ........................................................ 1.1-1': 1, '1.1.1 Facility Layout Description ................................................................. 1.1-3': 1, '1.1.2 Process Overview ............................................................................. 1.1-3': 1, '1.1.3 Site Overview ............................

In [30]:

#  Define a helper function to detect bullet points
def is_bullet_point(text):
    """
    Returns True if a line looks like a bullet point.
    Matches:
      - • Conduct...
      - - Maintain...
      - * Submit...
      - 1. Evaluate...
      - 2) Review...
    """
    text = text.strip()
    return bool(re.match(r"^(\s*[\*\-•]\s+|\d+[\.\)]\s+)", text))


# Step 2: Filter out unwanted text
# Keep text blocks if they are either:
# - Long paragraphs (>15 words), OR
# - Bullet points (even short ones)

filtered_chunks = []
discarded_chunks = []

for el in elements:
    if hasattr(el, "text"):
        text = el.text.strip()
        word_count = len(text.split())

        if word_count > 15 or is_bullet_point(text):
            filtered_chunks.append(text)
        else:
            discarded_chunks.append(text)

# ---- Step 3: Show basic stats ----
print(f" Retained: {len(filtered_chunks)} meaningful paragraphs")
print(f" Discarded: {len(discarded_chunks)} short or noisy blocks")

# Optional: Preview a few filtered results
print("\n Sample Clean Paragraph:\n")
print(filtered_chunks[0][:500])

# Optional: Preview what was discarded (debugging)
print("\n Sample Discarded Text:\n")
print(discarded_chunks[:5])


 Retained: 270 meaningful paragraphs
 Discarded: 648 short or noisy blocks

 Sample Clean Paragraph:

This section contains a general description and purpose of the Orano Enrichment USA LLC (OE), hereafter collectively referred to as the “Company”, OEIKE Enrichment Facility IKEF. The facility enriches uranium for producing nuclear fuel for use in commercial power plants. This License Application follows the format recommended by NUREG-1520, Standard Review Plan for Fuel Cycle Facility Applications (NRC, 2015). The level of detail provided in this chapter is appropriate for general familiarizatio

 Sample Discarded Text:

['IKE Enrichment Facility', 'License Application', 'TABLE OF CONTENTS', 'Page', '1.0']


## Build DocMap

In [31]:
import re

# --- Step 1: Extract TOC entries from top of document ---
toc_entries = []
toc_pattern = re.compile(r"^(\d+(\.\d+)+)\s+(.+?)\.{3,}\s+(\S+)$")

for i, el in enumerate(elements[:150]):
    if hasattr(el, "text"):
        text = el.text.strip()
        match = toc_pattern.match(text)
        if match:
            section = match.group(1)
            title = match.group(3).strip()
            page = match.group(4).strip()
            toc_entries.append({"section": section, "title": title, "toc_page": page})

# --- Step 2: Scan entire doc for actual body sections ---
body_sections = []
header_pattern = re.compile(r"^(\d+(\.\d+)+)\s+(.+)$")

for i, el in enumerate(elements):
    if hasattr(el, "text"):
        text = el.text.strip()
        match = header_pattern.match(text)
        if match:
            section = match.group(1)
            title = match.group(3).strip()
            body_sections.append({"section": section, "title": title, "index": i})

# --- Step 3: Create DocMap ---
doc_map = {}

for idx, entry in enumerate(body_sections):
    section = entry["section"]
    doc_map[section] = {
        "title": entry["title"],
        "start_index": entry["index"],
        "end_index": (
            body_sections[idx + 1]["index"] - 1
            if idx + 1 < len(body_sections)
            else len(elements) - 1
        )
    }

# Merge TOC page references into doc_map
for toc in toc_entries:
    if toc["section"] in doc_map:
        doc_map[toc["section"]]["toc_page"] = toc["toc_page"]

# --- Preview ---
import pprint
pprint.pprint(dict(list(doc_map.items())[:5]))


{'1.1.1': {'end_index': 8,
           'start_index': 8,
           'title': 'Facility Layout Description '
                    '................................................................. '
                    '1.1-3',
           'toc_page': '1.1-3'},
 '1.1.2': {'end_index': 9,
           'start_index': 9,
           'title': 'Process Overview '
                    '............................................................................. '
                    '1.1-3',
           'toc_page': '1.1-3'},
 '1.1.3': {'end_index': 10,
           'start_index': 10,
           'title': 'Site Overview '
                    '.................................................................................... '
                    '1.1-7',
           'toc_page': '1.1-7'},
 '1.1.4': {'end_index': 13,
           'start_index': 11,
           'title': 'Descriptive Summary of Licensed Material '
                    '...................................... 1.1-11',
           'toc_page': '1.1

In [32]:
### For tables and figure mapping

table_figure_entries = []

caption_pattern = re.compile(r"^(Table|Figure)\s+\d+[\.\d\-]*\s+(.*)", re.IGNORECASE)

for i, el in enumerate(elements):
    if hasattr(el, "text"):
        text = el.text.strip()
        if caption_pattern.match(text):
            table_figure_entries.append({
                "index": i,
                "text": text
            })

print(f"✅ Found {len(table_figure_entries)} tables/figures")
for t in table_figure_entries[:5]:
    print(t)


✅ Found 12 tables/figures
{'index': 34, 'text': 'Table 1.1-1 Estimated Annual Gaseous Effluent'}
{'index': 35, 'text': 'Table 1.1-2 Estimated Annual Radiological and Mixed Wastes'}
{'index': 36, 'text': 'Table 1.1-3 Estimated Annual Liquid Effluent'}
{'index': 37, 'text': 'Table 1.1-4 Estimated Annual Non-Radiological Wastes'}
{'index': 38, 'text': 'Table 1.1-5 Annual Hazardous Construction Wastes'}


## Create Output Files

In [33]:


# === Save cleaned output to file (optional) ===
with open("clean_application.txt", "w", encoding="utf-8") as f:
    for para in filtered_chunks:
        f.write(para + "\n\n")



In [34]:
#### Save code to JSON

import json

with open("docmap.json", "w") as f:
    json.dump(doc_map, f, indent=2)


In [35]:
# === Utility Code: Extract full text for any section ===
def get_section_text(section_number, elements, doc_map):
    """Return full paragraph text for a given section number using doc_map."""
    s = doc_map[section_number]["start_index"]
    e = doc_map[section_number]["end_index"]
    return "\n".join(
        el.text.strip() for el in elements[s:e+1]
        if hasattr(el, "text") and len(el.text.strip()) > 0
    )

# ✅ Example usage:
print(get_section_text("1.1.3", elements, doc_map)[:1000])  # Preview first 1000 characters of Section 1.1.3


1.1.3 Site Overview .................................................................................... 1.1-7


## Phase 3: Annotation, Entity Mapping, & Semantic Consistency Checks

In [36]:
!pip install -U spacy --quiet
!python -m spacy download en_core_web_sm --quiet


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m86.8 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [37]:
## Load
import spacy
from pprint import pprint

nlp = spacy.load("en_core_web_sm")


In [38]:
#Process each cleaned paragraph


ner_results = []

for para in filtered_chunks:
    doc = nlp(para)
    entities = []
    for ent in doc.ents:
        entities.append({
            "text": ent.text,
            "label": ent.label_
        })
    ner_results.append({
        "paragraph": para,
        "entities": entities
    })

print(f" Processed {len(ner_results)} paragraphs with NER")


 Processed 270 paragraphs with NER


In [39]:
# see

pprint(ner_results[0])

print("space")
pprint(ner_results[:2])

{'entities': [{'label': 'ORG', 'text': 'the Orano Enrichment USA'},
              {'label': 'ORG', 'text': 'This License Application'},
              {'label': 'ORG', 'text': 'NUREG-1520'},
              {'label': 'ORG',
               'text': 'Standard Review Plan for Fuel Cycle Facility '
                       'Applications'},
              {'label': 'ORG', 'text': 'NRC'},
              {'label': 'DATE', 'text': '2015'},
              {'label': 'ORG', 'text': 'the Integrated Safety Analysis'},
              {'label': 'ORG', 'text': 'ISA'}],
 'paragraph': 'This section contains a general description and purpose of the '
              'Orano Enrichment USA LLC (OE), hereafter collectively referred '
              'to as the “Company”, OEIKE Enrichment Facility IKEF. The '
              'facility enriches uranium for producing nuclear fuel for use in '
              'commercial power plants. This License Application follows the '
              'format recommended by NUREG-1520, Standar

In [40]:
## Global Reference table


from collections import defaultdict

# Dictionary of label -> set of unique entities
global_reference_table = defaultdict(set)

# Loop over all items and collect entities
for item in ner_results:
    for ent in item["entities"]:
        label = ent.get("label") or ent.get("label_")
        text = ent["text"]
        global_reference_table[label].add(text)

# Convert sets to sorted lists for easier viewing
global_reference_table = {k: sorted(list(v)) for k, v in global_reference_table.items()}

# Display
print("Global Reference Table of Entities:")
for label, entries in global_reference_table.items():
    print(f"\n{label}:")
    for e in entries:
        print(f" - {e}")


Global Reference Table of Entities:

ORG:
 - ANSI
 - AREVA Inc. AREVA Inc.
 - AREVA NC Inc.
 - AREVA NP Inc.
 - AREVA NP SAS
 - AREVA NP USA Inc.
 - AREVA SA.Orano Enrichment USA LLC
 - ARLFRD
 - ASTM
 - Access Authorization for Licensed Personnel
 - Administration Building
 - Air
 - Alpha/Beta/Gamma Counting
 - American Nuclear Insurers
 - American Nuclear Insurers and/or Mutual Atomic Energy Liability
 - American Nuclear Insurers and/or Mutual Atomic Energy Liability Underwriters
 - American Society of Civil Engineers
 - An Electrical Services Building
 - Analytical Laboratory
 - Argonne National Lab-West
 - BLM
 - BSPB
 - Barren
 - Blending Donor Stations
 - Blending Receiver Stations
 - Blending, Sampling and
 - Boise
 - C and D (
 - CAA
 - CAB
 - CFR
 - CFR 51.21
 - CFR 51.22
 - CFR 51.32
 - COGEMA Resources Inc.
 - CRSB
 - Centrifuge Assembly
 - Centrifuge Assembly Building
 - Centrifuge Assembly Building First Floor
 - Chemical Process Safety
 - Control Area Boundary
 - Control 

In [41]:
# Parse dates and times
from dateutil.parser import parse as date_parse
from dateutil.parser import ParserError

normalized_dates = []

# Loop over all items and process DATE entities
for item in ner_results:
    para = item["paragraph"]
    for ent in item["entities"]:
        label = ent.get("label") or ent.get("label_")
        text = ent["text"]

        if label == "DATE":
            try:
                dt = date_parse(text, fuzzy=True)
                normalized_dates.append({
                    "original": text,
                    "standardized": dt.isoformat(),
                    "paragraph": para[:100]
                })
            except (ParserError, ValueError):
                # Skip unparseable dates
                pass

# Display
print(f"\n Parsed {len(normalized_dates)} dates.")
for d in normalized_dates:
    print(f"- '{d['original']}' standardized to {d['standardized']} (excerpt: '{d['paragraph']}...')")



 Parsed 128 dates.
- '2015' standardized to 2015-07-21T00:00:00 (excerpt: 'This section contains a general description and purpose of the Orano Enrichment USA LLC (OE), hereaf...')
- '2005' standardized to 2005-07-21T00:00:00 (excerpt: 'The enrichment process at the IKEF is basically the same process described in the SAR for the Nation...')
- '2005' standardized to 2005-07-21T00:00:00 (excerpt: 'The enrichment process at the IKEF is basically the same process described in the SAR for the Nation...')
- 'January 1, 2014' standardized to 2014-01-01T00:00:00 (excerpt: 'Orano Enrichment USA LLC (OE) is a Delaware limited liability company. It has been formed solely to ...')
- 'April 24, 1989' standardized to 1989-04-24T00:00:00 (excerpt: 'Orano Enrichment USA LLC (OE) is a Delaware limited liability company. It has been formed solely to ...')
- 'January 1, 2014' standardized to 2014-01-01T00:00:00 (excerpt: 'Orano Enrichment USA LLC (OE) is a Delaware limited liability company. It has been

In [42]:
pprint(normalized_dates[:3])

[{'original': '2015',
  'paragraph': 'This section contains a general description and purpose of the '
               'Orano Enrichment USA LLC (OE), hereaf',
  'standardized': '2015-07-21T00:00:00'},
 {'original': '2005',
  'paragraph': 'The enrichment process at the IKEF is basically the same '
               'process described in the SAR for the Nation',
  'standardized': '2005-07-21T00:00:00'},
 {'original': '2005',
  'paragraph': 'The enrichment process at the IKEF is basically the same '
               'process described in the SAR for the Nation',
  'standardized': '2005-07-21T00:00:00'}]


In [43]:
##
##
## Acronym Library



acronym_dict = {
    "ERIEF": "Eagle Rock Enrichment Facility",
    "SAR": "Safety Analysis Report",
    "UF6": "Uranium Hexafluoride",
    "NRC": "Nuclear Regulatory Commission"
}




In [44]:
###Create a long text /acronym replacement version of each paragraph:

normalized_paragraphs = []

for para in filtered_chunks:
    norm_para = para
    for short, long in acronym_dict.items():
        pattern = re.compile(rf"\b{short}\b")
        norm_para = pattern.sub(long, norm_para)
    normalized_paragraphs.append(norm_para)

print(" Acronym normalization complete.")


 Acronym normalization complete.


In [45]:
# Compare orig vs normal

print("Original:")
print(filtered_chunks[0][:200])

print("\nNormalized:")
print(normalized_paragraphs[0][:200])


Original:
This section contains a general description and purpose of the Orano Enrichment USA LLC (OE), hereafter collectively referred to as the “Company”, OEIKE Enrichment Facility IKEF. The facility enriches

Normalized:
This section contains a general description and purpose of the Orano Enrichment USA LLC (OE), hereafter collectively referred to as the “Company”, OEIKE Enrichment Facility IKEF. The facility enriches


In [46]:

#refernce map for cross ref questions - see table 2.3 appendix a

reference_pattern = re.compile(r"(Table|Figure|Appendix)\s+([\w\d\.\-]+)", re.IGNORECASE)

cross_references = []

for i, para in enumerate(filtered_chunks):
    matches = reference_pattern.findall(para)
    if matches:
        refs = []
        for m in matches:
            refs.append({
                "type": m[0],
                "ref": m[1]
            })
        cross_references.append({
            "paragraph_index": i,
            "references": refs
        })

print(f" Found {len(cross_references)} cross-reference mentions")


 Found 31 cross-reference mentions


In [47]:
## Verify Doc Map is WOrking correctly
# Extract numeric parts for sorting
def section_key(s):
    return [int(part) for part in s.split(".")]

# Sort sections
sorted_sections = sorted(doc_map.keys(), key=section_key)

# Check oder and print
print(" Section order verification:")
for s in sorted_sections:
    print(f"{s} - {doc_map[s]['title']}")


 Section order verification:
1.1.1 - Facility Layout Description ................................................................. 1.1-3
1.1.2 - Process Overview ............................................................................. 1.1-3
1.1.2.3 - Materials, By-Products, Wastes, and Finished Products
1.1.3 - Site Overview .................................................................................... 1.1-7
1.1.4 - Descriptive Summary of Licensed Material ...................................... 1.1-11
1.2.1 - Corporate Identity and Ownership ..................................................... 1.2-1
1.2.2 - Financial Qualifications ...................................................................... 1.2-3
1.2.3 - Characteristics of the Material ........................................................... 1.2-4
1.2.4 - Authorized Uses ................................................................................ 1.2-4
1.2.5 - Special Exemptions or Special Authorizations .

## Phase 4:  Ingest and Chunk NUREG Document

In [48]:
!pip install unstructured pdfminer.six nltk --quiet


In [49]:
###
###nureg_path = "/content/NUREG 1520 Compliance.pdf"  # for .txt files

###with open(nureg_path, "r", encoding="utf-8") as f:
##    nureg_text = f.read()

#print(f" NUREG loaded: {len(nureg_text):,} characters")
###

In [50]:
from unstructured.partition.pdf import partition_pdf

pdf_path = "/content/NUREG 1520 Compliance.pdf"  # Update path
elements = partition_pdf(filename=pdf_path)

# Extract text
nureg_text = "\n\n".join([el.text for el in elements if hasattr(el, "text")])
print(f"PDF parsed into {len(elements)} elements")




PDF parsed into 7641 elements


In [51]:
import re
from langchain.text_splitter import RecursiveCharacterTextSplitter
from unstructured.partition.pdf import partition_pdf

#  Step 1: Extract text from PDF using Unstructured
pdf_path = "/content/NUREG 1520 Compliance.pdf"
elements = partition_pdf(filename=pdf_path)

#  Step 2: Combine all text elements into one string
nureg_text = "\n\n".join([el.text for el in elements if hasattr(el, "text")])

#  Step 3: Normalize whitespace
nureg_text = re.sub(r'\s+', ' ', nureg_text).strip()

#  Step 4: Split into semantic chunks with overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", ".", " "]  # Most to least preferred
)
nureg_chunks = splitter.split_text(nureg_text)

#  Step 5: Preview results
print(f" Total NUREG Chunks: {len(nureg_chunks)}")
print(" Sample Chunk:\n", nureg_chunks[0][:500])

#  Step 6: Save to disk (optional)
with open("nureg_chunks.txt", "w", encoding="utf-8") as f:
    for chunk in nureg_chunks:
        f.write(chunk + "\n\n")




 Total NUREG Chunks: 1259
 Sample Chunk:
 NUREG-1520, Rev. 2 Standard Review Plan for Fuel Cycle Facilities License Applications Final Report Office of Nuclear Material Safety and Safeguards AVAILABILITY OF REFERENCE MATERIALS IN NRC PUBLICATIONS NRC Reference Material Non-NRC Reference Material As of November 1999, you may electronically access NUREG-series publications and other NRC records at NRC’s Library at www.nrc.gov/reading-rm.html. Publicly released records include, to name a few, NUREG-series publications; Federal Register not


##  Create Semantic Vector Store

In [52]:
!pip install faiss-cpu sentence-transformers langchain --quiet


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [53]:
!pip install -U langchain-community --quiet


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [54]:
from langchain.embeddings import HuggingFaceEmbeddings  ## LANGCHAINNN only for building the vector store

embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)


  embedding_model = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [55]:
##
## Build a Facebook AI Similarity Search (FAISS) Vector Store

## look at chromadb

from langchain.vectorstores import FAISS

# Build vector index from the NUREG chunks
vectorstore = FAISS.from_texts(nureg_chunks, embedding_model)

print("Vector store built.")


Vector store built.


SyntaxError: invalid syntax (ipython-input-1-2163674117.py, line 1)

In [56]:
from langchain.vectorstores import FAISS

# Save the vectorstore
vectorstore.save_local("nureg_faiss_index")

# Load it safely (you trust your own file)
vectorstore = FAISS.load_local(
    "nureg_faiss_index",
    embedding_model,
    allow_dangerous_deserialization=True
)

print(" FAISS vector store successfully loaded.")


 FAISS vector store successfully loaded.


In [57]:
######
#####
#####
##### Try a Search Query!


query = "hydrology site characteristics"
results = vectorstore.similarity_search(query, k=3)

for i, res in enumerate(results):
    print(f"\n Match #{i+1}:\n")
    print(res.page_content[:500])



🔹 Match #1:

. Uses of land within the licensed facility or its proposed boundaries (i.e., residential, industrial, commercial, or agricultural) f. Description of nearby bodies of water and their uses 3. Meteorology a. Primary wind directions and average windspeeds b. Annual amount and forms of precipitation, as well as the design-basis values for accident analysis of maximum snow or ice load and probable maximum precipitation c. Type, frequency, and magnitude of severe weather (e.g., lightning, tornado, and

🔹 Match #2:

. water bodies within approximately 1.61 km (1 mi) 2. a general area map covering a radius of approximately 16.1 km (10 mi), a U.S. Geological Survey topographical quadrangle (7½-minute series, including the adjacent quadrangle(s) if the site is located less than 1.61 km (1 mi) from the edge of the quadrangle), and a map or aerial photograph indicating onsite and near-site structures within a radius of approximately 1.61 km (1 mi)1 3. stack heights, typical stack flo

In [60]:
from transformers import pipeline

#  Load open-source LLM pipeline (you can swap the model if needed)
llm = pipeline(
    "text-generation",
    model="tiiuae/falcon-7b-instruct",
    max_new_tokens=512,
    temperature=0.3,
    device=0  # Use 0 for GPU, -1 for CPU
)

#  Prompt Template
def build_prompt(app_section_text, nureg_guidance_text):
    return f"""
You are a nuclear compliance expert. Compare the license application section with the NUREG-1520 guidance.

--- APPLICATION SECTION ---
{app_section_text}

--- NUREG GUIDANCE ---
{nureg_guidance_text}

--- TASK ---
Determine:
1. Does the application fully comply with the NUREG guidance?
2. If not, list any missing or vague elements.
3. Suggest specific improvements to bring it into full compliance.
4. Keep response professional and structured.

Respond in the format:
- Compliance Rating: Fully / Partially / Inadequate
- Missing Elements: [...]
- Suggested Revisions: [...]
"""

#  Example query (replace with real content)
app_text = "The facility is located in Sevier County with access to emergency services."
retrieved_nureg = nureg_chunks[15]  # from your similarity search

#  Run LLM
prompt = build_prompt(app_text, retrieved_nureg)
response = llm(prompt)[0]["generated_text"]

print("🧠 LLM Compliance Output:\n")
print(response)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [63]:
from transformers import pipeline

#  Load open-source LLM pipeline (you can swap the model if needed)
llm = pipeline(
    "text-generation",
    model="tiiuae/falcon-7b-instruct",
    max_new_tokens=512,
    temperature=0.3,
    device=0  # Use 0 for GPU, -1 for CPU
)

#  Prompt Template
def build_prompt(app_section_text, nureg_guidance_text):
    return f"""
You are a nuclear compliance expert. Compare the license application section with the NUREG-1520 guidance.

--- APPLICATION SECTION ---
{"The facility is located in Sevier County with access to emergency services."}

--- NUREG GUIDANCE ---
{nureg_guidance_text}

--- TASK ---
Determine:
1. Does the application fully comply with the NUREG guidance?
2. If not, list any missing or vague elements.
3. Suggest specific improvements to bring it into full compliance.
4. Keep response professional and structured.

Respond in the format:
- Complian ce Rating: Fully / Partially / Inadequate
- Missing Elements: [...]
- Suggested Revisions: [...]
"""

#  Example query (replace with real content)
app_text = "The facility is located in Sevier County with access to emergency services."
retrieved_nureg = nureg_chunks[15]  # from your similarity search

#  Run LLM
prompt = build_prompt(app_text, retrieved_nureg)
response = llm(prompt)[0]["generated_text"]

print("🧠 LLM Compliance Output:\n")
print(response)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


KeyboardInterrupt: 

In [67]:
from transformers import pipeline

# ✅ Load open-source LLM pipeline (you can swap to another open-access model if needed)
llm = pipeline(
    "text-generation",
    model="tiiuae/falcon-rw-1b",  # Must be downloaded or cached locally
    max_new_tokens=512,
    temperature=0.3,
    device=0  # 0 for GPU, -1 for CPU
)

# ✅ Prompt Template Function
def build_prompt(app_section_text, nureg_guidance_text):
    return f"""
You are a nuclear compliance expert. Compare the license application section with the NUREG-1520 guidance.

--- APPLICATION SECTION ---
{app_section_text}

--- NUREG GUIDANCE ---
{nureg_guidance_text}

--- TASK ---
Determine:
1. Does the application fully comply with the NUREG guidance?
2. If not, list any missing or vague elements.
3. Suggest specific improvements to bring it into full compliance.
4. Keep response professional and structured.

Respond in the format:
- Compliance Rating: Fully / Partially / Inadequate
- Missing Elements: [...]
- Suggested Revisions: [...]
"""

# ✅ Provide the Application Section (example from your app text)
app_section_text = "The facility is located in Sevier County with access to emergency services."

# ✅ Retrieve relevant NUREG chunk using similarity search
retrieved_nureg_chunk = vectorstore.similarity_search(app_section_text, k=1)[0].page_content

# ✅ Build the prompt
prompt = build_prompt(app_section_text, retrieved_nureg_chunk)

# ✅ Run the prompt through the LLM
response = llm(prompt)[0]["generated_text"]

# ✅ Output the result
print("🧠 LLM Compliance Output:\n")
print(response)


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


🧠 LLM Compliance Output:


You are a nuclear compliance expert. Compare the license application section with the NUREG-1520 guidance.

--- APPLICATION SECTION ---
The facility is located in Sevier County with access to emergency services.

--- NUREG GUIDANCE ---
. The information that the NRC staff will review includes the following (as appropriate for the facility being reviewed): 1. Site geography a. Site location: State, county, municipality, topographic quadrangle (in eight 7-1/2-minute quadrants), site boundary, and controlled-area boundary b. Major nearby highways c. Nearby bodies of water d. Any other significant geographic feature that may affect accident analysis within 1.6 kilometers (1 mile) of the site (e.g., ridges, valleys, specific geologic structures) 2. Demographics a. Latest census results for area of concern b. Description, distance, and direction to nearby population centers c. Description of and distance and direction to nearby public facilities (e.g., schools, hos

In [71]:
from transformers import pipeline

# ✅ Use smaller, open-access model that's fast + instruction-tuned
llm = pipeline(
    "text-generation",
    model="databricks/dolly-v2-3b",
    max_new_tokens=400,
    temperature=0.2,
    do_sample=True,
    device=-1,  # or 0 for GPU
    trust_remote_code=True  # ✅ Enables loading models with custom code
)

# ✅ Prompt Template
def build_prompt(app_section_text, nureg_guidance_text):
    return f"""You are a nuclear compliance expert.

Compare the following LICENSE APPLICATION SECTION to the relevant NUREG-1520 GUIDANCE and answer the following:
1. Does the application fully comply?
2. List any missing or vague content.
3. Suggest improvements for full compliance.

--- LICENSE APPLICATION SECTION ---
{app_section_text}

--- NUREG GUIDANCE SECTION ---
{nureg_guidance_text}

--- RESPONSE FORMAT ---
- Compliance Rating: Fully / Partially / Inadequate
- Missing Elements: [...]
- Suggested Revisions: [...]
"""

# ✅ Example section
app_text = "The facility is located in Sevier County with access to emergency services."
retrieved_nureg = nureg_chunks[15]  # From your earlier vector search

# ✅ Run prompt through LLM
prompt = build_prompt(app_text, retrieved_nureg)
response = llm(prompt)[0]["generated_text"]

# ✅ Clean up and print
print("🧠 LLM Compliance Output:\n")
print(response.replace(prompt, "").strip())  # remove prompt from output


instruct_pipeline.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/databricks/dolly-v2-3b:
- instruct_pipeline.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin:   0%|          | 0.00/5.68G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.68G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

Device set to use cpu


🧠 LLM Compliance Output:

The LICENSE APPLICATION SECTION fully complies with NUREG-1520 GUIDANCE.

The NUREG-1520 GUIDANCE contains several requirements that are not fully or partially addressed in the LICENSE APPLICATION SECTION. These requirements include:
- Regulatory Requirements
- Regulatory Guidance
- Regulatory Acceptance Criteria
