# 📜 YAML to JSON Converter with Enum Processing

This function converts a YAML schema into JSON format while ensuring a valid structure. 

## 🛠 How to Use:
1. Place your **`schema.yaml`** file inside the **`input/`** directory.
2. Run the following command in a Python script or Jupyter Notebook:
   ```python
   import utils.yaml_to_json
   utils.yaml_to_json.yaml_to_json()


In [5]:
import importlib
import utils.yaml_to_json
importlib.reload(utils.yaml_to_json)

With_dependency=True

utils.yaml_to_json.yaml_to_json()

✅ YAML successfully converted to JSON!
 - Input: input/schema.yaml
 - Output: generated/schema.json



## 🔗 **Named Entity Class Processing**  

### **Extracting Named Entities**  
Finds and lists all classes classified as `NamedEntity` from the schema.

In [6]:
import importlib
import utils.extract_named_entity_classes
importlib.reload(utils.extract_named_entity_classes) 

# Extract and print NamedEntity classes
named_entity_classes = utils.extract_named_entity_classes.extract_named_entity_classes()
print("NamedEntity classes:", ", ".join(named_entity_classes.keys()))

NamedEntity classes: AnatomicalLocation, Animal, BiomedicalTechnique, Bacteria, Chemical, Metabolites, DietarySupplement, DiseaseDisorderOrFinding, Drug, Food, Gene, Human, Microbiome, StatisticalTechnique


### Generate Response Formats for Named Entity Classes  
This step generates JSON response formats for named entity classes using the extracted schema.  
The results are saved in the `generated/response_formats/` directory.

In [7]:
import imaplib
import utils.generate_named_entity_response_formats
importlib.reload(utils.generate_named_entity_response_formats) 
# from utils.generate_named_entity_response_formats import generate_named_entity_response_formats

# Define file paths
schema_path = "generated/schema.json"
output_path = "generated/response_formats/named_entity_response_formats.json"

# Generate response formats
utils.generate_named_entity_response_formats.generate_named_entity_response_formats(schema_path, output_path, named_entity_classes)


✅ Named entity response formats saved to generated/response_formats/named_entity_response_formats.json


In [8]:
import importlib
import os
import json
from utils.extract_named_entity_classes import extract_named_entity_classes

import utils.process_named_entities
importlib.reload(utils.process_named_entities)

# Define constants
schema_path = "generated/schema.json"
response_formats_path = "generated/response_formats/named_entity_response_formats.json"
sample_text_path = "input/sample.txt"  # temporary per PMID
final_predictions_path = "org_T61_BaselineRun_NuNerZero.json"

# Choose dependency setting
With_dependency = False

# Extract classes
named_entity_classes = extract_named_entity_classes()

# Load input documents
with open("dev.json", "r", encoding="utf-8") as f:
    dataset = json.load(f)

# Container for final output
final_predictions = {}

# Loop through PMIDs
for pmid, doc in dataset.items():
    title = doc.get("title", "")
    abstract = doc.get("abstract", "")

    # Save title + abstract to temp sample.txt
    with open(sample_text_path, "w", encoding="utf-8") as f:
        f.write(f'title: "{title}"\nabstract: "{abstract}"')

    # Define output paths (can reuse same ones since we overwrite each time)
    output_responses_path = (
        "output/generated_responses.json"
        if With_dependency else "output/generated_responses.json"
    )
    prompts_save_path = (
        "generated/prompts/final_namedentity_prompts.json"
        if With_dependency else "generated/prompts/final_namedentity_without_dependencies_prompts.json"
    )

    # Run entity extraction
    print(f"\n📄 Processing PMID: {pmid}")
    utils.process_named_entities.process_named_entity_classes(
        named_entity_classes,
        schema_path,
        sample_text_path,
        response_formats_path,
        output_responses_path,
        prompts_save_path
    )

    # Convert to span-based format per PMID
    from utils.process_named_entities import convert_extracted_to_span_annotated
    temp_output_path = f"output/tmp_{pmid}_converted.json"
    convert_extracted_to_span_annotated(
        output_responses_path=output_responses_path,
        text_sample_path=sample_text_path,
        final_output_path=temp_output_path,
        pmid=pmid
    )

    # Read result and merge into final dict
    with open(temp_output_path, "r", encoding="utf-8") as f:
        pred = json.load(f)
        final_predictions.update(pred)

# Save final predictions
with open(final_predictions_path, "w", encoding="utf-8") as f:
    json.dump(final_predictions, f, indent=4, ensure_ascii=True)

print(f"\n✅ All done! Saved final predictions to: {final_predictions_path}")



📄 Processing PMID: 30099552
title: "Making Sense of … the Microbiome in Psychiatry."
abstract: "Microorganisms can be found almost anywhere, including in and on the human body. The collection of microorganisms associated with a certain location is called a microbiota, with its collective genetic material referred to as the microbiome. The largest population of microorganisms on the human body resides in the gastrointestinal tract; thus, it is not surprising that the most investigated human microbiome is the human gut microbiome. On average, the gut hosts microbes from more than 60 genera and contains more cells than the human body. The human gut microbiome has been shown to influence many aspects of host health, including more recently the brain.Several modes of interaction between the gut and the brain have been discovered, including via the synthesis of metabolites and neurotransmitters, activation of the vagus nerve, and activation of the immune system. A growing body of work is im

EVALUATION

### 🔍 Extract Named Entitis Using GPT
This step generates prompts for extracting named entity relationships, calls GPT for entity and attribute extraction, and saves the responses. It ensures identified entities are correctly linked to their attributes.


In [None]:
# import importlib
# import utils.process_named_entities
# importlib.reload(utils.process_named_entities)
# # from utils.process_named_entities import process_named_entity_classes
# from utils.extract_named_entity_classes import extract_named_entity_classes

# # Define paths
# schema_path = "generated/schema.json"
# text_sample_path = "input/sample.txt"
# response_formats_path = "generated/response_formats/named_entity_response_formats.json"
# output_responses_path = "output/generated_responses.json"
# prompts_save_path = "generated/prompts/final_namedentity_prompts.json"

# # Extract named entity classes
# named_entity_classes = extract_named_entity_classes()

# # Run entity extraction
# if With_dependency:
#   print("📁 --> Dependency Based Processing")
#   utils.process_named_entities.process_named_entity_classes(
#       named_entity_classes, schema_path, text_sample_path, response_formats_path, output_responses_path, prompts_save_path
#   )
# else:
#   print("📁 --> Independent Processing")
#   output_responses_path= "output/generated_responses_without_dependencies.json"
#   prompts_save_path="generated/prompts/final_namedentity_without_dependencies_prompts.json"
#   utils.process_named_entities.process_named_entity_classes(
#       named_entity_classes, schema_path, text_sample_path, response_formats_path, output_responses_path, prompts_save_path
#   )

📁 --> Independent Processing
title: "Hypothesis of a potential BrainBiota and its relation to CNS autoimmune inflammation.",
abstract: "Infectious agents have been long considered to play a role in the pathogenesis of neurological diseases as part of the interaction between genetic susceptibility and the environment. The role of bacteria in CNS autoimmunity has also been highlighted by changes in the diversity of gut microbiota in patients with neurological diseases such as Parkinson's disease, Alzheimer disease and multiple sclerosis, emphasizing the role of the gut-brain axis. We discuss the hypothesis of a brain microbiota, the BrainBiota: bacteria living in symbiosis with brain cells. Existence of various bacteria in the human brain is suggested by morphological evidence, presence of bacterial proteins, metabolites, transcripts and mucosal-associated invariant T cells. Based on our data, we discuss the hypothesis that these bacteria are an integral part of brain development and imm