**Configuration File Details:**
The configuration file (`nz.yml`) specifies the following for the dataset:
- **Data Type:** custom, suitable for structured input/output format.
- **Tokenizer:** LlamaTokenizer with specific special tokens.

**Dataset Structure:**
- **Fields:** "instruction," "input," "output".
- **Example:**
  ```json
  {
      "instruction": "Claim coders manages claims...",
      "input": "IncidentDescription: While working on a vehicle repair...",
      "output": "Reasoning: Contorting the body... - InjurySource: Bodily motion..."
  }
  ```

**Prompt Function:**
- Constructs the input using the given `incident_description`.
- Example Prompt:
  ```plaintext
  [INST] <<SYS>>
  Medical coders manages claims by reviewing...
  <</SYS>>

  IncidentDescription: {incident_description}
  [/INST]
  ```

### Validation Steps:

1. **Check Data Type Compatibility:**
   Ensure the dataset format matches the configuration expectations. The "custom" type indicates a tailored structure for specific fields.
   ```yaml
   data_type: "custom"
   ```

### Format validator

In [None]:
import json

try:
    with open('synthec_data.json') as f:
        data = json.load(f)
    
    for i, entry in enumerate(data):
        assert 'instruction' in entry, f"Missing 'instruction' in entry {i}"
        assert 'input' in entry, f"Missing 'input' in entry {i}"
        assert 'output' in entry, f"Missing 'output' in entry {i}"
    
    print("All entries have 'instruction', 'input', and 'output'.")
except AssertionError as e:
    print(f"AssertionError: {e}")
except Exception as e:
    print(f"An error occurred: {e}")


### Prompt Validator

Verifying that the constructed prompt matches the expected input during fine-tuning.

In [None]:
# Verify that the constructed prompt matches the expected input during fine-tuning.
def prompt(incident_description):
    return f"""[INST] <<SYS>>
    Workers Compensation Board manages claims...
    <</SYS>>

    IncidentDescription: {incident_description}
    [/INST]
    """
   
sample_incident = "While working on a vehicle repair..."
sample_prompt = prompt(sample_incident)
print(sample_prompt)

In [None]:
#    Ensure the model correctly processes the structured input and provides the expected output.
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'NimaZahedinameghi/source_of_injury'
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

input_text = prompt("While working on a vehicle repair...")
inputs = tokenizer(input_text, return_tensors='pt')
outputs = model.generate(**inputs)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)