# OpenSEM Data Forge Tutorial

This notebook demonstrates how to use the **Data Forge** module in OpenSEM. 
We will cover:
1.  Setting up the environment.
2.  Running the default `TextForge` strategy.
3.  Creating a custom Forge strategy (e.g., for a different data type or processing logic).

## 1. Setup and Configuration

First, we need to ensure we can import the `opensem` package from the `src` directory.

In [None]:
import sys
import os
import yaml

# Add the src directory to the python path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
src_path = os.path.join(project_root, 'src')
if src_path not in sys.path:
    sys.path.insert(0, src_path)

print(f"Project Root: {project_root}")
print(f"Source Path: {src_path}")

# Import OpenSEM modules
from opensem.forge import TextForge, BaseForge

## 2. Initialize the Default Pipeline (TextForge)

We will load the configuration for the `testsem` project (or create a mock one) and initialize the `TextForge`.
`TextForge` is the default strategy for processing text files (.txt, .md, .pdf).

In [None]:
# Define configuration (simulating what's in data_config.yaml)
config = {
    "raw_data_dir": os.path.join(project_root, "data", "testsem", "raw"),
    "processed_data_dir": os.path.join(project_root, "data", "testsem", "processed"),
    "params": {
        "teacher_model": "gemini-2.0-flash",
        "max_chars_per_doc": 1000, # Small limit for tutorial
        "chunk_size": 500
    }
}

# Initialize the Forge
forge = TextForge(config)
print(f"Initialized {forge.__class__.__name__}")

## 3. Execute the Standard Workflow

We can run the individual steps manually to see what's happening, or call `forge.run()` for the full pipeline.
Here, we'll step through it.

### Step 3.1: Load Data

In [None]:
raw_data = forge.load_data()
print(f"Loaded {len(raw_data)} documents.")
if raw_data:
    print(f"First document preview:\n{raw_data[0][:200]}...")

### Step 3.2: Synthesize Data

This step uses the configured LLM (Gemini in this case) to generate instruction-output pairs.
*Note: Ensure you have your `.env` file with `GEMINI_API_KEY` in the project root.*

In [None]:
# We will use the mock synthesis if API key is missing, or real synthesis if present.
# The TextForge class handles this logic internally.
synthesized_data = forge.synthesize(raw_data)

print(f"Synthesized {len(synthesized_data)} examples.")
if synthesized_data:
    print("Sample Example:")
    print(synthesized_data[0])

### Step 3.3: Format and Save

Finally, we save the data to JSONL format.

In [None]:
forge.format_data(synthesized_data)
print(f"Data saved to {config['processed_data_dir']}")

## 4. Advanced: Custom PII Masking with LLMs

The real power of OpenSEM is its extensibility. 
Suppose you want to create a dataset for **PII Redaction** (specifically removing names). Regex is brittle for names, so we will use an LLM to intelligently identify and redact them.

We will create a `PIIMaskingForge` that:
1.  Inherits from `TextForge`.
2.  Overrides `synthesize` to ask the LLM to replace all names with `[PERSON]`.
3.  Generates training pairs where `input` is the original text and `output` is the redacted text.

In [None]:
import os
from typing import List, Dict, Any
from dotenv import load_dotenv
import google.generativeai as genai
from opensem.forge import TextForge

class PIIMaskingForge(TextForge):
    def synthesize(self, raw_data: List[str]) -> List[Dict[str, Any]]:
        print("Running PII Masking Synthesis (LLM-Based)...")
        
        # Setup Gemini
        load_dotenv()
        api_key = os.getenv("GEMINI_API_KEY")
        if not api_key:
            print("No API Key found. Returning empty list.")
            return []
            
        # Use the configured model or default
        model_name = self.config.get('params', {}).get('teacher_model', 'gemini-2.0-flash')
        genai.configure(api_key=api_key)
        model = genai.GenerativeModel(model_name)
        
        synthesized_data = []
        
        for doc in raw_data:
            # We'll process the first 500 chars for this demo to save tokens
            chunk = doc[:500]
            
            prompt = f"""
            You are a PII Redaction engine.
            Replace all names of people in the following text with the token [PERSON].
            Do not change anything else. Do not add any conversational text.
            
            Text:
            {chunk}
            """
            
            try:
                response = model.generate_content(prompt)
                redacted_text = response.text.strip()
                
                synthesized_data.append({
                    "instruction": "Redact all person names from the text.",
                    "input": chunk,
                    "output": redacted_text
                })
            except Exception as e:
                print(f"Error processing chunk: {e}")
        
        return synthesized_data

## 5. Execute Custom Workflow

Now we instantiate and run our `PIIMaskingForge` directly in the notebook.
We'll use some sample text reminiscent of *War and Peace* to test the name redaction.

In [None]:
# Define sample data
sample_docs = [
    "Anna Pavlovna was coughing for the last few days. She had an attack of la grippe, as she said.",
    "Prince Vasili entered, wearing his embroidered court uniform.",
    "He spoke to his daughter, the beautiful Helene, about the party."
]

# Initialize our custom forge
# We pass the model name in the config params
custom_config = {
    "params": {
        "teacher_model": "gemini-2.0-flash"
    }
}
forge_instance = PIIMaskingForge(config=custom_config)

# Run synthesis
results = forge_instance.synthesize(sample_docs)

# Display results
print(f"\nGenerated {len(results)} pairs.\n")
for i, res in enumerate(results):
    print(f"--- Example {i+1} ---")
    print(f"Input:  {res['input']}")
    print(f"Output: {res['output']}")
    print()

## Conclusion

In this tutorial, you learned:
1.  **Default Flow**: How `TextForge` uses an LLM to synthesize instruction pairs from raw text.
2.  **Custom Flow**: How to implement a `PIIMaskingForge` to create a specific task dataset (Redaction) without using an LLM.
3.  **Integration**: How to plug your custom class into the OpenSEM CLI using `data_config.yaml`.

You can now build any kind of data pipeline (Image captioning, Time-series forecasting, etc.) by simply extending `BaseForge`!

## 6. How to Use in Your Project

To use this custom strategy in your actual OpenSEM project (via the CLI), follow these steps:

1.  **Save the Class**: Save the `PIIMaskingForge` class code into a Python file within your project structure.
    *   Example: `src/my_project/forge/pii_forge.py`

2.  **Update Configuration**: Edit your `configs/<project_name>/data_config.yaml` file to point to this class.
    ```yaml
    # configs/my_project/data_config.yaml
    forge_class: "my_project.forge.pii_forge.PIIMaskingForge"
    params:
      teacher_model: "gemini-2.0-flash"
    ```

3.  **Run the CLI**:
    ```bash
    python opensem.py run-forge --project my_project
    ```

OpenSEM will dynamically load your class and use it to process your data!