##Example: Structured data extraction, batch inference & evaluation
This notebook demonstrates how to perform basic structured data extraction using `ai_query` ([AWS](https://docs.databricks.com/aws/sql/language-manual/functions/ai_query) | [Azure](https://learn.microsoft.com/azure/databricks/sql/language-manual/functions/ai_query)). 

The process illustrates how to effectively transform raw, unstructured data into organized, actionable information through automated extraction techniques.

This notebook also shows how to leverage Mosaic AI Agent Evaluation ([AWS](https://docs.databricks.com/aws/generative-ai/agent-evaluation/) | [Azure](https://learn.microsoft.com/azure/databricks/generative-ai/agent-evaluation/)) to evaluate the accuracy if ground truth data is available.

In [0]:
%pip install --upgrade "mlflow[databricks]>=3.1.0" 
dbutils.library.restartPython()

## Perform batch inference using `ai_query`
To demonstrate how to use `ai_query` for structured data extraction, this notebook creates a simulated dataset of employment contracts. This dummy dataset will serve as a testbed for entity extraction, focusing on key information such as employer and employee names. It includes the ground-truth of data to be extracted, which is used later for evaluation.

This notebook then utilizes this dataset to conduct batch inference using `ai_query`([AWS](https://docs.databricks.com/aws/sql/language-manual/functions/ai_query) | [Azure](https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/ai_query)).

In [0]:
# Some dummy employment contracts including ground truth

employment_contracts = [
    dict(
        contract_text="This Employment Agreement is made on May 15, 2023, between TechCorp Inc. (the 'Employer') and Sarah Johnson (the 'Employee'). The Employee will commence work as a Software Engineer on June 1, 2023, with an annual salary of $85,000. The Employee agrees to a probationary period of 3 months.",
        ground_truth='{"signature_date": "May 15, 2023", "employer": "TechCorp Inc.", "employee": "Sarah Johnson", "bonuses": ["N/A"]}'
    ),
    dict(
        contract_text="Employment Contract: Effective July 1, 2023, DataSystems LLC (hereinafter 'Employer') agrees to employ Michael Chen (hereinafter 'Employee') as a Data Analyst. The Employee's starting salary shall be $70,000 per annum. This agreement includes a non-compete clause effective for 12 months post-termination.",
        ground_truth='{"signature_date": "July 1, 2023", "employer": "DataSystems LLC", "employee": "Michael Chen", "bonuses": ["N/A"]}'
    ),
    dict(
        contract_text="On August 15, 2023, CloudNet Solutions ('Employer') and Emma Rodriguez ('Employee') enter into this employment agreement. The Employee is hired as a Network Administrator with a starting date of September 1, 2023. The annual compensation is set at $78,000, with a signing bonus of $5,000.",
        ground_truth='{"signature_date": "August 15, 2023", "employer":"CloudNet Solutions", "employee": "Emma Rodriguez", "bonuses":["$5,000 signing bonus"]}'
    ),
    dict(
        contract_text="This contract, dated October 1, 2023, is between AI Innovations Corp ('Employer') and Dr. James Lee ('Employee'). Dr. Lee is appointed as Chief Research Scientist, commencing on November 1, 2023. The base salary is $150,000 per year, with performance-based bonuses as outlined in Appendix A.",
        ground_truth='{"signature_date": "October 1, 2023", "employer": "AI Innovations Corp", "employee": "Dr. James Lee", "bonuses":["performance-based bonuses as outlined in Appendix A"]}'
    ),
]

employment_contracts_df = spark.createDataFrame(employment_contracts)
employment_contracts_df.display()

### Structured data extraction with `ai_query`
The next cell defines the main input required to perform structured data extraction with `ai_query`:
- The LLM endpoint name
- The prompt instructing the LLM to perform data extraction and to use JSON as response format
- The JSON schema of the response

In [0]:
import json

LLM_ENDPOINT_NAME = "databricks-meta-llama-3-3-70b-instruct"

PROMPT = """You are an AI assistant specialized in analyzing legal documents. 
Your task is to extract relevant information from a given contract document. 
Your output must be a structured JSON object.

Instructions:
1. Carefully read the entire contract document provided at the end of this prompt.
2. Extract the relevant information.
3. Present your findings in JSON format as specified below.

Important Notes:
- Extract only relevant information. 
- Consider the context of the entire contract when determining relevance.
- Do not be verbose, only respond with the correct format and information.
- Some questions may have no relevant excerpts. Just return "N/A" or ["N/A"] depending on the expected type in this case.
- Do not include additional JSON keys beyond the ones listed here.
- Do not include the same key multiple times in the JSON.

Expected JSON keys and explanation of what they are:
- signature_date: The signature date of the contract.
- employer: The employers name.
- employee: The employees name.
- bonuses: A list of any mentioned specific bonus.

Contract to analyze: 
"""

response_format = json.dumps({
    "type": "json_schema",
    "json_schema": {
        "name": "employment_contract_extraction",
        "schema": {
            "type": "object",
            "properties": {
                "signature_date": {"type": "string"},
                "employer": {"type": "string"},
                "employee": {"type": "string"},
                "bonuses": {"type": "array", "items": {"type": "string"}},
            },
            "strict": True,
        },
    },
})


### Batch inference
Below, `ai_query` is applied to the Spark dataframe as a SQL expression using the inputs defined above. The LLM's response, which is a JSON string, is parsed to extract the individual data points.

In [0]:
from pyspark.sql.functions import col, from_json

# define query
ai_query_expr = f"""
  ai_query(
    endpoint => '{LLM_ENDPOINT_NAME}',
    request => CONCAT('{PROMPT}', contract_text),
    responseFormat => '{response_format}',
    modelParameters => named_struct('temperature', 0.)
    ) AS response
  """

# the json schema of the LLM response string which we want to unpack
json_schema = "STRUCT<signature_date STRING, employee STRING, employer STRING, bonuses ARRAY<STRING>>"

# run the batch query and unpack the response
employment_contracts_df = employment_contracts_df.selectExpr(
    "*", ai_query_expr
).withColumn("parsed_response", from_json(col("response"), json_schema))

employment_contracts_df.display()

## Evaluate the agent with Agent Evaluation

To assess the agent's quality, we'll use the Agent Evaluation framework ([AWS](https://docs.databricks.com/en/generative-ai/agent-evaluation/index.html) | [Azure](https://learn.microsoft.com/azure/databricks/generative-ai/agent-evaluation/)). This approach employs a correctness judge to compare expected entities (or facts) with the actual response, providing a comprehensive evaluation of the agent's performance.

_Note: An alternative approach would be to compute metrics such as `recall` and `precision` for individual entities, though this would require additional data transformations or [custom metrics](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow-metrics)._

In [0]:
import mlflow
from mlflow.genai import evaluate
from mlflow.genai.scorers import Correctness

# prepare the evaluation dataframe expected by mlflow.genai.evaluate()
eval_pdf = employment_contracts_df.select(
    col("contract_text"),
    col("response").alias("outputs"),
    col("ground_truth")
).toPandas()

# Convert 'inputs' to required dict format
eval_pdf["inputs"] = eval_pdf["contract_text"].apply(
    lambda x: {"contract_text": x}
)
eval_pdf["expectations"] = eval_pdf["ground_truth"].apply(
    lambda x: {"expected_facts": x.split(",")}
)
eval_pdf = eval_pdf.drop(columns=["contract_text", "ground_truth"])

# run evaluation and track results in mlflow experiment
with mlflow.start_run():
    results = mlflow.genai.evaluate(
        data=eval_pdf,
        scorers=[Correctness()]
    )

## Next steps
For further insights and related examples of structured data extraction on Databricks, consider exploring these comprehensive technical blog posts:
- [End-to-End Structured Extraction with LLM – Part 1: Batch Entity Extraction](https://community.databricks.com/t5/technical-blog/end-to-end-structured-extraction-with-llm-part-1-batch-entity/ba-p/98396)
- [End-to-End Structured Extraction with LLM – Part 2: Fine-Tuning with Synthetic Data](https://community.databricks.com/t5/technical-blog/end-to-end-structured-extraction-with-llm-part-2-fine-tuning/ba-p/99900)