# Data Validation and Transformation Workflow

As part of the assignment and as an additional add-on/research the goal of this task was to build a system that validates and transforms order data using a GenAI model. processed a JSON file of synthetic order transactions, validated each transaction with the model.

Leveraged the Meta LLaMA 3.3 70B Instruct model for this task, integrating it seamlessly via NVIDIA Enterprise’s robust API. This setup not only ensured high-quality outputs but also delivered exceptional execution speed, demonstrating the model's capability to handle complex tasks efficiently in real-time.

### Advantages if implemented

* Context-Aware work activity
* Automation: Automated workflows
* Multi-Task Capability

In [31]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes
!pip install -U langchain
!pip install -U langchain-community
!pip install -U sentence-transformers

!pip install -U faiss-gpu

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-cpqtytev/unsloth_9210f8597dec4acd9f30f2a88b5a7d4a
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-cpqtytev/unsloth_9210f8597dec4acd9f30f2a88b5a7d4a
  Resolved https://github.com/unslothai/unsloth.git to commit b4c48d9c5e78203909495bf9beaa29a5c9aeaeeb
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2025.1.2 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2025.1.3-py3-none-any.whl.metadata (16 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.git-

Collecting xformers
  Downloading xformers-0.0.29.post1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.29.post1-cp310-cp310-manylinux_2_28_x86_64.whl (15.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m46.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xformers
Successfully installed xformers-0.0.29.post1
Collecting sec_api
  Downloading sec_api-1.0.25-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.3/51.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Downloading sec_api-1.0.25-py3-none-any.whl (19 kB)
Installing collected packages: sec_api
Successfully installed sec_api-1.0.25
Collecting langchain-community
  Downloading langchain_community-0.3.14-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse

In [26]:
import json
from openai import OpenAI

client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="API-KEY"
)

def validate_and_transform_order(order_data):
    """
    Validate and transform order data using the GenAI model.
    Args:
        order_data (dict): The order data to validate and transform.
    Returns:
        dict: The parsed response content from the GenAI model.
    """
    prompt = f"""
    You are a data validation assistant. Validate the following order data:
    {json.dumps(order_data, indent=2)}

    Tasks:
    1. Ensure 'user_id', 'order_id', and 'order_value' are present and correctly typed.
    2. Verify 'order_value' equals the sum of (quantity * price_per_unit) for all items. If not, suggest corrections.
    3. Extract the fields 'user_id', 'order_value', and 'order_timestamp'.
    4. Highlight any errors or inconsistencies and propose fixes.
    Provide your response in JSON format.
    """
    completion = client.chat.completions.create(
        model="meta/llama-3.3-70b-instruct",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2,
        top_p=0.7,
        max_tokens=1024
    )

    response_content = completion.choices[0].message.content
    return response_content

input_file_path = "synthetic_order_data.json"
output_file_path = "validated_order_data.json"

with open(input_file_path, "r") as infile:
    orders = json.load(infile)

output_data = []
for order in orders:
    raw_response = validate_and_transform_order(order)
    output_data.append({
        "original_order": order,
        "genai_response": raw_response
    })

with open(output_file_path, "w") as outfile:
    json.dump(output_data, outfile, indent=2)

print(f"Validated data has been saved to {output_file_path}")


Validated data has been saved to validated_order_data.json


**NOTE** : The synthetic data contained 10-15 transactions , with a mix of noise and good data which was used to validate the responses from the model

In the Result json file the gen ai responses were 100% accurate ,it was successfully able to identify all validation checks as mentioned in assignement and in addition to that it was also able to capture some context rich information like payment methods,locations etc..

**Doubts**

* In Cloud environments integrating this technique to pipeline data validation check
* How can this technology/use case can be implemented real-time in real working environments
* Additional Data Validation checks to compute the accuracy
* How feasible is this for real-time use cases

### Additional Fine Tuning Explorations

In [35]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [36]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct",  #meta-llama/Meta-Llama-3.1-405B meta-llama/Meta-Llama-3-8B-Instruct
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
    token = hf_token,
)


In [None]:
# Apply LoRA (Low-Rank Adaptation) adapters to the model for parameter-efficient fine-tuning
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)