# Query Generation for Verisk Assistant Testing

This notebook generates a diverse set of test queries for the Verisk OSHA & Risk Assessment Assistant. Queries are categorized into different types and saved for testing purposes.

In [1]:
import pandas as pd
import json
from datetime import datetime
from pathlib import Path

# Create data directory if it doesn't exist
Path('data').mkdir(exist_ok=True)

In [2]:
# Define query categories and example queries
test_queries = {
    "OSHA": [
        # Basic OSHA Questions
        "What PPE is required for welding?",
        "When do I need to wear safety glasses?",
        "What are OSHA requirements for fall protection?",
        "How often should fire extinguishers be inspected?",
        "What are the OSHA regulations for ladder safety?",
        
        # Workplace Safety
        "What are the requirements for emergency exits?",
        "How should hazardous materials be stored?",
        "What are the noise level limits in the workplace?",
        "What are the requirements for proper ventilation?",
        "How should chemical spills be handled?",
        
        # Industry-Specific
        "What are the safety requirements for construction sites?",
        "What PPE is needed for healthcare workers?",
        "What are the safety regulations for warehouse operations?",
        "What are the requirements for machine guarding?",
        "What are the safety standards for electrical work?",
        
        # Documentation & Training
        "How often should safety training be conducted?",
        "What safety documents need to be posted in the workplace?",
        "What records need to be kept for workplace injuries?",
        "How long should safety training records be maintained?",
        "What are the requirements for safety data sheets?"
    ],
    "risk_assessment": [
        # General Risk Assessment
        "What are the risks associated with remote work?",
        "How do I assess cybersecurity risks?",
        "What are common supply chain risks?",
        "How should I evaluate vendor risks?",
        "What are the risks of expanding to international markets?",
        
        # Industry-Specific Risks
        "What are the main risks in manufacturing?",
        "What risks should a retail business consider?",
        "What are the primary risks in healthcare operations?",
        "What are the risks in the construction industry?",
        "What are the main risks in food service?",
        
        # Financial Risks
        "How do I assess credit risk?",
        "What are the risks of different payment methods?",
        "How should I evaluate investment risks?",
        "What are the risks of offering customer financing?",
        "How do I assess currency exchange risks?"
    ],
    "out_of_scope": [
        # General Business
        "How do I start a business?",
        "What's the best marketing strategy?",
        "How do I hire employees?",
        "What accounting software should I use?",
        
        # Technical
        "How do I build a website?",
        "What programming language should I learn?",
        "How do I set up a database?",
        "What's the best CRM system?",
        
        # Other
        "What's the weather like today?",
        "How do I cook pasta?",
        "What's the capital of France?",
        "When is the next solar eclipse?",
        "How do I train my dog?"
    ]
}

In [3]:
# Convert to DataFrame for easier manipulation
queries_list = []
for category, questions in test_queries.items():
    for question in questions:
        queries_list.append({
            'category': category,
            'query': question,
            'expected_category': category  # For validation purposes
        })

df_queries = pd.DataFrame(queries_list)
print(f"Total queries generated: {len(df_queries)}")
display(df_queries.head())

# Show distribution of queries across categories
print("\nDistribution of queries across categories:")
display(df_queries['category'].value_counts())

Total queries generated: 48


Unnamed: 0,category,query,expected_category
0,OSHA,What PPE is required for welding?,OSHA
1,OSHA,When do I need to wear safety glasses?,OSHA
2,OSHA,What are OSHA requirements for fall protection?,OSHA
3,OSHA,How often should fire extinguishers be inspected?,OSHA
4,OSHA,What are the OSHA regulations for ladder safety?,OSHA



Distribution of queries across categories:


category
OSHA               20
risk_assessment    15
out_of_scope       13
Name: count, dtype: int64

## Helper Function for Testing

The following function can be used to load and process queries for testing:

In [5]:
def load_test_queries(file_path: str, format: str = 'csv') -> pd.DataFrame:
    """Load test queries from a file.
    
    Args:
        file_path (str): Path to the query file
        format (str): File format ('csv' or 'json')
        
    Returns:
        pd.DataFrame: DataFrame containing the queries
    """
    if format == 'csv':
        return pd.read_csv(file_path)
    elif format == 'json':
        with open(file_path, 'r') as f:
            data = json.load(f)
        queries_list = []
        for category, questions in data.items():
            for question in questions:
                queries_list.append({
                    'category': category,
                    'query': question,
                    'expected_category': category
                })
        return pd.DataFrame(queries_list)
    else:
        raise ValueError(f"Unsupported format: {format}")

# Example usage:
# test_queries = load_test_queries('data/test_queries_20241204_120000.csv', format='csv')