In [9]:
from dotenv import load_dotenv
load_dotenv()

True

In [10]:
import pandas as pd
question_type = [
    "Answerable","Answerable","Answerable","Partially Answerable","Partially Answerable","Unanswerable","Unanswerable","Answerable"
]
inputs = [
    "What are the seven minimum items that should be included in an employee handbook according to this template?",
    "Under what conditions is an employee eligible for leave under the federal Family and Medical Leave Act (FMLA)?",
    "What is the company's policy regarding the use of \"pirated\" software or installing new software on company computers?",
    "How many days of sick leave are employees granted each year, and are they paid or unpaid?",
    "What specific holidays does the company observe, and are employees paid for these days?",
    "What is the specific dollar amount of the replacement fee if an employee loses their paycheck?",
    "Does the company provide a \"Childcare Subsidy\" or \"Commuter Benefits\" for full-time employees?",
    "What is the policy on pirated software?"

]

outputs = [
    "According to the provided template, the seven minimum items that should be included in an employee handbook are:1. **Employment at-will disclaimer (section 1.3)**: A statement indicating that employment is at-will, meaning either the employee or the company may terminate employment at any time, with or without reason.2. **Equal employment opportunity statement (section 2.1)**: A statement regarding equal employment opportunity, emphasizing the company's commitment to providing equal opportunities for all employees.3. **Policy prohibiting unlawful discrimination and harassment (section 2.2)**: A policy that prohibits unlawful discrimination and harassment in the workplace.4. **Section on company property and privacy rules (section 3)**: A section that describes the policy for use of company property and privacy rules, including expectations for employee behavior and consequences for misuse.5. **Section on employment classification and overtime rules**: A section that explains the company's employment classification system (e.g., exempt vs. non-exempt) and overtime rules.6. **A policy on Family and Medical Leave (section 6.3)**: A policy that outlines the company's Family and Medical Leave (FMLA) policy, including eligibility, leave procedures, and return-to-work requirements (applicable to companies with 50 or more employees).7. **A section on Safety (section 9)**: A section that outlines the company's safety policies and procedures, including emergency protocols and employee responsibilities.",
    "According to the provided template, an employee is eligible for leave under the federal Family and Medical Leave Act (FMLA) if they have worked as an employee of the company for at least 1,250 hours for twelve months. This eligibility criterion is a requirement for FMLA leave.Additionally, the employee must be taking leave for one of the following reasons:1. The birth of a child and to care for the newborn child;2. Placement of a child into adoptive or foster care with the employee;3. Care for a spouse, son, daughter, or parent who has a serious health condition; or4. Care for the employee's own serious health condition.It's worth noting that the company may require employees to exhaust their accrued paid leave at the same time they are on FMLA leave, if the company has 50 or more employees.",
    "According to the provided context, the company has the following policy regarding the use of \"pirated\" software or installing new software on company computers:* Employees are prohibited from downloading any \"pirated\" software, files, or programs.* Employees must receive permission from a supervisor before installing any new software on a company computer.* Files or programs stored on company computers may not be copied for personal use.This policy is in place to protect the company's computer systems and data from potential security risks and to ensure that all software used on company computers is legitimate and authorized.",
    "The document states that sick days are granted to regular employees, but the specific number of days is not defined and must be filled in by the employer. Similarly, the handbook provides a placeholder to designate whether these days are paid or unpaid.",
    "The company observes New Year's Day, Martin Luther King, Jr. Day, Memorial Day, Independence Day, Labor Day, Thanksgiving, and Christmas Day. However, whether these are paid or unpaid depends on the specific employer's selection in the template.",
    "The document states that the company reserves the right to charge a replacement fee for lost paychecks,  but it does not specify the actual cost or dollar amount of that fee.",
    "The provided document does not mention childcare subsidies or commuter benefits. It lists general benefits like health insurance, retirement plans, and workers' compensation.",
    "According to Section 3.4, employees are not permitted to download any \"pirated\" software, files, or programs and must receive permission from a supervisor before installing any new software. Additionally, employees should have no expectation of privacy when using company computers.",
]

# Dataset
qa_pairs = [{"question_type":qt,"question": q, "answer": a} for qt, q, a in zip(question_type ,inputs, outputs)]
df = pd.DataFrame(qa_pairs)

# Write to csv
csv_path = "/home/deblina/Documents/projects/neura dynamics assignment/data/goldens.csv"
df.to_csv(csv_path, index=False)



In [11]:
from langsmith import Client
client = Client()

dataset_name = "company-policy-goldens"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Input and expected output pair for Company Policy"
)

client.create_examples(
    inputs=[{"question":q} for q in inputs],
    outputs=[{"answer":a} for a in outputs],
    dataset_id=dataset.id
)



{'example_ids': ['c63e86f8-514e-4b97-bef9-7ff3e667774b',
  '56827991-ef56-49ae-9e39-21c0357d5872',
  'a622cb02-8345-4191-8a0b-51e53ccb96a8',
  'd8dc3f76-5efd-4bf8-b1d6-24e95f50d4b0',
  'c3da2263-adb0-48cb-9126-b81f3c802095',
  'acd74396-c871-455e-9dd3-ce87e25b30ae',
  '532e55d2-ede3-4b72-83b0-d101786abe94',
  'f77afa60-dd6c-4c5c-9fa4-05f6dced4ce0'],
 'count': 8,
 'as_of': '2026-02-07T06:37:42.669516812Z'}

In [12]:
import sys
sys.path.append("/home/deblina/Documents/projects/neura dynamics assignment")
from pathlib import Path
from company_policy_chat.src.document_ingestion.ingestion import Ingestion
from company_policy_chat.src.document_retrieval.retrieval import Retrieval



class LocalFileAdapter:
    def __init__(self, file_path: str):
        self.path = Path(file_path)
        self.filename = self.path.name # Matches UploadFile property
    
    def read(self, size=-1):
        """Standard file read method."""
        return self.path.read_bytes()

    # Optional: Add seek for compatibility
    def seek(self, offset, whence=0):
        pass


def answer_ai_report_question(
    inputs: dict,
    data_path: str = "/home/deblina/Documents/projects/neura dynamics assignment/data/Small_Business_Administration_Employee_polich_Template.pdf",
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
) -> dict:
    
    try:
        # Extract question from inputs
        question = inputs.get("question", "")
        if not question:
            return {"answer": "No question provided"}
        
        # Check if file exists
        if not Path(data_path).exists():
            return {"answer": f"Data file not found: {data_path}"}
        
        # Create file adapter
        file_adapter = LocalFileAdapter(data_path)
        
        # Build index using ChatIngestor
        ingestor = Ingestion(
            temp_base="/home/deblina/Documents/projects/neura dynamics assignment/data",
            faiss_base="/home/deblina/Documents/projects/neura dynamics assignment/faiss_index",
        )
        
        # Build retriever
        ingestor.build_index(
            uploaded_files=[file_adapter],
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
        )
        
        index_path = "/home/deblina/Documents/projects/neura dynamics assignment/faiss_index"
        
        # Create RAG instance and load retriever
        rag = Retrieval()
        rag.load_retriever_from_faiss(
            index_path=index_path,
            index_name="index"
        )
        
        # Get answer
        answer = rag.invoke(question, chat_history=[])
        
        return {"answer": answer}
        
    except Exception as e:
        return {"answer": f"Error: {str(e)}"}



In [13]:


# Test the function with a sample question
test_input = {"question": "Does the company follow a strict \"progressive discipline\" system (e.g., verbal warning, then written, then suspension) for all employee misconduct?"}
result = answer_ai_report_question(test_input)
print("Question:", test_input["question"])
print("\nAnswer:", result["answer"])



[2m2026-02-07 12:07:42[0m [[32m[1minfo     [0m] [1mIngestion initialized         [0m [36mfaiss_dir[0m=[35m'/home/deblina/Documents/projects/neura dynamics assignment/faiss_index'[0m [36mtemp_dir[0m=[35m'/home/deblina/Documents/projects/neura dynamics assignment/data'[0m
[2m2026-02-07 12:07:42[0m [[32m[1minfo     [0m] [1mFile saved                    [0m [36mfile[0m=[35mSmall_Business_Administration_Employee_polich_Template.pdf[0m [36msize[0m=[35m3625853[0m
[2m2026-02-07 12:07:43[0m [[32m[1minfo     [0m] [1mDocuments loaded              [0m [36mcount[0m=[35m35[0m
[2m2026-02-07 12:07:43[0m [[32m[1minfo     [0m] [1mDocuments split into chunks   [0m [36mchunk_overlap[0m=[35m200[0m [36mchunk_size[0m=[35m1000[0m [36mchunks[0m=[35m79[0m


Loading embedding model: BBAI/bge-small-en-v1.5


[2m2026-02-07 12:07:43[0m [[32m[1minfo     [0m] [1mLoading existing FAISS index  [0m [36mpath[0m=[35m'/home/deblina/Documents/projects/neura dynamics assignment/faiss_index'[0m
[2m2026-02-07 12:07:43[0m [[32m[1minfo     [0m] [1mNo new documents to add — index is up to date[0m
[2m2026-02-07 12:07:43[0m [[32m[1minfo     [0m] [1mFAISS index ready             [0m [36mindex_path[0m=[35m'/home/deblina/Documents/projects/neura dynamics assignment/faiss_index'[0m [36mtotal_chunks[0m=[35m79[0m


Initializing LLM: groq | Model: llama-3.1-8b-instant


[2m2026-02-07 12:07:43[0m [[32m[1minfo     [0m] [1mLLM loaded successfully       [0m
[2m2026-02-07 12:07:43[0m [[32m[1minfo     [0m] [1mRetrieval class initialized successfully[0m


Loading embedding model: BBAI/bge-small-en-v1.5


[2m2026-02-07 12:07:43[0m [[32m[1minfo     [0m] [1mLCEL chain built successfully [0m
[2m2026-02-07 12:07:43[0m [[32m[1minfo     [0m] [1mFAISS retriever loaded successfully[0m


HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"




However, the company may choose to use a progressive discipline system in certain situations, as indicated by the mention of it in the policy. If you have specific questions about the company's approach to discipline, I recommend consulting with your supervisor or HR representative for more information.

Please note that the company may revise or update its policies at any time, so it's essential to stay informed about any changes to the discipline policy or procedures.


In [20]:

import os  

groq_api_key = os.environ.get('GROQ_API_KEY')
if not groq_api_key:
    raise ValueError("GROQ_API_KEY environment variable is not set. Please set it before running the script.")

from langsmith import Client
from langchain_groq import ChatGroq  
import re  
import time  


client = Client()

from company_policy_chat.src.document_retrieval.retrieval import Retrieval

rag = Retrieval()
index_path = "/home/deblina/Documents/projects/neura dynamics assignment/faiss_index"

rag.load_retriever_from_faiss(
    index_path=index_path,
    index_name="index"
)


def target(inputs: dict) -> dict:
    
    answer = rag.invoke(inputs["question"], chat_history=[])
    return {"answer": answer}


def correctness_evaluator(inputs: dict, outputs: dict, reference_outputs: dict):
    
    llm = ChatGroq(
        model="llama-3.1-8b-instant",
        api_key=groq_api_key,
        temperature=0,  
    )
    
    
    prompt = f"""
    Evaluate the correctness of the following output based on the reference.

    Question: {inputs['question']}
    Output: {outputs['answer']}
    Reference: {reference_outputs.get('answer', '')}

    Is the output factually correct and complete compared to the reference? Respond with only: "Score: 1" if yes, or "Score: 0" if no. Include a brief reasoning if needed, but end with the score.
    """
    
    max_retries = 5
    for attempt in range(max_retries):
        try:
            response = llm.invoke(prompt)
            response_text = response.content.strip()
            
            # Parse score from response
            match = re.search(r"Score:\s*([01])", response_text, re.IGNORECASE)
            if match:
                score = int(match.group(1))
                print(f"Score extracted: {score} from '{response_text[:100]}...'")
                return {"key": "correctness", "score": score}
            else:
                
                if re.search(r"(?i)(yes|true|correct)", response_text):
                    score = 1
                elif re.search(r"(?i)(no|false|incorrect)", response_text):
                    score = 0
                else:
                    score = 0  
                print(f"Fallback score: {score} from '{response_text[:100]}...'")
                return {"key": "correctness", "score": score}
        except Exception as e:
            print(f"Evaluator attempt {attempt + 1} failed: {e}")
            if "429" in str(e) or "rate limit" in str(e).lower():
                sleep_time = 2 ** attempt * 3  
                print(f"Rate limited. Sleeping for {sleep_time}s...")
                time.sleep(sleep_time)
            elif attempt < max_retries - 1:
                time.sleep(1)
            else:
                print("Max retries reached. Defaulting score to 0.")
                return {"key": "correctness", "score": 0}


try:
    experiment_results = client.evaluate(
        target,
        data="company-policy-goldens",  
        evaluators=[correctness_evaluator],
        experiment_prefix="company-policy-rag-eval",
        max_concurrency=1,  
    )

    print("✅ Evaluation completed successfully!")
    
except KeyboardInterrupt:
    print("❌ Evaluation interrupted (e.g., due to rate limits or manual stop). Check partial results in LangSmith.")
except Exception as e:
    print(f"❌ Evaluation failed with error: {e}")

Initializing LLM: groq | Model: llama-3.1-8b-instant


[2m2026-02-07 12:11:16[0m [[32m[1minfo     [0m] [1mLLM loaded successfully       [0m
[2m2026-02-07 12:11:16[0m [[32m[1minfo     [0m] [1mRetrieval class initialized successfully[0m


Loading embedding model: BBAI/bge-small-en-v1.5


[2m2026-02-07 12:11:16[0m [[32m[1minfo     [0m] [1mLCEL chain built successfully [0m
[2m2026-02-07 12:11:16[0m [[32m[1minfo     [0m] [1mFAISS retriever loaded successfully[0m
View the evaluation results for experiment: 'company-policy-rag-eval-96e150b4' at:
https://smith.langchain.com/o/99040f25-cf72-4aea-9165-21a9a528a5b7/datasets/ecb2b661-b036-4472-a7ea-e91800deebc6/compare?selectedSessions=d3d1c57b-55a7-453c-8f53-eb8cb9ad2f6f




0it [00:00, ?it/s]HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


Score extracted: 1 from 'Score: 1

The output is factually correct and complete compared to the reference. It accurately conv...'


HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
1it [00:01,  1.38s/it]HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2it [00:01,  1.20it/s]

Score extracted: 1 from 'Score: 1

The output is factually correct and complete compared to the reference. It accurately stat...'


HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
3it [00:02,  1.40it/s]

Score extracted: 1 from 'Score: 1

The output accurately reflects the information provided in the reference, stating that the...'


HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 3.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 9.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


Score extracted: 1 from 'Score: 1

The output is factually correct and complete because it accurately states that the company...'


HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 3.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
4it [00:15,  5.42s/it]HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 2.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 3.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 9.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/

Score extracted: 1 from 'Score: 1

The output accurately reflects the information provided in the reference, stating that the...'


HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 2.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
5it [00:28,  8.41s/it]HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 2.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 4.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 9.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/

Score extracted: 0 from 'Score: 0

The output is missing the eligibility criteria for FMLA leave, which is not just about wor...'


HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 4.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
6it [00:44, 10.91s/it]HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 1.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 2.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 9.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/

Score extracted: 1 from 'Score: 1

The output is factually correct as it lists the same holidays as the reference. However, i...'


HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 3.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
7it [00:58, 12.01s/it]HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
Retrying request to /openai/v1/chat/completions in 8.000000 seconds
HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
8it [01:07,  8.46s/it]

Score extracted: 0 from 'Score: 0

The output is not factually correct and complete compared to the reference. The main discr...'
✅ Evaluation completed successfully!



