#**HR-Related Q/A App**
- Ask Employee or Policy related questions.
- It will do the similarity serach with pinecone vectorstore and generate response using LLM

###**Prerequisites**
- Login to pinecone: https://www.pinecone.io/
- Generate API KEY for free account (They provide 5 indexes)
- Save as a secret PINECONE_API_KEY

##**Install Dependencies**

In [24]:
!pip install gradio langchain langchain-community sentence-transformers langchain-pinecone

Collecting gradio
  Downloading gradio-5.31.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.10.1 (from gradio)
  Downloading gradio_client-1.10.1-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.11-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting safehttpx<0.2.0,>=0.1.

###**Retrive API key from Secrets and Set as an ENV**

In [5]:
# Retrieve the API key from Colab's secrets
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
PINECONE_API_KEY =userdata.get("PINECONE_API_KEY")

# Set OPENAI_API_KEY as an ENV
import os
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
os.environ['PINECONE_API_KEY'] = PINECONE_API_KEY

In [6]:
INDEX_NAME = "genai-hrdataset"

###**Connect to Pinecone**

In [15]:
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=PINECONE_API_KEY)

# Create index if it doesn't exist
if INDEX_NAME not in [i.name for i in pc.list_indexes()]:
    pc.create_index(
        name=INDEX_NAME,
        dimension=384,  # for BGE/SBERT embeddings; use 1536 for OpenAI
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

index = pc.Index(INDEX_NAME)


###**Downloads the `hrdataset.zip` file from the CloudYuga GitHub repo**

Saves it in the current working directory of notebook

(e.g., /content/ in Google Colab).

In [8]:
!wget https://github.com/cloudyuga/mastering-genai-w-python/raw/refs/heads/main/hrdataset.zip

--2025-05-23 13:15:37--  https://github.com/cloudyuga/mastering-genai-w-python/raw/refs/heads/main/hrdataset.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/cloudyuga/mastering-genai-w-python/refs/heads/main/hrdataset.zip [following]
--2025-05-23 13:15:37--  https://raw.githubusercontent.com/cloudyuga/mastering-genai-w-python/refs/heads/main/hrdataset.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9530 (9.3K) [application/zip]
Saving to: ‘hrdataset.zip.1’


2025-05-23 13:15:37 (52.2 MB/s) - ‘hrdataset.zip.1’ saved [9530/9530]



###**Unzip `hrdataset.zip` file**
- It will automatically create **`hrdataset`** folder in our current working directory (/content/ in Google Colab)

In [9]:
n!unzip hrdataset.zip

Archive:  hrdataset.zip
replace hrdataset/policies/leave_policies.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/policies/training_and_development.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/policies/employee_benefits.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: nn
replace hrdataset/policies/holiday_calendar.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/policies/events_calendar.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: nn
replace hrdataset/surveys/Employee_Culture_Survey_Responses.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/employees/108_Rajesh_Kulkarni.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: nn
replace hrdataset/employees/106_Neha_Malhotra.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/employees/103_Anjali_Das.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/employees/105_Sunita_Patil.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/employees/101_Priya_Sharma.md? [y]es, [n]o, [A

###**Read All Markdown Files in hrdataset/**

In [10]:
import glob
from langchain_core.documents import Document

markdown_files = glob.glob("hrdataset/**/*.md", recursive=True)
documents = []

for path in markdown_files:
    with open(path, "r", encoding="utf-8") as f:
        text = f.read()
        documents.append(Document(page_content=text, metadata={"source": path}))


###**Chunk the Documents**

In [11]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = splitter.split_documents(documents)
print(f"✅ Total Chunks Prepared: {len(chunks)}")

✅ Total Chunks Prepared: 20


###**Embed + Store into Pinecone using LangChain**

In [16]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_pinecone import PineconeVectorStore

# You can also use OpenAIEmbeddings if preferred
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

vectorstore = PineconeVectorStore.from_documents(
        documents=chunks,
        embedding=embeddings,
        index_name=INDEX_NAME
    )

print("✅ All records stored in Pinecone!")

✅ All records stored in Pinecone!


##**Optional**

In [17]:
query = "What is the leave policy?"
results = vectorstore.similarity_search(query, k=3)

for i, doc in enumerate(results, 1):
    print(f"\n🔹 Match {i}:\n{doc.page_content[:300]}...")



🔹 Match 1:
# Leave Policies
- **Annual Leave:** 18 days of paid leave per year, accrued monthly.
- **Sick Leave:** 12 days of paid leave for medical reasons per year.
- **Maternity Leave:** 6 months of paid leave for expecting mothers.
- **Paternity Leave:** 15 days of paid leave for new fathers.L
- **Compensa...

🔹 Match 2:
# Employee Benefits
- **Health Insurance:** Covers employee and dependents up to ₹5,00,000.
- **Provident Fund:** 12% of basic salary contributed to the PF account.
- **Gratuity:** Paid on retirement/resignation based on tenure.
- **Travel Allowance:** Reimbursement for official travel expenses.
- *...

🔹 Match 3:
# Payroll Information

| Employee ID | Name           | Salary (INR) | HRA   | Provident Fund | Tax Slab   | Payday                 |
|-------------|----------------|--------------|-------|----------------|------------|-----------------------|
| 101         | Priya Sharma   | 12,00,000    | 40%   | ...


In [20]:
from openai import OpenAI
openai=OpenAI()

###**Response Generation Function**

In [21]:
def ask_policy_question(query):
    # Step 1: Similarity search from Pinecone
    results = vectorstore.similarity_search(query, k=3)

    if not results:
        return "❌ No relevant documents found."

    # Step 2: Combine retrieved documents
    context = "\n".join(doc.page_content for doc in results)

    # Step 3: Create prompt
    prompt = (
        f"You are an HR assistant. Answer based on the following policy documents.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {query}\n\n"
        f"Answer (max 100 words):"
    )

    # Step 4: Ask OpenAI LLM directly
    response = openai.chat.completions.create(
        model="gpt-4o",  # or gpt-3.5-turbo
        messages=[
            {"role": "system", "content": "You are an expert HR assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.2
    )

    return response.choices[0].message.content.strip()


###**Gradio App**

In [25]:
import gradio as gr
gr.Interface(
    fn=ask_policy_question,
    inputs=gr.Textbox(label="Ask about HR Policies"),
    outputs=gr.Textbox(label="Answer from LLM"),
    title="🔍 Ask Your HR Policy",
    description="Ask a question like 'What is the leave policy?' and get answers from indexed HR docs."
).launch()


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://bfbf4e86bf9aa519e7.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


