#**Upload Vectorstore from pkl File and Query it**
We will see, How Vectorstore save as .pkl file and use later.
###**Steps:**
1. Load and read dataset
2. Split text into chunks
3. Define Embedding model
4. Apply embeddings and create vectorstore
5. Save metadata to the .pkl file
6. Remove chromadb(persist directory)
7. Load the metadata from .pkl file (Stored in step:4)
8. Create vectorstore using .pkl
9. Query the vectorstore

###**Install Dependencies**

In [1]:
!pip install -qU chromadb langchain-chroma langchain-community langchain-openai

##**1. Upload Policy Documents**

###**Downloads the `hrdataset.zip` file from the CloudYuga GitHub repo**

Saves it in the current working directory of notebook

(e.g., /content/ in Google Colab).

In [2]:
!wget https://github.com/cloudyuga/mastering-genai-w-python/raw/refs/heads/main/hrdataset.zip

--2025-05-29 05:59:48--  https://github.com/cloudyuga/mastering-genai-w-python/raw/refs/heads/main/hrdataset.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/cloudyuga/mastering-genai-w-python/refs/heads/main/hrdataset.zip [following]
--2025-05-29 05:59:48--  https://raw.githubusercontent.com/cloudyuga/mastering-genai-w-python/refs/heads/main/hrdataset.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9530 (9.3K) [application/zip]
Saving to: ‘hrdataset.zip.2’


2025-05-29 05:59:48 (18.6 MB/s) - ‘hrdataset.zip.2’ saved [9530/9530]



###**Unzip `hrdataset.zip` file**
- It will automatically create **`hrdataset`** folder in our current working directory (/content/ in Google Colab)

In [3]:
 !unzip hrdataset.zip

Archive:  hrdataset.zip
replace hrdataset/policies/leave_policies.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: hrdataset/policies/leave_policies.md  
  inflating: hrdataset/policies/training_and_development.md  
  inflating: hrdataset/policies/employee_benefits.md  
  inflating: hrdataset/policies/holiday_calendar.md  
  inflating: hrdataset/policies/events_calendar.md  
  inflating: hrdataset/surveys/Employee_Culture_Survey_Responses.csv  
  inflating: hrdataset/employees/108_Rajesh_Kulkarni.md  
  inflating: hrdataset/employees/106_Neha_Malhotra.md  
  inflating: hrdataset/employees/103_Anjali_Das.md  
  inflating: hrdataset/employees/105_Sunita_Patil.md  
  inflating: hrdataset/employees/101_Priya_Sharma.md  
  inflating: hrdataset/employees/102_Rohit_Mehra.md  
  inflating: hrdataset/employees/104_Karan_Kapoor.md  
  inflating: hrdataset/employees/109_Meera_Iyer.md  
  inflating: hrdataset/employees/110_Aditya_Jain.md  
  inflating: hrdataset/employees/107_Amit_Verma.md

In [4]:
policy_files_path = 'hrdataset/policies'

In [5]:
import os
def read_markdown_files(directory):
    """Read and load content from all Markdown files in a directory."""
    documents = []
    for filename in os.listdir(directory):
        if filename.endswith(".md"):
            filepath = os.path.join(directory, filename)
            with open(filepath, 'r', encoding='utf-8') as f:
                documents.append({"filename": filename, "content": f.read()})
    return documents

In [6]:
docs = read_markdown_files(policy_files_path)

In [7]:
for item in docs:
    print(f"📄 Filename: {item['filename']}")
    print("📝 Content:")
    print(item['content'])
    print("\n" + "="*80 + "\n")

📄 Filename: training_and_development.md
📝 Content:
# Training and Development

| Employee ID | Name           | Courses Taken                          | Completion Date | Certifications Awarded      |
|-------------|----------------|----------------------------------------|-----------------|----------------------------|
| 101         | Priya Sharma   | Leadership in Operations              | 2022-12-10      | Certified Operations Manager |
| 102         | Rohit Mehra    | Data Analytics for Logistics          | 2021-11-15      | Certified Logistics Analyst |
| 103         | Anjali Das     | HR Management Essentials              | 2023-03-05      | Certified HR Professional   |



📄 Filename: holiday_calendar.md
📝 Content:
# Holiday Calendar

| Festival Name          | Date       | Day         |
|------------------------|------------|-------------|
| Republic Day          | 2023-01-26 | Thursday    |
| Holi                 | 2023-03-08 | Wednesday   |
| Good Friday          | 2023-04-07

##**2. Split Document into chunks**

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

def split_text(documents, chunk_size=1000, chunk_overlap=20):
    """Split text documents into manageable chunks."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        is_separator_regex=False
    )
    chunks = []
    for doc in documents:
        # Split the document into chunks
        doc_chunks = text_splitter.create_documents([doc["content"]])
        # Add metadata (e.g., filename) to each chunk
        for chunk in doc_chunks:
            chunk.metadata = {"filename": doc["filename"]}
        chunks.extend(doc_chunks)
    return chunks

In [9]:
chunks = split_text(docs)
print(len(chunks))

5


###**Retrive API key from Secrets and Set as an ENV**

In [10]:
# Retrieve the API key from Colab's secrets
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

In [11]:
# Set OPENAI_API_KEY as an ENV
import os
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

##**3. Define Embedding Model**

In [12]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

 ## **4. Embed HR Policy Document and Create Vector Store**

In [13]:
from langchain_chroma import Chroma

In [14]:
persist_dir = "chromadb"
#collection_name = "my_collection"
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=persist_dir
)

## **5. Save only the metadata (directory + embedding model) to a .pkl file**


In [15]:
import pickle
metadata = {
    "persist_directory": persist_dir,
    "embedding_model_name": "text-embedding-ada-002",
}
with open("vectorstore_metadata.pkl", "wb") as f:
    pickle.dump(metadata, f)
print("✅ Vectorstore metadata saved.")


✅ Vectorstore metadata saved.


##**6. Remove Persist Directory from Local Files**

In [16]:
import shutil
shutil.rmtree("chromadb")  # or whatever directory you used for persist_directory

## **7. Load the Metadata from the .pkl File**

In [17]:
# -------------------------------------
# 🟩 Later or in a new Colab session:
# -------------------------------------
with open("vectorstore_metadata.pkl", "rb") as f:
    loaded_metadata = pickle.load(f)


In [18]:
# We need embedding_function for vectorstpre
embedding_function = OpenAIEmbeddings(model=loaded_metadata["embedding_model_name"])

## **8. Create Vectorstore using Loaded Metadata**

In [19]:
loaded_vectorstore = Chroma(
    persist_directory=loaded_metadata["persist_directory"],
    embedding_function=embedding_function
)

## **9. Query the Vectorstore**


In [20]:
query = "What is the leave policy?"
results = loaded_vectorstore.similarity_search(query)

## **10. Print Results**


In [21]:
print(results[0].page_content)

# Leave Policies
- **Annual Leave:** 18 days of paid leave per year, accrued monthly.
- **Sick Leave:** 12 days of paid leave for medical reasons per year.
- **Maternity Leave:** 6 months of paid leave for expecting mothers.
- **Paternity Leave:** 15 days of paid leave for new fathers.L
- **Compensatory Leave:** Leave granted for working on weekends or holidays.


###**Print Multiple Similar Documents**

In [22]:
results = loaded_vectorstore.similarity_search("is there holiday on diwali?")
for i, doc in enumerate(results):
    print(f"\nResult {i+1}:\n{doc.page_content}")



Result 1:
# Holiday Calendar

| Festival Name          | Date       | Day         |
|------------------------|------------|-------------|
| Republic Day          | 2023-01-26 | Thursday    |
| Holi                 | 2023-03-08 | Wednesday   |
| Good Friday          | 2023-04-07 | Friday      |
| Eid al-Fitr          | 2023-04-22 | Saturday    |
| Independence Day     | 2023-08-15 | Tuesday     |
| Raksha Bandhan       | 2023-08-30 | Wednesday   |
| Ganesh Chaturthi     | 2023-09-19 | Tuesday     |
| Diwali               | 2023-11-12 | Sunday      |
| Christmas            | 2023-12-25 | Monday      |
| Makar Sankranti      | 2023-01-14 | Saturday    |

Result 2:
# Events and Holiday Calendar

| Event Name            | Date       | Time     | Location        | Participants           |
|-----------------------|------------|----------|-----------------|------------------------|
| Annual Company Meet   | 2023-12-15 | 10:00 AM | Mumbai Office   | All employees          |
| Health Camp      

## **(Optional)Script to Delete Created Dir**

In [23]:
import shutil
import os

# Replace this with your directory path
dir_path = "./chromadb"

# Check if the directory exists
if os.path.exists(dir_path):
    shutil.rmtree(dir_path)
    print(f"✅ Directory '{dir_path}' has been deleted.")
else:
    print(f"⚠️ Directory '{dir_path}' does not exist.")

⚠️ Directory './chromadb' does not exist.
