# Employee Pipeline Part 2:

In this notebook, we intend to have a pinecone vector db with the initial chunk size and chunk overlap that we have decided, we make that vector db and also make a separate function for adding new documents to the vectod db in case someone needs to update the organization files in real time.

# Importing libraries

In [1]:
#pip installing:
%pip install langchain
%pip install langchain_community
%pip install langchain_huggingface
%pip install langchain_pinecone
%pip install pinecone
%pip install pinecone-client
%pip install dotenv
%pip install streamlit
%pip install pymupdf
%pip install -qU langchain_community wikipedia
%pip install --upgrade --quiet langchain-text-splitters tiktoken

import os
import langchain #its giving module not found error
import langchain_community
import langchain_huggingface
import langchain_pinecone
import pinecone
import dotenv
import streamlit as st

# Additional Imports (loading document):
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import CharacterTextSplitter

#pinecone etc (storage of ducments):
from pinecone import Pinecone, ServerlessSpec
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_pinecone import PineconeVectorStore
from uuid import uuid4

#hugging face etc (for generation):
from langchain_huggingface import HuggingFaceEndpoint
from langchain import PromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain_core.runnables import RunnableLambda

#memory imports
#I used these documentations: https://python.langchain.com/v0.1/docs/use_cases/chatbots/memory_management/ , https://python.langchain.com/v0.1/docs/modules/memory/types/buffer/ , https://python.langchain.com/v0.1/docs/modules/memory/
from langchain.memory import ConversationBufferMemory
from langchain.chains import LLMChain

from langchain.text_splitter import CharacterTextSplitter
from langchain_text_splitters import TokenTextSplitter
#for timing the retrivals
import time

#for parsing:
import re


Collecting langchain_community
  Downloading langchain_community-0.3.5-py3-none-any.whl.metadata (2.9 kB)
Collecting SQLAlchemy<2.0.36,>=1.4 (from langchain_community)
  Downloading SQLAlchemy-2.0.35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.6 (from langchain_community)
  Downloading langchain-0.3.7-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.15 (from langchain_community)
  Downloading langchain_core-0.3.15-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.6.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from datac

# Setting up API keys and ENV file


In [2]:
# Replace with the API keys you need
HUGGINGFACE_API_KEY = "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"
PINECONE_API_KEY = "pcsk_53kMBB_46NnPeyFBe4q6LFpksKpKVkTr2q2L3w6uwDk3YnfmwMxWMNYrRCQniNzBoepwDi"

env_content = f"""
HUGGINGFACE_API_KEY={HUGGINGFACE_API_KEY}
PINECONE_API_KEY={PINECONE_API_KEY}
"""

with open(".env", "w") as file:
    file.write(env_content)

print("Environment variables are saved to .env file.")

dotenv.load_dotenv()

Environment variables are saved to .env file.


True

# EmployeePipeline class code:

for now, we will be testing with the same configs as provided in the RAG assignment just to get the ball rolling, and then will perform quality checks in the next part with different factors and finalizing which works best

In [8]:
class EmployeeChatBot:
    # TODO: To be implemented
    def __init__(self):
        #loading variables:
        self.combined_text = ""
        self.CHUNK_SIZE = 256
        self.CHUNK_OVERLAP = 0.50
        #storing variables:
        self.pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))
        self.index_name = "employee-queries-db" #keep the name small
        self.embeddings = HuggingFaceEmbeddings()
        self.index = self.pc.Index(self.index_name) #Remember, i can do this because i have already once created this index, else create index first
        self.vector_store = PineconeVectorStore(index=self.index, embedding=self.embeddings)
        # generating variables
        self.retriever = self.vector_store.as_retriever( search_type="similarity_score_threshold", search_kwargs={"k": 3, "score_threshold": 0.5},) #tunable
        self.repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1" #tunable
        self.llm = HuggingFaceEndpoint( repo_id=self.repo_id, temperature= 0.8, top_k= 50, huggingfacehub_api_token=os.getenv('HUGGINGFACE_API_KEY') ) #tunable

        #memory variables:
        self.memory_template = """You are a ambiguity clearer, your task is to examine the human question and check for any "he/she/it/they/them" ambiguities.
        return an updated human question fixing those ambiguities using the previous conversation context only.
        if there is not enought relevant context, RETURN HUMAN QUESTION AS IT IS
        YOUR ANSWER SHOULD BE A QUESTION WHICH ONLY CLARIFIES ANY AMBIGUITY IN human question by replacing it with their name
        RETURN IN FORMAT: New human question: (updated question)
        Previous conversation:
        {chat_history}

        human question: {question}
        New human question:
        """
        self.memory_prompt = PromptTemplate.from_template(self.memory_template)

        self.memory = ConversationBufferMemory(memory_key="chat_history")
        self.conversation = LLMChain(
            llm=self.llm,
            prompt=self.memory_prompt,
            verbose=False,
            memory=self.memory
        )

        #prompt variables
        self.Classifier_template = """
        You are a prompt classifier designed to classify questions from employees in an organization.
        classify the following question into "Relevant" or "Irrelevant", based on whether the query theme is of a question from an organization employee, the question could be about IT, HR, Finance or any other department
        Only answer from the specified classes and one word answers.

        Question: {question}
        Answer:
        """

        self.Employee_Template = """
          You are a chatbot designed to answer questions from Employees of an organization.
          Use following extract from the relevant documents to answer the question.

          Context: {context}
          Question: {question}
          Answer:
        """
        self.Classifier_prompt = PromptTemplate( template=self.Classifier_template, input_variables=["question"] )
        self.Employee_prompt = PromptTemplate(template=self.Employee_Template, input_variables=["context", "question"] )

        #chain variables
        self.classifier_chain = ({"question": RunnablePassthrough()} | self.Classifier_prompt | self.llm  | StrOutputParser() )
        self.Employee_chain = ({"context": self.retriever | self.format_docs,  "question": RunnablePassthrough()} | self.Employee_prompt | self.llm | StrOutputParser() )
        self.full_chain = {"Relevancy": self.classifier_chain, "question": lambda x: x["question"]} | RunnableLambda(self.route)


    #this function will add the given filepath (as a string) to the pinecone vector db after parsing it
    def AddFileToDB(self, docs_to_load):
      # [ADD LOADING AND PARSING AND CHUNKING PART HERE]
      combined_text = ""
      for doc in docs_to_load:
        loader = PyMuPDFLoader(doc)
        documents = loader.load()
        # print(documents)
        for page in documents:
          text = page.page_content
          if "contents" in text.lower():
            continue
          text = re.sub(r'\bPage\s+\d+\b', '', text, flags=re.IGNORECASE)
          text = re.sub(r'\n', '', text).strip() #removing all newlines
          # print(text)
          text = re.sub(r'[^\w\s.,?!:;\'\"()&-]', '', text)
          combined_text += text + " "
      combined_text = combined_text.strip()
      # print(combined_text)
      text_splitter = TokenTextSplitter(chunk_size=self.CHUNK_SIZE, chunk_overlap=int(self.CHUNK_SIZE*self.CHUNK_OVERLAP))
      texts = text_splitter.split_text(combined_text)
      docs = text_splitter.create_documents(texts)
      print(docs)
      if self.index_name not in self.pc.list_indexes().names():
        self.pc.create_index(  #tunable
          name=self.index_name,
          dimension=768,
          metric="cosine",
          spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
          )
        )
      embeddings = HuggingFaceEmbeddings()
      index = self.pc.Index(self.index_name)
      vector_store = PineconeVectorStore(index=index, embedding=embeddings)
      uuids = [str(uuid4()) for _ in range(len(docs))]
      vector_store.add_documents(documents=docs, ids=uuids)


    # TODO: To be implemented
    def generate(self, query):
        query_response =  self.full_chain.invoke({"question": query})
        return query_response


    #Helper functions:
    def format_docs(self, docs):
        return "\n\n".join([d.page_content for d in docs])


    def route(self, info):
        if "relevant" in info["Relevancy"].lower():
          print("Question was relevant")
          return self.Employee_chain.invoke(info["question"])
        else:
          return "Your question was not relevant to our organization"



## Driver Code

This driver code mainly demonstrates the Live Addition to vector DB feature, thought because of some minutes of meetings updates are important in some organizations, and timely updation of them might be required.

In [6]:
bot = EmployeeChatBot()
#the following few cells demonstrate live updates to the DB feature

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful




In [11]:
bot.generate("Tell me about the signature requirements of bank accounts from atmae financial policies")

Question was relevant


" Based on the provided context, Atmae Financial policies do not mention any specific signature requirements for bank accounts. However, employees are expected to maintain the highest standards of personal integrity and be fair and honest in all business dealings. This includes not manipulating any account to generate illegitimate financial gains, such as inflating a balance prior to a payday when a direct deposit will address any overdrafts. Employees are also expected to read, understand, and comply with Comerica's Anti-Money Laundering (AML) Policy and any additional AML policies implemented by their business unit. Failure to adhere to the AML Policy may result in disciplinary action, including termination. Any questions related to the AML Policy should be directed to the Director of AML Compliance. If there is any uncertainty or question regarding any issue, employees should contact an attorney in the Legal Department or speak with the Chief Legal Officer."

In [12]:
bot.AddFileToDB(["Finance_Policies_and_Procedures.pdf"])

[Document(metadata={}, page_content='ATMAE FINANCIAL POLICIES AND PROCEDURES  RELATIONSHIP  TO BY-LAWS These financial policies and procedures shall be implemented at all times consistently with the organizations by-laws.  FISCAL YEAR The fiscal year of the organization is July 1 through June 30.  BUDGET I. \xa0Budget Calendar. The budget year shall be the same as the fiscal year of the organization. II. \xa0Budget Format. The budget shall be in a format substantially consistent with Appendix B. III. \xa0Budget Process. a. \xa0 Presentation of Budget. The Executive Director shall present a general annual budget for the operation of the organization to the board of directors for action no later than the June board meeting each year. b. \xa0 Board action. The board of directors will receive the report and recommendation of the Executive Director and adopt a budget based thereon. IV. \xa0Administration of Budget. Written documentation of expenses shall be provided to the organizations adm



In [14]:
bot.generate("Tell me about the signature requirements of bank accounts from atmae financial policies")

Question was relevant


' According to the ATMAE financial policies, the following individuals have signatory authority on ATMAE bank accounts: The President and The Executive Director. The Management Company has authority to issue checks up to and including 5,000. For any checks that are greater than 5,000, the Treasurer must review the documentation and approve payment in writing before the check is issued. Additionally, no funds shall be transferred by the Management Company to or from investment accounts (excluding transfers to and from a money market account) without approval of the board of directors.'