### LangChain + RAG Exercise

This workbook shows the use of LangChain with RAG ( Retrieval Augmentation Generation)

In [None]:
# Package Imports

import os
import numpy as np
from sklearn.manifold import TSNE
import plotly.graph_objects as go

# Import environment variables from .env file
from dotenv import load_dotenv

# Import Google AI Package
import google.generativeai

#Import glob module
import glob

# Import modules and packages from langChain
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain.vectorstores import FAISS
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain_core.callbacks import StdOutCallbackHandler


In [None]:
# Variables Declaration

# Model of Google Gemini used
MODEL = "gemini-1.5-flash"

# Name of Vector Data Store
db_name = "vector_db"

In [None]:
# Load Environement Variables

# Load .env file
load_dotenv()

# load Google Api Key from .env file
os.environ['GOOGLE_API_KEY'] = os.getenv('GOOGLE_API_KEY')

# Authenticate to Python SDK for future requests
google.generativeai.configure()

## LangChain Workflow

In this notebook, we will demonstrate the LangChain workflow step by step.

### Step 1: Data Ingestion

In this section, we will demonstrate how to load a particular dataset from a specific data source (our pre defined knowledge base) 

In [None]:
# Take everything in all the sub-folders of our knowledgebase
folders = glob.glob("knowledge-base/*")

# Configure the encoding (Optional)
text_loader_kwargs = {'encoding': 'utf-8'}

# Document object generated when dataset is loaded 
documents = []

# Loop through all folders in knowledge base
for folder in folders:
    
    # Retrieve doc type of folder
    doc_type = os.path.basename(folder)

    # Load Markdown files from knowledge base directory. Each file is processed by TextLoader, converted into a Document object.
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)

    # Execute the document loading process
    folder_docs = loader.load()

    # For each Document Object, append doc_type to metadata
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

documents

### Step 2. Data Transformation

The following section describes how to break down the Document Objects from step 1 into small text chunks

In [None]:
# Using RecursiveCharacterTextSplitter as it is better suited for generic texts than CharacterTextSplitter.
# No need to specify where to chunk data as RecursiveCharacterTextSplitter chunks data on \n,"", empty spaces

# chunk_size = specify the size of chunks
# chunk_overlap = specify the number of characters to overlap in consecutive chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# generate chunks by splitting Document Object obtained from step 1
chunks = text_splitter.split_documents(documents)

print(f"Total number of chunks: {len(chunks)}")
print(f"Document types found: {set(doc.metadata['doc_type'] for doc in documents)}")

### Step 3. Embeddings

This section demonstrates how to convert text chunks obtained from Step 2 into vectors ( Vectorize Process)

In [None]:
# We are using Google Gemini Embedding

# mode = specify which model to use
# dimensions = specify the number of dimensions to embed the vectors
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", dimensions=1024)

### Step 4. Store in Vector DataStore

This section demonstrated the final step in the LangChain workflow where each vectors are saved in a vector Database

In [None]:
# Put the chunks of data into a Vector Store that associates a Vector Embedding with each chunk

vectorstore = FAISS.from_documents(chunks, embedding=embeddings)

total_vectors = vectorstore.index.ntotal
dimensions = vectorstore.index.d

print(f"There are {total_vectors} vectors with {dimensions:,} dimensions in the vector store")

# Save vectorstore locally
vectorstore.save_local(db_name)

# Reload Vector Store Database
vectorstore=FAISS.load_local(db_name,embeddings,allow_dangerous_deserialization=True)

### Visualise Vector Store

The following diagram will help to try to visualise the vector Store in 2D

In [None]:
# Prework

vectors = []
documents = []
doc_types = []
colors = []
color_map = {'products':'blue', 'employees':'green', 'contracts':'red', 'company':'orange'}

for i in range(total_vectors):
    vectors.append(vectorstore.index.reconstruct(i))
    doc_id = vectorstore.index_to_docstore_id[i]
    document = vectorstore.docstore.search(doc_id)
    documents.append(document.page_content)
    doc_type = document.metadata['doc_type']
    doc_types.append(doc_type)
    colors.append(color_map[doc_type])
    
vectors = np.array(vectors)



In [None]:
# We humans find it easier to visalize things in 2D!
# Reduce the dimensionality of the vectors to 2D using t-SNE
# (t-distributed stochastic neighbor embedding)

tsne = TSNE(n_components=2, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Create the 2D scatter plot
fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='2D FAISS Vector Store Visualization',
    scene=dict(xaxis_title='x',yaxis_title='y'),
    width=800,
    height=600,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

## Creation of RAG Pipeline

The following section will now demonstrate how to create a RAG pipeline with LangChain

In [16]:
# 1. create a new Chat with ChatGoogleGenerativeAI

# model = Specify the model of LLM used
# temperature = Controls randomness (higher = more creative, lower = more deterministic)
llm = ChatGoogleGenerativeAI(model=MODEL, temperature=0.7)

# 2. create a chat memory component in LangChain, allowing your chatbot or agent to remember previous messages in a conversation.

# memory_key = The key under which memory will be passed into the prompt
# return_messages = returns structured messages (HumanMessage, AIMessage) instead of just strings
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)

# 3. the retriever is an abstraction over the VectorStore that will be used during RAG to fetch the most relevant document chunks for a query.

# search_kwargs = Returns the top 25 most similar chunks for a given input query
retriever = vectorstore.as_retriever()

# 4. putting it together: set up the conversation chain with the GPT 4o-mini LLM, the vector store and memory

# callbacks=[StdOutCallbackHandler()] - Optional to view all the call backs
conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory, callbacks=[StdOutCallbackHandler()])

In [None]:
query = "Can you describe Deloitte in a few sentences"
result = conversation_chain.invoke({"question":query})
print(result["answer"])

# query = "Who received the prestigious IIOTY award in 2023?"
# result = conversation_chain.invoke({"question": query})
# answer = result["answer"]
# print("\nAnswer:", answer)

# query = "How many employees are there in the company based on the number of records present in employee folder?"
# result = conversation_chain.invoke({"question": query})
# answer = result["answer"]
# print("\nAnswer:", answer)



[1m> Entering new ConversationalRetrievalChain chain...[0m




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
# HR Record

# Alex Chen

## Summary
- **Date of Birth:** March 15, 1990  
- **Job Title:** Backend Software Engineer  
- **Location:** San Francisco, California  

## Insurellm Career Progression
- **April 2020:** Joined Insurellm as a Junior Backend Developer. Focused on building APIs to enhance customer data security.
- **October 2021:** Promoted to Backend Software Engineer. Took on leadership for a key project developing a microservices architecture to support the company's growing platform.
- **March 2023:** Awarded the title of Senior Backend Software Engineer due to exemplary performance in scaling backend services, reducing downtime by 30% over six months.

## An