# RAG with Langchain and vector stores

Data injection techniques
- TextLoader: load text files (.txt)
- WebLoader: load, chunk, and index scraped data from a html page (e.g. BeautifulSoup)
- PDFLoader: Load data from a pdf file (.pdf)

LTE
- Load
- Transfrom
- Embed

## TextLoader

In [2]:
from langchain_community.document_loaders import TextLoader

In [6]:
# Load Documents 
text_loader = TextLoader("example_text.txt")

In [7]:
text_docs = text_loader.load()

In [8]:
text_docs

[Document(metadata={'source': 'example_text.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairn

In [9]:
len(text_docs)

1

## WebBaseLoader

In [10]:
import os
from dotenv import load_dotenv

In [11]:
load_dotenv()

True

In [12]:
# Call API Keys

# os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

In [18]:
from langchain_community.document_loaders import WebBaseLoader
import bs4

In [20]:
# Web Loader: loads data from html page

web_loader = WebBaseLoader(
    web_path="https://jobs.lever.co/egen/0e0d9469-08a8-48aa-aefb-f65d50760e80",
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(
        # class names from HTML Page
        class_=(
            "sort-by-time posting-category medium-category-label width-full capitalize-labels location",
            "sort-by-team posting-category medium-category-label capitalize-labels department",
            "sort-by-commitment posting-category medium-category-label capitalize-labels commitment"
            )
        ))
    )

"sort-by-commitment posting-category medium-category-label capitalize-labels commitment"

'sort-by-commitment posting-category medium-category-label capitalize-labels commitment'

In [21]:
web_docs = web_loader.load()

In [22]:
web_docs

[Document(metadata={'source': 'https://jobs.lever.co/egen/0e0d9469-08a8-48aa-aefb-f65d50760e80'}, page_content='Naperville, IL or RemoteEngineering and Product Management /Full Time /')]

## PyPDFLoader, Chunking, and Vector Stores

### PyPDFLoader

Note: PyPDFLoader will actually chunk everything by page number

In [23]:
# Load PDF Data
from langchain_community.document_loaders import PyPDFLoader

In [31]:
# pdf_loader = PyPDFLoader("robot.pdf")
pdf_loader = PyPDFLoader("Profile.pdf")
# pdf_loader = PyPDFLoader("movie_theater.pdf")

In [32]:
pdf_docs = pdf_loader.load()

In [33]:
pdf_docs

[Document(metadata={'source': 'Profile.pdf', 'page': 0}, page_content="\xa0 \xa0\nContact\ndruestaples@gmail.com\nwww.linkedin.com/in/drue-\nstaples-65b206182  (LinkedIn)\nTop Skills\nPython (Programming Language)\nTensorFlow\nDeep LearningDrue Staples\nLead AI Machine Learning Engineer at PROLIFIC AI\nLowell, Indiana, United States\nSummary\nSpecific Skills:\nPython, Pandas, Scikit-Learn, Matplotlib, Numpy, Tensorflow,\nOpenCV, Keras, Seaborn, Statistics, Probability, Differential\nCalculus, Linear Algebra, SQL, Machine Learning Algorithms,\nFeature Engineering, Feature Selection, Google Cloud Platform,\nNLP, Jupyter, Excel, Powerpoint, Word, MATLAB, Speech\nRecognition, Sentiment Analysis, Time Series data, Agile\nMethodology, HTML, CSS, JavaScript, PHP\nCertification:\nStanford's Machine Learning Certification with Andrew Ng\nCourse includes machine learning, datamining, and statistical pattern\nrecognition. Topics include: (i) Supervised learning (parametric/non-\nparametric algori

In [34]:
len(pdf_docs)

4

### Chunk Documents 

Even though the documents have already been chunked by page number, there is still room to further chunk the text into even smaller pieces.

This helps with:
- indexing and retrieval as smaller chunks are easier to process
- increasing granularity (scale or level of detail) as a page might include multiple topics
- preserving context with the chunk_overlap parameter below

What is chunk size and chunk overlap? (RecursiveCharacterTextSplitter)

- Chunk size is just the number of characters to be inlcluded in each chunk 
- Chunk overlap: the last number of characters to be used from the previous chunk. This helps the model better understand the context of the corpus. 
- Example 1:
    - Chunk 0: 10 characters 
    - Chunk 1: 5 last characters from previous chunk 0 + 5 new characters
    - Chunk 2: 5 last characters from previous chunk 1 + 5 new characters
    - Chunk 3: 5 last characters from previous chunk 2 + 5 new characters
- Example 2: input="The dog ran up the hill.", chunk_size=10, chunk_overlap=5
    - Chunk 0: "The dog ra"
    - Chunk 1: "dog ran up"
    - Chunk 2: "ran up the"
    - Chunk 3: "up the hil"
    - Chunk 4: "the hill."


In [42]:
# Chunk documents  (Transform)

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=20, chunk_overlap=10)


In [43]:
pdf_docs_split = text_splitter.split_documents(pdf_docs)

In [44]:
pdf_docs_split[:3]

[Document(metadata={'source': 'Profile.pdf', 'page': 0}, page_content='Contact'),
 Document(metadata={'source': 'Profile.pdf', 'page': 0}, page_content='druestaples@gmail.c'),
 Document(metadata={'source': 'Profile.pdf', 'page': 0}, page_content='es@gmail.com')]

### ChromaDB with Ollama embeddings

In [49]:
# Convert data into vector embeddings (Chroma)
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

In [52]:
# Save to vector store (Chroma)
db = Chroma.from_documents(documents=pdf_docs_split[:10], embedding=OllamaEmbeddings())

In [53]:
db

<langchain_community.vectorstores.chroma.Chroma at 0x137fd6df0>

In [71]:
# Query the vector database
# query = "What projects are with GCP"
query = "What the top rated skills?"

In [72]:
search_results = db.similarity_search(query=query, k=3)

In [73]:
search_results

[Document(metadata={'page': 0, 'source': 'Profile.pdf'}, page_content='(LinkedIn)'),
 Document(metadata={'page': 0, 'source': 'Profile.pdf'}, page_content='Top Skills'),
 Document(metadata={'page': 0, 'source': 'Profile.pdf'}, page_content='es@gmail.com')]

In [74]:
# db.similarity_search_by_vector(embedding=embedding, k=k)

In [75]:
search_results[0].dict()['page_content']

'(LinkedIn)'

In [76]:
search_results[0].page_content

'(LinkedIn)'

### FAISS Vector Store with Ollama embeddings

In [87]:
# FAISS Vector Database

from langchain_community.vectorstores import FAISS

In [88]:
db1 = FAISS.from_documents(documents=pdf_docs_split, embedding=OllamaEmbeddings())

In [89]:
db1

<langchain_community.vectorstores.faiss.FAISS at 0x29b457b50>

In [90]:
# Query the vector database (FAISS)
# query = "What projects are with GCP"
query = "What the top rated skills?"

In [91]:
db1.similarity_search(query)

[Document(metadata={'source': 'Profile.pdf', 'page': 0}, page_content='(LinkedIn)'),
 Document(metadata={'source': 'Profile.pdf', 'page': 0}, page_content='Summary'),
 Document(metadata={'source': 'Profile.pdf', 'page': 1}, page_content='Data Analysis'),
 Document(metadata={'source': 'Profile.pdf', 'page': 2}, page_content='-Data Analysis')]

### Lance Vector Store with Ollama embeddings

In [None]:
# Lance vector database

