### This notebook's objective is to generate a simple RAG using LangChain. 
The guide for this notebook is [This](https://python.langchain.com/docs/tutorials/rag/)

We can use open source packages from HuggingFace like the embeddings model

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
# Ensure VertexAI credentials are configured.
# Or configure here another LLM. 
from langchain_google_vertexai import ChatVertexAI

model = ChatVertexAI(model="gemini-1.5-flash")
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
vector_store = InMemoryVectorStore(embeddings)

## **Step 1: indexing**

### **Loading documents**. 
We need to first load the blog post contents. We can use DocumentLoaders for this, which are objects that load in data from a source and return a list of Document objects.

In this case weâ€™ll use the WebBaseLoader, which uses urllib to load HTML from web URLs and BeautifulSoup to parse it to text

Documentations for Document loaders are: [Here](https://python.langchain.com/docs/how_to/#document-loaders), in the document loaders section

In [None]:
""" This library is used for parsing HTML content.
    and scraping the content from the web page."""
import bs4

""" This is a class used for loading documents from the web.
    The next steps for this notebook could be how to load
    doccuments from other sources like PDFs, Word documents, etc."""
from langchain_community.document_loaders import WebBaseLoader

# Only keep post title, headers, and content from the full HTML.
bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load()

"""This checks that exactly one document was loaded. 
    If not, it raises an AssertionError."""
assert len(docs) == 1

print(f"Total characters: {len(docs[0].page_content)}")

In [None]:
# Printing the first 500 characters of the document
print(docs[0].page_content[0:500])

### **Splitting documents**
Our loaded document is very long to fit into the context window of any model. 

To handle this, we'll split the document into chunks. 

This step is needed for the next steps where we apply **embedding** (*represent data (text) in a continuous vector space*) and **vector storage** (saving and retrieve vectors). This should help us retrieve only the most relevant parts of the blog post at run time.

We will use ***RecursiveCharacterTextSplitter***, which splits the text using common separators such as new lines until each chunk is appropriate size. 

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200, # Overlap characters between chunks
    # if True, track index in original document, each 
    # chunk will have metadata indicating at which character
    # it begins in the original text
    add_start_index=True,
)
all_splits = text_splitter.split_documents(docs)

print(f"Split blog post into {len(all_splits)} sub-documents.")

In [None]:
print(all_splits[0])

### **Storing documents**
Now, we need to index the text chunks so we can search over them at runtime. Here we embed the content of each split and insert these embeddings into a vector store. So we can, given an input query, use vector search to retrieve relevant documents. 

We can embed and store all the splits in a single command:

In [None]:
document_ids = vector_store.add_documents(documents=all_splits)

print(document_ids[:3])

We've completed **the indexing** portion of the pipeline. We have generated a query-able vector store containing the chunked contents of the initial text. Given user questions, we should ideally be able to return the snippets of the blog post that answer the question.