### This notebook's objective is to generate a simple RAG using LangChain. 
The guide for this notebook is [This](https://python.langchain.com/docs/tutorials/rag/)

We can use open source packages from HuggingFace like the embeddings model

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
# Ensure VertexAI credentials are configured
from langchain_google_vertexai import ChatVertexAI

model = ChatVertexAI(model="gemini-1.5-flash")
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
vector_store = InMemoryVectorStore(embeddings)

## **Step 1: indexing**

### **Loading documents**. 
We need to first load the blog post contents. We can use DocumentLoaders for this, which are objects that load in data from a source and return a list of Document objects.

In this case we’ll use the WebBaseLoader, which uses urllib to load HTML from web URLs and BeautifulSoup to parse it to text

Documentations for Document loaders are: [Here](https://python.langchain.com/docs/how_to/#document-loaders), in the document loaders section

In [None]:
""" This library is used for parsing HTML content.
    and scraping the content from the web page."""
import bs4

""" This is a class used for loading documents from the web.
    The next steps for this notebook could be how to load
    doccuments from other sources like PDFs, Word documents, etc."""
from langchain_community.document_loaders import WebBaseLoader

# Only keep post title, headers, and content from the full HTML.
bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load()

"""This checks that exactly one document was loaded. 
    If not, it raises an AssertionError."""
assert len(docs) == 1

print(f"Total characters: {len(docs[0].page_content)}")