# Create Text Embeddings for a Vector Store using LangChain

This notebook walks through building a question/answer system that retrieves information from a private knowledge base. A pre-trained LLM, or likely even a fine-tuned LLM will not be sufficient (in and of itself) when you want a conversational interface to ask specific questions about specific data (private knowledge base). This private knowledge base can be a collection of documents, websites, research papers, or even structured data tables and more. 

The steps to setup the private knowledge base are as follows:
1) Split documents into chunks
2) Vectorize (embed) each chunk 
3) Store vectors/embeddings in a database

Once you have a vectorstore of embeddings (private knowledge-base), the process of using it in a conversational workflow are as follows:
1) Embed the query (question)
2) Nearest neighbors lookup with query in vectorstore to find relevant chunks
3) Use relevant chunks to formulate response  

This process of course requires an LLM (like PaLM or others) to formulate responses to queries with the relevant chunks found via nearest neighbors.  

Of course there are many options for a vectorstore, including managed and scalable offerings like [Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview). Additionally there are different options for LLMs to use as the underpinning language model. In this walkthrough we will use [Chroma](https://www.trychroma.com/) as a vectorstore and [PaLM](https://cloud.google.com/vertex-ai/docs/generative-ai/start/quickstarts/api-quickstart) as the underpinning language model. In a production environment, consider using a more scalable and efficient vector store such as Vertex AI Vector Search. 


### Setup

In [1]:
!pip3 install --user \
    langchain \
    wikipedia \
    chromadb \
    google-cloud-aiplatform \
    langchain-google-vertexai

Collecting langchain
  Downloading langchain-0.1.17-py3-none-any.whl.metadata (13 kB)
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting chromadb
  Downloading chromadb-0.5.0-py3-none-any.whl.metadata (7.3 kB)
Collecting langchain-google-vertexai
  Downloading langchain_google_vertexai-1.0.3-py3-none-any.whl.metadata (3.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.5-py3-none-any.whl.metadata (25 kB)
Collecting langchain-community<0.1,>=0.0.36 (from langchain)
  Downloading langchain_community-0.0.36-py3-none-any.whl.metadata (8.7 kB)
Collecting langchain-core<0.2.0,>=0.1.48 (from langchain)
  Downloading langchain_core-0.1.48-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain-text-splitters<0.1,>=0.0.1 (from langchain)
  Downloading langchain_text_splitters-0.0.1-py3-none-any.whl.metadata (2.0 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
 

# Restart current runtime
To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [2]:
# Restart kernel after installs so that your environment can access the new packages
import IPython
import time

app = IPython.Application.instance()
app.kernel.do_shutdown(True)


{'status': 'ok', 'restart': True}

In [2]:
from langchain.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_google_vertexai import VertexAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import RetrievalQA, ConversationalRetrievalChain

In [4]:
# Define Google Cloud project information

In [3]:
# Define project information
import sys

PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

# if not running on colab, try to get the PROJECT_ID automatically
if "google.colab" not in sys.modules:
    import subprocess

    PROJECT_ID = subprocess.check_output(
        ["gcloud", "config", "get-value", "project"], text=True
    ).strip()

print(f"Your project ID is: {PROJECT_ID}")


Your project ID is: qwiklabs-gcp-03-39d1fc7a0525


### Document Loading 
Langchain provides classes to load data from different sources. Some useful data loaders are [Google Cloud Storage Directory Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/google_cloud_storage_directory), [Google Drive Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/google_drive), [Recursive URL Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/recursive_url_loader), [PDF Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/how_to/pdf), [JSON Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/how_to/json), [Wikipedia Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/wikipedia), and [more](https://python.langchain.com/docs/modules/data_connection/document_loaders/). 

In this notebook we will use the Wikipedia loader to create a private knowledge base of wikipedia articles about machine learning, but the overall process is similiar regardless of which document loader you use.

In [5]:
# Task 1. Use the WikipediaLoader to load documents related to queries
docs = WikipediaLoader(query="Machine Learning", load_max_docs=10).load()
docs += WikipediaLoader(query="Deep Learning", load_max_docs=10).load() 
docs += WikipediaLoader(query="Neural Networks", load_max_docs=10).load()

# Take a look at a single document
docs[0]

Document(page_content='Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.\nML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine. When applied to business problems, it is known under the name predictive analytics. Although not all machine learning is statistically based, computational statistics is an important source of the field\'s methods.\nThe mathematical foundations of ML are provided by mathematical optimization (mathematical programming) methods. Data mining is a related (parallel) field of study, focusing on exploratory data analysis (EDA) through unsupervised learning. \nFrom a theo

### Split text into chunks
Now that we have the documents we will split them into chunks. Each chunk will become one vector in the vector store. To do this we will define a chunk size (number of characters) and a chunk overlap (amount of overlap i.e. sliding window). The perfect chunk size can be difficult to determine. Too large of a chunk size leads to too much information per chunk (individual chunks not specific enough), however too small of a chunk size leads to not enough information per chunk. In both cases, nearest neighbors lookup with a query/question embedding may struggle to retrieve the actually relevant chunks, or fail altogether if the chunks are too large to use as context with an LLM query.

In this notebook we will use a chunk size of 800 chacters and a chunk overlap of 400 characters, but feel free to experiment with other sizes! Note: you can specify a custom `length_function` with `RecursiveCharacterTextSplitter` if you want chunk size/overlap to be determined by something other than Python's `len` function. In addition to `RecursiveCharacterTextSplitter`, there are [other text splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token) you can consider. 

In [6]:
# Task 2. Use the RecursiveCharacterTextSplitter class to split the documents into chunks for embedding
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 800,
    chunk_overlap  = 400,
    length_function = len,
)

chunks = text_splitter.split_documents(docs)

# Look at the first two chunks 
chunks[0:2]

[Document(page_content="Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.\nML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine. When applied to business problems, it is known under the name predictive analytics. Although not all machine learning is statistically based, computational statistics is an important source of the field's methods.", metadata={'title': 'Machine learning', 'summary': "Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus 

In [7]:
print(f'Number of documents: {len(docs)}')
print(f'Number of chunks: {len(chunks)}')

Number of documents: 30
Number of chunks: 233


### Vectorize/Embed Document Chunks
Now we need to embed the document chunks (turn them into vectors) and store them in a vector store. For this, we can use any text embedding model, however we need to be sure to use the same text embedding model when we embed our queries/questions at prediction time. To make things simple we will use the PaLM API for Embeddings. The LangChain library provides a wrapper class around the PaLM Embeddings API, `VertexAIEmbeddings()`.

For the purposes of this lab, you will use [Chroma](https://www.trychroma.com/) as the vector store for simplicity. In a real-world scenario with a large private knowledge-base, you may not be able to fit everything in memory. Langchain has a nice wrapper class for Chroma which allows us to pass in a list of documents, and an embedding class to create the vector store.

In [8]:
# Task 3. Create vector store using embeddings
embedding = VertexAIEmbeddings(model_name="textembedding-gecko@001") # PaLM embedding API 

# set persist directory so the vector store is saved to disk
db = Chroma.from_documents(chunks, embedding, persist_directory="./vectorstore")

### Putting it all together
Now that everything is in place, we can tie it all together with a langchain chain. A langchain chain simply orchestrates the multiple steps required to use an LLM for a specific use case. In this case the process we will chain together first embeds the query/question, then performs a nearest neighbors lookup to find the relevant chunks, then uses the relevant chunks to formulate a response with an LLM. We will use the Chroma database as our vector store and PaLM as our LLM. Langchain provides a wrapper around PaLM, `VertexAI()`. 

For this simple Q/A use case we can use langchain's `RetrievalQA` to link together the process.

In [9]:
# vector store 
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k":5} # number of nearest neighbors to retrieve  
)

# PaLM API 
# You can also set temperature, top_p, top_k 
llm = VertexAI(
    model_name="text-bison",
    max_output_tokens=1024
)

# q/a chain 
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

### Query 
Now that everything is tied together we can send queries and get answers! 

In [13]:
def ask_question(question: str):
    response = qa({"query": question})
    print(f"Response: {response['result']}\n")

    citations = {doc.metadata['source'] for doc in response['source_documents']}
    print(f"Citations: {citations}\n")

    # uncomment below to print source chunks used  
    print(f"Source Chunks Used: {response['source_documents']}")

In [14]:
ask_question("What technology underpins large language models?")

Response:  The provided context does not mention the technology that underpins large language models.

Citations: {'https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)', 'https://en.wikipedia.org/wiki/Mamba_(deep_learning_architecture)'}

Source Chunks Used: [Document(page_content='=== Comparison to Transformers ===\n\n\n== Variants ==\n\n\n=== Token-free language models: MambaByte ===\n\nOperating on byte-sized tokens, transformers scale poorly as every token must "attend" to every other token leading to O(n2) scaling laws, as a result, Transformers opt to use subword tokenization to reduce the number of tokens in text, however, this leads to very large vocabulary tables and word embeddings.\nThis research investigates a novel approach to language modeling, MambaByte, which departs from the standard token-based methods. Unlike traditional models that rely on breaking text into discrete units, MambaByte directly processes raw byte sequences.  This eliminates the need

In [15]:
ask_question("When was the transformer invented?")

Response:  The transformer was invented in 2017.

Citations: {'https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)', 'https://en.wikipedia.org/wiki/Feedforward_neural_network'}

Source Chunks Used: [Document(page_content='== Timeline ==\nIn 1958, a layered network of perceptrons, consisting of an input layer, a hidden layer with randomized weights that did not learn, and an output layer with learning connections, was introduced already by Frank Rosenblatt in his book Perceptron. This extreme learning machine was not yet a deep learning network.\nIn 1965, the first  deep-learning feedforward network, not yet using stochastic gradient descent, was published by Alexey Grigorevich Ivakhnenko and Valentin Lapa, at the time called the Group Method of Data Handling.', metadata={'source': 'https://en.wikipedia.org/wiki/Feedforward_neural_network', 'summary': 'A feedforward neural network (FNN) is one of the two broad types of artificial neural network, characterized by direc