The notebook makes a question answering chain using a website as the context data.

### Setup

Install dependencies

In [None]:
!pip install langchain
!pip install pinecone-client
!pip install openai
!pip install tiktoken
!pip install nest_asyncio

[0m

Set up OpenAI API key

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "INSERT_KEY"

Set up Pinecone API keys

In [None]:
from pinecone import Pinecone
import os

# initialize connection to pinecone (get API key at app.pc.io)
api_key="INSERT_KEY",  # find at app.pinecone.io
environment="gcp-starter"  # next to api key in console

os.environ["PINECONE_API_KEY"] = "INSERT_KEY"

# configure client
pc = Pinecone(api_key=api_key)

# Index


**Load data from Web**

Extends from the WebBaseLoader, this will load a sitemap from a given URL, and then scrape and load all the pages in the sitemap, returning each page as a document.

The scraping is done concurrently, using WebBaseLoader. There are reasonable limits to concurrent requests, defaulting to 2 per second.

Link to the [documentation](https://python.langchain.com/docs/integrations/document_loaders/sitemap/)

In [None]:
# @title Link XML
# fixes a bug with asyncio and jupyter
"""
import nest_asyncio
nest_asyncio.apply()

from langchain.document_loaders.sitemap import SitemapLoader


loader = SitemapLoader(
    "https://www.canada.ca/en/veterans-affairs-canada.sitemap.xml",
    filter_urls=["https://www.canada.ca/en/veterans-affairs-canada/news/"]
)

#concurent requests
loader.requests_per_second = 50


docs = loader.load()

In [None]:
# @title Document XML
# fixes a bug with asyncio and jupyter
import nest_asyncio
nest_asyncio.apply()

from langchain.document_loaders.sitemap import SitemapLoader

loader = SitemapLoader(
    "https://raw.githubusercontent.com/QuantumAbstraction/glowing-octo-telegram/main/Sitemap.xml",
    restrict_to_same_domain=False,
    #filter_urls=["https://www.https://www.veterans.gc.ca/en/"]
)

#concurent requests
loader.requests_per_second = 50


docs = loader.load()

Fetching pages: 100%|##########| 10343/10343 [04:16<00:00, 40.26it/s]


**Split the text from docs into smaller chunks**

There are many ways to split the text. We are using the text splitter that is recommended for generic texts. For more ways to slit the text check the [documentation](https://python.langchain.com/docs/how_to/recursive_text_splitter/)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1200,
    chunk_overlap  = 200,
    length_function = len,
)

docs_chunks = text_splitter.split_documents(docs)

Create embeddings

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

**Creating a vectorstore**

A vectorstore stores Documents and associated embeddings, and provides fast ways to look up relevant Documents by embeddings.

There are many ways to create a vectorstore. We are going to use Pinecone. For other types of vectorstores visit the [documentation](https://python.langchain.com/docs/how_to/vectorstores/)

First you need to go to [Pinecone](https://www.pinecone.io/) and create an index there. Then type the index name in "index_name"

In [None]:
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
from tenacity import retry_if_exception_type, wait_exponential

index_name = "INSERT_NAME"

# #create a new index
docsearch = Pinecone.from_documents(docs_chunks, embeddings, index_name=index_name)

# if you already have an index, you can load it like this
#docsearch = Pinecone.from_existing_index(index_name, embeddings)


Vectorstore is ready. Let's try to query our docsearch with similarity search

In [None]:
query = "What is JobConnex?"
docs = docsearch.similarity_search(query)
print(docs[0])

page_content="Q2. Do I need to qualify for benefits or services through Veterans Affairs Canada to apply as a Veteran on VAC JobConnex?\n\nYou don't have to qualify for benefits or services to apply for a job through VAC JobConnex. If you served in the Canadian Armed Forces or worked for the Royal Canadian Mounted Police, you can apply as a Veteran on VAC JobConnex.\n\n\nQ3. Who qualifies as a Veteran for the purposes of VAC JobConnex?\n\nIf you served in the Canadian Armed Forces or worked for the Royal Canadian Mounted Police, you can apply as a Veteran on VAC JobConnex.\n\n\nQ4. What can I expect when I apply to a job opportunity through GC Jobs?\n\nVisit Applying for Government of Canada Jobs: What to expect for more info.\n\n\nQ5. What can I expect when I apply to VAC JobConnex?\n\n\n\nYou will receive a confirmation email.\nYour application will be added to a pool of candidates that hiring managers can draw from when they need to fill job vacancies.\nIf your skills and experience

# Making a question answering chain
The question answering chain will enable us to generate the answer based on the relevant context chunks. See the [documentation](https://python.langchain.com/en/latest/modules/chains/index_examples/qa_with_sources.html) for more explanation.

Additionally, we can return the source documents used to answer the question by specifying an optional parameter when constructing the chain. For more information visit the [documentation](https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html#return-source-documents)

In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
llm=OpenAI()

qa_with_sources = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=True)

query = "What is JobConnex?"
result = qa_with_sources({"query": query})
result["result"]

' JobConnex is a recruitment portal specifically for Veterans Affairs Canada, where you can find current job openings and apply for temporary jobs. It is part of the Job Placement Program, which aims to help Veterans find employment in the public service.'

Output source documents that were found for the query

In [None]:
result["source_documents"]

[Document(page_content="Q2. Do I need to qualify for benefits or services through Veterans Affairs Canada to apply as a Veteran on VAC JobConnex?\n\nYou don't have to qualify for benefits or services to apply for a job through VAC JobConnex. If you served in the Canadian Armed Forces or worked for the Royal Canadian Mounted Police, you can apply as a Veteran on VAC JobConnex.\n\n\nQ3. Who qualifies as a Veteran for the purposes of VAC JobConnex?\n\nIf you served in the Canadian Armed Forces or worked for the Royal Canadian Mounted Police, you can apply as a Veteran on VAC JobConnex.\n\n\nQ4. What can I expect when I apply to a job opportunity through GC Jobs?\n\nVisit Applying for Government of Canada Jobs: What to expect for more info.\n\n\nQ5. What can I expect when I apply to VAC JobConnex?\n\n\n\nYou will receive a confirmation email.\nYour application will be added to a pool of candidates that hiring managers can draw from when they need to fill job vacancies.\nIf your skills and 