This notebook makes a question answering chain with a specified website as a context data.

# Setting up

Install dependencies

In [1]:
!pip install langchain
!pip install pinecone-client
!pip install openai
!pip install tiktoken
!pip install nest_asyncio

Set up OpenAI API key

In [2]:
import os
os.environ["OPENAI_API_KEY"] = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Set up Pinecone API keys

In [None]:
import pinecone

# initialize pinecone
pinecone.init(
    api_key="78f19263-d67b-40b1-984c-68f386b6d930",  # find at app.pinecone.io
    environment="asia-southeast1-gcp-free"  # next to api key in console
)

# Index

**Load data from Web**

Extends from the WebBaseLoader, this will load a sitemap from a given URL, and then scrape and load all the pages in the sitemap, returning each page as a document.

The scraping is done concurrently, using WebBaseLoader. There are reasonable limits to concurrent requests, defaulting to 2 per second.

Link to the [documentation](https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/sitemap.html)

In [None]:
!pip install unstructured

In [27]:
from langchain.document_loaders import UnstructuredURLLoader

In [28]:
urls = [
    "https://www.naeco.blue/"
]

In [29]:
loader = UnstructuredURLLoader(urls=urls)

In [31]:
data = loader.load()

**Split the text from docs into smaller chunks**

There are many ways to split the text. We are using the text splitter that is recommended for generic texts. For more ways to slit the text check the [documentation](https://python.langchain.com/en/latest/modules/indexes/text_splitters.html)

In [32]:
len(data)

1

In [33]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap  = 100,
    length_function = len,
)

docs_chunks = text_splitter.split_documents(data)


In [34]:
len(docs_chunks)

2

In [None]:
docs_chunks[1]

Create embeddings

In [36]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

**Creating a vectorstore**

A vectorstore stores Documents and associated embeddings, and provides fast ways to look up relevant Documents by embeddings.

There are many ways to create a vectorstore. We are going to use Pinecone. For other types of vectorstores visit the [documentation](https://python.langchain.com/en/latest/modules/indexes/vectorstores.html)

First you need to go to [Pinecone](https://www.pinecone.io/) and create an index there. Then type the index name in "index_name"

In [37]:
from langchain.vectorstores import Pinecone

index_name = "hsharzirina"

# #create a new index
docsearch = Pinecone.from_documents(docs_chunks, embeddings, index_name=index_name)

# if you already have an index, you can load it like this
#docsearch = Pinecone.from_existing_index(index_name, embeddings)


If you cannot create a Pinecone account, try to use CromaDB. The following code creates a transient in-memory vectorstore. For further information check the [documentation](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/chroma.html).

The following code block uses Croma for creating a vectorstore. Uncomment it if you don't have access to pinecone and use it instead.

In [None]:
# from langchain.vectorstores import Chroma
# docsearch = Chroma.from_documents(docs, embeddings)

Vectorstore is ready. Let's try to query our docsearch with similarity search

In [39]:
query = " Who is Felix Mertens?"
data = docsearch.similarity_search(query)
print(data)

[Document(page_content='Business Case\n\nProblem\n\nVolatilität der erneuerbaren Energien\n\nErfahre mehr\n\nLösung\n\nEinspeiseprognose mit Hilfe eines intelligenten Algorithmus\n\nErfahre mehr\n\nAnwendungsbereiche\n\nDirektvermarktung\nPortfoliooptimierung\nAnlagenplanung\nNetzplanung\nEnergieautarke Firmen\nEnergiespeicher\nEnergiewandlung\nIntelligente Stromnetze\nWartungsplanungElektromobilität\nEigenverbrauch\nAutarke Energieversorgung\n\nErfahre mehr\n\nNAECO Blue Management\n\nFelix Mertens\n\nGründer und Geschäftsführer\n\nM.Sc. Wirtschaftsingenieur Energiemanagement\n\n1000\n\nMio. Datenpunkte\n\n10\n\nPartner\n\n110\n\nAnlagen im Portfolio\n\n8000\n\nErstellte Prognosen\n\nUnsere Partner und Unterstützer\n\nDancing Man GmbH\n\nGWGL Hamburg\n\nGateway49 Accelerator\n\nAmazon Web Services\n\nStadtwerke Lübeck GmbH\n\nCapCerta GmbH & Co. KG\n\nEnergiecluster Digitales Lübeck\n\nFunding from the European Regional Development Fund (ERDF)\n\nAddress\n\nNAECO Blue GmbHAmselweg 523

# Making a question answering chain
The question answering chain will enable us to generate the answer based on the relevant context chunks. See the [documentation](https://python.langchain.com/en/latest/modules/chains/index_examples/qa_with_sources.html) for more explanation.

Additionally, we can return the source documents used to answer the question by specifying an optional parameter when constructing the chain. For more information visit the [documentation](https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html#return-source-documents)

In [40]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
llm=OpenAI()

qa_with_sources = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=True)

query = "Who is Felix Mertens?"
result = qa_with_sources({"query": query})
result["result"]

' Felix Mertens is the founder and CEO of NAECO Blue Management.'

Output source documents that were found for the query

In [41]:
result["source_documents"]

[Document(page_content='Business Case\n\nProblem\n\nVolatilität der erneuerbaren Energien\n\nErfahre mehr\n\nLösung\n\nEinspeiseprognose mit Hilfe eines intelligenten Algorithmus\n\nErfahre mehr\n\nAnwendungsbereiche\n\nDirektvermarktung\nPortfoliooptimierung\nAnlagenplanung\nNetzplanung\nEnergieautarke Firmen\nEnergiespeicher\nEnergiewandlung\nIntelligente Stromnetze\nWartungsplanungElektromobilität\nEigenverbrauch\nAutarke Energieversorgung\n\nErfahre mehr\n\nNAECO Blue Management\n\nFelix Mertens\n\nGründer und Geschäftsführer\n\nM.Sc. Wirtschaftsingenieur Energiemanagement\n\n1000\n\nMio. Datenpunkte\n\n10\n\nPartner\n\n110\n\nAnlagen im Portfolio\n\n8000\n\nErstellte Prognosen\n\nUnsere Partner und Unterstützer\n\nDancing Man GmbH\n\nGWGL Hamburg\n\nGateway49 Accelerator\n\nAmazon Web Services\n\nStadtwerke Lübeck GmbH\n\nCapCerta GmbH & Co. KG\n\nEnergiecluster Digitales Lübeck\n\nFunding from the European Regional Development Fund (ERDF)\n\nAddress\n\nNAECO Blue GmbHAmselweg 523

In [42]:
query = "What does Naeco Blue do?"
result = qa_with_sources({"query": query})
result["result"]

' Naeco Blue develops digital solutions that aim to minimize costs through forecast deviations and incorrect forecasts of energy generated by renewable energies, bringing customers savings and additional profit.'