User story:
- Hey, what is the webpage about: https://www.rottentomatoes.com/m/avengers_endgame
- I want to know the title of the webpage: Avengers: Endgame (2019)

## References
- https://learn.deeplearning.ai/langchain/lesson
- https://towardsdatascience.com/how-to-chunk-text-data-a-comparative-analysis-3858c4a0997a#:~:text=The%20Langchain%20Character%20Text%20Splitter%20works%20by%20recursively%20dividing%20the,meet%20the%20desired%20size%20criterion.

**Imports**

In [87]:
import openai
import os
from dotenv import load_dotenv, find_dotenv

import urllib
from langchain.document_loaders import UnstructuredHTMLLoader, BSHTMLLoader

from langchain.vectorstores import DocArrayInMemorySearch
from langchain.docstore.document import Document
from langchain.indexes import VectorstoreIndexCreator
from langchain.chains import RetrievalQA
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI



In [83]:
# Load and set API key
_ = load_dotenv(find_dotenv())
openai.api_key = os.environ['OPENAI_API_KEY']

model_name="gpt-3.5-turbo"


**Download the webpage**

To read the webpage

```python
file = urllib.request.urlopen(webpage)
myfile = f.read()
print(myfile.decode("utf-8"))
```

In [2]:
webpage_path = 'webpage.html'
webpage = "https://stackoverflow.com/questions/15138614/how-can-i-read-the-contents-of-an-url-with-python"
urllib.request.urlretrieve(webpage, webpage_path)

('webpage.html', <http.client.HTTPMessage at 0x28c67e4cd50>)

**Load the page with LangChain**

In [3]:
loader = UnstructuredHTMLLoader(webpage_path)
loader.load()

### Query the webpage

In [6]:
# Create a vector store index
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
).from_loaders([loader])

# Query the index
query = "What is this webpage about?"
response = index.query(query)

# Print the response
print(response)

## Querying Large Documents

The Wikipedia page about Artificial Intelligence is a large document. It cannot fit into GPT-3.5 context window. We need to split it in chunks.

In [13]:
# Download the page
webpage = 'https://en.wikipedia.org/wiki/Artificial_intelligence'
webpage_path = 'data/artificial-intelligence.html'
urllib.request.urlretrieve(webpage, webpage_path)

# Create a loder for the page
loader = UnstructuredHTMLLoader(webpage_path)

# Create a vector store index
# ai_page_index = VectorstoreIndexCreator(
#     vectorstore_cls=DocArrayInMemorySearch,
# ).from_loaders([ai_page_loader])

# # Query the index
# query = "What is this webpage about?"
# response = ai_page_index.query(query)

# # Print the response
# print(response)

In [100]:

loader = BSHTMLLoader(webpage_path, open_encoding='utf-8')
docs = loader.load()
docs



In [36]:
docs[0].page_content

