<a href="https://colab.research.google.com/github/GeorgeCrossIV/Langchain-HTML-Loader/blob/main/Langchain_HTML_Loader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Langchain - using HTML Loader
This notebook uses the Langchain UnstructuredHTMLLoader function to load data from an HTML page.
- This example will load a HTML page from https://en.wikipedia.org/wiki/Andouille. Two questions will be asked about the page.

# Setup
- Be sure to set the Colab secrets for openai_api_key, cass_user, and cass_pw
- set the keyspace and table variables. Make sure the keyspace exists. The table will be created.
- load your secure connection bundle and update the scb_path variable to point to the the secure connection bundle

In [None]:
!pip install openai cassandra-driver langchain unstructured

# Imports

In [2]:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import dict_factory
from cassandra.query import SimpleStatement
from langchain.document_loaders import UnstructuredHTMLLoader
from google.colab import userdata
import openai

# Keys & Environment Variables

In [3]:
# keys and tokens here
openai_api_key = userdata.get('openai_api_key')
openai.api_key = openai_api_key
cass_user = userdata.get('cass_user')
cass_pw = userdata.get('cass_pw')
scb_path = '/content/secure-connect-cassio-db.zip'
keyspace="chatbot"
table="wiki_documents"

# Select a model to compute embeddings

In [4]:
model_id = "text-embedding-ada-002"

# Connect to the Cluster

In [5]:
cloud_config= {
  'secure_connect_bundle': scb_path
}
auth_provider = PlainTextAuthProvider(cass_user, cass_pw)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider, protocol_version=4)
session = cluster.connect()
session.set_keyspace('vector_preview')
session

<cassandra.cluster.Session at 0x7bdcb7293ac0>

# Drop / Create Schema

In [6]:
# only use this to reset the schema
session.execute(f"""DROP INDEX IF EXISTS {keyspace}.openai_desc""")
session.execute(f"""DROP TABLE IF EXISTS {keyspace}.{table}""")

<cassandra.cluster.ResultSet at 0x7bdcb7292aa0>

In [7]:
# # Create Table
session.execute(f"""
CREATE TABLE {keyspace}.{table} (
    document_id text,
	  chunk_id int,
    document_text text,
    embedding_vector vector<float, 1536>,
    metadata_blob text,
	  PRIMARY KEY (document_id, chunk_id))
 """)

# # Create Index
session.execute(f"""
CREATE CUSTOM INDEX IF NOT EXISTS openai_desc ON {keyspace}.{table} (embedding_vector)
USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
""")


<cassandra.cluster.ResultSet at 0x7bdcb7290c40>

Load The Wiki page

In [8]:
from langchain.document_loaders import UnstructuredHTMLLoader
!wget https://en.wikipedia.org/wiki/Andouille
loader =  UnstructuredHTMLLoader("Andouille")
data = loader.load()
#data

--2023-11-17 06:48:50--  https://en.wikipedia.org/wiki/Andouille
Resolving en.wikipedia.org (en.wikipedia.org)... 208.80.154.224, 2620:0:861:ed1a::1
Connecting to en.wikipedia.org (en.wikipedia.org)|208.80.154.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 83809 (82K) [text/html]
Saving to: ‘Andouille.3’


2023-11-17 06:48:50 (2.33 MB/s) - ‘Andouille.3’ saved [83809/83809]



  rows = body.findall("tr") if body else []


# Load the table with data and create text embeddings

In [9]:
document_chunk_id = 0
for page in data:
  # Create Embedding for each conversation row, save them to the database
  text_chunk_length = 400
  text_chunks = [page.page_content[i:i + text_chunk_length] for i in range(0, len(page.page_content), text_chunk_length)]
  for chunk_id, chunk in enumerate(text_chunks):
    document_chunk_id += 1
    metadata_blob=f"Source: {page.metadata['source']}"
    embedding = openai.embeddings.create(input=chunk, model=model_id).data[0].embedding
    query = SimpleStatement(
                f"""
                INSERT INTO {keyspace}.{table}
                (document_id, chunk_id, document_text, embedding_vector, metadata_blob)
                VALUES (%s, %s, %s, %s, %s)
                """
            )

    session.execute(query, (page.metadata['source'], document_chunk_id, chunk, embedding, metadata_blob ))



---


# Start using the index

In the steps up to this point, we have been creating a schema and loading the table with data, including embeddings we generated through the OpenAI Embedding API.
Now we are going to query that table and use the results to give ChatGPT some context to support it's response.

# Convert a query string into a text embedding to use as part of the query

First question: "What is Andouille?"

In [10]:
customer_input = "What is Andouille?"

embedding = openai.embeddings.create(input=customer_input, model=model_id).data[0].embedding
#display(embedding)

query = SimpleStatement(
    f"""
    SELECT *
    FROM {keyspace}.{table}
    ORDER BY embedding_vector ANN OF {embedding} LIMIT 5;
    """
    )

results = session.execute(query)
top_results = results._current_rows

message_objects = []
message_objects.append({"role":"system",
                        "content":"You're a chatbot answering questions about a web page"})

message_objects.append({"role":"user",
                        "content": customer_input})

products_list = []

for row in top_results:
    brand_dict = {'role': "assistant", "content": f"{row.document_text}"}
    products_list.append(brand_dict)

message_objects.extend(products_list)
message_objects.append({"role": "assistant", "content":"Here's my best answer:"})

completion = openai.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=message_objects
)
print(completion.choices[0].message.content)



Andouille is a type of sausage that is commonly associated with Louisiana cuisine, particularly Creole and Cajun cooking. It originated in France and is made from smoked pork that is seasoned with garlic, pepper, onions, and other spices. Andouille has a unique and robust flavor and is often used in dishes like jambalaya, gumbo, and red beans and rice. It is known for its smoky and spicy taste and is widely enjoyed for its rich and complex flavor profile.


Second question: "What temperature should Andouilee be cooked?"

In [11]:
customer_input = "What temperature should Andouille be cooked?"

embedding = openai.embeddings.create(input=customer_input, model=model_id).data[0].embedding
#display(embedding)

query = SimpleStatement(
    f"""
    SELECT *
    FROM {keyspace}.{table}
    ORDER BY embedding_vector ANN OF {embedding} LIMIT 5;
    """
    )

results = session.execute(query)
top_results = results._current_rows

message_objects = []
message_objects.append({"role":"system",
                        "content":"You're a chatbot answering questions about a web page"})

message_objects.append({"role":"user",
                        "content": customer_input})

products_list = []

for row in top_results:
    brand_dict = {'role': "assistant", "content": f"{row.document_text}"}
    products_list.append(brand_dict)

message_objects.extend(products_list)
message_objects.append({"role": "assistant", "content":"Here's my best answer:"})

completion = openai.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=message_objects
)
print(completion.choices[0].message.content)



Andouille sausage is a type of sausage made from pork, garlic, pepper, onions, wine, and various seasonings. It is traditionally smoked and has a coarse texture. In terms of cooking temperature, it is typically cooked at a moderate heat until it reaches an internal temperature of 160°F (71°C). Cooking times may vary depending on the size and thickness of the sausage. It is always best to use a meat thermometer to ensure the sausage is cooked to the desired temperature.
