<a href="https://colab.research.google.com/github/StatsAI/NLP/blob/main/Unstructured_RAG_V3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Author: Hussain Abbas, MSc
# © 2025 Stats AI LLC
# All Rights Reserved

In [2]:
!pip install "unstructured[all-docs]"
!pip install "unstructured[openai]"
!pip install langchain
!pip install chromadb

Collecting unstructured[all-docs]
  Downloading unstructured-0.16.23-py3-none-any.whl.metadata (24 kB)
Collecting filetype (from unstructured[all-docs])
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured[all-docs])
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured[all-docs])
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting dataclasses-json (from unstructured[all-docs])
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting python-iso639 (from unstructured[all-docs])
  Downloading python_iso639-2025.2.18-py3-none-any.whl.metadata (14 kB)
Collecting langdetect (from unstructured[all-docs])
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collec

Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.14.2-py2.py3-none-any.whl.metadata (2.9 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.30.0-py3-none-any.whl.metadata (2.4 kB)
Collecting opentelemetry-instrumentation-fastapi>=0.41b0 (from chromadb)
  Downloading opentelemetry_instrumentation_fastapi-0

# Unstructured API - Key Features:

1. Precise Document Extraction: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata.

2. Extensive File Support: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more.

https://docs.unstructured.io/welcome

https://docs.unstructured.io/open-source/introduction/overview


# The Unstructured Core API consists of the following components:

1. Partitioning: The partitioning functions in Unstructured enable the extraction of structured content from raw, unstructured documents. Partitioning functions in unstructured allow users to extract structured content from a raw unstructured document. These functions break a document down into elements such as Title, NarrativeText, and ListItem, enabling users to decide what content they’d like to keep for their particular application. If you’re training a summarization model, for example, you may only be interested in NarrativeText.

2. Cleaning: Data preparation for NLP models often requires cleaning to ensure quality. The Unstructured library includes cleaning functions that assist in sanitizing output, removing unwanted content, and improving the performance of NLP models. This step is essential for maintaining the integrity of data before it is passed to downstream applications.

3. Extracting: This functionality allows for the extraction of specific entities within documents. It is designed to identify and isolate relevant pieces of information, making it easier for users to focus on the most pertinent data in their documents.

4. Staging: Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of Destination Connectors.

5. Chunking: The chunking process in Unstructured is distinct from conventional methods. Instead of relying solely on text-based features to form chunks, Unstructured uses a deep understanding of document formats to partition documents into semantic units (document elements).

6. Embedding: The embedding encoder classes in Unstructured leverage document elements detected through partitioning or grouped via chunking to obtain embeddings for each element. This is particularly useful for applications like Retrieval Augmented Generation (RAG), where precise and contextually relevant embeddings are crucial.

##PDF - Extraction
https://docs.unstructured.io/open-source/core-functionality/partitioning


In [3]:
# from unstructured.partition.auto import partition

# #link = 'https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/layout-parser-paper-fast.pdf'

# url = 'https://raw.githubusercontent.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/layout-parser-paper-fast.pdf'

# # Use if loading from  a url
# elements = partition(url=url)

# # Use if loading from a local file
# #elements = partition("/content/example-docs/layout-parser-paper-fast.pdf")

# elements

In [4]:
# from collections import Counter

# display(Counter(type(element) for element in elements))
# print("")

In [5]:
# display(*[(type(element), element.text) for element in elements])

##HTML - Extraction

In [6]:
# url = 'https://www.businessinsider.com/navalny-death-letters-trump-second-term-agenda-really-scary-2024-2'

# from unstructured.partition.auto import partition

# elements = partition(url=url, strategy='hi_res', html_assemble_articles=True,
#                      chunking_strategy="by_title", multipage_sections=True)
# elements

In [1]:
from unstructured.partition.html import partition_html

url = 'https://www.businessinsider.com/navalny-death-letters-trump-second-term-agenda-really-scary-2024-2'

elements = partition_html(url=url)

elements

[<unstructured.documents.elements.Text at 0x787bcf30c5d0>,
 <unstructured.documents.elements.Text at 0x787bcf3285d0>,
 <unstructured.documents.elements.Title at 0x787bcf2a4e50>,
 <unstructured.documents.elements.Text at 0x787bcf2a6b50>,
 <unstructured.documents.elements.Text at 0x787bcf2a7490>,
 <unstructured.documents.elements.Text at 0x787bcf22df50>,
 <unstructured.documents.elements.NarrativeText at 0x787bcd31d350>,
 <unstructured.documents.elements.NarrativeText at 0x787bcd31e7d0>,
 <unstructured.documents.elements.ListItem at 0x787bcd31eb50>,
 <unstructured.documents.elements.ListItem at 0x787bcd31ec50>,
 <unstructured.documents.elements.ListItem at 0x787bcd31ed90>,
 <unstructured.documents.elements.NarrativeText at 0x787bcd325950>,
 <unstructured.documents.elements.NarrativeText at 0x787bcd327dd0>,
 <unstructured.documents.elements.NarrativeText at 0x787bcd329910>,
 <unstructured.documents.elements.NarrativeText at 0x787bcd32b4d0>,
 <unstructured.documents.elements.NarrativeText 

In [2]:
from collections import Counter

display(Counter(type(element) for element in elements))
print("")

Counter({unstructured.documents.elements.Text: 10,
         unstructured.documents.elements.Title: 3,
         unstructured.documents.elements.NarrativeText: 17,
         unstructured.documents.elements.ListItem: 35})




In [3]:
display(*[(type(element), element.text) for element in elements])

(unstructured.documents.elements.Text, 'Subscribe Newsletters')

(unstructured.documents.elements.Text, 'Military & Defense')

(unstructured.documents.elements.Title,
 "In Navalny's last letters, the Russian dissident called Trump's agenda for a second term 'really scary'")

(unstructured.documents.elements.Text, 'Kelsey Vlamis')

(unstructured.documents.elements.Text, '2024-02-20T01:55:04Z')

(unstructured.documents.elements.Text,
 'Facebook Email X LinkedIn Copy Link Impact Link')

(unstructured.documents.elements.NarrativeText, 'Read in app')

(unstructured.documents.elements.NarrativeText,
 'This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? .')

(unstructured.documents.elements.ListItem,
 "Alexey Navalny, Vladimir Putin's most prominent critic, commented on US politics months before his death.")

(unstructured.documents.elements.ListItem,
 'Navalny expressed concern in letters to a friend over a potential second term for Donald Trump.')

(unstructured.documents.elements.ListItem,
 "Trump briefly mentioned Navalny's death in a Truth Social post on Monday.")

(unstructured.documents.elements.NarrativeText,
 'Alexey Navalny, a dissident and the political nemesis of Russian President Vladimir Putin, spent the past few years of his life behind bars but still managed to stay connected to the outside world.')

(unstructured.documents.elements.NarrativeText,
 "Letters from the final months of his life, obtained by The New York Times, show that Navalny, who'd been imprisoned since January 2021, managed to stay on top of current events — including in the US.")

(unstructured.documents.elements.NarrativeText,
 'In a letter sent to a friend, a photographer named Evgeny Feldman, Navalny said former President Donald Trump\'s agenda for a second term was "really scary," according to the Times.')

(unstructured.documents.elements.NarrativeText,
 'He said if President Joe Biden were to have a health issue, "Trump will become president," adding: "Doesn\'t this obvious thing concern the Democrats?"')

(unstructured.documents.elements.NarrativeText,
 'In another letter to Feldman dated December 3, Navalny again expressed concern over Trump and asked his friend, "Please name one current politician you admire."')

(unstructured.documents.elements.NarrativeText,
 "Trump's office didn't immediately respond to a request for comment from Business Insider.")

(unstructured.documents.elements.NarrativeText,
 'On December 6, Navalny disappeared from the IK-6 penal colony about 120 miles east of Moscow. He turned up again on Christmas Day when his lawyers announced they had located him at the IK-3 penal colony, about 1,000 miles northeast of Moscow, above the Arctic Circle.')

(unstructured.documents.elements.NarrativeText,
 "The Times reported that Navalny's communication ability from his new prison was greatly diminished.")

(unstructured.documents.elements.NarrativeText,
 "The journalist Sergei Parkhomenko said he received a letter from Navalny on February 13, a few days before Navalny's death was announced. In the letter, which Parkhomenko shared on Facebook, Navalny spoke of books and said he only had access to classics at his new prison.")

(unstructured.documents.elements.NarrativeText,
 '"Who could\'ve told me that Chekhov is the most depressing Russian writer?" he wrote.')

(unstructured.documents.elements.NarrativeText,
 "Trump, for his part, didn't mention Navalny in the days after his death, despite condemnations from other leaders who directly blamed Putin.")

(unstructured.documents.elements.NarrativeText,
 'In a Truth Social post on Monday, Trump briefly mentioned Navalny before directing his ire at his own perceived political opponents: "The sudden death of Alexei Navalny has made me more and more aware of what is happening in our Country. It is a slow, steady progression, with CROOKED, Radical Left Politicians, Prosecutors, and Judges leading us down a path to destruction."')

(unstructured.documents.elements.NarrativeText,
 'He mentioned neither Russia nor Putin.')

(unstructured.documents.elements.Title, 'Read next')

(unstructured.documents.elements.ListItem, 'Donald Trump')

(unstructured.documents.elements.ListItem, 'Russia')

(unstructured.documents.elements.Title, 'Recommended video')

(unstructured.documents.elements.NarrativeText,
 'This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? .')

(unstructured.documents.elements.Text, 'Legal & Privacy')

(unstructured.documents.elements.ListItem, 'Terms of Service')

(unstructured.documents.elements.ListItem, 'Terms of Sale')

(unstructured.documents.elements.ListItem, 'Privacy Policy')

(unstructured.documents.elements.ListItem, 'Accessibility')

(unstructured.documents.elements.ListItem, 'Code of Ethics Policy')

(unstructured.documents.elements.ListItem, 'Reprints & Permissions')

(unstructured.documents.elements.ListItem, 'Disclaimer')

(unstructured.documents.elements.ListItem, 'Advertising Policies')

(unstructured.documents.elements.ListItem, 'Conflict of Interest Policy')

(unstructured.documents.elements.ListItem, 'Commerce Policy')

(unstructured.documents.elements.ListItem, 'Coupons Privacy Policy')

(unstructured.documents.elements.ListItem, 'Coupons Terms')

(unstructured.documents.elements.ListItem, 'Your Privacy Choices')

(unstructured.documents.elements.Text, 'Company')

(unstructured.documents.elements.ListItem, 'About Us')

(unstructured.documents.elements.ListItem, 'Careers')

(unstructured.documents.elements.ListItem, 'Advertise With Us')

(unstructured.documents.elements.ListItem, 'Contact Us')

(unstructured.documents.elements.ListItem, 'Company News')

(unstructured.documents.elements.ListItem, 'Masthead')

(unstructured.documents.elements.Text, 'Other')

(unstructured.documents.elements.ListItem, 'Sitemap')

(unstructured.documents.elements.ListItem, 'Stock quotes by finanzen.net')

(unstructured.documents.elements.Text, 'International Editions')

(unstructured.documents.elements.ListItem, 'AT')

(unstructured.documents.elements.ListItem, 'DE')

(unstructured.documents.elements.ListItem, 'ES')

(unstructured.documents.elements.ListItem, 'JP')

(unstructured.documents.elements.ListItem, 'NL')

(unstructured.documents.elements.ListItem, 'PL')

(unstructured.documents.elements.NarrativeText,
 'Copyright © 2025 Insider Inc. All rights reserved. Registration on or use of this site constitutes acceptance of our Terms of Service and Privacy Policy.')

(unstructured.documents.elements.Text, 'Jump to')

(unstructured.documents.elements.ListItem, 'Main content')

(unstructured.documents.elements.ListItem, 'Search')

(unstructured.documents.elements.ListItem, 'Account')

In [4]:
for element in elements:
  print(str(type(element)))

<class 'unstructured.documents.elements.Text'>
<class 'unstructured.documents.elements.Text'>
<class 'unstructured.documents.elements.Title'>
<class 'unstructured.documents.elements.Text'>
<class 'unstructured.documents.elements.Text'>
<class 'unstructured.documents.elements.Text'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.ListItem'>
<class 'unstructured.documents.elements.ListItem'>
<class 'unstructured.documents.elements.ListItem'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<cla

In [5]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.elements.Title'>"])

(unstructured.documents.elements.Title,
 "In Navalny's last letters, the Russian dissident called Trump's agenda for a second term 'really scary'")

(unstructured.documents.elements.Title, 'Read next')

(unstructured.documents.elements.Title, 'Recommended video')

In [6]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.elements.ListItem'>"])

(unstructured.documents.elements.ListItem,
 "Alexey Navalny, Vladimir Putin's most prominent critic, commented on US politics months before his death.")

(unstructured.documents.elements.ListItem,
 'Navalny expressed concern in letters to a friend over a potential second term for Donald Trump.')

(unstructured.documents.elements.ListItem,
 "Trump briefly mentioned Navalny's death in a Truth Social post on Monday.")

(unstructured.documents.elements.ListItem, 'Donald Trump')

(unstructured.documents.elements.ListItem, 'Russia')

(unstructured.documents.elements.ListItem, 'Terms of Service')

(unstructured.documents.elements.ListItem, 'Terms of Sale')

(unstructured.documents.elements.ListItem, 'Privacy Policy')

(unstructured.documents.elements.ListItem, 'Accessibility')

(unstructured.documents.elements.ListItem, 'Code of Ethics Policy')

(unstructured.documents.elements.ListItem, 'Reprints & Permissions')

(unstructured.documents.elements.ListItem, 'Disclaimer')

(unstructured.documents.elements.ListItem, 'Advertising Policies')

(unstructured.documents.elements.ListItem, 'Conflict of Interest Policy')

(unstructured.documents.elements.ListItem, 'Commerce Policy')

(unstructured.documents.elements.ListItem, 'Coupons Privacy Policy')

(unstructured.documents.elements.ListItem, 'Coupons Terms')

(unstructured.documents.elements.ListItem, 'Your Privacy Choices')

(unstructured.documents.elements.ListItem, 'About Us')

(unstructured.documents.elements.ListItem, 'Careers')

(unstructured.documents.elements.ListItem, 'Advertise With Us')

(unstructured.documents.elements.ListItem, 'Contact Us')

(unstructured.documents.elements.ListItem, 'Company News')

(unstructured.documents.elements.ListItem, 'Masthead')

(unstructured.documents.elements.ListItem, 'Sitemap')

(unstructured.documents.elements.ListItem, 'Stock quotes by finanzen.net')

(unstructured.documents.elements.ListItem, 'AT')

(unstructured.documents.elements.ListItem, 'DE')

(unstructured.documents.elements.ListItem, 'ES')

(unstructured.documents.elements.ListItem, 'JP')

(unstructured.documents.elements.ListItem, 'NL')

(unstructured.documents.elements.ListItem, 'PL')

(unstructured.documents.elements.ListItem, 'Main content')

(unstructured.documents.elements.ListItem, 'Search')

(unstructured.documents.elements.ListItem, 'Account')

In [7]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.elements.Text'>"])

(unstructured.documents.elements.Text, 'Subscribe Newsletters')

(unstructured.documents.elements.Text, 'Military & Defense')

(unstructured.documents.elements.Text, 'Kelsey Vlamis')

(unstructured.documents.elements.Text, '2024-02-20T01:55:04Z')

(unstructured.documents.elements.Text,
 'Facebook Email X LinkedIn Copy Link Impact Link')

(unstructured.documents.elements.Text, 'Legal & Privacy')

(unstructured.documents.elements.Text, 'Company')

(unstructured.documents.elements.Text, 'Other')

(unstructured.documents.elements.Text, 'International Editions')

(unstructured.documents.elements.Text, 'Jump to')

In [8]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.elements.NarrativeText'>"])

(unstructured.documents.elements.NarrativeText, 'Read in app')

(unstructured.documents.elements.NarrativeText,
 'This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? .')

(unstructured.documents.elements.NarrativeText,
 'Alexey Navalny, a dissident and the political nemesis of Russian President Vladimir Putin, spent the past few years of his life behind bars but still managed to stay connected to the outside world.')

(unstructured.documents.elements.NarrativeText,
 "Letters from the final months of his life, obtained by The New York Times, show that Navalny, who'd been imprisoned since January 2021, managed to stay on top of current events — including in the US.")

(unstructured.documents.elements.NarrativeText,
 'In a letter sent to a friend, a photographer named Evgeny Feldman, Navalny said former President Donald Trump\'s agenda for a second term was "really scary," according to the Times.')

(unstructured.documents.elements.NarrativeText,
 'He said if President Joe Biden were to have a health issue, "Trump will become president," adding: "Doesn\'t this obvious thing concern the Democrats?"')

(unstructured.documents.elements.NarrativeText,
 'In another letter to Feldman dated December 3, Navalny again expressed concern over Trump and asked his friend, "Please name one current politician you admire."')

(unstructured.documents.elements.NarrativeText,
 "Trump's office didn't immediately respond to a request for comment from Business Insider.")

(unstructured.documents.elements.NarrativeText,
 'On December 6, Navalny disappeared from the IK-6 penal colony about 120 miles east of Moscow. He turned up again on Christmas Day when his lawyers announced they had located him at the IK-3 penal colony, about 1,000 miles northeast of Moscow, above the Arctic Circle.')

(unstructured.documents.elements.NarrativeText,
 "The Times reported that Navalny's communication ability from his new prison was greatly diminished.")

(unstructured.documents.elements.NarrativeText,
 "The journalist Sergei Parkhomenko said he received a letter from Navalny on February 13, a few days before Navalny's death was announced. In the letter, which Parkhomenko shared on Facebook, Navalny spoke of books and said he only had access to classics at his new prison.")

(unstructured.documents.elements.NarrativeText,
 '"Who could\'ve told me that Chekhov is the most depressing Russian writer?" he wrote.')

(unstructured.documents.elements.NarrativeText,
 "Trump, for his part, didn't mention Navalny in the days after his death, despite condemnations from other leaders who directly blamed Putin.")

(unstructured.documents.elements.NarrativeText,
 'In a Truth Social post on Monday, Trump briefly mentioned Navalny before directing his ire at his own perceived political opponents: "The sudden death of Alexei Navalny has made me more and more aware of what is happening in our Country. It is a slow, steady progression, with CROOKED, Radical Left Politicians, Prosecutors, and Judges leading us down a path to destruction."')

(unstructured.documents.elements.NarrativeText,
 'He mentioned neither Russia nor Putin.')

(unstructured.documents.elements.NarrativeText,
 'This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? .')

(unstructured.documents.elements.NarrativeText,
 'Copyright © 2025 Insider Inc. All rights reserved. Registration on or use of this site constitutes acceptance of our Terms of Service and Privacy Policy.')

## Embeddings

https://docs.unstructured.io/open-source/core-functionality/embedding

In [11]:
for element in elements:
  print(element)

Subscribe Newsletters
Military & Defense
In Navalny's last letters, the Russian dissident called Trump's agenda for a second term 'really scary'
Kelsey Vlamis
2024-02-20T01:55:04Z
Facebook Email X LinkedIn Copy Link Impact Link
Read in app
This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? .
Alexey Navalny, Vladimir Putin's most prominent critic, commented on US politics months before his death.
Navalny expressed concern in letters to a friend over a potential second term for Donald Trump.
Trump briefly mentioned Navalny's death in a Truth Social post on Monday.
Alexey Navalny, a dissident and the political nemesis of Russian President Vladimir Putin, spent the past few years of his life behind bars but still managed to stay connected to the outside world.
Letters from the final months of his life, obtained by The New York Times, show that Navalny, who'd been imprisoned since January 2021, managed to stay on to

In [13]:
!pip install langchain-huggingface

Collecting langchain-huggingface
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Downloading langchain_huggingface-0.1.2-py3-none-any.whl (21 kB)
Installing collected packages: langchain-huggingface
Successfully installed langchain-huggingface-0.1.2


In [20]:
#from unstructured.documents.elements import Text
#from sentence_transformers import SentenceTransformer

embeddings = []

from langchain_huggingface.embeddings import HuggingFaceEmbeddings

model = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2"
    )

# 1. Load a pretrained Sentence Transformer model
#embeddings = SentenceTransformer("all-MiniLM-L6-v2")

# Process each element in the JSON file.
for element in elements:
    # Get the element's "text" field.
    text = element.text
    # Generate the embeddings for that "text" field.
    query_result = model.embed_query(text)
    # Add the embeddings to that element as an "embeddings" field.
    element["embeddings"] = query_result

# # Print embeddings
# [print(e.embeddings, e) for e in elements]
# print(query_embedding, query)
# print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())

TypeError: 'Text' object does not support item assignment

In [9]:
from google.colab import userdata
open_ai_api_key = userdata.get('open_ai_api_key')

from unstructured.documents.elements import Text
from unstructured.embed.openai import OpenAIEmbeddingEncoder

# Initialize the encoder with OpenAI credentials
#embedding_encoder = OpenAIEmbeddingEncoder(api_key=open_ai_api_key)

#embedding_encoder = OpenAIEmbeddingEncoder(api_key=open_ai_api_key)


embedding_encoder = OpenAIEmbeddingEncoder(config=OpenAIEmbeddingConfig(api_key=open_ai_api_key))

#embedding_encoder = OpenAIEmbeddingEncoder(config=[open_ai_api_key,"text-embedding-ada-002"] )

# Embed a list of Elements
elements = embedding_encoder.embed_documents(
    elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)

# Embed a single query string
query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)

# Print embeddings
[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())

  from unstructured.embed.openai import OpenAIEmbeddingEncoder


NameError: name 'OpenAIEmbeddingConfig' is not defined

## CNN HTML Links, Vector Databases, Automatic Summarization via LangChain + OpenAI

In the following section we:

1. Get the links for the latest articles from an HTML file (Option 1) or directly from the CNN website (Option 2).  
2. Use the Unstructured document loader in Langchain to load the files.
3. Create embeddings for each file using OpenAIEmbeddings.
4. Store the embeddings in Chroma DB.
5. Query Chroma DB to return relevant articles.   
6. Summarize the relevant articles using LangChain OpenAI integration.

https://unstructured-io.github.io/unstructured/examples/chroma.html

## Option 1: Import from HTML file stored on Github

In [21]:
from unstructured.partition.html import partition_html
import requests
from google.colab import userdata
open_ai_api_key = userdata.get('open_ai_api_key')

url = "https://github.com/StatsAI/NLP/blob/main/Breaking%20News%2C%20Latest%20News%20and%20Videos%20_%20CNN.html"

try:
    response = requests.get(url, allow_redirects=True)
    response.raise_for_status()  # Raise error if download fails
except requests.exceptions.RequestException as e:
    print(f"Error downloading HTML: {e}")
    exit(1)

# Save the content to a file
with open("downloaded_html.html", "wb") as f:
    f.write(response.content)

print(f"HTML downloaded successfully to downloaded_html.html")

elements = partition_html(filename='downloaded_html.html')
elements = elements[3].links

links = []
cnn_lite_url = "https://lite.cnn.com/"

for element in elements:
  try:
    if element["url"][3:-2]:
      relative_link = element["url"][3:-2]
      links.append(f"{cnn_lite_url}{relative_link}")
  except IndexError:
    # Handle the case where the "url" key doesn't exist or the index is out of range
    continue

links

HTML downloaded successfully to downloaded_html.html


AttributeError: 'Text' object has no attribute 'links'

## Option 2: Import from URL

In [37]:
elements = partition_html(url=cnn_lite_url)

for element in elements:
  print(element.metadata.link_urls)
  #print(dir(element.metadata))
  #print(dir(element))
  #print(element.links)

['/']
None
['/2025/02/20/sport/4-nations-face-off-championship-spt/index.html']
['/2025/02/20/style/birkenstock-sandals-german-court-ruling/index.html']
['/2025/02/20/middleeast/israel-bus-explosions-hnk-intl/index.html']
['/2025/02/20/travel/new-zealand-bug-year-velvet-worm-intl-hnk/index.html']
['/2025/02/20/us/texas-ice-jocelynn-rojo-carranza/index.html']
['/2025/02/20/science/nasa-layoffs-workforce-firings/index.html']
['/2025/02/20/politics/elon-musk-private-security-deputized-marshals-service/index.html']
['/2025/02/20/politics/senate-budget-resolution-vote-trump-agenda/index.html']
['/2024/05/30/us/kohberger-idaho-killings-pretrial-hearings/index.html']
['/2025/02/20/sport/espn-mlb-end-relationship-spt/index.html']
['/2025/02/20/politics/us-resisting-adding-reference-to-russian-aggression-to-g7-ukraine-anniversary-statement/index.html']
['/2025/02/20/health/deep-vein-thrombosis-explainer-wembanyama-wellness/index.html']
['/2025/02/20/middleeast/israel-bibas-boys-among-dead-hosta

In [39]:
from unstructured.partition.html import partition_html
from google.colab import userdata
open_ai_api_key = userdata.get('open_ai_api_key')

cnn_lite_url = "https://lite.cnn.com/"

elements = partition_html(url=cnn_lite_url)
links = []

for element in elements:
  try:
    if element.metadata.link_urls:
      relative_link = element.metadata.link_urls[0]
      links.append(f"{cnn_lite_url}{relative_link}")
  except IndexError:
    # Handle the case where the "url" key doesn't exist or the index is out of range
    continue

links

['https://lite.cnn.com//',
 'https://lite.cnn.com//2025/02/20/sport/4-nations-face-off-championship-spt/index.html',
 'https://lite.cnn.com//2025/02/20/style/birkenstock-sandals-german-court-ruling/index.html',
 'https://lite.cnn.com//2025/02/20/middleeast/israel-bus-explosions-hnk-intl/index.html',
 'https://lite.cnn.com//2025/02/20/travel/new-zealand-bug-year-velvet-worm-intl-hnk/index.html',
 'https://lite.cnn.com//2025/02/20/us/texas-ice-jocelynn-rojo-carranza/index.html',
 'https://lite.cnn.com//2025/02/20/science/nasa-layoffs-workforce-firings/index.html',
 'https://lite.cnn.com//2025/02/20/politics/elon-musk-private-security-deputized-marshals-service/index.html',
 'https://lite.cnn.com//2025/02/20/politics/senate-budget-resolution-vote-trump-agenda/index.html',
 'https://lite.cnn.com//2024/05/30/us/kohberger-idaho-killings-pretrial-hearings/index.html',
 'https://lite.cnn.com//2025/02/20/sport/espn-mlb-end-relationship-spt/index.html',
 'https://lite.cnn.com//2025/02/20/politic

In [40]:
# from unstructured.partition.html import partition_html
# from google.colab import userdata
# open_ai_api_key = userdata.get('open_ai_api_key')

# cnn_lite_url = "https://lite.cnn.com/"

# elements = partition_html(url=cnn_lite_url)
# links = []

# for element in elements:
#   try:
#     if element.links[0]["url"][1:]:
#       relative_link = element.links[0]["url"][1:]
#       links.append(f"{cnn_lite_url}{relative_link}")
#   except IndexError:
#     # Handle the case where the "url" key doesn't exist or the index is out of range
#     continue

# links

In [42]:
!pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.18-py3-none-any.whl.metadata (2.4 kB)
Collecting langchain-core<1.0.0,>=0.3.37 (from langchain-community)
  Downloading langchain_core-0.3.37-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain<1.0.0,>=0.3.19 (from langchain-community)
  Downloading langchain-0.3.19-py3-none-any.whl.metadata (7.9 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Downloading langchain_community-0.3.18-py3-none-any.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx_sse-0.4.0-py3-none-any.whl (7.8 kB)
Downloading langchain-0.3.19-py3-none-any.whl (1.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [43]:
from langchain.document_loaders import UnstructuredURLLoader

# links = ['https://lite.cnn.com/2024/02/22/tech/nvidia-ceo-jensen-huang-20-richest-billionaire/index.html',
#          'https://lite.cnn.com/2024/02/22/us/darryl-george-crown-act-trial-texas-reaj/index.html']

loaders = UnstructuredURLLoader(urls=links, show_progress_bar=True)

docs = loaders.load()

100%|██████████| 103/103 [00:34<00:00,  2.96it/s]


In [45]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.9.0


In [46]:
from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings
from unstructured.embed.openai import OpenAIEmbeddingEncoder

embeddings = OpenAIEmbeddings(openai_api_key=open_ai_api_key)
vectorstore = Chroma.from_documents(docs, embeddings)

In [47]:
query_docs = vectorstore.similarity_search("Trump", k=5)

In [48]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k", openai_api_key=open_ai_api_key)
chain = load_summarize_chain(llm, chain_type="stuff")

  llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k", openai_api_key=open_ai_api_key)


In [49]:
for doc in query_docs:
  source = doc.metadata
  result = chain.invoke([doc])
  print(result['output_text'])
  print(source)
  print('')

This article from CNN highlights 13 of the biggest lies that former President Donald Trump made during his first month back in office. The lies range from false claims about funding condoms for Hamas to denying the assault on the Capitol on January 6, 2021. Trump also made false statements about birthright citizenship, California water policy, and the 2020 election. He continued to spread misinformation about Olympic boxers, Canada, FAA diversity initiatives, tariffs, autism rates, and China's operation of the Panama Canal. Additionally, Trump falsely claimed to have won the youth vote by 36 points in the 2024 election.
{'source': 'https://lite.cnn.com//2025/02/20/politics/analysis-trumps-13-biggest-lies-first-month-2025/index.html'}

The relationship between President Donald Trump and Ukrainian President Volodymyr Zelensky has deteriorated, with Trump publicly criticizing Zelensky and accusing him of strong-arming the US. Trump's criticism of Zelensky is linked to his own grievances w

In [50]:
chain.invoke(query_docs)

{'input_documents': [Document(metadata={'source': 'https://lite.cnn.com//2025/02/20/politics/analysis-trumps-13-biggest-lies-first-month-2025/index.html'}, page_content='CNN 2/20/2025\n\nAnalysis: Trump’s 13 biggest lies of his first month back in office\n\nBy Daniel Dale, CNN\n\nUpdated: 4:00 AM EST, Thu February 20, 2025\n\nSource: CNN\n\nPresident Donald Trump moved at a blistering pace in his first month back in the White House. He lied fast and furious, too.\n\nIn speeches, interviews, exchanges with reporters and posts on social media, the president filled his public statements not only with exaggerations but outright fabrications. As he did during his first presidency, Trump made false claims with a frequency and variety unmatched by any other elected official in Washington.\n\nHere is our list of Trump’s 13 biggest lies since he was inaugurated on January 20. It was hard to choose.\n\nThe tale of the $50 million – no, make it $100 million – in condoms for Hamas: When press secr