<a href="https://colab.research.google.com/github/StatsAI/NLP/blob/main/Unstructured_RAG_V3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Author: Hussain Abbas, MSc
# © 2025 Stats AI LLC
# All Rights Reserved

In [1]:
!pip install "unstructured[all-docs]"
!pip install "unstructured[openai]"
!pip install langchain
!pip install chromadb

Collecting unstructured[all-docs]
  Downloading unstructured-0.16.23-py3-none-any.whl.metadata (24 kB)
Collecting filetype (from unstructured[all-docs])
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured[all-docs])
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured[all-docs])
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting dataclasses-json (from unstructured[all-docs])
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting python-iso639 (from unstructured[all-docs])
  Downloading python_iso639-2025.2.18-py3-none-any.whl.metadata (14 kB)
Collecting langdetect (from unstructured[all-docs])
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collec

Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.14.2-py2.py3-none-any.whl.metadata (2.9 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.30.0-py3-none-any.whl.metadata (2.4 kB)
Collecting opentelemetry-instrumentation-fastapi>=0.41b0 (from chromadb)
  Downloading opentelemetry_instrumentation_fastapi-0

# Unstructured API - Key Features:

1. Precise Document Extraction: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata.

2. Extensive File Support: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more.

https://docs.unstructured.io/welcome

https://docs.unstructured.io/open-source/introduction/overview


# The Unstructured Core API consists of the following components:

1. Partitioning: The partitioning functions in Unstructured enable the extraction of structured content from raw, unstructured documents. Partitioning functions in unstructured allow users to extract structured content from a raw unstructured document. These functions break a document down into elements such as Title, NarrativeText, and ListItem, enabling users to decide what content they’d like to keep for their particular application. If you’re training a summarization model, for example, you may only be interested in NarrativeText.

2. Cleaning: Data preparation for NLP models often requires cleaning to ensure quality. The Unstructured library includes cleaning functions that assist in sanitizing output, removing unwanted content, and improving the performance of NLP models. This step is essential for maintaining the integrity of data before it is passed to downstream applications.

3. Extracting: This functionality allows for the extraction of specific entities within documents. It is designed to identify and isolate relevant pieces of information, making it easier for users to focus on the most pertinent data in their documents.

4. Staging: Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of Destination Connectors.

5. Chunking: The chunking process in Unstructured is distinct from conventional methods. Instead of relying solely on text-based features to form chunks, Unstructured uses a deep understanding of document formats to partition documents into semantic units (document elements).

6. Embedding: The embedding encoder classes in Unstructured leverage document elements detected through partitioning or grouped via chunking to obtain embeddings for each element. This is particularly useful for applications like Retrieval Augmented Generation (RAG), where precise and contextually relevant embeddings are crucial.

##PDF - Extraction
https://docs.unstructured.io/open-source/core-functionality/partitioning


In [6]:
from unstructured.partition.auto import partition

#link = 'https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/layout-parser-paper-fast.pdf'

url = 'https://raw.githubusercontent.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/layout-parser-paper-fast.pdf'

# Use if loading from  a url
elements = partition(url=url)

# Use if loading from a local file
#elements = partition("/content/example-docs/layout-parser-paper-fast.pdf")

elements

[<unstructured.documents.elements.Title at 0x7cffaf909550>]

In [7]:
from collections import Counter

display(Counter(type(element) for element in elements))
print("")

Counter({unstructured.documents.elements.Title: 1})




In [8]:
display(*[(type(element), element.text) for element in elements])

(unstructured.documents.elements.Title, '404: Not Found')

##HTML - Extraction

In [13]:
url = 'https://www.businessinsider.com/navalny-death-letters-trump-second-term-agenda-really-scary-2024-2'

from unstructured.partition.auto import partition

elements = partition(url=url, strategy='hi_res', html_assemble_articles=True,
                     chunking_strategy="by_title", multipage_sections=True)
elements

[<unstructured.documents.elements.CompositeElement at 0x7cffa29319d0>,
 <unstructured.documents.elements.CompositeElement at 0x7cffa2931750>,
 <unstructured.documents.elements.CompositeElement at 0x7cffa2931490>,
 <unstructured.documents.elements.CompositeElement at 0x7cffa2931250>,
 <unstructured.documents.elements.CompositeElement at 0x7cffa2930f90>,
 <unstructured.documents.elements.CompositeElement at 0x7cffa2930c50>,
 <unstructured.documents.elements.CompositeElement at 0x7cffa2930790>,
 <unstructured.documents.elements.CompositeElement at 0x7cffa29309d0>,
 <unstructured.documents.elements.CompositeElement at 0x7cffa2930610>]

In [26]:
from unstructured.partition.html import partition_html

url = 'https://www.businessinsider.com/navalny-death-letters-trump-second-term-agenda-really-scary-2024-2'

elements = partition_html(url=url)

elements

[<unstructured.documents.elements.Text at 0x7cffa1959dd0>,
 <unstructured.documents.elements.Text at 0x7cffa195ac50>,
 <unstructured.documents.elements.Title at 0x7cffa195ad10>,
 <unstructured.documents.elements.Text at 0x7cffa1f56750>,
 <unstructured.documents.elements.Text at 0x7cffa25c4210>,
 <unstructured.documents.elements.Text at 0x7cffa1af1d90>,
 <unstructured.documents.elements.NarrativeText at 0x7cffa1908710>,
 <unstructured.documents.elements.NarrativeText at 0x7cffa1908450>,
 <unstructured.documents.elements.ListItem at 0x7cffa1af24d0>,
 <unstructured.documents.elements.ListItem at 0x7cffa1af3fd0>,
 <unstructured.documents.elements.ListItem at 0x7cffa1af1dd0>,
 <unstructured.documents.elements.NarrativeText at 0x7cffa1af3ed0>,
 <unstructured.documents.elements.NarrativeText at 0x7cffa195a310>,
 <unstructured.documents.elements.NarrativeText at 0x7cffa195b7d0>,
 <unstructured.documents.elements.NarrativeText at 0x7cffa195b250>,
 <unstructured.documents.elements.NarrativeText 

In [27]:
from collections import Counter

display(Counter(type(element) for element in elements))
print("")

Counter({unstructured.documents.elements.Text: 10,
         unstructured.documents.elements.Title: 3,
         unstructured.documents.elements.NarrativeText: 17,
         unstructured.documents.elements.ListItem: 35})




In [33]:
display(*[(type(element), element.text) for element in elements])

(unstructured.documents.elements.Text, 'Subscribe Newsletters')

(unstructured.documents.elements.Text, 'Military & Defense')

(unstructured.documents.elements.Title,
 "In Navalny's last letters, the Russian dissident called Trump's agenda for a second term 'really scary'")

(unstructured.documents.elements.Text, 'Kelsey Vlamis')

(unstructured.documents.elements.Text, '2024-02-20T01:55:04Z')

(unstructured.documents.elements.Text,
 'Facebook Email X LinkedIn Copy Link Impact Link')

(unstructured.documents.elements.NarrativeText, 'Read in app')

(unstructured.documents.elements.NarrativeText,
 'This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? .')

(unstructured.documents.elements.ListItem,
 "Alexey Navalny, Vladimir Putin's most prominent critic, commented on US politics months before his death.")

(unstructured.documents.elements.ListItem,
 'Navalny expressed concern in letters to a friend over a potential second term for Donald Trump.')

(unstructured.documents.elements.ListItem,
 "Trump briefly mentioned Navalny's death in a Truth Social post on Monday.")

(unstructured.documents.elements.NarrativeText,
 'Alexey Navalny, a dissident and the political nemesis of Russian President Vladimir Putin, spent the past few years of his life behind bars but still managed to stay connected to the outside world.')

(unstructured.documents.elements.NarrativeText,
 "Letters from the final months of his life, obtained by The New York Times, show that Navalny, who'd been imprisoned since January 2021, managed to stay on top of current events — including in the US.")

(unstructured.documents.elements.NarrativeText,
 'In a letter sent to a friend, a photographer named Evgeny Feldman, Navalny said former President Donald Trump\'s agenda for a second term was "really scary," according to the Times.')

(unstructured.documents.elements.NarrativeText,
 'He said if President Joe Biden were to have a health issue, "Trump will become president," adding: "Doesn\'t this obvious thing concern the Democrats?"')

(unstructured.documents.elements.NarrativeText,
 'In another letter to Feldman dated December 3, Navalny again expressed concern over Trump and asked his friend, "Please name one current politician you admire."')

(unstructured.documents.elements.NarrativeText,
 "Trump's office didn't immediately respond to a request for comment from Business Insider.")

(unstructured.documents.elements.NarrativeText,
 'On December 6, Navalny disappeared from the IK-6 penal colony about 120 miles east of Moscow. He turned up again on Christmas Day when his lawyers announced they had located him at the IK-3 penal colony, about 1,000 miles northeast of Moscow, above the Arctic Circle.')

(unstructured.documents.elements.NarrativeText,
 "The Times reported that Navalny's communication ability from his new prison was greatly diminished.")

(unstructured.documents.elements.NarrativeText,
 "The journalist Sergei Parkhomenko said he received a letter from Navalny on February 13, a few days before Navalny's death was announced. In the letter, which Parkhomenko shared on Facebook, Navalny spoke of books and said he only had access to classics at his new prison.")

(unstructured.documents.elements.NarrativeText,
 '"Who could\'ve told me that Chekhov is the most depressing Russian writer?" he wrote.')

(unstructured.documents.elements.NarrativeText,
 "Trump, for his part, didn't mention Navalny in the days after his death, despite condemnations from other leaders who directly blamed Putin.")

(unstructured.documents.elements.NarrativeText,
 'In a Truth Social post on Monday, Trump briefly mentioned Navalny before directing his ire at his own perceived political opponents: "The sudden death of Alexei Navalny has made me more and more aware of what is happening in our Country. It is a slow, steady progression, with CROOKED, Radical Left Politicians, Prosecutors, and Judges leading us down a path to destruction."')

(unstructured.documents.elements.NarrativeText,
 'He mentioned neither Russia nor Putin.')

(unstructured.documents.elements.Title, 'Read next')

(unstructured.documents.elements.ListItem, 'Donald Trump')

(unstructured.documents.elements.ListItem, 'Russia')

(unstructured.documents.elements.Title, 'Recommended video')

(unstructured.documents.elements.NarrativeText,
 'This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? .')

(unstructured.documents.elements.Text, 'Legal & Privacy')

(unstructured.documents.elements.ListItem, 'Terms of Service')

(unstructured.documents.elements.ListItem, 'Terms of Sale')

(unstructured.documents.elements.ListItem, 'Privacy Policy')

(unstructured.documents.elements.ListItem, 'Accessibility')

(unstructured.documents.elements.ListItem, 'Code of Ethics Policy')

(unstructured.documents.elements.ListItem, 'Reprints & Permissions')

(unstructured.documents.elements.ListItem, 'Disclaimer')

(unstructured.documents.elements.ListItem, 'Advertising Policies')

(unstructured.documents.elements.ListItem, 'Conflict of Interest Policy')

(unstructured.documents.elements.ListItem, 'Commerce Policy')

(unstructured.documents.elements.ListItem, 'Coupons Privacy Policy')

(unstructured.documents.elements.ListItem, 'Coupons Terms')

(unstructured.documents.elements.ListItem, 'Your Privacy Choices')

(unstructured.documents.elements.Text, 'Company')

(unstructured.documents.elements.ListItem, 'About Us')

(unstructured.documents.elements.ListItem, 'Careers')

(unstructured.documents.elements.ListItem, 'Advertise With Us')

(unstructured.documents.elements.ListItem, 'Contact Us')

(unstructured.documents.elements.ListItem, 'Company News')

(unstructured.documents.elements.ListItem, 'Masthead')

(unstructured.documents.elements.Text, 'Other')

(unstructured.documents.elements.ListItem, 'Sitemap')

(unstructured.documents.elements.ListItem, 'Stock quotes by finanzen.net')

(unstructured.documents.elements.Text, 'International Editions')

(unstructured.documents.elements.ListItem, 'AT')

(unstructured.documents.elements.ListItem, 'DE')

(unstructured.documents.elements.ListItem, 'ES')

(unstructured.documents.elements.ListItem, 'JP')

(unstructured.documents.elements.ListItem, 'NL')

(unstructured.documents.elements.ListItem, 'PL')

(unstructured.documents.elements.NarrativeText,
 'Copyright © 2025 Insider Inc. All rights reserved. Registration on or use of this site constitutes acceptance of our Terms of Service and Privacy Policy.')

(unstructured.documents.elements.Text, 'Jump to')

(unstructured.documents.elements.ListItem, 'Main content')

(unstructured.documents.elements.ListItem, 'Search')

(unstructured.documents.elements.ListItem, 'Account')

In [38]:
for element in elements:
  print(str(type(element)))

<class 'unstructured.documents.elements.Text'>
<class 'unstructured.documents.elements.Text'>
<class 'unstructured.documents.elements.Title'>
<class 'unstructured.documents.elements.Text'>
<class 'unstructured.documents.elements.Text'>
<class 'unstructured.documents.elements.Text'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.ListItem'>
<class 'unstructured.documents.elements.ListItem'>
<class 'unstructured.documents.elements.ListItem'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<cla

In [39]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.elements.Title'>"])

(unstructured.documents.elements.Title,
 "In Navalny's last letters, the Russian dissident called Trump's agenda for a second term 'really scary'")

(unstructured.documents.elements.Title, 'Read next')

(unstructured.documents.elements.Title, 'Recommended video')

In [41]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.elements.ListItem'>"])

(unstructured.documents.elements.ListItem,
 "Alexey Navalny, Vladimir Putin's most prominent critic, commented on US politics months before his death.")

(unstructured.documents.elements.ListItem,
 'Navalny expressed concern in letters to a friend over a potential second term for Donald Trump.')

(unstructured.documents.elements.ListItem,
 "Trump briefly mentioned Navalny's death in a Truth Social post on Monday.")

(unstructured.documents.elements.ListItem, 'Donald Trump')

(unstructured.documents.elements.ListItem, 'Russia')

(unstructured.documents.elements.ListItem, 'Terms of Service')

(unstructured.documents.elements.ListItem, 'Terms of Sale')

(unstructured.documents.elements.ListItem, 'Privacy Policy')

(unstructured.documents.elements.ListItem, 'Accessibility')

(unstructured.documents.elements.ListItem, 'Code of Ethics Policy')

(unstructured.documents.elements.ListItem, 'Reprints & Permissions')

(unstructured.documents.elements.ListItem, 'Disclaimer')

(unstructured.documents.elements.ListItem, 'Advertising Policies')

(unstructured.documents.elements.ListItem, 'Conflict of Interest Policy')

(unstructured.documents.elements.ListItem, 'Commerce Policy')

(unstructured.documents.elements.ListItem, 'Coupons Privacy Policy')

(unstructured.documents.elements.ListItem, 'Coupons Terms')

(unstructured.documents.elements.ListItem, 'Your Privacy Choices')

(unstructured.documents.elements.ListItem, 'About Us')

(unstructured.documents.elements.ListItem, 'Careers')

(unstructured.documents.elements.ListItem, 'Advertise With Us')

(unstructured.documents.elements.ListItem, 'Contact Us')

(unstructured.documents.elements.ListItem, 'Company News')

(unstructured.documents.elements.ListItem, 'Masthead')

(unstructured.documents.elements.ListItem, 'Sitemap')

(unstructured.documents.elements.ListItem, 'Stock quotes by finanzen.net')

(unstructured.documents.elements.ListItem, 'AT')

(unstructured.documents.elements.ListItem, 'DE')

(unstructured.documents.elements.ListItem, 'ES')

(unstructured.documents.elements.ListItem, 'JP')

(unstructured.documents.elements.ListItem, 'NL')

(unstructured.documents.elements.ListItem, 'PL')

(unstructured.documents.elements.ListItem, 'Main content')

(unstructured.documents.elements.ListItem, 'Search')

(unstructured.documents.elements.ListItem, 'Account')

In [42]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.elements.Text'>"])

(unstructured.documents.elements.Text, 'Subscribe Newsletters')

(unstructured.documents.elements.Text, 'Military & Defense')

(unstructured.documents.elements.Text, 'Kelsey Vlamis')

(unstructured.documents.elements.Text, '2024-02-20T01:55:04Z')

(unstructured.documents.elements.Text,
 'Facebook Email X LinkedIn Copy Link Impact Link')

(unstructured.documents.elements.Text, 'Legal & Privacy')

(unstructured.documents.elements.Text, 'Company')

(unstructured.documents.elements.Text, 'Other')

(unstructured.documents.elements.Text, 'International Editions')

(unstructured.documents.elements.Text, 'Jump to')

In [43]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.elements.NarrativeText'>"])

(unstructured.documents.elements.NarrativeText, 'Read in app')

(unstructured.documents.elements.NarrativeText,
 'This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? .')

(unstructured.documents.elements.NarrativeText,
 'Alexey Navalny, a dissident and the political nemesis of Russian President Vladimir Putin, spent the past few years of his life behind bars but still managed to stay connected to the outside world.')

(unstructured.documents.elements.NarrativeText,
 "Letters from the final months of his life, obtained by The New York Times, show that Navalny, who'd been imprisoned since January 2021, managed to stay on top of current events — including in the US.")

(unstructured.documents.elements.NarrativeText,
 'In a letter sent to a friend, a photographer named Evgeny Feldman, Navalny said former President Donald Trump\'s agenda for a second term was "really scary," according to the Times.')

(unstructured.documents.elements.NarrativeText,
 'He said if President Joe Biden were to have a health issue, "Trump will become president," adding: "Doesn\'t this obvious thing concern the Democrats?"')

(unstructured.documents.elements.NarrativeText,
 'In another letter to Feldman dated December 3, Navalny again expressed concern over Trump and asked his friend, "Please name one current politician you admire."')

(unstructured.documents.elements.NarrativeText,
 "Trump's office didn't immediately respond to a request for comment from Business Insider.")

(unstructured.documents.elements.NarrativeText,
 'On December 6, Navalny disappeared from the IK-6 penal colony about 120 miles east of Moscow. He turned up again on Christmas Day when his lawyers announced they had located him at the IK-3 penal colony, about 1,000 miles northeast of Moscow, above the Arctic Circle.')

(unstructured.documents.elements.NarrativeText,
 "The Times reported that Navalny's communication ability from his new prison was greatly diminished.")

(unstructured.documents.elements.NarrativeText,
 "The journalist Sergei Parkhomenko said he received a letter from Navalny on February 13, a few days before Navalny's death was announced. In the letter, which Parkhomenko shared on Facebook, Navalny spoke of books and said he only had access to classics at his new prison.")

(unstructured.documents.elements.NarrativeText,
 '"Who could\'ve told me that Chekhov is the most depressing Russian writer?" he wrote.')

(unstructured.documents.elements.NarrativeText,
 "Trump, for his part, didn't mention Navalny in the days after his death, despite condemnations from other leaders who directly blamed Putin.")

(unstructured.documents.elements.NarrativeText,
 'In a Truth Social post on Monday, Trump briefly mentioned Navalny before directing his ire at his own perceived political opponents: "The sudden death of Alexei Navalny has made me more and more aware of what is happening in our Country. It is a slow, steady progression, with CROOKED, Radical Left Politicians, Prosecutors, and Judges leading us down a path to destruction."')

(unstructured.documents.elements.NarrativeText,
 'He mentioned neither Russia nor Putin.')

(unstructured.documents.elements.NarrativeText,
 'This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? .')

(unstructured.documents.elements.NarrativeText,
 'Copyright © 2025 Insider Inc. All rights reserved. Registration on or use of this site constitutes acceptance of our Terms of Service and Privacy Policy.')

## Embeddings

https://unstructured-io.github.io/unstructured/core/embedding.html

In [None]:
from google.colab import userdata
open_ai_api_key = userdata.get('open_ai_api_key')

from unstructured.documents.elements import Text
from unstructured.embed.openai import OpenAIEmbeddingEncoder

# Initialize the encoder with OpenAI credentials
#embedding_encoder = OpenAIEmbeddingEncoder(api_key=open_ai_api_key)

#embedding_encoder = OpenAIEmbeddingEncoder(api_key=open_ai_api_key)


embedding_encoder =OpenAIEmbeddingEncoder(config=OpenAIEmbeddingConfig(api_key=open_ai_api_key))

#embedding_encoder = OpenAIEmbeddingEncoder(config=[open_ai_api_key,"text-embedding-ada-002"] )

# Embed a list of Elements
elements = embedding_encoder.embed_documents(
    elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)

# Embed a single query string
query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)

# Print embeddings
[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())

NameError: name 'OpenAIEmbeddingConfig' is not defined

## CNN HTML Links, Vector Databases, Automatic Summarization via LangChain + OpenAI

In the following section we:

1. Get the links for the latest articles from an HTML file (Option 1) or directly from the CNN website (Option 2).  
2. Use the Unstructured document loader in Langchain to load the files.
3. Create embeddings for each file using OpenAIEmbeddings.
4. Store the embeddings in Chroma DB.
5. Query Chroma DB to return relevant articles.   
6. Summarize the relevant articles using LangChain OpenAI integration.

https://unstructured-io.github.io/unstructured/examples/chroma.html

## Option 1: Import from HTML file stored on Github

In [None]:
from unstructured.partition.html import partition_html
import requests
from google.colab import userdata
open_ai_api_key = userdata.get('open_ai_api_key')

url = "https://github.com/StatsAI/NLP/blob/main/Breaking%20News%2C%20Latest%20News%20and%20Videos%20_%20CNN.html"

try:
    response = requests.get(url, allow_redirects=True)
    response.raise_for_status()  # Raise error if download fails
except requests.exceptions.RequestException as e:
    print(f"Error downloading HTML: {e}")
    exit(1)

# Save the content to a file
with open("downloaded_html.html", "wb") as f:
    f.write(response.content)

print(f"HTML downloaded successfully to downloaded_html.html")

elements = partition_html(filename='downloaded_html.html')
elements = elements[3].links

links = []
cnn_lite_url = "https://lite.cnn.com/"

for element in elements:
  try:
    if element["url"][3:-2]:
      relative_link = element["url"][3:-2]
      links.append(f"{cnn_lite_url}{relative_link}")
  except IndexError:
    # Handle the case where the "url" key doesn't exist or the index is out of range
    continue

links

HTML downloaded successfully to downloaded_html.html


['https://lite.cnn.com/2024/02/22/middleeast/israel-ceasefire-negotiating-team-to-paris-intl/index.html',
 'https://lite.cnn.com/2024/02/22/europe/spain-valencia-apartment-fire-intl/index.html',
 'https://lite.cnn.com/2024/02/22/politics/leaked-documents-tech-firm-chinese-hacking/index.html',
 'https://lite.cnn.com/2024/02/22/us/university-georgia-woman-dead/index.html',
 'https://lite.cnn.com/2024/02/22/politics/nikki-haley-alabama-ivf-ruling/index.html',
 'https://lite.cnn.com/2024/02/22/politics/mar-a-lago-carlos-de-oliveira-classified-documents/index.html',
 'https://lite.cnn.com/2024/02/22/us/military-family-embryos-alabama-ruling/index.html',
 'https://lite.cnn.com/2024/02/22/politics/putin-trump-us-election-analysis/index.html',
 'https://lite.cnn.com/2024/02/22/politics/us-brothers-detained-gaza/index.html',
 'https://lite.cnn.com/2024/02/22/politics/trump-moves-dismiss-classified-documents-case/index.html',
 'https://lite.cnn.com/2024/02/22/us/rise-above-movement-ruling-on-hol

## Option 2: Import from URL

In [None]:
from unstructured.partition.html import partition_html
from google.colab import userdata
open_ai_api_key = userdata.get('open_ai_api_key')

cnn_lite_url = "https://lite.cnn.com/"

elements = partition_html(url=cnn_lite_url)
links = []

for element in elements:
  try:
    if element.links[0]["url"][1:]:
      relative_link = element.links[0]["url"][1:]
      links.append(f"{cnn_lite_url}{relative_link}")
  except IndexError:
    # Handle the case where the "url" key doesn't exist or the index is out of range
    continue

links

['https://lite.cnn.com/2024/02/22/us/man-arrested-in-the-killing-of-a-los-angeles-model-and-real-estate-agent/index.html',
 'https://lite.cnn.com/2024/02/22/travel/alaska-airlines-passenger-pens-assault-charge/index.html',
 'https://lite.cnn.com/2024/02/22/politics/trump-moves-dismiss-classified-documents-case/index.html',
 'https://lite.cnn.com/2024/02/22/europe/kherson-russia-advance-ukraine-intl/index.html',
 'https://lite.cnn.com/2024/02/22/style/jeff-koons-moon-phases-odysseus-landing/index.html',
 'https://lite.cnn.com/2024/02/21/politics/haley-alabama-supreme-court-ruling/index.html',
 'https://lite.cnn.com/2024/02/22/middleeast/israel-ceasefire-negotiating-team-to-paris-intl/index.html',
 'https://lite.cnn.com/2024/02/22/europe/spain-valencia-apartment-fire-intl/index.html',
 'https://lite.cnn.com/2024/02/22/politics/leaked-documents-tech-firm-chinese-hacking/index.html',
 'https://lite.cnn.com/2024/02/22/us/university-georgia-woman-dead/index.html',
 'https://lite.cnn.com/2024

In [None]:
from langchain.document_loaders import UnstructuredURLLoader

# links = ['https://lite.cnn.com/2024/02/22/tech/nvidia-ceo-jensen-huang-20-richest-billionaire/index.html',
#          'https://lite.cnn.com/2024/02/22/us/darryl-george-crown-act-trial-texas-reaj/index.html']

loaders = UnstructuredURLLoader(urls=links, show_progress_bar=True)

docs = loaders.load()

100%|██████████| 100/100 [00:27<00:00,  3.65it/s]


In [None]:
from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings
from unstructured.embed.openai import OpenAIEmbeddingEncoder

embeddings = OpenAIEmbeddings(openai_api_key=open_ai_api_key)
vectorstore = Chroma.from_documents(docs, embeddings)

  warn_deprecated(


In [None]:
query_docs = vectorstore.similarity_search("Russia", k=5)

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k", openai_api_key=open_ai_api_key)
chain = load_summarize_chain(llm, chain_type="stuff")

  warn_deprecated(


In [None]:
for doc in query_docs:
  source = doc.metadata
  result = chain.invoke([doc])
  print(result['output_text'])
  print(source)
  print('')

The article discusses how Russian President Vladimir Putin has been able to exert influence and cause division within the United States. Despite facing criticism and sanctions, Putin has remained resilient and has successfully exploited American political divides. His actions have not only threatened US power but also strained relations between the US and its European NATO allies. The article highlights Putin's use of espionage and propaganda to provoke discord in US politics, and how he has capitalized on the tendency of American politicians to turn against each other.
{'source': 'https://lite.cnn.com/2024/02/22/politics/putin-trump-us-election-analysis/index.html'}

European leaders are increasingly concerned about the United States' commitment to their defense and are taking steps to become more self-sufficient. Former President Donald Trump's failure to condemn Russian President Vladimir Putin and his opposition to aid for Ukraine have raised fears among European leaders that the U

In [None]:
chain.invoke(query_docs)

{'input_documents': [Document(page_content='CNN\n\n2/23/2024\n\nPutin looms over a third successive US election\n\nAnalysis by Stephen Collinson, CNN\n\nUpdated: \n        10:53 PM EST, Thu February 22, 2024\n\nSource: CNN\n\n“Russia, Russia, Russia.”\n\nEx-President Donald Trump’s scathing catchphrase for a torrent of investigations during his administration also serves as an apt catch-all for the\xa0current meltdown over Moscow roiling US politics.\n\nThe United States might have beaten the Kremlin in the Cold War\xa0and ever since regarded Moscow as a mere irritant — albeit one with nuclear arms — and have been desperate to concentrate on the showdown with its new superpower rival, China.\n\nBut Russia and its leader, whom President Joe Biden described as a “crazy S.O.B.” at a Wednesday fundraiser, won’t go away.\n\nPresident Vladimir Putin has trained the malevolence of his intelligence agencies, his military power, global diplomacy and obstructive statecraft into a multi-front ass