<a href="https://colab.research.google.com/github/StatsAI/NLP/blob/main/Unstructured_RAG_V3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Author: Hussain Abbas, MSc
# © 2025 Stats AI LLC
# All Rights Reserved

In [2]:
!pip install "unstructured[all-docs]"
#!pip install "unstructured[openai]"
!pip install langchain
!pip install -U langchain-community
!pip install langchain-huggingface
!pip install -U sentence-transformers
#!pip install tiktoken
!pip install chromadb
!pip install langchain-google-genai



# Unstructured API - Key Features:

1. Precise Document Extraction: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata.

2. Extensive File Support: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more.

https://docs.unstructured.io/welcome

https://docs.unstructured.io/open-source/introduction/overview


# The Unstructured Core API consists of the following components:

1. Partitioning: The partitioning functions in Unstructured enable the extraction of structured content from raw, unstructured documents. Partitioning functions in unstructured allow users to extract structured content from a raw unstructured document. These functions break a document down into elements such as Title, NarrativeText, and ListItem, enabling users to decide what content they’d like to keep for their particular application. If you’re training a summarization model, for example, you may only be interested in NarrativeText.

2. Cleaning: Data preparation for NLP models often requires cleaning to ensure quality. The Unstructured library includes cleaning functions that assist in sanitizing output, removing unwanted content, and improving the performance of NLP models. This step is essential for maintaining the integrity of data before it is passed to downstream applications.

3. Extracting: This functionality allows for the extraction of specific entities within documents. It is designed to identify and isolate relevant pieces of information, making it easier for users to focus on the most pertinent data in their documents.

4. Staging: Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of Destination Connectors.

5. Chunking: The chunking process in Unstructured is distinct from conventional methods. Instead of relying solely on text-based features to form chunks, Unstructured uses a deep understanding of document formats to partition documents into semantic units (document elements).

6. Embedding: The embedding encoder classes in Unstructured leverage document elements detected through partitioning or grouped via chunking to obtain embeddings for each element. This is particularly useful for applications like Retrieval Augmented Generation (RAG), where precise and contextually relevant embeddings are crucial.

##PDF - Extraction
https://docs.unstructured.io/open-source/core-functionality/partitioning


In [3]:
# from unstructured.partition.auto import partition

# #link = 'https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/layout-parser-paper-fast.pdf'

# url = 'https://raw.githubusercontent.com/Unstructured-IO/unstructured/blob/main/example-docs/pdf/layout-parser-paper-fast.pdf'

# # Use if loading from  a url
# elements = partition(url=url)

# # Use if loading from a local file
# #elements = partition("/content/example-docs/layout-parser-paper-fast.pdf")

# elements

In [4]:
# from collections import Counter

# display(Counter(type(element) for element in elements))
# print("")

In [5]:
# display(*[(type(element), element.text) for element in elements])

##HTML - Extraction

In [6]:
# url = 'https://www.businessinsider.com/navalny-death-letters-trump-second-term-agenda-really-scary-2024-2'

# from unstructured.partition.auto import partition

# elements = partition(url=url, strategy='hi_res', html_assemble_articles=True,
#                      chunking_strategy="by_title", multipage_sections=True)
# elements

In [7]:
from unstructured.partition.html import partition_html

url = 'https://www.businessinsider.com/navalny-death-letters-trump-second-term-agenda-really-scary-2024-2'

elements = partition_html(url=url)

elements

[<unstructured.documents.elements.Text at 0x7a9ee6b81cd0>,
 <unstructured.documents.elements.Text at 0x7a9ee6b38550>,
 <unstructured.documents.elements.Title at 0x7a9ee6b28110>,
 <unstructured.documents.elements.Text at 0x7a9ee6ba94d0>,
 <unstructured.documents.elements.Text at 0x7a9ee6ba97d0>,
 <unstructured.documents.elements.Text at 0x7a9ee6b11b50>,
 <unstructured.documents.elements.NarrativeText at 0x7a9ee4d67750>,
 <unstructured.documents.elements.NarrativeText at 0x7a9ee4d6c890>,
 <unstructured.documents.elements.ListItem at 0x7a9ee4d6c650>,
 <unstructured.documents.elements.ListItem at 0x7a9ee4d6c9d0>,
 <unstructured.documents.elements.ListItem at 0x7a9ee4d6cad0>,
 <unstructured.documents.elements.NarrativeText at 0x7a9ee4d6f790>,
 <unstructured.documents.elements.NarrativeText at 0x7a9ee4b75b90>,
 <unstructured.documents.elements.NarrativeText at 0x7a9ee4b77790>,
 <unstructured.documents.elements.NarrativeText at 0x7a9ee4b7cd10>,
 <unstructured.documents.elements.NarrativeText 

In [8]:
from collections import Counter

display(Counter(type(element) for element in elements))
print("")

Counter({unstructured.documents.elements.Text: 10,
         unstructured.documents.elements.Title: 3,
         unstructured.documents.elements.NarrativeText: 17,
         unstructured.documents.elements.ListItem: 35})




In [9]:
display(*[(type(element), element.text) for element in elements])

(unstructured.documents.elements.Text, 'Subscribe Newsletters')

(unstructured.documents.elements.Text, 'Military & Defense')

(unstructured.documents.elements.Title,
 "In Navalny's last letters, the Russian dissident called Trump's agenda for a second term 'really scary'")

(unstructured.documents.elements.Text, 'Kelsey Vlamis')

(unstructured.documents.elements.Text, '2024-02-20T01:55:04Z')

(unstructured.documents.elements.Text,
 'Facebook Email X LinkedIn Copy Link Impact Link')

(unstructured.documents.elements.NarrativeText, 'Read in app')

(unstructured.documents.elements.NarrativeText,
 'This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? .')

(unstructured.documents.elements.ListItem,
 "Alexey Navalny, Vladimir Putin's most prominent critic, commented on US politics months before his death.")

(unstructured.documents.elements.ListItem,
 'Navalny expressed concern in letters to a friend over a potential second term for Donald Trump.')

(unstructured.documents.elements.ListItem,
 "Trump briefly mentioned Navalny's death in a Truth Social post on Monday.")

(unstructured.documents.elements.NarrativeText,
 'Alexey Navalny, a dissident and the political nemesis of Russian President Vladimir Putin, spent the past few years of his life behind bars but still managed to stay connected to the outside world.')

(unstructured.documents.elements.NarrativeText,
 "Letters from the final months of his life, obtained by The New York Times, show that Navalny, who'd been imprisoned since January 2021, managed to stay on top of current events — including in the US.")

(unstructured.documents.elements.NarrativeText,
 'In a letter sent to a friend, a photographer named Evgeny Feldman, Navalny said former President Donald Trump\'s agenda for a second term was "really scary," according to the Times.')

(unstructured.documents.elements.NarrativeText,
 'He said if President Joe Biden were to have a health issue, "Trump will become president," adding: "Doesn\'t this obvious thing concern the Democrats?"')

(unstructured.documents.elements.NarrativeText,
 'In another letter to Feldman dated December 3, Navalny again expressed concern over Trump and asked his friend, "Please name one current politician you admire."')

(unstructured.documents.elements.NarrativeText,
 "Trump's office didn't immediately respond to a request for comment from Business Insider.")

(unstructured.documents.elements.NarrativeText,
 'On December 6, Navalny disappeared from the IK-6 penal colony about 120 miles east of Moscow. He turned up again on Christmas Day when his lawyers announced they had located him at the IK-3 penal colony, about 1,000 miles northeast of Moscow, above the Arctic Circle.')

(unstructured.documents.elements.NarrativeText,
 "The Times reported that Navalny's communication ability from his new prison was greatly diminished.")

(unstructured.documents.elements.NarrativeText,
 "The journalist Sergei Parkhomenko said he received a letter from Navalny on February 13, a few days before Navalny's death was announced. In the letter, which Parkhomenko shared on Facebook, Navalny spoke of books and said he only had access to classics at his new prison.")

(unstructured.documents.elements.NarrativeText,
 '"Who could\'ve told me that Chekhov is the most depressing Russian writer?" he wrote.')

(unstructured.documents.elements.NarrativeText,
 "Trump, for his part, didn't mention Navalny in the days after his death, despite condemnations from other leaders who directly blamed Putin.")

(unstructured.documents.elements.NarrativeText,
 'In a Truth Social post on Monday, Trump briefly mentioned Navalny before directing his ire at his own perceived political opponents: "The sudden death of Alexei Navalny has made me more and more aware of what is happening in our Country. It is a slow, steady progression, with CROOKED, Radical Left Politicians, Prosecutors, and Judges leading us down a path to destruction."')

(unstructured.documents.elements.NarrativeText,
 'He mentioned neither Russia nor Putin.')

(unstructured.documents.elements.Title, 'Read next')

(unstructured.documents.elements.ListItem, 'Donald Trump')

(unstructured.documents.elements.ListItem, 'Russia')

(unstructured.documents.elements.Title, 'Recommended video')

(unstructured.documents.elements.NarrativeText,
 'This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? .')

(unstructured.documents.elements.Text, 'Legal & Privacy')

(unstructured.documents.elements.ListItem, 'Terms of Service')

(unstructured.documents.elements.ListItem, 'Terms of Sale')

(unstructured.documents.elements.ListItem, 'Privacy Policy')

(unstructured.documents.elements.ListItem, 'Accessibility')

(unstructured.documents.elements.ListItem, 'Code of Ethics Policy')

(unstructured.documents.elements.ListItem, 'Reprints & Permissions')

(unstructured.documents.elements.ListItem, 'Disclaimer')

(unstructured.documents.elements.ListItem, 'Advertising Policies')

(unstructured.documents.elements.ListItem, 'Conflict of Interest Policy')

(unstructured.documents.elements.ListItem, 'Commerce Policy')

(unstructured.documents.elements.ListItem, 'Coupons Privacy Policy')

(unstructured.documents.elements.ListItem, 'Coupons Terms')

(unstructured.documents.elements.ListItem, 'Your Privacy Choices')

(unstructured.documents.elements.Text, 'Company')

(unstructured.documents.elements.ListItem, 'About Us')

(unstructured.documents.elements.ListItem, 'Careers')

(unstructured.documents.elements.ListItem, 'Advertise With Us')

(unstructured.documents.elements.ListItem, 'Contact Us')

(unstructured.documents.elements.ListItem, 'Company News')

(unstructured.documents.elements.ListItem, 'Masthead')

(unstructured.documents.elements.Text, 'Other')

(unstructured.documents.elements.ListItem, 'Sitemap')

(unstructured.documents.elements.ListItem, 'Stock quotes by finanzen.net')

(unstructured.documents.elements.Text, 'International Editions')

(unstructured.documents.elements.ListItem, 'AT')

(unstructured.documents.elements.ListItem, 'DE')

(unstructured.documents.elements.ListItem, 'ES')

(unstructured.documents.elements.ListItem, 'JP')

(unstructured.documents.elements.ListItem, 'NL')

(unstructured.documents.elements.ListItem, 'PL')

(unstructured.documents.elements.NarrativeText,
 'Copyright © 2025 Insider Inc. All rights reserved. Registration on or use of this site constitutes acceptance of our Terms of Service and Privacy Policy.')

(unstructured.documents.elements.Text, 'Jump to')

(unstructured.documents.elements.ListItem, 'Main content')

(unstructured.documents.elements.ListItem, 'Search')

(unstructured.documents.elements.ListItem, 'Account')

In [10]:
for element in elements:
  print(str(type(element)))

<class 'unstructured.documents.elements.Text'>
<class 'unstructured.documents.elements.Text'>
<class 'unstructured.documents.elements.Title'>
<class 'unstructured.documents.elements.Text'>
<class 'unstructured.documents.elements.Text'>
<class 'unstructured.documents.elements.Text'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.ListItem'>
<class 'unstructured.documents.elements.ListItem'>
<class 'unstructured.documents.elements.ListItem'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<class 'unstructured.documents.elements.NarrativeText'>
<cla

In [11]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.elements.Title'>"])

(unstructured.documents.elements.Title,
 "In Navalny's last letters, the Russian dissident called Trump's agenda for a second term 'really scary'")

(unstructured.documents.elements.Title, 'Read next')

(unstructured.documents.elements.Title, 'Recommended video')

In [12]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.elements.ListItem'>"])

(unstructured.documents.elements.ListItem,
 "Alexey Navalny, Vladimir Putin's most prominent critic, commented on US politics months before his death.")

(unstructured.documents.elements.ListItem,
 'Navalny expressed concern in letters to a friend over a potential second term for Donald Trump.')

(unstructured.documents.elements.ListItem,
 "Trump briefly mentioned Navalny's death in a Truth Social post on Monday.")

(unstructured.documents.elements.ListItem, 'Donald Trump')

(unstructured.documents.elements.ListItem, 'Russia')

(unstructured.documents.elements.ListItem, 'Terms of Service')

(unstructured.documents.elements.ListItem, 'Terms of Sale')

(unstructured.documents.elements.ListItem, 'Privacy Policy')

(unstructured.documents.elements.ListItem, 'Accessibility')

(unstructured.documents.elements.ListItem, 'Code of Ethics Policy')

(unstructured.documents.elements.ListItem, 'Reprints & Permissions')

(unstructured.documents.elements.ListItem, 'Disclaimer')

(unstructured.documents.elements.ListItem, 'Advertising Policies')

(unstructured.documents.elements.ListItem, 'Conflict of Interest Policy')

(unstructured.documents.elements.ListItem, 'Commerce Policy')

(unstructured.documents.elements.ListItem, 'Coupons Privacy Policy')

(unstructured.documents.elements.ListItem, 'Coupons Terms')

(unstructured.documents.elements.ListItem, 'Your Privacy Choices')

(unstructured.documents.elements.ListItem, 'About Us')

(unstructured.documents.elements.ListItem, 'Careers')

(unstructured.documents.elements.ListItem, 'Advertise With Us')

(unstructured.documents.elements.ListItem, 'Contact Us')

(unstructured.documents.elements.ListItem, 'Company News')

(unstructured.documents.elements.ListItem, 'Masthead')

(unstructured.documents.elements.ListItem, 'Sitemap')

(unstructured.documents.elements.ListItem, 'Stock quotes by finanzen.net')

(unstructured.documents.elements.ListItem, 'AT')

(unstructured.documents.elements.ListItem, 'DE')

(unstructured.documents.elements.ListItem, 'ES')

(unstructured.documents.elements.ListItem, 'JP')

(unstructured.documents.elements.ListItem, 'NL')

(unstructured.documents.elements.ListItem, 'PL')

(unstructured.documents.elements.ListItem, 'Main content')

(unstructured.documents.elements.ListItem, 'Search')

(unstructured.documents.elements.ListItem, 'Account')

In [13]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.elements.Text'>"])

(unstructured.documents.elements.Text, 'Subscribe Newsletters')

(unstructured.documents.elements.Text, 'Military & Defense')

(unstructured.documents.elements.Text, 'Kelsey Vlamis')

(unstructured.documents.elements.Text, '2024-02-20T01:55:04Z')

(unstructured.documents.elements.Text,
 'Facebook Email X LinkedIn Copy Link Impact Link')

(unstructured.documents.elements.Text, 'Legal & Privacy')

(unstructured.documents.elements.Text, 'Company')

(unstructured.documents.elements.Text, 'Other')

(unstructured.documents.elements.Text, 'International Editions')

(unstructured.documents.elements.Text, 'Jump to')

In [14]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.elements.NarrativeText'>"])

(unstructured.documents.elements.NarrativeText, 'Read in app')

(unstructured.documents.elements.NarrativeText,
 'This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? .')

(unstructured.documents.elements.NarrativeText,
 'Alexey Navalny, a dissident and the political nemesis of Russian President Vladimir Putin, spent the past few years of his life behind bars but still managed to stay connected to the outside world.')

(unstructured.documents.elements.NarrativeText,
 "Letters from the final months of his life, obtained by The New York Times, show that Navalny, who'd been imprisoned since January 2021, managed to stay on top of current events — including in the US.")

(unstructured.documents.elements.NarrativeText,
 'In a letter sent to a friend, a photographer named Evgeny Feldman, Navalny said former President Donald Trump\'s agenda for a second term was "really scary," according to the Times.')

(unstructured.documents.elements.NarrativeText,
 'He said if President Joe Biden were to have a health issue, "Trump will become president," adding: "Doesn\'t this obvious thing concern the Democrats?"')

(unstructured.documents.elements.NarrativeText,
 'In another letter to Feldman dated December 3, Navalny again expressed concern over Trump and asked his friend, "Please name one current politician you admire."')

(unstructured.documents.elements.NarrativeText,
 "Trump's office didn't immediately respond to a request for comment from Business Insider.")

(unstructured.documents.elements.NarrativeText,
 'On December 6, Navalny disappeared from the IK-6 penal colony about 120 miles east of Moscow. He turned up again on Christmas Day when his lawyers announced they had located him at the IK-3 penal colony, about 1,000 miles northeast of Moscow, above the Arctic Circle.')

(unstructured.documents.elements.NarrativeText,
 "The Times reported that Navalny's communication ability from his new prison was greatly diminished.")

(unstructured.documents.elements.NarrativeText,
 "The journalist Sergei Parkhomenko said he received a letter from Navalny on February 13, a few days before Navalny's death was announced. In the letter, which Parkhomenko shared on Facebook, Navalny spoke of books and said he only had access to classics at his new prison.")

(unstructured.documents.elements.NarrativeText,
 '"Who could\'ve told me that Chekhov is the most depressing Russian writer?" he wrote.')

(unstructured.documents.elements.NarrativeText,
 "Trump, for his part, didn't mention Navalny in the days after his death, despite condemnations from other leaders who directly blamed Putin.")

(unstructured.documents.elements.NarrativeText,
 'In a Truth Social post on Monday, Trump briefly mentioned Navalny before directing his ire at his own perceived political opponents: "The sudden death of Alexei Navalny has made me more and more aware of what is happening in our Country. It is a slow, steady progression, with CROOKED, Radical Left Politicians, Prosecutors, and Judges leading us down a path to destruction."')

(unstructured.documents.elements.NarrativeText,
 'He mentioned neither Russia nor Putin.')

(unstructured.documents.elements.NarrativeText,
 'This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? .')

(unstructured.documents.elements.NarrativeText,
 'Copyright © 2025 Insider Inc. All rights reserved. Registration on or use of this site constitutes acceptance of our Terms of Service and Privacy Policy.')

## Embeddings

https://docs.unstructured.io/open-source/core-functionality/embedding

In [15]:
# for element in elements:
#   print(element)

In [16]:
# #from unstructured.documents.elements import Text
# #from sentence_transformers import SentenceTransformer

# embeddings = []

# from langchain_huggingface.embeddings import HuggingFaceEmbeddings

# model = HuggingFaceEmbeddings(
#         model_name="sentence-transformers/all-MiniLM-L6-v2"
#     )

# # 1. Load a pretrained Sentence Transformer model
# #embeddings = SentenceTransformer("all-MiniLM-L6-v2")

# # Process each element in the JSON file.
# for element in elements:
#     # Get the element's "text" field.
#     text = element.text
#     # Generate the embeddings for that "text" field.
#     query_result = model.embed_query(text)
#     # Add the embeddings to that element as an "embeddings" field.
#     element["embeddings"] = query_result

# # # Print embeddings
# # [print(e.embeddings, e) for e in elements]
# # print(query_embedding, query)
# # print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())

In [17]:
# from google.colab import userdata
# open_ai_api_key = userdata.get('open_ai_api_key')

# from unstructured.documents.elements import Text
# from unstructured.embed.openai import OpenAIEmbeddingEncoder

# # Initialize the encoder with OpenAI credentials
# #embedding_encoder = OpenAIEmbeddingEncoder(api_key=open_ai_api_key)

# #embedding_encoder = OpenAIEmbeddingEncoder(api_key=open_ai_api_key)


# embedding_encoder = OpenAIEmbeddingEncoder(config=OpenAIEmbeddingConfig(api_key=open_ai_api_key))

# #embedding_encoder = OpenAIEmbeddingEncoder(config=[open_ai_api_key,"text-embedding-ada-002"] )

# # Embed a list of Elements
# elements = embedding_encoder.embed_documents(
#     elements=[Text("This is sentence 1"), Text("This is sentence 2")],
# )

# # Embed a single query string
# query = "This is the query"
# query_embedding = embedding_encoder.embed_query(query=query)

# # Print embeddings
# [print(e.embeddings, e) for e in elements]
# print(query_embedding, query)
# print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())

## CNN HTML Links, Vector Databases, Automatic Summarization via LangChain + OpenAI

In the following section we:

1. Get the links for the latest articles from an HTML file (Option 1) or directly from the CNN website (Option 2).  
2. Use the Unstructured document loader in Langchain to load the files.
3. Create embeddings for each file using OpenAIEmbeddings.
4. Store the embeddings in Chroma DB.
5. Query Chroma DB to return relevant articles.   
6. Summarize the relevant articles using LangChain OpenAI integration.

https://unstructured-io.github.io/unstructured/examples/chroma.html

## Option 1: Import from HTML file stored on Github

In [18]:
# from unstructured.partition.html import partition_html
# import requests
# from google.colab import userdata
# open_ai_api_key = userdata.get('open_ai_api_key')

# url = "https://github.com/StatsAI/NLP/blob/main/Breaking%20News%2C%20Latest%20News%20and%20Videos%20_%20CNN.html"

# try:
#     response = requests.get(url, allow_redirects=True)
#     response.raise_for_status()  # Raise error if download fails
# except requests.exceptions.RequestException as e:
#     print(f"Error downloading HTML: {e}")
#     exit(1)

# # Save the content to a file
# with open("downloaded_html.html", "wb") as f:
#     f.write(response.content)

# print(f"HTML downloaded successfully to downloaded_html.html")

# elements = partition_html(filename='downloaded_html.html')
# elements = elements[3].links

# links = []
# cnn_lite_url = "https://lite.cnn.com/"

# for element in elements:
#   try:
#     if element["url"][3:-2]:
#       relative_link = element["url"][3:-2]
#       links.append(f"{cnn_lite_url}{relative_link}")
#   except IndexError:
#     # Handle the case where the "url" key doesn't exist or the index is out of range
#     continue

# links

## Option 2: Import from URL

In [19]:
# elements = partition_html(url=cnn_lite_url)

# for element in elements:
#   print(element.metadata.link_urls)
#   #print(dir(element.metadata))
#   #print(dir(element))
#   #print(element.links)

In [20]:
from unstructured.partition.html import partition_html
from google.colab import userdata
#open_ai_api_key = userdata.get('open_ai_api_key')

cnn_lite_url = "https://lite.cnn.com/"

elements = partition_html(url=cnn_lite_url)
links = []

for element in elements:
  try:
    if element.metadata.link_urls:
      relative_link = element.metadata.link_urls[0]
      links.append(f"{cnn_lite_url}{relative_link}")
  except IndexError:
    # Handle the case where the "url" key doesn't exist or the index is out of range
    continue

links = links[1:-2]
links

['https://lite.cnn.com//2025/02/23/media/msnbc-cancels-joy-reid-primetime/index.html',
 'https://lite.cnn.com//2025/02/23/europe/pope-francis-hospital-kidney-intl-latam/index.html',
 'https://lite.cnn.com//2025/02/23/us/immigration-enforcement-operation-california/index.html',
 'https://lite.cnn.com//2025/02/23/us/pennsylvania-upmc-memorial-hospital-icu-hostage-hnk/index.html',
 'https://lite.cnn.com//2025/02/23/economy/surging-egg-prices-bakeries/index.html',
 'https://lite.cnn.com//2025/02/23/europe/ukraine-zelensky-resign-nato-intl/index.html',
 'https://lite.cnn.com//2025/02/21/health/listeria-supplemental-shakes/index.html',
 'https://lite.cnn.com//2025/02/23/politics/kathy-hochul-new-york-trump/index.html',
 'https://lite.cnn.com//2025/02/22/politics/elon-musk-employees-emails/index.html',
 'https://lite.cnn.com//2025/02/23/sport/juan-soto-new-york-mets-spring-training-spt-intl/index.html',
 'https://lite.cnn.com//2025/02/23/middleeast/nasrallah-funeral-lebanon-beirut-israel-intl

In [21]:
from langchain.document_loaders import UnstructuredURLLoader

loaders = UnstructuredURLLoader(urls=links, show_progress_bar=True)

docs = loaders.load()

100%|██████████| 100/100 [00:43<00:00,  2.29it/s]


In [22]:
# import chromadb.utils.embedding_functions as embedding_functions
# huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
#     api_key="",
#     model_name="sentence-transformers/all-MiniLM-L6-v2"
# )

In [23]:
type(docs)

list

In [24]:
docs[0]

Document(metadata={'source': 'https://lite.cnn.com//2025/02/23/media/msnbc-cancels-joy-reid-primetime/index.html'}, page_content='CNN 2/23/2025\n\nBusiness / Media\n\nMSNBC cancels Joy Reid’s evening show as part of a major programming shakeup\n\nBy Auzinea Bacon, CNN\n\nUpdated: 1:33 PM EST, Sun February 23, 2025\n\nSource: CNN\n\nJoy Reid will host her final evening news show with MSNBC this week as part of a slate of programming changes by Rebecca Kutler, the network’s new president, sources familiar with the matter told CNN.\n\nAccording to sources, Kutler plans to replace “The ReidOut,” which has aired at 7 p.m. ET since 2020, with co-hosts from “The Weekend” — Symone Sanders-Townsend, Michael Steele and Alicia Menendez.\n\n“The Weekend,” a two-hour show airing on Saturday and Sunday mornings at 8 a.m. ET, improved total viewership during its time slot by 35%.\n\nBefore “TheReidOut,” Reid hosted the MSNBC weekend talk show “AM Joy” from 2016 to 2020.\n\nDespite the change in progr

In [25]:
#from langchain.vectorstores.chroma import Chroma
#from langchain.embeddings import OpenAIEmbeddings
#from google.colab import userdata
#open_ai_api_key = userdata.get('open_ai_api_key')

#embeddings = OpenAIEmbeddings(openai_api_key = open_ai_api_key)
#vectorstore = Chroma.from_documents(docs, embeddings)

In [26]:
import chromadb
from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import SentenceTransformerEmbeddings

embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

chroma_client = chromadb.Client()
#chroma_client.delete_collection("langchain")
#chroma_client.delete_collection("your_collection_name")

vectorstore = Chroma.from_documents(docs, embedding_function, collection_name="cnn_doc_embeddings")

  embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [27]:
chroma_client.list_collections()

['cnn_doc_embeddings']

In [28]:
query_docs = vectorstore.similarity_search("Trump", k=5)

In [32]:
from google.colab import userdata
gem_api_key = userdata.get('gemini_api_secret_name')
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains.summarize import load_summarize_chain

llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash",
                             temperature=0.7,
                             max_tokens=None,
                             timeout=None,
                             max_retries=2,
                             google_api_key=gem_api_key)

chain = load_summarize_chain(llm, chain_type="stuff")

In [None]:
# from langchain.chat_models import ChatOpenAI
# from langchain.chains.summarize import load_summarize_chain

# llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k", openai_api_key=open_ai_api_key)
# chain = load_summarize_chain(llm, chain_type="stuff")

In [33]:
for doc in query_docs:
  source = doc.metadata
  result = chain.invoke([doc])
  print(result['output_text'])
  print(source)
  print('')

Despite President Trump's renewed alliance with Big Tech, his MAGA base at CPAC remains skeptical, fueled by past censorship and deplatforming. While tech companies and executives have sought to repair relations through donations and policy changes, many conservatives still harbor resentment. Elon Musk is an exception, enjoying popularity for his anti-government spending stance. Some Republican lawmakers are also threatening to remove legal protections for tech companies. This skepticism creates opportunities for conservative-aligned tech alternatives, but Trump's embrace of established companies complicates their market position. Key figures like Steve Bannon warn against trusting Big Tech, viewing their alignment as opportunistic rather than genuine support for MAGA.
{'source': 'https://lite.cnn.com//2025/02/22/politics/cpac-tech-trump-musk/index.html'}

In the first month of his second term, President Trump is dismantling the global system built by the US over the last 80 years, cau

In [35]:
#chain.invoke(query_docs)