<a href="https://colab.research.google.com/github/StatsAI/NLP/blob/main/Unstructured_API_Test_2_19_24.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [24]:
!pip install "unstructured[all-docs]"
!pip install langchain
!pip install chromadb

Collecting chromadb
  Downloading chromadb-0.4.22-py3-none-any.whl (509 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.109.2-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.27.1-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.4.2-py2.

# Unstructured API - Key Features:

1. Precise Document Extraction: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata.

2. Extensive File Support: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more.

https://unstructured-io.github.io/unstructured/introduction.html


# The Unstructured Core API consists of the following components:

1. Partitioning: The partitioning functions in Unstructured enable the extraction of structured content from raw, unstructured documents. Partitioning functions in unstructured allow users to extract structured content from a raw unstructured document. These functions break a document down into elements such as Title, NarrativeText, and ListItem, enabling users to decide what content they’d like to keep for their particular application. If you’re training a summarization model, for example, you may only be interested in NarrativeText.

2. Cleaning: Data preparation for NLP models often requires cleaning to ensure quality. The Unstructured library includes cleaning functions that assist in sanitizing output, removing unwanted content, and improving the performance of NLP models. This step is essential for maintaining the integrity of data before it is passed to downstream applications.

3. Extracting: This functionality allows for the extraction of specific entities within documents. It is designed to identify and isolate relevant pieces of information, making it easier for users to focus on the most pertinent data in their documents.

4. Staging: Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of Destination Connectors.

5. Chunking: The chunking process in Unstructured is distinct from conventional methods. Instead of relying solely on text-based features to form chunks, Unstructured uses a deep understanding of document formats to partition documents into semantic units (document elements).

6. Embedding: The embedding encoder classes in Unstructured leverage document elements detected through partitioning or grouped via chunking to obtain embeddings for each element. This is particularly useful for applications like Retrieval Augmented Generation (RAG), where precise and contextually relevant embeddings are crucial.

##PDF - Extraction
https://unstructured-io.github.io/unstructured/core/partition.html


In [2]:
from unstructured.partition.auto import partition

link = 'https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/layout-parser-paper-fast.pdf'

url = 'https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper-fast.pdf'

# Use if loading from  a url
elements = partition(url=url)

# Use if loading from a local file
#elements = partition("/content/example-docs/layout-parser-paper-fast.pdf")

elements

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[<unstructured.documents.elements.Text at 0x7b6f075d5db0>,
 <unstructured.documents.elements.Title at 0x7b6f075d5cc0>,
 <unstructured.documents.elements.Text at 0x7b6f075d6650>,
 <unstructured.documents.elements.Text at 0x7b6f075d6620>,
 <unstructured.documents.elements.Title at 0x7b6f075d5540>,
 <unstructured.documents.elements.Text at 0x7b6f075d6950>,
 <unstructured.documents.elements.Title at 0x7b6f075d6980>,
 <unstructured.documents.elements.Text at 0x7b6f075d6a40>,
 <unstructured.documents.elements.Text at 0x7b6f075d6b30>,
 <unstructured.documents.elements.NarrativeText at 0x7b6f075d6cb0>,
 <unstructured.documents.elements.Text at 0x7b6f075d6ad0>,
 <unstructured.documents.elements.Text at 0x7b6f075d6dd0>,
 <unstructured.documents.elements.Title at 0x7b6f075d7010>,
 <unstructured.documents.elements.NarrativeText at 0x7b6f075d6ef0>,
 <unstructured.documents.elements.Text at 0x7b6f075d4670>,
 <unstructured.documents.elements.NarrativeText at 0x7b6f075d4790>,
 <unstructured.documents.

In [3]:
from collections import Counter

display(Counter(type(element) for element in elements))
print("")

Counter({unstructured.documents.elements.Text: 9,
         unstructured.documents.elements.Title: 4,
         unstructured.documents.elements.NarrativeText: 8,
         unstructured.documents.elements.ListItem: 4})




In [4]:
display(*[(type(element), element.text) for element in elements])

(unstructured.documents.elements.Text, '1 2 0 2')

(unstructured.documents.elements.Title, 'n u J')

(unstructured.documents.elements.Text, '1 2')

(unstructured.documents.elements.Text, ']')

(unstructured.documents.elements.Title, 'V C . s c [')

(unstructured.documents.elements.Text, '2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a')

(unstructured.documents.elements.Title,
 'LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis')

(unstructured.documents.elements.Text,
 'Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5')

(unstructured.documents.elements.Text,
 '1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca')

(unstructured.documents.elements.NarrativeText,
 'Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core La

(unstructured.documents.elements.Text,
 'Keywords: Document Image Analysis · Deep Learning · Layout Analysis · Character Recognition · Open Source library · Toolkit.')

(unstructured.documents.elements.Text, '1')

(unstructured.documents.elements.Title, 'Introduction')

(unstructured.documents.elements.NarrativeText,
 'Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classiﬁcation [11,')

(unstructured.documents.elements.Text, '2')

(unstructured.documents.elements.NarrativeText, 'Z. Shen et al.')

(unstructured.documents.elements.NarrativeText,
 '37], layout detection [38, 22], table detection [26], and scene text detection [4]. A generalized learning-based framework dramatically reduces the need for the manual speciﬁcation of complicated rules, which is the status quo with traditional methods. DL has the potential to transform DIA pipelines and beneﬁt a broad spectrum of large-scale document digitization projects.')

(unstructured.documents.elements.NarrativeText,
 'However, there are several practical diﬃculties for taking advantages of re- cent advances in DL-based methods: 1) DL models are notoriously convoluted for reuse and extension. Existing models are developed using distinct frame- works like TensorFlow [1] or PyTorch [24], and the high-level parameters can be obfuscated by implementation details [8]. It can be a time-consuming and frustrating experience to debug, reproduce, and adapt existing models for DIA, and many researchers who would beneﬁt the most from using these methods lack the technical background to implement them from scratch. 2) Document images contain diverse and disparate patterns across domains, and customized training is often required to achieve a desirable detection accuracy. Currently there is no full-ﬂedged infrastructure for easily curating the target document image datasets and ﬁne-tuning or re-training the models. 3) DIA usually requires a sequence of models and o

(unstructured.documents.elements.NarrativeText,
 'LayoutParser provides a uniﬁed toolkit to support DL-based document image analysis and processing. To address the aforementioned challenges, LayoutParser is built with the following components:')

(unstructured.documents.elements.ListItem,
 '1. An oﬀ-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3)')

(unstructured.documents.elements.ListItem,
 '2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the oﬀ-the-shelf usage')

(unstructured.documents.elements.ListItem,
 '3. Comprehensive tools for eﬃcient document image data annotation and model tuning to support diﬀerent levels of customization')

(unstructured.documents.elements.ListItem,
 '4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)')

(unstructured.documents.elements.NarrativeText,
 'The library implements simple and intuitive Python APIs without sacriﬁcing generalizability and versatility, and can be easily installed via pip. Its convenient functions for handling document image data can be seamlessly integrated with existing DIA pipelines. With detailed documentations and carefully curated tutorials, we hope this tool will beneﬁt a variety of end-users, and will lead to advances in applications in both industry and academic research.')

(unstructured.documents.elements.NarrativeText,
 'LayoutParser is well aligned with recent eﬀorts for improving DL model reusability in other disciplines like natural language processing [8, 34] and com- puter vision [35], but with a focus on unique challenges in DIA. We show LayoutParser can be applied in sophisticated and large-scale digitization projects')

##HTML - Extraction

In [5]:
url = 'https://www.businessinsider.com/navalny-death-letters-trump-second-term-agenda-really-scary-2024-2'

from unstructured.partition.auto import partition
elements = partition(url=url, strategy='hi_res', html_assemble_articles=True,
                     chunking_strategy="by-title", multipage_sections=True)
elements

[<unstructured.documents.html.HTMLTitle at 0x7b6f04f7d810>,
 <unstructured.documents.html.HTMLTitle at 0x7b6f04f7d8a0>,
 <unstructured.documents.html.HTMLTitle at 0x7b6f04f7da80>,
 <unstructured.documents.html.HTMLText at 0x7b6f04f7c040>,
 <unstructured.documents.html.HTMLTitle at 0x7b6f04f7d720>,
 <unstructured.documents.html.HTMLText at 0x7b6f04f7d330>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7b6f04f7c5b0>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7b6f04f7e1d0>,
 <unstructured.documents.html.HTMLText at 0x7b6f04f7e0e0>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7b6f04f7dde0>,
 <unstructured.documents.html.HTMLTitle at 0x7b6f04f7d6c0>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7b6f04f7db10>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7b6f04f7d5d0>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7b6f04f7c640>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7b6f04f7dae0>,
 <unstructured.documents.html.HTMLListItem at 0

In [6]:
from collections import Counter

display(Counter(type(element) for element in elements))
print("")

Counter({unstructured.documents.html.HTMLTitle: 17,
         unstructured.documents.html.HTMLText: 3,
         unstructured.documents.html.HTMLNarrativeText: 28,
         unstructured.documents.html.HTMLListItem: 3})




In [7]:
display(*[(type(element), element.text) for element in elements])

(unstructured.documents.html.HTMLTitle, 'Military & Defense')

(unstructured.documents.html.HTMLTitle,
 "In Navalny's last letters, the Russian dissident called Trump's agenda for a second term 'really scary'")

(unstructured.documents.html.HTMLTitle, 'Kelsey Vlamis')

(unstructured.documents.html.HTMLText, '2024-02-20T01:55:04Z')

(unstructured.documents.html.HTMLTitle, 'Share')

(unstructured.documents.html.HTMLText,
 'Facebook Icon\n                                            The letter F.\n                                          \n                                            \n                                          \n                                            Facebook')

(unstructured.documents.html.HTMLNarrativeText,
 'Email icon\n                                            An envelope. It indicates the ability to send an email.\n                                          \n                                            \n                                          \n                                            Email')

(unstructured.documents.html.HTMLNarrativeText,
 'Twitter icon\n                                            A stylized bird with an open mouth, tweeting.\n                                          \n                                            \n                                          \n                                            Twitter')

(unstructured.documents.html.HTMLText,
 'LinkedIn icon\n                                          \n                                            \n                                          \n                                            LinkedIn')

(unstructured.documents.html.HTMLNarrativeText,
 'Link icon\n                                            An image of a chain link. It symobilizes a website link url.\n                                          \n                                            \n                                          \n                                            Copy Link')

(unstructured.documents.html.HTMLTitle, 'Save')

(unstructured.documents.html.HTMLNarrativeText, 'Read in app')

(unstructured.documents.html.HTMLNarrativeText,
 'Angle down icon\n                                      An icon in the shape of an angle pointing down.\n                                      \n                                              \n                              \n                              \n                                The Russian opposition leader Alexey Navalny attending an opposition march in memory of the murdered Kremlin critic Boris Nemtsov in central Moscow.\n                              \n                        \n                                \n                                  VASILY MAXIMOV/AFP/Getty Images')

(unstructured.documents.html.HTMLNarrativeText,
 'This story is available exclusively to Business Insider\n                      subscribers.\n                      Become an Insider\n                      and start reading now.')

(unstructured.documents.html.HTMLNarrativeText, 'Have an account? Log in.')

(unstructured.documents.html.HTMLListItem,
 "Alexey Navalny, Vladimir Putin's most prominent critic, commented on US politics months before his death.")

(unstructured.documents.html.HTMLListItem,
 'Navalny expressed concern\xa0in letters to a friend over a potential second term for Donald Trump.')

(unstructured.documents.html.HTMLListItem,
 "Trump briefly mentioned Navalny's death in a Truth Social post on Monday.")

(unstructured.documents.html.HTMLTitle, 'NEW LOOK')

(unstructured.documents.html.HTMLNarrativeText,
 '\n                                    Sign up to get the inside scoop on today’s biggest stories in markets, tech, and business — delivered daily. ')

(unstructured.documents.html.HTMLTitle, 'Read preview')

(unstructured.documents.html.HTMLNarrativeText, 'Thanks for signing up!')

(unstructured.documents.html.HTMLNarrativeText,
 "\n                              Access your favorite topics in a personalized feed while you're on the go.\n                              ")

(unstructured.documents.html.HTMLNarrativeText,
 '\n                                  By clicking “Sign Up”, you accept our ')

(unstructured.documents.html.HTMLTitle, 'Terms of Service and')

(unstructured.documents.html.HTMLNarrativeText,
 'Privacy Policy. You can opt-out at any time.')

(unstructured.documents.html.HTMLTitle, 'Advertisement')

(unstructured.documents.html.HTMLNarrativeText,
 'Alexey Navalny, a dissident and the political nemesis of Russian President Vladimir Putin, spent the past few years of his life behind bars but still managed to stay connected to the outside world.')

(unstructured.documents.html.HTMLNarrativeText,
 "Letters from the final months of his life, obtained by The New York Times, show that Navalny, who'd been imprisoned since January 2021, managed to stay on top of current events — including in the US.")

(unstructured.documents.html.HTMLNarrativeText,
 'This story is available exclusively to Business Insider\n                            subscribers.\n                            Become an Insider\n                            and start reading now.\n                          Have an account? Log in.')

(unstructured.documents.html.HTMLNarrativeText,
 'In a letter sent to a friend, a photographer named Evgeny Feldman, Navalny said former President Donald Trump\'s agenda for a second term was "really scary," according to the Times.')

(unstructured.documents.html.HTMLNarrativeText,
 'He said if President Joe Biden were to have a health issue, "Trump will become president," adding: "Doesn\'t this obvious thing concern the Democrats?"')

(unstructured.documents.html.HTMLTitle, 'Advertisement')

(unstructured.documents.html.HTMLNarrativeText,
 'In another letter to Feldman dated December 3, Navalny again expressed concern over Trump and asked his friend, "Please name one current politician you admire."')

(unstructured.documents.html.HTMLNarrativeText,
 "Trump's office didn't immediately respond to a request for comment from Business Insider.")

(unstructured.documents.html.HTMLNarrativeText,
 'On December 6, Navalny disappeared from the IK-6 penal colony about 120 miles east of Moscow. He turned up again on Christmas Day when his lawyers announced they had located him at the IK-3 penal colony, about 1,000 miles northeast of Moscow, above the Arctic Circle.')

(unstructured.documents.html.HTMLNarrativeText,
 "The Times reported that Navalny's communication ability from his new prison was greatly diminished.")

(unstructured.documents.html.HTMLTitle, 'Advertisement')

(unstructured.documents.html.HTMLNarrativeText,
 "The journalist Sergei Parkhomenko said he received a letter from Navalny on February 13, a few days before Navalny's death was announced. In the letter, which Parkhomenko shared on Facebook, Navalny spoke of books and said he only had access to classics at his new prison.")

(unstructured.documents.html.HTMLNarrativeText,
 '"Who could\'ve told me that Chekhov is the most depressing Russian writer?" he wrote.')

(unstructured.documents.html.HTMLNarrativeText,
 "Trump, for his part, didn't mention Navalny in the days after his death, despite condemnations from other leaders who directly blamed Putin.")

(unstructured.documents.html.HTMLNarrativeText,
 'In a Truth Social post on Monday, Trump briefly mentioned Navalny before directing his ire at his own perceived political opponents: "The sudden death of Alexei Navalny has made me more and more aware of what is happening in our Country. It is a slow, steady progression, with CROOKED, Radical Left Politicians, Prosecutors, and Judges leading us down a path to destruction."')

(unstructured.documents.html.HTMLTitle, 'Advertisement')

(unstructured.documents.html.HTMLNarrativeText,
 'He mentioned neither Russia nor Putin.')

(unstructured.documents.html.HTMLNarrativeText,
 'Sign up for notifications from Insider! Stay up to date with what you want to know.')

(unstructured.documents.html.HTMLNarrativeText,
 'Subscribe to push notifications')

(unstructured.documents.html.HTMLTitle, 'Read next')

(unstructured.documents.html.HTMLTitle,
 "Watch: Here's what to know about Russian opposition leader Alexei Navalny —Putin’s biggest critic")

(unstructured.documents.html.HTMLTitle, 'Donald Trump')

(unstructured.documents.html.HTMLTitle, 'Russia')

(unstructured.documents.html.HTMLTitle, 'Advertisement')

In [8]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.html.HTMLListItem'>"])

(unstructured.documents.html.HTMLListItem,
 "Alexey Navalny, Vladimir Putin's most prominent critic, commented on US politics months before his death.")

(unstructured.documents.html.HTMLListItem,
 'Navalny expressed concern\xa0in letters to a friend over a potential second term for Donald Trump.')

(unstructured.documents.html.HTMLListItem,
 "Trump briefly mentioned Navalny's death in a Truth Social post on Monday.")

In [9]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.html.HTMLText'>"])

(unstructured.documents.html.HTMLText, '2024-02-20T01:55:04Z')

(unstructured.documents.html.HTMLText,
 'Facebook Icon\n                                            The letter F.\n                                          \n                                            \n                                          \n                                            Facebook')

(unstructured.documents.html.HTMLText,
 'LinkedIn icon\n                                          \n                                            \n                                          \n                                            LinkedIn')

In [10]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.html.HTMLTitle'>"])

(unstructured.documents.html.HTMLTitle, 'Military & Defense')

(unstructured.documents.html.HTMLTitle,
 "In Navalny's last letters, the Russian dissident called Trump's agenda for a second term 'really scary'")

(unstructured.documents.html.HTMLTitle, 'Kelsey Vlamis')

(unstructured.documents.html.HTMLTitle, 'Share')

(unstructured.documents.html.HTMLTitle, 'Save')

(unstructured.documents.html.HTMLTitle, 'NEW LOOK')

(unstructured.documents.html.HTMLTitle, 'Read preview')

(unstructured.documents.html.HTMLTitle, 'Terms of Service and')

(unstructured.documents.html.HTMLTitle, 'Advertisement')

(unstructured.documents.html.HTMLTitle, 'Advertisement')

(unstructured.documents.html.HTMLTitle, 'Advertisement')

(unstructured.documents.html.HTMLTitle, 'Advertisement')

(unstructured.documents.html.HTMLTitle, 'Read next')

(unstructured.documents.html.HTMLTitle,
 "Watch: Here's what to know about Russian opposition leader Alexei Navalny —Putin’s biggest critic")

(unstructured.documents.html.HTMLTitle, 'Donald Trump')

(unstructured.documents.html.HTMLTitle, 'Russia')

(unstructured.documents.html.HTMLTitle, 'Advertisement')

In [11]:
display(*[(type(element), element.text) for element in elements if str(type(element)) == "<class 'unstructured.documents.html.HTMLNarrativeText'>"])

(unstructured.documents.html.HTMLNarrativeText,
 'Email icon\n                                            An envelope. It indicates the ability to send an email.\n                                          \n                                            \n                                          \n                                            Email')

(unstructured.documents.html.HTMLNarrativeText,
 'Twitter icon\n                                            A stylized bird with an open mouth, tweeting.\n                                          \n                                            \n                                          \n                                            Twitter')

(unstructured.documents.html.HTMLNarrativeText,
 'Link icon\n                                            An image of a chain link. It symobilizes a website link url.\n                                          \n                                            \n                                          \n                                            Copy Link')

(unstructured.documents.html.HTMLNarrativeText, 'Read in app')

(unstructured.documents.html.HTMLNarrativeText,
 'Angle down icon\n                                      An icon in the shape of an angle pointing down.\n                                      \n                                              \n                              \n                              \n                                The Russian opposition leader Alexey Navalny attending an opposition march in memory of the murdered Kremlin critic Boris Nemtsov in central Moscow.\n                              \n                        \n                                \n                                  VASILY MAXIMOV/AFP/Getty Images')

(unstructured.documents.html.HTMLNarrativeText,
 'This story is available exclusively to Business Insider\n                      subscribers.\n                      Become an Insider\n                      and start reading now.')

(unstructured.documents.html.HTMLNarrativeText, 'Have an account? Log in.')

(unstructured.documents.html.HTMLNarrativeText,
 '\n                                    Sign up to get the inside scoop on today’s biggest stories in markets, tech, and business — delivered daily. ')

(unstructured.documents.html.HTMLNarrativeText, 'Thanks for signing up!')

(unstructured.documents.html.HTMLNarrativeText,
 "\n                              Access your favorite topics in a personalized feed while you're on the go.\n                              ")

(unstructured.documents.html.HTMLNarrativeText,
 '\n                                  By clicking “Sign Up”, you accept our ')

(unstructured.documents.html.HTMLNarrativeText,
 'Privacy Policy. You can opt-out at any time.')

(unstructured.documents.html.HTMLNarrativeText,
 'Alexey Navalny, a dissident and the political nemesis of Russian President Vladimir Putin, spent the past few years of his life behind bars but still managed to stay connected to the outside world.')

(unstructured.documents.html.HTMLNarrativeText,
 "Letters from the final months of his life, obtained by The New York Times, show that Navalny, who'd been imprisoned since January 2021, managed to stay on top of current events — including in the US.")

(unstructured.documents.html.HTMLNarrativeText,
 'This story is available exclusively to Business Insider\n                            subscribers.\n                            Become an Insider\n                            and start reading now.\n                          Have an account? Log in.')

(unstructured.documents.html.HTMLNarrativeText,
 'In a letter sent to a friend, a photographer named Evgeny Feldman, Navalny said former President Donald Trump\'s agenda for a second term was "really scary," according to the Times.')

(unstructured.documents.html.HTMLNarrativeText,
 'He said if President Joe Biden were to have a health issue, "Trump will become president," adding: "Doesn\'t this obvious thing concern the Democrats?"')

(unstructured.documents.html.HTMLNarrativeText,
 'In another letter to Feldman dated December 3, Navalny again expressed concern over Trump and asked his friend, "Please name one current politician you admire."')

(unstructured.documents.html.HTMLNarrativeText,
 "Trump's office didn't immediately respond to a request for comment from Business Insider.")

(unstructured.documents.html.HTMLNarrativeText,
 'On December 6, Navalny disappeared from the IK-6 penal colony about 120 miles east of Moscow. He turned up again on Christmas Day when his lawyers announced they had located him at the IK-3 penal colony, about 1,000 miles northeast of Moscow, above the Arctic Circle.')

(unstructured.documents.html.HTMLNarrativeText,
 "The Times reported that Navalny's communication ability from his new prison was greatly diminished.")

(unstructured.documents.html.HTMLNarrativeText,
 "The journalist Sergei Parkhomenko said he received a letter from Navalny on February 13, a few days before Navalny's death was announced. In the letter, which Parkhomenko shared on Facebook, Navalny spoke of books and said he only had access to classics at his new prison.")

(unstructured.documents.html.HTMLNarrativeText,
 '"Who could\'ve told me that Chekhov is the most depressing Russian writer?" he wrote.')

(unstructured.documents.html.HTMLNarrativeText,
 "Trump, for his part, didn't mention Navalny in the days after his death, despite condemnations from other leaders who directly blamed Putin.")

(unstructured.documents.html.HTMLNarrativeText,
 'In a Truth Social post on Monday, Trump briefly mentioned Navalny before directing his ire at his own perceived political opponents: "The sudden death of Alexei Navalny has made me more and more aware of what is happening in our Country. It is a slow, steady progression, with CROOKED, Radical Left Politicians, Prosecutors, and Judges leading us down a path to destruction."')

(unstructured.documents.html.HTMLNarrativeText,
 'He mentioned neither Russia nor Putin.')

(unstructured.documents.html.HTMLNarrativeText,
 'Sign up for notifications from Insider! Stay up to date with what you want to know.')

(unstructured.documents.html.HTMLNarrativeText,
 'Subscribe to push notifications')

In [12]:
url = 'https://www.nytimes.com/2024/02/19/world/europe/navalny-letters-russia.html'

from unstructured.partition.auto import partition
elements = partition(url=url, strategy='hi_res', html_assemble_articles=True)
elements

[<unstructured.documents.html.HTMLTitle at 0x7b6f04f7e200>]

In [13]:
display(*[(type(element), element.text) for element in elements])

(unstructured.documents.html.HTMLTitle,
 'Please enable JS and disable any ad blocker')

## Embeddings

https://unstructured-io.github.io/unstructured/core/embedding.html

In [14]:
!pip install "unstructured[openai]"

Collecting langchain-community (from unstructured[openai])
  Downloading langchain_community-0.0.21-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken (from unstructured[openai])
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai (from unstructured[openai])
  Downloading openai-1.12.0-py3-none-any.whl (226 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.7/226.7 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-core<0.2,>=0.1.24 (from langchain-community->unstructured[openai])
  Downloading langchain_core-0.1.25-py3-none-any.whl (242 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.1/242.1 kB[0m [31m9.2 MB/s[0m eta [

In [27]:
from google.colab import userdata
open_ai_api_key = userdata.get('open_ai_api_key')

from unstructured.documents.elements import Text
from unstructured.embed.openai import OpenAIEmbeddingEncoder

# Initialize the encoder with OpenAI credentials
#embedding_encoder = OpenAIEmbeddingEncoder(api_key=open_ai_api_key)

#embedding_encoder = OpenAIEmbeddingEncoder(api_key=open_ai_api_key)


embedding_encoder =OpenAIEmbeddingEncoder(config=OpenAIEmbeddingConfig(api_key=open_ai_api_key))

#embedding_encoder = OpenAIEmbeddingEncoder(config=[open_ai_api_key,"text-embedding-ada-002"] )

# Embed a list of Elements
elements = embedding_encoder.embed_documents(
    elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)

# Embed a single query string
query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)

# Print embeddings
[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())

NameError: name 'OpenAIEmbeddingConfig' is not defined

## HTML Links, Vector Databases, Automatic Summarization via LangChain + OpenAI

https://unstructured-io.github.io/unstructured/examples/chroma.html

In [116]:
from unstructured.partition.html import partition_html

cnn_lite_url = "https://lite.cnn.com/"
elements = partition_html(url=cnn_lite_url)
links = []

for element in elements:
  try:
    if element.links[0]["url"][1:]:
      relative_link = element.links[0]["url"][1:]
      links.append(f"{cnn_lite_url}{relative_link}")
  except IndexError:
    # Handle the case where the "url" key doesn't exist or the index is out of range
    continue

links

['https://lite.cnn.com/2024/02/22/us/how-to-help-civilians-in-ukraine-after-two-years-of-war/index.html',
 'https://lite.cnn.com/2024/02/22/opinions/what-trumps-vp-shortlist-reveals-zelizer/index.html',
 'https://lite.cnn.com/2024/02/22/tech/att-cell-service-outage/index.html',
 'https://lite.cnn.com/2024/02/22/politics/trump-engoron-civil-fraud-order/index.html',
 'https://lite.cnn.com/2024/02/22/politics/usda-discrimination-report-recommendations-reaj/index.html',
 'https://lite.cnn.com/2024/02/22/tech/ftc-avast-cybersecurity-company-fine/index.html',
 'https://lite.cnn.com/2024/02/22/entertainment/wendy-williams-aphasia-and-dementia/index.html',
 'https://lite.cnn.com/health/frontotemporal-dementia-definition-symptoms-wellness/index.html',
 'https://lite.cnn.com/2024/02/22/cnn10/ten-content-fri/index.html',
 'https://lite.cnn.com/2024/02/22/media/timothy-burke-indicted-fox-news-tucker-carlson-footage/index.html',
 'https://lite.cnn.com/2024/02/22/entertainment/rust-trial-alec-baldwi

In [117]:
from langchain.document_loaders import UnstructuredURLLoader

# links = ['https://lite.cnn.com/2024/02/22/tech/nvidia-ceo-jensen-huang-20-richest-billionaire/index.html',
#          'https://lite.cnn.com/2024/02/22/us/darryl-george-crown-act-trial-texas-reaj/index.html']

loaders = UnstructuredURLLoader(urls=links, show_progress_bar=True)

docs = loaders.load()

100%|██████████| 104/104 [00:35<00:00,  2.97it/s]


In [118]:
from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key=open_ai_api_key)
vectorstore = Chroma.from_documents(docs, embeddings)

In [131]:
query_docs = vectorstore.similarity_search("Russia", k=2)

In [127]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k", openai_api_key=open_ai_api_key)
chain = load_summarize_chain(llm, chain_type="stuff")
chain

StuffDocumentsChain(llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['text'], template='Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'), llm=ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x7b6efe8c7160>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x7b6efb234070>, model_name='gpt-3.5-turbo-16k', temperature=0.0, openai_api_key='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', openai_proxy='')), document_variable_name='text')

In [128]:
chain.invoke(query_docs)

{'input_documents': [Document(page_content='CNN\n\n2/22/2024\n\nPutin looms over a third successive US election\n\nAnalysis by Stephen Collinson, CNN\n\nUpdated: \n        9:01 AM EST, Thu February 22, 2024\n\nSource: CNN\n\n“Russia, Russia, Russia.”\n\nEx-President Donald Trump’s scathing catchphrase for a torrent of investigations during his administration also serves as an apt catch-all for the\xa0current meltdown over Moscow roiling US politics.\n\nThe United States might have beaten the Kremlin in the Cold War\xa0and ever since regarded Moscow as a mere irritant — albeit one with nuclear arms — and have been desperate to concentrate on the showdown with its new superpower rival, China.\n\nBut Russia and its leader, whom President Joe Biden described as a “crazy S.O.B.” at a Wednesday fundraiser, won’t go away.\n\nPresident Vladimir Putin has trained the malevolence of his intelligence agencies, his military power, global diplomacy and obstructive statecraft into a multi-front assa

In [132]:
query_docs

[Document(page_content='CNN\n\n2/22/2024\n\nPutin looms over a third successive US election\n\nAnalysis by Stephen Collinson, CNN\n\nUpdated: \n        9:01 AM EST, Thu February 22, 2024\n\nSource: CNN\n\n“Russia, Russia, Russia.”\n\nEx-President Donald Trump’s scathing catchphrase for a torrent of investigations during his administration also serves as an apt catch-all for the\xa0current meltdown over Moscow roiling US politics.\n\nThe United States might have beaten the Kremlin in the Cold War\xa0and ever since regarded Moscow as a mere irritant — albeit one with nuclear arms — and have been desperate to concentrate on the showdown with its new superpower rival, China.\n\nBut Russia and its leader, whom President Joe Biden described as a “crazy S.O.B.” at a Wednesday fundraiser, won’t go away.\n\nPresident Vladimir Putin has trained the malevolence of his intelligence agencies, his military power, global diplomacy and obstructive statecraft into a multi-front assault on American powe

In [143]:
# Iterate over each document and summarize
for document in query_docs:
  summary = chain.invoke(document)
  print(f"Summary of document: {summary}")

AttributeError: 'tuple' object has no attribute 'page_content'

In [145]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import MapReduceDocumentsChain

llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k", openai_api_key=open_ai_api_key)
#chain = load_summarize_chain(llm, chain_type="stuff")

# Define map function (summarizes each document)
def summarize_document(document):
  map_chain = load_summarize_chain(llm, chain_type="stuff")
  return map_chain.invoke(document)

# Define reduce function (combines summaries)
def combine_summaries(summaries):
  # You can implement your desired logic here (e.g., concatenate, average)
  return "\n".join(summaries)

# Chain and invoke
chain = MapReduceDocumentsChain(
    map_chain=summarize_document,
    reduce_chain=combine_summaries,
)
summary = chain.invoke(query_docs)

print(summary)

KeyError: 'llm_chain'