# Unstructured File Loader
This notebook covers how to use Unstructured to load files of many types. Unstructured currently supports loading of text files, powerpoints, html, pdfs, images, and more.

In [1]:
# # Install package
!pip install "unstructured[local-inference]"
!pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"
!pip install layoutparser[layoutmodels,tesseract]

Collecting unstructured[local-inference]
  Downloading unstructured-0.6.1.tar.gz (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting argilla (from unstructured[local-inference])
  Downloading argilla-1.6.0-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m94.6 MB/s[0m eta [36m0:00:00[0m
Collecting lxml (from unstructured[local-inference])
  Downloading lxml-4.9.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m92.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hCollecting markdown (from unstructured[local-inference])
  Downloading Markdown-3.4.3-py3-none-any.whl (93 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m

In [2]:
# # Install other dependencies
# # https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst
# !brew install libmagic
# !brew install poppler
# !brew install tesseract
# # If parsing xml / html documents:
# !brew install libxml2
# !brew install libxslt

In [3]:
# import nltk
# nltk.download('punkt')

In [4]:
from langchain.document_loaders import UnstructuredFileLoader

In [5]:
loader = UnstructuredFileLoader("email.txt")

In [6]:
docs = loader.load()

[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ubuntu/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [7]:
docs[0].page_content[:400]

"Hello everyone, please see below for some interesting music law & industry articles I collected from the past two weeks.  If you would like a PDF of the full article, send me an email!\n\nNew Music Friday\n\nDaniel Caesar – Never Enough\n\nEllie Goulding – Higher Than Heaven\n\nDrake – Search & Rescue\n\nMelanie Martinez – PORTALS\n\nYaeji – With A Hammer\n\nChildish Gambino Beats 'This Is America' Copyright Su"

## Retain Elements

Under the hood, Unstructured creates different "elements" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode="elements"`.

In [14]:
loader = UnstructuredFileLoader("email.txt", mode="elements")

In [15]:
docs = loader.load()

In [16]:
docs[:5]

[Document(page_content='Hello everyone, please see below for some interesting music law & industry articles I collected from the past two weeks.  If you would like a PDF of the full article, send me an email!', metadata={'source': 'email.txt', 'filename': 'email.txt', 'category': 'NarrativeText'}),
 Document(page_content='New Music Friday', metadata={'source': 'email.txt', 'filename': 'email.txt', 'category': 'Title'}),
 Document(page_content='Daniel Caesar – Never Enough', metadata={'source': 'email.txt', 'filename': 'email.txt', 'category': 'Title'}),
 Document(page_content='Ellie Goulding – Higher Than Heaven', metadata={'source': 'email.txt', 'filename': 'email.txt', 'category': 'Title'}),
 Document(page_content='Drake – Search & Rescue', metadata={'source': 'email.txt', 'filename': 'email.txt', 'category': 'Title'})]

## Define a Partitioning Strategy

Unstructured document loader allow users to pass in a `strategy` parameter that lets `unstructured` know how to partition the document. Currently supported strategies are `"hi_res"` (the default) and `"fast"`. Hi res partitioning strategies are more accurate, but take longer to process. Fast strategies partition the document more quickly, but trade-off accuracy. Not all document types have separate hi res and fast partitioning strategies. For those document types, the `strategy` kwarg is ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing (i.e. a model for document partitioning). You can see how to apply a strategy to an `UnstructuredFileLoader` below.

In [None]:
from langchain.document_loaders import UnstructuredFileLoader

In [None]:
loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")

In [None]:
docs = loader.load()

In [None]:
docs[:5]

[Document(page_content='1', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
 Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
 Document(page_content='0', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
 Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
 Document(page_content='n', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category':

In [28]:
import os 
from apikey import apikey 

import streamlit as st 
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain, SequentialChain 
from langchain.chains import RetrievalQA
from langchain.memory import ConversationBufferMemory
from langchain.utilities import WikipediaAPIWrapper 
from langchain.document_loaders import TextLoader

os.environ['OPENAI_API_KEY'] = apikey


loader = TextLoader('email.txt', encoding='utf8')
loader2 =  TextLoader('email2.txt', encoding='utf8')
from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator().from_loaders([loader, loader2])

2023-04-25 01:14:54.853 INFO    chromadb.telemetry.posthog: Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
2023-04-25 01:14:54.854 INFO    chromadb: Running Chroma using direct local API.
2023-04-25 01:14:55.409 INFO    chromadb.db.duckdb: Exiting: Cleaning up .chroma directory


In [31]:
query = "What was Yuyen's excuse for why his company used the music without a license?"
response = index.query(query, verbose=True)
print(response)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 Yuyen's excuse was that Hotai unintentionally made the videos accessible on YouTube after the music licenses expired, i.e. September 27th, 2013.


In [41]:
from langchain.document_loaders import UnstructuredURLLoader
urls = [
    "https://cocatalog.loc.gov/cgi-bin/Pwebrecon.cgi?v1=1&ti=1,1&Search%5FArg=SR0000412664&Search%5FCode=REGS&CNT=25&PID=tJAQFG5-OaIjIT-Elu0DU5tVZOOna&SEQ=20230424212402&SID=1"]

loader = UnstructuredURLLoader(urls=urls)
data = loader.load()
index = VectorstoreIndexCreator().from_loaders([loader])
query = "What is the registration number?"
response = index.query(query)
print(response)

2023-04-25 01:28:35.748 INFO    unstructured: Reading document from string ...
2023-04-25 01:28:35.750 INFO    unstructured: Reading document ...
2023-04-25 01:28:36.772 INFO    unstructured: Reading document from string ...
2023-04-25 01:28:36.773 INFO    unstructured: Reading document ...
2023-04-25 01:28:36.776 INFO    chromadb.telemetry.posthog: Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
2023-04-25 01:28:36.777 INFO    chromadb: Running Chroma using direct local API.


NotEnoughElementsException: Number of requested results 4 cannot be greater than number of elements in index 1

## PDF Example

Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of `elements`. 

In [None]:
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper.pdf -P "../../"

In [None]:
loader = UnstructuredFileLoader("./example_data/layout-parser-paper.pdf", mode="elements")

In [None]:
docs = loader.load()

In [None]:
docs[:5]

[Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
 Document(page_content='Zejiang Shen 1 ( (ea)\n ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
 Document(page_content='Allen Institute for AI shannons@allenai.org', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
 Document(page_content='Brown University ruochen zhang@brown.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
 Document(page_content='Harvard University { melissadell,jacob carlson } @fas.harvard.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0)]