# Document Loaders

In [2]:
import os
def prepare_filesdir_path():
    return os.path.join("resources", "no-code-files")

def prepare_resource_path(filename):
    files_dir = prepare_filesdir_path()
    return os.path.join(files_dir, filename)

def prepare_pdf_path(filename):
    files_dir = prepare_filesdir_path()
    return os.path.join(files_dir, "pdfs", filename)

### TextLoader

In [90]:
from langchain_community.document_loaders import TextLoader

In [14]:
loader = TextLoader("loader-functionality-to-focus-on.md")
document = loader.load()
print(document)

[Document(metadata={'source': 'loader-functionality-to-focus-on.md'}, page_content='# For each document loader in LangChain, you should focus on the following aspects:\n\n### Supported File Types:\nIdentify what file formats the loader can handle (e.g., PDFs, Word docs, CSVs).\n\n### Configuration Options:\nLearn about the customizable parameters or settings available for each loader (e.g., handling large files, extracting metadata, pagination).\n\n### Parsing and Extraction:\nUnderstand how the loader extracts text or data from the document and how it handles complex structures like tables, images, or embedded files.\n\n### Efficiency and Performance:\nExplore how the loader manages memory, speed of loading, and processing large documents or datasets.\n\n### Integration with Other Tools:\nCheck how the loader integrates with other tools or services, such as cloud storage, databases, or web APIs.\n\n### Error Handling:\nInvestigate how the loader deals with corrupted files, missing dat

In [23]:
len(document[0].page_content.split('\n\n'))

10

In [28]:
# Let's try loading a txt with it
loader = TextLoader('requirements.txt')
loader.load()

[Document(metadata={'source': 'requirements.txt'}, page_content='ipykernel')]

In [29]:
loader = TextLoader('file')
loader.load()

[Document(metadata={'source': 'file'}, page_content='how are you. This is a file with no extension.')]

In [31]:
loader = TextLoader(prepare_resource_path('11-quote.txt'))
loader.load()

[Document(metadata={'source': 'resources\\no-code-files\\11-quote.txt'}, page_content='Never Say Tomorrow.\nDo it today.')]

In [33]:
loader = TextLoader(prepare_resource_path('09-50-quotes.txt'))
docs = loader.lazy_load()

for chunk in docs:
    print(chunk)

page_content='1. "If you want to achieve greatness stop asking for permission." --Anonymous 2. "Things work out best for those who make the best of how things work out." --John Wooden 3. "To live a creative life, we must lose our fear of being wrong." --Anonymous 4. "If you are not willing to risk the usual you will have to settle for the ordinary." --Jim Rohn 5. "Trust because you are willing to accept the risk, not because it's safe or certain." --Anonymous 6. "Take up one idea. Make that one idea your life--think of it, dream of it, live on that idea. Let the 
brain, muscles, nerves, every part of your body, be full of that idea, and just leave every other idea 
alone. This is the way to success." --Swami Vivekananda 7. "All our dreams can come true if we have the courage to pursue them." --Walt Disney 8. "Good things come to people who wait, but better things come to those who go out and get them." --
Anonymous 9. "If you do what you always did, you will get what you always got." -

- This loader can only load a .txt, .md, no-extension file containing text.
- This can't load any pdf, docx, csv, png file.
- In the metadata, it gives source: filepath only.

### CSV Loader

#### CSVLoader

In [107]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(prepare_resource_path("07-quote-dateset.csv"), source_column='author')
docs = loader.load()

In [92]:
# each row is treated as a separate document
len(docs)

75966

In [103]:
docs[0].page_content

"quote: Age is an issue of mind over matter. If you don't mind, it doesn't matter.\nauthor: Mark Twain"

In [99]:
type(docs[0].page_content)

str

In [105]:
docs[0].page_content.find('quote')

0

In [102]:
docs[0].metadata

{'source': 'Mark Twain', 'row': 0}

- This loader loads each rows as a separate document.
- If required, it updates the source of each document (row) to the defined column. So it'll use the corresponding entry for each row from that column.

#### UnstructuredCSVLoader

In [108]:
from langchain_community.document_loaders.csv_loader import UnstructuredCSVLoader

# not studied right now.

### PDF Loader

#### PyPDFLoader

In [119]:
from langchain_community.document_loaders.pdf import PyPDFLoader

loader = PyPDFLoader(prepare_resource_path("01-motivational-quotes.pdf"))
docs = loader.load()

In [125]:
docs[0].metadata

{'source': 'resources\\no-code-files\\01-motivational-quotes.pdf', 'page': 0}

In [124]:
docs[0].page_content

'100 Motivational Quotes That Will Inspire You to Succeed  \nEveryone needs some inspiration, and these motivational quotes will give you the edge you \nneed to create your success. So read on and let them inspire you . \nBy Lolly Daskal  \nAs leaders,  managers, and bosses, we must realize that everything we think actually matters. If we are \nseeking success, we must think successful, inspiring, and motivating thoughts.  \nRead on to find the words of wisdom that will motivate you in building your business, leading your life , \ncreating success,  achieving your goals, and overcoming your fears.   Here are quotes —100 of them —that \nwill inspire you r success.   \n1. "If you want to achieve greatness  stop asking for permission." --Anonymous  \n2. "Things work out best for those who make the best of how things work out." --John Wooden  \n3. "To live a creative life, we must lose  our fear of being wrong." --Anonymous  \n4. "If you are not willing to risk the usual you will have to s

In [130]:
# trying with a pdf having images and columns
loader = PyPDFLoader(prepare_resource_path("02-quotes-with-mix-format.pdf"), extract_images=True)
docs = loader.load()

In [131]:
docs[0].page_content

'Slow and s teady wins the race.  \n \nNever give up.  \nIts about the decis ive moment.  \nPerfectionis\nnotattainable,\nbutifwechase\nperfection we can\ncatch excellence.\nVINCELOMBARDI\nBRIANTRACY'

In [132]:
docs[1].page_content

' \n \nPrepare for the real life.  \n \nTable -quote -1 Table -quote -2 \nTable -quote -3 Table -quote -4 \n \nIfyoucamchange\nyourmind,you\ncamchange\nyour lhios,\n=- %9[LL1] 3LAAB8Iifeiswvhat\nhappenstous\nwhileweare\ninalkingother\nplaus."It\'sthe possibility\nofhavinga dream\ncometruethat\nmakeslife\ninteresting.\nPAULO COEHLO\nTOBAT'

- This loader treats each page as separate document.
- Also gives page_no is metadata
- It sometimes add extra spaces betweeb text e.g between a word chars etc.
- Uses RapidOCR-Runtime (a DL-based) library to extract text from images when extract_images=True.
- This does extracts text form images and  handles table. Extraction is somewhat good but not perfect. Handling tables is not perfect also.

#### PDFMinerLoader

In [14]:
from langchain_community.document_loaders import PDFMinerLoader

loader = PDFMinerLoader(prepare_pdf_path("07-contract.pdf"))
pages = loader.load()

In [15]:
pages[0].page_content

'1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\nEQUIPMENT TAGGING LEGEND\n\nFURNACE HPU\n\nRE-BCO201-MCC01\n\nCHECK VALVE\n\nFILTER\n\nCOMPONENT NUMBER\n(NUMERIC [2 DIGIT])\n\nCOMPONENT DESCRIPTION FOR SUB EQUIPMENT\n\nEQUIPMENT NUMBER DESIGNATION\n(NUMERIC [3 OR 4 DIGIT])\n\nEQUIPMENT NUMBER DESIGNATION\n\nPROCESS AREA\n\nFIXED DISPLACEMENT PUMP\n\nVARIABLE DISPLACEMENT,\nPRESSURE COMPENSATED PUMP\n\nBREATHER / FILTER\n\nHEAT EXCHANGER (COOLER)\n\nIMMERSION HEATER\n\nCYLINDER\n\nPRESSURE RELIEF VALVE\n\nDIFFERENTIAL AREA RELIEF VALVE\n\nFLOW CONTROL VALVE\n\n3-WAY BYPASS VALVE\n\nDIRECTIONAL VALVE, 4-WAY, 3-POSITION\nFLOAT CENTER\n\nDIRECTIONAL VALVE, 4-WAY, 3-POSITION\nTANDEM CENTER, OPEN CROSS-OVER PORTING\n\nDIRECTIONAL AIR VALVE, 4-WAY, SINGLE SOLENOID,\n2-POSITION, AIR RETURN, SPRING ASSISTED\n\nPILOT OPERATED\nDUAL CHECK VALVES\n\nA\n\nB\n\nC\n\nD\n\nE\n\nF\n\n         - PRELIMINARY -\nNOT FOR CONSTRUCTION\n\nA\n\nREV\n\nISSUED FOR DESIGN\n\nN. IMEL\nDESIGN BY\n\nT. HULL\nCHECKED 

In [18]:
pages[0].page_content[pages[0].page_content.find("ISSUED"):]

'ISSUED FOR DESIGN\n\nN. IMEL\nDESIGN BY\n\nT. HULL\nCHECKED BY\n\n07-05-23\nDATE\n\nTHIS DRAWING IS PROPERTY OF ENVIVA, INC. AND IS NOT TO BE\nREPRODUCED, COPIED OR USED FOR ANY PURPOSE OTHER\nTHAN CONSTRUCTION OF THIS PROJECT WITHOUT WRITTEN\nCONSENT OF ENVIVA, INC.\n\nENVIVA\n\nEPES WOOD PELLET FACILITY\n\nI\n\nWE IT\n\nK\n\nS\nI\nNC\n\nE\n\n4\n\n8 8\n\n1\n\n®\n\nCOMMON PLANT SYSTEMS\nP&ID LEGEND SHEET 7 OF 7\nP&ID\n\nENGINEER/DESIGN\nORIGINATOR\n\nT. HULL\n\nLEAD ENG\n\nENG MGR\n\nPROJ MGR\n\nC. HERETH\n\nE. SKIBBE\n\nR. MCNIFF\n\nDRAWING NUMBER\n\n00-01-D-009\n\n\x0c'

- extracts even the minute details (from images) e.g. quote, author, publisher from a blurry image with detail like newline etc.
- Handles tables well i.e. both the textual and image tables. Just don't provide some key etc. to indicate table. However uses \n etc.
- Handles multiple columns one-by-one.
- Beautifully Extracts details from large files.
- Awesomely extracts details from contracts (tried drawing.)

- Can't handle a pdf of images.
- Can't handle Urdu.
