# Load Documents Using LangChain for Different Sources


LangChain offers convenient loaders for reading and converting files into a unified document format. This common structure ensures that documents—regardless of their original format—can be seamlessly processed in the same way. Here, we'll cover loading data from text, PDF, Word, JSON, CSV, and other formats, making it easy to process diverse documents in a consistent way for LLM applications.

LangChain provides more document loaders for various document formats [here](https://python.langchain.com/v0.2/docs/integrations/document_loaders/).)


----


## Setup


In [2]:
# @title Install required libraries
%%capture
#After executing the cell,please RESTART the kernel (or Restart the Session).
!pip install --user "langchain-community==0.2.1"
!pip install --user "pypdf==4.2.0"
!pip install --user "PyMuPDF==1.24.5"
!pip install --user "unstructured==0.14.8"
!pip install --user "markdown==3.6"
!pip install --user "jq==1.7.0"
!pip install --user "pandas==2.2.2"
!pip install --user "docx2txt==0.8"
!pip install --user "requests==2.32.3"
!pip install --user "nltk==3.8.0"
!pip install arxiv


In [2]:
# @title Import required libraries

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

from pprint import pprint
import json
from pathlib import Path
import nltk
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_community.document_loaders import JSONLoader
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_community.document_loaders.csv_loader import UnstructuredCSVLoader
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.document_loaders import Docx2txtLoader
from langchain_community.document_loaders import UnstructuredFileLoader
from langchain_community.document_loaders import ArxivLoader

nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

----

## Load from TXT files


The `TextLoader` is a tool designed to load textual data from various sources.

It is the simplest loader, reading a file as text and placing all the content into a single document.


In [None]:
!wget "https://www.gutenberg.org/cache/epub/74/pg74.txt"

--2025-09-04 15:12:17--  https://www.gutenberg.org/cache/epub/74/pg74.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 434401 (424K) [text/plain]
Saving to: ‘pg74.txt’


2025-09-04 15:12:17 (3.42 MB/s) - ‘pg74.txt’ saved [434401/434401]



Next, we will use the `TextLoader` class to load the file.


In [None]:
# Download the text from Project Gutenberg
!wget 'https://www.gutenberg.org/cache/epub/74/pg74.txt'

--2025-09-04 16:28:25--  https://www.gutenberg.org/cache/epub/74/pg74.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 434401 (424K) [text/plain]
Saving to: ‘pg74.txt’


2025-09-04 16:28:26 (3.58 MB/s) - ‘pg74.txt’ saved [434401/434401]



In [None]:
loader = TextLoader("/content/pg74.txt")
data = loader.load()

In [None]:
# Inspect the loaded data

print(type(data))      # Should be a list
print(len(data))       # Usually 1 for a single text file

# Print a a snippet.
pprint(data[0].page_content[7498:8116])


<class 'list'>
1
('CHAPTER I\n'
 '\n'
 '\n'
 '“Tom!”\n'
 '\n'
 'No answer.\n'
 '\n'
 '“TOM!”\n'
 '\n'
 'No answer.\n'
 '\n'
 '“What’s gone with that boy, I wonder? You TOM!”\n'
 '\n'
 'No answer.\n'
 '\n'
 'The old lady pulled her spectacles down and looked over them about the\n'
 'room; then she put them up and looked out under them. She seldom or\n'
 'never looked _through_ them for so small a thing as a boy; they were\n'
 'her state pair, the pride of her heart, and were built for “style,” not\n'
 'service—she could have seen through a pair of stove-lids just as well.\n'
 'She looked perplexed for a moment, and then said, not fiercely, but\n'
 'still loud enough for the furniture to hear:\n'
 '\n'
 '“Well, I lay if I get hold of you I’ll—”')


## Load from PDF files


LangChain provides several classes for loading PDFs. Here, we are using the `PyMuPDFLoader`.


`PyMuPDFLoader` is the fastest of the PDF parsing options. It provides detailed metadata about the PDF and its pages.


In [None]:
pdf_url = "https://teses.usp.br/teses/disponiveis/55/55134/tde-19032018-172227/publico/ElsonFelixMendesFilho.pdf"
loader = PyMuPDFLoader(pdf_url)
data = loader.load()

In [None]:
print(data[0])

page_content='PROJETO EVOLUCIONÁRIO DE 
REDES NEURA1S ARTIFICIAIS 
PARA AVALIAÇÃO DE CRÉDITO 
FINANCEIRO 
Elson Feiix Mendes Filho 
Orientador: Prof. Dr. André Carlos Ponce de 
Leon Ferreira de Carvalho 
' metadata={'source': 'https://teses.usp.br/teses/disponiveis/55/55134/tde-19032018-172227/publico/ElsonFelixMendesFilho.pdf', 'file_path': 'https://teses.usp.br/teses/disponiveis/55/55134/tde-19032018-172227/publico/ElsonFelixMendesFilho.pdf', 'page': 0, 'total_pages': 96, 'format': 'PDF 1.6', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'PaperPort 14', 'producer': 'PaperPort 14', 'creationDate': "D:20170207100313-03'00'", 'modDate': "D:20170207124523-02'00'", 'trapped': '', 'encryption': 'Standard V2 R3 128-bit RC4'}


In [None]:
# Print a a snippet.
page_number = 10  # 0-based index (page 6)
page_text = data[page_number].page_content
print(page_text[:1800])  # first 500 characters of that page

D(
ABSTRACT
MENDES, E.F.F. (1997). Evolutionary Design of Artificial Neural Networks þr Credit
Evaluation São Carlos, 1997. 85 p. Dissertação (Mestrado) - Instituto de Ciências
Matemáticas de São Carlos, Universidade de São paulo.
he risk of credit evaluation has been estimated empirically or through credit score systems.
However, with the growth of the massive credit market, this activity arnacted more aúention,
mainly due to the increase of indebt rates, which has occasioned largelosses to the donors of the
resources.
Artificial Neural Networks (AIIN) can be trained using avery large quantity of significant
examples. Using this technique, the credit evaluation can be modeled through the Ãamples
found in the historical data of the credit applicants.
Nevertheless, the topology and the learning parameters of ANNs must be adequately set for
an efficient performance to be achieved. Recently Genetic Algorithms (GA) have been proposed
to overcome these problems. These algorithms are based o

## Load from Markdown files


LangChain provides the `UnstructuredMarkdownLoader` to load content from Markdown files.


In [32]:
!wget 'https://raw.githubusercontent.com/ElsonFilho/Python_ML/main/README.md'

--2025-09-05 10:31:14--  https://raw.githubusercontent.com/ElsonFilho/Python_ML/main/README.md
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5610 (5.5K) [text/plain]
Saving to: ‘README.md’


2025-09-05 10:31:14 (34.3 MB/s) - ‘README.md’ saved [5610/5610]



In [None]:
loader = UnstructuredMarkdownLoader("/content/README.md")

In [None]:
print(data)

[Document(metadata={'source': '/content/README.md'}, page_content="Python_ML\n\nThis repository contains a collection of Python notebooks illustrating fundamental concepts and practical applications in machine learning. It is structured into five modules, each focusing on a specific area within the field.\n\nTable of Contents\n\nModule 1: Linear and Logistic Regression\n\nModule 2: Building Supervised Learning Models\n\nModule 3: Building Unsupervised Learning Models\n\nModule 4: Evaluating and Validating Machine Learning Models\n\nModule 5: Complete Project\n\nNotebooks\n\nModule 1: Linear and Logistic Regression\n\nThis module introduces two classical statistical methods foundational to Machine Learning: Linear and Logistic Regression.\n\nLearn how linear regression, pioneered in the 1800s, models linear relationships while logistic regression serves as a classifier. Through implementing these models, understand their limitations and gain insight into why modern machine-learning mode

## Load from JSON files



The JSONLoader uses a specified [jq schema](https://en.wikipedia.org/wiki/Jq_(programming_language)) to parse the JSON files. It uses the jq python package, which we've installed before.


In [None]:
!wget 'https://raw.githubusercontent.com/gitrows/data/master/iris.json'

--2025-09-04 16:41:04--  https://raw.githubusercontent.com/gitrows/data/master/iris.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15802 (15K) [text/plain]
Saving to: ‘iris.json.1’


2025-09-04 16:41:04 (9.12 MB/s) - ‘iris.json.1’ saved [15802/15802]



First, let's use `pprint` to take a look at the JSON file and its structure.


In [None]:
file_path='/content/iris.json'
data = json.loads(Path(file_path).read_text())

In [None]:
jq_schema = '.[] | {sepal_length, sepal_width, petal_length, petal_width, species}'

loader = JSONLoader(file_path=file_path, jq_schema=jq_schema, text_content=False)
data = loader.load()

In [None]:
 # print first characters
print(str(data[0].page_content)[:200])

{"sepal_length": null, "sepal_width": null, "petal_length": null, "petal_width": null, "species": "setosa"}


## Load from CSV files


CSV files are a common format for storing tabular data. The `CSVLoader` provides a convenient way to read and process this data.


In [None]:
!wget 'https://raw.githubusercontent.com/ElsonFilho/Python_ML/refs/heads/main/data/Cust_Segmentation.csv'

--2025-09-04 16:45:53--  https://raw.githubusercontent.com/ElsonFilho/Python_ML/refs/heads/main/data/Cust_Segmentation.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35017 (34K) [text/plain]
Saving to: ‘Cust_Segmentation.csv’


2025-09-04 16:45:53 (3.24 MB/s) - ‘Cust_Segmentation.csv’ saved [35017/35017]



In [None]:
loader = CSVLoader(file_path='/content/Cust_Segmentation.csv')
data = loader.load()

In [None]:
for doc in data[:5]:
    print(doc.page_content)
    print("-----")

Customer Id: 1
Age: 41
Edu: 2
Years Employed: 6
Income: 19
Card Debt: 0.124
Other Debt: 1.073
Defaulted: 0.0
Address: NBA001
DebtIncomeRatio: 6.3
-----
Customer Id: 2
Age: 47
Edu: 1
Years Employed: 26
Income: 100
Card Debt: 4.582
Other Debt: 8.218
Defaulted: 0.0
Address: NBA021
DebtIncomeRatio: 12.8
-----
Customer Id: 3
Age: 33
Edu: 2
Years Employed: 10
Income: 57
Card Debt: 6.111
Other Debt: 5.802
Defaulted: 1.0
Address: NBA013
DebtIncomeRatio: 20.9
-----
Customer Id: 4
Age: 29
Edu: 2
Years Employed: 4
Income: 19
Card Debt: 0.681
Other Debt: 0.516
Defaulted: 0.0
Address: NBA009
DebtIncomeRatio: 6.3
-----
Customer Id: 5
Age: 47
Edu: 1
Years Employed: 31
Income: 253
Card Debt: 9.308
Other Debt: 8.908
Defaulted: 0.0
Address: NBA008
DebtIncomeRatio: 7.2
-----


When you load data from a CSV file, the loader typically creates a separate `Document` object for each row of data in the CSV.


## UnstructuredCSVLoader


In contrast to `CSVLoader`, which treats each row as an individual document with headers defining the data, `UnstructuredCSVLoader` considers the entire CSV file as a single unstructured table element. This approach is beneficial when you want to analyze the data as a complete table rather than as separate entries.


In [None]:
loader = UnstructuredCSVLoader(
    file_path="/content/Cust_Segmentation.csv", mode="elements"
)
data = loader.load()

In [None]:
data[0].page_content

'\n\n\nCustomer Id\nAge\nEdu\nYears Employed\nIncome\nCard Debt\nOther Debt\nDefaulted\nAddress\nDebtIncomeRatio\n\n\n1\n41\n2\n6\n19\n0.124\n1.073\n0.0\nNBA001\n6.3\n\n\n2\n47\n1\n26\n100\n4.582\n8.218\n0.0\nNBA021\n12.8\n\n\n3\n33\n2\n10\n57\n6.111\n5.802\n1.0\nNBA013\n20.9\n\n\n4\n29\n2\n4\n19\n0.681\n0.516\n0.0\nNBA009\n6.3\n\n\n5\n47\n1\n31\n253\n9.308\n8.908\n0.0\nNBA008\n7.2\n\n\n6\n40\n1\n23\n81\n0.998\n7.831\n\nNBA016\n10.9\n\n\n7\n38\n2\n4\n56\n0.442\n0.454\n0.0\nNBA013\n1.6\n\n\n8\n42\n3\n0\n64\n0.279\n3.945\n0.0\nNBA009\n6.6\n\n\n9\n26\n1\n5\n18\n0.575\n2.215\n\nNBA006\n15.5\n\n\n10\n47\n3\n23\n115\n0.653\n3.947\n0.0\nNBA011\n4.0\n\n\n11\n44\n3\n8\n88\n0.285\n5.083\n1.0\nNBA010\n6.1\n\n\n12\n34\n2\n9\n40\n0.374\n0.266\n\nNBA003\n1.6\n\n\n13\n24\n1\n7\n18\n0.526\n0.643\n0.0\nNBA000\n6.5\n\n\n14\n46\n1\n6\n30\n1.415\n3.865\n\nNBA019\n17.6\n\n\n15\n28\n3\n2\n20\n0.233\n1.647\n1.0\nNBA000\n9.4\n\n\n16\n24\n1\n1\n16\n0.185\n1.287\n\nNBA005\n9.2\n\n\n17\n29\n1\n1\n17\n0.132\n0.29

In [None]:
# Print the first characters as HTML

print(data[0].metadata["text_as_html"][:525])

<table border="1" class="dataframe">
  <tbody>
    <tr>
      <td>Customer Id</td>
      <td>Age</td>
      <td>Edu</td>
      <td>Years Employed</td>
      <td>Income</td>
      <td>Card Debt</td>
      <td>Other Debt</td>
      <td>Defaulted</td>
      <td>Address</td>
      <td>DebtIncomeRatio</td>
    </tr>
    <tr>
      <td>1</td>
      <td>41</td>
      <td>2</td>
      <td>6</td>
      <td>19</td>
      <td>0.124</td>
      <td>1.073</td>
      <td>0.0</td>
      <td>NBA001</td>
      <td>6.3</td>
    </tr>
    


## Load from URL/Website files


 LangChain's `WebBaseLoader` is designed to extract all text from HTML webpages and convert it into a document format suitable for further processing.


In [None]:
loader = WebBaseLoader("https://www.linkedin.com/pulse/introduction-retrieval-augmented-generation-rag-elson-mendes-filho-unzve/")

In [None]:
data = loader.load()

In [None]:
data

[Document(metadata={'source': 'https://www.linkedin.com/pulse/introduction-retrieval-augmented-generation-rag-elson-mendes-filho-unzve/', 'title': 'An Introduction to Retrieval-Augmented Generation (RAG)', 'description': 'Generative AI, with Large Language Models (LLMs) at its forefront, has significantly transformed the field of Artificial Intelligence. Despite their impressive capabilities in generating human-like text and answering diverse queries, LLMs inherently face limitations.', 'language': 'en'}, page_content='\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAn Introduction to Retrieval-Augmented Generation (RAG)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n              Agree & Join LinkedIn\n            \n\n      By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.\n    \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n                  Sign in to view more content\n                \n\n 

## Load from multiple web pages


You can load multiple webpages simultaneously by passing a list of URLs to the loader. This will return a list of documents corresponding to the order of the URLs provided.


In [14]:
loader = WebBaseLoader(["https://www.theverge.com/ai-artificial-intelligence/771356/deepseek-new-ai-model-2025",
                        "https://www.theverge.com/2024/4/18/24133808/meta-ai-assistant-llama-3-chatgpt-openai-rival",
                        "https://en.wikipedia.org/wiki/Artificial_intelligence"])

data = loader.load()

In [17]:
# Preview each document with metadata and first characters
for i, doc in enumerate(data):
    words = len(doc.page_content.split())
    print(f"Doc {i+1} ({words} words) | Metada: {doc.metadata}")
    snippet = doc.page_content[:100]
    print(f"Snippet: {snippet}...\n")

Doc 1 (595 words) | Metada: {'source': 'https://www.theverge.com/ai-artificial-intelligence/771356/deepseek-new-ai-model-2025', 'title': 'DeepSeek is planning to drop its next AI model by the end of 2025, beefing up agent features. | The Verge', 'description': 'Founder Liang Wenfeng is pushing developers to unveil the new system in the final quarter of the year, Bloomberg reports. The model, a successor to the industry-shaking R1, is expected to be able to do more complex, multistep tasks without constant monitoring from users.&nbsp;Right now, AI agents are still slow, glitchy, and far from ideal, but incremental improvements continue to drive the hype.\n[Link: China’s DeepSeek Preps AI Agent for End-2025 to Rival OpenAI | https://www.bloomberg.com/news/articles/2025-09-04/deepseek-targets-ai-agent-release-by-end-of-year-to-rival-openai | Bloomberg]', 'language': 'en-US'}
Snippet: DeepSeek is planning to drop its next AI model by the end of 2025, beefing up agent features. | The ...

D

## Load from WORD files


`Docx2txtLoader` is utilized to convert Word documents into a document format suitable for further processing.


In [5]:
!wget "https://calibre-ebook.com/downloads/demos/demo.docx"

--2025-09-05 10:02:07--  https://calibre-ebook.com/downloads/demos/demo.docx
Resolving calibre-ebook.com (calibre-ebook.com)... 166.78.105.155, 2001:4801:7817:72:be76:4eff:fe10:f43a
Connecting to calibre-ebook.com (calibre-ebook.com)|166.78.105.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1311881 (1.3M) [application/vnd.openxmlformats-officedocument.wordprocessingml.document]
Saving to: ‘demo.docx’


2025-09-05 10:02:08 (2.15 MB/s) - ‘demo.docx’ saved [1311881/1311881]



In [20]:
loader = Docx2txtLoader("/content/demo.docx")
data = loader.load()

In [26]:
text = data[0].page_content

# Get lengths
char_count = len(text)
word_count = len(text.split())

# Create a short snippet
snippet = text[:205]

print("Source:", data[0].metadata.get('source'))
print(f"Length: {char_count} characters | {word_count} words")
print("Snippet:", snippet, "...")

Source: /content/demo.docx
Length: 9271 characters | 1553 words
Snippet: Demonstration of DOCX support in calibre

This document demonstrates the ability of the calibre DOCX Input plugin to convert the various typographic features in a Microsoft Word (2007 and newer) document.  ...


## Load from Unstructured Files


Sometimes, we need to load content from various text sources and formats without writing a separate loader for each one. Additionally, when a new file format emerges, we want to save time by not having to write a new loader for it. `UnstructuredFileLoader` addresses this need by supporting the loading of multiple file types. Currently, `UnstructuredFileLoader` can handle text files, PowerPoints, HTML, PDFs, images, and more.


For example, we can load `.txt` file.


In [27]:
!wget "https://www.gutenberg.org/files/1065/1065-0.txt"

--2025-09-05 10:27:56--  https://www.gutenberg.org/files/1065/1065-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7582 (7.4K) [text/plain]
Saving to: ‘1065-0.txt’


2025-09-05 10:27:57 (1.36 GB/s) - ‘1065-0.txt’ saved [7582/7582]



In [33]:
loader = UnstructuredFileLoader("/content/1065-0.txt")
data = loader.load()

In [31]:
text = data[0].page_content

# Get lengths
char_count = len(text)
word_count = len(text.split())

# Create a short snippet
snippet = text[:201]

print("Source:", data[0].metadata.get('source'))
print(f"Length: {char_count} characters | {word_count} words")
print("Snippet:", snippet, "...")

Source: /content/1065-0.txt
Length: 6370 characters | 1093 words
Snippet: *** START OF THE PROJECT GUTENBERG EBOOK 1065 ***

The Raven

by

Edgar Allan Poe

Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore ...


We also can load `.md` file.


In [34]:
loader = UnstructuredFileLoader("/content/README.md")
data = loader.load()

In [35]:
text = data[0].page_content

# Get lengths
char_count = len(text)
word_count = len(text.split())

# Create a short snippet
snippet = text[:201]

print("Source:", data[0].metadata.get('source'))
print(f"Length: {char_count} characters | {word_count} words")
print("Snippet:", snippet, "...")

Source: /content/README.md
Length: 5043 characters | 677 words
Snippet: Python_ML

This repository contains a collection of Python notebooks illustrating fundamental concepts and practical applications in machine learning. It is structured into five modules, each focusing  ...


#### Multiple files with different formats


We can even load a list of files with different formats.


In [39]:
files = ["/content/README.md", "/content/1065-0.txt"]

In [40]:
loader = UnstructuredFileLoader(files)
data = loader.load()

In [46]:
# Preview the document with metadata and first characters
# UnstructuredFileLoader returns just one Document (the default).
text = data[0].page_content
char_count = len(text)
word_count = len(text.split())
snippet = text[:150]

print("Source:", data[0].metadata.get('source'))
print(f"Length: {char_count} characters | {word_count} words")
print("Snippet:", snippet, "...")

Source: ['/content/README.md', '/content/1065-0.txt']
Length: 11415 characters | 1770 words
Snippet: Python_ML

This repository contains a collection of Python notebooks illustrating fundamental concepts and practical applications in machine learning. ...


## Arxiv papers


Sometimes we have paper that we want to load from Arxiv, can you load a paper using `ArxivLoader`.


In [50]:
docs = ArxivLoader(query="2509.04139", load_max_docs=2).load()

In [56]:
for i, doc in enumerate(docs):
    text = doc.page_content
    char_count = len(text)
    word_count = len(text.split())
    summary = doc.metadata.get('Summary', '')[:500]

    print(f"Doc {i+1} | Title: {doc.metadata.get('Title')}")
    print(f"Authors: {doc.metadata.get('Authors')}")
    print(f"Published: {doc.metadata.get('Published')}")
    print(f"Length: {char_count} characters | {word_count} words")
    print(f"Summary (500 chars): {summary}...")

Doc 1 | Title: Enhancing Technical Documents Retrieval for RAG
Authors: Songjiang Lai, Tsun-Hin Cheung, Ka-Chun Fung, Kaiwen Xue, Kwan-Ho Lin, Yan-Ming Choi, Vincent Ng, Kin-Man Lam
Published: 2025-09-04
Length: 27404 characters | 3658 words
Summary (500 chars): In this paper, we introduce Technical-Embeddings, a novel framework designed
to optimize semantic retrieval in technical documentation, with applications in
both hardware and software development. Our approach addresses the challenges
of understanding and retrieving complex technical content by leveraging the
capabilities of Large Language Models (LLMs). First, we enhance user queries by
generating expanded representations that better capture user intent and improve
dataset diversity, thereby en...
