# 1 File-based Loaders

Load documents from local file formats.

Text Files → TextLoader, UnstructuredFileLoader

CSV Files → CSVLoader, UnstructuredCSVLoader

JSON Files → JSONLoader

PDF Files → PyPDFLoader, PDFPlumberLoader, PyMuPDFLoader

Word Documents → Docx2txtLoader, UnstructuredWordDocumentLoader

Excel Files → UnstructuredExcelLoader, PandasExcelLoader

HTML/XML → UnstructuredHTMLLoader, UnstructuredXMLLoader, BSHTMLLoader



# 1️ Text Loader

TextLoader in LangChain reads plain .txt files and converts them into LangChain Document objects for processing.


pip install langchain_openai 

pip install langchain_community

pip install ipykernel

In [1]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("files/sample.txt")
doc = loader.load()
doc

[Document(metadata={'source': 'files/sample.txt'}, page_content='Key Takeaways\nHierarchical structure awareness improves classification accuracy for parent-child relationships.\n\nHard negatives are crucial for robust training, especially in semantic relation tasks.\n\nMambaLLM offers speed advantages without major loss in representational quality.\n\nNext Steps\nHyperparameter tuning â€” Adjust learning rates, batch size, and negative sampling ratio.\n\nModel regularization â€” Introduce dropout or layer freezing to reduce overfitting.\n\nExtended evaluation â€” Test on unseen WordNet branches to assess generalization.\n\nReal-world integration â€” Apply model to knowledge graph validation in production scenarios.')]

In [2]:
doc[0].metadata

{'source': 'files/sample.txt'}

In [3]:
print(doc[0].page_content)

Key Takeaways
Hierarchical structure awareness improves classification accuracy for parent-child relationships.

Hard negatives are crucial for robust training, especially in semantic relation tasks.

MambaLLM offers speed advantages without major loss in representational quality.

Next Steps
Hyperparameter tuning â€” Adjust learning rates, batch size, and negative sampling ratio.

Model regularization â€” Introduce dropout or layer freezing to reduce overfitting.

Extended evaluation â€” Test on unseen WordNet branches to assess generalization.

Real-world integration â€” Apply model to knowledge graph validation in production scenarios.


# 2 PDF Loader

pip install pypdf


Purpose: Loads .pdf files, extracting text from each page.

When to Use: When working with scanned or digital PDFs (works well with text-based PDFs, not images unless OCR is used).

In [4]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('files/sample.pdf')

doc = loader.load()
print(doc[0].page_content)

Key Takeaways 
Hierarchical structure awareness improves classification accuracy for parent-child relationships. 
 
Hard negatives are crucial for robust training, especially in semantic relation tasks. 
 
MambaLLM offers speed advantages without major loss in representational quality. 
 
Next Steps 
Hyperparameter tuning — Adjust learning rates, batch size, and negative sampling ratio. 
 
Model regularization — Introduce dropout or layer freezing to reduce overfitting. 
 
Extended evaluation — Test on unseen WordNet branches to assess generalization. 
 
Real-world integration — Apply model to knowledge graph validation in production scenarios.


# 3 DOCX Loader (Word Documents)

pip install docx2txt



Purpose: Loads .docx Word files and extracts the text. 

When to Use: For Microsoft Word documents containing formatted or unformatted text.

In [5]:
from langchain_community.document_loaders import Docx2txtLoader

loader = Docx2txtLoader("files/sample.docx")
documents = loader.load()
print(documents[0].page_content)


Key Takeaways

Hierarchical structure awareness improves classification accuracy for parent-child relationships.



Hard negatives are crucial for robust training, especially in semantic relation tasks.



MambaLLM offers speed advantages without major loss in representational quality.



Next Steps

Hyperparameter tuning — Adjust learning rates, batch size, and negative sampling ratio.



Model regularization — Introduce dropout or layer freezing to reduce overfitting.



Extended evaluation — Test on unseen WordNet branches to assess generalization.



Real-world integration — Apply model to knowledge graph validation in production scenarios.


# 4 CSV Loader

pip install "unstructured[all-docs]" 

or 

pip install "unstructured[all]" 




Purpose: Loads .csv files and converts each row into a Document.
When to Use: For structured data in comma-separated values format.

In [6]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(file_path="files/test_data.csv")
doc = loader.load()
print(doc[0].page_content)

child: dog
parent: animal
random_negatives: table
hard_negatives: cat


# 5 Excel Loader

pip install networkx

pip install msoffcrypto-tool

pip install openpyxl 

pip install pandas 

Purpose: Loads .xlsx files into LangChain as documents.
When to Use: For spreadsheet data in Microsoft Excel format.

In [7]:
from langchain_community.document_loaders import UnstructuredExcelLoader

loader = UnstructuredExcelLoader("files/test_data.xlsx")
doc = loader.load()

print(doc[0].page_content)

child parent random_negatives hard_negatives dog animal table cat rose plant chair sunflower sparrow bird car pigeon laptop device pen tablet python programming banana java


# 6 JSON Loader
Purpose: Loads .json files and parses them into documents.

When to Use: For structured or semi-structured JSON data.

In [None]:
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path="files/test_data.json",
    jq_schema=".[]",
    text_content=False,
)

docs = loader.load()


for doc in docs:
    print(doc.page_content)


{"child": "dog", "parent": "animal", "random_negatives": "table", "hard_negatives": "cat"}
{"child": "dog", "parent": "animal", "random_negatives": "table", "hard_negatives": "cat"}
{"child": "rose", "parent": "plant", "random_negatives": "chair", "hard_negatives": "sunflower"}
{"child": "sparrow", "parent": "bird", "random_negatives": "car", "hard_negatives": "pigeon"}
{"child": "laptop", "parent": "device", "random_negatives": "pen", "hard_negatives": "tablet"}
{"child": "python", "parent": "programming", "random_negatives": "banana", "hard_negatives": "java"}



# 7 HTML / XML Loading

In [None]:
 
from langchain_community.document_loaders import UnstructuredHTMLLoader, UnstructuredXMLLoader

# Unstructured HTML Loader
loader = UnstructuredHTMLLoader("files/test_data.html")
docs = loader.load()
print(docs[0].page_content)

this is title

we are here to watch this page


In [23]:
# Unstructured XML Loader
loader = UnstructuredXMLLoader("files/test_data.xml")
docs = loader.load()
print(docs[0].page_content)


Introduction to AI

Noor Saeed


            Artificial Intelligence (AI) is the simulation of human intelligence in machines that are programmed to think and learn.
        

Machine Learning

John Smith


            Machine learning is a subset of AI that enables systems to learn from data without being explicitly programmed.
        

Deep Learning

Mary Johnson


            Deep learning uses neural networks with many layers to analyze various factors of data, especially in image and speech recognition.
        
