### File Loader (LangChain) for PDF, HTML, and CSV

This code loads and extracts content from PDF, HTML, and CSV files using specific loaders for each file type. It prints the first 500 characters of the content from each document to the console for preview. The process is repeated for all file types: PDF, HTML, and CSV.

In [18]:
# Install these packagees if not done already
#!pip install pymupdf
#!pip install Unstructured

In [21]:
from langchain.document_loaders import PyMuPDFLoader  # Import PDF loader
from langchain.document_loaders import UnstructuredHTMLLoader  # Import HTML loader
from langchain.document_loaders import CSVLoader  # Import CSV loader

# Paths for sample csv, pdf, and html files
csv_path = "sample_csv.csv"
pdf_path = "sample_pdf.pdf"
html_path = "sample_html.pdf"

def load_and_display_pdf(pdf_path):
    # Initialize the PDF loader with the file path
    loader = PyMuPDFLoader(pdf_path)
    # Load the document into a list of pages
    documents = loader.load()
    # Iterate through each page and print the first 500 characters
    for i, doc in enumerate(documents):
        print(f"--- Page {i + 1} ---\n")
        print(doc.page_content[:500])  # Print extracted text snippet
        print("\n" + "-" * 40 + "\n")  # Separator for readability

def load_and_display_html(html_path):
    # Initialize the HTML loader with the file path
    loader = UnstructuredHTMLLoader(html_path)
    # Load the document
    documents = loader.load()
    # Iterate through each document and print the first 500 characters
    for i, doc in enumerate(documents):
        print(f"--- HTML Document {i + 1} ---\n")
        print(doc.page_content[:500])  # Print extracted text snippet
        print("\n" + "-" * 40 + "\n")  # Separator for readability

def load_and_display_csv(csv_path):
    # Initialize the CSV loader with the file path
    loader = CSVLoader(csv_path)
    # Load the document into rows
    documents = loader.load()
    # Iterate through each row and print the first 500 characters
    for i, doc in enumerate(documents):
        print(f"--- CSV Row {i + 1} ---\n")
        print(doc.page_content[:500])  # Print extracted text snippet
        print("\n" + "-" * 40 + "\n")  # Separator for readability

if __name__ == "__main__":
    # Define file paths
    pdf_path = "sample_pdf.pdf"
    html_path = "sample_html.html"
    csv_path = "sample_csv.csv"
    
    # Load and display content from each file type
    print("Loading pdf file......\n")
    load_and_display_pdf("sample_pdf.pdf")
    print("Loading html file......\n")
    load_and_display_html(html_path)
    print("Loading csv file......\n")
    load_and_display_csv(csv_path)

Loading pdf file......

--- Page 1 ---

Lorem ipsum 
Lorem ipsum dolor sit amet, consectetur adipiscing 
elit. Nunc ac faucibus odio. 
Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut
varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum
condimentum.  Vivamus  dapibus  sodales  ex,  vitae  malesuada  ipsum  cursus
convallis. Maecenas sed egestas nulla, ac condimentum orci.  Mauris diam felis,
vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac lig

----------------------------------------

--- Page 2 ---

In non mauris justo. Duis vehicula mi vel mi pretium, a viverra erat efficitur. Cras aliquam
est ac eros varius, id iaculis dui auctor. Duis pretium neque ligula, et pulvinar mi placerat
et. Nulla nec nunc sit amet nunc posuere vestibulum. Ut id neque eget tortor mattis
tristique. Donec ante est, blandit sit amet tristique vel, lacinia pulvinar arcu. Pellentesque
scelerisque fermentum erat, id pos