### Business Data Preparation for AI Applications

This program reads data from a PDF and a text file. It combines the data into one set for easier handling. It processes and cleans the data by splitting it into smaller chunks. Finally, the cleaned data is ready for AI to use, such as answering questions. In short This program demonstrates how to prepare data for AI use.

In [None]:
1. Document Loade: PDF and text files are read using `PyPDFLoader` and `TextLoader`.
2. Document Integration: All loaded files are combined into one list for easy management.
3. Document Transformers: Data is cleaned and split into manageable chunks for AI models.

In [None]:
# Install it if not done already.
#!pip install pypdf

In [2]:
# Importing necessary libraries
from langchain.document_loaders import PyPDFLoader  # To load and read PDF files
from langchain.document_loaders import TextLoader  # To load and read text files
from langchain.text_splitter import CharacterTextSplitter  # To split text into smaller chunks

# Define a simple text-cleaning function
def clean_text(text):
    """Remove unnecessary characters and extra spaces."""
    import re
    text = re.sub(r"\s+", " ", text)  # Replace multiple spaces with a single space
    text = re.sub(r"[^a-zA-Z0-9.,!? ]", "", text)  # Remove special characters
    return text.strip()

# Step 1: Load documents
pdf_loader = PyPDFLoader("business_report.pdf")  # Replace with your business PDF file
pdf_documents = pdf_loader.load()  # Load and parse the PDF content

text_loader = TextLoader("meeting_notes.txt")  # Replace with your business text file
text_documents = text_loader.load()  # Load and parse the text file content

# Step 2: Integrate documents
all_documents = pdf_documents + text_documents

# Step 3: Transform documents
splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
transformed_docs = []
for doc in all_documents:
    chunks = splitter.split_text(doc.page_content)  # Use 'page_content' instead of 'content'
    cleaned_chunks = [clean_text(chunk) for chunk in chunks]  # Clean each chunk
    transformed_docs.extend(cleaned_chunks)

# Step 4: Use the transformed data for AI (Example: Question Answering)
print("Transformed and cleaned document chunks ready for AI processing:")
for doc in transformed_docs[:5]:  # Print a few examples
    print(doc)

Transformed and cleaned document chunks ready for AI processing:
Adobe Acrobat PDF Files Adobe Portable Document Format PDF is a universal file format that preserves all of the fonts, formatting, colours and graphics of any source document, regardless of the application and platform used to create it. Adobe PDF is an ideal format for electronic document distribution as it overcomes the problems commonly encountered with electronic file sharing.  Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat Reader. Recipients of other file formats sometimes cant open files because they dont have the applications used to create the documents.  PDF files always print correctly on any printing device.  PDF files always display exactly as created, regardless of fonts, software, and operating systems. Fonts, and graphics are not lost due to platform, software, and version incompatibilities.  The free Acrobat Reader is easy to download and can be freely distributed by anyone.  