### Business Document Processing: Summarization, Keyword Extraction, and Sentiment Analysis

- Loads and processes diverse business documents (PDF, HTML, CSV) for analysis.
- Summarizes document content for quick business insights using GPT-3.5.
- Extracts key business-relevant keywords to aid decision-making.
- Performs sentiment analysis on customer feedback to gauge satisfaction.
- Provides efficient business document analysis through automated text processing.

In [6]:
# Read the open ai API key from your text file
f = open('C:\\Users\\Shailendra Kadre\\Desktop\\OPEN_AI_KEY.txt')
api_key = f.read()

In [4]:
#api_key

In [8]:
# Import necessary libraries

import os  # Provides functions for interacting with the operating system

# Import document loaders for handling different file types
from langchain.document_loaders import PyPDFLoader, UnstructuredHTMLLoader, CSVLoader

# PyPDFLoader: Used to load PDF files
# UnstructuredHTMLLoader: Used to load and process HTML files
# CSVLoader: Used to load CSV files

# Import a text splitter to break documents into manageable chunks
from langchain.text_splitter import CharacterTextSplitter
# CharacterTextSplitter: Splits text into smaller segments while maintaining context

# Import ChatOpenAI for interacting with OpenAI's language models
from langchain_openai import ChatOpenAI

# Import prompt-related utilities to format queries for the AI model
from langchain.prompts import PromptTemplate  # Creates structured prompts for AI input

# Import LLMChain to execute AI workflows using prompts
from langchain.chains import LLMChain  

# Load and parse documents from different file formats

pdf_loader = PyPDFLoader("C:\AA SK 53\After IBA Aug 21 2024\Christ University\MDPs\DL & GenAI\sample_pdf.pdf")  # Load a PDF file
pdf_docs = pdf_loader.load()  # Read content from the PDF file

html_loader = UnstructuredHTMLLoader("C:\AA SK 53\After IBA Aug 21 2024\Christ University\MDPs\DL & GenAI\sample_html.html")  # Load an HTML file
html_docs = html_loader.load()  # Read content from the HTML file

csv_loader = CSVLoader("C:\AA SK 53\After IBA Aug 21 2024\Christ University\MDPs\DL & GenAI\sample_csv.csv")  # Load a CSV file
csv_docs = csv_loader.load()  # Read content from the CSV file

# Text Splitting for easier processing and model handling

splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)  
# Splits text into chunks of 500 characters with 50-character overlap to preserve context

pdf_chunks = splitter.split_documents(pdf_docs)  # Split PDF content into smaller chunks
html_chunks = splitter.split_documents(html_docs)  # Split HTML content into smaller chunks
csv_chunks = splitter.split_documents(csv_docs)  # Split CSV content into smaller chunks

# Initialize the OpenAI chat model
llm = ChatOpenAI(model="gpt-3.5-turbo", openai_api_key=api_key)  
# - Uses GPT-3.5-turbo model with the given API key

# Define a prompt template for summarization
summary_prompt = PromptTemplate(
    input_variables=["text"],  
    template="Summarize the following business document:\n{text}"  
    # Creates a template where {text} will be replaced with actual document content
)

# Create an LLM chain for summarization
summary_chain = LLMChain(llm=llm, prompt=summary_prompt)  
# Uses the ChatOpenAI model with the defined summarization prompt

# Generate summaries for each document type
pdf_summary = summary_chain.invoke({"text": pdf_chunks[0].page_content})  
html_summary = summary_chain.invoke({"text": html_chunks[0].page_content})  
csv_summary = summary_chain.invoke({"text": csv_chunks[0].page_content})  

# Print the generated summaries
print("\n--- PDF Summary ---\n", pdf_summary)  
print("\n--- HTML Summary ---\n", html_summary)  
print("\n--- CSV Summary ---\n", csv_summary)  

# Define a prompt template for keyword extraction
keyword_prompt = PromptTemplate(
    input_variables=["text"],  
    template="Extract the top 5 keywords from the following document:\n{text}"  
    # Creates a structured request for extracting keywords
)

# Create an LLM chain for keyword extraction
keyword_chain = LLMChain(llm=llm, prompt=keyword_prompt)  

# Extract keywords from each document type
pdf_keywords = keyword_chain.invoke({"text": pdf_chunks[0].page_content})  
html_keywords = keyword_chain.invoke({"text": html_chunks[0].page_content})  
csv_keywords = keyword_chain.invoke({"text": csv_chunks[0].page_content})  

# Print extracted keywords
print("\n--- PDF Keywords ---\n", pdf_keywords)  
print("\n--- HTML Keywords ---\n", html_keywords)  
print("\n--- CSV Keywords ---\n", csv_keywords)  

# Define a prompt template for sentiment analysis
sentiment_prompt = PromptTemplate(
    input_variables=["text"],  
    template="Analyze the sentiment of the following customer feedback and rate as Positive, Neutral, or Negative:\n{text}"  
    # Asks AI to classify sentiment as Positive, Neutral, or Negative
)

# Create an LLM chain for sentiment analysis
sentiment_chain = LLMChain(llm=llm, prompt=sentiment_prompt)  

# Perform sentiment analysis on the CSV document
csv_sentiment = sentiment_chain.invoke({"text": csv_chunks[0].page_content})  

# Print the sentiment analysis result
print("\n--- CSV Sentiment Analysis ---\n", csv_sentiment['text']) 

  pdf_loader = PyPDFLoader("C:\AA SK 53\After IBA Aug 21 2024\Christ University\MDPs\DL & GenAI\sample_pdf.pdf")  # Load a PDF file
  html_loader = UnstructuredHTMLLoader("C:\AA SK 53\After IBA Aug 21 2024\Christ University\MDPs\DL & GenAI\sample_html.html")  # Load an HTML file
  csv_loader = CSVLoader("C:\AA SK 53\After IBA Aug 21 2024\Christ University\MDPs\DL & GenAI\sample_csv.csv")  # Load a CSV file



--- PDF Summary ---
 {'text': 'The document discusses various topics related to business, such as market analysis and customer behavior. It also includes a chart with rows and columns displaying numerical data.'}

--- HTML Summary ---
 {'text': 'The document discusses the comparison between medicine and government with wisdom being the ultimate goal. It questions why people do not prefer to talk about important matters when they have the opportunity to do so. It also raises the issue of whether prolonged suffering makes one more miserable, or if prolonged pleasure is preferred.'}

--- CSV Summary ---
 {'text': 'The document provides information related to accounting and finance industry. It may cover topics such as financial analysis, budgeting, auditing, taxation, and financial reporting. The document could provide insights into best practices, industry trends, regulations, and challenges within the accounting and finance sector.'}

--- PDF Keywords ---
 {'text': '1. Lorem\n2. Maecen

- The LLMChain in the code acts as a structured pipeline that takes an input, processes it using a language model, and returns an output.

- Since pdf_chunks[0].page_content only processes the first chunk, it summarizes only that portion, not the entire document. 
- To summarize the full document, you need to iterate through all chunks and combine their summaries.
- A final summary can then be generated by summarizing the combined outputs.