# Knowledgraph Pipeline


## Contributers:
Tom Hargrove
Carl Koster
Hantao Lin
Allen Wang

## Description

This code is a Python script that automates the process of summarizing PDF files using natural language processing (NLP) techniques. Here's a high-level overview of what the code does:

Setup: The code sets up the environment by importing necessary libraries, loading environment variables, and defining constants for file paths and prompts.

PDF File Selection: It selects a random subset of PDF files from a specified directory (root_dir).

Text Extraction: For each selected PDF file, it extracts the text content using the PyPDF2 library.

Text Summarization: The extracted text is then summarized using a pre-trained NLP model (facebook/bart-large-cnn) and the transformers library. The summarization is done in chunks to handle large texts.

Logging and Output: The code logs the processing status (success or failure) and time for each file in a log file. It also saves the summarized text to a new file in a specified output directory (summaries_dir).

Main Function: The main function orchestrates the entire process, from selecting PDF files to summarizing and logging the results.
The code is designed to automate the summarization of PDF files, making it easier to extract key information and insights from large documents.

## Load Packages & Set Variabales & Test Connections


In [4]:
# %pip install --upgrade --quiet  langchain langchain-community langchain-openai langchain-experimental neo4j

import datetime
import glob
import os
import time
import torch
import random
import re
import logging
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
#from concurrent.futures import ThreadPoolExecutor
from PyPDF2 import PdfReader
from dotenv import load_dotenv

#from google.colab import drive

load_dotenv()


num_files_to_process = 2  # The number of random PDFs to summarize

root_dir = "C:\\Users\\ckost\\My Drive\\Data\\EA-KNOWLEDGE-BOT\\LIBRARY"
output_subfolder = "B:\\OneDrive\\Documents\\GitHub\\EA-Knowledge-Bot\\Final Deliverables\\Code"
summaries_subfolder = "SUMMARIES"
logs_subfolder = "LOGS"
num_files_to_process = 10

PROMPT = """Summarize the content of this document, focusing on the main points and key entities such as people, organizations, and software products. Extract and highlight the relationships between these entities, including hierarchical, collaborative, and adversarial relationships. The purpose of this summary is to facilitate the extraction of entities and relationships using LangChain, which will be used to populate a Neo4J knowledge graph. Ensure the summary is concise and clear, with a focus on accuracy and precision in describing the entities and their relationships. Limit the summary to 3000 words or fewer."""



# Define Functions


**extract_text_from_pdf(file_path)** This function reads text from a PDF file located at file_path. It uses the PdfReader class from the PyPDF2 library to extract text from each page of the PDF and concatenates it into a single string. If an error occurs during the process, it prints an error message and returns None.

**clean_text(text)** This function cleans and preprocesses the input text by:
    * Removing special characters and unnecessary spaces
    * Replacing multiple spaces with a single space
    * Removing leading and trailing spaces
    * It returns the cleaned text.

**chunk_text(text, chunk_size=1000)** This function splits the input text into smaller chunks of size chunk_size (default is 1000 characters). It returns a list of chunks.

**summarize_text(text, prompt)** This function summarizes the input text using a summarization model (not shown in the code snippet). It:
    * Replaces special characters with spaces (except for apostrophes)
    * Cleans the text using the clean_text function
    * Splits the text into chunks using the chunk_text function
    * Summarizes each chunk using the summarization model
    * Concatenates the summaries into a single string
    * Returns the summarized text

**log_processed_file(file_path, status, log_file, processing_time)** This function logs information about a processed file to a log file. It writes a line to the log file with the following information:
    * Current timestamp
    * Processing time
    * File path
    * Status (e.g., "success" or "error")
    
**has_been_processed(file_path, log_file)** This function checks if a file has already been processed by checking if its file path exists in the log file. If the file path is found in the log file, it returns True; otherwise, it returns False.

In [5]:


# Initialize the summarization pipeline
model_name = "facebook/bart-large-cnn"  # Replace with your model of choice
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1)

def chunk_text(text, chunk_size=1000):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

def log_processed_file(file_path, status, log_file, processing_time):
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    file_name = os.path.basename(file_path)
    log_entry = f"{timestamp}, {processing_time:.2f}, {file_name}, SUMMARIZE, {status}\n"
    
    with open(log_file, 'a') as f:
        f.write(log_entry)


def summarize_text(text, prompt):
    try:
        text = re.sub(r"[^a-zA-Z0-9\s']", ' ', text)
        text = clean_text(text)
        chunks = chunk_text(text)
        summary = ""
        for chunk in chunks:
            result = summarizer(chunk, max_length=150, min_length=30, do_sample=False)
            summary += result[0]['summary_text'] + " "
        return summary.strip()
    except Exception as e:
        print(f"Error summarizing text: {e}")
        return None

def has_been_processed(file_path, log_file):
    try:
        with open(log_file, 'r', encoding='utf-8', errors='ignore') as f:
            for line in f:
                if file_path in line and "SUCCESS" in line:
                    return True
    except UnicodeDecodeError as e:
        print(f"Error reading the log file: {e}")
    return False


# Function to read text from a PDF file
def extract_text_from_pdf(file_path):
    try:
        reader = PdfReader(file_path)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
        return text
    except Exception as e:
        print(f"Error reading {file_path}: {e}")
        return None

# Function to clean and preprocess text
def clean_text(text):
    # Remove special characters and unnecessary spaces
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = text.strip()
    return text

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# MAIN FUNCTION FOR READING & SUMMARIZING PDF FILES

In [6]:
def main(root_dir, output_subfolder, summaries_subfolder, logs_subfolder):
    

    # Set up logging configuration
    logging.basicConfig(
        filename=os.path.join(logs_subfolder, 'processing.log'),
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s'
    )

    # Get the current directory
    current_dir = os.getcwd()

    # Define the output and logs directories
    output_dir = os.path.join(current_dir, output_subfolder)
    summaries_dir = os.path.join(output_dir, summaries_subfolder)
    logs_dir = os.path.join(output_dir, logs_subfolder)

    # Ensure the output and logs directories exist
    os.makedirs(output_dir, exist_ok=True)
    os.makedirs(summaries_dir, exist_ok=True)
    os.makedirs(logs_dir, exist_ok=True)

    # Step 1: Enumerate PDF files
    file_paths = [y for x in os.walk(root_dir) for y in glob.glob(os.path.join(x[0], '*.pdf'))]
    selected_files = random.sample(file_paths, min(num_files_to_process, len(file_paths)))

    processed_files_count = 0  # Counter for processed files

    # Path to processed files log
    log_file = os.path.join(logs_dir, 'processed_files_log.txt')

    # Step 2: Extract text from selected PDFs and summarize
    for file_path in selected_files:
        if has_been_processed(file_path, log_file):
            logging.info(f"Skipping file: {file_path}")
            print(f"Skipping file: {file_path}")
            continue

        # Capture start time
        start_time = time.time()

        text = extract_text_from_pdf(file_path)
        if text:
            logging.info(f"Processing PDF: {os.path.basename(file_path)}")
            processed_files_count += 1  # Increment counter

            summary = summarize_text(text, PROMPT)
            if summary:
                # Save the summary to a text file in the specified summaries directory
                summary_filename = os.path.join(summaries_dir, os.path.splitext(os.path.basename(file_path))[0] + "_summary.txt")
                with open(summary_filename, 'w', encoding='utf-8') as f:
                    f.write(summary)

                # Capture end time
                end_time = time.time()
                processing_time = end_time - start_time

                # Log the processed file as SUCCESS
                log_processed_file(file_path, "SUCCESS", log_file, processing_time)
            else:
                log_processed_file(file_path, "FAILED", log_file, 0)
        else:
            log_processed_file(file_path, "FAILED", log_file, 0)

    # Print the number of processed files
    print(f"Total number of processed PDF files: {processed_files_count}")

if __name__ == "__main__":
    main(root_dir, output_subfolder, summaries_subfolder, logs_subfolder)


Your max_length is set to 150, but you input_length is only 126. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=63)
Your max_length is set to 150, but you input_length is only 147. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=73)
Your max_length is set to 150, but you input_length is only 85. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=42)
Your max_length is set to 150, but you input_length is only 28. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=14)
Your max_length is set to 150, but you input_length is only 110. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=55)
Your max_length is set to 150, but you input_length is only 135. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=67)
Your max_length is set to 150, but you input_length is only 135. You might con

Total number of processed PDF files: 10
