<a href="https://colab.research.google.com/github/Masoudrzpn/PDF_Summarization_Query_LLM/blob/main/PDF_Summarization_Query_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [15]:
!pip install langchain
!pip install PyPDF2
!pip install chromadb
!pip install sentence_transformers
!pip install transformers



In [16]:
import PyPDF2
from PyPDF2 import PdfReader
from transformers import pipeline
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from sentence_transformers import SentenceTransformer

from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from langchain_community.vectorstores import Chroma

In [17]:
# connect your Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"

Mounted at /content/gdrive


In [18]:
# location of the pdf file/files.
reader = PdfReader('/content/gdrive/My Drive/data/fastfacts-what-is-climate-change.pdf')

In [19]:
# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

In [20]:
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)
texts = text_splitter.create_documents(texts) #this should return the list of documents.

In [21]:
texts

[Document(page_content='What Is Climate Change?\n1. Climate change  can be a natural process where temperature, rainfall, wind and \nother elements vary over decades or more. In millions of years, our world has been \nwarmer and colder than it is now. But today we are experiencing rapid warming from \nhuman activities, primarily due to burning fossil fuels that generate greenhouse gas \nemissions.\n2. Increasing greenhouse gas emissions  from human activity act like a blanket \nwrapped around the earth, trapping the sun’s heat and raising temperatures.\n3. Examples of greenhouse gas emissions that are causing climate change include \ncarbon dioxide and methane. These come from burning fossil fuels such as gasoline \nfor driving a car or coal for heating a building. Clearing land and forests can also \nrelease carbon dioxide. Landfills for garbage are another source. Energy, industry, \nagriculture and waste disposal are among the major emitters.'),
 Document(page_content='release carbo

# *Query part*

In [22]:
# create the embedding function
model_name = "all-MiniLM-L6-v2"
embedding_function = SentenceTransformerEmbeddings(model_name=model_name)

# load to Chroma Data Base
db = Chroma.from_documents(texts, embedding_function)

In [23]:
query = "what is climate change?"
_filtered_pages = db.similarity_search(query,k=3)
_filtered_pages

[Document(page_content='What Is Climate Change?\n1. Climate change  can be a natural process where temperature, rainfall, wind and \nother elements vary over decades or more. In millions of years, our world has been \nwarmer and colder than it is now. But today we are experiencing rapid warming from \nhuman activities, primarily due to burning fossil fuels that generate greenhouse gas \nemissions.\n2. Increasing greenhouse gas emissions  from human activity act like a blanket \nwrapped around the earth, trapping the sun’s heat and raising temperatures.\n3. Examples of greenhouse gas emissions that are causing climate change include \ncarbon dioxide and methane. These come from burning fossil fuels such as gasoline \nfor driving a car or coal for heating a building. Clearing land and forests can also \nrelease carbon dioxide. Landfills for garbage are another source. Energy, industry, \nagriculture and waste disposal are among the major emitters.'),
 Document(page_content='What Is Clima

# *Summarization part*

In [24]:
import spacy
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer


# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Load the summarization pipeline
summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")

# Download NLTK resources (if not done before)
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

# Text cleaning function
def clean_text(text):
    # Remove non-alphanumeric characters
    text = re.sub(r'[^A-Za-z0-9\s]', '', text)

    # Convert to lowercase
    text = text.lower()

    # Remove extra whitespace
    text = ' '.join(text.split())

    # Remove stop words (It seems removing stop words is not suitable for summarization)
    # stop_words = set(stopwords.words('english'))
    # text = ' '.join(word for word in text.split() if word not in stop_words)

    # Lemmatization (It seems lemmatization is not suitable for summarization)
    # lemmatizer = WordNetLemmatizer()
    # text = ' '.join(lemmatizer.lemmatize(word) for word in text.split())

    # Stemming (It seems stemming is not suitable for summarization)
    # stemmer = PorterStemmer()
    # text = ' '.join(stemmer.stem(word) for word in text.split())

    return text

# Open the PDF file in read-binary mode
with open('/content/gdrive/My Drive/data/fastfacts-what-is-climate-change.pdf', 'rb') as file:
    # Create a PDF object
    pdf = PyPDF2.PdfReader(file)

    # Get the number of pages in the PDF
    num_pages = len(pdf.pages)

    # Checking the page
    if 0 <= num_pages:

        text = page.extract_text()

        # Clean the text
        cleaned_text = clean_text(text)



        # Process the cleaned text using spaCy to get word vectors
        # doc = nlp(cleaned_text)
        # word_vectors = [token.vector for token in doc]

        # Combine cleaned text and word vectors (for example, concatenation)
        # text_with_vectors = cleaned_text + " ".join(map(str, word_vectors))

        # Summarize the extracted text using Hugging Face Transformers with word vectors
        # summary = summarizer(text_with_vectors, max_length=150, min_length=50, length_penalty=2.0, num_beams=4, early_stopping=True)


        # Summarize the extracted text using Hugging Face Transformers
        summary = summarizer(cleaned_text, max_length=150, min_length=50, length_penalty=2.0, num_beams=4, early_stopping=True)

        # Print the summarized text
        print("\nSummary:")
        print(summary[0]['summary_text'])
    else:
        print(f"Page is out of range for this PDF (total pages: {num_pages}).")

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



Summary:
