<a href="https://colab.research.google.com/github/LuckyBoy587/Notes-Summarizer/blob/master/Colab_Run.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notes Summarizer on Colab

This notebook allows you to easily run the Notes Summarizer on Google Colab. It will clone the latest code from GitHub, install dependencies, and process a PDF.

In [1]:
# Setup: Clone repository, install dependencies, and download NLTK data
!git clone https://github.com/LuckyBoy587/Notes-Summarizer.git
%cd Notes-Summarizer
!pip install -r requirements.txt
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

Cloning into 'Notes-Summarizer'...
remote: Enumerating objects: 142, done.[K
remote: Counting objects: 100% (142/142), done.[K
remote: Compressing objects: 100% (102/102), done.[K
remote: Total 142 (delta 72), reused 101 (delta 35), pack-reused 0 (from 0)[K
Receiving objects: 100% (142/142), 53.00 KiB | 889.00 KiB/s, done.
Resolving deltas: 100% (72/72), done.
/content/Notes-Summarizer
Collecting PyMuPDF (from -r requirements.txt (line 4))
  Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting PyPDF2 (from -r requirements.txt (line 7))
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m21.6 MB/s[0m eta 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [2]:
# Import modules
from config import get_model_tokenizer_device, get_device
from text_processing import split_into_topics
from paraphrasing import paraphrase_chunks
from pdf_extraction import extract_topics_from_pdf
from google.colab import files
import os
import torch
# Show device info so you know whether GPU fp16 is being used
print('torch.cuda.is_available():', torch.cuda.is_available())
print('device:', get_device())


torch.cuda.is_available(): True
device: cuda


In [3]:
def summarize_pdf(pdf_filename, paraphrase=True, paraphrase_kwargs=None):
    # Process PDF: Extract topics, split, paraphrase, and save (use fast sampling for extraction)
    # fast=True uses a small set of sampled pages to estimate font-size thresholds which speeds up large PDFs
    if paraphrase_kwargs is None:
        paraphrase_kwargs = {'batch_size': 16, 'num_beams': 1, 'max_length': 64, 'do_sample': True}
    extracted_text = extract_topics_from_pdf(pdf_filename, fast=True, sample_pages=3)
    topics = split_into_topics(extracted_text)

    output_content = ""
    for topic, chunks in topics.items():
        if paraphrase:
            bullets = paraphrase_chunks(chunks, **paraphrase_kwargs)
        else:
            bullets = chunks
        output_content += f"\n## {topic}\n"
        output_content += "\n".join([f"• {b}" for b in bullets]) + "\n"

    output_filename = pdf_filename.replace('.pdf', '_paraphrased.txt')
    with open(output_filename, 'w', encoding='utf-8') as f:
        f.write(output_content)

    print(f"Output saved to {output_filename}")
    # Download the result
    files.download(output_filename)


In [4]:
# Upload PDF
uploaded = files.upload()

Saving 1. Introduction and definition.pdf to 1. Introduction and definition.pdf
Saving 2.Types and Models.pdf to 2.Types and Models.pdf


In [5]:
for pdf_filename in uploaded.keys():
    # Run with paraphrasing using faster generation defaults
    summarize_pdf(pdf_filename, paraphrase=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Output saved to 1. Introduction and definition_paraphrased.txt


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Output saved to 2.Types and Models_paraphrased.txt


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [9]:
!pip -q install gradio

import gradio as gr
import os
import shutil

def handle_pdfs(file_paths):
    if not file_paths:
        return "No files uploaded."
    saved_paths = []
    for path in file_paths:
        saved_path = f"./{os.path.basename(path)}"
        shutil.copy(path, saved_path)
        saved_paths.append(os.path.abspath(saved_path))
    for pdf_filename in saved_paths:
        # Run with paraphrasing using faster generation defaults
        summarize_pdf(pdf_filename, paraphrase=True)
    return "\n".join(saved_paths)

with gr.Blocks() as demo:
    gr.Markdown("### 📄 Multiple PDF Uploader")
    with gr.Row():
        pdf_input = gr.File(
            file_types=[".pdf"],
            type="filepath",
            file_count="multiple",
            label="Upload or Drop PDFs"
        )
    output = gr.Textbox(label="Saved Paths", lines=5)
    pdf_input.change(fn=handle_pdfs, inputs=pdf_input, outputs=output)

demo.launch()


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://4f57390b33fbca8881.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


