<a href="https://colab.research.google.com/github/MuchMarts/nlp_text_summarizer/blob/main/mini_project_summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mini Project: Lecture Note Summarizer


---


**Authors**:  Mārtiņš Patjanko (*mp22042*); Dinh Phuoc Nguyen Tran (*dt22025*)


---

Lecture note summarizer that is implemeted in 2 steps - Extractive and Abstracive.

Extractive step imlpements TextRank to get the most important sentences.

Abstracive step implements T5 transformer to summarize in natural language.

# Data Preprocess

All data is stored on github. This installs the newest version of the repository, uzips it and removes unnesessary files.

In [None]:
!rm -rf sample_data
!rm -rf your_folder
!rm -rf input_folder

!wget -qO temp.zip https://github.com/MuchMarts/nlp_text_summarizer/archive/refs/heads/main.zip && \
unzip -j temp.zip -d input_folder && rm temp.zip

!pip install python-docx
!pip install pymupdf

Archive:  temp.zip
8efd362f04d758df50a41c94fcee93c33ec9201e
  inflating: input_folder/README.md  
  inflating: input_folder/config.yaml  
  inflating: input_folder/genderdisparities_notes.txt  
  inflating: input_folder/genderdisparities_notes_summarized.txt  
  inflating: input_folder/lecture8.txt  
  inflating: input_folder/lecture8_summarized.txt  
  inflating: input_folder/lecture9.txt  
  inflating: input_folder/lecture9_summarized.txt  
  inflating: input_folder/lecturenotes_sample.txt  
  inflating: input_folder/lecturenotes_sample_summarized.txt  
  inflating: input_folder/notes_lec_1.txt  
  inflating: input_folder/notes_lec_10.txt  
  inflating: input_folder/notes_lec_11.txt  
  inflating: input_folder/notes_lec_12.txt  
  inflating: input_folder/notes_lec_2.txt  
  inflating: input_folder/notes_lec_3.txt  
  inflating: input_folder/notes_lec_4.txt  
  inflating: input_folder/notes_lec_5.txt  
  inflating: input_folder/notes_lec_6.txt  
  inflating: input_folder/notes_lec_7.t

In [None]:
import yaml
import os
from enum import Enum
# Read other document formats
import fitz
from docx import Document


class Output_Types(Enum):
  TEXTRANK = 1
  T5 = 2
  TEXTRANK_T5 = 3

Type_Mapping = {
    Output_Types.TEXTRANK : "textrank",
    Output_Types.T5 : "t5",
    Output_Types.TEXTRANK_T5 : "textrank_t5"
}

# Output_Types.TEXTRANK
# Output_Types.T5
# Output_Types.TEXTRANK_T5

INPUT_DIRECTORY = 'input_folder'
OUTPUT_DIRECTORY = 'output_folder'

with open(INPUT_DIRECTORY + "/config.yaml") as f:
  CONFIG = yaml.safe_load(f)

if CONFIG is None:
  raise Exception("Config file not found")

def store_file(data, output_type: Output_Types, name, output_dir=OUTPUT_DIRECTORY):
  if not os.path.exists(output_dir + "/" + Type_Mapping[output_type]):
    os.makedirs(output_dir + "/" + Type_Mapping[output_type])

  with open(output_dir + "/" + Type_Mapping[output_type] + "/" + name, 'w') as f:
    f.write(data)
    f.close()

def load_file(name, input_dir=INPUT_DIRECTORY):
  path = input_dir + "/" + name
  ext = os.path.splitext(path)[1].lower()

  if ext == '.txt':
    with open(path, 'r') as f:
      data = f.read()
      f.close()
      return data

  if ext == '.docx':
    doc = Document(path)
    return "\n".join([para.text for para in doc.paragraphs])

  if ext == '.pdf':
    doc = fitz.open(path)
    text = ""
    for page in doc:
      text += page.get_text()
    return text

  raise Exception(f"Unsupported file type: {ext}")

## Extractive Step

Dependency `spacy` is an NLP pipeline for parsing and analyzing text.
Dependency `pytextrank` implements the TextRank algorithm

In [None]:
# Install dependencies
!pip install --quiet numpy nltk pytextrank spacy torch
!pip install scipy>=1.14.0 --quiet
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import spacy
import pytextrank

# NLP
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank")

<pytextrank.base.BaseTextRankFactory at 0x78aebd760590>

In [None]:
def pytextrank_summarize(text, top_n=3):
    doc = nlp(text)
    summary = [sent.text for sent in doc._.textrank.summary(limit_phrases=15, limit_sentences=top_n)]
    return ' '.join(summary)

# Abstractive Step

Dependency `transformers` is used to get the T5 transformer and its tokenizer.

In [None]:
# Install dependencies
!pip install --quiet transformers

In [None]:
# Config values, transformer
MODEL_NAME = "t5-small"
#MODEL_NAME = "Vamsi/T5_Paraphrase_Paws"
#MODEL_NAME = "google/flan-t5-small"
MAX_TOKENS = 512 # T5 is trained on 512 token inputs, its the max input
MIN_TOKENS = 100
INSTRUCTION = "paraphrase: "
#INSTRUCTION = "Paraphrase the following text to improve clarity and grammar, but keep all details and information unchanged: "

DEBUG = False

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, logging

logging.set_verbosity_error()

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)

Input text might be longer than 512 tokens, for this we will implement sliding-window chunking.

In [None]:
# Config values, chunking
CHUNKER = "sliding_window"
#CHUNKER = "sentence_sliding_window"
CHUNK_SIZE = 450
OVERLAP = 50

Helper function to time each function

In [None]:
import time
from functools import wraps

def timer(fn):
    @wraps(fn)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = fn(*args, **kwargs)
        end = time.time()
        print(f"DEBUG: {fn.__name__!r} took {end - start:.4f} sec")
        return result
    return wrapper

In [None]:
def generate_sliding_window_chunks(text):
  tokens = tokenizer.encode(text)
  chunks = []
  start = 0
  while start < len(tokens):
    end = min(start + CHUNK_SIZE, len(tokens))
    chunks.append(tokens[start:end])
    start += CHUNK_SIZE - OVERLAP
  if DEBUG: print(f"DEBUG: Text Length: {len(text)}, Token Count: {len(tokens)}, Chunk Count: {len(chunks)}")

  return chunks

In [None]:
def generate_summary(text, chunker=CHUNKER, instruction=INSTRUCTION, max_tokens=MAX_TOKENS, min_length=MIN_TOKENS, do_sample=True, temperature=0.9):
  text = text.replace('\n', ' ')
  chunks = generate_sliding_window_chunks(text)
  summary_chunks = []

  for chunk in chunks:
    input_text = tokenizer.decode(chunk, skip_special_tokens=True)
    input_ids = tokenizer(
        instruction + input_text,
        return_tensors="pt",
        max_length=max_tokens,
        truncation=True)

    summary_ids = model.generate(
        input_ids.input_ids,
        num_beams=4,
        max_length=max_tokens,
        min_length=min_length,
        early_stopping=True,
        do_sample=do_sample,
        top_p=0.95,
        temperature=temperature
        )

    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    summary_chunks.append(summary)
  summaries = ' \n'.join(summary_chunks)
  if DEBUG: print(f"DEBUG: Text Length: {len(text)}, Summary Length: {len(summaries)}")
  return summaries

# Evaluation


## Helpers

In [None]:
!pip install --quiet rouge-score
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

In [None]:
def rouge_evaluation(prediction, reference, eval_print = True):
  scores = scorer.score(target=reference, prediction=prediction)
  if eval_print:
    print(f"############# EVALUATION #############")
    print(f"ROUGE-1: {scores['rouge1']}")
    print(f"ROUGE-2: {scores['rouge2']}")
    print(f"ROUGE-L: {scores['rougeL']}")
  return scores

In [None]:
#predictions = "The cat sat on the mat."
#references = "The cat is sitting on the mat."
#scores = rouge_evaluation(predictions, references)

In [None]:
!pip install --quiet bert-score

In [None]:
from bert_score import score

def bert_score_evaluation(predictions, references, eval_print = True):
  P, R, F1 = score(predictions, references, lang="en", verbose=False)
  if eval_print:
    print(f"############# EVALUATION #############")
    print("BERTScore F1:", F1.mean().item())
    print("BERTScore P:", P.mean().item())
    print("BERTScore R:", R.mean().item())
  return {"F1" : F1, "R" : R, "P" : P}

In [None]:
#bert_score_evaluation([predictions], [references])

## Evaluation for each Step

### HELPERS

In [None]:
import math
from nltk import sent_tokenize
# Formats a string for stats, per entry
def eval_score_format_helper(key, scores):
  score = "[ " + key + " ] - "

  rouge = scores[key]["rouge"]
  bertscore = scores[key]["bertscore"]

  score += f"ROUGE 1: {rouge['rouge1'].fmeasure} "
  score += f"2: {rouge['rouge2'].fmeasure} "
  score += f"L: {rouge['rougeL'].fmeasure} "

  score += f"BERTScore F1: {bertscore['F1'].mean().item()} "
  score += f"P: {bertscore['P'].mean().item()} "
  score += f"R: {bertscore['R'].mean().item()} "

  return score

# Prints out all evaluation scores for a entries and calculates the total average
def print_eval_scores(scores):
  average = {
      'rouge' : {
          'r1' : 0,
          'r2' : 0,
          'rl' : 0
      },
      'bertscore' : {
          'f1' : 0,
          'p' : 0,
          'r' : 0
      }
  }
  print("Score for each dataset entry")
  for pair in CONFIG["files"]:
    print(eval_score_format_helper(pair["original"], scores=scores))

    average['rouge']['r1'] += scores[pair["original"]]["rouge"]["rouge1"].fmeasure
    average['rouge']['r2'] += scores[pair["original"]]["rouge"]["rouge2"].fmeasure
    average['rouge']['rl'] += scores[pair["original"]]["rouge"]["rougeL"].fmeasure

    average['bertscore']['f1'] += scores[pair["original"]]["bertscore"]["F1"].mean().item()
    average['bertscore']['p'] += scores[pair["original"]]["bertscore"]["P"].mean().item()
    average['bertscore']['r'] += scores[pair["original"]]["bertscore"]["R"].mean().item()

  print(f"Average Result: ")
  print(f"ROUGE 1: {average['rouge']['r1'] / len(CONFIG['files'])}")
  print(f"ROUGE 2: {average['rouge']['r2'] / len(CONFIG['files'])}")
  print(f"ROUGE L: {average['rouge']['rl'] / len(CONFIG['files'])}")
  print(f"BERTScore F1: {average['bertscore']['f1'] / len(CONFIG['files'])}")
  print(f"BERTScore P: {average['bertscore']['p'] / len(CONFIG['files'])}")
  print(f"BERTScore R: {average['bertscore']['r'] / len(CONFIG['files'])}")

def prec_to_sent(text, precentage):
  return math.ceil(len(sent_tokenize(text)) * precentage)

### EVALUATION PARAMETERS

In [None]:
#TEXTRANK_PRECENTAGES = [ 0.2, 0.5, 0.8 ]
TEXTRANK_PRECENTAGES = [ 0.3 ]
T5_INSTRUCTION = [ INSTRUCTION ]
#T5_MIN_LENGTH = [ None, 100 ]
T5_MIN_LENGTH = [ 100 ]
T5_CHUNKER = [ "sliding_window" ]
#T5_DO_SAMPLE = [ False, True]
T5_DO_SAMPLE = [ True ]
#T5_TEMPERATURE = [ None, 0.9 ]
T5_TEMPERATURE = [ 0.9 ]

### Textrank

In [None]:
# Stores calculated scores
textrank_scores = {}

@timer
def evaluate_textrank():
    for pair in CONFIG["files"]:

        data = load_file(pair["original"])
        comparison = load_file(pair["summary"])

        if data is None or comparison is None:
            raise Exception(f"File not found: {pair['original']} or {pair['summary']}")

        # Generates summary with TextRank
        textrank_summary = pytextrank_summarize(data, top_n=3)

        # Evaluates summary with rouge and bert
        textrank_scores[pair["original"]] = {
          "rouge": rouge_evaluation(textrank_summary, comparison, eval_print=False),
          "bertscore": bert_score_evaluation([textrank_summary], [comparison], eval_print=False)
        }

        # Store generated summary file
        store_file(textrank_summary, Output_Types.TEXTRANK, "textrank_summary_" + pair["original"])

#evaluate_textrank() # Runs evaluation
#print_eval_scores(textrank_scores) # Outputs to console scores

### T5

In [None]:
t5_scores = {}

@timer
def evaluate_t5():
  for pair in CONFIG["files"]:
    data = load_file(pair["original"])
    comparison = load_file(pair["summary"])

    if data is None or comparison is None:
      raise Exception(f"File not found: {data} or {comparison}")

    summary = generate_summary(data)

    t5_scores[pair["original"]] = {
      "rouge": rouge_evaluation(summary, comparison, eval_print=False),
      "bertscore": bert_score_evaluation([summary], [comparison], eval_print=False)
    }

    store_file(summary, Output_Types.T5, "t5_summary_" + pair["original"])

#evaluate_t5()
#print_eval_scores(t5_scores)

### Textrank + T5

In [None]:
textrank_t5 = {}

@timer
def evaluate_textrank_t5():
    for pair in CONFIG["files"]:
        data = load_file(pair["original"])
        comparison = load_file(pair["summary"])

        if data is None or comparison is None:
            raise Exception(f"File not found: {pair['original']} or {pair['summary']}")

        # Generates a summary with TextRank, then using that Generates a summary with T5
        # This combines both aproaches, for hopefully a better outcome
        textrank_summary = pytextrank_summarize(data, top_n=3)  # or manual_textrank()
        t5_summary = generate_summary(textrank_summary)  # Your T5 summary function

        textrank_t5[pair["original"]] = {
            "rouge": rouge_evaluation(t5_summary, comparison, eval_print=False),
            "bertscore": bert_score_evaluation([t5_summary], [comparison], eval_print=False)
        }

#evaluate_textrank_t5()
#print_eval_scores(textrank_t5)

### All evals

In [None]:
from collections import defaultdict

evaluation_results = defaultdict(list)

def evaluation_pipeline(mode,
                        textrank_prec=None,
                        t5_instruct=None, t5_min_len=None,
                        t5_chunker=None, t5_do_sample=None, t5_temp=None):

  counter = 0 # Used to artifically stop eval pipeline faster
  for pair in CONFIG["files"]:
    if counter == 4: break
    if "type" not in pair: continue # Only use pairs where a type is defined, just used to have finer control over eval data

    print(".", end="") #DEBUG SHOW STUFF HAPPENING
    data = load_file(pair["original"])
    comparison = load_file(pair["summary"])

    if data is None or comparison is None:
      print(f"File not found: {data} or {comparison}; Skipping...")
      continue

    if mode == "textrank":
      summary = pytextrank_summarize(data, top_n=prec_to_sent(data, textrank_prec))
    elif mode == "t5":
      summary = generate_summary(data,
                                 instruction=t5_instruct,
                                 min_length=t5_min_len,
                                 chunker=t5_chunker,
                                 do_sample=t5_do_sample,
                                 temperature=t5_temp)
    elif mode == "combined":
      summary = pytextrank_summarize(data, top_n=prec_to_sent(data, textrank_prec))
      summary = generate_summary(summary,
                                 instruction=t5_instruct,
                                 min_length=t5_min_len,
                                 chunker=t5_chunker,
                                 do_sample=t5_do_sample,
                                 temperature=t5_temp)
    else:
      raise ValueError("Invalid mode")

    rouge = rouge_evaluation(summary, comparison, eval_print=False)
    bert = bert_score_evaluation([summary], [comparison], eval_print=False)

    evaluation_results[mode].append({
        "file": pair["original"],
        "rouge1_f1": rouge["rouge1"].fmeasure,
        "rouge2_f1": rouge["rouge2"].fmeasure,
        "rougeL_f1": rouge["rougeL"].fmeasure,
        "bertscore_f1": bert["F1"].mean().item(),
        "params": {
            "textrank_percent": textrank_prec,
            "t5_instruction": t5_instruct,
            "t5_min_length": t5_min_len,
            "t5_chunker": t5_chunker,
            "t5_do_sample": t5_do_sample,
            "t5_temp": t5_temp
        }
    })
    #counter += 1

In [None]:
from itertools import product

# Textrank
@timer
def textrank_eval_pipeline():
  for textrank_percentage in TEXTRANK_PRECENTAGES:
    evaluation_pipeline(
      mode="textrank",
      textrank_prec=textrank_percentage
    )
  print("------------------------------")

# T5
@timer
def t5_eval_pipeline():
  print("------------------------------")
  for t5_instruction, t5_min_length, t5_chunker, t5_do_sample, t5_temperature in product(
      T5_INSTRUCTION,
      T5_MIN_LENGTH,
      T5_CHUNKER,
      T5_DO_SAMPLE,
      T5_TEMPERATURE
  ):
    evaluation_pipeline(
      mode="t5",
      t5_instruct=t5_instruction,
      t5_min_len=t5_min_length,
      t5_chunker=t5_chunker,
      t5_do_sample=t5_do_sample,
      t5_temp=t5_temperature
    )
  print("------------------------------")

# Combined
@timer
def combined_eval_pipeline():
  print("------------------------------")
  for textrank_percentage, t5_instruction, t5_min_length, t5_chunker, t5_do_sample, t5_temperature in product(
      TEXTRANK_PRECENTAGES,
      T5_INSTRUCTION,
      T5_MIN_LENGTH,
      T5_CHUNKER,
      T5_DO_SAMPLE,
      T5_TEMPERATURE
  ):
    evaluation_pipeline(
      mode="combined",
      textrank_prec=textrank_percentage,
      t5_instruct=t5_instruction,
      t5_min_len=t5_min_length,
      t5_chunker=t5_chunker,
      t5_do_sample=t5_do_sample,
      t5_temp=t5_temperature
    )
  print("------------------------------")

evaluation_results = defaultdict(list)
textrank_eval_pipeline()
t5_eval_pipeline()
combined_eval_pipeline()

............------------------------------
DEBUG: 'textrank_eval_pipeline' took 156.1664 sec
------------------------------
............------------------------------
DEBUG: 't5_eval_pipeline' took 2635.4274 sec
------------------------------
............------------------------------
DEBUG: 'combined_eval_pipeline' took 975.5115 sec


### Analyze results

In [None]:
import pandas as pd

def build_eval_df(results_dict):
    rows = []
    for mode, entries in results_dict.items():
        for entry in entries:
            row = {
                "mode": mode,
                "file": entry["file"],
                "rouge1_f1": entry["rouge1_f1"],
                "rouge2_f1": entry["rouge2_f1"],
                "rougeL_f1": entry["rougeL_f1"],
                "bertscore_f1": entry["bertscore_f1"],
                # Flatten params dict here:
                **{f"param_{k}": v for k, v in entry.get("params", {}).items()}
            }
            rows.append(row)
    return pd.DataFrame(rows)

eval_df = build_eval_df(evaluation_results)
#print(eval_df)
eval_df["rouge_avg_f1"] = eval_df[["rouge1_f1", "rouge2_f1", "rougeL_f1"]].mean(axis=1)
eval_df["combined_score"] = (eval_df["rouge_avg_f1"] * 0.4 + eval_df["bertscore_f1"] * 0.6)
best = eval_df.sort_values("combined_score", ascending=False).head(20)
print(best[[
    "mode",
    "file",
    "rouge1_f1",
    "rouge2_f1",
    "rougeL_f1",
    "bertscore_f1",
    "combined_score"
] + [col for col in best.columns if col.startswith("param_t5_") or "param_textrank" in col]])

        mode                   file  rouge1_f1  rouge2_f1  rougeL_f1  \
19        t5   transcript_lec_8.txt   0.492634   0.142612   0.180884   
7   textrank   transcript_lec_8.txt   0.437175   0.141328   0.143076   
12        t5   transcript_lec_1.txt   0.474430   0.104809   0.174985   
17        t5   transcript_lec_6.txt   0.462238   0.111155   0.176354   
18        t5   transcript_lec_7.txt   0.463100   0.107055   0.158573   
14        t5   transcript_lec_3.txt   0.414145   0.114557   0.188115   
16        t5   transcript_lec_5.txt   0.419536   0.111111   0.182713   
15        t5   transcript_lec_4.txt   0.390678   0.092053   0.163140   
20        t5   transcript_lec_9.txt   0.434349   0.116716   0.172333   
6   textrank   transcript_lec_7.txt   0.438185   0.101775   0.111111   
0   textrank   transcript_lec_1.txt   0.437410   0.115174   0.120863   
22        t5  transcript_lec_11.txt   0.429228   0.103175   0.164509   
5   textrank   transcript_lec_6.txt   0.445214   0.101040   0.12

Average by mode

In [None]:
mode_avg_scores = eval_df.groupby("mode")[[
    "rouge1_f1", "rouge2_f1", "rougeL_f1", "bertscore_f1", "combined_score"
]].mean().reset_index()

mode_avg_scores = mode_avg_scores.round(4)

mode_avg_scores = mode_avg_scores.sort_values("combined_score", ascending=False)

print(mode_avg_scores)

       mode  rouge1_f1  rouge2_f1  rougeL_f1  bertscore_f1  combined_score
1        t5     0.4288     0.1047     0.1685        0.7867          0.5656
2  textrank     0.3958     0.1059     0.1181        0.7830          0.5525
0  combined     0.3447     0.0689     0.1070        0.7812          0.5381


Simple eval, to console

In [None]:
evaluate_t5()
evaluate_textrank()
evaluate_textrank_t5()

print("T5 Summarizer")
print_eval_scores(t5_scores)
print("TextRank Summarizer")
print_eval_scores(textrank_scores)
print("Textrank + T5 Summarizer")
print_eval_scores(textrank_t5)

DEBUG: 'generate_summary' took 35.2240 sec
DEBUG: 'generate_summary' took 42.3224 sec
DEBUG: 'generate_summary' took 7.5866 sec
DEBUG: 'generate_summary' took 24.2273 sec
DEBUG: 'generate_summary' took 8.1072 sec
DEBUG: 'generate_summary' took 23.6568 sec
DEBUG: 'evaluate_t5' took 178.1137 sec
DEBUG: 'evaluate_textrank' took 37.1645 sec
DEBUG: 'generate_summary' took 5.6126 sec
DEBUG: 'generate_summary' took 6.3294 sec
DEBUG: 'generate_summary' took 7.0478 sec
DEBUG: 'generate_summary' took 7.1001 sec
DEBUG: 'generate_summary' took 7.1113 sec
DEBUG: 'generate_summary' took 13.8757 sec
DEBUG: 'evaluate_textrank_t5' took 76.5662 sec
T5 Summarizer
Score for each dataset entry
[ sample.txt ] - ROUGE 1: 0.308300395256917 2: 0.03187250996015936 L: 0.15810276679841895 BERTScore F1: 0.8202622532844543 P: 0.8207087516784668 R: 0.8198162317276001 
[ transcript_sample.txt ] - ROUGE 1: 0.553191489361702 2: 0.38004750593824227 L: 0.43026004728132383 BERTScore F1: 0.8734961152076721 P: 0.85340511798

# Interactive Tool

In [None]:
!pip install --quiet gradio

In [None]:
import gradio as gr

def summarize(text, method, control_type, fixed_count, percentage):
    if not text.strip():
        return "Please enter input text."

    total_sentences = len(sent_tokenize(text))
    if total_sentences == 0:
        return "No sentences found in input."

    # Determine number of sentences to use
    if control_type == "Fixed Count":
        num_sentences = fixed_count
    elif control_type == "Percentage":
        num_sentences = max(1, int((percentage / 100) * total_sentences))
    else:
        return "Invalid selection for control type."

    if method == "TextRank":
        return pytextrank_summarize(text, top_n=num_sentences)
    elif method == "T5":
        return generate_summary(text)
    elif method == "Combination":
        textrank_output = pytextrank_summarize(text, top_n=num_sentences)
        return generate_summary(textrank_output)
    else:
        return "Invalid method"

with gr.Blocks() as demo:
    gr.Markdown("## 📝 Lecture Note Summarizer with T5 & TextRank")

    input_text = gr.Textbox(lines=10, label="Input Text")

    method = gr.Dropdown(choices=["TextRank", "T5", "Combination"], label="Summarization Method", value="T5")

    control_type = gr.Radio(choices=["Fixed Count", "Percentage"], label="Sentence Selection Method", value="Fixed Count")

    fixed_count_slider = gr.Slider(1, 10, step=1, value=3, label="Number of Sentences")
    percentage_slider = gr.Slider(10, 100, step=5, value=30, label="Summary Percentage (%)")

    output = gr.Textbox(lines=10, label="Summary Output")

    summarize_button = gr.Button("Summarize")

    # Logic to show/hide sliders based on choice
    control_type.change(
        lambda selection: (gr.update(visible=selection=="Fixed Count"), gr.update(visible=selection=="Percentage")),
        inputs=control_type,
        outputs=[fixed_count_slider, percentage_slider]
    )

    summarize_button.click(
        fn=summarize,
        inputs=[input_text, method, control_type, fixed_count_slider, percentage_slider],
        outputs=output
    )

demo.launch(share=True, debug=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://bb4b2a5273da71ca62.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


DEBUG: 'generate_summary' took 199.9725 sec
DEBUG: 'generate_summary' took 58.2150 sec
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://bb4b2a5273da71ca62.gradio.live


