Installs all the required packages for the summarization system, including transformer models, data processing libraries, evaluation metrics, and the arxiv API for fetching papers.

In [None]:
#run
!pip install transformers datasets rouge-score sacrebleu pandas torch sentencepiece nltk arxiv

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Collecting arxiv
  Downloading arxiv-2.1.3-py3-none-any.whl.metadata (6.1 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata 

Imports all the necessary Python libraries and modules for the project. Includes PyTorch for deep learning, pandas for data manipulation, transformers for NLP models, NLTK for text processing, and metrics libraries for evaluation. Also downloads all NLTK dependencies.

In [None]:
#run
import torch
import pandas as pd
import numpy as np
import time
from datasets import load_dataset, concatenate_datasets, Dataset
from transformers import (
    LEDTokenizer,
    LEDForConditionalGeneration,
    BertTokenizer,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments
)
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
import nltk
from tqdm import tqdm
import arxiv

# Ensure NLTK dependencies are available
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

True

Defines a function to load and preprocess three different datasets (CompScholar, PubMed, and arXiv papers) for training and testing the summarization system. Uses memory-efficient techniques like streaming and batch processing to handle large datasets. The function also tokenizes text into sentences and creates training/validation/test splits.

In [None]:
#run
# -----------------------------------
# 1. Data Loading and Preprocessing
# -----------------------------------

import arxiv
import pandas as pd
import nltk
from datasets import load_dataset
import time

def load_datasets(max_arxiv_results=100, batch_size=1000):
    """Load and preprocess datasets with lower memory footprint."""
    print("Loading datasets...")
    start_time = time.time()

    # CompScholar Dataset (streamed loading)
    compscholar_url = "https://raw.githubusercontent.com/jayantapaul/BrainDead-2K25/1dafe7a5b42a33e0afd5dfa183780ca32c036dad/Brain%20Dead%20CompScholar%20Dataset.csv"
    compscholar_df = pd.read_csv(compscholar_url, usecols=['Document', 'Summary']).rename(
        columns={'Document': 'text', 'Summary': 'summary'})
    compscholar_df = compscholar_df.sample(frac=1, random_state=42)
    split_idx = int(len(compscholar_df) * 0.9)
    compscholar_train_df = compscholar_df[:split_idx]
    compscholar_test_df = compscholar_df[split_idx:]
    del compscholar_df  # Free memory

    # PubMed Dataset (load in batches)
    pubmed = load_dataset("ccdv/pubmed-summarization", streaming=True)  # Stream instead of loading all

    # Debug: Print available splits
    print("Available PubMed splits:", list(pubmed.keys()))

    pubmed_dfs = {'train': [], 'validation': [], 'test': []}  # Use 'validation' instead of 'valid'

    for split in ['train', 'validation', 'test']:  # Corrected split names
        print(f"Processing PubMed {split} split...")
        dataset_iter = iter(pubmed[split])
        batch = []
        for i, example in enumerate(dataset_iter):
            batch.append({'text': example['article'], 'summary': example['abstract']})
            if len(batch) >= batch_size:
                df = pd.DataFrame(batch)
                pubmed_dfs[split].append(df)
                batch = []
        if batch:  # Handle remaining items
            pubmed_dfs[split].append(pd.DataFrame(batch))
        pubmed_dfs[split] = pd.concat(pubmed_dfs[split], ignore_index=True)

    # arXiv Dataset (smaller fetch size and immediate processing)
    client = arxiv.Client()
    search = arxiv.Search(
        query="cat:cs.LG",
        max_results=max_arxiv_results,
        sort_by=arxiv.SortCriterion.SubmittedDate
    )

    arxiv_results = []
    print("Fetching arXiv papers...")
    for i, result in enumerate(client.results(search)):
        arxiv_results.append({
            'text': f"Title: {result.title}\nCategories: {' '.join(result.categories)}\nAbstract: {result.summary}",
            'summary': result.summary
        })
        if (i + 1) % 50 == 0:
            print(f"Fetched {i + 1} arXiv papers...")

    arxiv_df = pd.DataFrame(arxiv_results)
    arxiv_df = arxiv_df.sample(frac=1, random_state=42)
    split_idx = int(len(arxiv_df) * 0.9)
    arxiv_train_df = arxiv_df[:split_idx]
    arxiv_test_df = arxiv_df[split_idx:]
    del arxiv_df

    # Pre-process with sentence tokenization (in-place to save memory)
    datasets = {
        'pubmed': {
            'train': pubmed_dfs['train'],
            'val': pubmed_dfs['validation'],  # Keep 'val' as key for consistency
            'test': pubmed_dfs['test']
        },
        'arxiv': {
            'train': arxiv_train_df,
            'test': arxiv_test_df
        },
        'compscholar': {
            'train': compscholar_train_df,
            'test': compscholar_test_df
        }
    }

    print("Tokenizing sentences...")
    for dataset_name, splits in datasets.items():
        for split_name, df in splits.items():
            print(f"Tokenizing {dataset_name} {split_name}...")
            df['sentences'] = df['text'].apply(nltk.sent_tokenize)

    print(f"Completed in {time.time() - start_time:.2f} seconds")
    return datasets



# Example usage
datasets = load_datasets(max_arxiv_results=100, batch_size=1000)

Loading datasets...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Available PubMed splits: ['train', 'validation', 'test']
Processing PubMed train split...
Processing PubMed validation split...
Processing PubMed test split...
Fetching arXiv papers...
Fetched 50 arXiv papers...
Fetched 100 arXiv papers...
Tokenizing sentences...
Tokenizing pubmed train...
Tokenizing pubmed val...
Tokenizing pubmed test...
Tokenizing arxiv train...
Tokenizing arxiv test...
Tokenizing compscholar train...
Tokenizing compscholar test...
Completed in 318.02 seconds


Defines the extractive summarization component class, which uses BERT to identify and extract important sentences from the text. It includes methods for preparing training data (by computing sentence similarity to summaries), training the model, and selecting the most important sentences from new texts based on predicted importance scores.

In [None]:
#run
# -----------------------------------
# 2. Extractive Component
# -----------------------------------

class ExtractiveComponent:
    def __init__(self, model_name='bert-base-uncased'):
        self.sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)

    def prepare_training_data(self, train_df, val_df, similarity_threshold=0.5):
        """Prepare data for training the extractive component."""
        print("Preparing extractive training data...")

        # Function to compute sentence importance labels based on similarity to summary
        def compute_sentence_labels(df):
            result_df = df.copy()

            # Compute embeddings for summaries
            print("Computing summary embeddings...")
            result_df['summary_embedding'] = [
                self.sentence_model.encode(summary) for summary in tqdm(result_df['summary'])
            ]

            # Compute embeddings for sentences and calculate similarity
            print("Computing sentence embeddings and similarities...")
            all_sentences = []
            all_labels = []

            for idx, row in tqdm(result_df.iterrows(), total=len(result_df)):
                sentences = row['sentences']
                summary_emb = row['summary_embedding']

                # Skip if no sentences
                if len(sentences) == 0:
                    continue

                # Compute sentence embeddings
                sentence_embs = self.sentence_model.encode(sentences)

                # Calculate similarities to summary
                similarities = cosine_similarity(
                    sentence_embs,
                    summary_emb.reshape(1, -1)
                ).flatten()

                # Create labels (1 for important, 0 for not important)
                labels = [1 if sim > similarity_threshold else 0 for sim in similarities]

                # Add to collection
                all_sentences.extend(sentences)
                all_labels.extend(labels)

            return all_sentences, all_labels

        # Create datasets
        train_sentences, train_labels = compute_sentence_labels(train_df)
        val_sentences, val_labels = compute_sentence_labels(val_df)

        # Convert to HF Datasets
        train_dataset = Dataset.from_dict({
            'sentence': train_sentences,
            'label': train_labels
        })
        val_dataset = Dataset.from_dict({
            'sentence': val_sentences,
            'label': val_labels
        })

        # Tokenize
        def tokenize_function(examples):
            return self.tokenizer(
                examples['sentence'],
                padding='max_length',
                truncation=True,
                max_length=128
            )

        train_dataset = train_dataset.map(tokenize_function, batched=True)
        val_dataset = val_dataset.map(tokenize_function, batched=True)

        # Format for training
        train_dataset = train_dataset.remove_columns(['sentence']).rename_column('label', 'labels')
        val_dataset = val_dataset.remove_columns(['sentence']).rename_column('label', 'labels')

        train_dataset.set_format('torch')
        val_dataset.set_format('torch')

        return train_dataset, val_dataset

    def train(self, train_dataset, val_dataset, output_dir="./extractive_model"):
        """Train the extractive component."""
        print("Training extractive component...")

        training_args = TrainingArguments(
            output_dir=output_dir,
            num_train_epochs=0.5,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=16,
            warmup_steps=500,
            weight_decay=0.01,
            logging_dir='./logs',
            logging_steps=100,
            evaluation_strategy='steps',
            eval_steps=500,
            save_strategy='steps',
            save_steps=500,
            load_best_model_at_end=True,
        )

        def compute_metrics(pred):
            logits, labels = pred
            predictions = np.argmax(logits, axis=-1)
            accuracy = np.mean(predictions == labels)
            return {'accuracy': accuracy}

        trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
            compute_metrics=compute_metrics
        )

        start_time = time.time()
        trainer.train()
        training_time = time.time() - start_time

        print(f"Extractive training completed in {training_time:.2f} seconds")

        # Save trained model
        trainer.save_model(output_dir)
        return training_time

    def select_important_sentences(self, text, top_k=None, threshold=0.5):
        """Select important sentences from text."""
        sentences = nltk.sent_tokenize(text)

        if not sentences:
            return ""

        # Select all if very few sentences
        if len(sentences) <= 3:
            return text

        # Default top_k to 30% of sentences if not specified
        if top_k is None:
            top_k = max(3, int(len(sentences) * 0.3))

        # Tokenize sentences
        inputs = self.tokenizer(
            sentences,
            padding=True,
            truncation=True,
            max_length=128,
            return_tensors='pt'
        ).to(self.device)

        # Get importance scores
        with torch.no_grad():
            outputs = self.model(**inputs)
            probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[:, 1]

        # Get indices of top-k sentences
        if top_k >= len(sentences):
            selected_indices = list(range(len(sentences)))
        else:
            selected_indices = torch.topk(probs, k=min(top_k, len(sentences))).indices.cpu().numpy()

        # Sort indices to maintain original order
        selected_indices = sorted(selected_indices)

        # Return selected sentences
        return " ".join([sentences[i] for i in selected_indices])

Implements the abstractive summarization component using the Longformer Encoder-Decoder (LED) model, which can handle long sequences. This class includes methods for preparing data, training the model with various optimization techniques, and generating new summaries from input text using beam search.

In [None]:
#run
# -----------------------------------
# 3. Abstractive Component
# -----------------------------------

class AbstractiveComponent:
    def __init__(self, model_name="allenai/led-base-16384"):
        self.tokenizer = LEDTokenizer.from_pretrained(model_name)
        self.model = LEDForConditionalGeneration.from_pretrained(model_name)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)

    def prepare_training_data(self, train_data, val_data):
        """Prepare LED training data."""
        print("Preparing abstractive training data...")

        def process_data(examples):
            inputs = self.tokenizer(
                examples["text"],
                padding="max_length",
                truncation=True,
                max_length=4096,  # LED can handle 16k but use less for efficiency
                return_tensors="pt"
            )

            outputs = self.tokenizer(
                examples["summary"],
                padding="max_length",
                truncation=True,
                max_length=512,
                return_tensors="pt"
            )

            batch = {
                "input_ids": inputs.input_ids,
                "attention_mask": inputs.attention_mask,
                "labels": outputs.input_ids
            }

            # Replace padding token id with -100 for loss calculation
            batch["labels"] = torch.where(
                batch["labels"] == self.tokenizer.pad_token_id,
                -100 * torch.ones_like(batch["labels"]),
                batch["labels"]
            )

            return batch

        # Convert to HF Dataset format
        train_dataset = Dataset.from_dict({
            "text": train_data["text"].tolist(),
            "summary": train_data["summary"].tolist()
        })

        val_dataset = Dataset.from_dict({
            "text": val_data["text"].tolist(),
            "summary": val_data["summary"].tolist()
        })

        # Process data
        train_dataset = train_dataset.map(
            process_data,
            batched=True,
            batch_size=4,
            remove_columns=["text", "summary"]
        )

        val_dataset = val_dataset.map(
            process_data,
            batched=True,
            batch_size=4,
            remove_columns=["text", "summary"]
        )

        return train_dataset, val_dataset

    def train(self, train_dataset, val_dataset, output_dir="./abstractive_model"):
        """Train the abstractive component."""
        print("Training abstractive component...")

        training_args = Seq2SeqTrainingArguments(
            output_dir=output_dir,
            evaluation_strategy="steps",
            eval_steps=500,
            save_strategy="steps",
            save_steps=500,
            learning_rate=3e-5,
            per_device_train_batch_size=1,
            per_device_eval_batch_size=1,
            gradient_accumulation_steps=4,
            weight_decay=0.01,
            num_train_epochs=0.5,
            predict_with_generate=True,
            fp16=True,
            gradient_checkpointing=True,
            logging_dir="./logs",
            report_to="tensorboard",
            push_to_hub=False
        )

        # Define metrics
        def compute_metrics(pred):
            labels_ids = pred.label_ids
            pred_ids = pred.predictions

            # Replace -100 with pad token id
            labels_ids[labels_ids == -100] = self.tokenizer.pad_token_id

            # Decode predictions and labels
            pred_str = self.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
            label_str = self.tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

            # Calculate ROUGE scores
            scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
            rouge_scores = []

            for p, l in zip(pred_str, label_str):
                rouge_scores.append(scorer.score(l, p))

            rouge1 = np.mean([s['rouge1'].fmeasure for s in rouge_scores])
            rouge2 = np.mean([s['rouge2'].fmeasure for s in rouge_scores])
            rougeL = np.mean([s['rougeL'].fmeasure for s in rouge_scores])

            # Calculate BLEU score
            bleu_scores = []
            for p, l in zip(pred_str, label_str):
                ref_tokens = l.split()
                pred_tokens = p.split()
                if len(ref_tokens) == 0 or len(pred_tokens) == 0:
                    bleu_scores.append(0.0)
                else:
                    bleu_scores.append(sentence_bleu([ref_tokens], pred_tokens))

            bleu = np.mean(bleu_scores)

            return {
                'rouge1': rouge1,
                'rouge2': rouge2,
                'rougeL': rougeL,
                'bleu': bleu
            }

        trainer = Seq2SeqTrainer(
            model=self.model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
            compute_metrics=compute_metrics
        )

        start_time = time.time()
        trainer.train()
        training_time = time.time() - start_time

        print(f"Abstractive training completed in {training_time:.2f} seconds")

        # Save trained model
        trainer.save_model(output_dir)
        return training_time

    def generate_summary(self, text, max_length=512, min_length=100):
        """Generate summary from text."""
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            max_length=4096,
            truncation=True
        ).to(self.device)

        summary_ids = self.model.generate(
            inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_length=max_length,
            min_length=min_length,
            num_beams=4,
            length_penalty=2.0,
            early_stopping=True
        )

        summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)
        return summary

The core class that combines both extractive and abstractive approaches into a hybrid summarization system. It first extracts important sentences, then generates an abstractive summary from those sentences. This class also includes methods for training both components and evaluating the complete system using ROUGE and BLEU metrics.

In [None]:
#run
# -----------------------------------
# 4. Hybrid Summarization Framework
# -----------------------------------

class HybridSummarizer:
    def __init__(self):
        self.extractive = ExtractiveComponent()
        self.abstractive = AbstractiveComponent()

    def train(self, datasets):
        """Train both components of the hybrid model."""
        # Train extractive component
        ext_train_dataset, ext_val_dataset = self.extractive.prepare_training_data(
            datasets['pubmed']['train'].sample(1000),  # Sample for efficiency
            datasets['pubmed']['val'].sample(200)
        )
        ext_training_time = self.extractive.train(ext_train_dataset, ext_val_dataset)

        # Prepare extractive summaries for abstractive component
        print("Generating extractive summaries for abstractive training...")

        train_data = datasets['pubmed']['train'].sample(5000)  # Sample for efficiency
        val_data = datasets['pubmed']['val'].sample(500)

        train_data['text'] = train_data['text'].apply(
            lambda x: self.extractive.select_important_sentences(x)
        )

        val_data['text'] = val_data['text'].apply(
            lambda x: self.extractive.select_important_sentences(x)
        )

        # Train abstractive component
        abs_train_dataset, abs_val_dataset = self.abstractive.prepare_training_data(
            train_data, val_data
        )
        abs_training_time = self.abstractive.train(abs_train_dataset, abs_val_dataset)

        return {
            'extractive_training_time': ext_training_time,
            'abstractive_training_time': abs_training_time
        }

    def summarize(self, text, ext_ratio=0.3):
        """Generate a summary using the hybrid approach."""
        # Extract important sentences
        extracted_text = self.extractive.select_important_sentences(
            text,
            top_k=int(len(nltk.sent_tokenize(text)) * ext_ratio)
        )

        # Generate abstractive summary
        summary = self.abstractive.generate_summary(extracted_text)

        return {
            'extracted_text': extracted_text,
            'summary': summary
        }

    def evaluate(self, test_data):
        """Evaluate the model on test data."""
        print("Evaluating hybrid summarizer...")

        results = []

        for _, row in tqdm(test_data.iterrows(), total=len(test_data)):
            text = row['text']
            reference = row['summary']

            # Generate summary
            summary_output = self.summarize(text)
            generated_summary = summary_output['summary']

            # Calculate ROUGE scores
            scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
            rouge_scores = scorer.score(reference, generated_summary)

            # Calculate BLEU score
            ref_tokens = reference.split()
            pred_tokens = generated_summary.split()

            if len(ref_tokens) > 0 and len(pred_tokens) > 0:
                bleu_score = sentence_bleu([ref_tokens], pred_tokens)
            else:
                bleu_score = 0.0

            results.append({
                'reference': reference,
                'generated': generated_summary,
                'rouge1': rouge_scores['rouge1'].fmeasure,
                'rouge2': rouge_scores['rouge2'].fmeasure,
                'rougeL': rouge_scores['rougeL'].fmeasure,
                'bleu': bleu_score
            })

        # Compile results
        results_df = pd.DataFrame(results)
        avg_results = {
            'rouge1': results_df['rouge1'].mean(),
            'rouge2': results_df['rouge2'].mean(),
            'rougeL': results_df['rougeL'].mean(),
            'bleu': results_df['bleu'].mean()
        }

        return avg_results, results_df

Defines the main execution function that orchestrates the entire workflow: loading datasets, training the hybrid summarizer, evaluating it on test sets, and reporting results. The function is set up to run when the script is executed directly.

In [None]:
# -----------------------------------
# 5. Main Execution
# -----------------------------------

def main():
    # Load datasets
    # datasets = load_datasets()

    # Initialize and train hybrid summarizer
    summarizer = HybridSummarizer()
    training_times = summarizer.train(datasets)

    # Evaluate on test sets
    for dataset_name in ['pubmed', 'arxiv', 'compscholar']:
        print(f"\nEvaluating on {dataset_name} test set...")
        avg_results, detailed_results = summarizer.evaluate(
            datasets[dataset_name]['test'].sample(100)  # Sample for efficiency
        )

        print(f"Results for {dataset_name}:")
        print(f"ROUGE-1: {avg_results['rouge1']:.4f}")
        print(f"ROUGE-2: {avg_results['rouge2']:.4f}")
        print(f"ROUGE-L: {avg_results['rougeL']:.4f}")
        print(f"BLEU: {avg_results['bleu']:.4f}")

        # Save detailed results
        detailed_results.to_csv(f"{dataset_name}_results.csv", index=False)

    print("\nTraining Times:")
    print(f"Extractive Component: {training_times['extractive_training_time']:.2f} seconds")
    print(f"Abstractive Component: {training_times['abstractive_training_time']:.2f} seconds")

if __name__ == "__main__":
    main()

In [None]:
datasets

{'pubmed': {'train':                                                      text  \
  0       a recent systematic analysis showed that in 20...   
  1       it occurs in more than 50% of patients and may...   
  2       tardive dystonia ( td ) , a rarer side effect ...   
  3       lepidoptera include agricultural pests that , ...   
  4       syncope is caused by transient diffuse cerebra...   
  ...                                                   ...   
  119919  eukaryotic cells depend on vesicle - mediated ...   
  119920  fiber post systems are routinely used in resto...   
  119921  in most of the peer review publications in the...   
  119922   \n the reveal registry is a longitudinal regi...   
  119923  cerebral palsy is a nonprogressive central ner...   
  
                                                    summary  \
  0       background : the present study was carried out...   
  1       backgroundanemia in patients with cancer who a...   
  2       tardive dystonia ( td )

Initializes and trains the HybridSummarizer using the loaded datasets. This processes the training data through both the extractive and abstractive components and records the training times

In [None]:
#run (make sure you have `extractive_model` folder)
summarizer = HybridSummarizer()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Initialize and train hybrid summarizer
summarizer = HybridSummarizer()
training_times = summarizer.train(datasets)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.09k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/648M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/648M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

Preparing extractive training data...
Computing summary embeddings...



  0%|          | 0/1000 [00:00<?, ?it/s][A
  0%|          | 1/1000 [00:01<18:28,  1.11s/it][A
  1%|          | 8/1000 [00:01<01:53,  8.76it/s][A
  2%|▏         | 19/1000 [00:01<00:42, 22.89it/s][A
  3%|▎         | 31/1000 [00:01<00:24, 38.94it/s][A
  4%|▍         | 44/1000 [00:01<00:17, 56.14it/s][A
  6%|▌         | 57/1000 [00:01<00:13, 71.33it/s][A
  7%|▋         | 70/1000 [00:01<00:11, 82.81it/s][A
  8%|▊         | 83/1000 [00:01<00:09, 92.77it/s][A
 10%|▉         | 97/1000 [00:01<00:08, 102.85it/s][A
 11%|█         | 110/1000 [00:02<00:08, 108.46it/s][A
 12%|█▏        | 123/1000 [00:02<00:07, 109.87it/s][A
 14%|█▎        | 136/1000 [00:02<00:07, 112.67it/s][A
 15%|█▍        | 148/1000 [00:02<00:07, 107.78it/s][A
 16%|█▌        | 161/1000 [00:02<00:07, 111.75it/s][A
 17%|█▋        | 173/1000 [00:02<00:07, 114.01it/s][A
 18%|█▊        | 185/1000 [00:02<00:07, 115.26it/s][A
 20%|█▉        | 197/1000 [00:02<00:07, 111.45it/s][A
 21%|██        | 210/1000 [00:02<00:06,

Computing sentence embeddings and similarities...


100%|██████████| 1000/1000 [01:15<00:00, 13.28it/s]


Computing summary embeddings...


100%|██████████| 200/200 [00:02<00:00, 94.77it/s] 


Computing sentence embeddings and similarities...


100%|██████████| 200/200 [00:15<00:00, 13.33it/s]


Map:   0%|          | 0/100542 [00:00<?, ? examples/s]

Map:   0%|          | 0/19990 [00:00<?, ? examples/s]

Training extractive component...


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmkaifqureshi[0m ([33mminionion[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss,Accuracy
500,0.4941,0.501849,0.755728
1000,0.4859,0.536949,0.718309
1500,0.4511,0.51221,0.761481
2000,0.4277,0.492322,0.757579
2500,0.4048,0.503268,0.758079


Step,Training Loss,Validation Loss,Accuracy
500,0.4941,0.501849,0.755728
1000,0.4859,0.536949,0.718309
1500,0.4511,0.51221,0.761481
2000,0.4277,0.492322,0.757579
2500,0.4048,0.503268,0.758079
3000,0.4122,0.481948,0.765083


Extractive training completed in 2528.87 seconds
Generating extractive summaries for abstractive training...


KeyboardInterrupt: 

In [None]:
#run
import swifter

train_data['text'] = train_data['text'].swifter.apply(lambda x: summarizer.extractive.select_important_sentences(x))
val_data['text'] = val_data['text'].swifter.apply(lambda x: summarizer.extractive.select_important_sentences(x))


Pandas Apply:   0%|          | 0/5000 [00:00<?, ?it/s]

Prepares datasets for training the abstractive component and then trains it, recording the training time. This code appears to have an error as it's referring to class attributes instead of instance attributes.

In [None]:
#run
# Train abstractive component
abs_train_dataset, abs_val_dataset = HybridSummarizer.abstractive.prepare_training_data(
    train_data, val_data
)
abs_training_time = HybridSummarizer.abstractive.train(abs_train_dataset, abs_val_dataset)

In [None]:
#run
for dataset_name in ['pubmed', 'arxiv', 'compscholar']:
    print(f"\nEvaluating on {dataset_name} test set...")
    avg_results, detailed_results = summarizer.evaluate(
        datasets[dataset_name]['test'].sample(100)  # Sample for efficiency
    )

    print(f"Results for {dataset_name}:")
    print(f"ROUGE-1: {avg_results['rouge1']:.4f}")
    print(f"ROUGE-2: {avg_results['rouge2']:.4f}")
    print(f"ROUGE-L: {avg_results['rougeL']:.4f}")
    print(f"BLEU: {avg_results['bleu']:.4f}")

    # Save detailed results
    detailed_results.to_csv(f"{dataset_name}_results.csv", index=False)

print("\nTraining Times:")
print(f"Extractive Component: {training_times['extractive_training_time']:.2f} seconds")
print(f"Abstractive Component: {training_times['abstractive_training_time']:.2f} seconds")

Tests the hybrid summarizer on a sample text about AI/ML in medical devices, displaying the generated summary and comparing the lengths of the original text and summary.

## Testing the model

In [None]:
summarizer = HybridSummarizer()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
#sample 1 : https://www.semanticscholar.org/paper/d8eef89fb4b86b4026e9e12acd7e2ef3bd20df03
sample_text = """The rapid adoption of software as a medical device (SAMD) driven by artificial intelligence and machine learning has brought about a fundamental shift in the medical industry. This shift has the potential to greatly improve clinical outcomes and the quality of care provided to patients. This shift has been responsible for a number of key achievements made in recent times. When seen in this context, the proposed legal framework for revisions to the AI/ML-SAMD appears as an essential response to the malleability of these technologies. To successfully navigate the tough process of modifying AI/ML-SAMD with the assistance of this framework. It does this by taking into consideration the need for rapid regulatory scrutiny and making an attempt to combine the promotion of innovation with the simultaneous preservation of patient safety. In other words, it ensures that patient safety is protected while also encouraging innovation. This abstract provides a summary of the fundamental components of the framework, as well as a discussion of the significance of those components with regard to fostering the development of moral AI/ML-SAMD within the context of the healthcare ecosystem. The healthcare sector is undergoing a change as a direct result of artificial intelligence and machine learning, which are improving patient outcomes, diagnostic accuracy, and treatment options. The research emphasizes the significance of specific AI and ML applications as well as the sector’s embrace of this paradigm-shifting technology. In addition, the regulatory framework that has been presented is an important step towards guaranteeing the safe use of AI and ML in the medical field."""
output = summarizer.summarize(sample_text)
print(output['summary'])
print(len(sample_text))
print(len(output['summary']))

The rapid adoption of software as a medical device (SAMD) driven by artificial intelligence and machine learning has brought about a fundamental shift in the medical industry. To successfully navigate the tough process of modifying AI/ML-SAMD with the assistance of this framework. The healthcare sector is undergoing a change as a direct result of artificial intelligence and machine learning, which are improving patient outcomes, diagnostic accuracy, and treatment options.What is the future of AI/ML-SAMD in the medical industry?
1682
533


In [None]:
#sample 2 : https://www.semanticscholar.org/paper/IntelliGenes%3A-Interactive-and-user-friendly-AI-ML-Narayanan-DeGroat/d7aa89e530b4a76af40cbeb91b5c0cf0ca357c5d
sample_text = """Abstract Artificial intelligence (AI) and machine learning (ML) have advanced in several areas and fields of life; however, its progress in the field of multi-omics is not matching the levels others have attained. Challenges include but are not limited to the handling and analysis of high volumes of complex multi-omics data, and the expertise needed to implement and execute AI/ML approaches. In this article, we present IntelliGenes, an interactive, customizable, cross-platform, and user-friendly AI/ML application for multi-omics data exploration to discover novel biomarkers and predict rare, common, and complex diseases. The implemented methodology is based on a nexus of conventional statistical techniques and cutting-edge ML algorithms, which outperforms single algorithms and result in enhanced accuracy. The interactive and cross-platform graphical user interface of IntelliGenes is divided into three main sections: (i) Data Manager, (ii) AI/ML Analysis, and (iii) Visualization. Data Manager supports the user in loading and customizing the input data and list of existing biomarkers. AI/ML Analysis allows the user to apply default combinations of statistical and ML algorithms, as well as customize and create new AI/ML pipelines. Visualization provides options to interpret a diverse set of produced results, including performance metrics, disease predictions, and various charts. The performance of IntelliGenes has been successfully tested at variable in-house and peer-reviewed studies, and was able to correctly classify individuals as patients and predict disease with high accuracy. It stands apart primarily in its simplicity in use for nontechnical users and its emphasis on generating interpretable visualizations. We have designed and implemented IntelliGenes in a way that a user with or without computational background can apply AI/ML approaches to discover novel biomarkers and predict diseases."""
output = summarizer.summarize(sample_text)
print(output['summary'])
print(len(sample_text))
print(len(output['summary']))

Input ids are automatically padded from 102 to 1024 to be a multiple of `config.attention_window`: 1024


Abstract Artificial intelligence (AI) and machine learning (ML) have advanced in several areas and fields of life; however, its progress in the field of multi-omics is not matching the levels others have attained. Challenges include but are not limited to the handling and analysis of high volumes of complex multi-omics data, and the expertise needed to implement and execute AI/ML approaches. Visualization provides options to interpret a diverse set of produced results, including performance metrics, disease predictions, and various charts.
1927
545
