### Problem 2

**Document Summarization**

**Scenario**: You're working on a news aggregator that displays summaries of various articles. You want to identify the most important keywords in each article to generate concise summaries.

**Tasks:**

1. **TF-IDF calculation**: Implement the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm. This involves calculating the frequency of each word in a document (TF) and how rare it is across all documents (IDF).
2. **Document Summarization**: Write a function that takes a list of pre-processed news articles (cleaned text) as input. It should perform the following:
    - Calculate TF-IDF for each word in each document.
    - Identify the top N words (keywords) with the highest TF-IDF scores for each document.
    - Generate a summary sentence using these keywords.

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
file_path = 'datasets/neurologica_articles.csv'
df = pd.read_csv(file_path)
df = df[:2]

df.head()

Unnamed: 0,publication_date,title,author,categories,text,url
0,Apr 23 2024,UFOs and SGU on John Oliver,Steven Novella,UFO's / Aliens,"The most recent episode of John Oliver, Last W...",https://theness.com/neurologicablog/ufos-and-s...
1,Apr 22 2024,Indigenous Knowledge,Steven Novella,Culture and Society,I recently received the following question to ...,https://theness.com/neurologicablog/indigenous...


In [3]:
import nltk
import emoji
import string
import pandas as pd
from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer


nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/haria/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
def normalize_text(comment_series):
    """Lowercases, removes punctuation (except '!' and '?'), and replaces emojis with descriptive text."""
    punctuation_remove = string.punctuation.replace('!', '').replace('?', '')
    
    # Vectorized operations with pandas: lowercasing, removing punctuation, and replacing emojis
    comment_series = comment_series.str.lower() 
    comment_series = comment_series.str.translate(str.maketrans('', '', punctuation_remove)) 
    comment_series = comment_series.apply(emoji.demojize)
    
    return comment_series

In [5]:
def tokenize_text(comment_series):
    """Splits each comment into individual tokens (words)."""
    return comment_series.str.split()

In [6]:
def remove_stop_words(token_series, custom_stop_words=None):
    """Removes stop words from tokenized comments. Accepts an optional custom stop-word list."""
    stop_words = set(stopwords.words('english'))
    if custom_stop_words:
        stop_words.update(custom_stop_words)
    
    return token_series.apply(lambda tokens: [token for token in tokens if token not in stop_words])

In [7]:
def create_text_pipeline(custom_stop_words=None):
    """Creates a text processing pipeline for normalization, tokenization, and stop-word removal."""
    return Pipeline([
        ('normalize', FunctionTransformer(lambda x: normalize_text(x), validate=False)),
        ('tokenize', FunctionTransformer(lambda x: tokenize_text(x), validate=False)),
        ('remove_stopwords', FunctionTransformer(lambda x: remove_stop_words(x, custom_stop_words), validate=False))
    ])

In [8]:
def process_comments_from_csv(custom_stop_words=None):
    text_pipeline = create_text_pipeline(custom_stop_words)
    df["cleaned_text"] = text_pipeline.fit_transform(df["text"])
    
process_comments_from_csv()
df.head()

Unnamed: 0,publication_date,title,author,categories,text,url,cleaned_text
0,Apr 23 2024,UFOs and SGU on John Oliver,Steven Novella,UFO's / Aliens,"The most recent episode of John Oliver, Last W...",https://theness.com/neurologicablog/ufos-and-s...,"[recent, episode, john, oliver, last, week, to..."
1,Apr 22 2024,Indigenous Knowledge,Steven Novella,Culture and Society,I recently received the following question to ...,https://theness.com/neurologicablog/indigenous...,"[recently, received, following, question, sgu,..."


### Load the model for summarization

In [9]:
def load_bart_model():
    # tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
    # model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
    tokenizer = AutoTokenizer.from_pretrained("google/bigbird-pegasus-large-pubmed")
    model = AutoModelForSeq2SeqLM.from_pretrained("google/bigbird-pegasus-large-pubmed")
    return tokenizer, model

### Generate a summary using model

In [10]:
def summarize(text, tokenizer, model, max_length=130, min_length=30):
    inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)
    summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=max_length, min_length=min_length, length_penalty=2.0)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

### Function to calculate TF-IDF scores for each document

In [11]:
def calculate_tfidf(corpus):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(corpus)
    feature_names = vectorizer.get_feature_names_out()
    return tfidf_matrix, feature_names

### Function to extract top N keywords based on TF-IDF scores

In [12]:
def get_top_keywords(tfidf_matrix, feature_names, top_n=5):
    top_keywords_per_doc = []
    
    for doc_index in range(tfidf_matrix.shape[0]):
        tfidf_row = tfidf_matrix[doc_index].toarray().flatten()
        top_indices = np.argsort(tfidf_row)[-top_n:]
        top_keywords = [feature_names[index] for index in top_indices]
        top_keywords_per_doc.append(top_keywords)
    
    return top_keywords_per_doc

### Main Method

In [13]:
def process_articles(df, top_n_keywords=5):
    
    documents = df['text'].tolist()
    
    cleaned_documents = [' '.join(tokens) for tokens in df['cleaned_text']]
    
    tokenizer, model = load_bart_model()
    
    tfidf_matrix, feature_names = calculate_tfidf(cleaned_documents)
    
    top_keywords_per_doc = get_top_keywords(tfidf_matrix, feature_names, top_n=top_n_keywords)
    
    summaries = []
    for i, keywords in enumerate(top_keywords_per_doc):
        
        keyword_sentence = " ".join(keywords)
        structured_input = f"The article discusses: {keyword_sentence}."
        
        key_words_summary = summarize(structured_input, tokenizer, model)
        
        normal_summary = summarize(documents[i], tokenizer, model)
        
        summaries.append({
            'title': df['title'][i],
            'key_words_summary': key_words_summary,
            'normal_summary': normal_summary
        })
    
    return summaries

In [14]:
summaries = process_articles(df, top_n_keywords=45)

# Output summaries
for summary in summaries:
    print(f"Title: {summary['title']}")
    print(f"Generated Summary with only key words: {summary['key_words_summary']}\n")
    print(f"Generated Summary with whole text: {summary['normal_summary']}\n")

2024-12-18 15:25:05.319161: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1734515705.489949  193919 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1734515705.530615  193919 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-18 15:25:05.885926: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Attention type 'block_sparse' is not possible if sequence_length: 56 <= num global tokens: 2 * config.block_size + min. num s

Title: UFOs and SGU on John Oliver
Generated Summary with only key words: the case of a young woman who was believed to have been shot and killed by an erythrocyte membranejoy group known as the erythrocyte extrapyramidal group is reported .<n> the patient , who was a follower of the group , has not been found .<n> the case has attracted considerable controversy , with some claiming there was an error in initial investigation and others questioning the authenticity of the findings . in this article , the author discusses the case with particular reference to the media , researchers and his former colleagues on both the pro and con sides .

Generated Summary with whole text: abstractthe mainstream media has a habit of giving space - based phenomena or , more generally , what has been termed space or , more specifically , what has been described as , what has been described as , what has been described as , what has been described as , what has been described as , what has been described