This notebooks outlines the entire step by step coding process for data transformation of the new articles.

In [None]:
!pip install huggingface_hub
!pip install cohere
!pip install transformers
!pip install sumy



##Loading data for demo

In [None]:
!pip install datasets



In [None]:
from datasets import load_dataset
import pandas as pd

In [None]:
dataset = load_dataset("abisee/cnn_dailymail", "3.0.0", split="train")

In [None]:
dataset

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 287113
})

In [None]:
dataset = pd.DataFrame(dataset)

In [None]:
dataset

Unnamed: 0,article,highlights,id
0,"LONDON, England (Reuters) -- Harry Potter star...",Harry Potter star Daniel Radcliffe gets £20M f...,42c027e4ff9730fbb3de84c1af0d2c506e41c3e4
1,Editor's note: In our Behind the Scenes series...,Mentally ill inmates in Miami are housed on th...,ee8871b15c50d0db17b0179a6d2beab35065f1e9
2,"MINNEAPOLIS, Minnesota (CNN) -- Drivers who we...","NEW: ""I thought I was going to die,"" driver sa...",06352019a19ae31e527f37f7571c6dd7f0c5da37
3,WASHINGTON (CNN) -- Doctors removed five small...,"Five small polyps found during procedure; ""non...",24521a2abb2e1f5e34e6824e0f9e56904a2b0e88
4,(CNN) -- The National Football League has ind...,"NEW: NFL chief, Atlanta Falcons owner critical...",7fe70cc8b12fab2d0a258fababf7d9c6b5e1262a
...,...,...,...
287108,"The nine-year-old daughter of a black, unarmed...","Rumain Brisbon, 34, was killed after Phoenix p...",279a12d3ee37b8109cc192a9e88115a5a631fb06
287109,Legalising assisted suicide is a slippery slop...,"Theo Boer, a European assisted suicide watchdo...",b5bc9d404a9a5d890c9fc26550b67e6d8d83241f
287110,A group calling itself 'The Women of the 99 Pe...,Ohio congressman criticised for 'condoning the...,500862586f925e406f8b662934e1a71bbee32463
287111,Most men enjoy a good pint of lager or real al...,The Black Country Ale Tairsters have been to 1...,32a1f9e5c37a938c0c0bca1a1559247b9c4334b2


##Cohere API

In [None]:
import cohere

co = cohere.Client('tHOKvUc9kU3iiZ3CDuMuPdJQn9Wwj1qwZKbG256D')

def generate_embedding(content):
    if content:
        response = co.embed(texts=[content], model='embed-english-v2.0')
        return response.embeddings[0]
    return None

def cohere_prompt(prompt, model='command', max_tokens=300):
    response = co.generate(
        model=model,
        prompt=prompt,
        max_tokens=max_tokens
    )
    return response.generations[0].text.strip()

## Abstractive Summary Generation

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

tokenizer = T5Tokenizer.from_pretrained('google/flan-t5-small')
model = T5ForConditionalGeneration.from_pretrained('google/flan-t5-small')

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
def summarize_text(content):
    if content:
        input_text = f"summarize: {content}"  # T5 expects a 'summarize:' prefix
        inputs = tokenizer(input_text, return_tensors='pt', max_length=512, truncation=True)
        summary_ids = model.generate(inputs.input_ids, max_length=150, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
        return tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return None

## Extractive Summary Generation Function

In [None]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

def extractive_summary(text, sentence_count=4):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = LsaSummarizer()
    summary = summarizer(parser.document, sentence_count)

    return " ".join(str(sentence) for sentence in summary)

https://www.geeksforgeeks.org/mastering-text-summarization-with-sumy-a-python-library-overview/

##Generate Embedding

In [None]:
dataset = dataset.sample(n=990, random_state=42).reset_index(drop=True)

In [None]:
dataset

Unnamed: 0,article,highlights,id
0,Nasa has warned of an impending asteroid pass ...,2004 BL86 will pass about three times the dist...,6ccb7278e86893ad3609d30ecb5c9ea902fb9527
1,"BAGHDAD, Iraq (CNN) -- Iraq's most powerful Su...","Iraqi Islamic Party calls Quran incident ""blat...",d4f57e3c18c38696345fb7a3d76a151bb9c5123b
2,By . David Kent . Andy Carroll has taken an un...,Carroll takes to Instagram to post selfie ahea...,c9ae9fc314adcc92d3835b0437a1c44e9e233e1c
3,Los Angeles (CNN) -- Los Angeles has long been...,Pop stars from all over Europe are setting the...,5b5a383dc8f9487857787ced5426154394dd99db
4,London (CNN) -- Few shows can claim such an au...,NEW: Young athletes light the Olympic cauldron...,2813505a990ad24071496c0d0936e40847eb6194
...,...,...,...
985,Huge reserves of the oldest water on Earth are...,Geologists estimate ancient rocks contain arou...,709ba11a8e686be453ddf2656a31f82a30f06a04
986,"By . Gemma Hartley . IF it’s not hedge funds, ...",Former RBS boss has been in four-year feud wit...,22d04e0b2e5d0ff9c78b7e384ad9cb2c2f06aabe
987,(CNN) -- A 6.9-magnitude earthquake shook the ...,The Solomon Islands are struck by a 6.9-magnit...,6ac07d30aa4d96ae9364252b7a335bba70866a9d
988,"One doctor says the study ""very clearly shows ...",Removal of thimerosal from most vaccines hasn'...,4981d2d34eef6c189da1cf5c504b3f6a3bf36365


In [None]:
import time

def safe_generate_embedding(content):
    time.sleep(0.6)
    return generate_embedding(content)

In [None]:
dataset['Embedding'] = dataset['article'].apply(safe_generate_embedding)

In [None]:
dataset

Unnamed: 0,article,highlights,id,Embedding
0,Nasa has warned of an impending asteroid pass ...,2004 BL86 will pass about three times the dist...,6ccb7278e86893ad3609d30ecb5c9ea902fb9527,"[-1.4140625, 0.95214844, -0.23156738, 1.034179..."
1,"BAGHDAD, Iraq (CNN) -- Iraq's most powerful Su...","Iraqi Islamic Party calls Quran incident ""blat...",d4f57e3c18c38696345fb7a3d76a151bb9c5123b,"[1.2080078, -1.1123047, -1.6630859, 1.2548828,..."
2,By . David Kent . Andy Carroll has taken an un...,Carroll takes to Instagram to post selfie ahea...,c9ae9fc314adcc92d3835b0437a1c44e9e233e1c,"[-1.5576172, -0.3149414, 0.19250488, 0.7631836..."
3,Los Angeles (CNN) -- Los Angeles has long been...,Pop stars from all over Europe are setting the...,5b5a383dc8f9487857787ced5426154394dd99db,"[1.1689453, 0.51220703, 0.48339844, 0.7832031,..."
4,London (CNN) -- Few shows can claim such an au...,NEW: Young athletes light the Olympic cauldron...,2813505a990ad24071496c0d0936e40847eb6194,"[0.47436523, -1.1376953, -2.8867188, 0.7612304..."
...,...,...,...,...
985,Huge reserves of the oldest water on Earth are...,Geologists estimate ancient rocks contain arou...,709ba11a8e686be453ddf2656a31f82a30f06a04,"[1.7333984, 0.6772461, -2.7089844, 1.2421875, ..."
986,"By . Gemma Hartley . IF it’s not hedge funds, ...",Former RBS boss has been in four-year feud wit...,22d04e0b2e5d0ff9c78b7e384ad9cb2c2f06aabe,"[-0.6245117, 0.9345703, 0.21057129, 0.60595703..."
987,(CNN) -- A 6.9-magnitude earthquake shook the ...,The Solomon Islands are struck by a 6.9-magnit...,6ac07d30aa4d96ae9364252b7a335bba70866a9d,"[-0.4946289, 1.5751953, -1.7597656, 1.6923828,..."
988,"One doctor says the study ""very clearly shows ...",Removal of thimerosal from most vaccines hasn'...,4981d2d34eef6c189da1cf5c504b3f6a3bf36365,"[0.2927246, -1.0820312, -1.3164062, 1.1435547,..."


##Saving the dataset

In [None]:
from google.colab import drive

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
dataset.to_csv('/content/drive/My Drive/DATA608_Work/embedded_data_v2.csv', index=False)

##Similarity Search and Ranked Retrieval

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

def get_top_k_similar_articles(query, df, k=3):
    query_embedding = np.array(generate_embedding(query)).reshape((1, -1))
    df_filtered = df[df['Embedding'].notnull()].copy()
    embeddings = np.stack(df_filtered['Embedding'].values)
    similarities = cosine_similarity(query_embedding, embeddings).flatten()
    df_filtered['Similarity'] = similarities
    top_k = df_filtered.sort_values('Similarity', ascending=False).head(k)
    return top_k

##Final Summary

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
import nltk
nltk.data.path.append('/usr/local/share/nltk_data')
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
nltk.data.find('tokenizers/punkt')

FileSystemPathPointer('/root/nltk_data/tokenizers/punkt')

In [None]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Abstractive Summary

In [None]:
dataset['Abstractive Summary'] = dataset['article'].apply(lambda x: summarize_text(x))

LookupError: NLTK tokenizers are missing or the language is not supported.
Download them by following command: python -c "import nltk; nltk.download('punkt')"
Original error was:

**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/root/nltk_data'
    - '/root/nltk_data'
**********************************************************************


In [None]:
dataset

Unnamed: 0,article,highlights,id,Embedding,Abstractive Summary
0,Nasa has warned of an impending asteroid pass ...,2004 BL86 will pass about three times the dist...,6ccb7278e86893ad3609d30ecb5c9ea902fb9527,"[-1.4140625, 0.95214844, -0.23156738, 1.034179...","The asteroid, designated 2004 BL86, will safel..."
1,"BAGHDAD, Iraq (CNN) -- Iraq's most powerful Su...","Iraqi Islamic Party calls Quran incident ""blat...",d4f57e3c18c38696345fb7a3d76a151bb9c5123b,"[1.2080078, -1.1123047, -1.6630859, 1.2548828,...",U.S. soldier's desecration of the Quran requir...
2,By . David Kent . Andy Carroll has taken an un...,Carroll takes to Instagram to post selfie ahea...,c9ae9fc314adcc92d3835b0437a1c44e9e233e1c,"[-1.5576172, -0.3149414, 0.19250488, 0.7631836...",Andy Carroll posted a snap of himself just as ...
3,Los Angeles (CNN) -- Los Angeles has long been...,Pop stars from all over Europe are setting the...,5b5a383dc8f9487857787ced5426154394dd99db,"[1.1689453, 0.51220703, 0.48339844, 0.7832031,...",L.A.'s list of the best pop stars in the world...
4,London (CNN) -- Few shows can claim such an au...,NEW: Young athletes light the Olympic cauldron...,2813505a990ad24071496c0d0936e40847eb6194,"[0.47436523, -1.1376953, -2.8867188, 0.7612304...","""Isles of Wonder"" showcased tributes to the Br..."
...,...,...,...,...,...
985,Huge reserves of the oldest water on Earth are...,Geologists estimate ancient rocks contain arou...,709ba11a8e686be453ddf2656a31f82a30f06a04,"[1.7333984, 0.6772461, -2.7089844, 1.2421875, ...",Geologists have found water that is up to 2.7 ...
986,"By . Gemma Hartley . IF it’s not hedge funds, ...",Former RBS boss has been in four-year feud wit...,22d04e0b2e5d0ff9c78b7e384ad9cb2c2f06aabe,"[-0.6245117, 0.9345703, 0.21057129, 0.60595703...",Former Royal Bank of Scotland boss Fred Goodwi...
987,(CNN) -- A 6.9-magnitude earthquake shook the ...,The Solomon Islands are struck by a 6.9-magnit...,6ac07d30aa4d96ae9364252b7a335bba70866a9d,"[-0.4946289, 1.5751953, -1.7597656, 1.6923828,...",A 6.9-magnitude earthquake shook the Pacific O...
988,"One doctor says the study ""very clearly shows ...",Removal of thimerosal from most vaccines hasn'...,4981d2d34eef6c189da1cf5c504b3f6a3bf36365,"[0.2927246, -1.0820312, -1.3164062, 1.1435547,...",A new study found the prevalence of autism cas...


Saving the abstractive summary

In [None]:
dataset.to_csv('/content/drive/My Drive/DATA608_Work/embedded_abstractive_data.csv', index=False)

Extractive Summary

In [None]:
dataset['Extractive Summary'] = dataset['article'].apply(lambda x: extractive_summary(x))

In [None]:
dataset

Unnamed: 0,article,highlights,id,Embedding,Abstractive Summary,Extractive Summary
0,Nasa has warned of an impending asteroid pass ...,2004 BL86 will pass about three times the dist...,6ccb7278e86893ad3609d30ecb5c9ea902fb9527,"[-1.4140625, 0.95214844, -0.23156738, 1.034179...","The asteroid, designated 2004 BL86, will safel...",Nasa has warned of an impending asteroid pass ...
1,"BAGHDAD, Iraq (CNN) -- Iraq's most powerful Su...","Iraqi Islamic Party calls Quran incident ""blat...",d4f57e3c18c38696345fb7a3d76a151bb9c5123b,"[1.2080078, -1.1123047, -1.6630859, 1.2548828,...",U.S. soldier's desecration of the Quran requir...,Maj. Gen. Jeffery Hammond apologizes after a U...
2,By . David Kent . Andy Carroll has taken an un...,Carroll takes to Instagram to post selfie ahea...,c9ae9fc314adcc92d3835b0437a1c44e9e233e1c,"[-1.5576172, -0.3149414, 0.19250488, 0.7631836...",Andy Carroll posted a snap of himself just as ...,"And you can understand the look on his face, a..."
3,Los Angeles (CNN) -- Los Angeles has long been...,Pop stars from all over Europe are setting the...,5b5a383dc8f9487857787ced5426154394dd99db,"[1.1689453, 0.51220703, 0.48339844, 0.7832031,...",L.A.'s list of the best pop stars in the world...,"'s Lena Katina from Russia, Slovakia's TWiiNS ..."
4,London (CNN) -- Few shows can claim such an au...,NEW: Young athletes light the Olympic cauldron...,2813505a990ad24071496c0d0936e40847eb6194,"[0.47436523, -1.1376953, -2.8867188, 0.7612304...","""Isles of Wonder"" showcased tributes to the Br...",Some details of the Â£27 million show were rel...
...,...,...,...,...,...,...
985,Huge reserves of the oldest water on Earth are...,Geologists estimate ancient rocks contain arou...,709ba11a8e686be453ddf2656a31f82a30f06a04,"[1.7333984, 0.6772461, -2.7089844, 1.2421875, ...",Geologists have found water that is up to 2.7 ...,"'It provides a ""treasure map"" of just how many..."
986,"By . Gemma Hartley . IF it’s not hedge funds, ...",Former RBS boss has been in four-year feud wit...,22d04e0b2e5d0ff9c78b7e384ad9cb2c2f06aabe,"[-0.6245117, 0.9345703, 0.21057129, 0.60595703...",Former Royal Bank of Scotland boss Fred Goodwi...,Mr Goodwin has been feuding with Colinton neig...
987,(CNN) -- A 6.9-magnitude earthquake shook the ...,The Solomon Islands are struck by a 6.9-magnit...,6ac07d30aa4d96ae9364252b7a335bba70866a9d,"[-0.4946289, 1.5751953, -1.7597656, 1.6923828,...",A 6.9-magnitude earthquake shook the Pacific O...,"By 12:56 a.m. local time Wednesday, there had ..."
988,"One doctor says the study ""very clearly shows ...",Removal of thimerosal from most vaccines hasn'...,4981d2d34eef6c189da1cf5c504b3f6a3bf36365,"[0.2927246, -1.0820312, -1.3164062, 1.1435547,...",A new study found the prevalence of autism cas...,"One doctor says the study ""very clearly shows ..."


Saving extractive summary

In [None]:
dataset.to_csv('/content/drive/My Drive/DATA608_Work/embedded_abstractive_extractive_data.csv', index=False)

In [None]:
def generate_summaries(query, df, k=5):
    # Get top-k articles
    top_k_articles = get_top_k_similar_articles(query, df, k)

    # Extractive summaries
    extractive_summaries = "ARTICLE: " + "\n\nARTICLE: ".join(top_k_articles['Extractive Summary'].values)
    extractive_prompt = f"Summarize the following articles into a cohesive extractive summary in under 200 words with proper flow. It should have something from each article:\n{extractive_summaries}"

    # Abstractive summaries
    abstractive_summaries = "ARTICLE: " + "\n\nARTICLE: ".join(top_k_articles['Abstractive Summary'].values)
    abstractive_prompt = f"Summarize the following articles into a cohesive abstractive summary in under 200 words with proper flow. It should have something from each article:\n{abstractive_summaries}"

    print("EXTRACTIVE PROMPT:\n", extractive_prompt)
    extractive_summary = cohere_prompt(extractive_prompt)

    print("\n\nABSTRACTIVE PROMPT:\n", abstractive_prompt)
    abstractive_summary = cohere_prompt(abstractive_prompt)

    return extractive_summary, abstractive_summary

##Input and Outputs (1)

In [None]:
extractive, abstractive = generate_summaries('Canada federal election 2025', dataset, k=5)

EXTRACTIVE PROMPT:
 Summarize the following articles into a cohesive extractive summary in under 200 words with proper flow. It should have something from each article:
ARTICLE: The risk of a renewed global banking crisis triggered by sovereign defaults in Europe would escalate sharply. Confronted with this dire alternative, Obama will have to compromise by accepting a program of long-term cuts in social programs while Republicans accept phased tax increases -- both departing from their diametrically opposed electoral platforms. Perhaps fortified by weather events such as Hurricane Sandy, Obama's second term may bring renewed efforts to encourage control of carbon emissions through environmental regulation, which could yield some modest results. The positive effect on reduced gas prices in Europe and other regions of these new American technologies is already being felt.

ARTICLE: Washington (CNN) -- Five weeks before the November midterm elections, voters give Democrats an edge over R

##Input and Outputs (2)

In [None]:
extractive, abstractive = generate_summaries('Economic outlook for Canada', dataset, k=5)

EXTRACTIVE PROMPT:
 Summarize the following articles into a cohesive extractive summary in under 200 words with proper flow. It should have something from each article:
ARTICLE: "We cannot fight wars by polls," Panetta said in Ottawa, where he was attending a trilateral defense meeting with Canadian and Mexican officials. As well as embarking on a joint trilateral defense threat assessment, the U.S., Canada and Mexico pledged to do more to confront and combat drug cartels on the continent. Specifically, military officials said there would be more time spent on patrolling waters and inspecting things like shipping containers that cross borders. "This very ambitious goal of coordinating our efforts goes beyond any one specific threat," Canadian Defence Minister Peter MacKay said.

ARTICLE: The Royal Canadian Mounted Police announced Monday that it had arrested Suliman Mohamed, 21, and charged him with participating in the activity of a terrorist group and conspiracy to participate in a t

##Input and Outputs (3)

In [None]:
extractive, abstractive = generate_summaries("climate policies on Canada's oil and gas markets", dataset, k=5)

EXTRACTIVE PROMPT:
 Summarize the following articles into a cohesive extractive summary in under 200 words with proper flow. It should have something from each article:
ARTICLE: A proposed new route through Nebraska for the controversial Keystone XL pipeline will require six to nine months of review under the normal process conducted by state and federal officials, a Nebraska official said Thursday. TransCanada said then it would submit a new proposal for the route through Nebraska. "We will have a significant amount of time for public comment," Linder said. The 293-127 vote allows Boehner to begin negotiations with Senate Democrats over a longer-term funding measure for road, rail and bridge projects.

ARTICLE: The Met Office’s global warming predictions are flawed and could result in millions of pounds being squandered, it is claimed. Large sums of public and private sector money could be ‘malinvested’ in everything from wind farms to heat-proof road surfaces as a result, it claims. 