Article-Level Text Summarization with TinyLlama

In [1]:
# pip install llama-cpp-python

import pandas as pd
import os
import re
import torch
from llama_cpp import Llama
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM


  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Save the model locally, or in a cloud, e.g. on Dropbox by downloading it in bash wget -O ~/Dropbox/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf "https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
# Load the model 
#MODEL_PATH = os.path.expanduser("~/Dropbox/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf")
#llm = Llama(model_path=MODEL_PATH, n_ctx=2048, n_threads=6)
# Ensure that the model path is correct
#MODEL_PATH = os.path.expanduser(MODEL_PATH)  # Expanding '~' to the full path

# Load tokenizer and model from the local path (TinyLlama model)
#tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, local_files_only=True)
#model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, torch_dtype=torch.float16, local_files_only=True)



In [3]:
# Fetch the data
df = pd.read_csv("../data/news_politics_topics.csv")
df.head()

Unnamed: 0,body,title,filename,category,topic
0,"The ""best person for the job"" should be appoin...",'Best person' for top legal job,bbc/politics/273.txt,politics,6
1,A cap on donations to political parties should...,'Debate needed' on donations cap,bbc/politics/059.txt,politics,-1
2,A cap on donations to political parties should...,'Debate needed' on donations cap,bbc/politics/298.txt,politics,-1
3,It could cost £80m to run a UK referendum on t...,'EU referendum could cost £80m',bbc/politics/391.txt,politics,0
4,The initial attempt to sell the Millennium Dom...,'Errors' doomed first Dome sale,bbc/politics/006.txt,politics,5


In [4]:
# TEXT CLEANING
# Filter out outliers (-1 topics)
df = df[df["topic"] != -1].reset_index(drop=True)
df.head()

Unnamed: 0,body,title,filename,category,topic
0,"The ""best person for the job"" should be appoin...",'Best person' for top legal job,bbc/politics/273.txt,politics,6
1,It could cost £80m to run a UK referendum on t...,'EU referendum could cost £80m',bbc/politics/391.txt,politics,0
2,The initial attempt to sell the Millennium Dom...,'Errors' doomed first Dome sale,bbc/politics/006.txt,politics,5
3,The risk of pets and children being given vote...,'Fido' to be taken off vote lists,bbc/politics/079.txt,politics,1
4,"People in England, Scotland and Wales must hav...",'Last chance' warning for voters,bbc/politics/412.txt,politics,1


In [None]:
# EXPERIMENT WITH BART
# Load BART model and tokenizer
model = "facebook/bart-large-cnn" 
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForSeq2SeqLM.from_pretrained(model)

# Initialize text summarization pipeline
summarization_pipeline = pipeline(
    "summarization",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1  # Use GPU if available, otherwise CPU
)

# Clean text function
def clean_text(text):
    """Preprocess text by removing extra spaces, line breaks, and special characters."""
    text = " ".join(text.split())  # Replace multiple spaces and newlines
    return text.strip()

# Summarization function
def summarize_article(text):
    """Summarize a single article using BART."""
    text = clean_text(text)
    #text = text[:1024]  # BART has a token limit, so truncate long articles

    # Generate the summary
    summary = summarization_pipeline(
        text, 
        max_length=200,  # Control summary length
        min_length=80, 
        do_sample=True,  # False for more deterministic output, true - to allow sampling
        temperature=0.7,  # Control randomness
        top_k=40,  # Control diversity
        top_p=0.75,  # Control diversity
        repetition_penalty=1.2  # Control repetition
    )[0]["summary_text"]
    
    return summary.strip()

# Apply summarization to the first 5 articles
df_sample = df.head(5)  # Take only the first 5 articles
df_sample["summary"] = df_sample["body"].apply(summarize_article)

df_sample.head()


Device set to use cpu
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sample["summary"] = df_sample["body"].apply(summarize_article)


Unnamed: 0,body,title,filename,category,topic,summary
0,"The ""best person for the job"" should be appoin...",'Best person' for top legal job,bbc/politics/273.txt,politics,6,Christopher Leslie says the post should not be...
1,It could cost £80m to run a UK referendum on t...,'EU referendum could cost £80m',bbc/politics/391.txt,politics,0,Constitution would make the European Union wor...
2,The initial attempt to sell the Millennium Dom...,'Errors' doomed first Dome sale,bbc/politics/006.txt,politics,5,Too many parties were involved in decision-mak...
3,The risk of pets and children being given vote...,'Fido' to be taken off vote lists,bbc/politics/079.txt,politics,1,Electoral Commission says having voters regist...
4,"People in England, Scotland and Wales must hav...",'Last chance' warning for voters,bbc/politics/412.txt,politics,1,"People in England, Scotland and Wales must hav..."


In [13]:
df_sample['summary'].iloc[4]

'People in England, Scotland and Wales must have registered by 1700 GMT. Those who filled in forms last autumn should already be on the register. But those who have moved house or were on holiday may have been left off. There will also be elections for local councils and mayors in parts of England on 5 May. The deadline for voters to register in Northern Ireland expired on Thursday. Last week Preston City Council reported that more than 14,000 voters were not registered.'

Notes: 
- concise and informative
- logical flow
- key figures included

In [None]:
# EXPERIMENT WITH TINYLLAMA
# Load the model and tokenizer
model = "PY007/TinyLlama-1.1B-Chat-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.float16)

# Clean text function
def clean_text(text):
    """Preprocess text by removing extra spaces, line breaks, and special characters."""
    text = re.sub(r"\s+", " ", text)  # Replace multiple spaces with one
    return text.strip()

# Pipeline Initialization for text generation
summarization_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",  # Use GPU if available, otherwise defaults to CPU
)

# Summarization function
def summarize_article(text):
    """Summarize a single article using TinyLlama."""
    text = clean_text(text)
    text = text[:1000]  # Limit text to 1000 characters for faster processing

    prompt = f"Summarize this article:\n{text}\n\n### Summary:"
    
    # Generate the summary
    sequences = summarization_pipeline(
        prompt,
        do_sample=True,
        top_k=50,
        top_p=0.7,
        num_return_sequences=1,
        repetition_penalty=1.1,
        max_new_tokens=200,
    )
    # Return the summarized text
    return sequences[0]['generated_text'].strip()

# Limit df to first 5 articles
df_sample = df.head(5)  # take first 5 rows

# Apply summarization only to these 5 articles
df_sample["summary"] = df_sample["body"].apply(summarize_article)
# Apply summarization to each article (run time -  16 min)
#df["summary"] = df["body"].apply(summarize_article)
df_sample.head()

Device set to use mps:0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sample["summary"] = df_sample["body"].apply(summarize_article)


Unnamed: 0,body,title,filename,category,topic,summary
0,"The ""best person for the job"" should be appoin...",'Best person' for top legal job,bbc/politics/273.txt,politics,6,"Summarize this article:\nThe ""best person for ..."
1,It could cost £80m to run a UK referendum on t...,'EU referendum could cost £80m',bbc/politics/391.txt,politics,0,Summarize this article:\nIt could cost £80m to...
2,The initial attempt to sell the Millennium Dom...,'Errors' doomed first Dome sale,bbc/politics/006.txt,politics,5,Summarize this article:\nThe initial attempt t...
3,The risk of pets and children being given vote...,'Fido' to be taken off vote lists,bbc/politics/079.txt,politics,1,Summarize this article:\nThe risk of pets and ...
4,"People in England, Scotland and Wales must hav...",'Last chance' warning for voters,bbc/politics/412.txt,politics,1,"Summarize this article:\nPeople in England, Sc..."


In [35]:
df_sample['summary'].iloc[1]

'Summarize this article:\nIt could cost £80m to run a UK referendum on the European constitution, ministers have revealed. In a written parliamentary answer, Constitutional Affairs Minister Chris Leslie said the poll was likely to cost the same as a general election. Mr Leslie said the cost could not be compared with the only previous British referendum, held 30 years ago. Ministers say the constitution would make the European Union work better but critics fear creating a "super state". Labour MP John Cryer, whose question revealed the price estimate, said the cost surprised him but was not a central factor as it was important people had their say. But he said it would have been better to have rejected the constitution so avoiding the need for a referendum. The 2001 election cost £80m. No date for the vote has been set but Foreign Secretary Jack Straw has suggested it is unlikely to be held until early 2006 - after the predicted date for the next election. Most voters said the UK shoul

Notes:
Strong points:
- concise summary
- key points

Weaknesses:
- fabrication
- loss of context
- lacks key details