<a href="https://colab.research.google.com/github/Osakhra/ITAI2373-NewsBot-Final/blob/main/notebooks/05_Text_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 05_Text_Summarization.ipynb

In this notebook, I will generate concise summaries of news articles using both extractive and abstractive summarization techniques from my NewsBot 2.0 pipeline.

**Specifically, I will:**
- Import my preprocessed news data
- Apply my Summarizer module (using transformers and/or TextRank)
- Compare summaries to the original content

---


In [1]:
!pip install langdetect spacy nltk scikit-learn pyldavis textblob transformers torch sumy sentence-transformers numpy matplotlib seaborn googletrans==4.0.0-rc1
import nltk
nltk.download('stopwords')
!git clone https://github.com/Osakhra/ITAI2373-NewsBot-Final.git
import sys
sys.path.append('/content/ITAI2373-NewsBot-Final/src')


fatal: destination path 'ITAI2373-NewsBot-Final' already exists and is not an empty directory.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
from google.colab import files
uploaded = files.upload()

Saving news_cleaned.csv to news_cleaned (1).csv


In [3]:
import pandas as pd
df = pd.read_csv('news_cleaned.csv')
df.head()


Unnamed: 0,ArticleId,content,category,clean_content
0,1833,worldcom ex-boss launches defence lawyers defe...,business,worldcom ex boss launch defence lawyer defend ...
1,154,german business confidence slides german busin...,business,german business confidence slide german busine...
2,1101,bbc poll indicates economic gloom citizens in ...,business,bbc poll indicate economic gloom citizen major...
3,1976,lifestyle governs mobile choice faster bett...,tech,lifestyle govern mobile choice fast well funky...
4,917,enron bosses in $168m payout eighteen former e...,business,enron boss payout eighteen former enron direct...


In [4]:
from language_models.summarizer import Summarizer

summarizer = Summarizer()  # Uses a transformer model by default


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cuda:0


In [5]:
# Picking a few articles to summarize
for i in range(3):
    original = df['content'].iloc[i]
    summary = summarizer.summarize(original)
    print(f"Original Article #{i+1}:\n", original[:400], "\n")
    print(f"Summary #{i+1}:\n", summary, "\n{'-'*60}\n")


Original Article #1:

Summary #1:
{'-'*60}

Original Article #2:
 german business confidence slides german business confidence fell in february knocking hopes of a speedy recovery in europe s largest economy.  munich-based research institute ifo said that its confidence index fell to 95.5 in february from 97.5 in january  its first decline in three months. the study found that the outlook in both the manufacturing and retail sectors had worsened. observers had b 

Summary #2:
 German business confidence fell in february knocking hopes of a speedy recovery in europe s largest economy. munich-based research institute ifo said that its confidence index fell to 95.5 in February. 
{'-'*60}

Original Article #3:
 bbc poll indicates economic gloom citizens in a majority of nations surveyed in a bbc world service poll believe the world economy is worsening.  most respondents also said their national economy was getting worse. but when asked about their own family s financial outlook  a majorit

In [6]:
df_sample = df.sample(n=5, random_state=42)
df_sample['summary'] = df_sample['content'].apply(lambda x: summarizer.summarize(x))
df_sample[['content', 'summary']]


Unnamed: 0,content,summary
941,wal-mart is sued over rude lyrics the parents ...,Parents of a 13-year-old girl are suing us sup...
297,howard taunts blair over splits tony blair s f...,Tory leader michael howard asked how can they ...
271,fox attacks blair s tory lies tony blair lie...,tory co-chairman liam fox was speaking after ...
774,online commons to spark debate online communit...,Think-tank says the net has not yet been fully...
420,piero gives rugby perspective bbc sport unveil...,bbc sport unveils its new analysis tool piero...


In [7]:
transformer_summary = summarizer.summarize(df['content'].iloc[0])

# TextRank fallback with sumy
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
import nltk

# Download the 'punkt' and 'punkt_tab' tokenizers if they are not already present
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab')


parser = PlaintextParser.from_string(df['content'].iloc[0], Tokenizer("english"))
text_rank_summarizer = TextRankSummarizer()
text_rank_summary = " ".join([str(sentence) for sentence in text_rank_summarizer(parser.document, sentences_count=3)])

print("Transformer summary:", transformer_summary)
print("TextRank summary:", text_rank_summary)



In [8]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [9]:
transformer_summary = summarizer.summarize(df['content'].iloc[0])

# TextRank fallback with sumy
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer


parser = PlaintextParser.from_string(df['content'].iloc[0], Tokenizer("english"))
text_rank_summarizer = TextRankSummarizer()
text_rank_summary = " ".join([str(sentence) for sentence in text_rank_summarizer(parser.document, sentences_count=3)])

print("Transformer summary:", transformer_summary)
print("TextRank summary:", text_rank_summary)

