### Text Summarization using TF-IDF
1. Split the document into sentences (ntlk.sent_tokenize)
2. Tokenize each sentence with tf-idf
3. Take average of non-zero values for each sentence
4. Sort by score while keeping original order
5. Print top N sentences

In [1]:
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

--2024-08-06 17:14:24--  https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv
Resolving lazyprogrammer.me (lazyprogrammer.me)... 172.67.213.166, 104.21.23.210
Connecting to lazyprogrammer.me (lazyprogrammer.me)|172.67.213.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5085081 (4,8M) [text/csv]
Saving to: 'bbc_text_cls.csv'

     0K .......... .......... .......... .......... ..........  1% 5,71M 1s
    50K .......... .......... .......... .......... ..........  2% 3,04M 1s
   100K .......... .......... .......... .......... ..........  3%  467K 4s
   150K .......... .......... .......... .......... ..........  4%  242K 8s
   200K .......... .......... .......... .......... ..........  5%  100K 16s
   250K .......... .......... .......... .......... ..........  6% 53,1M 13s
   300K .......... .......... .......... .......... ..........  7% 24,0M 11s
   350K .......... .......... .......... .......... ..........  8%  117M 10s
   400K .......... .....

In [1]:
# import libraries
import pandas as pd
import numpy as np
import textwrap
import nltk
from nltk import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dimitris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
#import dataset
df = pd.read_csv('bbc_text_cls.csv')

# helper function to remove newline
def strip_nl(s):
    return s.replace("\n"," ")

# remove newline
df['text'] = df['text'].apply(strip_nl)

# get 5 first articles
articles = df['text'].iloc[0:5]

In [28]:
# Split the document in sentences
sentences = nltk.sent_tokenize(articles[0])

# calculate TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sentences)


# sort based on tf-idf scores
tfidf_means = []
for i in range(tfidf_matrix.shape[0]):
    row = tfidf_matrix.getrow(i).toarray()[0]
    non_zero_elements = row[row > 0]
    if len(non_zero_elements) > 0:
        mean_score = np.mean(non_zero_elements)
    else:
        mean_score = 0
    tfidf_means.append(mean_score)


# rank 
sorted_indices = np.argsort(tfidf_means)[::-1]

# select N first sentences
N = 3
top_senteces = sorted_indices[:N]

# print summary
summary = [sentences[i] for i in top_senteces[:N]]
for sentence in summary:
    print(sentence)


But its own internet business, AOL, had has mixed fortunes.
Time Warner's fourth quarter profits were slightly better than analysts' expectations.
TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn.


In [29]:
for i in top_senteces[:N]:
    print(i)

5
10
2


In [27]:
top_senteces

array([ 5, 10], dtype=int64)