# Text Summarizer with TF-IDF

- Select an article from BBC corpus
- Split article into sentences using nltk.sent_tokenize
- Compute TF-IDF matrix of sentence x tokens
- Score each sentence by taking the average of non-zero TF-IDF values
- Summarize article by printing out only the top scoring sentences.

In [82]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import sent_tokenize
import numpy as np
import textwrap

In [64]:
# Load an article and tokenize sentences
with open("../datasets/bbc/business/020.txt") as file:
    input_sentences = sent_tokenize(file.read())


In [65]:
input_sentences

["Call centre users 'lose patience'\n\nCustomers trying to get through to call centres are getting impatient and quicker to hang up, a survey suggests.",
 'Once past the welcome message, callers on average hang up after just 65 seconds of listening to canned music.',
 'The drop in patience comes as the number of calls to call centres is growing at a rate of 20% every year.',
 '"Customers are getting used to the idea of an \'always available\' society," says Cara Diemont of IT firm Dimension Data, which commissioned the survey.',
 'However, call centres also saw a sharp increase of customers simply abandoning calls, she says, from just over 5% in 2003 to a record 13.3% during last year.',
 'When automated phone message systems are taken out of the equation, where customers have to pick their way through multiple options and messages, the number of abandoned calls is even higher - a sixth of all callers give up rather than wait.',
 "One possible reason for the lack in patience, Ms Diemon

In [66]:
vectorizer = TfidfVectorizer(stop_words='english', norm='l1')

In [67]:
X = vectorizer.fit_transform(input_sentences)

In [68]:
# let's score each line of X
score = np.zeros(len(input_sentences))
for i in range(X.shape[0]):
    score[i] = X[i][X[i] != 0].mean()

index_score = np.argsort(-score)
print(index_score)

[26 22 11 12 28 27 23 21  7  9 19 18  2 20 14  1 17  8  6  0 15  4  3 25
 16 24 13  5 10]


In [73]:
# print top 5 sentences in order of appearance.
sorted_index_score = np.sort(index_score[:10])
for index in sorted_index_score:
    print("%.2f: %s" % (score[index], input_sentences[index]))

0.12: The surge in customers trying to get through to call centres is also a reflection of the centres' growing range of tasks.
0.12: Problems are occurring because increased responsibility is not going hand-in-hand with more training, the survey found.
0.17: This, Ms Diemont warns, is "scary" and not good for the bottom line either.
0.17: Poor training frustrates both call centre workers and customers.
0.12: Half of them argue that workers in other countries offer better skills for the money.
0.20: But not everybody believes that outsourcing and offshoring are the solution.
0.12: Nearly two-thirds of all firms polled for the survey have no plans to offshore their call centres.
0.50: What are your experiences with call centres?
0.14: Are you happy to listen to Vivaldi or Greensleeves, or do you want an immediate response?
0.17: And if you work in a call centre: did your training prepare you for your job?


Let's make the whole code into one function:

In [83]:
def wrap(x):
  return textwrap.fill(x, replace_whitespace=False, fix_sentence_endings=True)

In [96]:
def summarize(text):
    #tokenize sentences
    input_sentences = sent_tokenize(text)

    #vectorize sentences
    X = vectorizer.fit_transform(input_sentences)

    # let's score each line of X, aka each sentences
    score = np.zeros(len(input_sentences))
    for i in range(X.shape[0]):
        score[i] = X[i][X[i] != 0].mean()

    index_score = np.argsort(-score)

    # print top 5 sentences in order of appearance.
    sorted_index_score = np.sort(index_score[:5])
    for index in sorted_index_score:
        print(wrap("%.2f: %s" % (score[index], input_sentences[index])))


In [99]:
filepath = "../datasets/bbc/business/032.txt"

with open(filepath) as file:
    text = file.read()
    print('Title: ', text.split('\n\n')[0])

Title:  Japanese banking battle at an end


In [100]:
summarize(text=text)

0.11: The deal would create the world's biggest bank with assets of
about 189 trillion yen ($1.8 trillion).
0.12: Concerns were also raised about Sumitomo's ability to absorb UFJ
and the former has now admitted defeat.
0.50: However, this is expected to be a formality.
0.11: The two are set to merge their venture capital operations and
there has been speculation that this could lead to a full-blown
merger.
0.14: Japanese banks are increasingly seeking alliances to boost
profits.
