###Building a text summarizer in Python using NLTK and scikit-learn class TfidfVectorizer

In a simple language, TF-IDF can be defined as follows:

A High weight in TF-IDF is reached by a high term frequency(in the given document) and a low document frequency of the term in the whole collection of documents.

    TF-IDF algorithm is made of 2 algorithms multiplied together.

**Term Frequency**

Term frequency (TF) is how often a word appears in a document, divided by how many words there are.

    TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

**Inverse document frequency**

Term frequency is how common a word is, inverse document frequency (IDF) is how unique or rare a word is.
    IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

**Example**

Consider a document containing 100 words wherein the word iphone appears 5 times. 

    The term frequency (i.e., TF) for iphone is then (5 / 100) = 0.05.

Now, assume we have 10 million documents and the word iphone appears in one thousand of these. 

    Then, the inverse document frequency (i.e., IDF) is calculated as log(10,000,000 / 1,000) = 4.

    Thus, the TF-IDF weight is the product of these quantities: 0.05 * 4 = 0.20

In [68]:
#Getting TF-IDF results for a given text
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

text=open("ml.txt","r").read()
documents = nltk.sent_tokenize(text)
documents

['The modern definition of artificial intelligence (or AI) is "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions which maximizes its chances of success.',
 'John McCarthy, who coined the term in 1956, defines it as "the science and engineering of making intelligent machines."',
 'Other names for the field have been proposed, such as computational intelligence, synthetic intelligence or computational rationality.',
 'The term artificial intelligence is also used to describe a property of machines or programs: the intelligence that the system demonstrates.',
 'AI research uses tools and insights from many fields, including computer science, psychology, philosophy, neuroscience, cognitive science, linguistics, operations research, economics, control theory, probability, optimization and logic.',
 'AI research also overlaps with tasks such as robotics, control systems, scheduling, data mining, logistics, spee

In [65]:
#Get tf-idf values from fit_transform() method.

# create object
tfidf = TfidfVectorizer()
  
# get tf-df values
result = tfidf.fit_transform(documents)
result     #Sparse matrix

<16x201 sparse matrix of type '<class 'numpy.float64'>'
	with 301 stored elements in Compressed Sparse Row format>

In [66]:
#Display idf values of the words present in the corpus.

# get idf values
print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names(), tfidf.idf_):
    print(ele1, ':', ele2)


idf values:
1956 : 3.1400661634962708
act : 3.1400661634962708
actions : 3.1400661634962708
agent : 3.1400661634962708
agents : 3.1400661634962708
ai : 2.041453874828161
algorithms : 3.1400661634962708
also : 2.734601055388106
an : 3.1400661634962708
and : 1.4353180712578455
ant : 3.1400661634962708
applies : 3.1400661634962708
are : 3.1400661634962708
artificial : 2.734601055388106
as : 1.6359887667199966
associated : 3.1400661634962708
attempts : 3.1400661634962708
based : 3.1400661634962708
be : 3.1400661634962708
been : 2.734601055388106
better : 3.1400661634962708
biologically : 3.1400661634962708
boiling : 3.1400661634962708
both : 3.1400661634962708
brain : 3.1400661634962708
by : 3.1400661634962708
can : 3.1400661634962708
capabilities : 3.1400661634962708
capable : 3.1400661634962708
chances : 3.1400661634962708
check : 3.1400661634962708
clarion : 3.1400661634962708
cognitive : 3.1400661634962708
coined : 3.1400661634962708
cold : 3.1400661634962708
combine : 3.1400661634962

In [67]:
#Display tf-idf values along with indexing.

# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)
  
# display tf-idf values
print('\ntf-idf value:')
print(result)
  
# in matrix form
print('\ntf-idf values in matrix form:')
print(result.toarray())


Word indexes:
{'the': 177, 'modern': 110, 'definition': 48, 'of': 122, 'artificial': 13, 'intelligence': 87, 'or': 126, 'ai': 5, 'is': 91, 'study': 162, 'and': 9, 'design': 51, 'intelligent': 88, 'agents': 4, 'where': 195, 'an': 8, 'agent': 3, 'system': 170, 'that': 176, 'perceives': 132, 'its': 94, 'environment': 57, 'takes': 172, 'actions': 2, 'which': 196, 'maximizes': 105, 'chances': 29, 'success': 164, 'john': 95, 'mccarthy': 106, 'who': 197, 'coined': 33, 'term': 175, 'in': 78, '1956': 0, 'defines': 47, 'it': 92, 'as': 14, 'science': 153, 'engineering': 56, 'making': 103, 'machines': 100, 'other': 127, 'names': 114, 'for': 65, 'field': 62, 'have': 73, 'been': 19, 'proposed': 143, 'such': 165, 'computational': 37, 'synthetic': 169, 'rationality': 145, 'also': 7, 'used': 191, 'to': 183, 'describe': 50, 'property': 142, 'programs': 140, 'demonstrates': 49, 'research': 148, 'uses': 192, 'tools': 184, 'insights': 84, 'from': 67, 'many': 104, 'fields': 63, 'including': 80, 'computer':

Below are some examples which depict how to compute tf-idf values of words from a corpus:

In [69]:
documents2 = nltk.word_tokenize(text)
documents2

['The',
 'modern',
 'definition',
 'of',
 'artificial',
 'intelligence',
 '(',
 'or',
 'AI',
 ')',
 'is',
 '``',
 'the',
 'study',
 'and',
 'design',
 'of',
 'intelligent',
 'agents',
 "''",
 'where',
 'an',
 'intelligent',
 'agent',
 'is',
 'a',
 'system',
 'that',
 'perceives',
 'its',
 'environment',
 'and',
 'takes',
 'actions',
 'which',
 'maximizes',
 'its',
 'chances',
 'of',
 'success',
 '.',
 'John',
 'McCarthy',
 ',',
 'who',
 'coined',
 'the',
 'term',
 'in',
 '1956',
 ',',
 'defines',
 'it',
 'as',
 '``',
 'the',
 'science',
 'and',
 'engineering',
 'of',
 'making',
 'intelligent',
 'machines',
 '.',
 "''",
 'Other',
 'names',
 'for',
 'the',
 'field',
 'have',
 'been',
 'proposed',
 ',',
 'such',
 'as',
 'computational',
 'intelligence',
 ',',
 'synthetic',
 'intelligence',
 'or',
 'computational',
 'rationality',
 '.',
 'The',
 'term',
 'artificial',
 'intelligence',
 'is',
 'also',
 'used',
 'to',
 'describe',
 'a',
 'property',
 'of',
 'machines',
 'or',
 'programs',
 '

In [70]:
# create object
tfidf = TfidfVectorizer()
  
# get tf-df values
result = tfidf.fit_transform(documents2)
  
# get idf values
print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names(), tfidf.idf_):
    print(ele1, ':', ele2)
  
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)
  
# display tf-idf values
print('\ntf-idf value:')
print(result)
  
# in matrix form
print('\ntf-idf values in matrix form:')
print(result.toarray())



idf values:
1956 : 6.320567975482857
act : 6.320567975482857
actions : 6.320567975482857
agent : 6.320567975482857
agents : 6.320567975482857
ai : 5.067805006987489
algorithms : 5.627420794922911
also : 5.915102867374692
an : 6.320567975482857
and : 4.241126433803021
ant : 6.320567975482857
applies : 6.320567975482857
are : 6.320567975482857
artificial : 5.915102867374692
as : 4.816490578706582
associated : 6.320567975482857
attempts : 6.320567975482857
based : 6.320567975482857
be : 6.320567975482857
been : 5.915102867374692
better : 6.320567975482857
biologically : 6.320567975482857
boiling : 6.320567975482857
both : 6.320567975482857
brain : 6.320567975482857
by : 6.320567975482857
can : 6.320567975482857
capabilities : 6.320567975482857
capable : 6.320567975482857
chances : 6.320567975482857
check : 6.320567975482857
clarion : 6.320567975482857
cognitive : 6.320567975482857
coined : 6.320567975482857
cold : 6.320567975482857
combine : 6.320567975482857
computation : 6.320567975482

----------