                                       Important word selection in document using TF-IDF
                                       
                           
Extracting the most meaningful words from given corpora is chalenging task. In general sense, we may think that the word appearing more frequently is important word. However in any language there are some specific words which occur repeatedly in document. We called such word as a 'Stop words', like in english, "in", "the", "with", "will" are stopwords. Notice that these word carry very less information. Therefore the word frequency and it's distribution across various document can be used to calculate amount of information they carry. We use same notion in TF-IDF technique. The variations of TF-IDF are often used by serach engines to rank document relevence given the user query.

TF-IDF stands for "Term Frequency - Inverse Document Frequency". We will define two important terminology regarding TF-IDF :
1) Term Frequency - 
   It denotes how many times a word aprrars in a document. Different documents may have various length and certain word may        appears repeatedly. 
   
                   TF(w) = count(w)
                   
                   
2) Inverse Document Frequency - 
   It represents a occurence of word across various document. Stopwords has high frequency in one document and we use concept      that they are equally likely to occur more frequently in another document. Using this, we weigh down their importance.          Mathematically,
   
                  IDF(w) = log( (Total number of document) / (Number of document with word w) )
             
             
Combining above to terms, we define TF-IDF as,
                  TF-IDF =  count(w) * log((Total number of document) / (Number of document with word w))
                  

Consider the following 3 documents from which we want important words :

Document 1 :

Artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals. In computer science AI research is defined as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving


Document 2 :

In 1905, Albert Einstein determined that the laws of physics are the same for all non-accelerating observers, and that the speed of light in a vacuum was independent of the motion of all observers. This was the theory of special relativity. It introduced a new framework for all of physics and proposed new concepts of space and time.Einstein then spent 10 years trying to include acceleration in the theory and published his theory of general relativity in 1915. In it, he determined that massive objects cause a distortion in space-time, which is felt as gravity.


Document 3 :
India is a land of ancient civilization. India's social, economic, and cultural configurations are the products of a long process of regional expansion. Indian history begins with the birth of the Indus Valley Civilization and the coming of the Aryans. These two phases are usually described as the pre-Vedic and Vedic age. Hinduism arose in the Vedic period. The fifth century saw the unification of India under Ashoka, who had converted to Buddhism, and it is in his reign that Buddhism spread in many parts of Asia. In the eighth century Islam came to India for the first time and by the eleventh century had firmly established itself in India as a political force. It resulted into the formation of the Delhi Sultanate, which was finally succeeded by the Mughal Empire, under which India once again achieved a large measure of political unity. 


Now we calculate TF-IDF for each word in above paragraph, sort them in decreasing order of their importance and select top 10 words. 

In [29]:
number_of_doc = 3
############     Input document   ########## 
d1 = "Artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals. In computer science AI research is defined as the study of 'intelligent agents': any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term 'artificial intelligence' is applied when a machine mimics 'cognitive' functions that humans associate with other human minds, such as 'learning' and 'problem solving'."
d2 = "In 1905, Albert Einstein determined that the laws of physics are the same for all non-accelerating observers, and that the speed of light in a vacuum was independent of the motion of all observers. This was the theory of special relativity. It introduced a new framework for all of physics and proposed new concepts of space and time.Einstein then spent 10 years trying to include acceleration in the theory and published his theory of general relativity in 1915. In it, he determined that massive objects cause a distortion in space-time, which is felt as gravity."
d3 = "India is a land of ancient civilization. India's social, economic, and cultural configurations are the products of a long process of regional expansion. Indian history begins with the birth of the Indus Valley Civilization and the coming of the Aryans. These two phases are usually described as the pre-Vedic and Vedic age. Hinduism arose in the Vedic period. The fifth century saw the unification of India under Ashoka, who had converted to Buddhism, and it is in his reign that Buddhism spread in many parts of Asia. In the eighth century Islam came to India for the first time and by the eleventh century had firmly established itself in India as a political force. It resulted into the formation of the Delhi Sultanate, which was finally succeeded by the Mughal Empire, under which India once again achieved a large measure of political unity."

In [30]:
from nltk.tokenize import word_tokenize
import re
from operator import itemgetter
import numpy as np
import math

##############  Creating token list for all document ##############
w1 = word_tokenize(d1)
w2 = word_tokenize(d2)
w3 = word_tokenize(d3)

##############  Filtering punctuation marks #########
filter_w1 = []
filter_w2 = []
filter_w3 = []
reg = '[A-Za-z0-9]*'

for word in w1:
    token = re.match(reg,word)
    str1 = token.group(0)
    if str1 != '':
        filter_w1.append(str1.lower())
        
for word in w2:
    token = re.match(reg,word)
    str1 = token.group(0)
    if str1 != '':
        filter_w2.append(str1.lower())
        
for word in w3:
    token = re.match(reg,word)
    str1 = token.group(0)
    if str1 != '':
        filter_w3.append(str1.lower())
#print(filter_w1)
#print(filter_w2)
#print(filter_w3)


#############   Calculating word frequency  ##########
w1Freq = {}
w2Freq = {}
w3Freq = {}

for word in filter_w1:
    if word not in w1Freq.keys():
        w1Freq[word] = filter_w1.count(word)
        
for word in filter_w2:
    if word not in w2Freq.keys():
        w2Freq[word] = filter_w2.count(word)

for word in filter_w3:
    if word not in w3Freq.keys():
        w3Freq[word] = filter_w3.count(word)
#print(w1Freq)
#print(w2Freq)
#print(w3Freq)


###########   Calculating TF-IDF score for each word   ########
tfidfDict1 = {}
tfidfDict2 = {}
tfidfDict3 = {}

for word in w1Freq.keys():
    countDoc = 1
    if word in w2Freq.keys():
        countDoc += 1
    if word in w3Freq.keys():
        countDoc += 1
    tfidfDict1[word] = w1Freq[word] * (math.log(number_of_doc / countDoc))
    
for word in w2Freq.keys():
    countDoc = 1
    if word in w1Freq.keys():
        countDoc += 1
    if word in w3Freq.keys():
        countDoc += 1
    tfidfDict2[word] = w2Freq[word] * (math.log(number_of_doc / countDoc))
    
for word in w3Freq.keys():
    countDoc = 1
    if word in w1Freq.keys():
        countDoc += 1
    if word in w2Freq.keys():
        countDoc += 1
    tfidfDict3[word] = w3Freq[word] * (math.log(number_of_doc / countDoc))

#print(tfidfDict1)
#print(tfidfDict2)
#print(tfidfDict3)

topWords = 10
print("*********   Top "+ str(topWords) +" in Document 1   *********")
print()
for k,v in reversed(sorted(tfidfDict1.items(), key = itemgetter(1))):
    if topWords > 0:
        print(k + " : " + str(v))
        topWords -= 1
    else:
        break

print()
print()
topWords = 10
print("*********   Top "+ str(topWords) +" in Document 2   *********")
print()
for k,v in reversed(sorted(tfidfDict2.items(), key = itemgetter(1))):
    if topWords > 0:
        print(k + " : " + str(v))
        topWords -= 1
    else:
        break

print()
print()
topWords = 10
print("*********   Top "+ str(topWords) +" in Document 3   *********")
print()
for k,v in reversed(sorted(tfidfDict3.items(), key = itemgetter(1))):
    if topWords > 0:
        print(k + " : " + str(v))
        topWords -= 1
    else:
        break

*********   Top 10 in Document 1   *********

intelligence : 5.493061443340549
its : 3.295836866004329
other : 2.1972245773362196
humans : 2.1972245773362196
machine : 2.1972245773362196
ai : 2.1972245773362196
solving : 1.0986122886681098
such : 1.0986122886681098
minds : 1.0986122886681098
human : 1.0986122886681098


*********   Top 10 in Document 2   *********

theory : 3.295836866004329
all : 3.295836866004329
space : 2.1972245773362196
new : 2.1972245773362196
relativity : 2.1972245773362196
observers : 2.1972245773362196
physics : 2.1972245773362196
determined : 2.1972245773362196
gravity : 1.0986122886681098
felt : 1.0986122886681098


*********   Top 10 in Document 3   *********

india : 6.591673732008658
century : 3.295836866004329
political : 2.1972245773362196
buddhism : 2.1972245773362196
had : 2.1972245773362196
under : 2.1972245773362196
vedic : 2.1972245773362196
civilization : 2.1972245773362196
unity : 1.0986122886681098
measure : 1.0986122886681098


These word also gives context about the document.