## TF-IDF computation
Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

### TO compute TF-IDF, we need to know the following:
**TF: Term Frequency**, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

>$TF(t)$ = (Number of times term $t$ appears in a $document$) / (Total number of terms in the $document$).

**IDF: Inverse Document Frequency**, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

>$IDF(t)$ = $log_e$(Total number of $documents$ / Number of $documents$ with term $t$ in it).

In [17]:
import pandas as pd
import math
from textblob import TextBlob

In [18]:
# tf(word, blob) computes "term frequency" which is the number of times 
# a word appears in a document blob,normalized by dividing by 
# the total number of words in blob. 
# We use TextBlob for breaking up the text into words and getting the word counts.
def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

# n_containing(word, bloblist) returns the number of documents containing word.
# A generator expression is passed to the sum() function.
def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

# idf(word, bloblist) computes "inverse document frequency" which measures how common 
# a word is among all documents in bloblist. 
# The more common a word is, the lower its idf. 
# We take the ratio of the total number of documents 
# to the number of documents containing word, then take the log of that. 
# Add 1 to the divisor to prevent division by zero.
def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

# tfidf(word, blob, bloblist) computes the TF-IDF score. 
# It is simply the product of tf and idf.
def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

In [3]:
questions = pd.read_csv('../input/Questions_Filtered.csv', encoding='latin1')
answers = pd.read_csv('../input/Answers_Filtered.csv', encoding='latin1')
tags = pd.read_csv('../input/Tags_Filtered.csv', encoding='latin1')

In [6]:
question_list = []
id_list=[]
for index, row in questions.iterrows():
    # we append the title to the text body here
    question_list.append(TextBlob(str(row[6])+" "+str(row[5])))
    id_list.append([row[0]])

In [None]:
tfidf_dict={}
qID_dict={}
for index, question_text in enumerate(question_list):
    # For each word in this specific question_text, we will send the word and the question_text
    # to find out the tf (Term frequeency will check the frequency of the word in )
    if index < 10:
        print('Words of interest in Question ID {}'.format(id_list[index]))
    scores = {word: tfidf(word, question_text, question_list) for word in question_text.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse = true)
    if index < 10:
            print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
     # word dict    
    if word in tfidf_dict:
        tfidf_dict[word].append([idlist[i],round(score, 5)])
    else:
        tfidf_dict[word] = [[idlist[i],round(score, 5)]]

    # qID dict
    if idlist[i] in qID_dict:
        qID_dict[idlist[i]].append(word)
    else:
        lst=[]
        lst.append(word)
        qID_dict[idlist[i]]=lst