## TF-IDF computation
Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

### TO compute TF-IDF, we need to know the following:
**TF: Term Frequency**, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

>$TF(t)$ = (Number of times term $t$ appears in a $document$) / (Total number of terms in the $document$).

**IDF: Inverse Document Frequency**, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

>$IDF(t)$ = $log_e$(Total number of $documents$ / Number of $documents$ with term $t$ in it).

In [1]:
import pandas as pd
import math
from textblob import TextBlob
import re # for removing HTML tags

In [2]:
word_count_dict = {}

In [3]:
# tf(word, blob) computes "term frequency" which is the number of times 
# a word appears in a document blob,normalized by dividing by 
# the total number of words in blob. 
# We use TextBlob for breaking up the text into words and getting the word counts.
def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

# n_containing(word, bloblist) returns the number of documents containing word.
# A generator expression is passed to the sum() function.
def n_containing(word, bloblist):
    if word in word_count_dict:
        return word_count_dict[word]
    word_count_dict[word] = sum(1 for blob in bloblist if word in blob.words)
    return word_count_dict[word]

# idf(word, bloblist) computes "inverse document frequency" which measures how common 
# a word is among all documents in bloblist. 
# The more common a word is, the lower its idf. 
# We take the ratio of the total number of documents 
# to the number of documents containing word, then take the log of that. 
# Add 1 to the divisor to prevent division by zero.
def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

# tfidf(word, blob, bloblist) computes the TF-IDF score. 
# It is simply the product of tf and idf.
def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

In [4]:
questions = pd.read_csv('../input/Questions_Filtered.csv', encoding='latin1')
real_questions = pd.read_csv('../input/Questions.csv', encoding='latin1')
#answers = pd.read_csv('../input/Answers.csv', encoding='latin1')
tags = pd.read_csv('../input/Tags_Filtered.csv', encoding='latin1')

In [5]:
def tag_questions_generator(question_table, id_table, included_tags):
    tag_question_list = []
    tag_id_list = []
    tag_questions = question_table[question_table['Id'].isin(tags[tags['Tag'].isin(included_tags)]['Id'])]
    for index, row in tag_questions.iterrows():
        tag_question_list.append(TextBlob(str(row['Title'])+" "+str(row['Body'])))
        tag_id_list.append(row['Id'])
    return tag_question_list,tag_id_list

In [6]:
def random_questions_generator(question_table, id_table, excluded_tags, sample_size):
    random_question_list = []
    random_id_list = []
    random_questions = question_table[~question_table['Id'].isin(tags[tags['Tag'].isin(excluded_tags)]['Id'])].sample(sample_size)
    for index, row in random_questions.iterrows():
        random_question_list.append(TextBlob(str(row['Title'])+" "+str(row['Body'])))
        random_id_list.append(row['Id'])
    return random_question_list,random_id_list

In [7]:
def all_tags(question_id): 
    return tags[tags['Id'].isin(questions[questions['Id'] == question_id]['Id'])]['Tag'].tolist()

In [8]:
dummy = ['angularjs']
dummy_questions, dummy_ids = tag_questions_generator(questions,tags,dummy);

In [9]:
for index in range(10):
    print(all_tags(dummy_ids[index]))

['javascript', 'angularjs']
['javascript', 'angularjs']
['angularjs']
['javascript', 'angularjs']
['javascript', 'angularjs']
['javascript', 'actionscript-3', 'angularjs']
['javascript', 'ruby-on-rails', 'angularjs']
['symfony2', 'angularjs']
['angularjs']
['angularjs', 'filter', 'module']


In [32]:
test = ['angularjs']
test_questions, test_ids = tag_questions_generator(questions,tags,test)
random_questions, random_ids = random_questions_generator(questions,tags,test, 100_000)
question_list = test_questions + random_questions
id_list = test_ids + random_ids

In [33]:
tfidf_dict={}
qID_dict={}
word_count_dict = {}
for index, question_text in enumerate(question_list):
    # For each word in this specific question_text, we will send the word and the question_text
    # to find out the tf (Term frequeency will check the frequency of the word in )
    #if index > len(test_ids):
    #    break
    if index < 10:
        print('Words of interest in Question ID {}'.format(id_list[index]))
    elif index == 10:
        break
    scores = {word: tfidf(word, question_text, question_list) for word in question_text.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse = True)
    for word, score in sorted_words[:5]:
        if index < 10:
                print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
         # word dict    
        if word in tfidf_dict:
            tfidf_dict[word].append([id_list[index],round(score, 5)])
        else:
            tfidf_dict[word] = [[id_list[index],round(score, 5)]]

        # qID dict
        if id_list[index] in qID_dict:
            qID_dict[id_list[index]].append(word)
        else:
            lst=[]
            lst.append(word)
            qID_dict[id_list[index]]=lst

Words of interest in Question ID 6082520
	Word: form, TF-IDF: 0.14465
	Word: submit, TF-IDF: 0.14339
	Word: cookbook, TF-IDF: 0.12854
	Word: server, TF-IDF: 0.11163
	Word: webpac2, TF-IDF: 0.09032
Words of interest in Question ID 7354870
	Word: oneval, TF-IDF: 0.56507
	Word: serializing, TF-IDF: 0.18732
	Word: changes, TF-IDF: 0.17508
	Word: serialized, TF-IDF: 0.1673
	Word: detecting, TF-IDF: 0.16497
Words of interest in Question ID 9629000
	Word: cookiestore, TF-IDF: 0.73247
	Word: userid, TF-IDF: 0.47708
	Word: fblogin, TF-IDF: 0.30353
	Word: validateuser, TF-IDF: 0.27666
	Word: configurations, TF-IDF: 0.19558
Words of interest in Question ID 9755780
	Word: invalidwidgets, TF-IDF: 0.23697
	Word: questionctrl, TF-IDF: 0.16654
	Word: word, TF-IDF: 0.11843
	Word: ans, TF-IDF: 0.11266
	Word: column, TF-IDF: 0.10169
Words of interest in Question ID 9981090
	Word: orderitemservice, TF-IDF: 0.35545
	Word: resource, TF-IDF: 0.31332
	Word: orderservice, TF-IDF: 0.30046
	Word: theservice, TF-

In [22]:
# will strip all text according to this regular expression.
pd.options.display.max_colwidth = 2000
TAG_RE = re.compile(r'<[^>]+>')
def strip_tags(text):
    return TAG_RE.sub('', text)

In [26]:
sample_question = TextBlob(strip_tags(real_questions[real_questions['Id'] == 6082520]['Body'].to_string(index=False)))
sample_question

TextBlob("I am new to Angular and would like to know how to actually submit a form to the server once the data is filled in.\n\nThe cookbook form examples on angularjs.org only save the state on the client. How do I submit to the server?\n\nAlternatively, how do I use jQuery's form.submit() on the form in the ng:click="save()" function?\n\nEdit - Found 2 ways to do this ( I also removed the html markup i pasted before - just refer to the advanced form cookbook example for the source)\n\n\nhttp://webpac2.rot13.org:3000/conference/Work (by Dobrica Pavlinusic) to go the angular way using a resource to send  the data to the server in json format. I had issues with that on the server side - Angular was sending it fine but grails was mangling it (according to firebug and request content-length). I need to look into this more. How do i change the content-type in angular for a resource method like $save() ?\nPut a form in and use a submit button. Since I am not doing a single page web app, I u

In [111]:
qID_dict[6082520]

['cookbook', 'submit', 'form', 'server', 'webpac2']

In [31]:
print(sample_question.words.count('cookbook'))
print(sample_question.words.count('submit'))
print(sample_question.words.count('form'))
print(sample_question.words.count('server'))

2
3
5
5
