# TP5 Catégorisez automatiquement des questions

In this notebook we extract data from stackoverflow questions and perform an unsupervised text classification.

The data can be originally found at: https://data.stackexchange.com/stackoverflow/query/new



# Download data: 100k posts containining Title, tags and text and score higher that 20

We create an SQL program that samples 50K random posts without repetition. To achieve efficiency if we wish to sample 50k more posts, we shall sample independently from the previous 50k sample such that the merge of the samplings must be filtered for duplicates.  Here we use a dataset merge of 2 samplings with score greater than 20, as discussed in the data analisys notebook.

The SQL script can be found in the following link:
https://docs.google.com/document/d/1ywL1rGiYKzoftbgdNP-SPX-A250kaL1SxWRLeJsvk1A/edit?usp=sharing



In [None]:
# imports

import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter

In [None]:
# downlaod and unzip tags counts
from google_drive_downloader import GoogleDriveDownloader as gdd

files = ['1aGnT5OqAN7KmJoyEiWnFsnCELVun7s-s','1-aKDsJDXPkhpzU0sAZ_4SC3gXKzGYuJV', '1WC60yk8Gen_eQocVPhnYNd3A5NJ5qZUf']

for file in files:
  data_path = os.path.join('.','data')

  gdd.download_file_from_google_drive(file_id=file,
                                      dest_path=data_path,
                                      unzip=True)
  # remove .zip data
  os.system('rm -rf data')

Downloading 1aGnT5OqAN7KmJoyEiWnFsnCELVun7s-s into ./data... Done.
Unzipping...Done.
Downloading 1-aKDsJDXPkhpzU0sAZ_4SC3gXKzGYuJV into ./data... Done.
Unzipping...Done.
Downloading 1WC60yk8Gen_eQocVPhnYNd3A5NJ5qZUf into ./data... Done.
Unzipping...Done.


# 1. Data pre-processing:

First we need to analyse the data structure and decide which processing will be applied to the dataset.


In [None]:
df_part1 = pd.read_csv('QueryResults_part1.csv')
df_part2 = pd.read_csv('QueryResults_part2.csv')

Since the two dataframes were gathered at random separately we must join without repeating data.

In [None]:
data = pd.concat([df_part1,df_part2]).drop_duplicates().reset_index(drop=True)

As a sanity check, the joint dataset should have less than 100k questions.

In [None]:
len(data)

96387

Now we split our dataset into training and testing data. We can split training data further into another training set and an evaluation set.

To avoid any bias in this sampling we shufle the dataset...

In [None]:
from sklearn.model_selection import train_test_split

df, test = train_test_split(data, test_size=0.2, random_state=42, shuffle=True,) # fix random state to reproduce results on other platforms

In [None]:
len(df)

77109

In [None]:
len(test)

19278

## 1.1 Tag pre-processing

Tags are written between less-than and greather-than signs on a single string.To work with individual tags we first have to process them using regex and create a list of tags that can be manipulated.

In [None]:
# separate tags into a list of tags using a lambda function
get_tags = lambda x: re.findall("\<(.*?)\>", x)

df['Tags'] = df['Tags'].apply(get_tags)
test['Tags'] = test['Tags'].apply(get_tags)

df.head()


Unnamed: 0,Id,Body,Title,Tags,CreationDate,Score
67997,12019159,<p>I was going thru some single page website e...,Infinite rotation animation using CSS and Java...,"[javascript, jquery, css, jquery-animate, css-...",2012-08-18 13:59:25,12
76486,20580968,<p>Does anybody know why this error happens on...,Xcode error : Distill failed for unknown reasons,"[ios, iphone, xcode, ios7, xcode5]",2013-12-14 07:51:40,51
29308,1000310,<p>I'm using guice for dependency injection wi...,What's aopalliance all about? And why is guice...,"[aop, guice, aopalliance]",2009-06-16 08:58:59,29
83844,32325590,<p>I often get the following 500 server error ...,Azure Web App: HTTP Error 500 on favicon.ico,"[azure, iis, azure-web-app-service, azure-diag...",2015-09-01 07:10:23,12
82803,30682575,"<p>i got trouble, in a rails project（redmine2....",Unable to load the EventMachine C extension; T...,"[ruby-on-rails, ruby, ruby-on-rails-3, bundle]",2015-06-06 11:58:18,35


#### Get Stackoverflow full tag data of top 50k most frequent tags

As discussed in the exploratory analysis we can use stackoverflow top 50k tags of the whole dataset and obtain the mapping from token to tags as explained in the exploratory analysis.

In [None]:
df_tags = pd.read_csv('TopTags.csv')
df_tags.at[553, 'TagName'] = 'null'
df_tags.at[1819, 'TagName'] = 'nan'

In [None]:
# we will use the tag array to check for tag matches
tags_array = (df_tags.TagName).to_numpy()[:1000]

# 2. Title Pre-processing

Here we process the title into a bow representation. This allows for a more in depth analysis of our data such that we can process it into features that can be better dealt by machine learning algorithms.

In [None]:
# Transform titles to list
df.Title.to_list()[:10]

['Infinite rotation animation using CSS and Javascript',
 'Xcode error : Distill failed for unknown reasons',
 "What's aopalliance all about? And why is guice using it?",
 'Azure Web App: HTTP Error 500 on favicon.ico',
 'Unable to load the EventMachine C extension; To use the pure-ruby reactor',
 'Efficient implementation of faceted search in relational databases',
 'Does calling a destructor explicitly destroy an object completely?',
 'Calling Win32 API method from Java',
 'Get a list of Solution/Project Files for VS Add-in or DXCore Plugin',
 'Does Java Web Start require that the Java browser plug-in is enabled?']

In [None]:
# download and import required packages
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
import nltk.stem as stemmer
from nltk.stem.porter import *
import nltk
from nltk import word_tokenize
from nltk.tokenize import MWETokenizer

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
# instantiate stemmer that will be used along the processing pipeline
stemmer = PorterStemmer()

**Observation:** Since we use a out of the box tokenizer we can add our own rules to it. An example would be to add the token for C# (c sharp). Otherwise it would be removed by the tokenizer.

In [None]:
# add exceptions to tokenizer
tokenizer = nltk.tokenize.MWETokenizer()
tokenizer.add_mwe(('c', '#'))

Here we define the preprocessing we will apply to text.

In [None]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))


def preprocess(text):
    result = []
    # remove ponctuation but keep relevant data
    initial_preprocess = lambda text : "".join([char for char in text if char not in '!"$%&\'()*,./:;<=>?@[\\]^_`{|}~']).lower()
    tokens = tokenizer.tokenize(word_tokenize(initial_preprocess(text)))
    for token in tokens:
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            result.append(lemmatize_stemming(token))
    return result

And we apply it to the train and test dataset.

In [None]:
processed_titles = df.Title.map(preprocess)
test_processed_titles = test.Title.map(preprocess)

Below we observe that the result of the tokenization.

In [None]:
processed_titles

67997              [infinit, rotat, anim, css, javascript]
76486        [xcode, error, distil, fail, unknown, reason]
29308                                [what, aopalli, guic]
83844       [azur, web, app, http, error, 500, faviconico]
82803    [unabl, load, eventmachin, c, extens, use, pur...
                               ...                        
6265     [fastest, select, sqlcalcfoundrow, tabl, selec...
54886        [set, dropdownlist, selecteditem, programmat]
76820    [redirect, audio, creat, altern, sound, path, ...
860                                     [differ, foo, foo]
15795                                [cast, listt, effect]
Name: Title, Length: 77109, dtype: object

Now we generate a corpora dictionary using the tokenized sentences from the train dataset. We will print the first 10 words from the dictionary.

In [None]:
# Generate corpora dictionary
title_dictionary = gensim.corpora.Dictionary(processed_titles)

Print the first 10 entries

In [None]:
print('there are {} entries on the corpora dict.\nFirst 10 entries:'.format(len(title_dictionary)))
print(list(title_dictionary.values())[:10])

there are 26743 entries on the corpora dict.
First 10 entries:
['anim', 'css', 'infinit', 'javascript', 'rotat', 'distil', 'error', 'fail', 'reason', 'unknown']


To reduce the computational complexity we filter out words that appear less than a fixed number of times. The documentation of gives the following parameters:

* no_below (int, optional) – Keep tokens which are contained in at least no_below documents.

* no_above (float, optional) – Keep tokens which are contained in no more than no_above documents (fraction of total corpus size, not an absolute number).

* keep_n (int, optional) – Keep only the first keep_n most frequent tokens.

* keep_tokens (iterable of str) – Iterable of tokens that must stay in dictionary after filtering.

We observe a great reduction in the number of the dictionary entries. This will greatly accelerate computations.

In [None]:
title_dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=10000)
print('After filtering there are {} entries on the corpora dict.\nFirst 10 entries:'.format(len(title_dictionary)))
print(list(title_dictionary.values())[:10])

After filtering there are 2431 entries on the corpora dict.
First 10 entries:
['anim', 'css', 'infinit', 'javascript', 'rotat', 'error', 'fail', 'reason', 'unknown', 'xcode']


With the dictionary in hands we can now create a BoW representation of titles. We do this both for the train and test sets.

In [None]:
# create bow of title filtered corpus
title_bow_corpus = [title_dictionary.doc2bow(title) for title in processed_titles]
test_title_bow_corpus = [title_dictionary.doc2bow(title) for title in test_processed_titles]

We can now observe the result of the preprocessing on a string of text

In [None]:
# observe pre processing result on a sampling of a given dataset
def sample_nlp_pipeline(sample_idx, dataframe, bow_corpus):
  print('sample idx:', sample_idx)

  print('sample tags:', dataframe.Tags.to_list()[sample_idx])
  print('\nprocessing pipeline: \n')
  print('sample title:', dataframe.Title.to_list()[sample_idx])
  print('preprocessed title:', processed_titles[sample_idx])
  print('bow_corpus of title:', bow_corpus[sample_idx])
  print('\nbag of words equivalence: ')
  bow_doc_sample = bow_corpus[sample_idx]
  for i in range(len(bow_doc_sample)):
      print("Word {} (\"{}\") appears {} time.".format(bow_doc_sample[i][0],
                                                title_dictionary[bow_doc_sample[i][0]], bow_doc_sample[i][1]))


Let's observe the result on a random question

In [None]:
# test preprocessing on a random question
sample_nlp_pipeline(np.random.randint(len(df.Tags)), df, title_bow_corpus)

sample idx: 15032
sample tags: ['android', 'security', 'rsa', 'private-key']

processing pipeline: 

sample title: Java: How can I generate PrivateKey from a string?
preprocessed title: ['add', 'datamemb', 'collectiondatacontract', 'wcf']
bow_corpus of title: [(34, 1), (66, 1), (266, 1)]

bag of words equivalence: 
Word 34 ("java") appears 1 time.
Word 66 ("string") appears 1 time.
Word 266 ("gener") appears 1 time.


## 3.1 Token2tag

To perform classification and evaluate we will use a dict of tokens to tag, as explained in the exploratory analysis notebook.

In [None]:
df_tags.head()

Unnamed: 0,Id,TagName,Count
0,3,javascript,2387264
1,16,python,1968760
2,17,java,1850627
3,9,c#,1543059
4,5,php,1438539


In [None]:
df_tags['tokenized'] = df_tags.TagName.apply(preprocess)

# get array with tags
tags_array = (df_tags.TagName).to_numpy()

# get an array with tokenized tags
tokenized_tags = df_tags.tokenized.to_numpy()

# get an array with the tag count
full_tag_count = (df_tags.Count).to_numpy()

def select_first(token_list):
  return [token_list[0]]

# count number of total tags
total = sum(full_tag_count)
# initialize the
n_lost = 0

# we will capture index of divided or lost tags
eliminated_tags = []
divided_tags = []

for idx, tag in enumerate(tags_array):
  tokenized_tag = preprocess(tag)
  # check if tag was mapped to zero
  if len(tokenized_tag) == 0:
      eliminated_tags.append(idx)
      n_lost += full_tag_count[idx] / total * 100
          #print("The tag '{}' ({:.2f}% of tags) was eliminated by tokenization".format(full_tags_array[idx], full_tag_count[idx]/total*100))


  # check if tags were divided
  if len(tokenized_tag) > 1:
      n_lost += full_tag_count[idx] / total * 100
      divided_tags.append(idx)
          #print("The tag '{}' ({:.2f}% of tags) was divided by tokenization into '{}'".format(full_tags_array[idx], full_tag_count[idx]/total*100, tokenized_tag))

print('\nTotal ammount of lost tags for tokenization: {:.2f}%'.format(n_lost))


df_tags.loc[divided_tags, 'tokenized'] = df_tags.loc[divided_tags, 'tokenized'].apply(select_first)

token2tag_dict = {}
for tag, token in zip(df_tags.TagName.to_numpy(), df_tags.tokenized.to_numpy()):
  # token should be a list containing only one element
  if len(token) == 0:
    pass
    #print('passed {} because the token is null'.format(tag))
  elif token[0] in token2tag_dict.keys():
    pass
    #print('passed {} because the token has already been mapped'.format(tag))
  else:#returning-a-view-versus-a-copy https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
    token2tag_dict[token[0]] = tag


Total ammount of lost tags for tokenization: 0.29%


# 3. Unsupervised learning on titles

In this section we will train a LDA model with the pre-processed titles. The resulting model will cluster the questions according to a number of topics.

LDA stands for Latent Dirichlet Allocation. It is a probabilistic generative model allowing to explain sets of observations, by means of unobserved groups, them themselves defined by data similarities.

In our context, since we already verified in the exploratory analysis that tags are present in the titles and text of questions, we can make the hypthesis that they end up making part of the topics.
Furthermore, since the tags seem to appear on separate clusters of questions (ie. a python tag is not present in a C question), we also can expect that the topics encoding will also preserve this separation.




## 3.1 Training

### Question: How many topics should we use?

This is an hyperparameter of the model. Tweaking it would offer ways to improve the results if we so desired.

In [None]:
import os
# Get the number of available CPU cores
num_cpus = os.cpu_count()
print("Number of CPU cores available:", num_cpus)

Number of CPU cores available: 8


In [None]:
# Fit LDA model using preprocessed titles

title_lda_model = gensim.models.LdaMulticore(
    title_bow_corpus,
    num_topics=100,
    id2word=title_dictionary,
    passes=10,
    workers=2,
    eta='auto'
)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Save the model
model_file_path = "/content/drive/MyDrive/Colab Notebooks/Lda models/title_lda_model"
title_lda_model.save(model_file_path)

The topics are a weighted combination of the processed tokens. Our unsupervised approach is now able to use this feature without any supervision to generate the predictions. We expect that the tags are present among the topic composition with high confidence.

In [None]:
# Print all topics and their token composants
for idx, topic in title_lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.203*"process" + 0.154*"messag" + 0.093*"respons" + 0.083*"5" + 0.078*"rout" + 0.046*"queue" + 0.046*"6" + 0.042*"clojur" + 0.040*"db" + 0.030*"asynchron"
Topic: 1 
Words: 0.147*"posit" + 0.115*"end" + 0.102*"height" + 0.068*"t-sql" + 0.065*"ssh" + 0.053*"absolut" + 0.049*"100" + 0.040*"reduc" + 0.039*"exe" + 0.034*"fatal"
Topic: 2 
Words: 0.240*"spring" + 0.232*"debug" + 0.085*"sqlite" + 0.079*"machin" + 0.067*"reason" + 0.047*"boot" + 0.039*"bean" + 0.032*"driver" + 0.032*"transit" + 0.029*"busi"
Topic: 3 
Words: 0.268*"number" + 0.139*"call" + 0.135*"detect" + 0.093*"2" + 0.068*"close" + 0.068*"calcul" + 0.033*"menu" + 0.028*"10" + 0.026*"convent" + 0.025*"admin"
Topic: 4 
Words: 0.277*"version" + 0.124*"ignor" + 0.121*"plugin" + 0.108*"avoid" + 0.080*"understand" + 0.040*"job" + 0.036*"clone" + 0.026*"restart" + 0.025*"white" + 0.024*"eclips"
Topic: 5 
Words: 0.237*"store" + 0.162*"output" + 0.157*"log" + 0.074*"state" + 0.060*"procedur" + 0.060*"remot" + 0.048*"w

## 3.3 Inference

Now we test the hypothesis of tags appearing in the themes:

In [None]:
# Given the index of the question print the pre processing pipeline and the score
def infer_topic_score(sample_idx, bow_corpus):
  for index, score in sorted(title_lda_model[bow_corpus[sample_idx]], key=lambda tup: -1*tup[1]):
      print("\nScore: {}\t \nTopic: {}".format(score, title_lda_model.print_topic(index, 10)))

Indeed we observe that the tags are present in at least some of the topics!

In [None]:
idx = 1
print("###################Processing#####################")
sample_nlp_pipeline(idx, df, title_bow_corpus)

###################Processing#####################
sample idx: 1
sample tags: ['ios', 'iphone', 'xcode', 'ios7', 'xcode5']

processing pipeline: 

sample title: Xcode error : Distill failed for unknown reasons
preprocessed title: ['microsoft', 'offic', '2007', 'file', 'type', 'mime', 'type', 'identifi', 'charact']
bow_corpus of title: [(5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]

bag of words equivalence: 
Word 5 ("error") appears 1 time.
Word 6 ("fail") appears 1 time.
Word 7 ("reason") appears 1 time.
Word 8 ("unknown") appears 1 time.
Word 9 ("xcode") appears 1 time.


In [None]:
print("###################Inference######################")
infer_topic_score(idx, title_bow_corpus)

###################Inference######################

Score: 0.3033624589443207	 
Topic: 0.234*"load" + 0.190*"fail" + 0.099*"address" + 0.051*"definit" + 0.050*"error" + 0.046*"target" + 0.037*"fast" + 0.035*"sdk" + 0.030*"upgrad" + 0.028*"-"

Score: 0.18693606555461884	 
Topic: 0.262*"queri" + 0.162*"linq" + 0.159*"specif" + 0.083*"sql" + 0.069*"assign" + 0.034*"sum" + 0.032*"vbnet" + 0.027*"unknown" + 0.025*"help" + 0.022*"-"

Score: 0.18136711418628693	 
Topic: 0.252*"rubi" + 0.134*"xcode" + 0.118*"count" + 0.090*"datetim" + 0.060*"report" + 0.057*"constraint" + 0.046*"associ" + 0.045*"ident" + 0.043*"receiv" + 0.027*"desktop"

Score: 0.16833244264125824	 
Topic: 0.240*"spring" + 0.232*"debug" + 0.085*"sqlite" + 0.079*"machin" + 0.067*"reason" + 0.047*"boot" + 0.039*"bean" + 0.032*"driver" + 0.032*"transit" + 0.029*"busi"


In [None]:
idx = 11010
print("###################Processing#####################")
sample_nlp_pipeline(idx, df, title_bow_corpus)
print("\n###################Inference######################")
infer_topic_score(idx, title_bow_corpus)

###################Processing#####################
sample idx: 11010
sample tags: ['html', 'css', 'forms']

processing pipeline: 

sample title: Why use definition lists (DL,DD,DT) tags for HTML forms instead of tables?
preprocessed title: ['aspnet', 'mvc', '-', 'view', 'master', 'page', 'set', 'titl']
bow_corpus of title: [(20, 1), (38, 1), (108, 1), (161, 1), (163, 1), (284, 1), (293, 1), (790, 1)]

bag of words equivalence: 
Word 20 ("use") appears 1 time.
Word 38 ("list") appears 1 time.
Word 108 ("instead") appears 1 time.
Word 161 ("html") appears 1 time.
Word 163 ("tag") appears 1 time.
Word 284 ("tabl") appears 1 time.
Word 293 ("form") appears 1 time.
Word 790 ("definit") appears 1 time.

###################Inference######################

Score: 0.11222319304943085	 
Topic: 0.483*"use" + 0.098*"get" + 0.083*"valid" + 0.077*"current" + 0.046*"activ" + 0.035*"doubl" + 0.031*"float" + 0.030*"hibern" + 0.028*"global" + 0.026*"emb"

Score: 0.11222290247678757	 
Topic: 0.286*"list"

We can also test on unseen data.

In [None]:
# test on unseen data
unseen_title = 'How can I declare a struct in java'
bow_vector = title_dictionary.doc2bow(preprocess(unseen_title))
for index, score in sorted(title_lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print(index)
    print("Score: {}\t Topic: {}".format(score, title_lda_model.print_topic(index, 5)))

34
Score: 0.25250110030174255	 Topic: 0.302*"java" + 0.193*"object" + 0.185*"code" + 0.090*"paramet" + 0.090*"c_#"
33
Score: 0.2524997889995575	 Topic: 0.224*"implement" + 0.168*"size" + 0.141*"display" + 0.098*"angular" + 0.097*"declar"
73
Score: 0.25249752402305603	 Topic: 0.251*"properti" + 0.228*"user" + 0.135*"bind" + 0.076*"locat" + 0.046*"object"


And finally, on the test dataset.

In [None]:
idx = 10
print("###################Processing#####################")
sample_nlp_pipeline(idx, test, test_title_bow_corpus)
print("###################Inference######################")
infer_topic_score(idx, test_title_bow_corpus)

###################Processing#####################
sample idx: 10
sample tags: ['git', 'bitbucket', 'git-branch']

processing pipeline: 

sample title: Delete branches in Bitbucket
preprocessed title: ['aspnet', 'site', 'map']
bow_corpus of title: [(136, 1), (679, 1), (1979, 1)]

bag of words equivalence: 
Word 136 ("delet") appears 1 time.
Word 679 ("branch") appears 1 time.
Word 1979 ("bitbucket") appears 1 time.
###################Inference######################

Score: 0.25250568985939026	 
Topic: 0.738*"file" + 0.086*"delet" + 0.025*"statu" + 0.024*"csv" + 0.019*"overload" + 0.018*"ubuntu" + 0.012*"dot" + 0.011*"javafx" + 0.010*"atom" + 0.010*"master"

Score: 0.2525056302547455	 
Topic: 0.285*"git" + 0.108*"repositori" + 0.095*"svn" + 0.092*"merg" + 0.084*"branch" + 0.047*"termin" + 0.044*"unix" + 0.038*"exclud" + 0.034*"filenam" + 0.031*"spark"

Score: 0.25248122215270996	 
Topic: 0.147*"posit" + 0.115*"end" + 0.102*"height" + 0.068*"t-sql" + 0.065*"ssh" + 0.053*"absolut" + 0.049

### Initial conclusion of LDA results:
We can observe that our hypothesis that the resulting LDA model would encode the questions that are represented by tags into different topics seems to hold. Furthermore it seems to work on the test set as we observe ground truth tags in the topics.

## 3.4 Tags proposals from title topics

Now that we have a model capable of defining topics from titles, we can use the result of the inference to generate Tag predictions. Here we will use the topic confidence and each composant of the tokens of each topic. Since we are using an unsupervised approach we will extract tags using simply the confidence terms.

#### Proposed approach:
We will extract topic tags from a string of text and compare it with a set of available tokenized tags. If there is a match and it passes a confidence check, it is a tag suggestion and we obtain it using the tag2dict mapping. This allows us to not miss on predictions because the token is different than the tag.

First we need to preprcess the new sentence. Then, it is converted to the BOW representation used to train the LDA model. Inference is performed on this BOW set. We verfy that that using token2tag increases the probability of matches between suggestions and existing tags!

In [None]:
# extract tag proposals from topics
unseen_title = 'How can I declare a variable in python'
bow_vector = title_dictionary.doc2bow(preprocess(unseen_title))

scores = []
words = []
for index, score in sorted(title_lda_model[bow_vector], key=lambda tup: -1*tup[1]):

    scores.append(score)
    words.append(title_lda_model.get_topic_terms(index, 5))

In [None]:
title_lda_model[bow_vector]

[(33, 0.2524989), (55, 0.2525002), (67, 0.25250015)]

Check if there are matches with existing tags matches before applying token2tag

In [None]:
common_tags = []
# compare with dict of tags
for bow_id, score in words[0]:
  if title_dictionary[bow_id] in tags_array:
    common_tags.append(title_dictionary[bow_id])

print('common tags: {}'.format(common_tags))

common tags: ['python', 'html']


Check if there are matches with existing tags after applying token2tag

In [None]:
common_tags = []
# compare with dict of tags
for bow_id, score in words[0]:
  token = title_dictionary[bow_id]
  # try to convert token2tag, else leave tag as is
  tag = ''
  if token in token2tag_dict.keys():
    tag = token2tag_dict[token]
  else:
    tag = token
  if tag in tags_array:
    common_tags.append(title_dictionary[bow_id])

print('common tags: {}'.format(common_tags))

common tags: ['python', 'html', 'page']


Simply comparing for common tokens between the title and the set of tags would possibly work. However the LDA model approach allows for less computations and also provides a score that we can use to refine the prediction.

We have two scores per proposal. The overall confidence on the topic and then, for each topic, a score that represents token contribution to the topic.

Here we extract both scores of the topics and individual tokens to refine the tag proposal.

In [None]:
# Extract topic score and individual token scores
title_token_score = []
title_topic_proposal = []
for index, topic_score in sorted(title_lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    words = title_lda_model.get_topic_terms(index, 5)
    # compare with dict of tags
    for bow_id, score in words:
      if title_dictionary[bow_id] in tags_array:
        title_topic_proposal.append(title_dictionary[bow_id])
        title_token_score.append((topic_score, score))

Print the scores for each token of the title. We observe that in this case the individual token score is high, but the topic score is approximately the same for all found topics.

In [None]:
for tag, score in zip(title_topic_proposal, title_token_score):
  print('tag : {}, topic score : {}, individual token score : {}'.format(tag, score[0], score[1]))

tag : python, topic score : 0.2525002360343933, individual token score : 0.3098738193511963
tag : html, topic score : 0.2525002360343933, individual token score : 0.10073015838861465
tag : size, topic score : 0.25249892473220825, individual token score : 0.16774149239063263
tag : display, topic score : 0.25249892473220825, individual token score : 0.1414836049079895
tag : angular, topic score : 0.25249892473220825, individual token score : 0.09769216179847717


With the scores we now can threshold the result and obtain a refined tag suggestion:

In [None]:
# catch a tag given a threshold
title_tag_thresh = 0.1
for tag, score in zip(title_topic_proposal, title_token_score):
  if score[1] > title_tag_thresh:
    print('tag : {} ##### topic score : {} ##### individual score : {}'.format(tag, score[0], score[1]))

tag : python ##### topic score : 0.2525002360343933 ##### individual score : 0.3098738193511963
tag : html ##### topic score : 0.2525002360343933 ##### individual score : 0.10073015838861465
tag : size ##### topic score : 0.25249892473220825 ##### individual score : 0.16774149239063263
tag : display ##### topic score : 0.25249892473220825 ##### individual score : 0.1414836049079895


Since this is a suggestion system, we want to have high recall since giving a tag is better than no tag at all.

The final prediction list is the union between the confidence analysis and the tags present in the text.  If we use the intersection we will restrict the tags to the already existent tags. On the other hand using the individual token scores allows for the inclusion of new tags if they were already present before training!

In [None]:
# Tag suggestion:
def lda_tag_suggestion(input_string, lda_model, corpus_dictionary,
                       token2tag_dict, n_proposals = 5, topic_thresh = 0.0,
                       token_thresh = 0.1, n_tags = 1000, new_tags = False, token2tag = True, verbose = True):
  """
  :input_string: string of text representing the title
  :lda_model: LDA model trained on corpus_dictionary
  :corpus_dictionary: Token dictionary
  :token2tag_dict: Token to tag encoding dictionary
  :n_proposals: maximum number of proposals
  :token2tag: set to False to disable token2tag
  :token2tag: set to True to accept new tags
  :title_topic_thresh: trim proposals based on topic score (default = 0)
  :title_token_thresh: trim proposals based on individual token score (default = 0.1)
  :verbose: silences function (default = False)

  :return: List of tag suggestions
  """
  # extract tag proposals from topics
  bow_vector = corpus_dictionary.doc2bow(preprocess(input_string))

  # Extract topic score and individual token scores
  token_scores = []
  proposals = []
  for index, topic_score in sorted(title_lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    topic_composants = title_lda_model.get_topic_terms(index, 5)
    # check topic score
    if topic_score > topic_thresh:
      # enter topic composition
      for bow_id, token_score in topic_composants:
        # check token score
        if token_score > token_thresh:
          # compare with dict of tags
          token = corpus_dictionary[bow_id]
          tag = ''
          if token2tag:
            tag = token2tag_dict.get(token)

            if (tag == '' or tag == None) and new_tags:
              tag = token
          else:
            tag = token
          if tag != '' and tag != None:
            proposals.append(tag)
            token_scores.append(token_score)
            if verbose:
              print('tag : {} ##### topic score : {} ##### individual score : {}'.format(tag, topic_score, token_score))


  ordering = np.arange(len(proposals))
  proposals = np.array(proposals)
  token_scores = np.array(token_scores)
  # select top n_proposals by score:
  if len(proposals) > n_proposals:
    ordering = np.argsort(token_scores)
    proposals = proposals[ordering]
    token_scores = token_scores[ordering]
    return proposals[:n_proposals],token_scores[:n_proposals]
  else:
    return proposals, token_scores

Now we test the suggestion pipeline:

For sanity check we bserve that a string that is not working (possibly because tokenization did not work outputs no tag:

In [None]:
test_str = ''
proposed_tags = lda_tag_suggestion(input_string = test_str, lda_model = title_lda_model, corpus_dictionary = title_dictionary,
                                   token2tag_dict = token2tag_dict, verbose = True)
print('tag proposals = {}'.format(proposed_tags))

tag proposals = (array([], dtype=float64), array([], dtype=float64))


Now we chec the result with a small string

In [None]:
test_str = 'Hello world'
proposed_tags,ps = lda_tag_suggestion(input_string = test_str, lda_model = title_lda_model, corpus_dictionary = title_dictionary,
                                   token2tag_dict = token2tag_dict, verbose = True)
print('tag proposals = {}'.format(proposed_tags))

tag : uses ##### topic score : 0.5049919486045837 ##### individual score : 0.48298272490501404
tag proposals = ['uses']


Now with a possible title:


In [None]:
test_str = 'How can i display keys and values of string python dict?'
proposed_tags, ps = lda_tag_suggestion(input_string = test_str, lda_model = title_lda_model, corpus_dictionary = title_dictionary,
                                   token2tag_dict = token2tag_dict, verbose = True)
print('tag proposals = {}'.format(proposed_tags))

tag : string ##### topic score : 0.1683335304260254 ##### individual score : 0.4313419759273529
tag : types ##### topic score : 0.1683335304260254 ##### individual score : 0.30193570256233215
tag : attributes ##### topic score : 0.168333500623703 ##### individual score : 0.1348075121641159
tag : python ##### topic score : 0.16833336651325226 ##### individual score : 0.3098738193511963
tag : html ##### topic score : 0.16833336651325226 ##### individual score : 0.10073015838861465
tag : key ##### topic score : 0.16833306849002838 ##### individual score : 0.2191891372203827
tag : updates ##### topic score : 0.16833306849002838 ##### individual score : 0.16892598569393158
tag : implementation ##### topic score : 0.16833259165287018 ##### individual score : 0.22439436614513397
tag : size ##### topic score : 0.16833259165287018 ##### individual score : 0.16774149239063263
tag : display ##### topic score : 0.16833259165287018 ##### individual score : 0.1414836049079895
tag proposals = ['html'

Now we check what happens with a Title from the train dataset

In [None]:
# sample text
for _ in range(2):
  print('*******')
  idx = np.random.randint(len(df.Tags))
  title_text = df.Title.to_list()[idx]

  print('title: {}'.format(title_text))
  print('saple tags:', df.Tags.to_list()[idx])

  proposed_tags, _ = lda_tag_suggestion(input_string = title_text,  lda_model = title_lda_model, corpus_dictionary = title_dictionary,
                                   token2tag_dict = token2tag_dict, verbose = False)
  print('tag proposals = {}'.format(proposed_tags))


*******
title: Is it okay to sign two different applications with the same key?
saple tags: ['android']
tag proposals = ['eclipse' 'fixed' 'updates' 'project' 'arrays']
*******
title: Cross platform IPC
saple tags: ['cross-platform', 'ipc']
tag proposals = ['express' 'iphone' 'hide' 'default' 'equivalent']


And now from the test dataset

In [None]:
# sample text
for _ in range(2):
  print('*******')
  idx = np.random.randint(len(test.Tags))
  title_text = test.Title.to_list()[idx]

  print('title: {}'.format(title_text))
  print('saple tags:', test.Tags.to_list()[idx])

  proposed_tags, _ = lda_tag_suggestion(input_string = title_text,  lda_model = title_lda_model, corpus_dictionary = title_dictionary,
                                   token2tag_dict = token2tag_dict, verbose = False)
  print('tag proposals = {}'.format(proposed_tags))

*******
title: Are there alternate implementations of GNU getline interface?
saple tags: ['c', 'licensing', 'getline']
tag proposals = ['alternate' 'assembly' 'display' 'import' 'interface']
*******
title: Remove HTML5 notification permissions
saple tags: ['javascript', 'html', 'html5-notifications']
tag proposals = ['request' 'exception' 'animation' 'removable' 'wcf']


Print of processing pipeline + prediction + gt

## 3.5 Model evaluation

We consider for a tag $i$:

*   $TP_i$: The sum of the total number of correct predictions for the tag
*   $FP_i$: The sum of the total number of bad predictions for the tag
*   $TN_i$: All of the correctly non guessed tags are true negatives.
*   $FN_i$: We do not predict negatives such that this number is zero.

The first metric that would serve as an evaluation given the relevant parameters for the model the average F1-score for all tags.

However, since each tag is unequaly represented on the dataset (as seen on the exploratory analysis), we need to ponder this by the presence of each tag on the dataset.

The best suited evaluation metric for this problem is thus the micro-F1 score.

Micro F1-score (short for micro-averaged F1 score) is used to assess the quality of multi-label binary problems.
It measures the F1-score of the aggregated contributions of all classes.

It corresponds to pondering the average of each class prediction by it's appearence.


We define then

*   $TP = \sum_{\forall i} TP_i$
*   $FP = \sum_{\forall i} FN_i$
*   $N = $  Total number of tags

This quantities can be calculated without calculating the total ammount of

*   $Pr_{micro} = \frac{TP}{TP + FP}$
*   $Re_{micro}= \frac{TP}{N} = accuracy$

As a bonus, this formulation allows us to efficiently calculate the values without calculating $TP_i$ and $FP_i$ for each class!

And we use $F1_{micro} = 2\frac{Pr_{micro}*Re_{micro}}{Pr_{micro}+Re_{micro}}$ as our evaluation metric.



In [None]:
# returns number of elements present in two lists
def count_matches(str_list_1, str_list_2):
  count = 0
  for word in str_list_1:
    if word in str_list_2:
      count+= 1
  return count

Now we check what happens with a Title from the train dataset.

In [None]:
# calculate accuracy, recall, precision, TP, FP
total = 0
TP = 0
FP = 0
n_samples = len(df.Tags)
idxs = np.random.randint(len(df.Tags),size = n_samples)

for i, idx in enumerate(idxs):
  title_text = df.Title.to_list()[idx]
  gt_tags = df.Tags.to_list()[idx]
  proposed_tags, _ = lda_tag_suggestion(input_string = title_text,  lda_model = title_lda_model, corpus_dictionary = title_dictionary,
                                   token2tag_dict = token2tag_dict, verbose = False)
  gt_pos = len(gt_tags)
  pred_pos = len(proposed_tags)
  positives = count_matches(proposed_tags, gt_tags)
  total += gt_pos
  TP += positives
  # FP is given by the excedent of proposals.
  FP += max(pred_pos - TP, 0)
  if True:
    if i % int(n_samples/10) == 0:
      print('processed {} out of {} questions...'.format(i, n_samples))
acc = TP/total
precision = TP / (TP + FP)
recall = TP/(total)
f1 = precision*recall/(precision+recall)
print('Positives = {}'.format(total))
print('micro TP = {}'.format(TP))
print('micro FP = {}'.format(FP))
print('accuracy = {}'.format(acc))
print('micro precision = {}'.format(precision))
print('micro recall = {}'.format(recall))
print('micro f1 = {}'.format(f1))

processed 0 out of 77109 questions...
processed 7710 out of 77109 questions...
processed 15420 out of 77109 questions...
processed 23130 out of 77109 questions...
processed 30840 out of 77109 questions...
processed 38550 out of 77109 questions...
processed 46260 out of 77109 questions...
processed 53970 out of 77109 questions...
processed 61680 out of 77109 questions...
processed 69390 out of 77109 questions...
processed 77100 out of 77109 questions...
Positives = 233740
micro TP = 18999
micro FP = 51
accuracy = 0.08128262171643706
micro precision = 0.9973228346456693
micro recall = 0.08128262171643706
micro f1 = 0.07515724514419082


And now from the test dataset

In [None]:
# calculate accuracy, recall, precision, TP, FP
total = 0
TP = 0
FP = 0
n_samples = len(test.Tags)
idxs = np.random.randint(len(test.Tags),size = n_samples)
for i, idx in enumerate(idxs):
  title_text = test.Title.to_list()[idx]
  gt_tags = test.Tags.to_list()[idx]
  proposed_tags, _ = lda_tag_suggestion(input_string = title_text,  lda_model = title_lda_model, corpus_dictionary = title_dictionary,
                                   token2tag_dict = token2tag_dict, verbose = False)
  gt_pos = len(gt_tags)
  pred_pos = len(proposed_tags)
  positives = count_matches(proposed_tags, gt_tags)
  total += gt_pos
  TP += positives
  # FP is given by the excedent of proposals.
  FP += max(pred_pos - TP, 0)
  if True:
    if i % int(n_samples/10) == 0:
      print('processed {} out of {} questions...'.format(i, n_samples))
acc = TP/total
precision = TP / (TP + FP)
recall = TP/(total)
f1 = precision*recall/(precision+recall)
print('Positives = {}'.format(total))
print('micro TP = {}'.format(TP))
print('micro FP = {}'.format(FP))
print('accuracy = {}'.format(acc))
print('micro precision = {}'.format(precision))
print('micro recall = {}'.format(recall))
print('micro f1 = {}'.format(f1))

processed 0 out of 19278 questions...
processed 1927 out of 19278 questions...
processed 3854 out of 19278 questions...
processed 5781 out of 19278 questions...
processed 7708 out of 19278 questions...
processed 9635 out of 19278 questions...
processed 11562 out of 19278 questions...
processed 13489 out of 19278 questions...
processed 15416 out of 19278 questions...
processed 17343 out of 19278 questions...
processed 19270 out of 19278 questions...
Positives = 57894
micro TP = 4845
micro FP = 13
accuracy = 0.08368742874909317
micro precision = 0.9973240016467683
micro recall = 0.08368742874909317
micro f1 = 0.07720869454360021


# 4. Unsupervised learning on titles

## 4.1 Body Pre-processing

Here we process the body of questions into a bow representation. Applying the same pipeline that we used for the Title is not straightforward since the text contains different sentences.

In [None]:
df.Body.to_list()[:3]

['<p>I was going thru some single page website examples and found  this: <a href="http://alwayscreative.net/" rel="noreferrer">http://alwayscreative.net/</a>. I am totally amazed by the disc in the background that rotates infinitely. i have looked at some examples but none worked that way. Can anyone tell me how was that implemented.\nThanks.</p>\n',
 '<p>Does anybody know why this error happens on Xcode 5?</p>\n<p><img src="https://i.stack.imgur.com/uBCcr.png" alt="error" /></p>\n<p><strong>Answer</strong></p>\n<p>I had this problem when I accidentally renamed a .psd as a .png. Converting the image to an actual png instead of a Photoshop file fixed it for me.</p>\n',
 '<p>I\'m using guice for dependency injection with aop from <a href="http://aopalliance.sourceforge.net/" rel="noreferrer">aopalliance</a>. I can\'t quite figure out what\'s aopalliance all about and who implemented the version (dated from 2004) that\'s on their sourceforge page. Why is guice using this version instead o

In [None]:
processed_Bodies = df.Body.map(preprocess)

In [None]:
processed_Bodies[:10]

67997    [pi, go, singl, page, websit, exampl, hrefhttp...
76486    [pdoe, anybodi, know, error, happen, xcode, 5p...
29308    [pim, guic, depend, inject, aop, hrefhttpaopal...
83844    [pi, follow, 500, server, error, publish, azur...
82803    [pi, get, troubl, rail, project（redmine23, rai...
48827    [pi, tri, implement, hrefhttpenwikipediaorgwik...
30263    [pif, destructor, explicitli, myobjectobject, ...
52564    [pi, need, hrefhttpmsdnmicrosoftcomen-uslibrar...
39947    [pi, tri, write, add-in, visual, studio, thing...
83261    [pto, protect, user, maliciuo, applet, want, d...
Name: Body, dtype: object

In [None]:
body_dictionary = gensim.corpora.Dictionary(processed_Bodies)
count = 0
for k, v in body_dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 amaz
1 background
2 disc
3 exampl
4 go
5 hrefhttpalwayscreativenet
6 implement
7 infinit
8 look
9 page
10 pi


In [None]:
body_dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [None]:
body_bow_corpus = [body_dictionary.doc2bow(body) for body in processed_Bodies]

In [None]:
def body_sample_nlp_pipeline(sample_idx):
  print('sample idx:', sample_idx)

  print('saple tags:', df.Tags.to_list()[sample_idx])
  print('\nprocessing pipeline: \n')
  print('sample body:', df.Title.to_list()[sample_idx])
  print('preprocessed title:', processed_Bodies[sample_idx])
  print('bow_corpus of body:', body_bow_corpus[sample_idx])
  print('bag of words equivalence: \n')
  bow_doc_sample = body_bow_corpus[sample_idx]
  for i in range(len(bow_doc_sample)):
      print("Word {} (\"{}\") appears {} time.".format(bow_doc_sample[i][0],
                                                body_dictionary[bow_doc_sample[i][0]], bow_doc_sample[i][1]))


In [None]:
body_sample_nlp_pipeline(1)

sample idx: 1
saple tags: ['ios', 'iphone', 'xcode', 'ios7', 'xcode5']

processing pipeline: 

sample body: Xcode error : Distill failed for unknown reasons
preprocessed title: ['pwhere', 'list', 'mime', 'type', 'identifi', 'charact', 'strongmicrosoft', 'offic', '2007strong', 'filesp', 'pi', 'upload', 'form', 'restrict', 'upload', 'base', 'extens', 'identifi', 'charact', 'strongoffic', '2007', 'mimestrong', 'typesp', 'pcan', 'helpp']
bow_corpus of body: [(17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 2), (36, 1), (37, 1), (38, 1)]
bag of words equivalence: 

Word 17 ("5p") appears 1 time.
Word 18 ("accident") appears 1 time.
Word 19 ("actual") appears 1 time.
Word 20 ("alterror") appears 1 time.
Word 21 ("anybodi") appears 1 time.
Word 22 ("convert") appears 1 time.
Word 23 ("error") appears 1 time.
Word 24 ("file") appears 1 time.
Word 25 ("fix") appears 1 time.
Word

### Question: How many topics should we use?

In [None]:
body_lda_model = gensim.models.LdaMulticore(body_bow_corpus, num_topics=20, id2word=body_dictionary, passes=20, workers=2)

In [None]:
# Save the model
model_file_path = "/content/drive/MyDrive/Colab Notebooks/Lda models/body_lda_model"
body_lda_model.save(model_file_path)

In [None]:
for idx, topic in body_lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.024*"page" + 0.016*"want" + 0.015*"control" + 0.015*"like" + 0.013*"work" + 0.013*"form" + 0.013*"way" + 0.013*"user" + 0.011*"text" + 0.011*"p"
Topic: 1 
Words: 0.037*"user" + 0.026*"request" + 0.019*"servic" + 0.017*"url" + 0.016*"server" + 0.015*"send" + 0.014*"client" + 0.012*"http" + 0.012*"respons" + 0.012*"log"
Topic: 2 
Words: 0.040*"tabl" + 0.023*"queri" + 0.023*"column" + 0.022*"data" + 0.020*"id" + 0.019*"select" + 0.019*"databas" + 0.018*"row" + 0.018*"sql" + 0.012*"p"
Topic: 3 
Words: 0.104*"file" + 0.021*"project" + 0.019*"instal" + 0.017*"build" + 0.016*"directori" + 0.014*"path" + 0.013*"version" + 0.013*"packag" + 0.012*"folder" + 0.011*"creat"
Topic: 4 
Words: 0.672*"--" + 0.048*"-" + 0.029*"+" + 0.023*"info" + 0.015*"-+" + 0.010*"null" + 0.010*"1" + 0.009*"ltdependencygt" + 0.006*"ltoption" + 0.004*"ltplugingt"
Topic: 5 
Words: 0.022*"div" + 0.022*"git" + 0.021*"ltdivgt" + 0.021*"ltdiv" + 0.018*"width" + 0.017*"branch" + 0.015*"color" + 0.014*"comm

In [None]:
def infer_body_topic_score(sample_idx, ):
  for index, score in sorted(body_lda_model[body_bow_corpus[sample_idx]], key=lambda tup: -1*tup[1]):
      print("\nScore: {}\t \nTopic: {}".format(score, body_lda_model.print_topic(index, 10)))

In [None]:
idx = 11010
body_sample_nlp_pipeline(idx)
print("\n#####\nprediction:\n")
infer_body_topic_score(idx)

sample idx: 11010
saple tags: ['html', 'css', 'forms']

processing pipeline: 

sample body: Why use definition lists (DL,DD,DT) tags for HTML forms instead of tables?
preprocessed title: ['pwhat', 'prefer', 'way', 'set', 'html', 'titl', 'head', 'view', 'master', 'pagesp', 'pone', 'way', 'pagetitl', 'aspx', 'file', 'requir', 'master', 'page', 'mess', 'html', 'code', 'let', 'assum', 'server', 'control', 'pure', 'html', 'better', 'idea', 'p', 'pupdat', 'like', 'set', 'titl', 'view', 'control', 'modelp']
bow_corpus of body: [(3, 1), (311, 1), (318, 1), (388, 1), (406, 1), (431, 1), (478, 1), (668, 1), (677, 1), (732, 1), (776, 2), (1449, 1), (5361, 1), (5441, 1)]
bag of words equivalence: 

Word 3 ("exampl") appears 1 time.
Word 311 ("recent") appears 1 time.
Word 318 ("thing") appears 1 time.
Word 388 ("come") appears 1 time.
Word 406 ("form") appears 1 time.
Word 431 ("pfor") appears 1 time.
Word 478 ("likep") appears 1 time.
Word 668 ("advantag") appears 1 time.
Word 677 ("pive") appear

In [None]:
# test on unseen data
unseen_title = 'There seems to be an'
bow_vector = body_dictionary.doc2bow(preprocess(unseen_title))
for index, score in sorted(body_lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {:.3f}\t Topcic: {}".format(score, body_lda_model.print_topic(index, 5)))

Score: 0.050	 Topcic: 0.024*"page" + 0.016*"want" + 0.015*"control" + 0.015*"like" + 0.013*"work"
Score: 0.050	 Topcic: 0.037*"user" + 0.026*"request" + 0.019*"servic" + 0.017*"url" + 0.016*"server"
Score: 0.050	 Topcic: 0.040*"tabl" + 0.023*"queri" + 0.023*"column" + 0.022*"data" + 0.020*"id"
Score: 0.050	 Topcic: 0.104*"file" + 0.021*"project" + 0.019*"instal" + 0.017*"build" + 0.016*"directori"
Score: 0.050	 Topcic: 0.672*"--" + 0.048*"-" + 0.029*"+" + 0.023*"info" + 0.015*"-+"
Score: 0.050	 Topcic: 0.022*"div" + 0.022*"git" + 0.021*"ltdivgt" + 0.021*"ltdiv" + 0.018*"width"
Score: 0.050	 Topcic: 0.350*"#" + 0.041*"end" + 0.036*"def" + 0.028*"import" + 0.025*"precod"
Score: 0.050	 Topcic: 0.078*"ul" + 0.040*"ol" + 0.019*"lia" + 0.018*"li" + 0.017*"data"
Score: 0.050	 Topcic: 0.072*"function" + 0.050*"var" + 0.047*"return" + 0.028*"true" + 0.020*"fals"
Score: 0.050	 Topcic: 0.018*"use" + 0.016*"im" + 0.015*"like" + 0.013*"code" + 0.011*"know"
Score: 0.050	 Topcic: 0.060*"1" + 0.055*"0

## Tag from body

In [None]:
# extract tag proposals from topics
unseen_body = 'How can I declare a struct in java if there is a list of random integers that I can lorem ipsum'
bow_vector = body_dictionary.doc2bow(preprocess(unseen_body))

scores = []
words = []
for index, score in sorted(body_lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    scores.append(score)
    words.append(title_lda_model.get_topic_terms(index, 5))

In [None]:
words[0]

[(1, 0.40980998),
 (305, 0.17781377),
 (77, 0.09306017),
 (269, 0.05579522),
 (143, 0.053933434)]

In [None]:
# compare with dict of tags
for bow_id, score in words[0]:
  if body_dictionary[bow_id] in tags_array:
    print(body_dictionary[bow_id])

background
construct


In [None]:
body_tag_score = []
body_tag_proposal = []
for index, topic_score in sorted(body_lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    words = body_lda_model.get_topic_terms(index, 5)
    # compare with dict of tags
    for bow_id, score in words:
      if body_dictionary[bow_id] in tags_array:
        body_tag_proposal.append(body_dictionary[bow_id])
        body_tag_score.append((topic_score, score))

In [None]:
for tag, score in zip(body_tag_proposal, body_tag_score):
  print('tag : {}, topic score : {}, individual score : {}'.format(tag, score[0], score[1]))

tag : string, topic score : 0.4822022318840027, individual score : 0.02608604170382023
tag : list, topic score : 0.4822022318840027, individual score : 0.018192294985055923
tag : git, topic score : 0.22808238863945007, individual score : 0.021855566650629044
tag : width, topic score : 0.22808238863945007, individual score : 0.01755526103079319


In [None]:
# catch a tag given a threshold
body_tag_thresh = 0.1
for tag, score in zip(body_tag_proposal, body_tag_score):
  if score[1] > body_tag_thresh:
    print('tag : {} ##### topic score : {} ##### individual score : {}'.format(tag, score[0], score[1]))

# 5. Title + body to tags:



In [None]:
def infer_title_tags(title_text, thr = 0.1):
  # extract tag proposals from topics
  bow_vector = title_dictionary.doc2bow(preprocess(title_text))

  scores = []
  words = []
  for index, score in sorted(title_lda_model[bow_vector], key=lambda tup: -1*tup[1]):
      scores.append(score)
      words.append(title_lda_model.get_topic_terms(index, 5))

  title_tag_score = []
  title_tag_proposal = []
  for index, topic_score in sorted(title_lda_model[bow_vector], key=lambda tup: -1*tup[1]):
      words = title_lda_model.get_topic_terms(index, 5)
      # compare with dict of tags
      for bow_id, score in words:
        if title_dictionary[bow_id] in tags_array:
          title_tag_proposal.append(title_dictionary[bow_id])
          title_tag_score.append((topic_score, score))
  # catch a tag given a threshold
  proposals = {}
  for tag, score in zip(title_tag_proposal, title_tag_score):
    if score[1] > thr:
      #print('tag : {} ##### topic score : {} ##### individual score : {}'.format(tag, score[0], score[1]))
      proposals[tag] = (score[0], score[1])
  return proposals

In [None]:
def infer_body_tags(body_text, thr = 0.1):
  # extract tag proposals from topics
  bow_vector = body_dictionary.doc2bow(preprocess(body_text))

  scores = []
  words = []
  for index, score in sorted(body_lda_model[bow_vector], key=lambda tup: -1*tup[1]):
      scores.append(score)
      words.append(body_lda_model.get_topic_terms(index, 5))

  body_tag_score = []
  body_tag_proposal = []
  for index, topic_score in sorted(body_lda_model[bow_vector], key=lambda tup: -1*tup[1]):
      words = body_lda_model.get_topic_terms(index, 5)
      # compare with dict of tags
      for bow_id, score in words:
        if body_dictionary[bow_id] in tags_array:
          body_tag_proposal.append(body_dictionary[bow_id])
          body_tag_score.append((topic_score, score))
  # catch a tag given a threshold
  proposals = {}
  for tag, score in zip(body_tag_proposal, body_tag_score):
    if score[1] > thr:
      #print('tag : {} ##### topic score : {} ##### individual score : {}'.format(tag, score[0], score[1]))
      proposals[tag] = (score[0], score[1])
  return proposals

In [None]:
infer_title_tags('i want to know how to create a python dict given a list of words')

{'python': (0.12625113, 0.30987382),
 'html': (0.12625113, 0.10073016),
 'list': (0.12625109, 0.2860862),
 'api': (0.12625109, 0.11448753),
 'contain': (0.12625109, 0.11375814),
 'group': (0.12625036, 0.1365088),
 'block': (0.12625034, 0.17727534),
 'email': (0.12625034, 0.115352795),
 'default': (0.1262484, 0.15170397),
 'express': (0.1262484, 0.118174724),
 'android': (0.12624569, 0.60257727)}

In [None]:
infer_body_tags('i want to know how to create a python dict given a list of words')

{}

In [None]:
import numpy as np

def infer_tag_proposals(title, body, thr=0.1, n_max=5):
    def aggregate_tags(tag_dict, proposals, scores):
        for tag, scores_tuple in tag_dict.items():
            if tag in proposals:
                idx = proposals.index(tag)
                scores[idx] += sum(scores_tuple)
            else:
                proposals.append(tag)
                scores.append(sum(scores_tuple))

    # Get title and body tag proposals
    title_tags = infer_title_tags(title, thr)
    body_tags = infer_body_tags(body, thr)

    # Aggregate tags and scores from title and body
    proposals = []
    scores = []

    aggregate_tags(title_tags, proposals, scores)
    aggregate_tags(body_tags, proposals, scores)

    # Sort proposals based on scores
    sorted_indices = np.argsort([-score for score in scores])
    sorted_proposals = [proposals[idx] for idx in sorted_indices]

    # Return only up to n_max proposals
    return sorted_proposals[:n_max]

# Example usage:
title = 'i want to know how to create a python dict given a list of words'
body = title
infer_tag_proposals(title, body, thr=0.1, n_max=5)


['android', 'python', 'list', 'block', 'default']

In [None]:
test1 = "i want to know how to create a python dict given a list of words"
infer_tag_proposals(test1, test1)

['android', 'python', 'list', 'block', 'default']