This notebook shows the topic modelling.
Topic Modelling works with the underlying assumption that a text is built using words from a shared topic. It tries to group words together that match this shared topic. Topic Modelling can be used for recommendation algorithms, in my case I used it to have a look at certain categories and see if I can interpret the output to get an understanding of the different categories.
For a further look on how the algorithm works see here:
https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

In [0]:
import pandas as pd
import re
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag, word_tokenize
from stop_words import get_stop_words
from nltk.corpus import stopwords
from gensim import corpora
import gensim
frame=pd.read_csv('framemai17.csv')
frame=frame.drop(frame.columns[0], axis=1)

This part is the pre-processing that you also saw in the trend detection. Nothing new to see here.

In [0]:
liste=[]
for s in range(0,len(frame)):
    if re.findall("Computer Science",frame["Subjects"][s]):
        liste.append(frame["Title"][s])
liste = [w.replace('.', '') for w in liste]
liste = [w.replace(',', '') for w in liste]
liste = [w.replace(':', '') for w in liste]
liste = [w.replace('!', '') for w in liste]
liste = [w.replace('(', '') for w in liste]
liste = [w.replace(')', '') for w in liste]
liste = [w.replace('"', '') for w in liste]
liste = [w.replace('$', '') for w in liste]
liste = [w.replace('{', '') for w in liste]
liste = [w.replace('}', '') for w in liste]
liste = [w.replace('[', '') for w in liste]
liste = [w.replace(']', '') for w in liste]
liste = [w.replace(w,w.lower()) for w in liste]
liste = [w.replace('\r\r\r\r\r\r\n','') for w in liste]
liste = [w.replace('\r\r\r\r\n','') for w in liste]
liste = [w.replace('\r','') for w in liste]
for element in liste:
    if element=='':
        liste.remove(element)
liste_tagged=[]
lemma=WordNetLemmatizer()
for element in liste:
    liste_tagged.append(pos_tag(word_tokenize(element)))
liste_lemma=[]
for element in liste_tagged:
    for word, tag in element:
        if tag.startswith("NN"):
            liste_lemma.append(lemma.lemmatize(word, pos='n'))
        elif tag.startswith('VB'):
            liste_lemma.append(lemma.lemmatize(word, pos='v'))
        elif tag.startswith('JJ'):
            liste_lemma.append(lemma.lemmatize(word, pos='a'))
        else:
            liste_lemma.append(word)
stop_words = get_stop_words('en')
stopWords = set(stopwords.words('english'))
fo = open("atire_puurula.txt", "r")
line = fo.readlines()
new_out=[w.replace('\n', '') for w in line]
fo.close()
liste_stopped=[]

for w in liste_lemma:
    if w not in stopWords and w not in stop_words and w not in new_out:
        liste_stopped.append(w)

This part is the actual computation of the LDA (Latent Dirichlet Allocation) model. I think it is very funny to see that this part is by far the shortest in this task, most of the code we see in this notebook goes to the pre-processing and cleaning.
The code is building a dictionary with words and relative occurence and then creates a model to group these words together. For the model you have 3 parameters to play around with:
1. The number of topics
2. The number of words per topic
3. The number of repetition for the model to train

In this example I chose 3 topics with 5 words each.
We are looking at the output for Computer Science here and when you look at each topic you can kind of distinguish different subcategories of the discipline: Machine Learning, Neural Networks and Deep Learning. 
For my thesis I tried different parameters and compared the outputs to see which ones give the best insight.

In [0]:
liste_unicode = [w.replace(w,unicode(w)) for w in liste_stopped]
dictionary = corpora.Dictionary([element.split() for element in liste_unicode])
doc_term_matrix = [dictionary.doc2bow(a) for a in [element.split() for element in liste_unicode]]
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=10)
print(ldamodel.print_topics(num_topics=3, num_words=5))

[(0, u'0.058*"network" + 0.026*"learning" + 0.017*"analysis" + 0.017*"data" + 0.009*"machine"'), (1, u'0.027*"neural" + 0.027*"model" + 0.015*"image" + 0.015*"base" + 0.012*"approach"'), (2, u'0.027*"deep" + 0.024*"learn" + 0.014*"algorithm" + 0.013*"graph" + 0.013*"detection"')]
