### Topic modelling, Latent Dirichlet Allocation
Topic modelling means to assign each individual item to a fixed number of topics. 
Latent Dirichlet Allocation is a method to learn the topics and assign a probability for each item belonging to that topic.
A topic consists of a set of words found in a subset of the items. In the optimal case this set of words fit together.  
It doesn't spit out new words not found in the input data, and it also doesn't assign one overal term to to the topic (that's what I thought first).  

The goal of topic modelling for this topic is to be able to group courses and use these groupings in the recommendation.

This notebook does some simple tests. It must be noted however that most of it is inspired by others work. And it is just a first version. 


The main source of inspiration was https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

In [1]:
#import the needed packages
import pandas as pd

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.corpora import Dictionary
from gensim.models import LdaMulticore

from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import nltk
#nltk.download('wordnet')
#from nltk.corpus import stopwords


In [2]:
#read the course info
df=pd.read_csv('../Data/filtered_courses.csv')

we use part of the course data as input data.  
Here we only used the info from 'name' and 'content'  
Other keys that might be used
'name','content','learningOutcomes','additionalInformation','courseStatus','prerequisities','substitutes'

In [33]:
df['content']=df['content'].fillna('')
data_text = df[['name','content']]
df=df.astype(str)
data_text['combined']=df[['additionalInformation', 'assesmentMethods','content','courseStatus','credits','gradingScale','learningOutcomes','level','literature','organizationId','prerequisities','teacherInCharge','type','workload']].apply(lambda x: ' '.join(x), axis=1)
data_text['index'] = data_text.index
documents = data_text

In [34]:
documents.head()

Unnamed: 0,name,content,combined,index
0,Capstone: Business Development Project,"The course consists of an applied, real-life p...",Compulsory attendance in all class sessions an...,0
1,Introduction to business,This introductory course gives a basic underst...,The minimum number of participants is 20 Learn...,1
2,Human Resource Management,"Throughout this course, we will be covering di...",Max. 100 students. Priority for management stu...,2
3,Current Issues in Leadership,The course is taught by a visiting lecturer an...,nan nan The course is taught by a visiting lec...,3
4,Business and Society,"Must know: the concepts of ""concept and contex...",nan 50% reflective learning diary50% final ess...,4


NLTK and Gensim are both common libraries used for working with text. 
I experimented a bit with both and chose to use a mix of them. For a final version, this should be experimented more. 

In [35]:
def preprocess(doc):
    #remove stopwords, tokenize, lemmatize and stem
    
    #NLTK version is bit less strict on stopwords than gensim. E.g. 'also' is not stopword in nltk
    #stop=stopwords.words('english')
    #gensim stopwords
    stop=STOPWORDS
    wnl = WordNetLemmatizer()
    sbs=SnowballStemmer('english')
    
    
    #From documentation: Convert a document into a list of lowercase tokens, ignoring tokens that are too short or too long.
    #also tested regexptokenizer from nltk, but like outcome of this one better
    tokens=simple_preprocess(doc)
    
    
    lem=[wnl.lemmatize(t) for t in tokens if t not in stop and len(t)>=3]
    stem=[sbs.stem(l) for l in lem]
    return stem
            
processed_docs = documents['combined'].map(preprocess)

#print example of original and preprocess document
doc_sample=documents[documents['index'] == 0].values[0][2]
words_sample=simple_preprocess(doc_sample)

print('Original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n Preprocessed document: ')
print(preprocess(doc_sample))

Original document: 
['Compulsory', 'attendance', 'in', 'all', 'class', 'sessions', 'and', 'meetings.', 'Most', 'Master¿s', 'Programme', 'studies', 'have', 'to', 'be', 'completed', 'before', 'you', 'can', 'enroll', 'on', 'the', 'Capstone', 'course.', 'The', 'maximum', 'number', 'of', 'students', 'is', '50,', 'but', 'only', 'eligible', 'candidates', 'will', 'be', 'admitted', 'even', 'if', 'the', 'maximum', 'number', 'is', 'not', 'reached.', 'Credit', 'transfer', 'and', 'capstone', 'course', 'Students', 'can,', 'on', 'legitimate', 'grounds', '(Eg.', 'exchange', 'studies', 'abroad,', 'serious', 'illness;', 'however,', 'working', 'life', 'and', 'its', 'restraints', 'are', 'not', 'considered', 'legitimate', 'reasons', 'not', 'to', 'complete', 'the', 'capstone', 'course),', 'apply', 'for', 'a', 'credit', 'transfer', 'for', 'a', 'capstone', 'course.', 'However,', 'as', 'a', 'deviation', 'from', 'the', 'common', 'process', 'of', 'credit', 'transfer', 'at', 'the', 'School', 'of', 'Business,', 't

In [36]:
#creates set of all words in processed_docs and assigns them an 'index'(i.e. dictionary key)
dict_docs=Dictionary(processed_docs)

print("Number of unique tokens:",len(dict_docs))

#print first 10 entries
count = 0
for k, v in dict_docs.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

Number of unique tokens: 7814
0 aalto
1 abl
2 abroad
3 accord
4 admit
5 advanc
6 analys
7 analyz
8 appli
9 applic
10 assign


In [37]:
#with the function below you can filter out extreme tokens
#depends on situations whether necessary or not
#no_below=delete words that occur less than that number, no_above=delete words that are in more than that amount of documents, keep_n is keep the most frequent n words
#dict_docs.filter_extremes(no_above=0.1, keep_n=3500)

In [38]:
#convert the documents to a bag of words
#this basicalyy makes dictionary for each course, of the words occuring and how often they occur
bow_docs=[dict_docs.doc2bow(d) for d in processed_docs]

In [39]:
#printe a sample of the bag of words of one course
bow_doc_sample = bow_docs[100]
for i in range(len(bow_doc_sample)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_sample[i][0], dict_docs[bow_doc_sample[i][0]], bow_doc_sample[i][1]))

Word 0 ("aalto") appears 2 time.
Word 4 ("admit") appears 1 time.
Word 5 ("advanc") appears 1 time.
Word 8 ("appli") appears 1 time.
Word 10 ("assign") appears 1 time.
Word 13 ("base") appears 1 time.
Word 17 ("case") appears 1 time.
Word 19 ("class") appears 1 time.
Word 23 ("complet") appears 1 time.
Word 28 ("contact") appears 2 time.
Word 29 ("corpor") appears 3 time.
Word 30 ("cours") appears 8 time.
Word 43 ("exchang") appears 1 time.
Word 62 ("knowledg") appears 1 time.
Word 68 ("manag") appears 2 time.
Word 69 ("mandatori") appears 2 time.
Word 71 ("maximum") appears 2 time.
Word 99 ("skill") appears 1 time.
Word 102 ("student") appears 8 time.
Word 103 ("studi") appears 8 time.
Word 107 ("teach") appears 1 time.
Word 129 ("elect") appears 3 time.
Word 133 ("exam") appears 2 time.
Word 136 ("financ") appears 6 time.
Word 137 ("firm") appears 3 time.
Word 140 ("function") appears 1 time.
Word 152 ("lectur") appears 4 time.
Word 153 ("market") appears 2 time.
Word 159 ("particip"

In [40]:
#now we can input the bag of words to the LDA model and get our topics! 
#num_topics defines the number of topics it outputs
lda_model = LdaMulticore(bow_docs, num_topics=20, id2word=dict_docs, passes=2, workers=2)

In [41]:
#print the words belonging to each topic! 
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.027*"nan" + 0.027*"cours" + 0.020*"student" + 0.015*"game" + 0.010*"market" + 0.008*"studi" + 0.008*"design" + 0.008*"lectur" + 0.007*"work" + 0.007*"exercis"
Topic: 1 
Words: 0.050*"cours" + 0.026*"student" + 0.014*"work" + 0.012*"lectur" + 0.012*"nan" + 0.011*"project" + 0.011*"design" + 0.010*"master" + 0.010*"studi" + 0.008*"process"
Topic: 2 
Words: 0.030*"cours" + 0.015*"student" + 0.014*"busi" + 0.013*"understand" + 0.011*"sustain" + 0.010*"develop" + 0.009*"studi" + 0.009*"manag" + 0.009*"nan" + 0.009*"model"
Topic: 3 
Words: 0.026*"structur" + 0.015*"ice" + 0.014*"cours" + 0.011*"fuel" + 0.010*"cell" + 0.010*"materi" + 0.010*"work" + 0.009*"test" + 0.009*"understand" + 0.008*"student"
Topic: 4 
Words: 0.045*"cours" + 0.031*"student" + 0.022*"skill" + 0.018*"work" + 0.013*"studi" + 0.012*"communic" + 0.010*"present" + 0.009*"project" + 0.008*"nan" + 0.008*"write"
Topic: 5 
Words: 0.041*"cours" + 0.027*"student" + 0.025*"nan" + 0.015*"research" + 0.013*"studi"

In [42]:
#Try with tf-idf

In [43]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_docs)
corpus_tfidf = tfidf[bow_docs]
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.02075140837182791),
 (1, 0.013250239071957146),
 (2, 0.06835884923831546),
 (3, 0.03863250427237543),
 (4, 0.04484150580231598),
 (5, 0.012526598419456726),
 (6, 0.03570979717795061),
 (7, 0.025229956948338864),
 (8, 0.04857666181366953),
 (9, 0.043095751780433206),
 (10, 0.010295476334941161),
 (11, 0.020803080966136176),
 (12, 0.04115335036878248),
 (13, 0.029581823518127576),
 (14, 0.1622046780190968),
 (15, 0.061748314519394214),
 (16, 0.6174831451939421),
 (17, 0.051781374347831),
 (18, 0.048658016705338314),
 (19, 0.01758993002837512),
 (20, 0.055218585451640864),
 (21, 0.027569820598196168),
 (22, 0.024396081369395747),
 (23, 0.10169012631113153),
 (24, 0.03788625109486548),
 (25, 0.019285940217996796),
 (26, 0.04287292939284812),
 (27, 0.03093099638741497),
 (28, 0.015864115276122415),
 (29, 0.04235906794649344),
 (31, 0.27067019146135035),
 (32, 0.056581293025185116),
 (33, 0.0777965418015885),
 (34, 0.014824242772669523),
 (35, 0.06532059085128834),
 (36, 0.10260317826

In [44]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=5, id2word=dict_docs, passes=2, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.002*"art" + 0.002*"design" + 0.002*"project" + 0.002*"artist" + 0.002*"cultur" + 0.002*"learn" + 0.002*"urban" + 0.002*"plan" + 0.002*"system" + 0.001*"research"
Topic: 1 Word: 0.003*"project" + 0.003*"design" + 0.002*"servic" + 0.002*"softwar" + 0.002*"process" + 0.002*"skill" + 0.002*"market" + 0.002*"manag" + 0.002*"research" + 0.002*"medium"
Topic: 2 Word: 0.003*"busi" + 0.002*"design" + 0.002*"project" + 0.002*"model" + 0.002*"research" + 0.002*"develop" + 0.002*"thesi" + 0.002*"process" + 0.002*"method" + 0.002*"market"
Topic: 3 Word: 0.002*"design" + 0.002*"energi" + 0.002*"project" + 0.002*"doctor" + 0.001*"research" + 0.001*"mechan" + 0.001*"scienc" + 0.001*"exercis" + 0.001*"system" + 0.001*"method"
Topic: 4 Word: 0.008*"nan" + 0.003*"game" + 0.002*"data" + 0.002*"busi" + 0.002*"design" + 0.002*"research" + 0.002*"financ" + 0.002*"manag" + 0.002*"model" + 0.002*"project"


### Comments
Can see some structure, but at the same time also quite random words in the same topic

### Questions
- Try different methods
    - Try with TF-IDF instead of BOW
    
- What can be the inputs for this method? Should it state something about the topic? Would say so?
    - Is this a difference compared to e.g. Word2Vec, where it places an entry in a vector space and hence could in theory have any information we find important (e.g. teacher)
        - Guess then it is not really topic modelling anymore
- Experiment more with different input data
- Be aware of the differences between NLTK vs Gensim
