### Topic modelling, Latent Dirichlet Allocation
Topic modelling means to assign each individual item to a fixed number of topics. 
Latent Dirichlet Allocation is a method to learn the topics and assign a probability for each item belonging to that topic.
A topic consists of a set of words found in a subset of the items. In the optimal case this set of words fit together.  
It doesn't spit out new words not found in the input data, and it also doesn't assign one overal term to to the topic (that's what I thought first).  

The goal of topic modelling for this topic is to be able to group courses and use these groupings in the recommendation.

This notebook does some simple tests. It must be noted however that most of it is inspired by others work. And it is just a first version. 


The main source of inspiration was https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

In [1]:
#import the needed packages
import pandas as pd

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.corpora import Dictionary
from gensim.models import LdaMulticore

from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import nltk
#nltk.download('wordnet')
#from nltk.corpus import stopwords


In [2]:
#read the course info
df=pd.read_csv('../Data/filtered_courses.csv')

we use part of the course data as input data.  
Here we only used the info from 'name' and 'content'  
Other keys that might be used
'name','content','learningOutcomes','additionalInformation','courseStatus','prerequisities','substitutes'

In [3]:
df['content']=df['content'].fillna('')
data_text = df[['name','content']]
data_text['combined']=df[['name', 'content']].apply(lambda x: ' '.join(x), axis=1)
data_text['index'] = data_text.index
documents = data_text

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [4]:
documents.head()

Unnamed: 0,name,content,combined,index
0,Capstone: Business Development Project,"The course consists of an applied, real-life p...",Capstone: Business Development Project The cou...,0
1,Introduction to business,This introductory course gives a basic underst...,Introduction to business This introductory cou...,1
2,Human Resource Management,"Throughout this course, we will be covering di...",Human Resource Management Throughout this cour...,2
3,Current Issues in Leadership,The course is taught by a visiting lecturer an...,Current Issues in Leadership The course is tau...,3
4,Business and Society,"Must know: the concepts of ""concept and contex...",Business and Society Must know: the concepts o...,4


NLTK and Gensim are both common libraries used for working with text. 
I experimented a bit with both and chose to use a mix of them. For a final version, this should be experimented more. 

In [5]:
def preprocess(doc):
    #remove stopwords, tokenize, lemmatize and stem
    
    #NLTK version is bit less strict on stopwords than gensim. E.g. 'also' is not stopword in nltk
    #stop=stopwords.words('english')
    #gensim stopwords
    stop=STOPWORDS
    wnl = WordNetLemmatizer()
    sbs=SnowballStemmer('english')
    
    
    #From documentation: Convert a document into a list of lowercase tokens, ignoring tokens that are too short or too long.
    #also tested regexptokenizer from nltk, but like outcome of this one better
    tokens=simple_preprocess(doc)
    
    
    lem=[wnl.lemmatize(t) for t in tokens if t not in stop and len(t)>=3]
    stem=[sbs.stem(l) for l in lem]
    return stem
            
processed_docs = documents['combined'].map(preprocess)

#print example of original and preprocess document
doc_sample=documents[documents['index'] == 0].values[0][2]
words_sample=simple_preprocess(doc_sample)

print('Original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n Preprocessed document: ')
print(preprocess(doc_sample))

Original document: 
['Capstone:', 'Business', 'Development', 'Project', 'The', 'course', 'consists', 'of', 'an', 'applied,', 'real-life', 'problem-based', 'project/case', 'that', 'students', 'identify,', 'analyze', 'and', 'solve', 'in', 'multi-disciplinary', 'teams.', 'It', 'also', 'focuses', 'on', 'developing', 'the', 'students¿', 'self-awareness', 'of', 'the', 'key', 'learnings', 'during', 'their', 'studies', 'in', 'the', 'Master¿s', 'Program.']


 Preprocessed document: 
['capston', 'busi', 'develop', 'project', 'cours', 'consist', 'appli', 'real', 'life', 'problem', 'base', 'project', 'case', 'student', 'identifi', 'analyz', 'solv', 'multi', 'disciplinari', 'team', 'focus', 'develop', 'student', 'self', 'awar', 'key', 'learn', 'studi', 'master', 'program']


In [6]:
#creates set of all words in processed_docs and assigns them an 'index'(i.e. dictionary key)
dict_docs=Dictionary(processed_docs)

print("Number of unique tokens:",len(dict_docs))

#print first 10 entries
count = 0
for k, v in dict_docs.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

Number of unique tokens: 6165
0 analyz
1 appli
2 awar
3 base
4 busi
5 capston
6 case
7 consist
8 cours
9 develop
10 disciplinari


In [7]:
#with the function below you can filter out extreme tokens
#depends on situations whether necessary or not
#no_below=delete words that occur less than that number, no_above=delete words that are in more than that amount of documents, keep_n is keep the most frequent n words
#dict_docs.filter_extremes(no_above=0.1, keep_n=3500)

In [8]:
#convert the documents to a bag of words
#this basicalyy makes dictionary for each course, of the words occuring and how often they occur
bow_docs=[dict_docs.doc2bow(d) for d in processed_docs]

In [9]:
#printe a sample of the bag of words of one course
bow_doc_sample = bow_docs[100]
for i in range(len(bow_doc_sample)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_sample[i][0], dict_docs[bow_doc_sample[i][0]], bow_doc_sample[i][1]))

Word 47 ("manag") appears 1 time.
Word 115 ("corpor") appears 1 time.
Word 259 ("structur") appears 1 time.
Word 265 ("choos") appears 1 time.
Word 283 ("opportun") appears 1 time.
Word 502 ("cash") appears 2 time.
Word 508 ("financi") appears 1 time.
Word 509 ("flow") appears 1 time.
Word 511 ("invest") appears 4 time.
Word 514 ("measur") appears 2 time.
Word 521 ("valuat") appears 1 time.
Word 536 ("investor") appears 1 time.
Word 545 ("capit") appears 1 time.
Word 720 ("return") appears 1 time.
Word 1080 ("riski") appears 1 time.
Word 1115 ("optimis") appears 1 time.


In [10]:
#now we can input the bag of words to the LDA model and get our topics! 
#num_topics defines the number of topics it outputs
lda_model = LdaMulticore(bow_docs, num_topics=10, id2word=dict_docs, passes=2, workers=2)

In [11]:
#print the words belonging to each topic! 
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.020*"cours" + 0.008*"busi" + 0.008*"analysi" + 0.008*"student" + 0.007*"data" + 0.007*"concret" + 0.007*"materi" + 0.006*"model" + 0.006*"manag" + 0.006*"topic"
Topic: 1 
Words: 0.009*"work" + 0.009*"cours" + 0.009*"art" + 0.007*"design" + 0.007*"theori" + 0.007*"model" + 0.007*"project" + 0.006*"invest" + 0.006*"student" + 0.005*"build"
Topic: 2 
Words: 0.013*"cours" + 0.012*"model" + 0.009*"manag" + 0.009*"market" + 0.009*"busi" + 0.008*"method" + 0.007*"analysi" + 0.007*"develop" + 0.006*"strategi" + 0.006*"design"
Topic: 3 
Words: 0.020*"student" + 0.013*"design" + 0.012*"cours" + 0.010*"process" + 0.009*"work" + 0.009*"research" + 0.009*"topic" + 0.008*"project" + 0.007*"seminar" + 0.007*"manag"
Topic: 4 
Words: 0.027*"cours" + 0.025*"student" + 0.019*"design" + 0.014*"skill" + 0.013*"research" + 0.011*"work" + 0.007*"busi" + 0.007*"technolog" + 0.007*"present" + 0.007*"develop"
Topic: 5 
Words: 0.021*"structur" + 0.017*"method" + 0.013*"element" + 0.011*"finit"

### Comments
Can see some structure, but at the same time also quite random words in the same topic

### Questions
- Try different methods
    - Try with TF-IDF instead of BOW
    
- What can be the inputs for this method? Should it state something about the topic? Would say so?
    - Is this a difference compared to e.g. Word2Vec, where it places an entry in a vector space and hence could in theory have any information we find important (e.g. teacher)
        - Guess then it is not really topic modelling anymore
- Experiment more with different input data
- Be aware of the differences between NLTK vs Gensim
