# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *F*

**Names:**

* *Dessimoz Frank*
* *Micheli Vincent*
* *Lefebvre Hippolyte*

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [171]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
import pickle
from utils import load_json, load_pkl
import numpy as np

In [16]:
terms = load_pkl('terms.pkl')
words_prep=load_pkl('words_prep.pkl')
courses= load_json('data/courses.txt')

In [28]:
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD

In [5]:
Ids_courses = list({c['courseId']:c for c in courses})

In [34]:
from pyspark.mllib.linalg import Vectors

## Exercise 4.8: Topics extraction

We load term frequency matrix:

In [10]:
with open('tf.pkl', 'rb') as matrix:
    tf = pickle.load(matrix)

In [172]:
mat=np.matrix(tf)

We collect the columns of the term frequency matrix as arrays. So that we have the counts for all words in the corpus but for each document.

In [179]:
vect=[]
i=0
for course in Ids_courses:
    v = mat[:,i]
    vect.append(v)
    i+=1

Now that we have a collection of arrays corresponding to each document, we process the data so that they are in the good format for LDA method:

In [181]:
rdd=sc.parallelize(vect)

In [182]:
parsedData = rdd.map(lambda line: Vectors.dense([float(x) for x in line]))

In [183]:
corpus= parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()

And now we can train the model:

In [184]:
ldaModel = LDA.train(corpus, k=10)

In [188]:
wordNumbers = 10  # number of words per topic
topicIndices = sc.parallelize(ldaModel.describeTopics(maxTermsPerTopic = wordNumbers))

In [190]:
print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize()) + " words):")
topics = ldaModel.topicsMatrix()
for topic in range(10):
    print("Topic " + str(topic) + ":")
    for word in range(0, 10):
        print(" " + str(topics[word][topic]))

Learned topics (as distributions over vocab of 11819 words):
Topic 0:
 0.0308961754174
 0.0188959436342
 21.7786494634
 0.0223688164579
 137.972770856
 0.467983226446
 0.00320211011815
 0.0394497374966
 0.0246049873522
 5.81887873418
Topic 1:
 0.0328086922443
 0.0196806902237
 0.946783511093
 0.727501463036
 63.0680442743
 4.35235894697
 0.00867648652684
 0.035969705229
 0.00659869555561
 11.2491353688
Topic 2:
 0.418444976835
 0.0246850881309
 17.3031151231
 0.0304643272094
 141.556224903
 0.104675284554
 0.00235571182403
 1.75115495308
 0.0289719803113
 9.41123887162
Topic 3:
 0.0343855635374
 0.0197289215763
 1.7019673608
 0.0388422643016
 63.1923154455
 1.29082309444
 0.0252250219016
 0.0360986521766
 0.00853787897972
 48.9773778965
Topic 4:
 0.0540493392075
 0.0235345472289
 4.52448248573
 0.0335918266551
 55.2483221461
 1.65340913713
 0.00351339948945
 0.110177017849
 0.0155464775089
 5.23084168431
Topic 5:
 0.0734286708979
 0.0306413425516
 4.69292901428
 0.0232280226144
 17.909

## Exercise 4.9: Dirichlet hyperparameters

## Exercise 4.10: EPFL's taught subjects

## Exercise 4.11: Wikipedia structure