# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *F*

**Names:**

* *Dessimoz Frank*
* *Micheli Vincent*
* *Lefebvre Hippolyte*

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [171]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
import pickle
from utils import load_json, load_pkl
import numpy as np

In [16]:
terms = load_pkl('terms.pkl')
words_prep=load_pkl('words_prep.pkl')
courses= load_json('data/courses.txt')

In [28]:
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD

In [5]:
Ids_courses = list({c['courseId']:c for c in courses})

In [34]:
from pyspark.mllib.linalg import Vectors

In [213]:
terms[11569]

'project'

## Exercise 4.8: Topics extraction

We load term frequency matrix:

In [10]:
with open('tf.pkl', 'rb') as matrix:
    tf = pickle.load(matrix)

In [172]:
mat=np.matrix(tf)

We collect the columns of the term frequency matrix as arrays. So that we have the counts for all words in the corpus but for each document.

In [179]:
vect=[]
i=0
for course in Ids_courses:
    v = mat[:,i]
    vect.append(v)
    i+=1

Now that we have a collection of arrays corresponding to each document, we process the data so that they are in the good format for LDA method:

In [181]:
rdd=sc.parallelize(vect)

In [182]:
parsedData = rdd.map(lambda line: Vectors.dense([float(x) for x in line]))

In [183]:
corpus= parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()

And now we can train the model:

In [234]:
ldaModel = LDA.train(corpus, k=10)

In [188]:
wordNumbers = 10  # number of words per topic
topicIndices = sc.parallelize(ldaModel.describeTopics(maxTermsPerTopic = wordNumbers))

Let's print topic and words obtained via LDA:

In [229]:
for i in range(10):
    arr=ldaModel.describeTopics(10)[i][0]
    print('Topic ', i+1,':')
    for x in arr:
        print(terms[x])
    print()  

Topic  1 :
student
report
method
problem
model
optim
skill
learn
evalu
risk

Topic  2 :
project
student
plan
time
learn
assess
evalu
method
system
skill

Topic  3 :
method
optic
energi
model
equat
basic
physic
numer
analysi
learn

Topic  4 :
design
data
learn
model
circuit
digit
student
method
system
teach

Topic  5 :
system
learn
method
comput
student
program
process
cours
exercis
control

Topic  6 :
student
work
studi
learn
case
present
develop
method
architectur
content

Topic  7 :
model
method
learn
network
linear
control
analysi
dynam
content
system

Topic  8 :
chemic
chemistri
biolog
method
reaction
molecular
protein
cell
student
mass

Topic  9 :
materi
electron
mechan
properti
applic
devic
magnet
physic
method
structur

Topic  10 :
student
research
week
present
field
learn
scientif
discuss
structur
epfl



We can label the topics as following:  

    1) Risk modelling and visualization  
    2) Problem solving skills  
    3) Optic and Energy Physics  
    4) Digital circuits design and learning  
    5) MAN at work  
    6) Architecture  
    7) Networks Analysis  
    8) Molecular and Cellular Biology  
    9) Electrical engineering  
    10)Students thesis presentations  

Each topic is still to generalist into its word distribution, the model suffer from under-parametrization.

## Exercise 4.9: Dirichlet hyperparameters

The Dirichlet distributions in the LDA model have each one hyperparameter:
    - α for the distribution of topics in documents 
    - β for the distribution of words in topics

Their role is to encode any prior belief we have about the hidden topics for the corpus at hand.

1) We Fix k = 10 and β = 1.01, and vary α.

In [232]:
a=[0.01,0.15,0.5,1,1.2,50]

#### Then we loop over those different values of alpha using the exact same method as before: (we will not output all the results here for reading convenience). This time weed parameters as following :

In [231]:
ldaModel2 = LDA.train(corpus, k=10,topicConcentration=1.01,docConcentration=a[0])

Since b=1.01 words distribution over topics is close to uniform.   
- When a=1, topics distribution is uniform. 
- When a>>1, topics distribution converges towards an equiprobable topics distribution. 
- When a<<1 topics distribution converges toward one dominant topic distribution

2) We Fix k = 10 and α = 6, and vary β.

In [None]:
b=[0.01,0.15,0.5,1,1.2,50]

In [None]:
ldaModel3 = LDA.train(corpus, k=10,docConcentration=6,topicConcentration=b[0])

#### Again the method is the same, thus we we will provide here only interpretation:

Since a=6 topics distribution is closer to an equiprobable topics distribution. 
- When b=1, words distribution is uniform. 
- When b>>1, words distribution converges towards an equiprobable words distribution. 
- When b<<1 words distribution converges toward one dominant word distribution.

## Exercise 4.10: EPFL's taught subjects

Let's find the combination of k, α and β that gives the most interpretable topics that most explain the topics covered by the courses taught at EPFL.

We chose these values using the following assumptions: A broad number of subjects is discussed at EPFL which motivates an important number of topics. Engineering courses should be rather specific about one particular science related subject.  Hence, we should observe one or two dominant topic per course description. Therefore setting a < 1 to foster such distributions makes sense. Moreover, we should observe a large variety of topic related words in each topic. Therefore setting b > 1 to foster such distributions makes sense.

In [None]:
ldaModel4 = LDA.train(corpus, k=30,docConcentration=0.15,topicConcentration=5)

Output of topics :

In [None]:
for i in range(10):
    arr=ldaModel.describeTopics(10)[i][0]
    print('Topic ', i+1,':')
    for x in arr:
        print(terms[x])
    print()  

Values of parameters :
- k = 30
- α = 0.15
- β = 5

## Exercise 4.11: Wikipedia structure

Since the Wikipedia for school is large and a large collection of subjects is covered, the topic number k should be quite large. As we said previously, each document should concentrate on a few topics. Therefore setting a < 1 to foster dominant topics distributions makes sense.  Moreover, we should observe a large variety of topic related words in each topic. Therefore setting b>1 to foster close to equiprobable words distributions makes sense.

Values of parameters chosen a priori:
- k = 100
- α = 0.01
- β = 10