# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** W

**Names:**

* Olivier Cloux
* Thibault Urien
* Saskia Reiss

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
# Given imports
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
from scipy.sparse import find

# Import pickle to open processed courses
import pickle as pk
# Use Sparks sparse vectors
from pyspark.mllib.linalg import Vectors 

from utils import load_pkl, listPrettyPrint
from lab04_helper import *

## Exercise 4.8: Topics extraction

In [2]:
# Let's open the pickle to load it.
processedCourses = load_pkl("cidWithBag")

# We also need the list of words
uniqueWords = load_pkl("indexToWord")

# Get the course indexes for the matrix
courseIndices = load_pkl("indexToCourse")

# Get the TF-IDF matrix
TF_course_matrix = load_sparse_csr("TFIDF.npz")

In [3]:
# Transform the list of words into a spark vector usig TF-IDF
#### Use snippets Get value with index from RDD and Spar's sparse vectors

# Create an RDD we will fill with our sparse vectors 
v_dim = len(uniqueWords)
vector_list = []

# Create the sparse vectors using the matrix.
for i in range (len(courseIndices)):
    values = find(TF_course_matrix.getcol(i))[-1]
    indices = (TF_course_matrix.getcol(i).nonzero())[0] # Could use first of find though
    v = Vectors.sparse(v_dim, indices, values)
    vector_list.append((i, v))
    
# In the RDD we need a tuple wit (index of document, corresponding SparseVector) then map(list)
#docs = courseIndices.zipWithIndex(vector_list)
#print(docs)

# Create the RDD (corpus) from the list
corpus = sc.parallelize(vector_list)
#print(corpus.take(2))
corpus = corpus.map(list)

# Cluster the documents into ten topics using LDA
ldaModel = LDA.train(corpus, k=10)

In [4]:
def printTopics(model, maxTerms):
    # Gives a list of the 10 most linked words to our topics
    topic_indices = model.describeTopics(maxTermsPerTopic=maxTerms)

    # Extraire les indices (premier tuple) de la liste de mots
    topic = 1
    for i in topic_indices:
        print("Topic #", topic)
        listPrettyPrint([str("%s - %.7f" % (uniqueWords[i[0][j]], i[1][j]))  for j in range(10)], 3)
        print('\n')
        topic += 1

In [5]:
printTopics(ldaModel, 10)

Topic # 1
algebra - 0.0056484           computer - 0.0053268          algorithm - 0.0050481
theory - 0.0048821            network - 0.0047307           programming - 0.0043964
geometry - 0.0043694          probability - 0.0040490       exercise - 0.0037610
flow - 0.0037208	

Topic # 2
project - 0.0084657           semester - 0.0050236          urban - 0.0047000
technology - 0.0045309        science - 0.0044223           research - 0.0042855
reactor - 0.0039378           energy - 0.0037923            subject - 0.0035235
development - 0.0032324	

Topic # 3
design - 0.0050698            architecture - 0.0050535      device - 0.0045635
mechanical - 0.0040559        electronics - 0.0039531       programming - 0.0038833
material - 0.0036846          mem - 0.0035759               cryptography - 0.0034412
system - 0.0032525	

Topic # 4
communication - 0.0083996     propagation - 0.0050013       solar - 0.0047762
optical - 0.0044649           electromagnetic - 0.0041844   field - 0.0041525
ener

* **Topic 1** : Materials physics
* **Topic 2** : IC students mental health
* **Topic 3** : Physics
* **Topic 4** : (Risk) Management
* **Topic 5** : Chemistry
* **Topic 6** : Urban architecture
* **Topic 7** : Financial Engineering
* **Topic 8** : Circuits/Electricity (including low level as CMO)
* **Topic 9** : Biology
* **Topic 10** : Computing

TODO : Compare to LSI

## Exercise 4.9: Dirichlet hyperparameters

#### Varying the prior on topic distribution α

In [22]:
topic_indices_list = []
for i in [2, 52, 102]:
    model = (LDA.train(corpus, k=10, seed=2, docConcentration=float(i), topicConcentration=1.01))
    topic_indices_list.append(model.describeTopics(maxTermsPerTopic=10))

In [23]:
for topic in range(1,10):
    print("\n\t\tTOPIC %d beta=1.01\n" %(topic))
    alpha = 2
    print
    for i in range(0, 3):
        print("\n\t alpha=%d \n" %(alpha))
        j = topic_indices_list[i][topic-1]
        listPrettyPrint([str("%s - %.7f" % (uniqueWords[j[0][k]], j[1][k]))  for k in range(10)], 3)
        alpha += 50
        print('\n')


		TOPIC 1 beta=1.01


	 alpha=2 

risk - 0.0063122              financial - 0.0059356         innovation - 0.0057969
finance - 0.0047209           corporate - 0.0042858         market - 0.0038213
derivative - 0.0033596        management - 0.0031839        valuation - 0.0031354
model - 0.0030952	


	 alpha=52 

system - 0.0023974            project - 0.0020351           model - 0.0020007
design - 0.0019625            material - 0.0018807          analysis - 0.0017701
problem - 0.0017134           data - 0.0016051              energy - 0.0015961
process - 0.0015608	


	 alpha=102 

system - 0.0024717            project - 0.0020478           model - 0.0019989
material - 0.0019476          design - 0.0019155            analysis - 0.0017798
problem - 0.0017151           energy - 0.0017092            theory - 0.0016331
data - 0.0016189	


		TOPIC 2 beta=1.01


	 alpha=2 

optical - 0.0082619           electron - 0.0069237          material - 0.0065333
optic - 0.0053968             diffracti

TODO: Explain wtf is happening here.

#### Varying the prior on word distribution per topic β

In [8]:
topic_indices_list = []
array = [1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0]

for i in array:
    model = (LDA.train(corpus, k=10, seed=2, docConcentration=6.0, topicConcentration=float(i)))
    topic_indices_list.append(model.describeTopics(maxTermsPerTopic=10))

In [9]:
for topic in range(1,10):
    print("\n\t\tTOPIC %d alpha=6\n" %(topic))
    beta_index = 0
    print
    for i in range(0, 9):
        print("\n\t beta=%f \n" %(array[beta_index]))
        j = topic_indices_list[i][topic-1]
        listPrettyPrint([str("%s - %.7f" % (uniqueWords[j[0][k]], j[1][k]))  for k in range(10)], 3)
        beta_index += 1
        print('\n')


		TOPIC 1 alpha=6


	 beta=1.100000 

innovation - 0.0062312        financial - 0.0052694         case - 0.0038888
individual - 0.0036166        market - 0.0035763            management - 0.0034928
finance - 0.0034421           team - 0.0032250              option - 0.0031989
project - 0.0030557	


	 beta=1.200000 

innovation - 0.0058226        financial - 0.0044079         business - 0.0043743
case - 0.0037000              market - 0.0036754            management - 0.0034166
project - 0.0032938           team - 0.0031293              individual - 0.0031180
technology - 0.0029744	


	 beta=1.300000 

innovation - 0.0050797        financial - 0.0037151         business - 0.0036587
case - 0.0032494              market - 0.0031868            project - 0.0031003
management - 0.0029692        design - 0.0027453            company - 0.0026364
technology - 0.0026362	


	 beta=1.400000 

innovation - 0.0036974        financial - 0.0029489         project - 0.0027504
model - 0.0027011        

TODO: Explain WTF is happeing here

## Exercise 4.10: EPFL's taught subjects

## Exercise 4.11: Wikipedia structure