# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** W

**Names:**

* Olivier Cloux
* Thibault Urien
* Saskia Reiss

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [6]:
# Given imports
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
from scipy.sparse import find

# Import pickle to open processed courses
import pickle as pk
# Use Sparks sparse vectors
from pyspark.mllib.linalg import Vectors 

from utils import load_pkl, listPrettyPrint
from lab04_helper import *

## Exercise 4.8: Topics extraction

In [7]:
# Let's open the pickle to load it.
processedCourses = load_pkl("cidWithBag")

# We also need the list of words
uniqueWords = load_pkl("indexToWord")

# Get the course indexes for the matrix
courseIndices = load_pkl("indexToCourse")

# Get the TF-IDF matrix
TF_course_matrix = load_sparse_csr("TFIDF.npz")

In [8]:
# Transform the list of words into a spark vector usig TF-IDF
#### Use snippets Get value with index from RDD and Spar's sparse vectors

# Create an RDD we will fill with our sparse vectors 
v_dim = len(uniqueWords)
vector_list = []

# Create the sparse vectors using the matrix.
for i in range (len(courseIndices)):
    values = find(TF_course_matrix.getcol(i))[-1]
    indices = (TF_course_matrix.getcol(i).nonzero())[0] # Could use first of find though
    v = Vectors.sparse(v_dim, indices, values)
    vector_list.append((i, v))
    
# In the RDD we need a tuple wit (index of document, corresponding SparseVector) then map(list)
#docs = courseIndices.zipWithIndex(vector_list)
#print(docs)

# Create the RDD (corpus) from the list
corpus = sc.parallelize(vector_list)
#print(corpus.take(2))
corpus = corpus.map(list)

# Cluster the documents into ten topics using LDA
ldaModel = LDA.train(corpus, k=10)

In [9]:
def printTopics(model, maxTerms):
    # Gives a list of the 10 most linked words to our topics
    topic_indices = model.describeTopics(maxTermsPerTopic=maxTerms)

    # Extraire les indices (premier tuple) de la liste de mots
    topic = 1
    for i in topic_indices:
        print("Topic #", topic)
        listPrettyPrint([str("%s - %.7f" % (uniqueWords[i[0][j]], i[1][j]))  for j in range(10)], 3)
        print('\n')
        topic += 1

In [10]:
printTopics(ldaModel, 10)

Topic # 1
optic - 0.0100972             circuit - 0.0076716           analog - 0.0056814
nois - 0.0053117              machin - 0.0051002            propag - 0.0049420
ic - 0.0049364                polici - 0.0048633            design - 0.0041062
day - 0.0040150	

Topic # 2
measur - 0.0058739            structur - 0.0049982          print - 0.0049566
kinet - 0.0045471             scale - 0.0042134             microwav - 0.0040918
experiment - 0.0040402        strain - 0.0040318            aim - 0.0040208
provid - 0.0038820	

Topic # 3
semiconductor - 0.0078735     quantum - 0.0069046           physic - 0.0065440
devic - 0.0054316             water - 0.0049186             descript - 0.0048775
electron - 0.0044809          atmospher - 0.0041100         pollut - 0.0040514
fundament - 0.0039824	

Topic # 4
signal - 0.0079693            control - 0.0059097           brain - 0.0059001
system - 0.0053316            data - 0.0051725              protein - 0.0049450
neurosci - 0.0045479        

* **Topic 1** : Materials physics
* **Topic 2** : IC students mental health
* **Topic 3** : Physics
* **Topic 4** : (Risk) Management
* **Topic 5** : Chemistry
* **Topic 6** : Urban architecture
* **Topic 7** : Financial Engineering
* **Topic 8** : Circuits/Electricity (including low level as CMO)
* **Topic 9** : Biology
* **Topic 10** : Computing

TODO : Compare to LSI

## Exercise 4.9: Dirichlet hyperparameters

#### Varying the prior on topic distribution α

In [11]:
topic_indices_list = []
for i in range(2, 11):
    model = (LDA.train(corpus, k=10, seed=2, docConcentration=float(i), topicConcentration=1.01))
    topic_indices_list.append(model.describeTopics(maxTermsPerTopic=10))

In [12]:
for topic in range(1,10):
    print("\n\t\tTOPIC %d beta=1.01\n" %(topic))
    alpha = 2
    print
    for i in range(0, 9):
        print("\n\t alpha=%d \n" %(alpha))
        j = topic_indices_list[i][topic-1]
        listPrettyPrint([str("%s - %.7f" % (uniqueWords[j[0][k]], j[1][k]))  for k in range(10)], 3)
        alpha += 1
        print('\n')


		TOPIC 1 beta=1.01


	 alpha=2 

financi - 0.0084479           risk - 0.0080918              price - 0.0078663
financ - 0.0066596            market - 0.0055575            stochast - 0.0054865
model - 0.0050486             research - 0.0048772          microscopi - 0.0047907
asset - 0.0047680	


	 alpha=3 

risk - 0.0088280              financi - 0.0085006           price - 0.0082244
financ - 0.0067591            market - 0.0061367            stochast - 0.0059766
research - 0.0055960          model - 0.0055153             microscopi - 0.0052601
asset - 0.0048231	


	 alpha=4 

risk - 0.0092409              financi - 0.0085892           price - 0.0084167
financ - 0.0068402            market - 0.0066480            stochast - 0.0063881
research - 0.0061075          model - 0.0058196             microscopi - 0.0053847
asset - 0.0048779	


	 alpha=5 

risk - 0.0094215              financi - 0.0086261           price - 0.0084877
market - 0.0070217            financ - 0.0068725            st

TODO: Explain wtf is happening here.

#### Varying the prior on word distribution per topic β

In [13]:
topic_indices_list = []
array = [1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0]

for i in array:
    model = (LDA.train(corpus, k=10, seed=2, docConcentration=6.0, topicConcentration=float(i)))
    topic_indices_list.append(model.describeTopics(maxTermsPerTopic=10))

In [14]:
for topic in range(1,10):
    print("\n\t\tTOPIC %d alpha=6\n" %(topic))
    beta_index = 0
    print
    for i in range(0, 9):
        print("\n\t beta=%f \n" %(array[beta_index]))
        j = topic_indices_list[i][topic-1]
        listPrettyPrint([str("%s - %.7f" % (uniqueWords[j[0][k]], j[1][k]))  for k in range(10)], 3)
        alpha += 1
        print('\n')


		TOPIC 1 alpha=6


	 beta=1 

risk - 0.0086089              financi - 0.0084894           price - 0.0083656
market - 0.0077078            electron - 0.0075672          microscopi - 0.0074453
stochast - 0.0069161          financ - 0.0065660            research - 0.0056172
tem - 0.0051305	


	 beta=1 

electron - 0.0087398          financi - 0.0076570           microscopi - 0.0076378
price - 0.0075849             risk - 0.0072593              market - 0.0063047
stochast - 0.0056360          financ - 0.0053330            research - 0.0051985
tem - 0.0048746	


	 beta=1 

electron - 0.0079999          microscopi - 0.0065407        financi - 0.0058969
price - 0.0058027             risk - 0.0055254              sem - 0.0047131
research - 0.0045374          market - 0.0045029            scan - 0.0043219
microscop - 0.0042025	


	 beta=1 

electron - 0.0060882          microscopi - 0.0049184        sem - 0.0046091
financi - 0.0042435           risk - 0.0039079              price - 0.0038364


TODO: Explain WTF is happeing here

## Exercise 4.10: EPFL's taught subjects

## Exercise 4.11: Wikipedia structure