# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** W

**Names:**

* Olivier Cloux
* Thibault Urien
* Saskia Reiss

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [None]:
# Given imports
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
from scipy.sparse import find

# Import pickle to open processed courses
import pickle as pk
# Use Sparks sparse vectors
from pyspark.mllib.linalg import Vectors 

from utils import load_pkl, listPrettyPrint
from lab04_helper import *

## Exercise 4.8: Topics extraction

In [None]:
# Let's open the pickle to load it.
processedCourses = load_pkl("cidWithBag")

# We also need the list of words
uniqueWords = load_pkl("indexToWord")

# Get the course indexes for the matrix
courseIndices = load_pkl("indexToCourse")

# Get the TF-IDF matrix
TF_course_matrix = load_sparse_csr("TFIDF.npz")

In [None]:
# Transform the list of words into a spark vector usig TF-IDF
#### Use snippets Get value with index from RDD and Spar's sparse vectors

# Create an RDD we will fill with our sparse vectors 
v_dim = len(uniqueWords)
vector_list = []

# Create the sparse vectors using the matrix.
for i in range (len(courseIndices)):
    values = find(TF_course_matrix.getcol(i))[-1]
    indices = (TF_course_matrix.getcol(i).nonzero())[0] # Could use first of find though
    v = Vectors.sparse(v_dim, indices, values)
    vector_list.append((i, v))
    
# In the RDD we need a tuple wit (index of document, corresponding SparseVector) then map(list)
# Create the RDD (corpus) from the list
corpus = sc.parallelize(vector_list)
#print(corpus.take(2))
corpus = corpus.map(list)

# Cluster the documents into ten topics using LDA
ldaModel = LDA.train(corpus, seed=2, k=10)

In [None]:
def printTopics(model, maxTerms):
    # Gives a list of the 10 most linked words to our topics
    topic_indices = model.describeTopics(maxTermsPerTopic=maxTerms)

    # Extract the indices (first tuple) from the list of word
    topic = 1
    for i in topic_indices:
        print("Topic #", topic)
        listPrettyPrint([str("%s - %.7f" % (uniqueWords[i[0][j]], i[1][j]))  for j in range(maxTerms)], 3)
        print('\n')
        topic += 1

In [None]:
printTopics(ldaModel, 10)

* **Topic 1** : Finanancial Engineering
* **Topic 2** : Circuits & Systems
* **Topic 3** : Risk Managment
* **Topic 4** : Physics (Acoustic Waves)
* **Topic 5** : Electrical Engineering
* **Topic 6** : Environmental Engineering
* **Topic 7** : Seminars & Talks
* **Topic 8** : Chemistry
* **Topic 9** : Astrophysics
* **Topic 10** : Bioengineering and Bology

As we can see, LDA is a bit more deterministic and precise than LSI, even if we do seem to have some words that seem to not correspond to our main topics, probably because k is too big. 
Note also that LDA is a bit more random, as it will extract the same topics each time, but their order of appearence will be random, except if we lock te seed beforehand (as we did to be sre to have the same tpic order each time).

## Exercise 4.9: Dirichlet hyperparameters

#### Varying the prior on topic distribution α

In [None]:
topic_indices_list = []
array_a = [2, 6, 15, 25, 52]

for i in array_a:
    model = (LDA.train(corpus, k=10, seed=2, docConcentration=float(i), topicConcentration=1.01))
    topic_indices_list.append(model.describeTopics(maxTermsPerTopic=10))

In [None]:
for topic in range(1,10):
    print("\n\t\tTOPIC %d beta=1.01\n" %(topic))
    alpha_index = 0
    for i in range(0, 5):
        print("\n\t alpha=%d \n" %(array_a[alpha_index]))
        j = topic_indices_list[i][topic-1]
        listPrettyPrint([str("%s - %.7f" % (uniqueWords[j[0][k]], j[1][k]))  for k in range(10)], 3)
        alpha_index += 1
        print('\n')

As we can see, setting an higher alpha will make so that each document is only composed of some very broad and general topics, while a lower value will show more precise and dominant topics.
In fact, an alpha is what we now already about the data (being a prior) and thus, we say ourselves if the tpic is very geral and broad, or if we want something pecific.

Also note that between 52 and 102 (not shown here), we have the same topics. This means that we need to play with lower values of alpha to see topic changes (like we have between 25 and 52).

#### Varying the prior on word distribution per topic β

In [None]:
topic_indices_list = []
array = [1.1, 2.0, 6.0, 20.0]

for i in array:
    model = (LDA.train(corpus, k=10, seed=2, docConcentration=6.0, topicConcentration=float(i)))
    topic_indices_list.append(model.describeTopics(maxTermsPerTopic=10))

In [None]:
for topic in range(1,10):
    print("\n\t\tTOPIC %d alpha=6\n" %(topic))
    beta_index = 0
    for i in range(0, 4):
        print("\n\t beta=%f \n" %(array[beta_index]))
        j = topic_indices_list[i][topic-1]
        listPrettyPrint([str("%s - %.7f" % (uniqueWords[j[0][k]], j[1][k]))  for k in range(10)], 3)
        beta_index += 1
        print('\n')

As we can see, a low beta value places more weight on having each topic composed of only a few dominant words, while a igher beta value will have far broader terms. So alpha and beta have similr effects, even if not the same words are extracted.

Also note that between 6 and 20, we have the exact same topics, so we need to take lower values for beta to have any different topics (like between 6 and 2).

## Exercise 4.10: EPFL's taught subjects

To get the most value of the EPFL taught subjects, we are going to take out 7 topics, one for each faculty taught at the EPFL (not counting individual sections or we would have k = 24) so or k will be equal to 7.
Now, we are going to choose α and β such that we use the most common values used for both of them. According to documentation, we thus have α = 50/7 = 7.14 ans β = 1.01 or 200/W with W the number of words in the vocabulary. We may use 1.01, as W can be very big and we need β to be grater than 1.

In [None]:
w = len(uniqueWords)
print(200/w)

As we can see, this value is too small, so we will use β = 1.01.

In [None]:
epfl_model = (LDA.train(corpus, k=7, seed=2, docConcentration=7.14, topicConcentration=1.01))
printTopics(epfl_model, 7)

As we can see here, the labels are quite similar form what we would think would appear in the tought subjects : we will find the 7 faculties in a form or another in there topics.

So we have:
* **Topic 1** : Administrative (most courses will have some form of adminsitration discussion in their description)
* **Topic 2** : Cancer & Global Health + Blue Brain Project (Life Sciences)
* **Topic 3** : Life Sciences (Bioengineering)
* **Topic 4** : Financial Engineering (Managment of Technology)
* **Topic 5** : Electrical and Mechanical Engineering (Engineering)
* **Topic 6** : Architecture, Civil and Environmental Engineering
* **Topic 7** : Basic Sciences (Chemistry, Mahematics, Physics)

Note that we don't seem to have our own faculty (Computer and Communication Sciences) but this might be due to the fact that in most Bachelor years, we have courses in other domains, like physics, mathematics and electrical engineering, rather than pure computre science courses. This makes the presence of technical terms to such topics much less present.

## Exercise 4.11: Wikipedia structure

Using the same logic as for the EPFL's taught subjects, we are going to choose k = 16, as we can see on the wikipeia for schools website that they have 16 sections.

As for α and β, we use the most common values for both of them again. Thus, we have α = 50/16 = 3.13 and β = 1.01.

In [None]:
# Need to do TF-IDF of the wiki txt, then re-use code above.