# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *P*

**Names:**

* *Matthias Leroy*
* *Pierre Fouche*


---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [None]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
import json
import pickle
import numpy as np
from collections import defaultdict

## Exercise 4.8: Topics extraction

In [None]:
with open("ex3.pickle", "rb") as f:
    courses = pickle.load(f, encoding="utf-8")
#Load the courses
data = sc.parallelize(courses)
#Retrieve the list of words for each course
data = data.map(lambda x: x['listDescription'])
#Remove doublons
termCounts = data.flatMap(lambda x:x).map(lambda x : (x,1)).reduceByKey(lambda a,b:a+b).map(lambda x:(x[1],x[0])).sortByKey(False)
#Create dic with word as key and indes as value
voc = termCounts.map(lambda x:x[1]).zipWithIndex().collectAsMap()


In [None]:
#Create a list that is the size of the vocabulary, 
# and the value of each cell is the count of the word whose id is the index of that cell
def document_vector(document):
    id = document[1]
    counts = defaultdict(int)
    for token in document[0]:
        if token in voc:
            token_id = voc[token]
            counts[token_id] += 1
    counts = sorted(counts.items())
    keys = [x[0] for x in counts]
    values = [x[1] for x in counts]
    return (id, Vectors.sparse(len(voc), keys, values))

documents = data.zipWithIndex().map(document_vector).map(list)
inv_voc = {value: key for (key, value) in voc.items()}

In [None]:
#Compute lda with the differents parameters
def trainLda(cluster = 10 ,alpha =- 1.0, beta = -1.0):
    lda_model = LDA.train(documents,k = cluster, docConcentration=alpha,topicConcentration=beta)
    result =lda_model.describeTopics(10)
    for i in range(len(result)):
        print('topic',(i+1),'=================')
        for j in range(len(result[i][0])):
            print(inv_voc[result[i][0][j]])
trainLda()

Topic 1: Engineering, Topic 2: Physics/Chemistry, Topic 3: Presentation, Topic 4: Algorithm, Topic 5: Methodology, Topic 6: Machine Learning, Topic 7: Mathematics, Topic 8: Electronic System, Topic 9: Stochastic Model, Topic 10: Physics

In [None]:
trainLda(cluster=12,alpha=6.0,beta=2.0)

## Exercise 4.9: Dirichlet hyperparameters

In [None]:
#Try some alphas
for a in np.linspace(1.01,6,11):
    print(a," 1.01")
    print("=========================")
    trainLda(alpha=a,beta=1.01)

A high value of alpha leads to more precise topics

In [None]:
#Try some betas
for b in np.linspace(1.01,6,11):
    print("6.0 ",b)
    print("=========================")
    trainLda(alpha =6.0, beta=b)

A high value of beta leads to topics being more similar in terms of what words they contain.

## Exercise 4.10: EPFL's taught subjects

In [None]:
#Find the best k, alpha and beta
for k in [12,15,18]:
    for a in np.linspace(1.01,6.01,3):
        for b in np.linspace(1.01,6.01,3):
            print(a,b,k)
            print("===========================================================================")
            trainLda(k,a,b)


In [None]:
trainLda(cluster=12,alpha=6.0,beta=2.0)

We chose k=12 because there are approximately 12 differents sections in EPFL, alpha=6 because we want precise topics and beta = 2.

## Exercise 4.11: Wikipedia structure

In [None]:
wiki = sc.textFile("/ix/wikipedia-for-schools.txt").map(json.loads)
data = wiki.map(lambda x: x['tokens'])
termCounts = data.flatMap(lambda x:x).map(lambda x : (x,1)).reduceByKey(lambda a,b:a+b).map(lambda x:(x[1],x[0])).sortByKey(False)
voc = termCounts.map(lambda x:x[1]).zipWithIndex().collectAsMap()
documents = data.zipWithIndex().map(document_vector).map(list)
inv_voc = {value: key for (key, value) in voc.items()}

trainLda(12,8.0,2.0)

We chose k = 12 because Wikipedia has 12 portals.<br/>
Topic 1: General, Topic 2: Society,  Topic 3: Animals , Topic 4: Natural Science , Topic 5: Religions, Topic 6:Technologies, Topic 7:Globalization, Topic 8: Mathematics, Topic 9: Arts, Topic 10: History, Topic 11: Culture, Topic 12: Physics sciences
