# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *K*

**Names:**

* *Mathieu Sauser*
* *Luca Mouchel*
* *Heikel Jebali*
* *Jérémy Chaverot*

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

#### Some imports to begin with

In [26]:
import json
import numpy as np
from utils import load_json, load_pkl
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors

## Exercise 4.8: Topics extraction

In [27]:
!hdfs dfs -put data/preprocessed_courses.txt
!hdfs dfs -ls

# Load the pre-processed courses data
courses_RDD = sc.textFile("preprocessed_courses.txt").flatMap(json.loads)

put: `preprocessed_courses.txt': File exists
Found 4 items
drwx------   - jchavero hadoop          0 2023-06-06 03:00 .Trash
drwxr-xr-x   - jchavero hadoop          0 2023-06-06 19:43 .sparkStaging
-rw-r--r--   3 jchavero hadoop    2147813 2023-03-01 09:39 election-day-tweets.txt
-rw-r--r--   3 jchavero hadoop    2861332 2023-06-06 02:35 preprocessed_courses.txt


In [28]:
# Retrieve the identifier for all courses
coursesIds_RDD = courses_RDD.map(lambda c: c['courseId']).distinct()

# The number of distinct courses
N = coursesIds_RDD.count() 

# Map each course to a unique index
docToIdx = dict(zip(coursesIds_RDD.collect(), range(N)))
docIdx = list(docToIdx.values())

In [29]:
# Retrieve all distinct terms from all documents
all_words_RDD = courses_RDD.flatMap(lambda c: c['description']).distinct()

# The number of distinct terms
M = all_words_RDD.count() 

# Map each term to a unique index
termToIdx = dict(zip(all_words_RDD.collect(), range(M)))
vectorized_termToIdx = np.vectorize(lambda x: termToIdx[x])
idxToTerm = {v: k for k, v in termToIdx.items()}

# Compute a reduced version of the courses RDD with only indexes
red_courses_RDD = courses_RDD.map(lambda c: (docToIdx[c["courseId"]], vectorized_termToIdx(c["description"])))

In [30]:
# Helper function to create the sparse vector of size M the number of terms for each doc
def create_sparse_vector_from_document(doc):
    vector = {}
    for termIdx in doc[1]:
        vector[termIdx] = vector.get(termIdx, 0) + 1
    return (doc[0], Vectors.sparse(M, vector))

# Build the term-document matrix
term_doc_matrix = red_courses_RDD.map(lambda x: create_sparse_vector_from_document(x)).map(list)

# Train the LDA model
lda = LDA.train(term_doc_matrix, k=10, seed=0)

# Function used to retrieve the words from their respective index
def print_topics(model, idxToTerm=idxToTerm, labels=None, words_per_topic=7):
    topics = model.describeTopics(words_per_topic)
    for i, topic in enumerate(topics):
        wordsIdx = topic[0]
        words = []
        for idx in wordsIdx:
            word = idxToTerm[idx]
            words.append(word)
        if labels != None:
            print(f'Topic {i + 1 :>2}: \033[1m{labels[i]}\033[0m \n{words}\n')
        else: print(f'Topic {i + 1 :>2}: {words}')
            

# Infered labels for each topics, written by hand after a first print_topics() call
labels = [
    "Physics",
    "Electromagnetism",
    "Mathematics",
    "Optical & Materials",
    "Project Planning",
    "Methodology",
    "Signal Processing",
    "Course content",
    "Systems Design",
    "Chemistry & Applications"
]

print_topics(lda, labels=labels)

Topic  1: [1mPhysics[0m 
['method', 'equation', 'model', 'learning', 'numerical', 'exercise', 'basic']

Topic  2: [1mElectromagnetism[0m 
['material', 'method', 'learning', 'property', 'magnetic', 'concept', 'energy']

Topic  3: [1mMathematics[0m 
['data', 'learning', 'method', 'system', 'problem', 'analysis', 'algorithm']

Topic  4: [1mOptical & Materials[0m 
['system', 'design', 'energy', 'material', 'learning', 'optical', 'method']

Topic  5: [1mProject Planning[0m 
['project', 'report', 'research', 'skill', 'scientific', 'data', 'laboratory']

Topic  6: [1mMethodology[0m 
['method', 'analysis', 'learning', 'content', 'note', 'exam', 'theory']

Topic  7: [1mSignal Processing[0m 
['model', 'processing', 'learning', 'method', 'system', 'analysis', 'basic']

Topic  8: [1mCourse content[0m 
['management', 'work', 'case', 'method', 'design', 'learning', 'presentation']

Topic  9: [1mSystems Design[0m 
['system', 'design', 'circuit', 'technology', 'content', 'method', '1

When comparing these 10 topics with the results of the previous lab on Latent Semantic Indexing (LSI), we can observe that there are only a few similarities. Specifically, the topics "Electromagnetism", "Project Planning" and "Mathematics" align with the LSI topics "Electromagnetism", "Projects" and "Algebra" respectively. 

It is worth mentioning that when utilizing either LSI or LDA, there is always some degree of inherent randomness involved.

## Exercise 4.9 Dirichlet hyperparameters

When we call the PySpark LDA function, $\alpha$ corresponds to the $docConcentration$ parameter, and $\beta$ corresponds to the $topicConcentration$ parameter.

In [31]:
ALPHA_VALUES = np.array([1.01, 2, 5, 10, 20, 100], dtype=np.float64)
BETA_VALUES = np.array([1.01, 2, 3, 6, 10, 20], dtype=np.float64)

1. First, we fix $k=10$ and $\beta=1.01$, and vary $\alpha$.

In [32]:
def lda_vary_alpha(data, values=ALPHA_VALUES, beta=1.01, k=10, seed=0):
    models = []
    for alpha in values:
        models.append(
            LDA.train(
                data, k=k, docConcentration=float(alpha), topicConcentration=float(beta), seed=seed
            )
        )
    print("The training of the RDD on the LDA model was completed while adjusting the alpha parameter.")
    return models

lda_alpha = lda_vary_alpha(term_doc_matrix)

The training of the RDD on the LDA model was completed while adjusting the alpha parameter.


In [33]:
for i, alpha in enumerate(ALPHA_VALUES):
    print(f'\033[1malpha={alpha} & beta=1.01 : \033[0m')
    print_topics(lda_alpha[i])
    print('\n')

[1malpha=1.01 & beta=1.01 : [0m
Topic  1: ['method', 'learning', 'model', 'numerical', 'exercise', 'equation', 'course']
Topic  2: ['method', 'material', 'learning', 'structure', 'property', 'project', 'content']
Topic  3: ['learning', 'data', 'method', 'model', 'analysis', 'system', 'algorithm']
Topic  4: ['system', 'energy', 'method', 'design', 'optical', 'learning', 'concept']
Topic  5: ['report', 'project', 'research', 'learning', 'data', 'skill', 'laboratory']
Topic  6: ['method', 'learning', 'analysis', 'content', 'signal', 'paper', 'exam']
Topic  7: ['method', 'learning', 'processing', 'project', 'system', 'model', 'lecture']
Topic  8: ['design', 'method', 'work', 'management', 'learning', 'case', 'project']
Topic  9: ['design', 'circuit', 'system', 'device', 'content', 'method', 'learning']
Topic 10: ['chemistry', 'system', 'project', 'method', 'content', 'engineering', 'learning']


[1malpha=2.0 & beta=1.01 : [0m
Topic  1: ['method', 'learning', 'model', 'numerical', 'equa

#### Observations
When we increase the value of $\alpha$ in the LDA model, the topics will become more similar to each other. Conversely, decreasing the parameter will lead to more distinct and diverse topics.
This is because as seen in class and in the spark documentation, $\alpha$ controls the prior distribution of documents over each topic, and larger values encourage smoother inferred distributions, making them more uniform.

2. Then, we fix $k=10$ and $\alpha=6$, and vary $\beta$.

In [34]:
def lda_vary_beta(data, alpha=6, values=BETA_VALUES, k=10, seed=0):
    models = []
    for beta in values:
        models.append(
            LDA.train(
                data, k=k, docConcentration=float(alpha), topicConcentration=float(beta), seed=seed
            )
        )
    print("The training of the RDD on the LDA model was completed while adjusting the beta parameter.")
    return models

lda_beta = lda_vary_beta(term_doc_matrix)

The training of the RDD on the LDA model was completed while adjusting the beta parameter.


In [35]:
for i, beta in enumerate(BETA_VALUES):
    print(f'\033[1malpha=6 & beta={beta} :\033[0m')
    print_topics(lda_beta[i])
    print('\n')

[1malpha=6 & beta=1.01 :[0m
Topic  1: ['method', 'equation', 'learning', 'numerical', 'model', 'exercise', 'basic']
Topic  2: ['method', 'material', 'energy', 'learning', 'concept', 'heat', 'property']
Topic  3: ['data', 'learning', 'analysis', 'problem', 'method', 'report', 'algorithm']
Topic  4: ['system', 'design', 'optical', 'material', 'energy', 'learning', 'method']
Topic  5: ['project', 'research', 'learning', 'report', 'skill', 'scientific', 'work']
Topic  6: ['method', 'analysis', 'learning', 'content', 'exam', 'theory', 'note']
Topic  7: ['model', 'processing', 'method', 'learning', 'system', 'lecture', 'basic']
Topic  8: ['design', 'method', 'work', 'case', 'management', 'learning', 'assessment']
Topic  9: ['system', 'design', 'circuit', 'content', 'technology', 'learning', '1']
Topic 10: ['chemistry', 'system', 'molecular', 'application', 'method', 'content', 'organic']


[1malpha=6 & beta=2.0 :[0m
Topic  1: ['method', 'learning', 'model', 'numerical', 'system', 'equati

#### Observations
In the same manner, when we increase the value of $\beta$ in the LDA model, the topics will become more similar to each other. Conversely, decreasing the parameter will lead to more distinct and diverse topics.
This is because as seen in class and in the spark documentation, $\beta$ controls the prior distribution of topics over each term, and larger values encourage smoother inferred distributions, making them more uniform.

## Exercise 4.10: EPFL’s taught subjects

Based on the outcomes obtained in the previous exercise, we can infer that the most interpretable results for representing EPFL's taught subjects require the following combination of hyperparameters:

In [65]:
k = 15
alpha = 1.2
beta = 1.2

#### Dirichlets hyperparameter choice explanation

1. $k=15$

At EPFL, we have a total of 13 distinct sections, each focusing on different areas of study and research. In addition to these sections, we can also take into account two more categories to cover external aspects. This comprehensive approach ensures that we consider the full spectrum of activities and departments within and beyond the academic environment of our school.

2. $\alpha = 1.2$ and $\beta = 1.2$

We want the topics to be as different as possible from each others, which means we have to give to $\alpha$ and $\beta$ the lowest value possible. But as the EPFL is an engineering school, some overlap is likely to happen between the topics discussed in the courses, and same goes for the terms used in the topics. That's why we put both $\alpha=\beta=1.2$ to have some tolerance.

In [71]:
# Train the LDA model
epfl_lda = LDA.train(term_doc_matrix, k=k, docConcentration=float(alpha), topicConcentration=float(beta), seed=0)

# Infered EPFL taught subjects, written by hand after a first print_topics() call
labels = [
    "Computer Science",
    "Mathematics?",
    "Biology/Life Sciences",
    "Digital Humanities",
    "Scientific Research",
    "Quantum Sciences",
    "Evaluation related stuff",
    "Nuclear Physic",
    "Electrical & Electronic Engineering",
    "Electromagnetism",
    "Chemistry",
    "Data Science",
    "Material Sciences",
    "Finance Engineering",
    "Microengineering"
]

# Print the words for each topics
print_topics(epfl_lda, labels=labels, words_per_topic=10)

Topic  1: [1mComputer Science[0m 
['design', 'learning', 'system', 'project', 'method', 'programming', 'data', 'analysis', 'course', 'concept']

Topic  2: [1mMathematics?[0m 
['energy', 'process', 'system', 'flow', 'heat', 'method', 'learning', 'transfer', 'concept', 'equation']

Topic  3: [1mBiology/Life Sciences[0m 
['cell', 'biology', 'note', 'structure', 'system', 'material', 'learning', 'protein', 'content', 'molecular']

Topic  4: [1mDigital Humanities[0m 
['research', 'semester', 'policy', 'learning', 'content', 'innovation', 'social', 'technology', 'industry', 'method']

Topic  5: [1mScientific Research[0m 
['project', 'report', 'skill', 'week', 'data', 'scientific', 'written', 'problem', 'specific', 'research']

Topic  6: [1mQuantum Sciences[0m 
['modeling', 'method', 'learning', 'model', 'quantum', 'exercise', 'system', 'design', 'theory', 'introduction']

Topic  7: [1mEvaluation related stuff[0m 
['project', 'plan', 'learning', 'method', 'presentation', 'skill'

## Exercise 4.11: Wikipedia structure

In [13]:
# Load the wikipedia RDD
wikipedia_RDD = sc.textFile('/ix/wikipedia-for-schools.txt').map(json.loads)

In [14]:
# Retrieve the identifier for all wikipedia pages
wikiPageIds_RDD = wikipedia_RDD.map(lambda p: p['page_id']).distinct()

# The number of distinct pages
N = wikiPageIds_RDD.count()

# Map each page to a unique index 
pageToIdx = dict(zip(wikiPageIds_RDD.collect(), range(N)))
pageIdx = list(pageToIdx.values())

In [15]:
# Retrieve all distinct terms from all wiki pages
all_words_RDD = wikipedia_RDD.flatMap(lambda p: p["tokens"]).distinct()

# The number of distinct terms
M = all_words_RDD.count() 

# Map each term to a unique index
termToIdx = dict(zip(all_words_RDD.collect(), range(M)))
vectorized_termToIdx = np.vectorize(lambda x: termToIdx[x])
idxToTerm = {v: k for k, v in termToIdx.items()}

# Compute a reduced version of the wikipedia RDD with only indexes
red_wiki_RDD = wikipedia_RDD.map(lambda c: (pageToIdx[c["page_id"]], vectorized_termToIdx(c["tokens"])))

In [16]:
# Helper function to create the sparse vector of size M the number of terms for each page
def create_sparse_vector_from_document(doc):
    vector = {}
    for termIdx in doc[1]:
        vector[termIdx] = vector.get(termIdx, 0) + 1
    return (doc[0], Vectors.sparse(M, vector))

# Build the term-document matrix
term_doc_matrix = red_wiki_RDD.map(lambda x: create_sparse_vector_from_document(x)).map(list)

# Dirichlet hyperparameters
k = 25
alpha = 1.01
beta = 1.5

# Train the LDA model
lda = LDA.train(term_doc_matrix, k=k, docConcentration=float(alpha), topicConcentration=float(beta), seed=0)

# Function used to retrieve the words from their respective index
def print_topics(model, idxToTerm=idxToTerm, labels=None, words_per_topic=10):
    topics = model.describeTopics(words_per_topic)
    for i, topic in enumerate(topics):
        wordsIdx = topic[0]
        words = []
        for idx in wordsIdx:
            word = idxToTerm[idx]
            words.append(word)
        if labels != None:
            print(f'Topic {i + 1 :>2}: \033[1m{labels[i]}\033[0m \n{words}\n')
        else: print(f'Topic {i + 1 :>2}: {words}')

# Infered labels, written by hand after a first print_topics() call
labels = ['Olympic Games', 'Number Theory', 'Urban Development', 'Healthcare', 'Volcanic Eruptions',
          'Island Geography', 'Urban Planning', 'Optics', 'Music and Art', 'Calendar',
          'Computer Science', 'Environment', 'Entertainment', 'Government and Politics',
          'Rivers and Waterways', 'Ancient History and Mythology', 'War and Conflict', 'Space Exploration',
          'Information Systems', 'Business', 'Humanity and Culture', 'Tea Culture', 'Indian History',
          'French History', 'Biodiversity']

# Finally print the topics
print_topics(lda, labels=labels)

Topic  1: [1mOlympic Games[0m 
['games', 'game', 'players', 'world', 'time', 'olympic', '–', 'cup', 'player', 'events']

Topic  2: [1mNumber Theory[0m 
['theory', 'number', '=', 'numbers', 'work', 'set', 'called', 'form', 'written', 'century']

Topic  3: [1mUrban Development[0m 
['city', '·', 'centre', 'century', 'law', 'population', 'system', 'government', 'state', 'years']

Topic  4: [1mHealthcare[0m 
['blood', 'people', 'health', 'cancer', 'medical', 'treatment', 'risk', 'patients', 'high', 'years']

Topic  5: [1mVolcanic Eruptions[0m 
['eruption', 'years', 'comet', 'lava', 'volcanic', 'volcano', 'india', 'soil', 'large', 'time']

Topic  6: [1mIsland Geography[0m 
['island', 'islands', 'european', 'city', 'population', 'country', 'north', 'south', 'ireland', 'east']

Topic  7: [1mUrban Planning[0m 
['south', 'lake', 'mi', 'area', 'river', 'city', 'oil', 'water', 'population', 'north']

Topic  8: [1mOptics[0m 
['gas', 'lens', 'game', 'time', 'earth', 'lenses', 'water'

#### Dirichlets hyperparameter choice explanation

As the number of topics on Wikipedia is extensive, it is reasonable to expect that a larger value of $k$ would yield better results. However, due to memory limitations, we had to restrict ourselves to $k=25$. Regarding  $α$, since each Wikipedia page focuses on a highly specialized subject, the topics are less likely to overlap across pages. This is why we set $\alpha=1.01$ that is to the minimum value. As for $\beta$, considering the vast number of words in a subset of Wikipedia, it is inevitable that words will be repeated across various topics. Therefore, by selecting $\beta=1.5$, we introduced some level of tolerance.  

We observe that the topics are quite discernible, and thus, we are satisfied with the results achieved with $k=25$. However, it is important to acknowledge that there may be hundreds of additional topics beyond these.