# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *K*

**Names:**

* *Mathieu Sauser*
* *Luca Mouchel*
* *Heikel Jebali*
* *Jérémy Chaverot*

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

#### Some imports to begin with

In [31]:
import json
import numpy as np
from utils import load_json, load_pkl
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors

## Exercise 4.8: Topics extraction

In [110]:
!hdfs dfs -put data/preprocessed_courses.txt
!hdfs dfs -ls

courses_RDD = sc.textFile("preprocessed_courses.txt").flatMap(json.loads)
termToIdx = np.load("termToIdx.npy", allow_pickle=True)
docToIdx = np.load("docToIdx.npy", allow_pickle=True)

put: `preprocessed_courses.txt': File exists
Found 4 items
drwx------   - jchavero hadoop          0 2023-06-06 03:00 .Trash
drwxr-xr-x   - jchavero hadoop          0 2023-06-06 08:05 .sparkStaging
-rw-r--r--   3 jchavero hadoop    2147813 2023-03-01 09:39 election-day-tweets.txt
-rw-r--r--   3 jchavero hadoop    2861332 2023-06-06 02:35 preprocessed_courses.txt


In [113]:
courses_RDD.take(10)

[{'courseId': 'MSE-440',
  'name': 'Composites technology',
  'description': ['latest',
   'development',
   'processing',
   'generation',
   'organic',
   'composite',
   'discussed',
   'nanocomposites',
   'adaptive',
   'composite',
   'biocomposites',
   'presented',
   'product',
   'development',
   'cost',
   'analysis',
   'study',
   'market',
   'practiced',
   'team',
   'work',
   'content',
   'basic',
   'composite',
   'composite',
   'structure',
   'current',
   'composite',
   'force',
   'keywords',
   'composite',
   'application',
   'nanocomposites',
   'biocomposites',
   'adaptive',
   'composite',
   'design',
   'cost',
   'learning',
   'prerequisite',
   'required',
   'course',
   'notion',
   'polymer',
   'recommended',
   'course',
   'polymer',
   'composite',
   'learning',
   'outcome',
   'end',
   'propose',
   'suitable',
   'design',
   'production',
   'performance',
   'criterion',
   'production',
   'composite',
   'basic',
   'equation',
  

In [127]:
coursesIds = courses_RDD.map(lambda c: c['courseId']).collect()
docToIdx = dict(zip(coursesIds, range(len(coursesIds))))
docIdx = list(docToIdx.values())

In [159]:
all_words = set(courses_RDD.flatMap(lambda c: c['description']).collect())
#print(len(all_words))
termToIdx = dict(zip(all_words, range(len(all_words))))
courses_reduced_rdd = courses_RDD.map(lambda c: (docToIdx[c["courseId"]], c["description"]))
#courses_reduced_rdd.take(5)

In [178]:
def create_vector_from_document(document):
    vector = {}
    for word in document[1]:
        vector[termToIdx[word]] = vector.get(termToIdx[word], 0) + 1
    return (document[0], Vectors.sparse(len(all_words), vector))


term_count_matrix = courses_reduced_rdd.map(lambda x: create_vector_from_document(x)).map(list)

# Number of topics
k=10 

# LDA algorithm
lda = LDA.train(term_count_matrix, k=10, seed=0)

# Extract 5 topics from LDA Model
labels = [
    "Materials & Devices",
    "Machine Learning",
    "Mathematical analysis",
    "Systems & Signal Processing",
    "Project Planning",
    "Energy & Process Design",
    "Scientific Research in Cell Biology",
    "System Design in Chemical Modeling",
    "Chemistry & documentation",
    "Optical & Quantum Applications"
]

# Function used to retrieve the words from their respective index
def print_topics(model, termToIdx, words_per_topic=7):
    topics = model.describeTopics(words_per_topic)
    for i, topic in enumerate(topics):
        wordsIdx = topic[0]
        words = []
        for idx in wordsIdx:
            word = list(termToIdx.keys())[idx]
            words.append(word)
        print(f'Topics {i + 1 :>2}: {labels[i]} \n{words}\n')
        
print_topics(lda, termToIdx)

Topics  1: Materials & Devices 
['material', 'method', 'property', 'protein', 'system', 'device', 'application']

Topics  2: Machine Learning 
['method', 'data', 'learning', 'algorithm', 'theory', 'problem', 'programming']

Topics  3: Mathematical analysis 
['learning', 'model', 'time', 'method', 'management', 'analysis', 'stochastic']

Topics  4: Systems & Signal Processing 
['system', 'method', 'processing', 'control', 'signal', 'flow', 'learning']

Topics  5: Project Planning 
['design', 'group', 'work', 'project', 'method', 'plan', 'week']

Topics  6: Energy & Process Design 
['energy', 'learning', 'design', 'process', 'power', 'content', 'method']

Topics  7: Scientific Research in Cell Biology 
['project', 'report', 'cell', 'scientific', 'research', 'skill', 'biology']

Topics  8: System Design in Chemical Modeling 
['system', 'method', 'design', 'model', 'learning', 'chemical', 'exercise']

Topics  9: Chemistry & documentation 
['presentation', 'paper', 'chemistry', 'method', 'c