# Assignment 3 – Topic Modeling and Clustering for Online Social Media Data

*Due: Friday January 12 at 14:00 CET*

In the third assignment of the course Applications of Machine Learning (INFOB3APML), you will learn to use topic modeling and clustering to identify topics in online social media data. The objectives of this assignment are:
- understand and process the text data
- use the clustering algorithm to determine clusters in real-life data
- use the Latent Dirichlet Allocation algorithm to identify discussed topics in real-life text data 
- use the visualization tools to validate the results of unsupervised learning and interpret your findings
- reflect on the difference between two type of unsupervised learning algorithms

In this assignment, you are going to discover the different ‘topics’ from a real social media text dataset. The project is divided into two parts (4 subtasks):

- The first part contains data processing (1.1) and feature extraction (1.2) from the raw text data.
- In the second part, you will implement two methods (2.1), a topic modeling method and a clustering method, to identify topics from the processed data. Then, the evaluation will be done by using visualization tools (2.2). 

Provided files:
- The dataset: data/raw_data.txt
- A tutorial notebook showcases some packages you could use for this assignment (optional): Ass3_tutorial.ipynb
- Some sample visualization codes for interpreting the topic results: viz_example.ipynb

In [1]:
import spacy
from spacy.lang.nl.examples import sentences
import io

nlp = spacy.load("nl_core_news_sm")
doc = nlp(sentences[0])
print(doc.text)
print()
for token in doc:
    print(token.text)
    # print(token.pos_)
    # print(token.dep_)
    # print('')

Apple overweegt om voor 1 miljard een U.K. startup te kopen

Apple
overweegt
om
voor
1
miljard
een
U.K.
startup
te
kopen


 ## Dataset:
 The data used in this assignment is Dutch text data. We collected the COVID-19 crisis related messages from online social media (Twitter) from January to November 2021. Then, a subset of raw tweets was randomly sampled. In total, our dataset includes the text data of about 100K messages. **To protect the data privacy, please only use this dataset within the course.**

 ## 0. Before you start the Project: 
 The provided messages in the raw dataset were collected based on 10 different themes that relate to the COVID-19 crisis. Here is a list of all themes:
 -	Lockdown
 -	Face mask
 -	Social distancing
 -	Loneliness
 -	Happiness
 -	Vaccine
 -	Testing
 -  Curfew
 -  Covid entry pass
 -  Work from home

Before starting your project, you need to first filter the messages (all messages are in Dutch) and use the messages belonging to only one theme for the topic identification. 
 
If you have submitted the theme preference, you can skip the following paragraph.

*Please notice that there will be maximum two teams working on a same theme. In this way, we hope that each group will develop their own dataset and come up with interesting results.*

 ## 1.1 Data Processing
 In the first part of the assignment, please first filter the messages and use the messages belonging to your allocated theme for the identification of topics. For that you will need to:
 -	Design your query (e.g. a regular expression or a set of keywords) and filter the related messages for your allocated theme. 
 -	Clean your filtered messages and preprocess them into the right representation. Please refer to the text data pre-processing and representation methods discussed in the lecture. You may use some of the recommended packages for text data preprocessing and representation.

In [2]:
# TODO: filter the related messages
RANDOM_SEED = 42
topic_words = ['Eenzaamheid', 'Thuis', 'depressie','verdrietig']

def phase0_open_txt_stream(filename):
    return io.open(filename, "r", encoding="utf-8")

def get_data(max = -1):
    pipe = phase0_open_txt_stream("../others/data/raw_data.txt")
    data = []
    cont = 0
    while (cont != max):
        sentence = nlp(next(pipe))
        if not sentence:
            break
        data.append(sentence)
        cont += 1
    pipe.close()
    return data


In [3]:

data = get_data(1000)

In [4]:
def ignore_token(token):
    return token.is_stop or token.is_space or token.is_punct or token.is_digit

filtered = []
for sentence in data:
    for token in sentence:
        if not ignore_token(token):
            if token.text in topic_words:
                filtered.append(sentence)
                continue
filtered, len(filtered)

([@VanMetje Heel. Erg. Ik heb me zelden zo verdrietig en alleen gevoeld. Wilde keihard schreeuwen tijdens de dienst. Maar dat toch maar niet gedaan. Maar het was hemeltergend.,
  Eenzaam is misschien niet het goede woord, ben niet heel verdrietig en kijk juist best wel op tegen weer moeten afspreken met mensen. Voel me gewoon even helemaal gedistantieerd van jullie. Naar gevoel.,
  We hadden de hoop vandaag naar mijn moeder te kunnen ivm haar verjaardag. Helaas hebben wij nog steeds wat klachten en twijfelt manlief over zijn reuk. Dus toch maar thuisblijven. Hopelijk vals alarm en kunnen we volgend weekend alsnog. #blijfthuis #verdrietig,
  Bewijslast ligt al maanden bij wie versoepelingen wil. En nu is elke ontwikkeling wel een reden om het huidige regime te behouden of te verstrengen. De nadelen van een lockdown zijn moeilijker meetbaar, dus die negeren we.\n\nVermijd een depressie - kijk naar de rest van de wereld.],
 4)

In [5]:
# TODO: clean and preprocess the messages


In [6]:

# TODO: represent the messages into formats that can be used in clustering or LDA algorithms (you may need different represention for two algorithms)



 ## 1.2 Exploratory Data Analysis
 After preprocessing the data, create at least 2 figures or tables that help you understand the data.

 While exploring the data, you may also think about questions such as:
 - Can you spot any differences between Twitter data and usual text data?
 - Does your exploration reveal some issues that would make it difficult to interpret the topics?
 - Can you improve the data by adding additional preprocessing steps?

In [7]:
# TODO: plot figure(s)


## 2.1 Topic modelling and clustering
 In the second part of the assignment, you will first:
 -	Implement a Latent Dirichlet Allocation (LDA) algorithm to identify the discussed topics for your theme
 -	Implement a clustering method  to cluster messages into different groups, then represent the topic of each cluster using a bag of words

While implementing the algorithms, you may use the codes from the recommended packages. In the final report, please explain reasons to select the used algorithm/package. 

In [35]:
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

def get_term_document_matrix(random=True):
    if random:
        np.random.seed(RANDOM_SEED)
        # 100 documents with 24 words each
        random_data = np.random.rand(1000,35)
    return random_data

In [48]:
# TODO: topic modeling using the LDA algorithm

td_matrix = get_term_document_matrix()
lda = LatentDirichletAllocation(n_components=5, random_state=RANDOM_SEED)

print(td_matrix[1,:5])
y = lda.fit_transform(td_matrix)
print(td_matrix[1,:5])

print("matrix shpae:", td_matrix.shape)
print("transformed shape: ",y.shape)
print("example document 1:",y[1])

[0.80839735 0.30461377 0.09767211 0.68423303 0.44015249]
[0.80839735 0.30461377 0.09767211 0.68423303 0.44015249]
matrix shpae: (1000, 35)
transformed shape:  (1000, 5)
example document 1: [0.11240754 0.85288605 0.01157149 0.01156845 0.01156646]


0.3046137691733707

In [57]:
# TODO: cluster the messages using a clustering algorithm

from sklearn.cluster import KMeans

kmeans = KMeans(random_state=RANDOM_SEED, n_clusters=8)
clustrer_labels = kmeans.fit_predict(y)
clustrer_labels



  super()._check_params_vs_input(X, default_n_init=10)


array([4, 0, 2, 5, 3, 5, 5, 4, 1, 1, 1, 0, 1, 0, 0, 2, 1, 5, 5, 0, 5, 0,
       1, 2, 5, 5, 0, 6, 5, 1, 2, 0, 5, 2, 0, 2, 4, 0, 1, 0, 6, 0, 2, 1,
       7, 4, 5, 4, 0, 4, 1, 4, 0, 0, 0, 0, 0, 4, 4, 0, 1, 2, 0, 2, 1, 2,
       2, 5, 4, 1, 5, 4, 4, 0, 2, 0, 2, 1, 0, 1, 2, 2, 2, 5, 1, 5, 2, 2,
       1, 1, 0, 7, 0, 4, 4, 2, 4, 4, 5, 6, 2, 0, 5, 2, 4, 1, 4, 0, 1, 3,
       0, 0, 4, 2, 2, 2, 4, 5, 0, 5, 1, 0, 1, 5, 2, 5, 4, 1, 2, 0, 1, 0,
       1, 5, 0, 4, 2, 4, 5, 2, 0, 4, 0, 0, 1, 1, 6, 2, 1, 4, 4, 4, 2, 0,
       4, 2, 5, 5, 2, 1, 1, 5, 1, 5, 5, 0, 5, 7, 0, 0, 0, 4, 5, 2, 5, 1,
       1, 0, 2, 5, 1, 0, 5, 1, 7, 2, 4, 3, 4, 4, 1, 5, 1, 5, 4, 1, 4, 1,
       4, 5, 4, 1, 5, 2, 2, 5, 2, 7, 1, 1, 0, 1, 2, 4, 4, 4, 2, 1, 2, 2,
       3, 1, 4, 0, 1, 5, 7, 2, 0, 1, 7, 5, 6, 0, 1, 1, 5, 1, 0, 7, 2, 1,
       1, 1, 2, 2, 2, 4, 0, 5, 1, 0, 1, 2, 4, 0, 0, 2, 4, 2, 5, 1, 3, 2,
       7, 7, 4, 1, 0, 0, 1, 0, 5, 1, 4, 5, 2, 6, 4, 0, 1, 2, 1, 5, 1, 0,
       1, 1, 0, 2, 0, 2, 2, 1, 0, 0, 4, 4, 4, 5, 0,

 ## 2.2 Results, evaluation and Interpretation 
 
Finally, you will describe, evaluate and interpret your findings from two methods. 

- In the report, you need to describe and discuss the similarity and difference of results from two methods.
- While evaluating the results, human judgment is very important, so visualization techniques are helpful to evaluate the identified topics in an interpreted manner. 
    
1. For evaluating the topic modelling algorithm, please first use the interactive tool **[pyLDAvis](https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb#topic=0&lambda=1&term=)** to examine the inter-topic separation of your findings. 

2. For interpreting the identified topics / clusters of both algorithms, we provide example code for several visualization techiques. You can use multiple ones to evaluate your results or come up with visualisations on your own. The files contain examples for how to use the visualisation functions.


In [11]:
# TODO: evaluation 


# Bonus Tasks 

We would like to challenge you with the following bonus task. For each task that is successfully completed, you may obtain max. 1 extra point. 

1. Implement another clustering algorithm or design your own clustering algorithm. Discuss your findings and explain why this is a better (or worse) clustering algorithm than the above one (the clustering algorithm, not LDA).

2. Can you think of other evaluation methods than the provided visualization techniques? If so, implement one and explain why it is a good evaluation for our task.