# Assignment 3 – Topic Modeling and Clustering for Online Social Media Data

*Due: Friday January 12 at 14:00 CET*

In the third assignment of the course Applications of Machine Learning (INFOB3APML), you will learn to use topic modeling and clustering to identify topics in online social media data. The objectives of this assignment are:
- understand and process the text data
- use the clustering algorithm to determine clusters in real-life data
- use the Latent Dirichlet Allocation algorithm to identify discussed topics in real-life text data 
- use the visualization tools to validate the results of unsupervised learning and interpret your findings
- reflect on the difference between two type of unsupervised learning algorithms

In this assignment, you are going to discover the different ‘topics’ from a real social media text dataset. The project is divided into two parts (4 subtasks):

- The first part contains data processing (1.1) and feature extraction (1.2) from the raw text data.
- In the second part, you will implement two methods (2.1), a topic modeling method and a clustering method, to identify topics from the processed data. Then, the evaluation will be done by using visualization tools (2.2). 

Provided files:
- The dataset: data/raw_data.txt
- A tutorial notebook showcases some packages you could use for this assignment (optional): Ass3_tutorial.ipynb
- Some sample visualization codes for interpreting the topic results: viz_example.ipynb

In [106]:
import io

# TODO: import the packages
import spacy
import re

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

import numpy as np

In [107]:
# load spacy's nl_core_news_sm language model
nlp = spacy.load("nl_core_news_sm")

In [113]:
# read the data

def phase0_open_txt_stream(filename):
    return io.open(filename, "r", encoding="utf-8")

def load_data(max = -1):
    pipe = phase0_open_txt_stream("../others/data/raw_data.txt")
    data = []
    cont = 0

    while (cont != max):
        try: 
            raw_text = next(pipe).replace("\\n", "")
        except StopIteration:
            # break if end of file is reached
            pipe.close()
            break

        # remove ill formatted text
        url_pattern = "http(s?)://.+( ?)"
        html_pattern = "&.*;"
        raw_text = re.sub(url_pattern, " ", raw_text)
        raw_text = re.sub(html_pattern, " ", raw_text)
        raw_text = raw_text.replace(".", " ")\
                           .replace("?", " ")\
                           .replace("(", " ")\
                           .replace(")", " ")\
                           .replace(",", " ")\
                           .replace("!", " ")
        
        sentence = nlp(raw_text)
        data.append(sentence)
        cont += 1
    pipe.close()
    return data

data = load_data(-1)

 ## 0. Before you start the Project: 
 The provided messages in the raw dataset were collected based on 10 different themes that relate to the COVID-19 crisis. Here is a list of all themes:
 -	Lockdown
 -	Face mask
 -	Social distancing
 -	Loneliness
 -	Happiness
 -	Vaccine
 -	Testing
 -  Curfew
 -  Covid entry pass
 -  Work from home

Before starting your project, you need to first filter the messages (all messages are in Dutch) and use the messages belonging to only one theme for the topic identification. 
 
If you have submitted the theme preference, you can skip the following paragraph.

*Please notice that there will be maximum two teams working on a same theme. In this way, we hope that each group will develop their own dataset and come up with interesting results.*

 ## 1.1 Data Processing
 In the first part of the assignment, please first filter the messages and use the messages belonging to your allocated theme for the identification of topics. For that you will need to:
 -	Design your query (e.g. a regular expression or a set of keywords) and filter the related messages for your allocated theme. 
 -	Clean your filtered messages and preprocess them into the right representation. Please refer to the text data pre-processing and representation methods discussed in the lecture. You may use some of the recommended packages for text data preprocessing and representation.

In [114]:
# TODO: filter the related messages
# TOPIC : Loneliness
topic_words = [
    "alleen",         # alone
    "geïsoleerd",     # isolated
    "isolatie",       # isolation
    "kluizenaar",     # hermit
    "eenzaamheid",    # solitude
    "eenzaam",        # lonely
    "virtueel",       # virtual
    "knuffel",        # hug
    "thuis",          # at home
    # "avondklok",      # curfew
]

data_filtered = []
for sentence in data:
    for token in sentence:
        if token.lemma_.lower() in topic_words:
            data_filtered.append(sentence)
            break

print(f"filtered {len(data_filtered)} messages related to loneliness")

filtered 10384 messages related to loneliness


In [115]:
# TODO: clean and preprocess the messages
#  - remove URL, mentions, numbers, hashtags, ?emojis?
url_pattern = "^http(s?)://.+"
mention_pattern = "^@.+"
number_pattern = "(\d+,\d+)|(\d+\.\d+)|(\d+%)|(^\d+.*)|(\+\d+)"
emoji_pattern = "[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U00002702-\U000027B0\U00002000-\U000020ff]"

allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']

def match_pattern(pattern, text):
    match = re.match(pattern, text)
    return bool(match)

def generate_hash(sentence):
    text = "".join([token.text for token in sentence])
    return hash(text)


data_processed = {}
sentence_id = 0
for sentence in data_filtered:
    sentence_processed = []

    for token in sentence:

        #  - remove URL, mentions, emojis, 
        if match_pattern(url_pattern, token.text) or \
           match_pattern(mention_pattern, token.text) or \
           match_pattern(emoji_pattern, token.text):
            continue
        
        # - remove numbers, punctuation 
        if token.is_punct or token.is_space or token.is_digit or \
           match_pattern(number_pattern, token.text) :
            continue
        
        #  - Stopping
        if token.is_stop:
            continue

        if token.pos_ not in allowed_postags:
            continue

        sentence_processed.append(token)
    
    # add preprocessed sentence in a way that prevents duplicate messages
    key = generate_hash(sentence_processed)
    if key not in data_processed:
        data_processed[key] = {
            "id" : sentence_id,
            "sentence" : sentence,
            "bow": sentence_processed,
            #  - Stemming & Lemmatization
            "lemmas" : [token.lemma_.lower() for token in sentence_processed]
        }
        sentence_id += 1

print(f"preprocessed {len(data_processed)} unique messages related to loneliness")
for sentence in data_processed.values():
    print(sentence)

preprocessed 8416 unique messages related to loneliness
{'id': 0, 'sentence': Gelukkig en gezond 2021 toegewenst  met perspectief op betere tijden  hier samen lockdown thuis   een virtuele knuffel voor mijn volgers  
, 'bow': [Gelukkig, gezond, toegewenst, perspectief, tijden, samen, lockdown, thuis, virtuele, knuffel, volgers], 'lemmas': ['gelukkig', 'gezond', 'toegewenst', 'perspectief', 'tijd', 'samen', 'lockdown', 'thuis', 'virtueel', 'knuffel', 'volger']}
{'id': 1, 'sentence': @thesarge671 @LodewijkA Ik heb mij  binnen de normen  ontfermd over negen alleenstaande cq eenzame bejaarden die nog op zichzelf wonen  Boodschappengedaan en voor hen gekookt  Allemaal lapwerk wat in een land  als Nederland  niet hoort  We hebben dus beide niets  gedaan aan een structurele oplossing 
, 'bow': [normen, ontfermd, alleenstaande, cq, eenzame, bejaarden, wonen, gekookt, Allemaal, lapwerk, land, hoort, gedaan, structurele, oplossing], 'lemmas': ['norm', 'ontfermen', 'alleenstaan', 'cq', 'eenzaam',

In [116]:
# TODO: represent the messages into formats that can be used in clustering or LDA algorithms (you may need different represention for two algorithms)
# generate document term frequency matrix that is going to be used by LDA
documents = [" ".join(sentence["lemmas"]) for sentence in data_processed.values()]

vectorizer = CountVectorizer()
dtf_matrix = vectorizer.fit_transform(documents)
terms = vectorizer.get_feature_names_out()
dtf_df = pd.DataFrame(dtf_matrix.toarray(), columns=terms)

dtf_df.to_csv('document_term_frequency.csv', index=False)

# generate term frequency-inverse document frequency matrix that is going to be used by kmeans
vectorizer = TfidfVectorizer()

tf_df_matrix = vectorizer.fit_transform(documents) 
terms = vectorizer.get_feature_names_out()
tf_df_df = pd.DataFrame(tf_df_matrix.toarray(), columns=terms)

tf_df_df.to_csv('tf_df_document_vectors.csv', index=False)

# spacy vector representation for tokens
for key in data_processed:
    entry = data_processed[key]
    entry["doc_vector"] = np.mean([token.vector for token in entry["bow"]], axis=0) 

word_embeddings_df = pd.DataFrame.from_dict(data_processed, orient='index')
word_embeddings_df = word_embeddings_df.drop(['bow', 'lemmas'], axis="columns")

word_embeddings_df.to_csv('word_embeddings_document_vectors.csv', index=False)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


 ## 1.2 Exploratory Data Analysis
 After preprocessing the data, create at least 2 figures or tables that help you understand the data.

 While exploring the data, you may also think about questions such as:
 - Can you spot any differences between Twitter data and usual text data?
 - Does your exploration reveal some issues that would make it difficult to interpret the topics?
 - Can you improve the data by adding additional preprocessing steps?

In [14]:
# TODO: plot figure(s)


## 2.1 Topic modelling and clustering
 In the second part of the assignment, you will first:
 -	Implement a Latent Dirichlet Allocation (LDA) algorithm to identify the discussed topics for your theme
 -	Implement a clustering method  to cluster messages into different groups, then represent the topic of each cluster using a bag of words

While implementing the algorithms, you may use the codes from the recommended packages. In the final report, please explain reasons to select the used algorithm/package. 

In [20]:
# TODO: topic modeling using the LDA algorithm


In [21]:
# TODO: cluster the messages using a clustering algorithm


 ## 2.2 Results, evaluation and Interpretation 
 
Finally, you will describe, evaluate and interpret your findings from two methods. 

- In the report, you need to describe and discuss the similarity and difference of results from two methods.
- While evaluating the results, human judgment is very important, so visualization techniques are helpful to evaluate the identified topics in an interpreted manner. 
    
1. For evaluating the topic modelling algorithm, please first use the interactive tool **[pyLDAvis](https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb#topic=0&lambda=1&term=)** to examine the inter-topic separation of your findings. 

2. For interpreting the identified topics / clusters of both algorithms, we provide example code for several visualization techiques. You can use multiple ones to evaluate your results or come up with visualisations on your own. The files contain examples for how to use the visualisation functions.


In [1]:
# TODO: evaluation 


# Bonus Tasks 

We would like to challenge you with the following bonus task. For each task that is successfully completed, you may obtain max. 1 extra point. 

1. Implement another clustering algorithm or design your own clustering algorithm. Discuss your findings and explain why this is a better (or worse) clustering algorithm than the above one (the clustering algorithm, not LDA).

2. Can you think of other evaluation methods than the provided visualization techniques? If so, implement one and explain why it is a good evaluation for our task.