# Assignment 3 – Topic Modeling and Clustering for Online Social Media Data

*Due: Friday January 12 at 14:00 CET*

In the third assignment of the course Applications of Machine Learning (INFOB3APML), you will learn to use topic modeling and clustering to identify topics in online social media data. The objectives of this assignment are:
- understand and process the text data
- use the clustering algorithm to determine clusters in real-life data
- use the Latent Dirichlet Allocation algorithm to identify discussed topics in real-life text data 
- use the visualization tools to validate the results of unsupervised learning and interpret your findings
- reflect on the difference between two type of unsupervised learning algorithms

In this assignment, you are going to discover the different ‘topics’ from a real social media text dataset. The project is divided into two parts (4 subtasks):

- The first part contains data processing (1.1) and feature extraction (1.2) from the raw text data.
- In the second part, you will implement two methods (2.1), a topic modeling method and a clustering method, to identify topics from the processed data. Then, the evaluation will be done by using visualization tools (2.2). 

Provided files:
- The dataset: data/raw_data.txt
- A tutorial notebook showcases some packages you could use for this assignment (optional): Ass3_tutorial.ipynb
- Some sample visualization codes for interpreting the topic results: viz_example.ipynb

In [250]:
import spacy
from spacy.lang.nl.examples import sentences
import io

nlp = spacy.load("nl_core_news_sm")
doc = nlp(sentences[0])
print(doc.text)
print()
for token in doc:
    print(token.text)
    # print(token.pos_)
    # print(token.dep_)
    # print('')

Apple overweegt om voor 1 miljard een U.K. startup te kopen

Apple
overweegt
om
voor
1
miljard
een
U.K.
startup
te
kopen


 ## Dataset:
 The data used in this assignment is Dutch text data. We collected the COVID-19 crisis related messages from online social media (Twitter) from January to November 2021. Then, a subset of raw tweets was randomly sampled. In total, our dataset includes the text data of about 100K messages. **To protect the data privacy, please only use this dataset within the course.**

 ## 0. Before you start the Project: 
 The provided messages in the raw dataset were collected based on 10 different themes that relate to the COVID-19 crisis. Here is a list of all themes:
 -	Lockdown
 -	Face mask
 -	Social distancing
 -	Loneliness
 -	Happiness
 -	Vaccine
 -	Testing
 -  Curfew
 -  Covid entry pass
 -  Work from home

Before starting your project, you need to first filter the messages (all messages are in Dutch) and use the messages belonging to only one theme for the topic identification. 
 
If you have submitted the theme preference, you can skip the following paragraph.

*Please notice that there will be maximum two teams working on a same theme. In this way, we hope that each group will develop their own dataset and come up with interesting results.*

 ## 1.1 Data Processing
 In the first part of the assignment, please first filter the messages and use the messages belonging to your allocated theme for the identification of topics. For that you will need to:
 -	Design your query (e.g. a regular expression or a set of keywords) and filter the related messages for your allocated theme. 
 -	Clean your filtered messages and preprocess them into the right representation. Please refer to the text data pre-processing and representation methods discussed in the lecture. You may use some of the recommended packages for text data preprocessing and representation.

In [251]:
# TODO: filter the related messages
RANDOM_SEED = 42
topic_words = ['Eenzaamheid', 'Thuis', 'depressie','verdrietig']

def phase0_open_txt_stream(filename):
    return io.open(filename, "r", encoding="utf-8")

def get_data(max = -1):
    pipe = phase0_open_txt_stream("../others/data/raw_data.txt")
    data = []
    cont = 0
    while (cont != max):
        sentence = nlp(next(pipe))
        if not sentence:
            break
        data.append(sentence)
        cont += 1
    pipe.close()
    return data


In [252]:
data = get_data(1000)

In [253]:
raw_data = list(map(lambda token: token.text, data))
print(raw_data)


["Hahah, het verzet is begonnen. Het knalt hier op z'n best hoor. Voor mijn dieren vind ik het erg, maar f@ck die maffe regering hier. Gelukkig nieuwjaar!\n", 'RT @D66Vught: Het is 2021! https://t.co/LpPuFPuqR8\n', '@MijumewAndCo Happy new year mij!\n', 'Fantastisch dat ik zoveel vuurwerk hoor..... We worden echt wakker yes. En het boeit mij niet wat een ander er van denkt.\\nOp naar een Great Awakening en happy new year 😃🎉🙏❤️\n', 'Gelukkig nieuwjaar allemaal!!!! Binnen exact een maand ben ik jarig en ik hoop dat ik dan eindelijk birthdaySEX kan hebben xxx\n', 'Iedereen in de wereld krijgt aftel momenten bij bruggen, hoge torens of een drone show.\\nSchitterende beelden.\\nin Nederland krijgen we dat afschuwelijke stadion in duivedrecht\\ntezien, en dan vinden we het gek dat iedereen de straat op gaat..\\nAlsof die lockdown niet voldoende is 🙉\n', 'Gelukkig 2021 iedereen! Maar please hou het veilig, want een nieuw jaar betekend niet dat die’n Corona ineens foetsie is eh...\n', 'Ik wens

In [254]:
# TODO: clean and preprocess the messages


In [255]:

# TODO: represent the messages into formats that can be used in clustering or LDA algorithms (you may need different represention for two algorithms)



 ## 1.2 Exploratory Data Analysis
 After preprocessing the data, create at least 2 figures or tables that help you understand the data.

 While exploring the data, you may also think about questions such as:
 - Can you spot any differences between Twitter data and usual text data?
 - Does your exploration reveal some issues that would make it difficult to interpret the topics?
 - Can you improve the data by adding additional preprocessing steps?

In [256]:
# TODO: plot figure(s)


## 2.1 Topic modelling and clustering
 In the second part of the assignment, you will first:
 -	Implement a Latent Dirichlet Allocation (LDA) algorithm to identify the discussed topics for your theme
 -	Implement a clustering method  to cluster messages into different groups, then represent the topic of each cluster using a bag of words

While implementing the algorithms, you may use the codes from the recommended packages. In the final report, please explain reasons to select the used algorithm/package. 

In [257]:
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

dutch_stopwords = [
    'aan', 'als', 'bij', 'dan', 'dat', 'die', 'dit', 'een', 'en', 'er', 'het',
    'hij', 'hoe', 'hun', 'ik', 'in', 'is', 'je', 'kan', 'maar', 'met', 'mij', 'niet', 'nog',
    'nu', 'of', 'ons', 'ook', 'te', 'tot', 'uit', 'van', 'voor', 'was', 'wat',
    'we', 'wel', 'wij', 'zal', 'ze', 'zei', 'zelf', 'zich', 'zo', 'zij'
]

def get_term_document_matrix(random=None):
    if not random:
        np.random.seed(RANDOM_SEED)
        # 100 documents with 35 words each
        return np.random.rand(1000,35)
    if random == 'tmp':
        from sklearn.feature_extraction.text import CountVectorizer
        tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                        lowercase = True,
                                        stop_words = dutch_stopwords,
                                        token_pattern = r'\b[a-zA-Z]{3,}\b',
                                        max_df = 0.5, 
                                        min_df = 10)
        document_term_matrix = tf_vectorizer.fit_transform(raw_data)
        print(document_term_matrix.shape, len(tf_vectorizer.vocabulary_))
        return document_term_matrix, tf_vectorizer
    
    
def get_word_vector_matrix(random = None):
    if not random:
        np.random.seed(RANDOM_SEED)
        return np.random.rand(3466,4)
        

def get_document_vector_matrix(random = None):
    if not random:
        np.random.seed(RANDOM_SEED)
        return np.random.rand(1000,4)

In [258]:
# TODO: topic modeling using the LDA algorithm

dtm_tf, tf_vectorizer = get_term_document_matrix(random='tmp')
lda_tf = LatentDirichletAllocation(n_components=5, random_state=RANDOM_SEED)

print(dtm_tf[1,:5])
doc_topic = lda_tf.fit_transform(dtm_tf)
print(dtm_tf[1,:5])

print("matrix shape:", dtm_tf.shape)
print("transformed shape: ",doc_topic.shape)
print("example document 1:",doc_topic[1])

(1000, 302) 302


matrix shape: (1000, 302)
transformed shape:  (1000, 5)
example document 1: [0.10206998 0.59529447 0.10000018 0.10209207 0.10054329]


#### 2.2 Genism Method


In [259]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary
# Create a dictionary representation of the documents.
matrix_sentences = [el.split(" ") for el in raw_data]
dictionary = Dictionary(documents=matrix_sentences)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)
print(dictionary)

Dictionary<152 unique tokens: ['Gelukkig', 'Het', 'Voor', 'die', 'het']...>


In [260]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in matrix_sentences]

In [261]:
dictionary.id2token

{}

In [262]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 5
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make an index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every,
    random_state=RANDOM_SEED
)

In [263]:
import json
import numpy as np

def load_R_model(filename):
    data = {'topic_term_dists': model.get, 
            'doc_topic_dists': model.theta,
            'doc_lengths': data_input['doc.length'],
            'vocab': data_input['vocab'],
            'term_frequency': data_input['term.frequency']}
    return data

movies_model_data = load_R_model('data/movie_reviews_input.json')

print('Topic-Term shape: %s' % str(np.array(movies_model_data['topic_term_dists']).shape))
print('Doc-Topic shape: %s' % str(np.array(movies_model_data['doc_topic_dists']).shape))

AttributeError: 'LdaModel' object has no attribute 'get'

### Clustering

In [None]:
word_vector_matrx = get_word_vector_matrix()
word_vector_matrx.shape

(3466, 4)

In [None]:
document_vector_matrx = get_document_vector_matrix()
document_vector_matrx.shape

(1000, 4)

In [None]:
# TODO: cluster the messages using a clustering algorithm

from sklearn.cluster import KMeans

kmeans = KMeans()

clustrer_labels = kmeans.fit_predict(document_vector_matrx)
clustrer_labels.shape, max(clustrer_labels)
kmeans.cluster_centers_[0].shape

  super()._check_params_vs_input(X, default_n_init=10)


(4,)

In [None]:
def get_k_neares_words(cluster_centers, any_other_word, k=5):
    from sklearn.neighbors import NearestNeighbors
    assert( cluster_centers.shape[1] == any_other_word.shape[1])
    print("Cluster centers shape:", cluster_centers.shape)
    print("Words2vec shape: ",any_other_word.shape)
    
    # Create and fit the nearest neighbors model
    nbrs = NearestNeighbors(n_neighbors=k, algorithm='auto').fit(any_other_word)

    # Find the k nearest neighbors for each point in A
    _, indices = nbrs.kneighbors(cluster_centers)
    return indices

get_k_neares_words(kmeans.cluster_centers_, word_vector_matrx)


Cluster centers shape: (8, 4)
Words2vec shape:  (3466, 4)


array([[2775, 3026, 1143, 3071, 2875],
       [ 359,  735, 1239, 1937,  993],
       [2221, 1570,   69, 1087, 2702],
       [1975,  905, 2054, 1240,  232],
       [ 517,   21, 1641, 3192, 1249],
       [3161, 3163,  487, 1305, 3313],
       [1508, 1003,  221,  840, 2350],
       [3447, 2128, 3201, 1555,  761]])

In [344]:
dtm_tf.toarray().nbytes / 1e6

2.416

In [347]:
sparse = dtm_tf.toarray()
sparse

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [342]:

# Assuming 'csr_matrix' is your sparse matrix in CSR format
# Replace this with the actual CSR matrix you have

# Get the memory usage of 'data', 'indices', and 'indptr' arrays
memory_usage_data = dtm_tf.data.nbytes
memory_usage_indices = dtm_tf.indices.nbytes
memory_usage_indptr = dtm_tf.indptr.nbytes

# Calculate the total memory usage in bytes
total_memory_usage_bytes = memory_usage_data + memory_usage_indices + memory_usage_indptr

# Convert to kilobytes (optional)
total_memory_usage_kb = total_memory_usage_bytes / 1024

# Convert to megabytes (optional)
total_memory_usage_mb = total_memory_usage_kb / 1024

# Display the result
print(f"Memory usage: {total_memory_usage_bytes} bytes, {total_memory_usage_kb:.2f} KB, {total_memory_usage_mb:.2f} MB")


Memory usage: 99212 bytes, 96.89 KB, 0.09 MB


In [299]:
import pandas as pd

#Embedded as the average of words that is composed of
document_vector_matrx



df = pd.DataFrame(dtm_tf.toarray())
df['cluster'] = clustrer_labels
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,293,294,295,296,297,298,299,300,301,cluster
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,6
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,7
996,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,7
997,0,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,5
998,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,6


In [334]:
aggregated_clusters = df.groupby('cluster').sum()
aggregated_clusters

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,292,293,294,295,296,297,298,299,300,301
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,2,0,0,7,1,1,3,3,1,2,...,2,4,26,3,4,1,4,3,4,3
1,2,6,5,10,4,3,0,2,1,1,...,0,3,38,6,5,2,1,3,3,1
2,2,2,8,5,1,4,5,1,2,2,...,1,4,23,1,1,5,3,2,11,2
3,2,4,5,7,2,2,5,2,3,1,...,2,3,26,0,1,1,3,3,3,0
4,0,6,4,10,0,4,6,1,2,3,...,0,2,25,2,1,1,0,2,9,2
5,0,4,5,5,5,1,5,2,0,3,...,2,1,31,6,1,3,4,2,4,1
6,1,5,1,9,5,2,3,3,2,2,...,1,0,33,2,1,2,3,2,5,1
7,1,1,3,8,5,3,4,0,1,2,...,2,4,24,3,1,2,1,1,3,0


In [313]:
bag_dimension = 10

top_columns = aggregated_clusters.apply(lambda row: row.nlargest(bag_dimension).index.tolist(), axis=1)
top_columns

cluster
0    [100, 294, 137, 43, 151, 164, 266, 153, 37, 88]
1     [294, 100, 64, 151, 86, 149, 197, 43, 62, 137]
2     [100, 294, 88, 24, 43, 64, 164, 239, 151, 145]
3    [100, 294, 64, 164, 151, 61, 86, 153, 197, 281]
4      [100, 294, 151, 64, 137, 88, 24, 149, 46, 14]
5    [100, 294, 239, 88, 151, 164, 197, 64, 266, 86]
6    [100, 294, 239, 64, 24, 164, 153, 132, 46, 282]
7     [100, 294, 151, 64, 239, 43, 164, 86, 88, 109]
dtype: object

In [331]:
feature_names = tf_vectorizer.get_feature_names_out()
feature_names[top_columns[0]]

array(['https', 'zijn', 'lockdown', 'door', 'mensen', 'naar', 'weer',
       'mijn', 'deze', 'heeft'], dtype=object)

 ## 2.2 Results, evaluation and Interpretation 
 
Finally, you will describe, evaluate and interpret your findings from two methods. 

- In the report, you need to describe and discuss the similarity and difference of results from two methods.
- While evaluating the results, human judgment is very important, so visualization techniques are helpful to evaluate the identified topics in an interpreted manner. 
    
1. For evaluating the topic modelling algorithm, please first use the interactive tool **[pyLDAvis](https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb#topic=0&lambda=1&term=)** to examine the inter-topic separation of your findings. 

2. For interpreting the identified topics / clusters of both algorithms, we provide example code for several visualization techiques. You can use multiple ones to evaluate your results or come up with visualisations on your own. The files contain examples for how to use the visualisation functions.


In [264]:
import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

pyLDAvis.lda_model.prepare(lda_tf, dtm_tf, tf_vectorizer)

In [265]:
import sys
sys.path.insert(1, '../others')

from viz import show_sentences, show_top_k_topics, show_topic_distributions, show_topic_weights_and_counts, show_topic_wordclouds, show_wordcounts_and_topics



# Bonus Tasks 

We would like to challenge you with the following bonus task. For each task that is successfully completed, you may obtain max. 1 extra point. 

1. Implement another clustering algorithm or design your own clustering algorithm. Discuss your findings and explain why this is a better (or worse) clustering algorithm than the above one (the clustering algorithm, not LDA).

2. Can you think of other evaluation methods than the provided visualization techniques? If so, implement one and explain why it is a good evaluation for our task.