# Topic Modeling with Gibbs Sampling Dirichlet Mixture Model (GSDMM)

**GSDMM is a short-text topic modeling algorithm, which is a modified version of the LDA algorithm that uses the simple assumption of one topic assigned to a text.**

The idea is simple. Imagine a professor is leading a film class of students. At the start of class, students are all asked to write down their favorite movies (relatively shortlist). The students represent documents and their movie lists represent words. Next, the students are randomly assigned to K tables. The goal is to cluster and group them in a way that students within the same table share similar movie interests. Lastly, the professor repeatedly reads the class roster; each time a student’s name is called, they must select a new table satisfying one or both of the following conditions:  
> 1 — Choose a table with more students than their current table.  
> 2 — Choose a table where students share similar movie interests. 
  
Condition 1 improves **completeness**, all students with similar movie interests are at the same table rather than spread across different tables. Condition 2 helps lead to better **homogeneity**, ensuring that only members sharing similar interests are at the table. After satisfying these conditions and repeating them consistently until we near optimality (convergence), we expect some tables to disappear while others grow. The hope is that students will eventually arrive at an optimal table configuration. Simply, this is what the GSDMM algorithm does!   

  
Research paper can be found [here](https://dl.acm.org/doi/10.1145/2623330.2623715)  
Medium article can be found here [here](https://pub.towardsai.net/tweet-topic-modeling-part-3-using-short-text-topic-modeling-on-tweets-bc969a827fef)

## *Code Implementation*

clone github repo and import algorithm from file.  
~~~
git clone https://github.com/rwalk/gsdmm.git 
~~~

In [1]:
# Import libraries 

import pandas as pd
import numpy as np
import pickle
import gensim
from gsdmm.gsdmm import MovieGroupProcess
from tqdm import tqdm

**Read Data**

In [2]:
df = pd.read_csv('Corrected_Final_All.csv')
print(df.shape)
df.head()

(22160, 36)


Unnamed: 0.1,Unnamed: 0,created_at,id_str,conversation_id_str,full_text,twitter_lang,favorited,retweeted,retweet_count,favorite_count,...,preprocessed_data,emoji_list,emoticons_list,filename,data_source,lang,score,langTb,lang_langdetect,preprocessed_data_without_hashtags
0,0,2021-03-27T04:09:42+00:00,1.38e+18,1.38e+18,@Diputado_Canelo Hagamos otro por el uno de ma...,es,False,False,0.0,1.0,...,"['hacer', 'mayo', 'cazar', 'fantasma', 'mayo']",[''],[':/'],Mayo_SPANISH_tweets_stweet.csv,Twitter,es,,,,"['hacer', 'mayo', 'cazar', 'fantasma']"
1,1,2021-03-22T21:12:09+00:00,1.37e+18,1.37e+18,Después de esperar con ancias el #28F ahora es...,es,False,False,1.0,4.0,...,"['despues', 'esperar', 'ancia', 'ahora', 'espe...",['💙🤍💙'],[],Mayo_SPANISH_tweets_stweet.csv,Twitter,es,,,,"['despues', 'esperar', 'ancia', 'ahora', 'espe..."
2,2,2021-03-22T12:30:53+00:00,1.37e+18,1.37e+18,Espero que ésto llegue hasta oídos de la nueva...,es,False,False,0.0,1.0,...,"['esperar', 'llegar', 'oido', 'nuevo', 'inicia...",[''],[],Mayo_SPANISH_tweets_stweet.csv,Twitter,es,,,,"['esperar', 'llegar', 'oido', 'nuevo', 'inicia..."
3,3,2021-04-04T12:56:55+00:00,1.38e+18,1.38e+18,A menos de un mes del #1Mayo Urkullu teme perd...,es,False,False,3.0,5.0,...,"['menos', 'mes', 'mayo', 'urkullu', 'temer', '...",[''],[],Mayo_SPANISH_tweets_stweet.csv,Twitter,es,,,,"['menos', 'mes', 'urkullu', 'temer', 'perder',..."
4,4,2021-04-03T20:14:57+00:00,1.38e+18,1.38e+18,La X Edición del Festival Internacional Un Pue...,es,False,False,1.0,3.0,...,"['edicion', 'festival', 'internacional', 'puen...",[''],"[':/', ':/']",Mayo_SPANISH_tweets_stweet.csv,Twitter,es,,,,"['edicion', 'festival', 'internacional', 'puen..."


In [3]:
df['preprocessed_str_without_hashtags'] = df['preprocessed_data_without_hashtags'].apply(eval).apply(' '.join)
df.head()

Unnamed: 0.1,Unnamed: 0,created_at,id_str,conversation_id_str,full_text,twitter_lang,favorited,retweeted,retweet_count,favorite_count,...,emoji_list,emoticons_list,filename,data_source,lang,score,langTb,lang_langdetect,preprocessed_data_without_hashtags,preprocessed_str_without_hashtags
0,0,2021-03-27T04:09:42+00:00,1.38e+18,1.38e+18,@Diputado_Canelo Hagamos otro por el uno de ma...,es,False,False,0.0,1.0,...,[''],[':/'],Mayo_SPANISH_tweets_stweet.csv,Twitter,es,,,,"['hacer', 'mayo', 'cazar', 'fantasma']",hacer mayo cazar fantasma
1,1,2021-03-22T21:12:09+00:00,1.37e+18,1.37e+18,Después de esperar con ancias el #28F ahora es...,es,False,False,1.0,4.0,...,['💙🤍💙'],[],Mayo_SPANISH_tweets_stweet.csv,Twitter,es,,,,"['despues', 'esperar', 'ancia', 'ahora', 'espe...",despues esperar ancia ahora esperar despues se...
2,2,2021-03-22T12:30:53+00:00,1.37e+18,1.37e+18,Espero que ésto llegue hasta oídos de la nueva...,es,False,False,0.0,1.0,...,[''],[],Mayo_SPANISH_tweets_stweet.csv,Twitter,es,,,,"['esperar', 'llegar', 'oido', 'nuevo', 'inicia...",esperar llegar oido nuevo iniciar laboral part...
3,3,2021-04-04T12:56:55+00:00,1.38e+18,1.38e+18,A menos de un mes del #1Mayo Urkullu teme perd...,es,False,False,3.0,5.0,...,[''],[],Mayo_SPANISH_tweets_stweet.csv,Twitter,es,,,,"['menos', 'mes', 'urkullu', 'temer', 'perder',...",menos mes urkullu temer perder control dar pas...
4,4,2021-04-03T20:14:57+00:00,1.38e+18,1.38e+18,La X Edición del Festival Internacional Un Pue...,es,False,False,1.0,3.0,...,[''],"[':/', ':/']",Mayo_SPANISH_tweets_stweet.csv,Twitter,es,,,,"['edicion', 'festival', 'internacional', 'puen...",edicion festival internacional puente hacia ce...


**Pre-process data for modeling** 

In [5]:
# Put tokens in list format
docs = df.preprocessed_data_without_hashtags.tolist()

# remove punctuations
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

docs = list(sent_to_words(docs))
docs

[['hacer', 'mayo', 'cazar', 'fantasma'],
 ['despues',
  'esperar',
  'ancia',
  'ahora',
  'esperar',
  'despues',
  'ser',
  'inmagino',
  'celebracion',
  'ano',
  'independencia',
  'patria',
  'ahora',
  'si',
  'jubil',
  'ser',
  'verdadero',
  'libertad'],
 ['esperar',
  'llegar',
  'oido',
  'nuevo',
  'iniciar',
  'laboral',
  'partir',
  'proximo',
  'dinero',
  'invertia',
  'privilegio',
  'tiempo',
  'atra',
  'invertir',
  'programa',
  'joven',
  'adulto',
  'ojalar',
  'alguien',
  'ver'],
 ['menos',
  'mes',
  'urkullu',
  'temer',
  'perder',
  'control',
  'dar',
  'paso',
  'atro',
  'urkullu',
  'decir',
  'ahora',
  'si',
  'cumplir',
  'todo',
  'medida',
  'impuesto',
  'superar',
  'situacion',
  'mas',
  'drastico'],
 ['edicion',
  'festival',
  'internacional',
  'puente',
  'hacia',
  'celebrar',
  'abril',
  'proximo',
  'manera',
  'online',
  'bajo',
  'slogan'],
 ['cgt', 'celebrar', 'mayo', 'hostigamiento', 'sufrido', 'empresa'],
 ['cgt', 'celebrar', 'ac

**Train topic model**

**Parameters**  

> * *K -*  Represents the maximum number of topics to be found.  
  
> * **Alpha α -** alpha controls a factor that decides how easily as table is removed when it is empty.  
  
> * **Beta β -** beta controls how a table is chosen either based on similarity or popularity. low beta -> more similar clusters and high beta -> more emphasis on selecting popular clusters.  
  
> * **N_iters -** this represents the number of iterations or *number of times a student is reassigned to a new table by the proffessor*.

In [9]:
# Instantaite model
mgp = MovieGroupProcess(K=12, alpha=0.1, beta=0.5, n_iters=30)

# create vocab
vocab = set(x for doc in docs for x in doc)

# length of vocabulary
n_terms = len(vocab)

# fit on data
y = mgp.fit(docs, n_terms)

In stage 0: transferred 19632 clusters with 12 clusters populated
In stage 1: transferred 12433 clusters with 12 clusters populated
In stage 2: transferred 4815 clusters with 12 clusters populated
In stage 3: transferred 2558 clusters with 12 clusters populated
In stage 4: transferred 1942 clusters with 12 clusters populated
In stage 5: transferred 1788 clusters with 12 clusters populated
In stage 6: transferred 1669 clusters with 12 clusters populated
In stage 7: transferred 1548 clusters with 12 clusters populated
In stage 8: transferred 1558 clusters with 12 clusters populated
In stage 9: transferred 1517 clusters with 12 clusters populated
In stage 10: transferred 1485 clusters with 11 clusters populated
In stage 11: transferred 1469 clusters with 11 clusters populated
In stage 12: transferred 1445 clusters with 11 clusters populated
In stage 13: transferred 1408 clusters with 11 clusters populated
In stage 14: transferred 1427 clusters with 11 clusters populated
In stage 15: trans

In [10]:
# helper functions

def top_words(cluster_word_distribution, top_cluster, values):
    '''prints the top words in each cluster'''
    for cluster in top_cluster:
        sort_dicts =sorted(cluster_word_distribution[cluster].items(), key=lambda k: k[1], reverse=True)[:values]
        print('Cluster %s : %s'%(cluster,sort_dicts))
        print(' — — — — — — — — —')

def cluster_importance(mgp):
    '''returns a word-topic matrix[phi] where each value represents
    the word importance for that particular cluster;
    phi[i][w] would be the importance of word w in topic i.
    '''
    n_z_w = mgp.cluster_word_distribution
    beta, V, K = mgp.beta, mgp.vocab_size, mgp.K
    phi = [{} for i in range(K)]
    for z in range(K):
        for w in n_z_w[z]:
            phi[z][w] = (n_z_w[z][w]+beta)/(sum(n_z_w[z].values())+V*beta)
    return phi

def topic_allocation(df, docs, mgp, topic_dict):
    '''allocates all topics to each document in original dataframe,
    adding two columns for cluster number and cluster description'''
    topic_allocations = []
    for doc in tqdm(docs):
        topic_label, score = mgp.choose_best_label(doc)
        topic_allocations.append(topic_label)

    df['cluster'] = topic_allocations

    df['topic_name'] = df.cluster.apply(lambda x: get_topic_name(x, topic_dict))
    print('Complete. Number of documents with topic allocated: {}'.format(len(df)))

def get_topic_name(doc, topic_dict):
    '''returns the topic name string value from a dictionary of topics'''
    topic_desc = topic_dict[doc]
    return topic_desc

In [11]:
doc_count = np.array(mgp.cluster_doc_count)
print('Number of documents per topic :', doc_count)
print('*'*20)
# topics sorted by the number of documents they are allocated to
top_index = doc_count.argsort()[-10:][::-1]
print('Most important clusters (by number of docs inside):',   
       top_index)
print('*'*20)
# show the top 5 words in term frequency for each cluster 
topic_indices = np.arange(start=0, stop=len(doc_count), step=1)
top_words(mgp.cluster_word_distribution, topic_indices, 10)

Number of documents per topic : [   18    65   105 16085  4376    16     0  1053    21    24    12   385]
********************
Most important clusters (by number of docs inside): [ 3  4  7 11  2  1  9  8  0  5]
********************
Cluster 0 : [('post', 10), ('sub', 10), ('spam', 10), ('tropical', 6), ('severe', 6), ('por', 5), ('de', 5), ('regla', 5), ('due', 5), ('violation', 5)]
 — — — — — — — — —
Cluster 1 : [('mas', 40), ('el', 20), ('colombiano', 20), ('ayudar', 20), ('sr', 20), ('sostener', 20), ('volver', 19), ('casa', 19), ('mayor', 19), ('auxilio', 19)]
 — — — — — — — — —
Cluster 2 : [('uribe', 63), ('duque', 57), ('martuchis', 57), ('vicky', 55), ('paraco', 55), ('velez', 43), ('roman', 43), ('bolivar', 40), ('escobar', 39), ('polo', 39)]
 — — — — — — — — —
Cluster 3 : [('mas', 2028), ('hacer', 1772), ('si', 1730), ('el', 1587), ('ir', 1350), ('poder', 1339), ('pais', 1274), ('colombio', 1204), ('ser', 1201), ('gobierno', 1151)]
 — — — — — — — — —
Cluster 4 : [('colombia', 8

In [12]:
phi = cluster_importance(mgp) # initialize phi matrix
phi[4]['government']

0.00887131970322624

In [14]:
phi[8]['policy']

0.0004564694165490912

In [18]:
# Save model
with open('12cluster.model', 'wb') as f:
    pickle.dump(mgp, f)
    f.close()

In [15]:
def top_words_dict(cluster_word_distribution, top_cluster, n_words):
    '''returns a dictionary of the top n words and the number of docs they are in;
    cluster numbers are the keys and a tuple of (word, word count) are the values'''
    top_words_dict = {}
    for cluster in top_cluster:
        top_words_list = []
        for val in range(1, n_words):
            top_n_word = sorted(cluster_word_distribution[cluster].items(), 
                                key=lambda item: item[1], reverse=True)[:n_words][val]    #[0]
            top_words_list.append(top_n_word)
        top_words_dict[cluster] = top_words_list

    return top_words_dict

def get_word_counts_dict(top_words_nclusters):
    '''returns a dictionary that counts the number of times a word 
    appears only in the top n words list across all the clusters;
    words are the keys and a count of the word is the value'''
    word_count_dict = {}
    for key in top_words_nclusters:
        words_score_list = []
        for word in top_words_nclusters[key]:
            if word[0] in word_count_dict.keys():
                word_count_dict[word[0]] += 1
            else:
                word_count_dict[word[0]] = 1
    return word_count_dict

def get_cluster_importance_dict(top_words_nclusters, phi):
    '''returns a dictionary that of all top words and their cluster
    importance value for each cluster;
    cluster numbers are the keys and a list of word 
    importance computed scores are the values'''
    cluster_importance_dict = {}
    for key in top_words_nclusters:
        words_score_list = []
        for word in top_words_nclusters[key]:
            importance_score = phi[key][word[0]]
            words_score_list.append(importance_score)
        cluster_importance_dict[key] = words_score_list
    return cluster_importance_dict

def get_doc_counts_dict(top_words_nclusters):
    '''returns a dictionary of only the doc counts of each top n word for each cluster;
    cluster numbers are the keys and a list of doc counts are the values'''
    doc_counts_dict = {}
    for key in top_words_nclusters:
        doc_counts_list = []
        for word in top_words_nclusters[key]:
            num_docs = word[1]
            doc_counts_list.append(num_docs)
        doc_counts_dict[key] = doc_counts_list
    return doc_counts_dict

def get_word_frequency_dict(top_words_nclusters, word_counts):
    '''returns a dictionary of only the number of occurences across all 
    clusters for each word in a particular cluster's top n words;
    cluster numbers are the keys and a list of 
    word occurences counts are the values'''
    word_frequency_dict = {}
    for key in top_words_nclusters:
        words_count_list = []
        for word in top_words_nclusters[key]:
            words_count_list.append(word_counts[word[0]])
        word_frequency_dict[key] = words_count_list

    return word_frequency_dict

In [32]:

# declare any static variables needed 
nwords = 10
phi = cluster_importance(mgp)
modified_topic_indices = np.delete(topic_indices, 6) # topic 6 is an empty table
# define and generate dictionaries that hold each topic number and its values
top_words = top_words_dict(mgp.cluster_word_distribution, modified_topic_indices, nwords)
word_count = get_word_counts_dict(top_words)
word_frequency = get_word_frequency_dict(top_words, word_count)
cluster_importance_dict = get_cluster_importance_dict(top_words, phi)
    
# add all values for each topic to a list of lists
rows_list = []
for cluster in range(0, 10):
    if cluster == 6:
        pass
    else:
        words = [x[0] for x in top_words[cluster]]
        doc_counts = [x[1] for x in top_words[cluster]]
    
    # create a list of values which represents a 'row' in our data frame 
        rows_list.append([int(cluster), words, doc_counts, 
                        word_frequency[cluster], cluster_importance_dict[cluster]])
        
topic_words_df = pd.DataFrame(data=rows_list, 
                              columns=['cluster', 'top_words',
                                        'doc_count', 'num_topic_occurrence', 'word_importance'])

topic_words_df


Unnamed: 0,cluster,top_words,doc_count,num_topic_occurrence,word_importance
0,0,"[sub, spam, tropical, severe, por, de, regla, ...","[10, 10, 6, 6, 5, 5, 5, 5, 5]","[2, 1, 1, 1, 1, 2, 1, 1, 1]","[0.0008700696055684454, 0.0008700696055684454,..."
1,1,"[el, colombiano, ayudar, sr, sostener, volver,...","[20, 20, 20, 20, 20, 19, 19, 19, 19]","[3, 2, 1, 1, 1, 1, 1, 1, 1]","[0.0016414444711345985, 0.0016414444711345985,..."
2,2,"[duque, martuchis, vicky, paraco, velez, roman...","[57, 57, 55, 55, 43, 43, 40, 39, 39]","[1, 1, 1, 1, 1, 1, 1, 1, 1]","[0.0044281863688871775, 0.0044281863688871775,..."
3,3,"[hacer, si, el, ir, poder, pais, colombio, ser...","[1772, 1730, 1587, 1350, 1339, 1274, 1204, 120...","[2, 2, 3, 2, 2, 2, 2, 2, 1]","[0.009441147958368399, 0.009217436695038937, 0..."
4,4,"[people, police, government, help, colombian, ...","[739, 640, 528, 477, 468, 404, 365, 355, 311]","[2, 1, 1, 1, 1, 1, 1, 1, 1]","[0.012413133246046933, 0.010751334474770873, 0..."
5,5,"[usar, sub, to, lenguaje, controversial, langu...","[14, 14, 14, 14, 14, 14, 9, 8, 8]","[1, 2, 1, 1, 1, 1, 1, 1, 1]","[0.0011722855525911552, 0.0011722855525911552,..."
6,7,"[la, hoy, dialogo, salud, esperar, calles, col...","[79, 60, 58, 44, 43, 38, 37, 37, 37]","[1, 1, 1, 1, 1, 1, 2, 1, 2]","[0.004121733720447947, 0.0031366652841144753, ..."
7,8,"[account, post, reddit, or, twitter, temporari...","[6, 6, 6, 6, 5, 5, 5, 5, 5]","[1, 1, 1, 1, 1, 1, 1, 1, 1]","[0.0005394638559216532, 0.0005394638559216532,..."
8,9,"[empresa, ver, encontrar, oportunidad, situaci...","[14, 14, 14, 14, 14, 14, 14, 14, 14]","[1, 1, 1, 1, 1, 2, 1, 1, 1]","[0.0012011265738899933, 0.0012011265738899933,..."


In [33]:
# save results
topic_words_df.to_csv('results.csv', index=False)