## Introduction

Despite its great results on medium or large sized texts (>50 words), typically mails and news articles are about this size range, LDA poorly performs on short texts like Tweets, Reddit posts or StackOverflow titles’ questions.

The Gibbs Sampling Dirichlet Mixture Model (GSDMM) is an “altered” LDA algorithm, showing great results on STTM tasks, that makes the initial assumption: 1 topic ↔️1 document. The words within a document are generated using the same unique topic, and not from a mixture of topics as it was in the original LDA.

What is GSDMM?

Imagine a bunch of students in a restaurant, seating randomly at K tables. They are all asked to write their favorite movies on a paper (but it must remain a short list). The objective is to cluster them in such a way that so students within the same group share the same movie interest. To do so, one after another, students must make a new table choice regarding the two following rules:

* Rule 1: Choose a table with more students. This rule improves completeness, all students sharing the same movie’s interest are assigned to the same table.
* Rule 2: Choose a table where students share similar movie’s interest. This rule aims to increase homogeneity, we want only members sharing the same movie’s interest at a table.

After repeating this process, we expect some tables to disappear and others to grow larger and eventually have clusters of students matching their movie’s interest. This is simply what the GSDMM algorithm does!

## Code

In [2]:
from preprocessing import tokenize, export_to_csv # a seperate class to be created
from gsdmm import MovieGroupProcess
from topic_allocation import top_words, topic_attribution
from visualisation import plot_topic_notebook, save_topic_html
from sklearn.datasets import fetch_20newsgroups

import pickle
import matplotlib as plt
import pandas as pd
import numpy as np
import ast

In [3]:
cats = ['talk.politics.mideast', 'comp.windows.x', 'sci.space']

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), categories=cats)
newsgroups_train_subject = fetch_20newsgroups(subset='train', categories=cats)

data = newsgroups_train.data
data_subject = newsgroups_train_subject.data

targets = newsgroups_train.target.tolist()
target_names = newsgroups_train.target_names

In [4]:
# Let's see if our topics are evenly distributed
df_targets = pd.DataFrame({'targets': targets})
order_list = df_targets.targets.value_counts()
order_list

1    593
0    593
2    564
Name: targets, dtype: int64

In [5]:
def extract_first_sentence(data_subject):
    list_first_sentence = []
    for text in data:
        first_sentence = text.split(".")[0].replace("\n", "")
        list_first_sentence.append(first_sentence)
    return list_first_sentence


def extract_subject(data):
    c = 0
    s = "Subject:"
    list_subjects = []
    for new in data_subject:    
        lines = new.split("\n")
        b = 0 # loop out at the first "Subject:", they may be several and we want first one only
        for line in lines:
            if s in line and b == 0:
                subject = " ".join(line.split(":")[1:]).strip()
                subject = subject.replace('Re', '').strip()
                list_subjects.append(subject)
                c += 1
                b = 1
    return list_subjects
   
    
def concatenate(list_first_sentence, list_subjects):
    list_docs = []
    for i in range(len(list_first_sentence)):
        list_docs.append(list_subjects[i] + " " + list_first_sentence[i])
    return list_docs


list_first_sentence = extract_first_sentence(data)
list_subjects = extract_subject(data_subject)
list_docs = concatenate(list_first_sentence, list_subjects)

In [6]:
df = pd.DataFrame(columns=['content', 'topic_id', 'topic_true_name'])
df['content'] = list_docs
df['topic_id'] = targets

def true_topic_name(x, target_names):
    return target_names[x].split('.')[-1]

df['topic_true_name'] = df['topic_id'].apply(lambda x: true_topic_name(x, target_names))
df.head()

Unnamed: 0,content,topic_id,topic_true_name
0,Elevator to the top floor Reading from a Amoco...,1,space
1,"Title for XTerm Yet again,the escape sequences...",0,x
2,From Israeli press. Madness. Before getting ex...,2,mideast
3,Accounts of Anti-Armenian Human Right Violatio...,2,mideast
4,How many israeli soldiers does it take to kill...,2,mideast


In [7]:
tokenized_data = tokenize(df, form_reduction='stemming', predict=False)

NameError: name 'tokenize' is not defined