# Topic Modeling

It is the unsupervised classification of text documents to find "topics".

Since we don't know what topics are within the text, the topics are known as hidden or "latent".

Key Foundations:

    1. Documents are mixtures of topics
    2. Topics are mixtures of words

A document is represented as a distribution oer topics and a topic is a distribution over words (Document-topic and topic-word distributions).

Topic Modeling Algorithms:

    1. Latent Dirichlet Allocation (LDA): 
        A generative statistical model using Bayesian techniques.
        Views documents as bag of words (no relations/order)
    2. Non-negative Matrix Factorization (NMF): 
        Finds document-topic and term-topic matrices by minimizing a cost function
    3. BertTopic:
        Use transformer-based methods
        Build embedding, compress + cluster topics, extract topics

## Latent Dirchlet Allocation (LDA)

Mechanics:

LDA is a generative method to fit a topic model - it generates words using a probabilistic model.

LDA treats each document as a mixture of topics, and each topic as a bag of words, allowing documents to overlap in content.

For each document in a corpus:

    Choose N ~ Poisson(ξ) (Determine the document length acc to a Poisson distribution)
    
    Choose θ ~ Dir(α)     (Choose a topic mixture acc to a Dirichlet distribution (over a fixed set of topics))

    For each of the N words:
    
        (a) Choose a topic zn ~ Multinomial(θ) (For each word, sample a topic from the multinomial distribution
        
        (b) Choose a word wn from p(wn|zn, β), a multinomial probability conditioned on the topic zn. 
        

Evaluation Metrics:

    Perplexity: Tells us how well the model describes a set of documents, Low is better
    
    Topic Coeherence: Measures how similar the top words are to each other in a single topic, Higher is better

### Implement LDA

In [None]:
%matplotlib inline

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/deepshah/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
!pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 1.1 MB/s eta 0:00:01
Collecting joblib>=1.2.0
  Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
[K     |████████████████████████████████| 301 kB 10.7 MB/s eta 0:00:01
Collecting funcy
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Collecting numpy>=1.24.2
  Downloading numpy-1.26.4-cp39-cp39-macosx_10_9_x86_64.whl (20.6 MB)
[K     |████████████████████████████████| 20.6 MB 9.6 MB/s eta 0:00:01    |████▋                           | 3.0 MB 1.2 MB/s eta 0:00:15
[?25hCollecting pandas>=2.0.0
  Downloading pandas-2.2.2-cp39-cp39-macosx_10_9_x86_64.whl (12.6 MB)
[K     |████████████████████████████████| 12.6 MB 11.9 MB/s eta 0:00:01
Collecting tzdata>=2022.7
  Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB)
[K     |████████████████████████████████| 345 kB 9.2 MB/s eta 0:00:01
Collecting scipy
  Downloading scipy-1.13.1-cp39-cp39-macosx_10_9_x86_64

In [6]:
!pip install gensim



In [1]:
# Clean up 20 Newsgroups data
from string import punctuation
from nltk import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.datasets import fetch_20newsgroups

# Obtain the dataset
newsgroups = fetch_20newsgroups()

# Define our stopwords
stop_words = set(stopwords.words('english'))

# Use RegexpTokenizer to split string into substrings using a reg ex.
tokenizer = RegexpTokenizer(r'\s+', gaps=True)

# Stem the words
stemmer = PorterStemmer()
translate_tab = {ord(p): u" " for p in punctuation}

def text2tokens(raw_text):
    """Split the raw_text string into a list of stemmed tokens"""
    # With Python .translate, some chars are replaced with the char desc in a dict
    clean_text = raw_text.lower().translate(translate_tab)
    
    # Apply tokenizer
    tokens = [token.strip() for token in tokenizer.tokenize(clean_text)]
    tokens = [token for token in tokens if token not in stop_words]
    
    # Apply Stemmer
    stemmed = [stemmer.stem(token) for token in tokens]
    return [token for token in stemmed if len(token) > 2] # skip short ones

# Convert a document to a list of tokens
data = [text2tokens(text) for text in newsgroups['data']]

In [None]:
from gensim.corpora import Dictionary

# A Dictionary is mapping between words and their int ids
dictionary = Dictionary(documents=data, prune_at=None)

# Use Dictionary to remove un-relevant tokens
dictionary.filter_extremes(no_below=5, no_above=0.3, keep_n=None)

# Assign new word IDs to all words, to make IDs more compact after filtering
dictionary.compactify()

# Convert the list of tokens to the bag of word representation
bow_dataset = [dictionary.doc2bow(doc) for doc in data]

#### Train(fit) the LDA Model
Key parameters and hyperparameters:

    1. eval_every: The log perplexity score is estimated every that many updates. Setting this one slows down training by ~2x, so we set it to None.
    2. passes: Number of passes thorugh the corpus during training.
    3. workers: Number of worker processes to be used for parallelization.

In [None]:
from gensim.models import LdaMulticore
# Define the number of topics - we'll cover how to determine this later
num_topics = 15

# LdaMulticore is an istantiation of an online LDA algo. It uses all cpu cores to parallelize and speed up model training
lda1 = LdaMulticore(
        corpus=bow_dataset, num_topics=num_topics, id2word=dictionary,
        workers=4, eval_every=None, passes=10, batch=True
        )

In [None]:
lda2 = LdaMulticore(
        corpus=bow_dataset, num_topics=num_topics, id2word=dictionary,
        workers=4, eval_every=None, passes=20,
        # alpha represents an A-priori belief on document-topic distribution; default is 1.0/num_topics
        alpha=(5.0/num_topics),
        # eta represents an A-priori belief on document-topic distribution; default is 1.0/num_topics
        eta=0.01,
        batch=True,
)

In [None]:
# Analyze the results of LDA
import pyLDAvis.gensim_models
import pyLDAvis
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_fst, d2b_dataset, dictionary)
LDAvis_prepared

In [None]:
# Calculate the Coherence Score
from gensim.models import CoherenceModel

# Compute Coherence score using C_V
coherence_model = CoherenceModel(model=lda1, texts=data, dictionary=dictionary, coherence='c_v') 
# c_v score is measured based on a sliding window and it leverages normalized poitnwise mutual information and cosine similarity
coherence = coherence_model.get_coherence()
print("\nCoherence Score: ", coherence) # score = 0.50
# Do the same for second model score: 0.5649

In [None]:
# Calculating model perplexity
# Using the .log_perplexity method, we can calculate and return per-word liklihood bound, using the chunk of documents as evaluation corpus.
# This method doesn't return perplexity score but returns the "per-word-bound": -ve of the log of perplexity

perplexity = lda1.log_perplexity(bow_dataset) # -8.09
# lda2 = -8.13357