For this example, we will use the 20 Newsgroups dataset which is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

You can load this dataset using the following code

In [6]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))


  and should_run_async(code)


Preprocess the Data

Next, you need to preprocess the data. Here are the steps you can follow:

1)Convert the text to lowercase.

2)Tokenize the text into words.

3)Remove stop words and punctuation.

4)Lemmatize the words.

Here is the code to do this using spaCy:

In [7]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load('en_core_web_sm')
def preprocess(text):
    doc = nlp(text)
    tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct and token.lemma_.isalpha()]
    return tokens


  and should_run_async(code)


Prepare the Corpus and Dictionary

Now that you have preprocessed the data, you can prepare the corpus and dictionary. The corpus is a list of bag-of-words representations of each document in the dataset. The dictionary is a mapping between words and their integer ids.

Here is the code to prepare the corpus and dictionary

In [8]:
from gensim.corpora import Dictionary
from gensim.models import TfidfModel

texts = [preprocess(text) for text in newsgroups_train.data]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]


  and should_run_async(code)


Train the LDA Model

Finally, you can train the LDA model. Here is the code to do this

In [9]:
from gensim.models import LdaMulticore

num_topics = 10
lda_model = LdaMulticore(corpus_tfidf, num_topics=num_topics, id2word=dictionary, passes=10, workers=4)


  and should_run_async(code)


Visualize the Results

To visualize the results of the topic modelling, you can use the pyLDAvis library. Here is the code to do this

In [10]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

vis = gensimvis.prepare(lda_model, corpus_tfidf, dictionary)
pyLDAvis.display(vis)


  and should_run_async(code)


  and should_run_async(code)
