LDA (latent Dirichlet Allocation) attempts to assign documents to topics based on the probability of a term being in the topic and the topic being in the document

Load and clean data

In [2]:
from sklearn.datasets import fetch_20newsgroups
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space'
]
groups = fetch_20newsgroups(subset='all', categories=categories)
labels = groups.target
label_names = groups.target_names
def is_letter_only(word):
    for char in word:
        if not char.isalpha():
            return False
    return True

from nltk.corpus import names
all_names = set(names.words())
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
data_cleaned = []
for doc in groups.data:
    doc = doc.lower()
    doc_cleaned = ' '.join(lemmatizer.lemmatize(word) for word in doc.split() if is_letter_only(word) and word not in all_names)
    data_cleaned.append(doc_cleaned)

Create a countvector (we can only use a countvector, not tfidf)

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer(stop_words="english", max_features=None, max_df=0.5, min_df=2)
data = count_vector.fit_transform(data_cleaned)

Create the LDA object

In [4]:
from sklearn.decomposition import LatentDirichletAllocation
t = 20
lda = LatentDirichletAllocation(n_components=t, learning_method='batch', random_state=42)

Fitting the model

In [5]:
lda.fit(data)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=20, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

Topic-term rank can be obtained from components_

In [6]:
lda.components_

array([[0.05     , 2.05     , 2.05     , ..., 0.05     , 0.05     ,
        0.05     ],
       [0.05     , 0.05     , 0.05     , ..., 0.05     , 0.05     ,
        0.05     ],
       [0.05     , 0.05     , 0.05     , ..., 4.0336285, 0.05     ,
        0.05     ],
       ...,
       [0.05     , 0.05     , 0.05     , ..., 0.05     , 0.05     ,
        0.05     ],
       [0.05     , 0.05     , 0.05     , ..., 0.05     , 0.05     ,
        0.05     ],
       [0.05     , 0.05     , 0.05     , ..., 0.05     , 0.05     ,
        3.05     ]])

Print top 10 terms for each rank

In [7]:
terms = count_vector.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    print("Topic: {}:".format(topic_idx))
    print(" ".join([terms[i] for i in topic.argsort()[-10:]]))

Topic: 0:
atheist doe ha believe say jesus people christian wa god
Topic: 1:
moment just adobe want know ha wa hacker article radius
Topic: 2:
center point ha wa available research computer data graphic hst
Topic: 3:
objective argument just thing doe people wa think say article
Topic: 4:
time like brian ha good life want know just wa
Topic: 5:
computer graphic think know need university just article wa like
Topic: 6:
free program color doe use version gif jpeg file image
Topic: 7:
gamma ray did know university ha just like article wa
Topic: 8:
tool ha processing using data software color program bit image
Topic: 9:
apr men know ha think woman just university article wa
Topic: 10:
jpl propulsion mission april mar jet command data spacecraft wa
Topic: 11:
russian like ha university redesign point option article space station
Topic: 12:
ha van book star material physicist universe physical theory wa
Topic: 13:
bank doe book law wa article rushdie muslim islam islamic
Topic: 14:
think goph