NMF (non-negative matrix factorization) factorizes an input matrix V (bag of words for a document matrix) into a matrix with topics and terms (W) and a matrix with documents and terms (H)

Load and clean data

In [2]:
from sklearn.datasets import fetch_20newsgroups
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space'
]
groups = fetch_20newsgroups(subset='all', categories=categories)
labels = groups.target
label_names = groups.target_names
def is_letter_only(word):
    for char in word:
        if not char.isalpha():
            return False
    return True

from nltk.corpus import names
all_names = set(names.words())
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
data_cleaned = []
for doc in groups.data:
    doc = doc.lower()
    doc_cleaned = ' '.join(lemmatizer.lemmatize(word) for word in doc.split() if is_letter_only(word) and word not in all_names)
    data_cleaned.append(doc_cleaned)

We create an NMF object with 20 topics

In [3]:
from sklearn.decomposition import NMF
t = 20
nmf = NMF(n_components=t, random_state=42)

We create a Count Vector (we could also use a TfidfVectorizer here)

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer(stop_words="english", max_features=None, max_df=0.5, min_df=2)
data = count_vector.fit_transform(data_cleaned)

Fitting the data

In [5]:
nmf.fit(data)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=20, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

We can get W (topic-feature rank)

In [6]:
nmf.components_

array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 1.81952400e-04],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 7.35497518e-04, 3.65665719e-03],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        2.69725134e-02, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 4.26844886e-05],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

Displaying the top 10 terms for each topic based on their ranks

In [7]:
terms = count_vector.get_feature_names()
for topic_idx, topic in enumerate(nmf.components_):
    print("Topic: {}:".format(topic_idx))
    print(" ".join([terms[i] for i in topic.argsort()[-10:]]))

Topic: 0:
available quality program free color version gif file image jpeg
Topic: 1:
ha article make know doe say like just people think
Topic: 2:
include available analysis user software ha processing data tool image
Topic: 3:
atmosphere kilometer surface ha earth wa planet moon spacecraft solar
Topic: 4:
communication technology venture service market ha commercial space satellite launch
Topic: 5:
verse wa jesus father mormon shall unto mcconkie lord god
Topic: 6:
format message server object image mail file ray send graphic
Topic: 7:
christian people doe atheism believe religion belief religious god atheist
Topic: 8:
file graphic grass program ha package ftp available image data
Topic: 9:
speed material unified star larson book universe theory physicist physical
Topic: 10:
planetary station program group astronaut center mission shuttle nasa space
Topic: 11:
infrared high astronomical center acronym observatory satellite national telescope space
Topic: 12:
used occurs true form ha a