In [1]:
%matplotlib inline

# Algorithms for Data Mining Workshop Week 10 - Topic Models

The goal of this workshop is to extract topic models out of a corpus of 20 newsgroups. The corpus contains over 11000 documents. We will test Non-negative Matrix Factorization and Latent Dirichlet Allocation on this corpus with different pre-processing steps. 

In order to make the notebook work, you first have to install the textblob library. This can be simply be done by:
- change to comand line and change in the anaconda2/bin directory
- type: "./pip install -U textblob"
- type: "./python -m textblob.download_corpora"

or simply
- !pip install textblob
- !python -m textblob.download_corpora


# Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation


The default parameters (n_samples / n_features / n_topics) should make
the example runnable in a couple of tens of seconds. You can try to
increase the dimensions of the problem, but be aware that the time
complexity is polynomial in NMF. In LDA, the time complexity is
proportional to (n_samples * iterations).


In [1]:
# To install textblob
#!pip install textblob
#!python -m textblob.download_corpora

In [None]:
from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from textblob import TextBlob
from sklearn.datasets import fetch_20newsgroups

n_samples = 2000
n_topics = 20
n_top_words = 20

In [None]:
# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.

print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples]
print("done in %0.3fs." % (time() - t0))

# type of data_samples
print(type(data_samples))
# Look at the first entry
print(data_samples[0])
# How many documents do we have?
len(data_samples)

## Excercise 1: Data Preprocessing

Our first step is data-preproccing. Write the function split_into_tokens that splits the documents into its single words. Subsequently, use the class CountVectorizer to transform the documents in the bag of words representation. How big is your resulting dictionary?


In [9]:
def split_into_tokens(message):
    message = message.lower()  # convert bytes into proper unicode
    return TextBlob(message).words

# How does the first element look like?
split_into_tokens(data_samples[0])

WordList([u'well', u'i', u"'m", u'not', u'sure', u'about', u'the', u'story', u'nad', u'it', u'did', u'seem', u'biased', u'what', u'i', u'disagree', u'with', u'is', u'your', u'statement', u'that', u'the', u'u.s', u'media', u'is', u'out', u'to', u'ruin', u'israels', u'reputation', u'that', u'is', u'rediculous', u'the', u'u.s', u'media', u'is', u'the', u'most', u'pro-israeli', u'media', u'in', u'the', u'world', u'having', u'lived', u'in', u'europe', u'i', u'realize', u'that', u'incidences', u'such', u'as', u'the', u'one', u'described', u'in', u'the', u'letter', u'have', u'occured', u'the', u'u.s', u'media', u'as', u'a', u'whole', u'seem', u'to', u'try', u'to', u'ignore', u'them', u'the', u'u.s', u'is', u'subsidizing', u'israels', u'existance', u'and', u'the', u'europeans', u'are', u'not', u'at', u'least', u'not', u'to', u'the', u'same', u'degree', u'so', u'i', u'think', u'that', u'might', u'be', u'a', u'reason', u'they', u'report', u'more', u'clearly', u'on', u'the', u'atrocities', u'what

In [10]:
bow_transformer = CountVectorizer(analyzer=split_into_tokens)
bow_transformer.fit(data_samples)
data_bow = bow_transformer.transform(data_samples)

print(len(bow_transformer.vocabulary_))

38884


Your dictionary is probably pretty huge! We can restrict the number of words by using the min_df and the max_df parameters of the Countvectorizer class. min_df says we want to use words that occur in at least the given percentage of documents. max_df says we do not want to use documents that occur in more then max_df % of the documents as these are uninformative. Use min_df = 0.001 and max_df = 0.3 and retrain the CountVectorizer. Is your vocabulary smaller now?

In [49]:
bow_transformer = CountVectorizer(analyzer=split_into_tokens, min_df=0.001, max_df=0.3)
bow_transformer.fit(data_samples)
data_bow = bow_transformer.transform(data_samples)

print(len(bow_transformer.vocabulary_))
#print(bow_transformer.vocabulary_)
#print(bow_transformer.get_feature_names())
#print(type(data_bow))

12337


## Training a non-negative matrix factorization (NMF) model

The NMF can be trained by

In [12]:
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(data_bow)
nmf

NMF(alpha=0.1, beta=1, eta=0.1, init=None, l1_ratio=0.5, max_iter=200,
  n_components=20, nls_max_iter=2000, random_state=1, shuffle=False,
  solver='cd', sparseness=None, tol=0.0001, verbose=0)

Now we can look at the topics. We first define a function that prints the top 20 words of all topics

In [14]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

An then use it for our topic model...

In [15]:
print("\nTopics in NMF model:")
feature_names = bow_transformer.get_feature_names()
print_top_words(nmf, feature_names, n_top_words)


Topics in NMF model:
Topic #0:
m p t x s l h 8 b r z w f 0 9 n c u o v
Topic #1:
was were there we me he she what did my people all had them out one our said her who
Topic #2:
1 2 3 each 4 6 copies vs 5 w copy print annual new cover issue left 10 8 rider
Topic #3:
disk drives hard bios drive rom controller 2 feature card system supports floppy will up 16 interface systems must cylinders
Topic #4:
graphics send mail ray 3d also objects image by files there package stuff file etc format message images many available
Topic #5:
your my me what so does one would all about which just no 'm any only than other get how
Topic #6:
was first probe mars lunar surface probes moon space orbit mission venus missions mariner by earth its were into pioneer
Topic #7:
any section firearm weapon military license shall by dangerous person division application which means device use following under issued other
Topic #8:
hiv aids health will by care said children medical new patients disease other 1993 inf

As we can see, the quality of our topics is quite poor. We have to do some more preprocessing. 

## Excercise 2: Stemming and lower case words

Repeat the previous experiment for training an NMF, but this time, use stemming as preprocessing step. Also, convert all words to lower case. Did your vocabulary size decrease? Print the first element of data_samples in the Bag-of-word representation. Can you interpret this vector?

In [16]:
def split_into_lemmas(message):
    # convert to lower case
    message = message.lower()
    words = TextBlob(message).words
    # for each word, take its "base form" = lemma 
    words = [word.lemma for word in words]
    return words


In [17]:
bow_transformer = CountVectorizer(analyzer=split_into_lemmas, min_df=0.001, max_df=0.3)
bow_transformer.fit(data_samples)
data_bow = bow_transformer.transform(data_samples)

print(len(bow_transformer.vocabulary_))

11141


In [18]:
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(data_bow)
nmf

print("\nTopics in NMF model:")
feature_names = bow_transformer.get_feature_names()
print_top_words(nmf, feature_names, n_top_words)


Topics in NMF model:
Topic #0:
m p t x s l h r 8 b z w 0 f 9 n c q v u
Topic #1:
wa were there we me he what she did my people all had them one out our know said her
Topic #2:
drive disk system hard bios controller rom support feature 2 card floppy will up 16 interface head must cylinder speed
Topic #3:
1 2 3 copy each 4 6 v 5 comic w cover print annual new issue left 10 8 price
Topic #4:
file image graphic send mail ray object 3d format also package by system server there stuff message etc site many
Topic #5:
car brake tire fluid oil q system may there some by will your braking ab when more driver manufacturer ha
Topic #6:
probe wa mission first mar lunar surface moon space orbit venus mariner by planet earth were into pioneer image soviet
Topic #7:
any section firearm weapon military license shall by person dangerous division application device which mean explosive state use following under
Topic #8:
hiv aid health will by disease child vaccine trial care said medical patient new st

## Excercise 3: Pruning numbers and short words
We still low quality topics, with numbers or only single letters. We want to prune this vocabulary. In your 
preprocessing step, remove all words that contain digits or that have less then 3 letters from the vocabulary.

Hint:
- you can query a string whether it contain a digit with .isalpha(). It returns true if the string does not contain a digit.
- the following pattern might help you: 
        words = [word for word in words if *put your code here *]
        
How large is your vocabulary? Again train the topic model and print the topics. 
    

In [19]:
def split_into_lemmas(message):
    # convert to lower case
    message = message.lower()
    words = TextBlob(message).words
    # for each word, take its "base form" = lemma 
    words = [word.lemma for word in words]
    words = [word for word in words if  word.isalpha() and len(word) >= 3]
    return words


In [20]:
bow_transformer = CountVectorizer(analyzer=split_into_lemmas, min_df=0.001, max_df=0.3)
bow_transformer.fit(data_samples)
data_bow = bow_transformer.transform(data_samples)

print(len(bow_transformer.vocabulary_))

9438


In [21]:
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(data_bow)
nmf

print("\nTopics in NMF model:")
feature_names = bow_transformer.get_feature_names()
print_top_words(nmf, feature_names, n_top_words)


Topics in NMF model:
Topic #0:
were there what she did all people had them one out our know said her who something just mamma say
Topic #1:
drive disk system hard bios controller rom support feature card floppy will interface head must cylinder speed formatting board your
Topic #2:
image file graphic send mail ray object format also package system server there stuff etc message site many amiga available
Topic #3:
any section firearm weapon military license shall person dangerous division application device which mean explosive state use following under issued
Topic #4:
probe mission first mar lunar surface moon space orbit venus mariner planet earth were into pioneer image soviet planetary through
Topic #5:
were there people out had armenian them our when one crowd who went baku home karabagh his parent then said
Topic #6:
hiv aid health will disease child vaccine trial said care medical patient new study number other service information research were
Topic #7:
version machine contact

## Excercise 4: TFIDF representation
Repeat the training with the TFIDF representation instead of the Bag-of-words representation. Can you see a difference in the topics?

In [22]:
tfidf_transformer = TfidfVectorizer(analyzer=split_into_lemmas, min_df=0.001, max_df=0.3)
tfidf_transformer.fit(data_samples)
data_tfidf = tfidf_transformer.transform(data_samples)

print(data_tfidf.shape)

(2000, 9438)


In [23]:
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(data_tfidf)
nmf

print("\nTopics in NMF model:")
feature_names = tfidf_transformer.get_feature_names()
print_top_words(nmf, feature_names, n_top_words)


Topics in NMF model:
Topic #0:
what all will about one people just who more some think their them like were out when get which time
Topic #1:
window program application using running font use manager microsoft mode workspace code problem run help memory compiler screen looking software
Topic #2:
god christian bible jesus his faith christ doe heaven believe him religion belief will son who sin say life lord
Topic #3:
key chip clipper encryption session government phone secret bit rsa use message enforcement communication secure encrypted public deposit scheme algorithm
Topic #4:
thanks please anyone any know doe info advance email interested anybody mail reply information looking help send appreciated hello could
Topic #5:
drive floppy disk hard controller problem cable mac ide scsi meg pin upgrade software external access kit digital cartridge boot
Topic #6:
geb chastity skepticism shameful intellect bank gordon surrender soon too blood fit nerve drop evangelist pressure pituitary dis

## Excercise 5: Removing stop words
Our dictionary still contains many uninformative words. Download the stopload corpora from nltk. Use the english stop-words. In your data-preprocssing step, delete all words that are contained in the stop word. Hints:
- download the corpora: 

In [24]:
import nltk
nltk.download('stopwords') 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mingjun\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

- get the stopwords:
    

In [25]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

- Again use the function split_into_lemma for removing the stopwords. Do not use the build in function of the TFIDF vectorizer, as it does not work
- You can check whether a word is contained in the list of stopwords with

In [26]:
'have'in stop_words

True

Repeat the experiment and train the topic models. Do you get better topics?

In [27]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mingjun\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [28]:
def split_into_lemmas(message):
    # convert to lower case
    message = message.lower()
    words = TextBlob(message).words
    # for each word, take its "base form" = lemma 
    words = [word.lemma for word in words]
    words = [word for word in words if  word.isalpha() and len(word) >= 3]
    words = [word for word in words if not word in stop_words]
    return words


In [29]:
tfidf_transformer = TfidfVectorizer(analyzer=split_into_lemmas, min_df=0.001, max_df=0.3)
tfidf_transformer.fit(data_samples)
data_tfidf = tfidf_transformer.transform(data_samples)

print(data_tfidf.shape)

(2000, 9355)


In [30]:
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(data_tfidf)
nmf

print("\nTopics in NMF model:")
feature_names = tfidf_transformer.get_feature_names()
print_top_words(nmf, feature_names, n_top_words)


Topics in NMF model:
Topic #0:
one people think like get time know thing year make good could well say right even way see would want
Topic #1:
drive disk hard controller problem ide scsi meg mac access cartridge software pin digital external boot western internal cable tape
Topic #2:
god christian bible jesus faith believe religion christ heaven belief son sin say life law atheism scripture satan lord love
Topic #3:
key chip clipper encryption session government phone secret bit rsa use message secure enforcement communication encrypted public scheme deposit algorithm
Topic #4:
thanks please anyone email advance know reply interested info send mail looking hello list information could someone address anybody help
Topic #5:
file bmp format swap ftp help drv read exe directory problem site copy available gif library postscript program midi disk
Topic #6:
geb chastity skepticism shameful bank intellect gordon surrender soon blood fit nerve drop evangelist pituitary disease pressure medic