# Topic modelling



<hr> 

Topic modelling is an unsupervised machine learning technique used to discover topics that are present in a corpus. This is known as 'unsupervised' machine learning because it doesn’t require training data that has previously been annotated and classified by humans. Topic modelling involves counting words and grouping similar word patterns to infer topics within unstructured data. 

We will learn how to do topic modelling in Python using <b>Latent Dirichlet Allocation (LDA)</b> and <b>Non-negative Matrix Factorization (NMF)</b> implemented in the scikit library. While LDA and NMF are based on different mathematical concepts, both algorithms are able to return the documents that belong to a topic in a corpus and the words that belong to a topic. 

Run all code cells in the given order. 

### Preprocessing

First we import all Python libraries/modules we need.

In [2]:
import os

from sklearn.feature_extraction.text import CountVectorizer # prepares tokens for use in topic model
from sklearn.decomposition import LatentDirichletAllocation

from sklearn.feature_extraction.text import TfidfVectorizer # prepares tokens for use in topic model
from sklearn.decomposition import NMF 


Next, we will open the folder 'BBC' that contains a collection of news texts and get a list of all the files (i.e. texts) that are stored in that folder. 

In [3]:
file_names=sorted([os.path.join("BBC", fn) for fn in os.listdir("BBC") if fn.endswith(".txt")])
print(len(file_names)) # count files in corpus
print(file_names[:10]) # print names of 1st ten files in corpus

260
['BBC\\001.txt', 'BBC\\002.txt', 'BBC\\003.txt', 'BBC\\004.txt', 'BBC\\005.txt', 'BBC\\006.txt', 'BBC\\007.txt', 'BBC\\008.txt', 'BBC\\009.txt', 'BBC\\010.txt']


### Latent Dirichlet Allocation

In this section we will see how to do topic modelling using LDA, also implemented in the scikit Python library. To prepare texts for the LDA topic modelling, we use a <i>CountVectorizer</i> vectorizer (LDA works with raw frequencies). With the help of <i>CountVectorizer</i> we will tokenize the text, put all characters lower case, remove stop words and count the frequency of each token (word).  <i>CountVectorizer</i> will turn a collection of text documents into numerical feature vectors, where a corpus of documents is represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

In [5]:
# LDA can only use raw term counts. It is a probabilistic graphical model.
lda_vectorizer = CountVectorizer(input='filename', analyzer='word', max_df=0.95, min_df=2, stop_words='english')
lda_features = lda_vectorizer.fit_transform(file_names)
lda_feature_names = lda_vectorizer.get_feature_names()

# show total number of features
print("Number of features:", len(lda_feature_names))

# show some examples
print("Feature examples:", lda_feature_names[1700:1710])

Number of features: 4221
Feature examples: ['gough', 'governance', 'governing', 'government', 'governor', 'graham', 'grand', 'granted', 'gray', 'great']


We now define the LDA topic model and apply it to the features (words). We also print the first 10 topics.

In [6]:
# run LDA
lda_model = LatentDirichletAllocation(n_components=10, max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
lda_model.fit(lda_features)


LatentDirichletAllocation(learning_method='online', learning_offset=50.0,
                          max_iter=5, random_state=0)

Next, we define a function that shows the first n words in each topic.

In [9]:
def print_topics(model, vectorizer, n_top_words):
    words = vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

Now we will print the first 10 words in each topic.

In [10]:
# print top 10 topics
print_topics(lda_model, lda_vectorizer, 10)


Topic #0:
sales said year profits 2004 euros india rose spending growth

Topic #1:
company mr euros charges marsh bankruptcy parmalat business group firm

Topic #2:
fuel said fiat airlines drugs engines tax year deutsche gm

Topic #3:
seed rusedski number forced second rib said left half victory

Topic #4:
said year growth new economy oil market government years economic

Topic #5:
said win mirza set final roddick world game open year

Topic #6:
car mini factory cars 000 gm saab bmw new production

Topic #7:
mci airlines verizon fiat qwest passengers new offer compensation beer

Topic #8:
fiat gm prices oil said crude opec firm cut barrel

Topic #9:
dollar ireland south gara reserves said korea horgan penalty minute


We will also check the highest ranked topic for each document.

In [11]:
lda_doc_topic = lda_model.transform(lda_features)

for n in range(lda_doc_topic.shape[0]):
    topic_most_pr_lda = lda_doc_topic[n].argmax()
    print("doc {}: topic: {}\n".format(n+1,topic_most_pr_lda))

doc 1: topic: 0

doc 2: topic: 4

doc 3: topic: 4

doc 4: topic: 0

doc 5: topic: 4

doc 6: topic: 4

doc 7: topic: 4

doc 8: topic: 4

doc 9: topic: 4

doc 10: topic: 4

doc 11: topic: 0

doc 12: topic: 4

doc 13: topic: 4

doc 14: topic: 4

doc 15: topic: 7

doc 16: topic: 4

doc 17: topic: 1

doc 18: topic: 0

doc 19: topic: 3

doc 20: topic: 3

doc 21: topic: 4

doc 22: topic: 4

doc 23: topic: 4

doc 24: topic: 4

doc 25: topic: 4

doc 26: topic: 4

doc 27: topic: 4

doc 28: topic: 4

doc 29: topic: 4

doc 30: topic: 4

doc 31: topic: 0

doc 32: topic: 4

doc 33: topic: 1

doc 34: topic: 4

doc 35: topic: 4

doc 36: topic: 5

doc 37: topic: 1

doc 38: topic: 0

doc 39: topic: 0

doc 40: topic: 0

doc 41: topic: 4

doc 42: topic: 2

doc 43: topic: 4

doc 44: topic: 4

doc 45: topic: 4

doc 46: topic: 7

doc 47: topic: 0

doc 48: topic: 4

doc 49: topic: 4

doc 50: topic: 2

doc 51: topic: 4

doc 52: topic: 4

doc 53: topic: 4

doc 54: topic: 4

doc 55: topic: 4

doc 56: topic: 0

d

###  Non-negative Matrix Factorization


Now we will see how to do topic modelling using Non-negative Matrix Factorization as implemented in the scikit Python library. First we need to put our texts in a bag of words matrix format where each text is represented as a row, and each column contains the count of words in the texts. To prepare such a matrix for the NMF algorithm, we use a <i>TfidfVectorizer</i>. <i>TfidfVectorizer</i> tokenizes texts, counts frequency of each word, calculates <i>tf-idf</i> (Term Frequency Inverse Document Frequency) value for each word in the corpus and puts it in the format needed as an input for the NMF. 

https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

In [13]:
# define the vectorizer
nmf_vectorizer = TfidfVectorizer(input='filename', analyzer='word', min_df=1, strip_accents = None, stop_words='english', preprocessor=None, encoding = 'utf-8')

# obtain all features from the texts
features = nmf_vectorizer.fit_transform(file_names)
feature_names = nmf_vectorizer.get_feature_names()

# show total number of features
print("Number of features:", len(feature_names))

# show some examples
print("Feature examples:", feature_names[1700:1710])




Number of features: 8592
Feature examples: ['claude', 'clauses', 'clawed', 'clay', 'clean', 'clear', 'clearance', 'cleared', 'clearer', 'clearest']


Next, we define the topic model and apply it to the features (words). The topic model uses a mathematical technique called Non-negative Matrix Factorization (NMF) to determine the topics. In this example we distinguish 10 topics, but you can change this number if you want. https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

In [14]:
# define a topic model with 10 topics (=n_components)
nmf_model = NMF(n_components=10, random_state=42, init='nndsvd', max_iter=2000) 

# apply/fit the model to the features
nmf_model.fit(features)


NMF(init='nndsvd', max_iter=2000, n_components=10, random_state=42)

Now we will print the first 10 words in each topic.

In [15]:
print_topics(nmf_model, nmf_vectorizer, 10)


Topic #0:
england robinson rugby wilkinson cup nations captain dawson squad injury

Topic #1:
economy growth economic rate spending bank dollar said rates consumer

Topic #2:
seed federer set roddick final open henman moya win beat

Topic #3:
mr ebbers worldcom sullivan fraud mci accounting trial charges guilty

Topic #4:
ireland gara penalty try connor driscoll scotland minutes horgan easterby

Topic #5:
yukos rosneft yugansk oil russian russia court khodorkovsky tax bankruptcy

Topic #6:
sales euros fiat profits car gm company said firm market

Topic #7:
wales williams ruddock thomas henson jones italy cardiff france welsh

Topic #8:
women davenport capriati wimbledon open champion australian prize money equal

Topic #9:
lions rugby zealand umaga tour woodward new match players hemisphere


We will also check the highest ranked topic for each document.

In [16]:
doc_topic = nmf_model.transform(features)

for n in range(doc_topic.shape[0]):
    topic_most_pr = doc_topic[n].argmax()
    print("doc {}: topic: {}\n".format(n+1,topic_most_pr))

doc 1: topic: 6

doc 2: topic: 1

doc 3: topic: 5

doc 4: topic: 6

doc 5: topic: 6

doc 6: topic: 1

doc 7: topic: 1

doc 8: topic: 1

doc 9: topic: 6

doc 10: topic: 5

doc 11: topic: 6

doc 12: topic: 1

doc 13: topic: 6

doc 14: topic: 6

doc 15: topic: 6

doc 16: topic: 1

doc 17: topic: 6

doc 18: topic: 1

doc 19: topic: 6

doc 20: topic: 6

doc 21: topic: 1

doc 22: topic: 1

doc 23: topic: 1

doc 24: topic: 5

doc 25: topic: 5

doc 26: topic: 1

doc 27: topic: 6

doc 28: topic: 6

doc 29: topic: 6

doc 30: topic: 6

doc 31: topic: 6

doc 32: topic: 6

doc 33: topic: 6

doc 34: topic: 6

doc 35: topic: 6

doc 36: topic: 3

doc 37: topic: 3

doc 38: topic: 1

doc 39: topic: 6

doc 40: topic: 6

doc 41: topic: 6

doc 42: topic: 6

doc 43: topic: 1

doc 44: topic: 1

doc 45: topic: 6

doc 46: topic: 6

doc 47: topic: 1

doc 48: topic: 1

doc 49: topic: 6

doc 50: topic: 1

doc 51: topic: 1

doc 52: topic: 5

doc 53: topic: 6

doc 54: topic: 1

doc 55: topic: 1

doc 56: topic: 6

d