In [1]:
# Topic Modelling #

# Topic Modelling Overview 

# Topic Modelling allows for us to efficiently analyze large volumes of text by clustering 
# the documents into topics.

# A large amount of text data is unlabeled, which means we will not be able to apply the 
# supervised learning approaches to create machine learning models for the data. Because 
# the supervised machine learning approaches depend on the historical labelled data.

# It is up to use to try to discover text labels through the usage of topic modelling.

# If we have unlabeled data, then we can attempt to "discover" the labels. In the case of 
# text data, this means attempting to discover clusters of similar documents, grouped 
# together by topic.

# A very important idea to keep in mind here is that we do not know the "correct topic" or
# the "right answer". All we know is that the documents which are clustered together share
# similar topic ideas. It is up to the user to determine what these topics represent.

In [2]:
# Latent Dirichlet Allocation (LDA) for Topic Modelling

# * Johann Peter Gustav Lejeune Dirichlet was a German mathematician in the 1800s who 
# contributed widely to the field of modern mathematics.

# There is a probability distribution named after him, "Dirichlet Distribution". The 
# Latent Dirichlet Allocation (LDA) is based on this probability distribution.

# In 2003, LDA was first published as a graphical model for topic discovery in 
# Journal of Machine Learning research.

# Assumptions of LDA for Topic Modelling

# 1-) Documents with similar topics use similar groups of words
# 2-) Latent topics can be found by searching for groups of words which frequently occur together 
# in documents across the corpus.

# - Documents are probability distributions over latent topics.
# - Topics themselves are probability distributions over words.

# LDA represents documents as mixtures of topics which generate words with certain probabilities.


# LDA assumes that the documents are produced in the following fashion: 

# - Decide on the number of words N the document will have.
# - Choose a topic mixture for the document (according to the Dirichlet distribution over a fixed set of K topics)
   # * e.g. 55% business, 25% politics, 10% economics 10% trade
    
# Generate each word in the document by: 
  # First picking a topic according to the multinomial distribution that is sampled in the previous step 
  # (55% business, 25% politics, 10% economics, and 10% trade).

# Use the topic to generate the word itself (according to the topic's multinomial distribution). For instance,
# if we choose the topic named 'economics', we might generate the word 'stocks' with 50% probability, 
# 'investment' with 35% probability, and so forth.

# Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents 
# to find a set of topics that are likely to have generated the collection.


# We have choosen a fixed number of K topics, and want to use LDA to learn the topic representation of each 
# document and the words associated to each topic.

# We are going to go through each document, and randomly assign each word in the document to one of the K topics.

# This random assignment gives us the topic representations of all the documents and word distributions of all 
# the topics.

# We iterate through every word in every document to improve this fixed set of topics.

# For every word in every document and for each topic t, we calculate the following: 
# p(topic t | document d) = the proportion of words in the document d that are currently assigned to topic t.

# Reassign w a new topic, where we choose topic t with the below probability: 

# p(topic t | document d) * p(word w | topic t)
# This probability is essentially the probability that the topic t generated word w.

# After repeating the previous step for many times, we finally reach an approximately steady state where 
# the assignments are acceptable.

# Ultimately, we have each document being assigned to a topic. We can also search for the words which have 
# the highest probability of being assigned to a topic.

# We end up with an output such as: 

# Document assigned to the Topic #4
# Most common words (highest probability) for Topic #4:
# ['cat', 'dog', 'vet', 'birds', 'food', 'home', ...]

# Two important notes: 
# 1-) The user must decide on the amount of topics present in the document.
# 2-) The user must interpret what the topics are.

In [39]:
# Latent Dirichlet Allocation
import pandas as pd
npr = pd.read_csv('npr.csv')
print(npr)
print()
print(type(npr))

                                                 Article
0      In the Washington of 2016, even when the polic...
1        Donald Trump has used Twitter  —   his prefe...
2        Donald Trump is unabashedly praising Russian...
3      Updated at 2:50 p. m. ET, Russian President Vl...
4      From photography, illustration and video, to d...
...                                                  ...
11987  The number of law enforcement officers shot an...
11988    Trump is busy these days with victory tours,...
11989  It’s always interesting for the Goats and Soda...
11990  The election of Donald Trump was a surprise to...
11991  Voters in the English city of Sunderland did s...

[11992 rows x 1 columns]

<class 'pandas.core.frame.DataFrame'>


In [40]:
print(npr['Article'])

0        In the Washington of 2016, even when the polic...
1          Donald Trump has used Twitter  —   his prefe...
2          Donald Trump is unabashedly praising Russian...
3        Updated at 2:50 p. m. ET, Russian President Vl...
4        From photography, illustration and video, to d...
                               ...                        
11987    The number of law enforcement officers shot an...
11988      Trump is busy these days with victory tours,...
11989    It’s always interesting for the Goats and Soda...
11990    The election of Donald Trump was a surprise to...
11991    Voters in the English city of Sunderland did s...
Name: Article, Length: 11992, dtype: object


In [41]:
print("There are "+str(len(npr['Article']))+" articles in this dataset.")

There are 11992 articles in this dataset.


In [42]:
# data preprocessing
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df=0.15, max_df=0.95, stop_words = 'english')
print(cv)
print(type(cv))

CountVectorizer(max_df=0.95, min_df=0.15, stop_words='english')
<class 'sklearn.feature_extraction.text.CountVectorizer'>


In [43]:
document_term_matrix = cv.fit_transform(npr['Article'])
print(document_term_matrix)

  (0, 169)	2
  (0, 181)	3
  (0, 83)	3
  (0, 119)	6
  (0, 72)	3
  (0, 88)	1
  (0, 14)	6
  (0, 65)	3
  (0, 44)	1
  (0, 78)	2
  (0, 7)	1
  (0, 179)	4
  (0, 182)	2
  (0, 69)	2
  (0, 180)	1
  (0, 110)	2
  (0, 131)	4
  (0, 155)	1
  (0, 106)	1
  (0, 159)	19
  (0, 124)	1
  (0, 95)	2
  (0, 174)	1
  (0, 160)	1
  (0, 167)	1
  :	:
  (11991, 133)	1
  (11991, 143)	1
  (11991, 134)	8
  (11991, 158)	1
  (11991, 113)	7
  (11991, 162)	2
  (11991, 156)	1
  (11991, 84)	1
  (11991, 13)	1
  (11991, 103)	1
  (11991, 114)	2
  (11991, 58)	1
  (11991, 0)	1
  (11991, 175)	1
  (11991, 59)	1
  (11991, 87)	4
  (11991, 178)	2
  (11991, 32)	1
  (11991, 166)	1
  (11991, 61)	1
  (11991, 23)	1
  (11991, 75)	1
  (11991, 116)	1
  (11991, 4)	1
  (11991, 77)	2


In [44]:
document_term_matrix

<11992x185 sparse matrix of type '<class 'numpy.int64'>'
	with 535794 stored elements in Compressed Sparse Row format>

In [45]:
# Apply Latent Dirichlet Allocation
from sklearn.decomposition import LatentDirichletAllocation

# creating an lda instance where the random state is equal to 42 and the number of topics seeked is 20
lda = LatentDirichletAllocation(n_components=20, random_state=42)


print(lda)
print(type(lda))

LatentDirichletAllocation(n_components=20, random_state=42)
<class 'sklearn.decomposition._lda.LatentDirichletAllocation'>


In [46]:
# fitting lda to the document_term_matrix
lda.fit(document_term_matrix)

LatentDirichletAllocation(n_components=20, random_state=42)

In [47]:
# grab the vocabulary of words 

words_vocab = cv.get_feature_names()
print(len(words_vocab))
print()
print(type(words_vocab))
print()
for i in range(0, 100):
    print(words_vocab[i])

185

<class 'list'>

000
10
20
able
according
actually
ago
america
american
asked
away
best
better
big
called
came
campaign
care
case
center
change
children
city
clear
come
comes
country
course
day
days
department
did
didn
different
director
does
doesn
doing
don
donald
earlier
early
end
especially
example
fact
family
far
federal
feel
getting
given
going
good
got
government
great
group
hard
having
health
help
high
history
home
house
human
idea
important
including
instead
isn
just
kind
know
known
later
law
left
let
life
like
likely
little
live
lives
ll
local
long
look
looking
lot
make
makes
making
man
means
media
million
money




In [62]:
print(type(words_vocab))

import random 

random_word_id = random.randint(0, 185)
print(random_word_id)

<class 'list'>
43


In [64]:
# grab the topics
topics = lda.components_
print(topics)
print()
print(len(topics))
print("The type of the topics is: "+str(type(topics))+"")
print("The type of the lda.components_ is: "+str(type(lda.components_))+"")

topics_shape = topics.shape
print(topics_shape)

row_num = topics.shape[0]
col_num = topics.shape[1]

print("Total number of rows: "+str(row_num)+"")
print("Total number of columns: "+str(col_num)+"")
print()
print("There are "+str(row_num)+" topics seeked and "+str(col_num)+" words in the topics data.")

[[3.99512296e+02 3.35814296e+02 3.02808998e+02 ... 1.89997033e+03
  1.14663701e+01 2.15987452e+02]
 [7.78239990e+02 2.43059236e+02 1.69066582e+02 ... 1.89701635e+03
  3.85539978e-01 3.00232383e+02]
 [1.19333943e+01 7.78627904e+01 4.98411992e+01 ... 1.65145067e+02
  3.10770850e+02 9.96143266e-01]
 ...
 [2.23705317e+02 1.51123297e+02 1.04184226e+02 ... 1.43265139e+02
  3.45703618e-01 2.44484804e+01]
 [4.66926713e+02 3.54011773e+02 2.14956168e+02 ... 4.13937845e+02
  2.27145484e+01 2.26862642e+01]
 [6.99069565e-01 1.84012321e+02 6.49793619e+01 ... 7.11155996e+02
  1.85826049e+02 1.59077952e+03]]

20
The type of the topics is: <class 'numpy.ndarray'>
The type of the lda.components_ is: <class 'numpy.ndarray'>
(20, 185)
Total number of rows: 20
Total number of columns: 185

There are 20 topics seeked and 185 words in the topics data.


In [53]:
single_topic = lda.components_[0]
print("The first topic: "+str(single_topic)+"")

The first topic: [3.99512296e+02 3.35814296e+02 3.02808998e+02 7.04191064e+02
 3.01254709e+02 7.51823477e+02 6.37757337e+02 7.76051731e+01
 1.51458920e+02 1.86963650e+02 5.33382813e+02 2.91193779e+02
 5.77648500e+02 1.09238707e+03 1.02807211e+03 3.59263778e+02
 4.56788099e-01 4.40796524e+01 2.66476766e+02 7.25702671e+02
 7.82644839e+02 4.58630114e+00 6.06367771e+01 1.76532724e+02
 9.44794115e+02 4.41747362e+02 2.60833734e+02 1.59833371e+02
 7.70140350e+02 3.84992626e+02 1.67830946e+02 3.68548962e+02
 7.07178526e+02 7.83455790e+02 5.50848415e+02 3.96230989e+02
 1.01290421e+03 5.84302110e+02 2.26019814e+03 5.00000004e-02
 1.08521849e+02 2.25913317e+02 3.88261567e+02 4.17570384e+02
 3.68828258e+02 2.21549661e+02 2.52491960e+01 5.30983754e+02
 1.05974198e+01 3.86747440e+02 8.81888878e+02 1.21791566e+02
 1.66789637e+03 9.73784454e+02 8.77950313e+02 7.60357116e+01
 1.68332597e+02 5.84092560e+02 4.87872077e+02 3.83861401e+02
 1.06916948e+01 1.04206992e+03 3.69755660e+02 8.72066961e+01
 1.9941

In [56]:
single_topic.argsort()

array([159,  39, 175, 120, 142,  16, 174, 128,  77,  21,  48,  60, 118,
       183, 127, 131, 180, 161, 119,  65,  46, 143, 135,  17,  22, 141,
       110, 106,  95,  55,   7,  63, 107, 108,  40,  97, 133, 145,  51,
       157,  85,   8,  27,  30,  56,  79,  23, 103,  78,   9,  66, 168,
        64,  76, 137, 148, 176, 184, 123,  80,  45, 169,  41, 100, 172,
        84, 149,  98, 173,  26,  18, 144, 114, 117, 155,  11, 126,  99,
         4,   2, 156,  70,  82,  75, 109,   1, 158, 170,  15, 115,  31,
        44, 153,  62, 122, 136, 111,  69, 101,  59,  29,  49,  42,  35,
       124,   0, 112,  96,  43, 139,  87,  25, 147, 179,  94, 116,  67,
       129,  58,  68,  93,  47,  10, 102,  34, 146, 150,  12,  57,  37,
        90,  71,   6, 178,  86,  89,  73,   3,  32,  19,  88, 140,   5,
       165, 121,  28, 138,  20,  33, 151, 160,  83, 181,  54,  50, 163,
        24, 104,  53,  36,  14, 164, 130,  61, 166,  13, 177, 167,  74,
       171, 154, 152, 162, 132,  91,  92,  52, 105, 125, 182,  3

In [66]:
# It gets the list of index positions of the high probability words for the first topic, 
# meaning the topic located at the 0th index.
print(single_topic.argsort())

[159  39 175 120 142  16 174 128  77  21  48  60 118 183 127 131 180 161
 119  65  46 143 135  17  22 141 110 106  95  55   7  63 107 108  40  97
 133 145  51 157  85   8  27  30  56  79  23 103  78   9  66 168  64  76
 137 148 176 184 123  80  45 169  41 100 172  84 149  98 173  26  18 144
 114 117 155  11 126  99   4   2 156  70  82  75 109   1 158 170  15 115
  31  44 153  62 122 136 111  69 101  59  29  49  42  35 124   0 112  96
  43 139  87  25 147 179  94 116  67 129  58  68  93  47  10 102  34 146
 150  12  57  37  90  71   6 178  86  89  73   3  32  19  88 140   5 165
 121  28 138  20  33 151 160  83 181  54  50 163  24 104  53  36  14 164
 130  61 166  13 177 167  74 171 154 152 162 132  91  92  52 105 125 182
  38  72  81 113 134]


In [72]:
print(single_topic.argsort()[-10:]) # top 10 high-probable words

[ 92  52 105 125 182  38  72  81 113 134]


In [80]:
# Displaying top 20 high-probable words that can show up in the topic called 'single_topic'

top_twenty_words = single_topic.argsort()[-20:]
feature_names = cv.get_feature_names()
for index in top_twenty_words:
    print(feature_names[index])
    

big
work
want
know
way
time
think
university
say
lot
make
going
new
really
years
don
just
like
people
says


In [81]:
import numpy as np
arr = np.array([20, 300, 11])
print(arr)

[ 20 300  11]


In [82]:
print(arr)

# sorts the values in an ascending order and returns the indices of these values in a sequence
# Since 11 is the smallest value in the numpy array called 'arr', its index (2) comes first in the 
# argsort() call. 20 is greater than 11 and it is the smallest choice that we can take which is
# larger than 11. So, the index of '20' (0) takes the second place in the output. Because of the 
# fact that 300 is the largest value in the numpy array called arr, its index (1) takes the last 
# index position in the output of argsort() call.

# In short, argsort() returns the index positions that will sort the array with which argsort() is called.
print(arr.argsort()) 

[ 20 300  11]
[2 0 1]


In [109]:
# grab the highest probability words per topic
for index, topic in enumerate(topics):
    print("The top 15 words for the topic "+str(index)+" are: ")
    print()
    top15_words = [cv.get_feature_names()[index] for index in topic.argsort()[-15:]]
    print(top15_words)
    print()
    print("------------------------------------------------------------------------")

The top 15 words for the topic 0 are: 

['time', 'think', 'university', 'say', 'lot', 'make', 'going', 'new', 'really', 'years', 'don', 'just', 'like', 'people', 'says']

------------------------------------------------------------------------
The top 15 words for the topic 1 are: 

['country', 'people', 'just', 'ago', '000', 'work', 'help', 'time', 'day', 'children', 'life', 'years', 'home', 'family', 'says']

------------------------------------------------------------------------
The top 15 words for the topic 2 are: 

['news', 'night', 'going', 'week', 'just', 'support', 'won', 'new', 'states', 'political', 'president', 'said', 'donald', 'campaign', 'trump']

------------------------------------------------------------------------
The top 15 words for the topic 3 are: 

['says', 'likely', 'new', 'nearly', 'day', '10', 'years', 'world', 'according', 'number', 'million', '000', 'year', 'people', 'percent']

------------------------------------------------------------------------
The 



In [87]:
document_term_matrix

<11992x185 sparse matrix of type '<class 'numpy.int64'>'
	with 535794 stored elements in Compressed Sparse Row format>

In [88]:
npr

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."
...,...
11987,The number of law enforcement officers shot an...
11988,"Trump is busy these days with victory tours,..."
11989,It’s always interesting for the Goats and Soda...
11990,The election of Donald Trump was a surprise to...


In [99]:
topic_results = lda.transform(document_term_matrix)
print(topic_results)
print(type(topic_results))
print("\n")
print("\n")
print(topic_results.shape[0], topic_results.shape[1])
print("There are "+str(topic_results.shape[0])+" articles presented and "+str(topic_results.shape[1])+" topics seeked in the data frame.")

[[3.75939856e-04 3.75939857e-04 2.58978596e-01 ... 3.75939854e-04
  3.75939863e-04 8.41715411e-02]
 [6.02409649e-04 6.02409652e-04 6.43703216e-01 ... 6.02409645e-04
  1.20906457e-01 6.02409656e-04]
 [4.31034489e-04 4.31034489e-04 5.34020442e-01 ... 4.31034486e-04
  2.93180840e-02 4.31034491e-04]
 ...
 [6.84931519e-04 7.69371022e-02 6.84931516e-04 ... 7.94475603e-02
  6.84931520e-04 3.20664221e-01]
 [7.93650810e-04 7.93650808e-04 1.92791430e-01 ... 7.93650806e-04
  7.93650805e-04 7.97601603e-02]
 [2.27106258e-01 8.95326899e-02 1.69197313e-02 ... 4.76190483e-04
  4.76190488e-04 4.76190485e-04]]
<class 'numpy.ndarray'>




11992 20
There are 11992 articles presented and 20 topics seeked in the data frame.


In [97]:
# probabilities of the first article belonging to a particular topic for each topic in the list of topics
print(topic_results[0])

[0.00037594 0.00037594 0.2589786  0.00037594 0.18400657 0.23144787
 0.00037594 0.00037594 0.01492258 0.00037594 0.00037594 0.20850575
 0.01307987 0.00037594 0.00037594 0.00037594 0.00037594 0.00037594
 0.00037594 0.08417154]


In [98]:
# The below call will round every element in the first element of the array called topic
topic_results[0].round(4)

array([0.0004, 0.0004, 0.259 , 0.0004, 0.184 , 0.2314, 0.0004, 0.0004,
       0.0149, 0.0004, 0.0004, 0.2085, 0.0131, 0.0004, 0.0004, 0.0004,
       0.0004, 0.0004, 0.0004, 0.0842])

In [101]:
topic_results[0].round(3)

array([0.   , 0.   , 0.259, 0.   , 0.184, 0.231, 0.   , 0.   , 0.015,
       0.   , 0.   , 0.209, 0.013, 0.   , 0.   , 0.   , 0.   , 0.   ,
       0.   , 0.084])

In [102]:
topic_results[0].argmax() # It returns the index position of the highest probability topic for the first article.

2

In [105]:
print(topic_results)
print()
print()
print()
print()
npr['Topic'] = topic_results.argmax(axis=1)
print(npr)

[[3.75939856e-04 3.75939857e-04 2.58978596e-01 ... 3.75939854e-04
  3.75939863e-04 8.41715411e-02]
 [6.02409649e-04 6.02409652e-04 6.43703216e-01 ... 6.02409645e-04
  1.20906457e-01 6.02409656e-04]
 [4.31034489e-04 4.31034489e-04 5.34020442e-01 ... 4.31034486e-04
  2.93180840e-02 4.31034491e-04]
 ...
 [6.84931519e-04 7.69371022e-02 6.84931516e-04 ... 7.94475603e-02
  6.84931520e-04 3.20664221e-01]
 [7.93650810e-04 7.93650808e-04 1.92791430e-01 ... 7.93650806e-04
  7.93650805e-04 7.97601603e-02]
 [2.27106258e-01 8.95326899e-02 1.69197313e-02 ... 4.76190483e-04
  4.76190488e-04 4.76190485e-04]]




                                                 Article  Topic
0      In the Washington of 2016, even when the polic...      2
1        Donald Trump has used Twitter  —   his prefe...      2
2        Donald Trump is unabashedly praising Russian...      2
3      Updated at 2:50 p. m. ET, Russian President Vl...     11
4      From photography, illustration and video, to d...      3
...         

In [104]:
print(npr['Article'][0])

In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing o

In [108]:
print(topic_results)
print()
print()
print(topic_results.argmax(axis=0)) # columnwise list of indices each of which correspond to the maxium value in the column
print(topic_results.argmax(axis=1)) # rowwise list of indices each of which correspond to the maximum value in the row


# axis = 0 ======> columnwise
# axis = 1 ======> rowwise

[[3.75939856e-04 3.75939857e-04 2.58978596e-01 ... 3.75939854e-04
  3.75939863e-04 8.41715411e-02]
 [6.02409649e-04 6.02409652e-04 6.43703216e-01 ... 6.02409645e-04
  1.20906457e-01 6.02409656e-04]
 [4.31034489e-04 4.31034489e-04 5.34020442e-01 ... 4.31034486e-04
  2.93180840e-02 4.31034491e-04]
 ...
 [6.84931519e-04 7.69371022e-02 6.84931516e-04 ... 7.94475603e-02
  6.84931520e-04 3.20664221e-01]
 [7.93650810e-04 7.93650808e-04 1.92791430e-01 ... 7.93650806e-04
  7.93650805e-04 7.97601603e-02]
 [2.27106258e-01 8.95326899e-02 1.69197313e-02 ... 4.76190483e-04
  4.76190488e-04 4.76190485e-04]]


[10451   542  4617  3396  5991 11740  6774 10273  4741   855  8589   465
  6896  1039  4945  1655  4666  2954 10615  6730]
[ 2  2  2 ... 19 14  0]
20
11992
