# Topic Modelling by `Mr. Harshit Dawar!`
### Algorithm: LDA (Latent Dirichlet Allocation)

***Steps***
1. A random number of topics will be decided by the user to which the words from the document will be assigned.
2. Each word from each document will be assigned to any of the random topics initially.
3. Now, random topics from each document & words assignments to those topics in each document will be obtained. Although, initial assignment will not make any sense.
4. Steps 2 & 3 will be repeated until the best assignments are provided using the formula given below.

For each topic:  ***probability( topic "t" | document "d")*** <= Probability of topic "t" existing in document "d".

For each word:  ***probability( word "w" | topic "t")*** <= Probability of word "w" belonging to topic "t".

Final probability that a topic "t" generated word "w" in document "d": ***probability( topic "t" | document "d") * probability( word "w" | topic "t")***


**Important Pointers**

* The user has to decide the number of topics to get from the document
* The user has to interpret the topics itself.

**Few Assumptions of LDA**

* Documents are probability distributions over Topics, Topics are probabilty distributions over words.
* Documents with similar topics uses similar groups of words.
* Topics can be founded by searching for the words that occur across the corpus in documents. 

In [13]:
# Importing the required Libraries
import pandas as pd
import numpy as np
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
# Loading the Dataset
data = pd.read_csv("data.csv")

In [4]:
data.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [5]:
data.shape

(11992, 1)

## Getting the word vectorized

In [6]:
"""
* min_df represents min. number of documents in which a word should occur. A word with a number below this
  will be ignored.
* max_df represents max. word frequency of occurence of a word in the document above which all the
  words will be ignored.
  
* Stopwrods of English will be removed.
"""
vectorizer = CountVectorizer(min_df = 2, max_df = 0.95, stop_words="english")

In [8]:
document_word_matrix = vectorizer.fit_transform(data.Article)

In [15]:
document_word_matrix

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [16]:
np.unique(document_word_matrix.toarray()[0])

array([ 0,  1,  2,  3,  4,  5,  6,  7,  9, 10, 15, 19])

## Applying LDA

In [17]:
"""
n_components: number of different topics to divide the documents into
"""
LDA = LatentDirichletAllocation(n_components = 9, random_state = 5)

LDA.fit(document_word_matrix)

LatentDirichletAllocation(n_components=9, random_state=5)

In [20]:
### Getting the Corpus from the Documents

len(vectorizer.get_feature_names())

54777

In [23]:
vectorizer.get_feature_names()[5000], vectorizer.get_feature_names()[5500], vectorizer.get_feature_names()[33000]

('bask', 'benzodiazepines', 'nas')

In [24]:
### Getting the Different Topics
LDA.components_

array([[2.04108203e+01, 4.93163268e+02, 1.11111111e-01, ...,
        6.11103023e+00, 1.11675663e-01, 1.11111111e-01],
       [2.92231210e+00, 7.68997627e+02, 1.11111111e-01, ...,
        1.11111111e-01, 1.11119937e-01, 1.11113664e-01],
       [1.11189038e-01, 1.75726089e+02, 1.11113075e-01, ...,
        1.11119789e-01, 1.11036644e+00, 1.11111111e-01],
       ...,
       [7.29581971e+00, 1.67203972e+03, 1.11111111e-01, ...,
        1.11138772e-01, 1.11288688e-01, 1.11111111e-01],
       [7.94521735e+00, 4.38966163e+01, 1.11111111e-01, ...,
        1.11111111e-01, 1.11111111e-01, 1.11111648e-01],
       [4.56026646e+00, 3.66501790e+02, 1.11116300e-01, ...,
        1.11111669e-01, 1.11111111e-01, 1.11111966e-01]])

In [25]:
LDA.components_.shape

(9, 54777)

### Getting Word to Topic Assignment

In [26]:
Topic_1 = LDA.components_[0]
Topic_1

array([2.04108203e+01, 4.93163268e+02, 1.11111111e-01, ...,
       6.11103023e+00, 1.11675663e-01, 1.11111111e-01])

In [27]:
# Getting the top 15 words from this topic
top_15_words_of_topic_1 = Topic_1.argsort()[-15 : ]

In [30]:
"""
index will be the word index & actual words are present in the vocabulary/corpus
not in the Topic1, it contains just the probabilities for the words.
""" 
for index in top_15_words_of_topic_1:
    print(vectorizer.get_feature_names()[index])

work
way
home
day
new
life
world
women
family
time
years
just
people
like
says


In [38]:
for index, topic in enumerate(LDA.components_):
    print("Top 15 words for Topic:", index + 1)
    print(list(vectorizer.get_feature_names()[word_index] for word_index in topic.argsort()[-15 : ]))
    print("*" * 50)
    print()

Top 15 words for Topic: 1
['work', 'way', 'home', 'day', 'new', 'life', 'world', 'women', 'family', 'time', 'years', 'just', 'people', 'like', 'says']
**************************************************

Top 15 words for Topic: 2
['just', 'going', 'state', 'political', 'obama', 'donald', 'country', 'says', 'new', 'campaign', 'people', 'clinton', 'president', 'said', 'trump']
**************************************************

Top 15 words for Topic: 3
['democrats', 'won', 'new', 'party', 'states', 'democratic', 'win', 'race', 'said', 'vote', 'percent', 'clinton', 'voters', 'sanders', 'state']
**************************************************

Top 15 words for Topic: 4
['years', 'life', 've', 'don', 'says', 'music', 'way', 'really', 'new', 'know', 'time', 'think', 'people', 'just', 'like']
**************************************************

Top 15 words for Topic: 5
['hospital', 'don', 'disease', 'just', 'women', 'drug', 'like', 'police', 'medical', 'care', 'patients', 'said', 'health',

### Getting Topic to Document Assignment

In [39]:
topic_attachments_to_documents = LDA.transform(document_word_matrix)

In [40]:
# Contains probabilities for each document to belong to a particular topic!
topic_attachments_to_documents

array([[1.74816323e-04, 1.40156284e-01, 1.74819014e-04, ...,
        1.74830177e-04, 8.11023247e-01, 7.23389538e-03],
       [3.42081374e-04, 2.76590295e-01, 3.42049311e-04, ...,
        3.42013870e-04, 5.63406622e-01, 3.42058426e-04],
       [2.54964364e-04, 1.13054360e-01, 2.55019594e-04, ...,
        2.54921261e-04, 8.70353776e-01, 2.54962856e-04],
       ...,
       [4.46646671e-01, 3.08857580e-04, 3.25828615e-02, ...,
        2.40554503e-02, 3.08840917e-04, 3.08926005e-04],
       [3.35943737e-04, 3.88437787e-01, 3.27103262e-01, ...,
        3.35844133e-04, 2.29261164e-01, 3.35892474e-04],
       [1.61390923e-01, 2.56396867e-01, 6.35280896e-02, ...,
        5.62936234e-02, 1.97113047e-04, 1.97120171e-04]])

In [41]:
topic_attachments_to_documents.shape

(11992, 9)

### Assigning Labels to the Topics

In [42]:
# Creating a new column for the topics for each document in the dataset!
data["Topic"] = topic_attachments_to_documents.argmax(axis = 1)

In [44]:
"""
Increasing the topic by number as I am taking topic numbering from 1, not from 0.
This will help to compare the words present in the topic that are displayed above!
"""
data.Topic = data.Topic.apply(lambda x: x + 1)

In [58]:
data.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",8
1,Donald Trump has used Twitter — his prefe...,8
2,Donald Trump is unabashedly praising Russian...,8
3,"Updated at 2:50 p. m. ET, Russian President Vl...",8
4,"From photography, illustration and video, to d...",7


In [60]:
## Giving Topic Labels
topic_labels = {1 : "Family",
                2 : "politics",
                3 : "State Politics",
                4 : "Music",
                5 : "Pharmacy/Drugs",
                6 : "War",
                7 : "Education",
                8 : "Presidential Management",
                9 : "Justice & Court"}

data["Topic Labels"] = data.Topic.map(topic_labels)

In [63]:
data.head(25)

Unnamed: 0,Article,Topic,Topic Labels
0,"In the Washington of 2016, even when the polic...",8,Presidential Management
1,Donald Trump has used Twitter — his prefe...,8,Presidential Management
2,Donald Trump is unabashedly praising Russian...,8,Presidential Management
3,"Updated at 2:50 p. m. ET, Russian President Vl...",8,Presidential Management
4,"From photography, illustration and video, to d...",7,Education
5,I did not want to join yoga class. I hated tho...,5,Pharmacy/Drugs
6,With a who has publicly supported the debunk...,5,Pharmacy/Drugs
7,"I was standing by the airport exit, debating w...",1,Family
8,"If movies were trying to be more realistic, pe...",7,Education
9,"Eighteen years ago, on New Year’s Eve, David F...",1,Family


## Using Non-Negatice Matrix Factorization(NMF) Algorithm to perform Topic Modelling

In [64]:
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer

In [65]:
# Geenrating TF-IDF Values for the data as this NMF Algorithm is applied on the data with TF-IDF Values only.
tfidf_vec = TfidfVectorizer(max_df = 0.95, min_df = 2, stop_words = "english")
tfidf_data = tfidf_vec.fit_transform(data.Article)

In [66]:
tfidf_data

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [68]:
tfidf_data.toarray()[0]

array([0., 0., 0., ..., 0., 0., 0.])

In [69]:
# Creating the NMF Model
nmf = NMF(n_components = 7, random_state = 5)

In [70]:
nmf.fit(tfidf_data)



NMF(n_components=7, random_state=5)

In [72]:
tfidf_vec.get_feature_names()[5000], tfidf_vec.get_feature_names()[5500]

('bask', 'benzodiazepines')

In [73]:
for index, topic in enumerate(nmf.components_):
    print("Top 15 words for Topic:", index + 1)
    print(list(tfidf_vec.get_feature_names()[word_index] for word_index in topic.argsort()[-15 : ]))
    print("*" * 50)
    print()

Top 15 words for Topic: 1
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']
**************************************************

Top 15 words for Topic: 2
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']
**************************************************

Top 15 words for Topic: 3
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']
**************************************************

Top 15 words for Topic: 4
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']
**************************************************

Top 15 words for Topic: 5
['primary', 'cruz', 'election', 'democrat

In [74]:
topic_attachments_tfidf = nmf.transform(tfidf_data)

In [78]:
data["TFIDF TOPIC"] = topic_attachments_tfidf.argmax(axis = 1)

In [83]:
data["TFIDF TOPIC"] = data["TFIDF TOPIC"].apply(lambda x: x + 1)

In [84]:
data

Unnamed: 0,Article,Topic,Topic Labels,TFIDF TOPIC,Topic Labels TFIDF
0,"In the Washington of 2016, even when the polic...",8,Presidential Management,2,Pharmacy/Drugs
1,Donald Trump has used Twitter — his prefe...,8,Presidential Management,2,Pharmacy/Drugs
2,Donald Trump is unabashedly praising Russian...,8,Presidential Management,2,Pharmacy/Drugs
3,"Updated at 2:50 p. m. ET, Russian President Vl...",8,Presidential Management,4,Law & Insurance
4,"From photography, illustration and video, to d...",7,Education,7,Music
...,...,...,...,...,...
11987,The number of law enforcement officers shot an...,5,Pharmacy/Drugs,4,Law & Insurance
11988,"Trump is busy these days with victory tours,...",2,politics,2,Pharmacy/Drugs
11989,It’s always interesting for the Goats and Soda...,1,Family,1,
11990,The election of Donald Trump was a surprise to...,2,politics,5,Jurisdiction


In [85]:
## Giving Topic Labels
topic_labels_tfidf = {1 : "Pharmacy/Drugs",
                2 : "Politics",
                3 : "Law & Insurance",
                4 : "Jurisdiction",
                5 : "Democracy",
                6 : "Music",
                7 : "Education"}

data["Topic Labels TFIDF"] = data["TFIDF TOPIC"].map(topic_labels_tfidf)

In [86]:
data.head(25)

Unnamed: 0,Article,Topic,Topic Labels,TFIDF TOPIC,Topic Labels TFIDF
0,"In the Washington of 2016, even when the polic...",8,Presidential Management,2,Politics
1,Donald Trump has used Twitter — his prefe...,8,Presidential Management,2,Politics
2,Donald Trump is unabashedly praising Russian...,8,Presidential Management,2,Politics
3,"Updated at 2:50 p. m. ET, Russian President Vl...",8,Presidential Management,4,Jurisdiction
4,"From photography, illustration and video, to d...",7,Education,7,Education
5,I did not want to join yoga class. I hated tho...,5,Pharmacy/Drugs,6,Music
6,With a who has publicly supported the debunk...,5,Pharmacy/Drugs,1,Pharmacy/Drugs
7,"I was standing by the airport exit, debating w...",1,Family,1,Pharmacy/Drugs
8,"If movies were trying to be more realistic, pe...",7,Education,1,Pharmacy/Drugs
9,"Eighteen years ago, on New Year’s Eve, David F...",1,Family,6,Music
