# Latent Dirichlet Allocation

It is relatively easy for humans to learn a language. With years of practice subconsciously, we pick up nuances and stack up to the sophistication with the help of localised cultural cues. We have this complex mechanism where we meticulously derive deep meanings with the help of very few words.

For machines, which operate on inferences of binary nature, human language is almost an impossible task.

One way to do it is by predetermining the groups to which certain words belong to, segregating the useful words from stop words and appending a score to the relationship between two words in a sentence.

Document: Probability distributions over latent topics
Topic: Probability distributions over words.


The word ‘topic’ refers to associating a certain word with a definition. For instance, when the machine reads-horse is black, it tokenizes the sentence and comes to the conclusion that there are two topics; horse which is an animal and black, a colour.

https://analyticsindiamag.com/beginners-guide-to-latent-dirichlet-allocation/

It is one of the most popular topic modeling methods. Each document is made up of various words, and each topic also has various words belonging to it. The aim of LDA is to find topics a document belongs to, based on the words in it.

https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2



## Data

We will be using articles from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

In [3]:
npr = pd.read_csv('npr.csv')

In [4]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


Notice how we don't have the topic of the articles! Let's use LDA to attempt to figure out clusters of the articles.

## Preprocessing

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [6]:
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [7]:
dtm = cv.fit_transform(npr['Article'])

In [8]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

## LDA

In [9]:
from sklearn.decomposition import LatentDirichletAllocation

In [10]:
LDA = LatentDirichletAllocation(n_components=7,random_state=42)

In [None]:
# This can take awhile, we're dealing with a large amount of documents!
LDA.fit(dtm)

## Showing Stored Words

In [None]:
len(cv.get_feature_names())

In [None]:
import random

In [None]:
for i in range(10):
    random_word_id = random.randint(0,54776)
    print(cv.get_feature_names()[random_word_id])

In [None]:
for i in range(10):
    random_word_id = random.randint(0,54776)
    print(cv.get_feature_names()[random_word_id])

### Showing Top Words Per Topic

In [None]:
len(LDA.components_)

In [None]:
LDA.components_

In [None]:
len(LDA.components_[0])

In [None]:
single_topic = LDA.components_[0]

In [None]:
# Returns the indices that would sort this array.
single_topic.argsort()

In [None]:
# Word least representative of this topic
single_topic[18302]

In [None]:
# Word most representative of this topic
single_topic[42993]

In [None]:
# Top 10 words for this topic:
single_topic.argsort()[-10:]

In [None]:
top_word_indices = single_topic.argsort()[-10:]

In [None]:
for index in top_word_indices:
    print(cv.get_feature_names()[index])

These look like business articles perhaps... Let's confirm by using .transform() on our vectorized articles to attach a label number. But first, let's view all the 10 topics found.

In [None]:
for index,topic in enumerate(LDA.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

### Attaching Discovered Topic Labels to Original Articles

In [None]:
dtm

In [None]:
dtm.shape

In [None]:
len(npr)

In [None]:
topic_results = LDA.transform(dtm)

In [None]:
topic_results.shape

In [None]:
topic_results[0]

In [None]:
topic_results[0].round(2)

In [None]:
topic_results[0].argmax()

This means that our model thinks that the first article belongs to topic #1.

### Combining with Original Data

In [None]:
npr.head()

In [None]:
topic_results.argmax(axis=1)

In [None]:
npr['Topic'] = topic_results.argmax(axis=1)

In [None]:
npr.head(10)