## LDA (Latent Dirichlet Allocation)

In this notebook, I'll be showing you the practical example of topic modelling using LDA.

For this I'll be using ABC news headlines dataset from kaggle - https://www.kaggle.com/therohk/million-headlines

In [2]:
# Let's first read the dataset
import pandas as pd

df = pd.read_csv("abcnews-date-text.csv")


In [3]:
# Let's check the head of the dataframe

df.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


#### Here our main focus is the headline_text column because we will be using these headlines to extract the topics.

In [32]:
df1 = df[:50000].drop("publish_date", axis = 1)

#### Here I am taking only 50000 records.

In [33]:
df1.head()

Unnamed: 0,headline_text
0,aba decides against community broadcasting lic...
1,act fire witnesses must be aware of defamation
2,a g calls for infrastructure protection summit
3,air nz staff in aust strike for pay rise
4,air nz strike to affect australian travellers


In [34]:
# Length of the data

len(df1)

50000

### Preprocessing

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_df = 0.95, min_df = 3, stop_words = 'english')

In [36]:
# Create a document term matrix

dtm = cv.fit_transform(df1[0:50000]['headline_text'])

In [37]:
dtm

<50000x9672 sparse matrix of type '<class 'numpy.int64'>'
	with 234987 stored elements in Compressed Sparse Row format>

### Let's perfrom LDA

***Here I'll be assuming that there are 20 topics present in this document***

In [38]:
from sklearn.decomposition import LatentDirichletAllocation

In [39]:
lda = LatentDirichletAllocation(n_components = 20, random_state = 79)

In [40]:
# This will take some time to execute

lda.fit(dtm)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=20, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=79, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [41]:
topics = lda.transform(dtm)

### Let's print 15 most common words for all the 20 topics

In [42]:
for index,topic in enumerate(lda.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['row', 'sale', 'telstra', 'indigenous', 'bid', 'campaign', 'budget', 'tax', 'airport', 'bomb', 'community', 'blast', 'funding', 'boost', 'security']


THE TOP 15 WORDS FOR TOPIC #1
['says', 'saddam', 'dump', 'qaeda', 'broken', 'gm', 'city', 'waste', 'israel', 'gets', 'industry', 'al', 'warns', 'hill', 'future']


THE TOP 15 WORDS FOR TOPIC #2
['debate', 'merger', 'real', 'local', 'centre', 'stop', 'woes', 'seeks', 'force', 'new', 'air', 'plan', 'chief', 'work', 'council']


THE TOP 15 WORDS FOR TOPIC #3
['airs', 'opposition', 'staff', 'nsw', 'support', 'east', 'rate', 'teachers', 'pay', 'gold', 'west', 'strike', 'coast', 'south', 'concerns']


THE TOP 15 WORDS FOR TOPIC #4
['soldiers', 'british', 'bali', 'forces', 'victims', 'iraqi', 'israeli', 'case', 'search', 'attack', 'appeal', 'missing', 'iraq', 'killed', 'baghdad']


THE TOP 15 WORDS FOR TOPIC #5
['aims', 'plant', 'children', 'downer', 'nuclear', 'begin', 'says', 'sign', 'gas', 'deal', 'urges', 'nor

### Let's combine these topics with our original headlines

In [44]:
df1['Headline Topic'] = topics.argmax(axis = 1)

In [45]:
df1.head()

Unnamed: 0,headline_text,Headline Topic
0,aba decides against community broadcasting lic...,5
1,act fire witnesses must be aware of defamation,6
2,a g calls for infrastructure protection summit,7
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,14
