# Topic Modelling
Given a set of words, we will try to construct the relevant topic from the words.

The general idea is:
1. We will create a bag of words
2. Using respective algorithms for each method (NMF or LDA) we will generate topics
3. We will calculate how often a collection of word show up in a topic and return it

In [105]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import pandas as pd

## Dataframe

In [106]:
df = pd.read_csv("./data/Articles.csv", sep=",", encoding= 'unicode_escape')
df

Unnamed: 0,Article,Date,Heading,NewsType
0,KARACHI: The Sindh government has decided to b...,1/1/2015,sindh govt decides to cut public transport far...,business
1,HONG KONG: Asian markets started 2015 on an up...,1/2/2015,asia stocks up in new year trad,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,1/5/2015,hong kong stocks open 0.66 percent lower,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,1/6/2015,asian stocks sink euro near nine year,business
4,NEW YORK: US oil prices Monday slipped below $...,1/6/2015,us oil prices slip below 50 a barr,business
...,...,...,...,...
2687,strong>DUBAI: Dubai International Airport and ...,3/25/2017,Laptop ban hits Dubai for 11m weekend traveller,business
2688,"strong>BEIJING: Former Prime Minister, Shaukat...",3/26/2017,Pak China relations not against any third coun...,business
2689,strong>WASHINGTON: Uber has grounded its fleet...,3/26/2017,Uber grounds self driving cars after accid,business
2690,strong>BEIJING: The New Development Bank plans...,3/27/2017,New Development Bank plans joint investments i...,business


## Vectorizing
For LDA, we only need the bag of words, so we use Count Vectorizer
For NMF, we need the Tf-Idf values because of linear algebra, so we use Tf-Idf Vectorizer

In [107]:
tiv = TfidfVectorizer(stop_words="english", min_df=1, max_df=0.95)
df_vectorized = tiv.fit_transform(df['Article'])

In [108]:
df_vectorized

<2692x29388 sparse matrix of type '<class 'numpy.float64'>'
	with 350156 stored elements in Compressed Sparse Row format>

In [109]:
cv = CountVectorizer(stop_words="english", min_df=1, max_df=0.95)
cv_vectorized = cv.fit_transform(df['Article'])

In [110]:
tiv.get_feature_names_out()

array(['00', '000', '0000', ..., '½zaman', '½zarai', '½ï'], dtype=object)

## LDA

Considers P(topic | document) * P(word | topic). Randomly assigns a word to a topic and then evaluate how often the word and other words show up in that topic

In [111]:
from sklearn.decomposition import LatentDirichletAllocation, NMF

In [123]:
LDA = LatentDirichletAllocation(n_components=10)
final_topics = LDA.fit_transform(cv_vectorized)

Generate the topics (clusters) along with the words that are in the cluster

In [124]:
for id, topic in enumerate(LDA.components_):
    print(f'Topic #{id}')
    print([top for top in cv.get_feature_names_out()[topic.argsort()[-10:]]])

Topic #0
['output', 'barrel', 'production', 'opec', 'million', 'market', 'crude', 'prices', 'said', 'oil']
Topic #1
['board', 'players', 'india', 'world', 'pcb', '½s', 'team', 'cricket', 'pakistan', 'said']
Topic #2
['team', 'game', 'minutes', 'win', 'time', 'half', 'goal', 'final', 'second', '½s']
Topic #3
['australia', 'india', 'new', 'players', 'day', 'test', 'world', '½s', 'said', 'cricket']
Topic #4
['second', 'match', 'pakistan', 'sri', 'day', '½ï', 'series', 'england', '½s', 'test']
Topic #5
['won', 'set', 'old', 'open', 'time', 'strong', 'world', 'year', 'said', '½s']
Topic #6
['bank', 'market', 'week', 'yen', 'markets', 'said', 'dollar', 'year', '½s', 'percent']
Topic #7
['½s', 'west', 'overs', 'balls', 'pakistan', 'wickets', 'captain', 'match', 'india', 'runs']
Topic #8
['worth', 'waived', 'countries', 'loans', 'china', 'power', 'billion', 'said', '½s', 'million']
Topic #9
['tax', 'percent', 'million', 'minister', 'government', '½s', 'year', 'billion', 'pakistan', 'said']


The similarity of each sentence to each sentences

In [114]:
final_topics[0]

array([0.00144952, 0.00144946, 0.00144963, 0.00144948, 0.26814675,
       0.00144947, 0.00144964, 0.72025676, 0.00144984, 0.00144946])

In [115]:
df['topic'] = final_topics.argmax(axis = 1)

In [116]:
df['topic'].value_counts()

9    513
1    359
4    318
0    317
7    315
8    315
2    237
6    172
3     86
5     60
Name: topic, dtype: int64

## NMF
Input a weighted matrix (V) to determine H (the topics) using W (components) which gives V ~ W * H

In [127]:
nmf = NMF(n_components=10)
final_topics_nmf = nmf.fit_transform(df_vectorized)

In [128]:
for id, topic in enumerate(nmf.components_):
    print(f'Topic #{id}')
    print([top for top in tiv.get_feature_names_out()[topic.argsort()[-10:]]])

Topic #0
['week', 'fed', 'rates', 'bank', 'markets', 'yen', 'rate', 'gold', 'dollar', 'percent']
Topic #1
['india', 'indies', 'balls', 'australia', 'wicket', 'overs', 'lanka', 'wickets', 'sri', 'runs']
Topic #2
['million', 'barrels', 'cents', 'output', 'production', 'barrel', 'opec', 'prices', 'crude', 'oil']
Topic #3
['½s', 'finance', 'imf', 'economic', 'tax', 'government', 'minister', 'pakistan', 'said', 'billion']
Topic #4
['dubai', 'pakistan', 'captain', 'quetta', 'ahmed', 'kings', 'gladiators', 'peshawar', 'zalmi', 'mohammad']
Topic #5
['ball', 'lordï', 'root', 'amir', 'series', 'innings', 'cook', '½s', 'england', 'test']
Topic #6
['security', 'icc', 'players', 'board', 'said', 'pcb', 'team', 'india', 'pakistan', 'cricket']
Topic #7
['year', 'title', 'win', 'messi', 'said', 'open', 'time', 'world', 'final', '½s']
Topic #8
['oil', 'regulatory', 'prices', 'petroleum', 'ogra', 'price', 'petrol', 'diesel', 'rs', 'litre']
Topic #9
['benchmark', 'topix', '225', 'shares', 'stocks', 'yen'

In [119]:
df['topic_nmf'] = final_topics_nmf.argmax(axis = 1)

In [120]:
df['topic_nmf'].value_counts()

3    595
7    492
6    329
1    315
2    256
5    190
0    185
4    120
9    107
8    103
Name: topic_nmf, dtype: int64