<a href="https://colab.research.google.com/github/PseudoPythonista/nlp/blob/master/unsupervised_topic_modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Objective**:
1. Determine roughly into how many topics these articles can be divided into 

In [None]:
import pandas as pd
articles = pd.read_csv("Articles.csv",encoding='latin1')
articles.head()

Unnamed: 0,Article,Date,Heading,NewsType
0,KARACHI: The Sindh government has decided to b...,1/1/2015,sindh govt decides to cut public transport far...,business
1,HONG KONG: Asian markets started 2015 on an up...,1/2/2015,asia stocks up in new year trad,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,1/5/2015,hong kong stocks open 0.66 percent lower,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,1/6/2015,asian stocks sink euro near nine year,business
4,NEW YORK: US oil prices Monday slipped below $...,1/6/2015,us oil prices slip below 50 a barr,business


In [None]:
articles.isnull().sum()

Article     0
Date        0
Heading     0
NewsType    0
dtype: int64

In [None]:
articles.drop(columns=["Date","NewsType"],inplace=True)

In [None]:
articles.shape

(2692, 2)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [None]:
dtm = tfidf.fit_transform(articles['Article'])

In [None]:
from sklearn.decomposition import NMF

In [None]:
nmf_model = NMF(n_components=4)

In [None]:
nmf_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=4, random_state=None, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

**Display topics**

In [None]:
len(tfidf.get_feature_names())

16067

In [None]:
import random
for i in range(20):
    random_word_id = random.randint(0,16066)
    print(tfidf.get_feature_names()[random_word_id])

warmup
understood
skidded
kyrgios
wearing
silly
strong1
feeding
strongtwo
suggestion
modric
posing
penetration
ronaldo
phangiso
hua
nz
angles
strongjust
275


In [None]:
len(nmf_model.components_)

4

In [None]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['india', 'economic', 'finance', 'power', 'cricket', 'year', 'rs', 'tax', 'islamabad', 'country', 'government', 'minister', 'billion', 'said', 'pakistan']


THE TOP 15 WORDS FOR TOPIC #1
['ball', 'india', 'innings', 'pakistan', 'lanka', 'series', 'wicket', 'captain', 'sri', 'cricket', 'wickets', 'match', 'runs', 'test', 'england']


THE TOP 15 WORDS FOR TOPIC #2
['brent', 'iran', 'demand', 'market', 'said', 'million', 'barrels', 'cents', 'output', 'production', 'barrel', 'opec', 'prices', 'crude', 'oil']


THE TOP 15 WORDS FOR TOPIC #3
['investors', 'stocks', 'week', 'rose', 'rate', 'shares', 'markets', 'tokyo', 'points', 'bank', 'gold', 'index', 'dollar', 'yen', 'percent']




In [None]:
topic_results = nmf_model.transform(dtm)

In [None]:
articles['Topic'] = topic_results.argmax(axis=1)

In [None]:
mapper = {0:"Economy/Finance",1:"Cricket",2:"Oil",3:"Stock market"}
articles["Potential topics"] = articles["Topic"].map(mapper)

In [None]:
pd.reset_option('^display.', silent=True)
articles.head()

Unnamed: 0,Article,Heading,Topic,Potential topics
0,KARACHI: The Sindh government has decided to b...,sindh govt decides to cut public transport far...,0,Economy/Finance
1,HONG KONG: Asian markets started 2015 on an up...,asia stocks up in new year trad,3,Stock market
2,HONG KONG: Hong Kong shares opened 0.66 perce...,hong kong stocks open 0.66 percent lower,3,Stock market
3,HONG KONG: Asian markets tumbled Tuesday follo...,asian stocks sink euro near nine year,3,Stock market
4,NEW YORK: US oil prices Monday slipped below $...,us oil prices slip below 50 a barr,2,Oil


In [None]:
articles["Article"][0]

'KARACHI: The Sindh government has decided to bring down public transport fares by 7 per cent due to massive reduction in petroleum product prices by the federal government, Geo News reported.Sources said reduction in fares will be applicable on public transport, rickshaw, taxi and other means of traveling.Meanwhile, Karachi Transport Ittehad (KTI) has refused to abide by the government decision.KTI President Irshad Bukhari said the commuters are charged the lowest fares in Karachi as compare to other parts of the country, adding that 80pc vehicles run on Compressed Natural Gas (CNG). Bukhari said Karachi transporters will cut fares when decrease in CNG prices will be made.                        \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n'

In [None]:
articles["Article"][1]

'HONG KONG: Asian markets started 2015 on an upswing in limited trading on Friday, with mainland Chinese stocks surging in Hong Kong on speculation Beijing may ease monetary policy to boost slowing growth.Hong Kong rose 1.07 percent, closing 252.78 points higher at 23857.82.Seoul closed up 0.57 percent, rising 10.85 points to 1,926.44, while Sydney gained 0.46 percent, or 24.89 points, to close at 5,435.9.Singapore edged up 0.19 percent, gaining 6.39 points to 3,371.54.Markets in mainland China, Japan, Taiwan, New Zealand, the Philippines, and Thailand remained closed for holidays.With mainland bourses shut until January 5, shares in Chinese developers and financial companies surged in Hong Kong, stoked by hopes that Beijing could ease monetary policy to support lagging growth in the world´s second-largest economy.China Vanke, the country´s biggest developer by sales, leapt 10.8 percent and the People´s Insurance Company (Group) of China Ltd. was up 5.51 percent in afternoon trading.Tr