<a href="https://colab.research.google.com/github/Ashwin1999/Topic-Modelling-using-LDA/blob/master/Final_LDA_on_BBC_News.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Importing the BBC News Dataset

In [None]:
import numpy as np
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/Ashwin1999/NLP-project---LDA/main/Datasets/BBC%20News.csv')
print(f'Number of na values {df.isna().sum().sum()}')
df.head()

Number of na values 0


Unnamed: 0,News
0,Wal-Mart fights back at accusers Two big US na...
1,Hodgson relishes European clashes Former Black...
2,UK set to cut back on embassies Nine overseas ...
3,Holmes is hit by hamstring injury Kelly Holmes...
4,O'Gara revels in Ireland victory Ireland fly-h...


# 2. Using NLTK's WordNetLemmatizer to Lemmatize the Documents

In [None]:
import nltk

nltk.download('wordnet')
nltk.download('punkt')

from nltk.stem import WordNetLemmatizer

def SimpleLemmatizeSentences(sentence):
    lemmatizer = WordNetLemmatizer()
    word_list = nltk.word_tokenize(sentence)
    return ' '.join([lemmatizer.lemmatize(w) for w in word_list])

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
import time

start_time = time.time()
df.News = df.News.apply(SimpleLemmatizeSentences)
print("SimpleLemmatizeSentences() executed after ", round((time.time() - start_time), 2), " seconds.")

df.head()

SimpleLemmatizeSentences() executed after  13.79  seconds.


Unnamed: 0,News
0,Wal-Mart fight back at accuser Two big US name...
1,Hodgson relish European clash Former Blackburn...
2,UK set to cut back on embassy Nine overseas em...
3,Holmes is hit by hamstring injury Kelly Holmes...
4,O'Gara revel in Ireland victory Ireland fly-ha...


# 3. Get the Document Term Matrix

This can be done by using either the Tf-idf vectorizer or Count vectorizer.
The former returns a float matrix whereas the latter return an integer matrix.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_df=0.95, min_df=6, stop_words='english')
dtm = cv.fit_transform(df['News'])

print(f'Shape of document term matrix: {dtm.shape}')

Shape of document term matrix: (2225, 7325)


# 4. Fit LDA to our Document Term Matrix

Here, I want to find 5 topics from the corpus. I've set the alpha and eta parameters to 0.1 and 0.3 respectively. I'll train it batch wise, with the batch size set to 32 and a maximum of 50 iterations per batch.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

nTopics = 5

LDA = LatentDirichletAllocation(
    n_components=nTopics,
    doc_topic_prior=0.1,
    topic_word_prior=0.3,
    learning_method='batch',
    learning_decay=0.8,
    learning_offset=12.0,
    max_iter=50,
    batch_size=32,
)

In [None]:
start_time = time.time()
LDA.fit(dtm)
print("LDA has fit on document term matrix after ", round((time.time() - start_time), 2), " seconds.")
print(f'Shape of LDA components: {LDA.components_.shape}')

LDA has fit on document term matrix after  51.25  seconds.
Shape of LDA components: (5, 7325)


# 5. Finding the Latent Topics from the trained LDA Model

In [None]:
def findTopic(i, topic):
    topic_name = list()
    for t in topic:
        topic_name.append(cv.get_feature_names()[t])
    return f"Topic-{i}:\t{topic_name}\n\n"

i=1
for topic in LDA.components_:
    topic = topic.argsort()[-20:] # get the indices of the top 20 words in each topic
    print(findTopic(i, topic))
    i+=1

Topic-1:	['uk', 'time', 'song', 'director', 'band', 'actor', 'number', 'british', 'world', 'won', 'new', 'star', 'music', 'award', 'said', 'ha', 'best', 'year', 'film', 'wa']


Topic-2:	['injury', 'just', 'half', 'ireland', 'wales', 'second', 'cup', 'play', 'team', 'match', 'club', 'win', 'time', 'year', 'england', 'player', 'game', 'said', 'ha', 'wa']


Topic-3:	['tax', 'law', 'public', 'year', 'plan', 'brown', 'new', 'told', 'blair', 'minister', 'say', 'election', 'people', 'party', 'labour', 'government', 'ha', 'wa', 'mr', 'said']


Topic-4:	['make', 'firm', 'net', 'digital', 'use', 'year', 'music', 'computer', 'user', 'mr', 'new', 'service', 'phone', 'technology', 'mobile', 'game', 'wa', 'ha', 'people', 'said']


Topic-5:	['business', '2004', 'rate', 'country', 'share', 'month', 'new', 'mr', 'growth', 'economy', 'price', 'sale', 'bank', 'firm', 'market', 'company', 'wa', 'year', 'ha', 'said']




In [None]:
# My inference from the abovelist of topics...

topics_pred = ['topics unknown']*5

topics_pred[4] = "Business"
topics_pred[2] = "Politics"
topics_pred[3] = "Tech"
topics_pred[1] = "Sport"
topics_pred[0] = "Entertainment" 

# 6. Making Predictions based on my Inference

To make prediction, all I have to do is transform the document term matrix which we trained the LDA on.

In [None]:
# making the predictions...
topic_results = LDA.transform(dtm)
print(f'Shape of matrix: {topic_results.shape}')

Shape of matrix: (2225, 5)


The result will be a matrix which is the length of the dataset used and 5 columns wide (since we have 5 topics). For a given document, we'll have the probability of it belonging to each topic that we obtained. We'll be taking the topic corresponding to the highest probability.

In [None]:
ind = 0
print("News Article:")
print(df.iloc[ind,0])
print('\n\n')
print(f'Probability of news article belonging to each of the {nTopics} topics: {topic_results[0]}')

News Article:
Wal-Mart fight back at accuser Two big US name have launched advertising campaign to `` set the record straight '' about their product and corporate behaviour . The world 's biggest retailer Wal-Mart took out more than 100 full page advert in national newspaper . The group is trying to see off criticism over it pay deal , benefit package and promotion strategy . Meanwhile , drug group Eli Lilly is planning a campaign against `` false '' claim about it product Prozac . Wal-Mart kicked off the battle with advert in newspaper like the Wall Street Journal , using an open letter from company president Lee Scott saying it wa time for the public to hear the `` unfiltered truth '' . `` There are lot of 'urban legend ' going around these day about Wal-Mart , but fact are fact . Wal-Mart is good for consumer , good for community and good for the US economy , '' Mr Scott said in a separate statement . Its advert - and a new website - outlined the group 's plan to create more than 10

We can clearly see the value at the last index is highest. So it must belong to **business** as per the LDA model.

Inorder to get the max values for each row of the numpy array, we can make use of the argmax method since it will give us the index with the maximum value. If we have the index, we can say which topic it belongs to.

In [None]:
topic_index = topic_results.argmax(axis=1)
pred = np.array([topics_pred[ind] for ind in topic_index])
df["Predicted Topics"] = pred
df.head()

Unnamed: 0,News,Predicted Topics
0,Wal-Mart fight back at accuser Two big US name...,Business
1,Hodgson relish European clash Former Blackburn...,Sport
2,UK set to cut back on embassy Nine overseas em...,Politics
3,Holmes is hit by hamstring injury Kelly Holmes...,Sport
4,O'Gara revel in Ireland victory Ireland fly-ha...,Sport


# 7. Inspecting our LDA Model's Prediction

In [None]:
def get_news_and_topic(ind):
  print(f'News:\n{df.iloc[ind, 0]}')
  print(f'\n\nTopic Predicted: {df.iloc[ind, 1]}')

In [None]:
get_news_and_topic(0)

News:
Wal-Mart fight back at accuser Two big US name have launched advertising campaign to `` set the record straight '' about their product and corporate behaviour . The world 's biggest retailer Wal-Mart took out more than 100 full page advert in national newspaper . The group is trying to see off criticism over it pay deal , benefit package and promotion strategy . Meanwhile , drug group Eli Lilly is planning a campaign against `` false '' claim about it product Prozac . Wal-Mart kicked off the battle with advert in newspaper like the Wall Street Journal , using an open letter from company president Lee Scott saying it wa time for the public to hear the `` unfiltered truth '' . `` There are lot of 'urban legend ' going around these day about Wal-Mart , but fact are fact . Wal-Mart is good for consumer , good for community and good for the US economy , '' Mr Scott said in a separate statement . Its advert - and a new website - outlined the group 's plan to create more than 10,000 US 

In [None]:
get_news_and_topic(1)

News:
Hodgson relish European clash Former Blackburn bos Roy Hodgson say the Premiership should follow the rest of Europe and have a winter break - but insists that a gruelling domestic schedule will not damage the English elite 's bid for Champions League glory . Hodgson - now in charge at Viking Stavanger - wa at Liverpool 's clash with Bayer Leverkusen at Anfield on Tuesday a a member of Uefa 's technical committee . Hodgson is a fierce advocate of the winter break employed throughout Europe , although not in England - where the Champions League contender have ploughed through a heavy fixture list . But Hodgson told BBC Sport that while he belief the Premiership should embrace the idea , he doe not expect it to cost the English representative in the last 16 of the Champions League . `` I just feel it is very difficult to say with certainty that team who have had the break will have a definite edge . `` I am a fervent supporter of the break . It give player the chance to recharge the

In [None]:
get_news_and_topic(2)

News:
UK set to cut back on embassy Nine overseas embassy and high commission will close in an effort to save money , UK Foreign Secretary Jack Straw ha announced . The Bahamas , East Timor , Madagascar and Swaziland are among the area affected by the biggest shake-up for the diplomatic service for year . Other diplomatic post are being turned over to local staff . Mr Straw said the move would save Â£6m a year to free up cash for priority such a fighting terrorism . Honorary consul will be appointed in some of the area affected by the embassy closure . Nine consulate or consulate general will also be closed , mostly in Europe and America . They include Dallas in the US , Bordeaux in France and Oporto in Portugal , with local staff replacing UK representation in another 11 . The change are due to be put in place before the end of 2006 , with most saving made from cutting staff and running cost . Some of the money will have to be used to fund redundancy payment . In a written statement ,

In [None]:
get_news_and_topic(98)

News:
Cardinal criticises Iraq war cost Billions of pound spent on conflict in Iraq and in the Middle East should have been used to reduce poverty , Cardinal Cormac Murphy-O'Connor ha said . The head of the Catholic Church in England and Wales made the comment on BBC Radio 4 and will re-iterate his stance in his Christmas Midnight Mass . The cardinal used a Christmas message to denounce the war in Iraq a a `` terrible '' waste of money . He and the Archbishop of Canterbury have both spoken out about the war . Speaking on BBC Radio 4 's Thought for the Day slot , he criticised the fact that `` billion '' have been spent on war , instead of being used to bring people `` out of dire poverty and malnourishment and disease '' . The cardinal said 2005 should be the year for campaigning to `` make history poverty '' . He added : `` If the government of the rich country were a ready to devote to peace the resource they are willing to commit to war , that would be to see with new eye and speak 

In [None]:
get_news_and_topic(42)

News:
Adventure tale top award Young book fan have voted Fergus Crane , a story about a boy who is taken on an adventure by a flying horse , the winner of two Smarties Book Prizes . Paul Stewart and Chris Riddell 's book came top in the category for six- to eight-year-olds and won the award chosen by after-school club member . Sally Grindley 's Spilled Water , about a Chinese girl sold a a servant , wa top in vote of reader aged nine to 11 . Biscuit Bear by Mini Grey took the top award in the under-five category . Winners were voted for by about 6,000 child from a shortlist picked by an adult panel . The prize , which is celebrating it 20th year , is billed a `` the UK 's biggest child 's book award '' . Fergus Crane includes text by Stewart and illustration by Riddell , who also created The Edge Chronicles together . As well a the six to eight prize , it won the 4Children Special Award voted for by after-school club member . Julia Eccleshare , chair of the adult judging panel , said c