<a href="https://colab.research.google.com/github/GODxFATHER/Topic_Modeling_on_News_Articles_Capstone_Project-/blob/main/Topic_Modeling_on_News_Articles_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Extraction/identification of major topics & themes discussed in news articles. </u></b>

## <b> Problem Description </b>

### In this project your task is to identify major themes/topics across a collection of BBC news articles. You can use clustering algorithms such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) etc.

## <b> Data Description </b>

### The dataset contains a set of news articles for each major segment consisting of business, entertainment, politics, sports and technology. You need to create an aggregate dataset of all the news articles and perform topic modeling on this dataset. Verify whether these topics correspond to the different tags available.

In [2]:
# Importing the libraries 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import os
%matplotlib inline

In [3]:
# Configuration for matplotlib graphs

matplotlib.rcParams['font.size'] = 12
matplotlib.rcParams['figure.figsize'] = (13, 7)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
sns.set_style('darkgrid');

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# Loading the dataset in pandas 
path = '/content/drive/MyDrive/DATA_FILES/bbc'

In [6]:
News=[]
Type=[]

# path="/content/drive/MyDrive/Almabetter/Cohort Aravali/Module 4/Capstone-4/Topic Modeling on BBC News Articles/bbc"
folders=["business","entertainment","politics","sport","tech"] 
for i in folders:
    files=os.listdir(path+'/'+i)
    for text_file in files:
        file_path=path + '/'+i+'/'+text_file
        with open(file_path,'rb') as f:
            data=f.read()
        News.append(data)
        Type.append(i)
      

data={'news':News,'type':Type}
df = pd.DataFrame(data)

In [7]:
df.head()

Unnamed: 0,news,type
0,b'Sluggish economy hits German jobs\n\nThe num...,business
1,b'Dollar gains on Greenspan speech\n\nThe doll...,business
2,b'China keeps tight rein on credit\n\nChina\'s...,business
3,b'Call centre users \'lose patience\'\n\nCusto...,business
4,b'High fuel prices hit BA\'s profits\n\nBritis...,business


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   news    2225 non-null   object
 1   type    2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


In [9]:
#removing duplicates
df=df.drop_duplicates()

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2127 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   news    2127 non-null   object
 1   type    2127 non-null   object
dtypes: object(2)
memory usage: 49.9+ KB


In [11]:
df.type.value_counts(normalize=True)*100

sport            23.742360
business         23.648331
politics         18.946874
entertainment    17.348378
tech             16.314057
Name: type, dtype: float64

### **Text Prepocessing**  

In [12]:
# Converting column into astring
df['news'] = df['news'].astype('str') 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [13]:
 df.head(1)

Unnamed: 0,news,type
0,b'Sluggish economy hits German jobs\n\nThe num...,business


In [14]:
chr = df.news[1452] 
chr

'b\'SFA awaits report over Mikoliunas\\n\\nThe Scottish Football Association is awaiting referee Hugh Dallas\\\'s report before acting against Hearts winger Saulius Mikoliunas.\\n\\nMikoliunas, 20, barged linesman Andy Davis, who had advised Dallas to award Rangers an injury-time penalty in Hearts\\\'s 2-1 defeat at Tynecastle. "He was sent off for violent conduct in the 90th minute but we don\\\'t know if he did something else after the whistle. "We don\\\'t know how many red cards he was shown," said an SFA statement. Hearts could also face action after three fans were arrested for throwing coins on the pitch. Rangers\\\' striker Dad Prso was also sent off during the same incident when he received a second yellow card for wrestling the ball away from Craig Gordon and leaving the Hearts keeper on the ground. The SFA said: "Once the referee\\\'s report comes in then we\\\'ll immediately look at things.\\n\\n"We don\\\'t normally get the reports until a couple of days after the game but

In [15]:
# Removing HTML Tag
import re
def remove_html_tags(text):
	html_pattern = r'<.*?>'
	without_html = re.sub(pattern=html_pattern, repl=' ', string=text)
	return without_html

In [16]:
df['news']=df['news'].apply(remove_html_tags)

In [17]:
#removing URLs
def remove_urls(text):
  url_pattern = r'https?://\S+|www\.\S+' 
  without_urls = re.sub(pattern = url_pattern, repl = ' ', string = text )
  return without_urls 

In [18]:
df['news']=df['news'].apply(remove_urls)

In [19]:
# Removing numbers
def remove_numbers(text):
	number_pattern = r'\d+'
	without_number = re.sub(pattern=number_pattern, repl=" ", string=text)
	return without_number

In [20]:
df['news']=df['news'].apply(remove_numbers)

In [21]:
# Removing single Charecter
def remove_single_char(text):
	single_char_pattern = r'\s+[a-zA-Z]\s+'
	without_sc = re.sub(pattern=single_char_pattern, repl=" ", string=text)
	return without_sc

In [22]:
df['news']=df['news'].apply(remove_single_char)

In [23]:
import nltk

from nltk.corpus import stopwords
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [24]:
stop_list=list(stopwords.words('english'))
stop_list.append('would')
stop_list.append('could')
stop_list.append('say')
# stop_list

In [25]:
import string
# Removing StopWord
def Change_text(msg):
  punc=[char for char in msg if str(char) not in string.punctuation]
  no_punc=''.join(punc)
  main_text=[word for word in no_punc.split() if word.lower() not in stop_list]
  return ' '.join(main_text)

In [26]:
# Doing Lemmatization
def lemmatization(texts):
  sentence=''
  allowed_postags=['NOUN','ADJ','VERB','ADV']
  texts_out=[]
  doc=nlp(texts)
  texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
  return ' '.join(texts_out[0])

In [27]:
import spacy
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [28]:

df['news']=df['news'].apply(Change_text)
nlp=spacy.load('en',disable=['parser', 'ner'])
df['news']=df['news'].apply(lemmatization)

In [29]:
df

Unnamed: 0,news,type
0,bsluggish economy hit german number people wor...,business
1,gain dollar hit high level almost month head s...,business
2,keep tight rein creditnnchina effort stop econ...,business
3,user lose patiencenncustomer try call centre g...,business
4,bhigh fuel price hit ba profitsnnbritish blame...,business
...,...,...
2217,phone act date toolsnntechnologie email net ch...,tech
2219,bnet fingerprint combat attacksnneighty large ...,tech
2220,virus spot writer trading interest distribute ...,tech
2223,woman sue ink woman sue say printer ink cartri...,tech


In [30]:
df.news[0]

'bsluggish economy hit german number people work europe large economy rise tenth straight month growth remain stubbornly slownngerman unemployment rise people workforce seasonally adjust rise show small rise expect government measure encourage job creation begin take effect official say stagnant growth still stifle job market clear sign revival domestic demand say head statement growth third quarter still insufficient deliver positive momentum labour market high oil price soar damage competitiveness exporter also negative effect say brunt unemployment still feel eastern part rate unemployment stick year government put job creation top agenda controversial package measure shake incentive get back work pay cut cherished benefit spark anger german worker strike number industry notably countrys iconic carmaker demonstrate displeasure fear job loss outsource take hold new initiative socalle job top unemployment benefit scheme formal launch hiring position already take place affect unemploym

## Data Preprocessing

In [31]:
'''
Loading Gensim and nltk libraries
'''
# !pip install gensim
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [32]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [33]:
 
print(WordNetLemmatizer().lemmatize('brunt', pos = 'v')) # past tense to present tense


brunt


In [34]:
stemmer = SnowballStemmer("english")

original_words = ['competitiveness']

singles = [stemmer.stem(plural) for plural in original_words]

pd.DataFrame(data={'original word':original_words, 'stemmed':singles })


Unnamed: 0,original word,stemmed
0,competitiveness,competit


In [35]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    
    result = []
    
    for token in gensim.utils.simple_preprocess(text) :
  
        
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            
            # TODO: Apply lemmatize_stemming() on the token, then add to the results list
            result.append(lemmatize_stemming(token))
    
    return result


In [36]:
def lemmatize_stemming(text):
  return stemmer.stem(WordNetLemmatizer().lemmatize(text,pos = 'v'))

def preprocess(text):

  '''
    function to perform the pre processing steps on the entire dataset
  '''

  result = []
  for token in gensim.utils.simple_preprocess(text):

    if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
 
      # TODO: Apply lemmatize_stemming() on the token, then add to the results list

      result.append(lemmatize_stemming(token))
  
 
  return result

In [37]:
preprocess(df.news[1])

['gain',
 'dollar',
 'high',
 'level',
 'month',
 'head',
 'trade',
 'deficit',
 'highlight',
 'govern',
 'willing',
 'curb',
 'spend',
 'rise',
 'household',
 'save',
 'factor',
 'help',
 'reduc',
 'late',
 'trade',
 'dollar',
 'reach',
 'concern',
 'deficit',
 'greenback',
 'recent',
 'month',
 'chairman',
 'speech',
 'ahead',
 'meet',
 'financ',
 'minist',
 'send',
 'dollar',
 'higher',
 'earlier',
 'tumbl',
 'datum',
 'think',
 'sanguin',
 'view',
 'current',
 'account',
 'deficit',
 'time',
 'head',
 'currenc',
 'strategi',
 'longerterm',
 'view',
 'condit',
 'current',
 'account',
 'deficit',
 'improv',
 'year',
 'nextnnworri',
 'deficit',
 'concern',
 'remain',
 'currenc',
 'remain',
 'dollar',
 'currenc',
 'sharp',
 'fall',
 'recent',
 'month',
 'chines',
 'export',
 'price',
 'high',
 'competit',
 'shift',
 'beij',
 'polici',
 'fall',
 'deaf',
 'recent',
 'comment',
 'major',
 'chines',
 'newspap',
 'time',
 'ripe',
 'meet',
 'think',
 'unlik',
 'produc',
 'meaning',
 'movemen

In [38]:
df.iloc[0]

news    bsluggish economy hit german number people wor...
type                                             business
Name: 0, dtype: object

In [39]:
document_num = 0
df.iloc[0].values[0][0]


'b'

In [40]:
 '''
Preview a document after preprocessing
'''
document_num = 1
doc_sample =  df.news[document_num]

print("Original document: ")

words = []

for word in doc_sample.split(' '):
    words.append(word)
    
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 
['gain', 'dollar', 'hit', 'high', 'level', 'almost', 'month', 'head', 'say', 'trade', 'deficit', 'set', 'highlight', 'government', 'willingness', 'curb', 'spend', 'rise', 'household', 'saving', 'factor', 'may', 'help', 'reduce', 'late', 'trading', 'dollar', 'reach', 'concern', 'deficit', 'hit', 'greenback', 'recent', 'month', 'chairman', 'speech', 'ahead', 'meet', 'finance', 'minister', 'send', 'dollar', 'higher', 'earlier', 'tumble', 'back', 'worsethanexpecte', 'job', 'datum', 'think', 'take', 'much', 'sanguine', 'view', 'current', 'account', 'deficit', 's', 'take', 'time', 'say', 'head', 'currency', 'strategy', 's', 'take', 'longerterm', 'view', 'lay', 'set', 'condition', 'current', 'account', 'deficit', 'improve', 'year', 'nextnnworrie', 'deficit', 'concern', 'however', 'remain', 'currency', 'remain', 'peg', 'dollar', 'currencys', 'sharp', 'fall', 'recent', 'month', 'therefore', 'make', 'chinese', 'export', 'price', 'highly', 'competitive', 'call', 'shift', 'beij

In [41]:
df_process = df.copy()

In [42]:
# TODO: preprocess all the headlines, saving the list of results as 'processed_docs'

df_process.news = df.news.map(preprocess)

In [43]:
index = 6

print(len(df_process.news[index]))

print(len(df.news[index]))

92
780


> **creating dictionary**

In [44]:
dictionary = gensim.corpora.Dictionary(df_process.news) 

In [45]:
print(dictionary)

Dictionary(14648 unique tokens: ['adjust', 'affect', 'agenda', 'anger', 'begin']...)


In [46]:
'''
Checking dictionary created
'''
count = 0
c = 0
for k, v in dictionary.iteritems():
    c+=1
    print(k, v)
    
    count += 1
    
    if count > 10:
        break

0 adjust
1 affect
2 agenda
3 anger
4 begin
5 benefit
6 brunt
7 bsluggish
8 carmak
9 cherish
10 clear


In [47]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 5 times
- words appearing in more than 50% of all documents
'''
# TODO: apply dictionary.filter_extremes() with the parameters mentioned above
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000)


In [48]:
bow_corpus = [dictionary.doc2bow(doc) for doc in df_process.news]
 

In [49]:
bow_corpus[0]

[(0, 2),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 2),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1),
 (10, 1),
 (11, 1),
 (12, 3),
 (13, 1),
 (14, 1),
 (15, 1),
 (16, 1),
 (17, 1),
 (18, 1),
 (19, 1),
 (20, 2),
 (21, 1),
 (22, 2),
 (23, 1),
 (24, 1),
 (25, 1),
 (26, 1),
 (27, 1),
 (28, 1),
 (29, 1),
 (30, 2),
 (31, 2),
 (32, 3),
 (33, 1),
 (34, 1),
 (35, 1),
 (36, 1),
 (37, 1),
 (38, 1),
 (39, 1),
 (40, 1),
 (41, 1),
 (42, 2),
 (43, 1),
 (44, 1),
 (45, 1),
 (46, 3),
 (47, 3),
 (48, 1),
 (49, 1),
 (50, 1),
 (51, 1),
 (52, 2),
 (53, 1),
 (54, 1),
 (55, 1),
 (56, 2),
 (57, 1),
 (58, 2),
 (59, 1),
 (60, 1),
 (61, 1),
 (62, 1),
 (63, 1),
 (64, 1),
 (65, 1),
 (66, 4),
 (67, 1),
 (68, 2),
 (69, 1),
 (70, 1),
 (71, 1),
 (72, 1),
 (73, 1),
 (74, 1),
 (75, 1),
 (76, 1),
 (77, 1),
 (78, 1),
 (79, 1),
 (80, 1),
 (81, 1),
 (82, 1),
 (83, 5),
 (84, 2),
 (85, 1),
 (86, 1)]

In [50]:
'''
Preview BOW for our sample preprocessed document
'''
# Here document_num is document number 4310 which we have checked in Step 2
bow_doc_4310 = bow_corpus[0]

for i in range(len(bow_doc_4310)):
    
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                     dictionary[bow_doc_4310[i][0]], 
                                    bow_doc_4310[i][1]))

Word 0 ("adjust") appears 2 time.
Word 1 ("affect") appears 1 time.
Word 2 ("agenda") appears 1 time.
Word 3 ("anger") appears 1 time.
Word 4 ("begin") appears 1 time.
Word 5 ("benefit") appears 2 time.
Word 6 ("carmak") appears 1 time.
Word 7 ("clear") appears 1 time.
Word 8 ("come") appears 1 time.
Word 9 ("competit") appears 1 time.
Word 10 ("controversi") appears 1 time.
Word 11 ("countri") appears 1 time.
Word 12 ("creation") appears 3 time.
Word 13 ("damag") appears 1 time.
Word 14 ("deliv") appears 1 time.
Word 15 ("demand") appears 1 time.
Word 16 ("demonstr") appears 1 time.
Word 17 ("deterior") appears 1 time.
Word 18 ("domest") appears 1 time.
Word 19 ("eastern") appears 1 time.
Word 20 ("economi") appears 2 time.
Word 21 ("economist") appears 1 time.
Word 22 ("effect") appears 2 time.
Word 23 ("encourag") appears 1 time.
Word 24 ("europ") appears 1 time.
Word 25 ("expect") appears 1 time.
Word 26 ("export") appears 1 time.
Word 27 ("fear") appears 1 time.
Word 28 ("feel") a

> **TF-IDF on our document set**

In [51]:
'''
Create tf-idf model object using models.
TfidfModel on 'bow_corpus' and save it to 'tfidf'
'''
from gensim import corpora, models

# TODO
tfidf = models.TfidfModel(bow_corpus)
print(tfidf)


TfidfModel(num_docs=2127, num_nnz=171039)


In [53]:
'''
Apply transformation to the entire corpus and call it 'corpus_tfidf'
'''
# TODO
corpus_tfidf = tfidf[bow_corpus]
print(corpus_tfidf[1])

[(9, 0.049500713956409655), (26, 0.06600774063521814), (31, 0.05885386959244517), (33, 0.09320524060749633), (34, 0.06669011469471978), (49, 0.1100658667400016), (60, 0.04502785869702502), (62, 0.05383485071377029), (63, 0.10105774977577525), (64, 0.11859121899875888), (66, 0.037533935512503336), (87, 0.15788333732625925), (88, 0.046220804120602074), (89, 0.04064428974804473), (90, 0.14581011990258086), (91, 0.06384513406057522), (92, 0.07099140997933505), (93, 0.09074202482366624), (94, 0.0521137831632876), (95, 0.12769026812115045), (96, 0.05275836421908786), (97, 0.24203316172561776), (98, 0.03939542438483376), (99, 0.10105774977577525), (100, 0.08780388002144586), (101, 0.06872062324404181), (102, 0.09931003047989514), (103, 0.24203316172561776), (104, 0.10632122733103197), (105, 0.05796505244045353), (106, 0.04035996237737882), (107, 0.5349188668844769), (108, 0.11767684321121033), (109, 0.3094664695748456), (110, 0.06652173658068362), (111, 0.04972467990142878), (112, 0.071522115

In [54]:
'''
Preview TF-IDF scores for our first document --> --> (token_id, tfidf score)
'''
from pprint import pprint

for doc in corpus_tfidf:
    
    pprint(doc)
    
    break

[(0, 0.19890680310598233),
 (1, 0.06637704068618801),
 (2, 0.09509910930183535),
 (3, 0.104820697457052),
 (4, 0.04994783476077181),
 (5, 0.1301496542114395),
 (6, 0.104820697457052),
 (7, 0.0480510686934501),
 (8, 0.02349690947050931),
 (9, 0.05718077562866587),
 (10, 0.06816670590739149),
 (11, 0.038982830553052475),
 (12, 0.2937275491534637),
 (13, 0.06489463400961329),
 (14, 0.06676296121957359),
 (15, 0.05372198731863849),
 (16, 0.08778563027972604),
 (17, 0.11471803367416988),
 (18, 0.07624887613421685),
 (19, 0.11631554263909769),
 (20, 0.1125947661723699),
 (21, 0.07972069466180391),
 (22, 0.1130927214216692),
 (23, 0.07247114949345469),
 (24, 0.11181991914960802),
 (25, 0.033020870855193354),
 (26, 0.07624887613421685),
 (27, 0.062835082011607),
 (28, 0.049760346319448395),
 (29, 0.08980035226180696),
 (30, 0.154903976186011),
 (31, 0.0679850782557977),
 (32, 0.16116596195591548),
 (33, 0.05383304122932718),
 (34, 0.03851856044305582),
 (35, 0.11049643345373233),
 (36, 0.04006

> **Running LDA using Bag of Words**

In [56]:
# LDA mono-core -- fallback code in case LdaMulticore 
# throws an error on your machine
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 10, 
#                                    id2word = dictionary,                                    
#                                    passes = 50)

# LDA multicore 
'''
Train your lda model using gensim.models.LdaMulticore and 
save it to 'lda_model'
'''
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                       num_topics=10, 
                                       id2word = dictionary, 
                                       passes = 2, 
                                       workers=2)

In [57]:
'''
For each topic, we will explore the words occuring in that topic
 and its relative weight
'''
for idx, topic in lda_model.print_topics(-1):
    
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.012*"game" + 0.011*"mobil" + 0.010*"phone" + 0.009*"time" + 0.007*"technolog" + 0.006*"second" + 0.006*"play" + 0.005*"come" + 0.005*"work" + 0.005*"record"


Topic: 1 
Words: 0.010*"peopl" + 0.009*"govern" + 0.009*"game" + 0.006*"month" + 0.006*"tell" + 0.006*"increas" + 0.005*"spend" + 0.005*"play" + 0.005*"rise" + 0.005*"come"


Topic: 2 
Words: 0.009*"music" + 0.008*"peopl" + 0.006*"report" + 0.006*"tell" + 0.006*"govern" + 0.006*"market" + 0.006*"film" + 0.005*"come" + 0.005*"think" + 0.004*"compani"


Topic: 3 
Words: 0.013*"peopl" + 0.010*"servic" + 0.008*"firm" + 0.008*"compani" + 0.006*"user" + 0.006*"phone" + 0.005*"work" + 0.005*"site" + 0.005*"mobil" + 0.005*"network"


Topic: 4 
Words: 0.010*"price" + 0.008*"club" + 0.008*"week" + 0.007*"rise" + 0.007*"number" + 0.006*"market" + 0.006*"sale" + 0.006*"fall" + 0.006*"expect" + 0.006*"govern"


Topic: 5 
Words: 0.010*"elect" + 0.010*"peopl" + 0.009*"govern" + 0.007*"parti" + 0.006*"plan" + 0.006*"work" + 0.

Classification of the topics
Using the words in each topic and their corresponding weights, what categories were you able to infer?

0:

1:

2:

3:

4:

5:

6:

7:

8:

9:

> **Running LDA using TF-IDF**

In [63]:
'''
Define lda model using corpus_tfidf, 
again using gensim.models.LdaMulticore()
'''
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
                                             num_topics=10, 
                                             id2word = dictionary, 
                                             passes = 2, 
                                             workers=4)

In [65]:
'''
For each topic, we will explore the words occuring 
in that topic and its relative weight
'''
for idx, topic in lda_model_tfidf.print_topics(-1):
    
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.003*"mobil" + 0.003*"phone" + 0.003*"game" + 0.003*"broadband" + 0.003*"technolog" + 0.002*"number" + 0.002*"servic" + 0.002*"match" + 0.002*"play" + 0.002*"peopl"


Topic: 1 Word: 0.005*"film" + 0.003*"good" + 0.003*"game" + 0.002*"award" + 0.002*"peopl" + 0.002*"site" + 0.002*"firm" + 0.002*"actor" + 0.002*"elect" + 0.002*"star"


Topic: 2 Word: 0.004*"film" + 0.003*"game" + 0.003*"compani" + 0.002*"music" + 0.002*"sale" + 0.002*"court" + 0.002*"play" + 0.002*"firm" + 0.002*"peopl" + 0.002*"market"


Topic: 3 Word: 0.002*"elect" + 0.002*"bank" + 0.002*"music" + 0.002*"rise" + 0.002*"club" + 0.002*"good" + 0.002*"parti" + 0.002*"market" + 0.002*"share" + 0.002*"rate"


Topic: 4 Word: 0.003*"card" + 0.002*"game" + 0.002*"season" + 0.002*"athlet" + 0.002*"firm" + 0.002*"report" + 0.002*"mobil" + 0.002*"world" + 0.002*"rate" + 0.002*"offici"


Topic: 5 Word: 0.003*"game" + 0.003*"compani" + 0.002*"music" + 0.002*"sale" + 0.002*"firm" + 0.002*"film" + 0.002*"servic" + 0.0

**Classification of the topics**

As we can see, when using tf-idf, heavier weights are given to words that are not as frequent which results in nouns being factored in. That makes it harder to figure out the categories as nouns can be hard to categorize. This goes to show that the models we apply depend on the type of corpus of text we are dealing with.

Using the words in each topic and their corresponding weights, what categories could you find?

0:
1:
2:
3:
4:
5:
6:
7:
8:
9:

> **Performance evaluation by classifying sample document using LDA Bag of Words model**

In [68]:
'''
Check which topic our test document belongs to using the LDA Bag of Words model.
'''
document_num = 0
# Our test document is document number 4310

for index, score in sorted(lda_model[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):

    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))
 


Score: 0.5009258985519409	 
Topic: 0.010*"peopl" + 0.009*"govern" + 0.009*"game" + 0.006*"month" + 0.006*"tell" + 0.006*"increas" + 0.005*"spend" + 0.005*"play" + 0.005*"rise" + 0.005*"come"

Score: 0.4189510643482208	 
Topic: 0.011*"compani" + 0.008*"firm" + 0.005*"come" + 0.005*"share" + 0.005*"product" + 0.005*"countri" + 0.005*"market" + 0.005*"govern" + 0.005*"profit" + 0.005*"rise"

Score: 0.07403475046157837	 
Topic: 0.010*"price" + 0.008*"club" + 0.008*"week" + 0.007*"rise" + 0.007*"number" + 0.006*"market" + 0.006*"sale" + 0.006*"fall" + 0.006*"expect" + 0.006*"govern"


> **Performance evaluation by classifying sample document using LDA TF-IDF model**

In [66]:
'''
 Check which topic our test document belongs
 to using the LDA TF-IDF model.
'''
# Our test document is document number 4310
for index, score in sorted(lda_model_tfidf[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):

    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))
 


Score: 0.6848410367965698	 
Topic: 0.004*"game" + 0.003*"play" + 0.003*"elect" + 0.002*"injuri" + 0.002*"player" + 0.002*"club" + 0.002*"good" + 0.002*"peopl" + 0.002*"govern" + 0.002*"second"

Score: 0.19719085097312927	 
Topic: 0.003*"game" + 0.003*"compani" + 0.002*"music" + 0.002*"sale" + 0.002*"firm" + 0.002*"film" + 0.002*"servic" + 0.002*"work" + 0.002*"peopl" + 0.002*"number"

Score: 0.07465879619121552	 
Topic: 0.004*"film" + 0.003*"game" + 0.003*"compani" + 0.002*"music" + 0.002*"sale" + 0.002*"court" + 0.002*"play" + 0.002*"firm" + 0.002*"peopl" + 0.002*"market"

Score: 0.038090839982032776	 
Topic: 0.003*"film" + 0.003*"dollar" + 0.002*"countri" + 0.002*"profit" + 0.002*"play" + 0.002*"price" + 0.002*"hunt" + 0.002*"meet" + 0.002*"airlin" + 0.002*"market"


> **Testing model on unseen document**

In [73]:
unseen_document = """Following Congress's poor performance in the recently concluded Uttar Pradesh assembly polls, BJP MP Sushil Kumar Modi on Sunday trained his guns at Congress General Secretary Priyanka Gandhi Vadra. Taking a swipe at her women-focused election campaign in the politically vital state, he accused the grand old party of corruption and practising dynastic politics.
"Priyanka Gandhi Vadra said she was the face of Congress in UP and tried to mislead women with her 'I am a woman, I can fight' slogan but the people gave her just 2 seats. Despite this, no Congress leader asked for her resignation from the post of General Secretary," he said in Hindi on Twitter. 

"A party steeped in dynasty and corruption is only pretending to introspect for many years," he added. 
 
Mr Modi, BJP's Rajya Sabha MP from Bihar and the state's former Deputy Chief Minister, has bitterly criticised the Congress party in a series of tweets since yesterday. 

In a jibe at Congress leader Sonia Gandhi and son Rahul Gandhi, who he said was the "de-facto supremo" of the party since there has been no full-time formal chief of the party in many years, he said that Congress has even lost the will to save the party and protect its internal democracy. 

Mr Modi said that Congress leaders don't have the courage to free the party of the Gandhi family even after a "shameful" routing in five states. "The party will be wiped out even from Rajasthan and Chattisgarh in the next elections," he said. 

Congress interim president Sonia Gandhi along with her family members Rahul Gandhi, and Priyanka Gandhi Vadra had proposed to resign during the Congress Working Committee meeting on Sunday but the CWC unanimously rejected this, a senior party leader told news agency ANI yesterday.

The CWC called the recent assembly election results of five states a "cause of serious concern" for the party.

The Gandhi family has come under severe criticism after the party's massive rout in the recent state election, with demands of a leadership change in the party growing louder every day. Leaders of the "G-23" - the group of 23 "dissenters" who had written to Sonia Gandhi two years ago - held a meeting on Friday at the house of senior leader Ghulam Nabi Azad.

6
Comments
Many internally predicted yet another futile exercise in the name of "introspection" with the Congress Working Committee - in which Gandhi "loyalists" easily outnumber "dissenters" - widely expected to steer clear of hard decisions. There are only three members from the "G-23" in the CWC - Anand Sharma, Ghulam Nabi Azad and Mukul Wasnik."""

In [71]:
unseen_document = "Kyiv’s lead negotiator Mikhailo Podolyak said in a video statement posted to Twitter. The country’s Chernobyl nuclear power plant, the site of the world’s worst nuclear disaster in 1986, once again lost its electricity supply, Ukraine’s energy operator Ukrenergo said.Meanwhile, Ukrainian President Volodymyr Zelensky warned NATO that its member states would soon be attacked by Russian forces after an airstrike hit a Ukrainian military base close to the Polish border. The President urged for a no-fly zone or “Russian rockets will fall on NATO soil”.Russia has escalated attacks in western Ukraine with a deadly airstrike on a military base where its troops had trained with NATO forces, bringing the conflict closer to Poland and other members of the bloc Police in the capital of Kyiv confirmed that a U.S. video journalist died Sunday and another American journalist was injured when they were attacked by Russian forces.Meanwhile, in an interview with The Kyiv Independent on Monday, Ukrainian Health Minister Viktor Lyashko revealed that Russian forces have completely destroyed seven hospitals and damaged an additional 104 health facilities, killing six medical personnel and severely injuring 12 more since the start of the all-out war.On the other hand, Russia has expressed that the country is undeterred by the sanctions imposed on it by the United States. “Russia will not ask the US and EU member nations to lift the sanctions, as pressure from the West and around the world will not change Moscow’s course,” Russian media Sputnik reports quoted Russian Deputy Foreign Minister Sergey Vershinin.The United States has once again condemned Russia over its territorial aggression in Ukraine. “We condemn the Russian Federation’s missile attack on the International Center for Peacekeeping and Security in Yavoriv, close to Ukraine’s border with Poland. The brutality must stop,” US Secretary of State Antony Blinken said.President Joe Biden spoke with French President Emmanuel Macron to discuss Russia’s war against Ukraine. They reviewed recent diplomatic engagements and underscored their commitment to hold Russia accountable for its actions and support the govt and people of Ukraine, the White House revealed on Monday."

In [74]:
# unseen_document = "My favorite sports activities are running and swimming."

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):

    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))
 

Score: 0.7241026163101196	 Topic: 0.010*"elect" + 0.010*"peopl" + 0.009*"govern" + 0.007*"parti" + 0.006*"plan"
Score: 0.235848531126976	 Topic: 0.009*"music" + 0.008*"peopl" + 0.006*"report" + 0.006*"tell" + 0.006*"govern"
Score: 0.035047467797994614	 Topic: 0.007*"countri" + 0.006*"come" + 0.006*"play" + 0.006*"time" + 0.005*"want"


In [70]:
unseen_document

'My favorite sports activities are running and swimming.'

# **EDA**

In [52]:
df_process

Unnamed: 0,news,type
0,"[bsluggish, economi, german, number, peopl, wo...",business
1,"[gain, dollar, high, level, month, head, trade...",business
2,"[tight, rein, creditnnchina, effort, stop, eco...",business
3,"[user, lose, centr, impati, quicker, survey, w...",business
4,"[bhigh, fuel, price, blame, high, fuel, price,...",business
...,...,...
2217,"[phone, date, email, chatroom, instant, messag...",tech
2219,"[bnet, fingerprint, combat, attacksnneighti, l...",tech
2220,"[virus, spot, writer, trade, distribut, malici...",tech
2223,"[woman, woman, printer, cartridg, secret, prog...",tech
