# TOPIC MODELLING

<left><img src="images/topic.gif" width="500" height="100" /></left>

# Latent Dirichlet Allocation #
 

LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. 

* Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
* LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial. 
* It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. 

## Load the dataset

The dataset we'll use is a list of over one million news headlines published over a period of 15 years. We'll start by loading it from the `abcnews-date-text.csv` file.

In [42]:

#Load the dataset from the csv and save it to 'data_text'

import pandas as pd
data = pd.read_excel('output-merged.xlsx')
# we only need to headlines from the data
data_text = data[:300000][['Content']]
data_text['index'] = data_text.index
documents = data_text

Let's glance at the dataset:

In [43]:

#Get the total number of documents - in our case its

print(len(documents))

1250


In [44]:
documents.head()

Unnamed: 0,Content,index
0,"For decades, blue and white paper coupons defi...",0
1,A federal appeals court largely sided with App...,1
2,Anheuser-Busch has placed two executives who m...,2
3,US manufacturers have now mostly worked their ...,3
4,A copyright infringement case against British ...,4


## Data Preprocessing ##

We will perform the following steps:

* **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
* Words that have fewer than 3 characters are removed.
* All **stopwords** are removed.
* Words are **lemmatized** - words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Words are **stemmed** - words are reduced to their root form.

In [45]:

#Loading Gensim and nltk libraries


import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [46]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/keerthanaakannan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [47]:

#Write a function to perform the pre processing steps on the entire dataset

stemmer = SnowballStemmer("english")
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and Lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [48]:

#Preview a document after preprocessing

document_num = 1248
doc_sample = documents[documents['index'] == document_num].values[0][0]

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 


Tokenized and lemmatized document: 
['news', 'flash', 'headlin', 'check', 'click', 'foxnew', 'peopl', 'world', 'lose', 'confid', 'import', 'routin', 'childhood', 'vaccin', 'killer', 'diseas', 'like', 'measl', 'polio', 'covid', 'pandem', 'accord', 'report', 'unicef', 'countri', 'survey', 'public', 'percept', 'vaccin', 'children', 'declin', 'agenc', 'say', 'data', 'worri', 'warn', 'signal', 'rise', 'vaccin', 'hesit', 'amid', 'misinform', 'dwindl', 'trust', 'govern', 'polit', 'polaris', 'unicef', 'unit', 'nation', 'children', 'fund', 'say', 'near', 'million', 'african', 'children', 'miss', 'vaccin', 'accord', 'unicef', 'report', 'allow', 'confid', 'routin', 'immun', 'victim', 'pandem', 'catherin', 'russel', 'unicef', 'execut', 'director', 'say', 'statement', 'wave', 'death', 'children', 'measl', 'diphtheria', 'prevent', 'diseas', 'chang', 'percept', 'particular', 'worri', 'agenc', 'say', 'come', 'largest', 'sustain', 'backslid', 'childhood', 'immun', 'generat', 'covi

<left><img src="images/sidelook.gif" width="500" height="100" /></left>

In [49]:
documents.head()

Unnamed: 0,Content,index
0,"For decades, blue and white paper coupons defi...",0
1,A federal appeals court largely sided with App...,1
2,Anheuser-Busch has placed two executives who m...,2
3,US manufacturers have now mostly worked their ...,3
4,A copyright infringement case against British ...,4


Let's now preprocess all the news headlines we have. To do that, let's use the [map](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) function from pandas to apply `preprocess()` to the `headline_text` column


In [50]:
# preprocess all the headlines, saving the list of results as 'processed_docs'
processed_docs = documents['Content'].map(preprocess)

In [51]:

#Preview 'processed_docs'

processed_docs.head()

0    [decad, blue, white, paper, coupon, defin, bat...
1    [feder, appeal, court, larg, side, appl, monda...
2    [anheus, busch, place, execut, manag, light, s...
3    [manufactur, work, backlog, order, mean, cut, ...
4    [copyright, infring, case, british, artist, sh...
Name: Content, dtype: object

## Bag of words on the dataset

Now let's create a dictionary from 'processed_docs' containing the number of times a word appears in the training set. To do that, let's pass `processed_docs` to [`gensim.corpora.Dictionary()`](https://radimrehurek.com/gensim/corpora/dictionary.html) and call it '`dictionary`'.

In [52]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = gensim.corpora.Dictionary(processed_docs)

In [53]:

#Checking dictionary created

count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 accord
1 admit
2 analyst
3 attach
4 away
5 backfir
6 backlash
7 bankruptci
8 bath
9 begin
10 better


** Gensim filter_extremes **
​
[`filter_extremes(no_below=5, no_above=0.5, keep_n=100000)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes)
​
Filter out tokens that appear in
​
* less than no_below documents (absolute number) or
* more than no_above documents (fraction of total corpus size, not absolute number).
* after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [54]:
'''
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)

** Gensim doc2bow **

[`doc2bow(document)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow)

* Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In [55]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [56]:
'''
Checking Bag of Words corpus for our sample document --> (token_id, token_count)
'''
bow_corpus[document_num]

[(11, 1),
 (84, 1),
 (186, 1),
 (268, 1),
 (270, 1),
 (284, 1),
 (303, 4),
 (313, 1),
 (319, 2),
 (320, 1),
 (378, 1),
 (463, 1),
 (487, 2),
 (504, 1),
 (510, 1),
 (532, 1),
 (626, 1),
 (688, 1),
 (740, 2),
 (812, 1),
 (860, 1),
 (931, 2),
 (933, 1),
 (1009, 1),
 (1124, 1),
 (1139, 1),
 (1165, 2),
 (1214, 1),
 (1266, 1),
 (1277, 2),
 (1395, 1),
 (1685, 1),
 (1744, 13),
 (1745, 2),
 (1752, 1),
 (1765, 2),
 (1802, 1),
 (1812, 7),
 (1859, 1),
 (1961, 1),
 (2089, 4),
 (2198, 1),
 (2202, 2),
 (2376, 1),
 (2381, 1),
 (2401, 2),
 (2443, 1),
 (2598, 1)]

In [57]:
'''
Preview BOW for our sample preprocessed document
'''
# Here document_num is document number 4310 which we have checked in Step 2
bow_doc_4310 = bow_corpus[document_num]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 11 ("catch") appears 1 time.
Word 84 ("agreement") appears 1 time.
Word 186 ("broad") appears 1 time.
Word 268 ("disrupt") appears 1 time.
Word 270 ("dwindl") appears 1 time.
Word 284 ("hesit") appears 1 time.
Word 303 ("pandem") appears 4 time.
Word 313 ("shift") appears 1 time.
Word 319 ("survey") appears 2 time.
Word 320 ("sustain") appears 1 time.
Word 378 ("wave") appears 1 time.
Word 463 ("indic") appears 1 time.
Word 487 ("percept") appears 2 time.
Word 504 ("signal") appears 1 time.
Word 510 ("stress") appears 1 time.
Word 532 ("annual") appears 1 time.
Word 626 ("recommend") appears 1 time.
Word 688 ("victim") appears 1 time.
Word 740 ("miss") appears 2 time.
Word 812 ("trust") appears 1 time.
Word 860 ("india") appears 1 time.
Word 931 ("african") appears 2 time.
Word 933 ("amid") appears 1 time.
Word 1009 ("japan") appears 1 time.
Word 1124 ("mexico") appears 1 time.
Word 1139 ("medicin") appears 1 time.
Word 1165 ("covid") appears 2 time.
Word 1214 ("korea") appears 1 

## TF-IDF on our document set ##

While performing TF-IDF on the corpus is not necessary for LDA implemention using the gensim model, it is recemmended. TF-IDF expects a bag-of-words (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality.

TF-IDF stands for "Term Frequency, Inverse Document Frequency".**

* It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.
* If a word appears frequently in a document, it's important. Give the word a high score. But if a word appears in many documents, it's not a unique identifier. Give the word a low score.
* Therefore, common words like "the" and "for", which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

In other words:

* TF(w) = `(Number of times term w appears in a document) / (Total number of terms in the document)`.
* IDF(w) = `log_e(Total number of documents / Number of documents with term w in it)`.

For example

* Consider a document containing `100` words wherein the word 'tiger' appears 3 times. 
* The term frequency (i.e., tf) for 'tiger' is then: 
    - `TF = (3 / 100) = 0.03`. 

* Now, assume we have `10 million` documents and the word 'tiger' appears in `1000` of these. Then, the inverse document frequency (i.e., idf) is calculated as:
    - `IDF = log(10,000,000 / 1,000) = 4`. 

* Thus, the Tf-idf weight is the product of these quantities: 
    - `TF-IDF = 0.03 * 4 = 0.12`.

In [58]:
 '''
Create tf-idf model object using models.TfidfModel on 'bow_corpus' and save it to 'tfidf'
'''
from gensim import corpora, models


tfidf = models.TfidfModel(bow_corpus)

In [59]:
'''
Apply transformation to the entire corpus and call it 'corpus_tfidf'
'''
corpus_tfidf = tfidf[bow_corpus]

In [60]:
'''
Preview TF-IDF scores for our first document --> --> (token_id, tfidf score)
'''
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.04073058553936874),
 (1, 0.039035783503742094),
 (2, 0.10788788894098837),
 (3, 0.05037511517775265),
 (4, 0.12378170952523458),
 (5, 0.7221927172465836),
 (6, 0.1605715535817654),
 (7, 0.03438501860300902),
 (8, 0.04360248134796318),
 (9, 0.13080744404388953),
 (10, 0.04720633076746646),
 (11, 0.06655582742261955),
 (12, 0.048186074678061866),
 (13, 0.03705107248905493),
 (14, 0.04155404253355692),
 (15, 0.3529296939616342),
 (16, 0.03281227677595765),
 (17, 0.05499632615688211),
 (18, 0.04628984154589048),
 (19, 0.06021825860204716),
 (20, 0.03629953976630424),
 (21, 0.03629953976630424),
 (22, 0.049986031819735835),
 (23, 0.04785192622106832),
 (24, 0.048186074678061866),
 (25, 0.05249856362454665),
 (26, 0.037844611555530865),
 (27, 0.06021825860204716),
 (28, 0.04628984154589048),
 (29, 0.06189085476261729),
 (30, 0.03644669812159597),
 (31, 0.04335972921099753),
 (32, 0.04360248134796318),
 (33, 0.03339674833118179),
 (34, 0.03572569788578483),
 (35, 0.05344562477225768),


## Running LDA using Bag of Words ##

We are going for 10 topics in the document corpus.


Some of the parameters we will be tweaking are:

* **num_topics** is the number of requested latent topics to be extracted from the training corpus.
* **id2word** is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.
* **workers** is the number of extra processes to use for parallelization. Uses all available cores by default.
* **alpha** and **eta** are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. We will let these be the default values for now(default value is `1/num_topics`)
    - Alpha is the per document topic distribution.
        * High alpha: Every document has a mixture of all topics(documents appear similar to each other).
        * Low alpha: Every document has a mixture of very few topics

    - Eta is the per topic word distribution.
        * High eta: Each topic has a mixture of most words(topics appear similar to each other).
        * Low eta: Each topic has a mixture of few words.



In [61]:
# LDA mono-core -- fallback code in case LdaMulticore throws an error on your machine
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 10, 
#                                    id2word = dictionary,                                    
#                                    passes = 50)

# LDA multicore 

#Train your lda model using gensim.models.LdaMulticore and save it to 'lda_model'

lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                       num_topics=5, 
                                       id2word = dictionary, 
                                       passes = 2, 
                                       workers=2)



In [62]:

#For each topic, we will explore the words occuring in that topic and its relative weight

for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(topic, idx ))
    print("\n")

Topic: 0.006*"trump" + 0.004*"florida" + 0.004*"russian" + 0.004*"medic" + 0.003*"goal" + 0.003*"nasa" + 0.003*"station" + 0.003*"bird" + 0.003*"murder" + 0.003*"russia" 
Words: 0


Topic: 0.007*"trump" + 0.006*"republican" + 0.005*"sudan" + 0.003*"credit" + 0.003*"vote" + 0.003*"round" + 0.003*"presidenti" + 0.003*"rocket" + 0.002*"poll" + 0.002*"evacu" 
Words: 1


Topic: 0.004*"militari" + 0.004*"russia" + 0.004*"ukrain" + 0.004*"fuel" + 0.004*"pollut" + 0.004*"energi" + 0.003*"plastic" + 0.003*"chines" + 0.003*"taiwan" + 0.003*"electr" 
Words: 2


Topic: 0.008*"moon" + 0.005*"speci" + 0.004*"anim" + 0.004*"race" + 0.004*"plant" + 0.003*"abort" + 0.003*"solar" + 0.003*"egg" + 0.003*"bird" + 0.003*"australia" 
Words: 3


Topic: 0.007*"moon" + 0.006*"nasa" + 0.005*"flight" + 0.005*"spacex" + 0.005*"rocket" + 0.004*"starship" + 0.004*"astronaut" + 0.003*"musk" + 0.003*"eclips" + 0.003*"isra" 
Words: 4




In [63]:
import pyLDAvis
import pyLDAvis.gensim_models
import gensim.corpora as corpora
import gensim.models.ldamodel as lda

In [64]:
pyLDAvis.enable_notebook()
id2word=dictionary
vis = pyLDAvis.gensim_models.prepare(lda_model, bow_corpus, id2word)
vis

  from scipy.sparse.linalg.interface import LinearOperator
  from scipy.sparse.sputils import asmatrix
  from scipy.sparse.linalg.interface import aslinearoperator, LinearOperator, \
  from scipy.sparse.linalg.interface import aslinearoperator, LinearOperator, \
  from scipy.sparse.linalg.interface import aslinearoperator, LinearOperator, \
  from scipy.sparse.linalg.interface import LinearOperator
  from scipy.sparse.linalg.interface import aslinearoperator
  from scipy.sparse.sputils import asmatrix
  from scipy.sparse.linalg.interface import aslinearoperator, LinearOperator, \
  from scipy.sparse.linalg.interface import aslinearoperator, LinearOperator, \
  from scipy.sparse.linalg.interface import aslinearoperator, LinearOperator, \
  from scipy.sparse.linalg.interface import aslinearoperator
  from scipy.sparse.sputils import is_pydata_spmatrix
  from scipy.sparse.linalg.interface import aslinearoperator
  from scipy.sparse.linalg.interface import aslinearoperator
  from scipy.spa

## Running LDA using TF-IDF ##

In [65]:
'''
Define lda model using corpus_tfidf, again using gensim.models.LdaMulticore()
'''

lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
                                       num_topics=5, 
                                       id2word = dictionary, 
                                       passes = 2, 
                                       workers=2)



In [66]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.004*"sudan" + 0.003*"evacu" + 0.002*"militari" + 0.002*"russia" + 0.002*"beach" + 0.002*"station" + 0.002*"employe" + 0.002*"parent" + 0.002*"incom" + 0.002*"bank"


Topic: 1 Word: 0.005*"trump" + 0.003*"race" + 0.002*"leagu" + 0.002*"desanti" + 0.002*"champion" + 0.002*"tournament" + 0.002*"player" + 0.002*"round" + 0.002*"hospit" + 0.002*"republican"


Topic: 2 Word: 0.003*"jet" + 0.003*"chines" + 0.002*"rocket" + 0.002*"beij" + 0.002*"ukrain" + 0.002*"flight" + 0.002*"starship" + 0.002*"spacex" + 0.002*"republican" + 0.002*"minist"


Topic: 3 Word: 0.003*"plastic" + 0.002*"pollut" + 0.002*"race" + 0.002*"disney" + 0.002*"music" + 0.002*"trump" + 0.002*"vote" + 0.002*"asteroid" + 0.002*"fish" + 0.002*"republican"


Topic: 4 Word: 0.004*"moon" + 0.003*"anim" + 0.003*"nasa" + 0.003*"telescop" + 0.003*"solar" + 0.003*"speci" + 0.002*"orbit" + 0.002*"bird" + 0.002*"galaxi" + 0.002*"australian"




In [67]:
vis2 = pyLDAvis.gensim_models.prepare(lda_model_tfidf, corpus_tfidf, dictionary)
vis2

## Performance evaluation by classifying sample document using LDA Bag of Words model

We will check to see where our test document would be classified.

In [68]:
#Text of sample document 1249

processed_docs[1248]

['news',
 'flash',
 'headlin',
 'check',
 'click',
 'foxnew',
 'peopl',
 'world',
 'lose',
 'confid',
 'import',
 'routin',
 'childhood',
 'vaccin',
 'killer',
 'diseas',
 'like',
 'measl',
 'polio',
 'covid',
 'pandem',
 'accord',
 'report',
 'unicef',
 'countri',
 'survey',
 'public',
 'percept',
 'vaccin',
 'children',
 'declin',
 'agenc',
 'say',
 'data',
 'worri',
 'warn',
 'signal',
 'rise',
 'vaccin',
 'hesit',
 'amid',
 'misinform',
 'dwindl',
 'trust',
 'govern',
 'polit',
 'polaris',
 'unicef',
 'unit',
 'nation',
 'children',
 'fund',
 'say',
 'near',
 'million',
 'african',
 'children',
 'miss',
 'vaccin',
 'accord',
 'unicef',
 'report',
 'allow',
 'confid',
 'routin',
 'immun',
 'victim',
 'pandem',
 'catherin',
 'russel',
 'unicef',
 'execut',
 'director',
 'say',
 'statement',
 'wave',
 'death',
 'children',
 'measl',
 'diphtheria',
 'prevent',
 'diseas',
 'chang',
 'percept',
 'particular',
 'worri',
 'agenc',
 'say',
 'come',
 'largest',
 'sustain',
 'backslid',
 'chi

In [69]:
#Check which topic our test document belongs to using the LDA Bag of Words model.

document_num = 1248

# Our test document is document number 4310
for index, score in sorted(lda_model[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.5814734101295471	 
Topic: 0.006*"trump" + 0.004*"florida" + 0.004*"russian" + 0.004*"medic" + 0.003*"goal" + 0.003*"nasa" + 0.003*"station" + 0.003*"bird" + 0.003*"murder" + 0.003*"russia"

Score: 0.41111451387405396	 
Topic: 0.007*"trump" + 0.006*"republican" + 0.005*"sudan" + 0.003*"credit" + 0.003*"vote" + 0.003*"round" + 0.003*"presidenti" + 0.003*"rocket" + 0.002*"poll" + 0.002*"evacu"


## Performance evaluation by classifying sample document using LDA TF-IDF model

In [70]:
#Check which topic our test document belongs to using the LDA TF-IDF model.

# Our test document is document number 4310
for index, score in sorted(lda_model_tfidf[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.8317645192146301	 
Topic: 0.005*"trump" + 0.003*"race" + 0.002*"leagu" + 0.002*"desanti" + 0.002*"champion" + 0.002*"tournament" + 0.002*"player" + 0.002*"round" + 0.002*"hospit" + 0.002*"republican"

Score: 0.16078321635723114	 
Topic: 0.003*"jet" + 0.003*"chines" + 0.002*"rocket" + 0.002*"beij" + 0.002*"ukrain" + 0.002*"flight" + 0.002*"starship" + 0.002*"spacex" + 0.002*"republican" + 0.002*"minist"


## It has the highest probability (83%) to be part of the topic that we assigned as topic X.

## Testing model on unseen document ##

In [71]:
unseen_document = "My favorite sports activities are running and swimming."

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.7279661297798157	 Topic: 0.004*"militari" + 0.004*"russia" + 0.004*"ukrain" + 0.004*"fuel" + 0.004*"pollut"
Score: 0.0689203068614006	 Topic: 0.007*"trump" + 0.006*"republican" + 0.005*"sudan" + 0.003*"credit" + 0.003*"vote"
Score: 0.06814376264810562	 Topic: 0.008*"moon" + 0.005*"speci" + 0.004*"anim" + 0.004*"race" + 0.004*"plant"
Score: 0.06764495372772217	 Topic: 0.007*"moon" + 0.006*"nasa" + 0.005*"flight" + 0.005*"spacex" + 0.005*"rocket"
Score: 0.06732484698295593	 Topic: 0.006*"trump" + 0.004*"florida" + 0.004*"russian" + 0.004*"medic" + 0.003*"goal"


<left><img src="images/amazing.gif" width="200" height="100" /></left>

The model correctly classifies the unseen document with '72'% probability to the X category.

<left><img src="images/dualipa.gif" width="500" height="100" /></left>

 <b>Concluding in dua lipa style! </b>