<a href="https://colab.research.google.com/github/CRYSTAL813/dissertation-/blob/main/%E2%80%9CCombined_TM_on_Wikipedia_Data_(Preproc%2BSaving%2BViz)_(stable_v2_3_0)%E2%80%9Dcopy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Combined Topic Modeling

(last updated 10-07-2022)

In this tutorial, we are going to use our **Combined Topic Model** to get the topics out of a collections of articles.

## Topic Models 

Topic models allow you to discover latent topics in your documents in a completely unsupervised way. Just use your documents and get topics out.

## Contextualized Topic Models

![](https://raw.githubusercontent.com/MilaNLProc/contextualized-topic-models/master/img/logo.png)

What are Contextualized Topic Models? **CTMs** are a family of topic models that combine the expressive power of BERT embeddings with the unsupervised capabilities of topic models to get topics out of documents. 

## Python Package

You can find our package [here](https://github.com/MilaNLProc/contextualized-topic-models).

![https://github.com/MilaNLProc/contextualized-topic-models/actions](https://github.com/MilaNLProc/contextualized-topic-models/workflows/Python%20package/badge.svg) ![https://pypi.python.org/pypi/contextualized_topic_models](https://img.shields.io/pypi/v/contextualized_topic_models.svg) ![https://pepy.tech/badge/contextualized-topic-models](https://pepy.tech/badge/contextualized-topic-models)

# **Before you start...**

If you have additional questions about these topics, follow the links:

- you need to work with languages different than English: [click here!](https://contextualized-topic-models.readthedocs.io/en/latest/language.html#language-specific)
- you can't get good results with topic models: [click here!](https://contextualized-topic-models.readthedocs.io/en/latest/faq.html#i-am-getting-very-poor-results-what-can-i-do)
- you want to load your own embeddings: [click here!](https://contextualized-topic-models.readthedocs.io/en/latest/faq.html#can-i-load-my-own-embeddings)


# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# Installing Contextualized Topic Models

First, we install the contextualized topic model library

In [None]:
%%capture
!pip install contextualized-topic-models==2.3.0

In [None]:
%%capture
!pip install pyldavis

## Restart the Notebook

For the changes to take effect, we now need to restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Data

We are going to need some data. You should upload a file with one document per line. We assume you haven't run any preprocessing script.

However, if you want to first test the model without uploading your data, you can simply use the test file I'm putting here

In [None]:
from google.colab import files
files.upload()

Saving greencredit.csv to greencredit.csv




In [None]:
!head -n 2 greencredit.csv

﻿Authors,Title,Year,Link,Abstract
"Sarkar S., Ghosh A., Mondal A.","Design, Installation and Performance Analysis of an On-Grid Rooftop Solar PV Power Plant for Partial Fulfillment of Common Load",2023,"https://www.scopus.com/inward/record.uri?eid=2-s2.0-85135094536&doi=10.1007%2f978-981-19-1906-0_21&partnerID=40&md5=61667b01995c82391354171c997c9697","With shortage of fossil fuels like coal, petroleum the energy generation is depicted toward renewable energy sources like solar, wind, biomass, etc. The renewable energy technologies present an emission free energy generation technique toward a sustainable tomorrow. Rooftop solar power plant (RTPV) is one of the good solar power generation technique. In this paper, a brief description on design, commissioning and techno economic analysis of a 50Kwp rooftop solar power plant design in Uluberia super specialty hospital Howrah, India have been described. The electricity generation in both input DC and output AC end of each inverter is record

In [None]:
text_file = "greencredit.csv" # EDIT THIS WITH THE FILE YOU UPLOAD

# Importing what we need

In [None]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessingStopwords
import nltk

## Preprocessing

Why do we use the **preprocessed text** here? We need text without punctuation to build the bag of word. Also, we might want only to have the most frequent words inside the BoW. Too many words might not help.

In [None]:
from nltk.corpus import stopwords as stop_words

nltk.download('stopwords')

documents = [line.strip() for line in open(text_file, encoding="utf-8").readlines()[0:2000]]

stopwords = list(stop_words.words("english"))
stopwords.extend(['https','www','com','uri','eid','doi','fj','taylor','ch','md','scopus','cc','inc','elsevier','jbusres','record','partnerid','inward','ff',
          'dc','fs','fb','le','ag','iaffe','ci','ieee','ie','de','ee','bc','ed','aa','ltd','pv'])

sp = WhiteSpacePreprocessingStopwords(documents, stopwords_list=stopwords)
preprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
preprocessed_documents[:2]

['year link abstract',
 'design installation performance analysis grid solar power plant common load fossil fuels like coal petroleum energy generation toward renewable energy sources like solar wind biomass etc renewable energy technologies present emission free energy generation technique toward sustainable solar power plant one good solar power generation technique paper design economic analysis solar power plant design india described electricity generation input output ac end analyzed paper input end power solar panels open short current maximum power respectively time day west output power connected main transmission grid grid rule solar power plant main grid west state electricity distribution company limited generated current power solar panels power plant taken using data cloud analyzed output data paper total load building plant generating average units per saving rs per year utilization factor solar power plant attempt made calculate payback period even years approximately y

We don't discard the non-preprocessed texts, because we are going to use them as input for obtaining the contextualized document representations. 

Let's pass our files with preprocess and unpreprocessed data to our `TopicModelDataPreparation` object. This object takes care of creating the bag of words for you and of obtaining the contextualized BERT representations of documents. This operation allows us to create our training dataset.

Note: Here we use the contextualized model "paraphrase-distilroberta-base-v1".


In [None]:
tp = TopicModelDataPreparation("all-mpnet-base-v2")

training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)



Batches:   0%|          | 0/10 [00:00<?, ?it/s]



Let's check the first ten words of the vocabulary 

In [None]:
tp.vocab[:10]

['ab',
 'ability',
 'able',
 'abstract',
 'ac',
 'academic',
 'access',
 'according',
 'account',
 'accounting']

## Training our Combined TM

Finally, we can fit our new topic model. We will ask the model to find 50 topics in our collection.

In [None]:
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=20, num_epochs=10)
ctm.fit(training_dataset) # run the model

Epoch: [10/10]	 Seen Samples: [18360/18360]	Train Loss: 832.8062597869009	Time: 0:00:01.507029: : 10it [00:15,  1.52s/it]
Sampling: [20/20]: : 20it [00:16,  1.24it/s]


# Topics

After training, now it is the time to look at our topics: we can use the 

```
get_topic_lists
```

function to get the topics. It also accepts a parameter that allows you to select how many words you want to see for each topic.

If you look at the topics, you will see that they all make sense and are representative of a collection of documents that comes from Wikipedia (general knowledge). Notice that the topics are in English, because we trained the model on English documents.

In [None]:
ctm.get_topic_lists(5)

[['design', 'leadership', 'certification', 'council', 'tool'],
 ['green', 'china', 'credit', 'financial', 'financing'],
 ['building', 'rating', 'leed', 'buildings', 'certification'],
 ['agricultural', 'land', 'agriculture', 'farmers', 'production'],
 ['corporate', 'enterprises', 'evidence', 'find', 'polluting'],
 ['gas', 'generation', 'power', 'production', 'per'],
 ['carbon', 'optimal', 'trading', 'chain', 'supply'],
 ['leed', 'building', 'buildings', 'rating', 'environmental'],
 ['leed', 'design', 'leadership', 'certification', 'buildings'],
 ['formation', 'versions', 'fundamental', 'taiwan', 'architects'],
 ['development', 'management', 'research', 'analysis', 'results'],
 ['tax', 'theory', 'political', 'companies', 'play'],
 ['design', 'leadership', 'building', 'systems', 'leed'],
 ['renewable', 'sources', 'power', 'fuel', 'electricity'],
 ['china', 'financial', 'effect', 'green', 'credit'],
 ['development', 'economic', 'financial', 'panel', 'investment'],
 ['design', 'systems', 'l

# Let's Draw!

We can use PyLDAvis to plot our topic in a nice and friendly manner :)

In [None]:
 lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

Sampling: [10/10]: : 10it [00:08,  1.24it/s]


In [None]:
import pyLDAvis as vis

lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

ctm_pd = vis.prepare(**lda_vis_data)
vis.display(ctm_pd)

  from collections import Iterable
  from collections import Mapping
Sampling: [10/10]: : 10it [00:08,  1.22it/s]
  by='saliency', ascending=False).head(R).drop('saliency', 1)


# Topic Predictions

Ok now we can take a document and see which topic has been assigned to it. Results will obviously change with respect to the documents you are using. For example, let's predict the topic of the first preprocessed document that is talking about a peninsula.

In [None]:
topics_predictions = ctm.get_thetas(training_dataset, n_samples=5) # get all the topic predictions

Sampling: [5/5]: : 5it [00:02,  1.72it/s]


In [None]:
preprocessed_documents[0] # see the text of our preprocessed document

'year link abstract'

In [None]:
import numpy as np
topic_number = np.argmax(topics_predictions[0]) # get the topic id of the first document

In [None]:
topic_number

19

In [None]:
ctm.get_topic_lists(5)[15]

['trial', 'outcomes', 'methods', 'children', 'clinical']

In [None]:
ctm.get_topic_lists(5)[topic_number] #and the topic should be about natural location/places/related things

['classified', 'departments', 'analyse', 'reliable', 'conflict']

# Save Our Model for Later Use

In [None]:
ctm.save(models_dir="./")



In [None]:
# let's remove the trained model
del ctm

In [None]:
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, num_epochs=100, n_components=50)

ctm.load("/content/contextualized_topic_model_nc_50_tpm_0.0_tpv_0.98_hs_prodLDA_ac_(100, 100)_do_softplus_lr_0.2_mo_0.002_rp_0.99",
                                                                                                      epoch=19)



FileNotFoundError: ignored

In [None]:
ctm.get_topic_lists(5)