# Topic Modeling with BERTopic

BERTopic is a topic modeling technique that uses transformers 🤗 and a custom class-based TF-IDF to create dense clusters, allowing for easily interpretable topics while keeping important words in topic descriptions.

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="10%">

<img src="https://github.com/MaartenGr/BERTopic/blob/master/docs/img/algorithm.png?raw=true" width="50%">

# **Installing BERTopic**

We start by installing PyPi's BERTopic:

In [None]:
# %%capture
# !pip install bertopic

## Import dependencies

In [1]:
import pandas as pd
from bertopic import BERTopic

## Importing dataset

For this project, we will use a preprocessed dataset of job vacancies in the field of bioinformatics

In [56]:
docs = pd.read_csv('job_bioinfo_csv/bioinf_ads_preprocessed.csv')
docs.head(5)

Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,diffbotUri,humanLanguage,id,lastCrawlTime,name,pageUrl,...,resolvedPageUrl,summary,tasks,text,title,type,word_count,urlCleaned,text_result,title_result
0,0,0,0,0,http://diffbot.com/entity/JOB966384586,en,JOB966384586,1591962958,"Bioinformatics Specialist, GIS",https://www.nature.com/naturecareers/job/bioin...,...,,The Genomics Institute of Singapore () has an ...,,The Genomics Institute of Singapore () has an ...,"Bioinformatics Specialist, GIS",Job,130,nature,genomics institute singapore exciting opportun...,bioinformatics specialist gis
1,1,1,1,1,http://diffbot.com/entity/JOB1028283361,en,JOB1028283361,1549734701,Postdoctoral Fellowship in Bioinformatics and ...,http://www.nature.com/naturejobs/science/jobs/...,...,https://www.nature.com/naturecareers/job?id=67...,"The laboratories of Drs. Jeffrey Pessin, Fajun...",,"The laboratories of Drs. Jeffrey Pessin, Fajun...",Postdoctoral Fellowship in Bioinformatics and ...,Job,231,nature,laboratories drs jeffrey pessin fajun yang dey...,postdoctoral fellowship bioinformatics molecul...
2,2,2,2,2,http://diffbot.com/entity/JOB1570298833,en,JOB1570298833,1486437231,"Bioinformatics Analyst : Bar Harbor, ME, Unite...",http://www.nature.com/naturejobs/science/jobs/...,...,,\nTweet\nFacebook\nLinkedIn\nThe MDI Biologica...,,\nTweet\nFacebook\nLinkedIn\nThe MDI Biologica...,"Bioinformatics Analyst : Bar Harbor, ME, Unite...",Job,179,nature,tweetfacebooklinkedinthe mdi biological labora...,bioinformatics analyst bar harbor united states
3,3,3,3,3,http://diffbot.com/entity/JOB2313411533,en,JOB2313411533,1629620706,Postdoctoral Fellow in Bioinformatics and/or G...,https://www.nature.com/naturecareers/job/postd...,...,,We are looking for enthusiastic postdoctoral f...,,We are looking for enthusiastic postdoctoral f...,Postdoctoral Fellow in Bioinformatics and/or G...,Job,351,nature,looking enthusiastic postdoctoral fellows join...,postdoctoral fellow bioinformatics genomics
4,4,4,4,4,http://diffbot.com/entity/JOB2513853425,en,JOB2513853425,1563630178,"Postdoctoral Position in Bioinformatics, Micro...",https://www.nature.com/naturecareers/job/postd...,...,,Work group:\nInstitute of Virology\nArea of re...,,Work group:\nInstitute of Virology\nArea of re...,"Postdoctoral Position in Bioinformatics, Micro...",Job,525,nature,work group institute virologyarea research sci...,postdoctoral position bioinformatics microbial...


## Training

We start by instantiating BERTopic. We set the language to 'English' as ​​our documents are in the English language. If you want to use a multilingual model, use `language="multilingual"`.

We will also calculate the topic probabilities. However, this can make BERTopic significantly slower on large amounts of data (>100_000 documents). It is advisable to disable it if you want to speed up the model.

In [57]:
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True) #, nr_topics = 20)
topics, probs = topic_model.fit_transform(docs.text_result)

In [58]:
topic_model = topic_model.reduce_topics(docs.text_result, nr_topics=11)

## Extracting Topics
After fine-tuning our model, we can start looking at the results. Typically, we look at the most frequent topics first, as they best represent the document collection.

In [29]:
len(topic_model.get_topic_info())

10

In [30]:
freq = topic_model.get_topic_info(); freq.head(len(topic_model.get_topic_info()))

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,501,-1_data_research_bioinformatics_experience,"[data, research, bioinformatics, experience, c...",[principal scientist computational biology hum...
1,0,939,0_data_research_computational_experience,"[data, research, computational, experience, bi...",[descriptioncelgene global biopharmaceutical c...
2,1,224,1_software_developer_experience_python,"[software, developer, experience, python, deve...",[lead software developer bioinformaticsa new o...
3,2,149,2_research_university_students_teaching,"[research, university, students, teaching, bio...",[position description department biological st...
4,3,68,3_module_sequence_course_bioinformatics,"[module, sequence, course, bioinformatics, lea...",[biomedical sciencescourse co ordinatordr b c ...
5,4,47,4_job_jobs_bioinformatics_cassava,"[job, jobs, bioinformatics, cassava, institute...",[scientific programme pdra bioinformatics depe...
6,5,34,5_board_society_job_worldwide,"[board, society, job, worldwide, oncology, pla...",[found phd computational biology biostatistics...
7,6,22,6_protein_research_crag_plant,"[protein, research, crag, plant, computational...",[crag seeksthe centre research agricultural ge...
8,7,16,7_clinical_autoimmune_research_lupus,"[clinical, autoimmune, research, lupus, diseas...",[data analyst bioinformatics specialist transl...
9,8,16,8_position_standard_create_first,"[position, standard, create, first, system, ap...",[biostatistics bioinformatics apply position n...


-1 refere-se a todos os outliers e normalmente deve ser ignorado. A seguir, vamos dar uma olhada em um tópico frequente que foi gerado:

**NOTE**: BERTopic is stochastic, which means topics may differ between runs. This is mainly due to the stochastic nature of UMAP.

In [38]:
topic_model.get_topic(0)  

[('data', 0.02872391625971289),
 ('experience', 0.02635777915484802),
 ('research', 0.025372789422789685),
 ('bioinformatics', 0.023685734788982706),
 ('computational', 0.021978285574913898),
 ('analysis', 0.020579058937215795),
 ('biology', 0.020324744568035333),
 ('development', 0.01773894828066943),
 ('cancer', 0.017645875350720198),
 ('software', 0.017297665134682175)]

In [39]:
topic_model.get_topic(1)  

[('university', 0.03985218700048508),
 ('research', 0.039678755533861396),
 ('students', 0.03184728002507955),
 ('teaching', 0.030999950794597494),
 ('faculty', 0.029077632150984422),
 ('program', 0.028668584317966032),
 ('biology', 0.028250686196313397),
 ('department', 0.027404806413874334),
 ('applicants', 0.023980417348768168),
 ('computational', 0.02227458081175145)]

In [40]:
topic_model.get_topic(2)  

[('bioinformatics', 0.05030927116731804),
 ('module', 0.047715852053891734),
 ('sequence', 0.040425593520767106),
 ('course', 0.03885533294790458),
 ('learning', 0.031133280502293612),
 ('modules', 0.02973637449006597),
 ('students', 0.029245609925581357),
 ('data', 0.027851401014928306),
 ('analysis', 0.024373940714381873),
 ('biological', 0.022923689673234934)]

In [41]:
topic_model.get_topic(3)  

[('sequencing', 0.051090897001368867),
 ('detection', 0.04574143581265577),
 ('analysis', 0.03968337788159084),
 ('applications', 0.03817930195744738),
 ('data', 0.03630610190543475),
 ('variants', 0.03526979045696186),
 ('expression', 0.031557946779457065),
 ('genotyping', 0.030985470892824024),
 ('gene', 0.03053899166957824),
 ('next', 0.029980145339290172)]

In [42]:
topic_model.get_topic(4)  

[('board', 0.2610430396038062),
 ('society', 0.24197964152042017),
 ('oncology', 0.23530524511789486),
 ('worldwide', 0.23182897955820073),
 ('sunny', 0.20716443175332533),
 ('platform', 0.19997736172469913),
 ('roots', 0.1979256353677251),
 ('backup', 0.1964309136130045),
 ('jobs', 0.1834906357576063),
 ('vacation', 0.1816206808230115)]

In [43]:
topic_model.get_topic(5)  

[('clinical', 0.05810178636198764),
 ('autoimmune', 0.053604174788055646),
 ('research', 0.048725180303971195),
 ('lupus', 0.04399856438253687),
 ('disease', 0.040543192680669235),
 ('investigators', 0.039753267381363463),
 ('basic', 0.03830187171308582),
 ('arthritis', 0.03585880879250139),
 ('data', 0.03385236976950784),
 ('translational', 0.03278942523216417)]

In [44]:
topic_model.get_topic(6)  

[('position', 0.5800958814614342),
 ('standard', 0.480080574983487),
 ('create', 0.4765335552480117),
 ('first', 0.4639000367778338),
 ('system', 0.41916326181069613),
 ('application', 0.33833719995338046),
 ('apply', 0.3255685883385101),
 ('biostatistics', 0.3046649112834317),
 ('new', 0.30028672032611187),
 ('bioinformatics', 0.15036651171620305)]

In [45]:
topic_model.get_topic(7)  

[('job', 0.5499867693591898),
 ('board', 0.46200474469562525),
 ('society', 0.4282655560242357),
 ('worldwide', 0.41030049556729176),
 ('found', 0.40590448662756723),
 ('platform', 0.3539281878143484),
 ('oncology', 0.3407342185798088),
 ('biostatistics', 0.3191727642016904),
 ('phd', 0.2740139481154898),
 ('clinical', 0.23679307574962652)]

In [46]:
topic_model.get_topic(8)  

[('jobs', 0.41709147699074495),
 ('job', 0.30576881762207664),
 ('still', 0.2798972338187333),
 ('go', 0.2316554088095096),
 ('many', 0.21023372879757518),
 ('live', 0.209496640442525),
 ('page', 0.20624398390650325),
 ('institute', 0.1989093240003256),
 ('search', 0.18556533672845466),
 ('similar', 0.16461095868478728)]

In [47]:
topic_model.get_topic(9)  

[('industry', 0.10062148377643924),
 ('partners', 0.09926351945809722),
 ('commercial', 0.09750672769859353),
 ('working', 0.0784106397240603),
 ('life', 0.061115509372103236),
 ('create', 0.0606873413340449),
 ('negotiation', 0.060148951700745606),
 ('agreements', 0.060148951700745606),
 ('works', 0.05617172355795957),
 ('partnership', 0.054093699343213966)]

## View terms

We can visualize selected terms for some topics by creating bar charts from the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Additionally, you can easily compare topic representations with each other.

In [55]:
topic_model.visualize_barchart(top_n_topics=51, n_words=10, height=300)