<h1 align="center">BERTopic</h1>

Data Scientist.: Dr.Eddy Giusepe Chirinos Isidro

[BERTopic](https://maartengr.github.io/BERTopic/index.html) é uma técnica de modelagem de tópicos que utiliza Transformers 🤗 e `c-TF-IDF` para criar clusters densos, permitindo tópicos facilmente interpretáveis, mantendo palavras importantes nas descrições dos tópicos.

BERTopic suporta modelagem de tópicos guiada , supervisionada , semi-supervisionada , manual , de documento longo , hierárquica , baseada em classe , dinâmica e online . Ele ainda suporta visualizações semelhantes ao LDAvis!

# Instalação

A instalação, com `sentence-transformers`, pode ser feita usando [pypi](https://pypi.org/project/bertopic/):

```
pip install bertopic
```

Você pode querer instalar mais dependendo dos Transformers e back-ends de idioma que você usará. As instalações possíveis são:

```
pip install bertopic[flair]
pip install bertopic[gensim]
pip install bertopic[spacy]
pip install bertopic[use]
```

# Exemplo de uso

Começamos extraindo tópicos do conhecido conjunto de dados de $20$ grupos de notícias contendo documentos em `inglês`:

In [1]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']



  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import pandas as pd

docs_df = pd.DataFrame(docs)
docs_df.head(8)

Unnamed: 0,0
0,\n\nI am sure some bashers of Pens fans are pr...
1,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...
4,1) I have an old Jasmine drive which I cann...
5,\n\nBack in high school I worked as a lab assi...
6,\n\nAE is in Dallas...try 214/241-6060 or 214/...
7,"\n[stuff deleted]\n\nOk, here's the solution t..."


In [3]:
docs_df.shape

(18846, 1)

In [4]:
docs_df[0][0]

"\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n"

In [None]:
# Instanciamos nosso Objeto:

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

Após gerar os `tópicos` e suas `probabilidades`, podemos acessar os tópicos frequentes que foram gerados:

In [6]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,6503,-1_to_is_the_and
1,0,1834,0_game_team_games_he
2,1,571,1_key_clipper_chip_encryption
3,2,526,2_ites_cheek_yep_huh
4,3,473,3_israel_israeli_jews_arab
...,...,...,...
207,206,10,206_nec_toshiba_multisession_drives
208,207,10,207_media_nw_washington_dc
209,208,10,208_mcwilliams_18084tmibmclmsuedu_3369591_5173...
210,209,10,209_god_beliefs_religion_faith


`-1` refere-se a todos os `outliers` e normalmente deve ser ignorado. A seguir, vamos dar uma olhada no tópico mais frequente que foi gerado, o `tópico 0`:

In [7]:
topic_model.get_topic(0)

[('game', 0.010473214900954434),
 ('team', 0.009121887336298864),
 ('games', 0.007263400941411855),
 ('he', 0.007169826032153097),
 ('players', 0.006386609162441127),
 ('season', 0.006304416558176395),
 ('hockey', 0.006203975667871301),
 ('play', 0.005839280757134111),
 ('25', 0.005711453570973233),
 ('year', 0.005689238569633178)]

Com `.get_document_info`, também podemos extrair informações em nível de documento, como seus tópicos correspondentes, probabilidades, se são documentos representativos de um tópico etc.:

In [8]:
topic_model.get_document_info(docs)


Unnamed: 0,Document,Topic,Name,Top_n_words,Probability,Representative_document
0,\n\nI am sure some bashers of Pens fans are pr...,0,0_game_team_games_he,game - team - games - he - players - season - ...,1.000000,False
1,My brother is in the market for a high-perform...,11,11_card_drivers_diamond_ati,card - drivers - diamond - ati - driver - vide...,0.877945,False
2,\n\n\n\n\tFinally you said what you dream abou...,-1,-1_to_is_the_and,to - is - the - and - of - for - you - it - in...,0.000000,False
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,49,49_scsi_scsi2_scsi1_ide,scsi - scsi2 - scsi1 - ide - asynchronous - dr...,0.608730,False
4,1) I have an old Jasmine drive which I cann...,75,75_tape_backup_tapes_drive,tape - backup - tapes - drive - munroe - wangd...,0.761425,False
...,...,...,...,...,...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...,35,35_cancer_medical_patients_medicine,cancer - medical - patients - medicine - disea...,0.824217,False
18842,\nNot in isolated ground recepticles (usually ...,181,181_ground_grounding_conductor_neutral,ground - grounding - conductor - neutral - wir...,0.604598,False
18843,I just installed a DX2-66 CPU in a clone mothe...,82,82_fan_cpu_heat_sink,fan - cpu - heat - sink - fans - cooling - chi...,1.000000,False
18844,\nWouldn't this require a hyper-sphere. In 3-...,128,128_den_sphere_radius_points,den - sphere - radius - points - plane - ellip...,1.000000,False


In [9]:
# Precisei instalar --> https://stackoverflow.com/questions/63533424/mime-type-rendering-requires-nbformat-4-2-0
topic_model.visualize_topics()

In [None]:
topic_model.visualize_documents(docs)