### Clustering
 In order to perform clustering we've decided to use SOTA approach BERTopic.
 Link to paper: https://arxiv.org/pdf/2203.05794.pdf \
 We use pretrained sentence transformer https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models to generate embeddings and set minimum topic size to 50 papers.

In [1]:
from bertopic import BERTopic

topic_model = BERTopic(verbose=True, embedding_model="paraphrase-MiniLM-L12-v2", min_topic_size=50)

In [2]:
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True) 

#### Data Processing

In [3]:
import pandas as pd
data = pd.read_csv('../datasets/004_of_V13.csv')

In [4]:
data.drop(data[data['lang'] == 'zh'].index, inplace=True)
data = data.dropna(subset=['abstract', 'lang', 'fos', 'keywords'])
# to get rid of papers with no abstracts
data.drop(data[data['abstract'].map(len) <= 55].index, inplace=True)
data = data.dropna(subset=['lang'])
# to get rid of german language, that is somehow not in 'lang' column
data.drop(data[data['abstract'].map(lambda x: ' der ' in x)].index, inplace=True)
data.drop(data[data['keywords'].map(lambda x: x == '[]')].index, inplace=True)


In [5]:
data.reset_index(inplace=True)
data.to_csv('data/004_of_V13_filtered.csv')

In [6]:
abstracts = data['abstract'].to_list()
len(abstracts)

109042

#### Perform Clustering

In [7]:
topics, _ = topic_model.fit_transform(abstracts); len(topic_model.get_topic_info())

Batches:   0%|          | 0/3408 [00:00<?, ?it/s]

2022-10-18 15:20:24,322 - BERTopic - Transformed documents to Embeddings
2022-10-18 15:21:09,173 - BERTopic - Reduced dimensionality


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

2022-10-18 15:21:12,951 - BERTopic - Clustered reduced embeddings


180

##### We got 177 topic, let's take a look at the most frequent ones and their most representative words.

In [8]:
topic_model.visualize_barchart(top_n_topics=9, height=700)

In [9]:
topic_model.visualize_topics(top_n_topics=50)

In [10]:
topic_model.visualize_hierarchy(top_n_topics=50, width=800)

In [11]:
topic_model.visualize_heatmap(n_clusters=20, top_n_topics=100)

#### Save the model and clustering results

In [12]:
topic_model.save('models/topic_modelling')

In [13]:
topics_for_each_doc = [topic_model.topic_labels_[i] for i in topic_model.topics_]
data['topic'] = topics_for_each_doc
data.head()

Unnamed: 0,index,_id,title,authors,venue,year,keywords,n_citation,page_start,page_end,...,issue,doi,pdf,url,abstract,issn,isbn,fos,references,topic
0,174,53e99792b7602d9701f58152,Preface.,"[{'name': 'Takagi Mutsumi', 'org': 'Hokkaido U...","{'_id': '53a7325720f7420be8d8fa9b', 'raw': 'An...",2005,"['remote sensing', 'airborne electromagnetics'...",1.0,S1,S3,...,8 Suppl,10.1016/j.bpobgyn.2014.10.013,//static.aminer.org/pdf/PDF/000/868/217/prefac...,['http://dx.doi.org/10.1016/j.bpobgyn.2014.10....,On behalf of the IRPS 2010 Management Committe...,0275-004X,978-1-4244-5430-3,"['Computer science', 'Artificial intelligence']","['53e9b9adb7602d970459fe66', '53e9ba28b7602d97...",-1_the_of_and_to
1,188,53e99792b7602d9701f5b3e5,A Method of Multiple-Marker Register and Appli...,"[{'_id': '5603dcf245cedb3396276ecc', 'name': '...","{'sid': 'FRONTIERS IN COMPUTER EDUCATION', 'is...",2011,"['Augmented Reality', 'Virtual Education', 'Co...",0.0,431,+,...,,10.1007/978-3-642-27552-4_60,,['http://dx.doi.org/10.1007/978-3-642-27552-4_...,Augmented Reality is the technology that overl...,1867-5662,,"['Virtual image', 'Educational technology', 'C...","['53e9a8c5b7602d9703216bc6', '53e9ab6fb7602d97...",32_virtual_reality_vr_ar
2,189,53e99796b7602d9701f5b9d1,Optimal weighting of posteriors for audio-visu...,"[{'_id': '560c28fb45cedb33974b3f3e', 'name': '...","{'_id': '555037547cea80f9541805e0', 'raw': 'IC...",2001,"['conditional independence', 'audio signal pro...",25.0,161,164,...,,10.1109/ICASSP.2001.940792,,['http://doi.ieeecomputersociety.org/10.1109/I...,We investigate the fusion of audio and video ...,,,"['Weighting', 'Speech coding', 'Pattern recogn...","['53e9a863b7602d97031b24fe', '558aabb7e4b0b32f...",11_speech_speaker_recognition_music
3,190,53e99796b7602d9701f5c1a2,The clinical bioinformatics ontology: a curate...,"[{'_id': '56017d9745cedb3395e63f00', 'name': '...","{'_id': '53a7254520f7420be8b4a58f', 'name_d': ...",2005,"['controlled vocabulary', 'diagnostic test', '...",33.0,139,150,...,,,//static.aminer.org/pdf/PDF/000/554/244/the_cl...,['http://psb.stanford.edu/psb-online/proceedin...,Existing medical vocabularies lack rich terms ...,2335-6936,,"['Ontology', 'RefSeq', 'Molecular diagnostics'...","['53e9a281b7602d9702b88181', '55a3f4c2612ca648...",26_ontology_ontologies_semantic_web
4,248,53e9979bb7602d9701f650bf,Digital evidence,"[{'_id': '53f438d3dabfaedce554609c', 'name': '...","{'_id': '555036f57cea80f954169e28', 'type': 0,...",2002,['digital evidence'],5.0,128,128,...,4,10.1145/505248.505280,//static.aminer.org/pdf/PDF/000/776/363/digita...,['http://doi.acm.org/10.1145/505248.505280'],The evolution of an information society is acc...,,,"['Trusted Network Connect', 'Internet privacy'...",,-1_the_of_and_to


In [14]:
data.drop(data[data['topic'] == '-1_the_of_and_to'].index, inplace=True)

In [15]:
data.reset_index(inplace=True)
data.to_csv('results/004_of_V13_topics.csv')