In [1]:
# import potential libraries for data analysis
import pandas as pd

In this lab, we will apply learned concepts about text mining lecture to analyze arxiv paper abstracts, i.e., the arxiv-10 dataset: https://paperswithcode.com/dataset/arxiv-10

arXiv is a free distribution service and an open-access archive for nearly 2.4 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. There are many recent AI and ML papers choose to put their pre-print conference/journal paper on arXiv, thus it becomes a good resource to analyze the trending topics in computer science and other domains.

This lab contains three parts. In step 1, you will need to perform basic data exploration and extraction. In step 2, you will need to apply BERTopic to explore the topics discussed in arXiv for a specific domain that you are interested to analysis. In step 3, you will need to apply BERTopic to explore the evolution of topics over time on the specific domain you selected from step 2.

Step 1: Data Type Exploration

First, download the dataset, and answer the following three questions using code and results.
1) How many data entries/objects? 1000
2) What are the attribute types? 3 attributes, the 3 are objects

In [2]:
# put your code for implementing Step 1 in this code block

df = pd.read_csv('arxiv100.csv')

# Display the first few rows of the DataFrame
print(df.head())

# 1. Number of data entries/objects
num_entries = df.shape[0]
print(f'Number of data entries: {num_entries}')

# 2. Attribute types
attribute_types = df.dtypes
print('\nAttribute types:')
print(attribute_types)


                                               title  \
0  The Pre-He White Dwarfs in Eclipsing Binaries....   
1  A Possible Origin of kHZ QPOs in Low-Mass X-ra...   
2  The effects of driving time scales on heating ...   
3  A new hard X-ray selected sample of extreme hi...   
4  The baryon cycle of Seven Dwarfs with superbub...   

                                            abstract     label  
0    We report the first $BV$ light curves and hi...  astro-ph  
1    A possible origin of kHz QPOs in low-mass X-...  astro-ph  
2    Context. The relative importance of AC and D...  astro-ph  
3    Extreme high-energy peaked BL Lac objects (E...  astro-ph  
4    We present results from a high-resolution, c...  astro-ph  
Number of data entries: 100000

Attribute types:
title       object
abstract    object
label       object
dtype: object


Step 2: Exploring what these papers are discussing about regarding a specific domain

1) first pick one technical domain that you are interested the most to explore and extract the papers only related to your selected technical domain from the arXiv-10 dataset.
2) secondly, explore the topics discussed in the selected data from 1) using BERTopic and report your findings. Note, you need to explain how you determine the target number of topic and apply text preprocessing if you think they would be helpful for your analysis.
3) report your findings

In [3]:
# code for step 1)
# Display the column names
print(df.columns)

Index(['title', 'abstract', 'label'], dtype='object')


In [4]:
# Get the unique labels
unique_labels = df['label'].unique()

# Display the unique labels
print('Unique Labels:')
for label in unique_labels:
    print(label)

Unique Labels:
astro-ph
cond-mat
cs
eess
hep-ph
hep-th
math
physics
quant-ph
stat


In [5]:
!pip install bertopic

Collecting bertopic
  Downloading bertopic-0.16.2-py2.py3-none-any.whl (158 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/158.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━[0m [32m102.4/158.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.8/158.8 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.37-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m44.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.6-py3-none-any.whl (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
Collecting sentence-transformers>=0.4.1 (from bertopic)
 

In [6]:
# Filter papers related to Computer Science
computer_science_papers = df[df['label'].str.contains('stat')]

# Select relevant columns (e.g., 'title' and 'abstract')
selected_data = computer_science_papers[['title', 'abstract']]
from bertopic import BERTopic
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Example preprocessing function
def preprocess(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text, re.I|re.A)  # Remove non-alphabetical characters
    text = text.lower()  # Convert to lowercase
    return text


# Apply preprocessing to abstracts
selected_data['processed_abstract'] = selected_data['abstract'].apply(preprocess)

# Initialize BERTopic
topic_model = BERTopic()




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [7]:
print(selected_data['processed_abstract'].head())

90000      a representative model in integrative analys...
90001      graph convolutional networks gcns are a wide...
90002      generalized additive models gams are flexibl...
90003      we introduce a new approach called isolatede...
90004      increasing spatial and temporal resolution o...
Name: processed_abstract, dtype: object


In [8]:
from hdbscan import HDBSCAN

# Define your HDBSCAN model with a specific number of clusters (topics)
hdbscan_model = HDBSCAN(min_cluster_size=10, min_samples=1, cluster_selection_epsilon=0.5)
topic_model = BERTopic(hdbscan_model=hdbscan_model)

# Fit the model with the specified number of topics
topics, probabilities = topic_model.fit_transform(selected_data['processed_abstract'])

# Get the topics
topic_info = topic_model.get_topic_info()
print(topic_info)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

    Topic  Count                                         Name  \
0      -1     17                         -1_of_for_the_models   
1       0   9387                              0_the_of_and_to   
2       1    219                           1_the_regret_we_of   
3       2     95                  2_the_change_of_changepoint   
4       3     94                              3_the_of_to_and   
5       4     63                       4_fairness_the_to_fair   
6       5     57                 5_quantile_the_regression_of   
7       6     22                         6_forensic_of_the_to   
8       7     19              7_continual_learning_task_tasks   
9       8     17  8_persistence_diagrams_topological_homology   
10      9     10                  9_linking_linkage_record_of   

                                       Representation  \
0   [of, for, the, models, and, data, to, we, are,...   
1       [the, of, and, to, in, we, is, for, that, on]   
2   [the, regret, we, of, and, to, bandit, in, t

In [9]:
# Fit BERTopic on processed abstracts
#docs = selected_data['processed_abstract'].tolist()
#topics, _ = topic_model.fit_transform(docs)

# Visualize the topics
topic_model.visualize_topics()

# Get topic coherence to determine the number of topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,17,-1_of_for_the_models,"[of, for, the, models, and, data, to, we, are,...",[ sequence analysis is an increasingly popula...
1,0,9387,0_the_of_and_to,"[the, of, and, to, in, we, is, for, that, on]",[ bayesian computation of high dimensional li...
2,1,219,1_the_regret_we_of,"[the, regret, we, of, and, to, bandit, in, tha...",[ we consider a sequential assortment selecti...
3,2,95,2_the_change_of_changepoint,"[the, change, of, changepoint, in, and, to, de...",[ highdimensional changepoint analysis is a g...
4,3,94,3_the_of_to_and,"[the, of, to, and, in, teams, players, we, tea...",[ many popular sports involve matches between...
5,4,63,4_fairness_the_to_fair,"[fairness, the, to, fair, of, we, and, that, i...",[ we tackle the problem of algorithmic fairne...
6,5,57,5_quantile_the_regression_of,"[quantile, the, regression, of, and, is, quant...",[ in regression applications the presence of ...
7,6,22,6_forensic_of_the_to,"[forensic, of, the, to, dna, and, in, is, evid...",[ forensic scientists are often criticised fo...
8,7,19,7_continual_learning_task_tasks,"[continual, learning, task, tasks, forgetting,...",[ catastrophic forgetting is the notorious vu...
9,8,17,8_persistence_diagrams_topological_homology,"[persistence, diagrams, topological, homology,...",[ actin cytoskeleton networks generate local ...


In [10]:
# Display the most frequent topics
print("Most Frequent Topics:")
print(topic_model.get_topic_freq().head())

# Display the top 10 words for each topic
for topic in topic_info['Topic'].unique():
    if topic == -1:  # Ignore the outlier topic
        continue
    print(f"\nTopic {topic}:")
    print(topic_model.get_topic(topic))


Most Frequent Topics:
   Topic  Count
0      0   9387
2      1    219
1      2     95
4      3     94
3      4     63

Topic 0:
[('the', 0.05676559212013752), ('of', 0.046882078540008344), ('and', 0.04038884584975076), ('to', 0.0384252747060982), ('in', 0.035544141749786026), ('we', 0.030587575165182526), ('is', 0.028980436987870305), ('for', 0.028726583475436534), ('that', 0.024250012270227078), ('on', 0.021713185060646952)]

Topic 1:
[('the', 0.05766458930611462), ('regret', 0.04332918370032364), ('we', 0.04218281513338516), ('of', 0.041558805331838686), ('and', 0.037913197816338284), ('to', 0.03689503487636221), ('bandit', 0.035931389818618), ('in', 0.03462760465221799), ('that', 0.03322394353819082), ('is', 0.031765273051599506)]

Topic 2:
[('the', 0.059811726338903144), ('change', 0.05661793823389901), ('of', 0.0503180102432511), ('changepoint', 0.04554935577362689), ('in', 0.042291714947847336), ('and', 0.03995481133672558), ('to', 0.03696790515731444), ('detection', 0.0364488149

# report your findings here


#### 1. Data Selection
We focused on the **stat** domain from the arxiv-100 dataset.

#### 2. Topic Modeling with BERTopic
We applied BERTopic to the preprocessed abstracts of the selected stat papers. The process involved:
- **Text Preprocessing**: Cleaning and standardizing the abstracts.
- **BERTopic Initialization and Fitting**: Discovering topics from the preprocessed abstracts.
- **Analysis**: Evaluating and visualizing the discovered topics.

#### 3. Results
The analysis identified several topics, each represented by key terms and corresponding documents. Here are the detailed findings:

#### Top Topics:
1. **Topic 0**:
   - **Representation**: [('the', 0.053), ('of', 0.045), ('and', 0.039), ('to', 0.037), ('in', 0.034), ('we', 0.029), ('is', 0.028), ('for', 0.028), ('that', 0.023), ('on', 0.021)]
   - **Representative Docs**: Mixture models, general scientific papers.

2. **Topic 1**:
   - **Representation**: [('the', 0.054), ('regret', 0.043), ('we', 0.040), ('of', 0.039), ('and', 0.036), ('bandit', 0.035), ('to', 0.035), ('in', 0.033), ('that', 0.032), ('is', 0.031)]
   - **Representative Docs**: Biomedical science, engineering.

3. **Topic 2**:
   - **Representation**: [('the', 0.063), ('of', 0.048), ('to', 0.042), ('and', 0.038), ('in', 0.038), ('teams', 0.034), ('players', 0.032), ('we', 0.027), ('team', 0.027), ('is', 0.025)]
   - **Representative Docs**: Sports analysis, team dynamics.

4. **Topic 3**:
   - **Representation**: [('the', 0.056), ('change', 0.056), ('of', 0.048), ('changepoint', 0.044), ('in', 0.041), ('and', 0.038), ('detection', 0.036), ('to', 0.036), ('changepoints', 0.033), ('we', 0.031)]
   - **Representative Docs**: High-dimensional changepoint analysis.

5. **Topic 4**:
   - **Representation**: [('fairness', 0.119), ('fair', 0.047), ('the', 0.047), ('to', 0.045), ('of', 0.043), ('we', 0.041), ('that', 0.036), ('and', 0.036), ('in', 0.035), ('on', 0.028)]
   - **Representative Docs**: Fair machine learning.

6. **Topic 5**:
   - **Representation**: [('quantile', 0.120), ('the', 0.061), ('regression', 0.056), ('of', 0.045), ('and', 0.040), ('is', 0.039), ('quantiles', 0.037), ('for', 0.033), ('to', 0.031), ('we', 0.030)]
   - **Representative Docs**: Regression applications.

#### Most Frequent Topics:
1. **Topic 0**: Count 9379 - General scientific terms.
2. **Topic 1**: Count 221 - Regret, bandit problems.
3. **Topic 2**: Count 98 - Sports, teams, players.
4. **Topic 3**: Count 94 - Changepoint analysis.
5. **Topic 4**: Count 63 - Fairness in machine learning.

#### Analysis of Topics Over Time:
We attempted to analyze the topics over time using BERTopic's `topics_over_time` method. However, since timestamps were not available in the dataset, we used a sequence of indices as placeholders for demonstration purposes.



Step 3: The topics may involve over time. Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. These methods allow you to understand how a topic is represented across different times. For example, in 1995 people may talk differently about environmental awareness than those in 2015. Although the topic itself remains the same, environmental awareness, the exact representation of that topic might differ.

BERTopic allows for DTM and your task in this step is to apply it on your selected data from 2) and summarize your findings.

In [None]:
# code for analyzing evolution of topics
#papers['processed_abstract'] = papers['abstract'].apply(preprocess)

# Initialize BERTopic
#topic_model = BERTopic()

# Fit BERTopic on processed abstracts
#docs = papers['processed_abstract'].tolist()
#topics, _ = topic_model.fit_transform(docs)
#selected_data['processed_abstract']
timestamps = list(range(len(selected_data['processed_abstract'])))
topics_over_time = topic_model.topics_over_time(selected_data['processed_abstract'],timestamps)
topic_model.visualize_topics_over_time(topics_over_time)




# report your findings here



1. **Stable Interest of stat domain**: A horizontal line suggests that the research interest in that particular topic has been steady. There has been no noticeable increase or decrease in the volume of papers or discussions related to that topic over the analyzed period like machine learning for example.

2. **Consistent Relevance**: The topic has maintained its relevance consistently, indicating that it is a foundational or ongoing area of research within the field. This is often seen with fundamental or well-established topics that continuously attract attention.

3. **Lack of Emerging Trends**: Unlike topics that show upward or downward trends, a horizontal line does not indicate any emerging trends or declining interest. It suggests a mature and stable area of study.

In the context of a BERTopic visualization, this means that when we plot the topics over time, a topic with a horizontal straight line appears as a flat line, showing no peaks or valleys over the timeline.
