# __Step 4.2: Topic model outlier__

BERTopic 
- [Step-by-step](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6)
- [Deal with situation where most docs are in the -1 topic](https://github.com/MaartenGr/BERTopic/issues/485)

Goals here:
- Choose SciBERT to proceed as it has the fewest docs that cannot be assigned to topics in the initial model:
  - distillbert: 323846
  - scibert: 268848
  - biobert: 293790
- Rerun BERTopic with scibert with a different parameter setting compared to the intial one (4_1).
  - size of outlier cluser (-1): 241567
- Assess probability distributions and establish a threshold p-value for assignment unassigned docs (outliers) to topics. With probability treshold of:
  - 0.0067 (~75 percentiles): topic(-1)=34622
  - 0.0155 (~95 percentiles): topic(-1)=49228 <-- go with this...
    - At this threshold, 11.7% of the documents are not assigned to topic.
  - 0.0434 (~99 percentiles): topic(-1)=124648 
- Apply the threshold and generate an updated model. 

## ___Set up___

### Module import

In [1]:
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path
from bertopic import BERTopic

### Key variables

In [2]:
# Reproducibility
seed = 20220609

# Setting working directory
proj_dir   = Path.home() / "projects/plant_sci_hist"
work_dir   = proj_dir / "4_topic_model/4_2_outlier_assign"
work_dir.mkdir(parents=True, exist_ok=True)

# plant science corpus
dir25       = proj_dir / "2_text_classify/2_5_predict_pubmed"
corpus_file = dir25 / "corpus_plant_421658.tsv.gz"

# processed docs
dir41            = proj_dir / "4_topic_model/4_1_compare_models"
docs_clean_file  = dir41 / "corpus_plant_421658_proc_txt.pkl"

# embedding model
emb_model_name = "allenai/scibert_scivocab_uncased" 


## ___Load data___

### Load cleaned data and topic model

In [3]:
with open(docs_clean_file, "rb") as f:
  docs_clean = pickle.load(f)

In [4]:
len(docs_clean), docs_clean[0]

(421658,
 'identification 120 mus phase decay delayed fluorescence spinach chloroplasts subchloroplast particles intrinsic back reaction . dependence level phase thylakoids internal ph . 500 mus laser flash 120 mus phase decay delayed fluorescence visible variety circumstances spinach chloroplasts subchloroplast particles enriched photosystem ii prepared means digitonin . level phase high case inhibition oxygen evolution donor side photosystem ii . comparison results babcock sauer ( 1975 ) biochim . bio-phys . acta 376 , 329-344 , indicates epr signal iif suppose due z+ , oxidized first secondary donor photosystem ii , well correlated large amplitude 120 mus phase . explain 120 mus phase intrinsic back reaction excited reaction center presence z+ , predicted van gorkom donze ( 1973 ) photochem . photobiol . 17 , 333-342. redox state z+ dependent internal ph thylakoids . results effect ph mus region compared obtained ms region .')

### Get doc embeddings 

In [None]:
# Generate embeddings
emb_model  = SentenceTransformer(emb_model_name)
embeddings = emb_model.encode(docs_clean, show_progress_bar=True)
# Output embeddings
with open(work_dir / "embeddings_scibert.pickle", "wb") as f:
  pickle.dump(embeddings, f)

In [None]:
# Load embeddings
with open(work_dir / "embeddings_scibert.pickle", "rb") as f:
  embeddings = pickle.load(f)

In [None]:
type(embeddings), embeddings.shape

## ___Run BERTopic___

### Set parameters

In [None]:
# HDBSCAN clustering setting
min_cluster_size         = 500 
metric                   = 'euclidean' 
cluster_selection_method ='eom' 
prediction_data          = True 
min_samples              = 5

# BERTopic setting
calculate_probabilities = True
n_neighbors             = 10  
nr_topics               = 500
n_gram_range            = (1,2)

### Initialize HDBSCAN

For reducing outliers, following [this instruction](https://maartengr.github.io/BERTopic/faq.html#how-do-i-reduce-topic-outliers)
- Also see [HDBSCAN doc](https://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html#what-about-different-metrics)
- Comparison of [distance metrics](https://www.kdnuggets.com/2019/01/comparison-text-distance-metrics.html)

In [None]:
hdbscan_model = HDBSCAN(min_cluster_size=min_cluster_size, 
                        metric=metric, 
                        cluster_selection_method=cluster_selection_method, 
                        prediction_data=prediction_data, 
                        min_samples=min_samples)

### Intialize and train topic model

In [None]:
topic_model = BERTopic(hdbscan_model=hdbscan_model,
                       calculate_probabilities=calculate_probabilities,
                       n_gram_range=n_gram_range,
                       nr_topics=nr_topics,
                       verbose=True)

In [None]:
topics, probs = topic_model.fit_transform(docs_clean,
                                          embeddings)

### Save model, topics, and probability

In [None]:
# I already save the embeddings, so won't save it again
topic_model.save(work_dir / 'topic_model')

with open(work_dir / 'probs.pickle', "wb") as f:
  pickle.dump(probs, f)

In [5]:
# Load topic model
topic_model = BERTopic.load(work_dir / 'topic_model')

In [6]:
# load prob
with open(work_dir / 'probs.pickle', "rb") as f:
  probs = pickle.load(f)

In [7]:
topic_info = topic_model.get_topic_info()
topic_info

Unnamed: 0,Topic,Count,Name
0,-1,241567,-1_plant_plants_species_growth
1,0,919,0_allergen_allergens_pollen_ige
2,1,3976,1_medium_callus_regeneration_culture
3,2,1111,2_dots_fluorescence_detection_carbon dots
4,3,859,3_glyphosate_herbicide_resistance_herbicides
...,...,...,...
86,85,825,85_soil_yield_nitrogen_fertilizer
87,86,567,86_inbreeding_depression_inbreeding depression...
88,87,2828,87_pollen_pollination_flowers_floral
89,88,1849,88_populations_genetic_diversity_genetic diver...


## ___Assign outliers to topics___

### Determine probability distributions

In [None]:
probs.shape, probs[:,0].shape

In [None]:
# Cluster 0
pd.DataFrame(probs[:,0]).describe()

In [None]:
sns.histplot(probs[:,0], log_scale=True)
plt.xlim(1e-5, 1e-1)
plt.savefig(work_dir / "fig4_2_prob_cluster0.pdf", bbox_inches='tight')

In [None]:
# Cluster 1
pd.DataFrame(probs[:,1]).describe()

In [None]:
sns.histplot(probs[:,1], log_scale=True)
plt.xlim(1e-5, 1e-1)
plt.savefig(work_dir / "fig4_2_prob_cluster1.pdf", bbox_inches='tight')

In [8]:
topic_freq = topic_model.get_topic_freq()
topic_freq

Unnamed: 0,Topic,Count
0,-1,241567
1,61,11209
2,12,8942
3,69,7685
4,35,6913
...,...,...
86,30,522
87,25,508
88,78,506
89,38,503


### Assignments

In [9]:
# Get the overall probability values at three different percentiles
np.percentile(probs, 75), np.percentile(probs, 95), np.percentile(probs, 99)

(0.006735027420849654, 0.015512210159378426, 0.04337546078552455)

In [10]:
probability_threshold = np.percentile(probs, 95)
new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 
                                                            for prob in probs]

In [11]:
pd.Series(new_topics).value_counts()

-1     49228
 61    16183
 35    13599
 81    10403
 79    10390
       ...  
 57      828
 21      789
 3       751
 66      673
 5       620
Length: 91, dtype: int64

In [12]:
49228/len(new_topics)

0.11674864463617433

### Update topics

See [this post](https://github.com/MaartenGr/BERTopic/issues/529):
- When update_topics the topic_model topics remain unchanged.
- Tunr out that I need to update topic size as well. This seems weird...
- Take me 2 hours to find this...


In [13]:
# ~8 min
topic_model.update_topics(docs_clean, new_topics)

In [15]:
# Update topic frequencies
documents = pd.DataFrame({"Document": docs_clean, "Topic": new_topics})
topic_model._update_topic_size(documents)

In [16]:
topic_info_changed = topic_model.get_topic_info()
topic_info_changed

Unnamed: 0,Topic,Count,Name
0,-1,49228,-1_plant_plants_genes_cell
1,0,895,0_allergen_allergens_pollen_ige
2,1,2917,1_medium_callus_regeneration_mgl
3,2,1098,2_dots_fluorescence_detection_carbon dots
4,3,751,3_glyphosate_resistance_herbicide_herbicides
...,...,...,...
86,85,5989,85_soil_yield_nitrogen_water
87,86,4315,86_populations_genetic_selection_inbreeding
88,87,3737,87_pollen_pollination_flowers_floral
89,88,3807,88_populations_genetic_population_species


In [17]:
topic_model.save(work_dir / 'topic_model_updated')

### Get updated topic info

In [18]:
# Load topic model
topic_model_updated = BERTopic.load(work_dir / 'topic_model_updated')

In [19]:
topic_info_updated = topic_model_updated.get_topic_info()
topic_info_updated

Unnamed: 0,Topic,Count,Name
0,-1,49228,-1_plant_plants_genes_cell
1,0,895,0_allergen_allergens_pollen_ige
2,1,2917,1_medium_callus_regeneration_mgl
3,2,1098,2_dots_fluorescence_detection_carbon dots
4,3,751,3_glyphosate_resistance_herbicide_herbicides
...,...,...,...
86,85,5989,85_soil_yield_nitrogen_water
87,86,4315,86_populations_genetic_selection_inbreeding
88,87,3737,87_pollen_pollination_flowers_floral
89,88,3807,88_populations_genetic_population_species


Continue analysis in 4.3 model analysis