# __Step 4.1: Topic model__

The kmean cluserting results are not particularly clear what's going on. So go stiraght to topic modeling.

Goals:
- Get topic models
- Get topics

The modeling run is done on HPC through `script_4_1_topic_model_v2.py`.
- RuntimeError: CUDA error: no kernel image is available for execution on the device
  - Fixed by creating a new bertopic environment on HPC with new install of pytorch.
- /var/lib/slurmd/job58225495/slurm_script: line 19: 28351 Killed
  - Unsure why. Error did not occur again.
- max_topic = unique_topics[-1], IndexError: list index out of range
  - Changed BERTopic setting without specifying min_topic_size or nr_topics. 

Topic model comparison
- All three (distillbert, scibert, biobert) have similar top topics with similar wordings. Thus, even general purpose BERT after retraining with the corpus, it is possible to pick out the topics.
- Among them, scibert has the least number of docs in the -1 (outlier) topic, so it is chosen as the embedding model for further analysis (4.2).
- There is 268848 docs in the -1 topic. Use [the tips from BERTopic FAQ](https://maartengr.github.io/BERTopic/faq.html#how-do-i-reduce-topic-outliers) to fix it.

## ___Set up___

### Module import

In [1]:
import re, pickle, os
import pandas as pd
from pathlib import Path
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from tqdm import tqdm
from bertopic import BERTopic

### Key variables

In [2]:
# Reproducibility
seed = 20220609

# Setting working directory
proj_dir   = Path.home() / "projects/plant_sci_hist"
work_dir   = proj_dir / "4_topic_model/4_1_compare_models"
work_dir.mkdir(parents=True, exist_ok=True)

os.chdir(work_dir)

# plant science corpus
dir25       = proj_dir / "2_text_classify/2_5_predict_pubmed"
corpus_file = dir25 / "corpus_plant_421658.tsv.gz"

# Output files

## The list obj containing cleaned docs
docs_clean_file  = work_dir / f"docs_clean.pickle"

## model names 
model1 = "distilbert-base-uncased"
model2 = "allenai/scibert_scivocab_uncased" # scibert
model3 = "dmis-lab/biobert-base-cased-v1.2" # biobert

model2_mod = "-".join(model2.split("/"))
model3_mod = "-".join(model3.split("/"))

## topic models
model1_file = work_dir / f'topic_model_{model1}'
model2_file = work_dir / f'topic_model_{model2_mod}'
model3_file = work_dir / f'topic_model_{model3_mod}'

## topics
topic1_file = work_dir / f'topics_{model1}.pickle'
topic2_file = work_dir / f'topics_{model2_mod}.pickle'
topic3_file = work_dir / f'topics_{model3_mod}.pickle'

### Proprecess corpus

In [None]:
corpus_df = pd.read_csv(corpus_file, sep='\t', compression='gzip')
corpus_df.head(2)

In [None]:
def clean_text(x):
    x = str(x)
    x = x.lower()
    # Replace any non-alphanumric characters of any length
    # Q: Not sure what the # character do.
    x = re.sub(r'#[A-Za-z0-9]*', ' ', x)
    # tokenize and rid of any token matching stop words
    tokens = word_tokenize(x)
    x = ' '.join([w for w in tokens if not w in stop_words_dict])
    return x

In [None]:
docs       = corpus_df['txt']
stop_words = stopwords.words('english')
stop_words_dict = {}
for i in stop_words:
  stop_words_dict[i] = 1

In [None]:
docs_clean = []
for doc_idx in tqdm(range(len(docs))):
  doc = docs[doc_idx]
  docs_clean.append(clean_text(doc))
len(docs_clean)

In [None]:
with open(docs_clean_file, "wb") as f:
  pickle.dump(docs_clean, f)

## ___Run BERTopic___

### Initialize

- language: str = 'english'
- top_n_words: int = 10
  - The number of words per topic to extract. __Setting this too high can negatively impact topic embeddings__ as topics are typically best represented by at most __10 words__.
- n_gram_range: Tuple[int, int] = (1, 1)
  - The n-gram range for the CountVectorizer, between 1 and 3, otherwise memory issue.
- min_topic_size: int = 10
  - The minimum size of the topic.
- nr_topics: Union[int, str] = None
  - Specifying the number of topics will reduce the initial number of topics to the value specified.
  - Use __"auto"__ to automatically reduce topics using HDBSCAN
- calculate_probabilities: bool = False
  - Whether to calculate the probabilities of all topics per document instead of the probability of the assigned topic per document.
  - Will significantly increase computing time if True.
- diversity: float = None
  - Whether to use MMR to diversify the resulting topic representations.
  - Value between 0 (no divresity) and 1 (very diverse).
    - __Q: What does diversity mean here?__
- seed_topic_list: List[List[str]] = None
  - A list of seed words per topic to converge around.
- embedding_model=None
  - SentenceTransformers, Flair, Spacy, Gensim, USE (TF-Hub), or [these](https://www.sbert.net/docs/pretrained_models.html).
  - Try to use `allenai-specter`.
- umap_model: umap.umap_.UMAP = None
- hdbscan_model: hdbscan.hdbscan_.HDBSCAN = None
- vectorizer_model: sklearn.feature_extraction.text.CountVectorizer = None
- verbose: bool = False

In [None]:
# Because of issue with the line:
# max_topic = unique_topics[-1]
# IndexError: list index out of range
# The following is tweaked to rid of min_topic_size and nr_topics setting and
# the run completed.
#topic_model = BERTopic(calculate_probabilities=False,
#                       n_gram_range=(1,2),
#                       min_topic_size=1000, 
#                       nr_topics='auto',
#                       embedding_model=model1,
#                       verbose=True)
topic_model1 = BERTopic(calculate_probabilities=False,
                        n_gram_range=(1,2),
                        embedding_model=model1,
                        verbose=True)

### Fit_transform

Long run time. Switch to HPC.

In [None]:
topics1 = topic_model1.fit_transform(docs_clean)

In [None]:
  topic_model1.save(model1_file)
  with open(topic1_file, "wb") as f:
    pickle.dump(topics1, f)

## ___Analysis of fitted model___

### distllbert

In [5]:
topic_model1_loaded = BERTopic.load(model1_file)
topic_info1 = topic_model1_loaded.get_topic_info()
type(topic_info1), topic_info1.shape

(pandas.core.frame.DataFrame, (1443, 3))

In [10]:
topic_info1.head(20)

Unnamed: 0,Topic,Count,Name
0,-1,323846,-1_plant_plants_activity_growth
1,0,4799,0_spectroscopic_nmr_elucidated_compounds
2,1,2567,1_cd_pb_zn_soil
3,2,2024,2_qtls_qtl_traits_trait
4,3,2011,3_μm_conidia_first report_et al
5,4,1960,4_resistance_rust_qtl_rust resistance
6,5,1859,5_medium_regeneration_callus_ms medium
7,6,1456,6_drought_tolerance_stress_aba
8,7,1455,7_transformation_agrobacterium_tumefaciens_med...
9,8,1424,8_genomics_technologies_breeding_biology


In [14]:
topic_info1.to_csv(str(model1_file) + "_topics.tsv", sep='\t')

### scibert

In [3]:
topic_model2_loaded = BERTopic.load(model2_file)
topic_info2 = topic_model2_loaded.get_topic_info()
topic_info2.shape

(1697, 3)

In [11]:
topic_info2.head(20)

Unnamed: 0,Topic,Count,Name
0,-1,268848,-1_plant_species_plants_growth
1,0,6592,0_spectroscopic_nmr_elucidated_compounds
2,1,6073,1_transcriptome_degs_unigenes_differentially e...
3,2,4369,2_cd_pb_cu_zn
4,3,2618,3_qtls_qtl_traits_quantitative trait
5,4,2522,4_et al_first report_symptoms_et
6,5,2292,5_regeneration_medium_callus_ms medium
7,6,2168,6_virus_viruses_nucleotide sequence_coat protein
8,7,2056,7_tolerance_drought_salt_stress
9,8,1859,8_gene family_family_genome wide_wide identifi...


In [15]:
topic_info2.to_csv(str(model2_file) + "_topics.tsv", sep='\t')

### biobert

In [8]:
topic_model3_loaded = BERTopic.load(model3_file)
topic_info3 = topic_model3_loaded.get_topic_info()
topic_info3.shape

(1452, 3)

In [12]:
topic_info3.head(20)

Unnamed: 0,Topic,Count,Name
0,-1,293790,-1_plant_growth_activity_plants
1,0,3403,0_qtl_qtls_mapping_rust
2,1,2891,1_communities_community_rhizosphere_microbial
3,2,2671,2_degs_unigenes_transcriptome_differentially e...
4,3,2669,3_transformation_agrobacterium_tumefaciens_med...
5,4,2510,4_aba_tolerance_overexpression_stress
6,5,2358,5_et al_first report_symptoms_conidia
7,6,2265,6_gene family_family_genome wide_genes
8,7,2174,7_virus_viruses_nucleotide sequence_mosaic
9,8,2145,8_cd_pb_zn_contaminated


In [16]:
topic_info3.to_csv(str(model3_file) + "_topics.tsv", sep='\t')