## <font color='red'> BERT for Arabic Topic Modeling: An Experimental Study on BERTopic Technique

In [2]:
# we start with installing bertopic from pypi before preparing the data

!pip install bertopic[all]
!pip install flair


  Attempting uninstall: hyperopt
    Found existing installation: hyperopt 0.1.2
    Uninstalling hyperopt-0.1.2:
      Successfully uninstalled hyperopt-0.1.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pynndescent 0.5.8 requires importlib-metadata>=4.8.1; python_version < "3.8", but you have importlib-metadata 3.10.1 which is incompatible.
markdown 3.4.1 requires importlib-metadata>=4.4; python_version < "3.10", but you have importlib-metadata 3.10.1 which is incompatible.
gym 0.25.2 requires importlib-metadata>=4.8.0; python_version < "3.10", but you have importlib-metadata 3.10.1 which is incompatible.[0m
Successfully installed bpemb-0.3.4 conllu-4.5.2 deprecated-1.2.13 flair-0.11.3 ftfy-6.1.1 hyperopt-0.2.7 importlib-metadata-3.10.1 janome-0.4.2 konoha-4.6.5 langdetect-1.0.9 mpld3-0.3 overrides-3.1.0 pptree-3.1 py4j-0.10.9.7 requests-2.28.1 segto

In [None]:
import re
import pandas as pd
from bertopic import BERTopic
from flair.embeddings import TransformerDocumentEmbeddings
from gensim.models.coherencemodel import CoherenceModel
import gensim.corpora as corpora
from gensim.models import LdaMulticore
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

In [None]:
# add your data path 

data=  pd.read_excel("data.xlsx")
data.head()

In [None]:
data.isnull().mean()*100

In [7]:
data.shape

(4879, 42)

In [None]:
documents = data['Text'].values
documents

## Embedding model
BERTopic has two default embedding models: "distilbert-base-nli-stsb-mean-tokens'' for the English language and "xlm-r-bert-base-nli-stsb-meantokens" for any language other than English, where XLM-R models support 50+ languages.

Also, you can select any model from [Hugging Face](https://huggingface.co/models)  and use it instead of the preselected models by simply passing the model through
BERTopic with embedding_model.

For more deatelis check out BERTopic decomntion [here](https://maartengr.github.io/BERTopic/tutorial/embeddings/embeddings.html).

In [92]:
!pip install Arabic-Stopwords

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Arabic-Stopwords
  Downloading Arabic_Stopwords-0.3-py3-none-any.whl (353 kB)
[K     |████████████████████████████████| 353 kB 7.1 MB/s 
[?25hCollecting pyarabic>=0.6.2
  Downloading PyArabic-0.6.15-py3-none-any.whl (126 kB)
[K     |████████████████████████████████| 126 kB 21.3 MB/s 
Installing collected packages: pyarabic, Arabic-Stopwords
Successfully installed Arabic-Stopwords-0.3 pyarabic-0.6.15


In [94]:
## Let's clean the texts.
import nltk

In [97]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [182]:
words_stp=['من','انا','على','السيارة','ولا','عن','السلام','هل','اي','ما','عليكم','الله','مع']

In [None]:
arb_stopwords = list(nltk.corpus.stopwords.words("arabic"))
arb_stopwords

In [184]:
for w in words_stp:
  if w not in arb_stopwords:
    arb_stopwords.append(w)

In [None]:
#show stopwords.
arb_stopwords

In [None]:
for t in documents:
  print(t)
  print("**"*10)

In [195]:
def remove_stpwords(t):
  for w in t.split():
    if w.strip()  in (arb_stopwords):
      t=t.replace(w,'')
      #print(w)
  return t.strip()


  
#function to remove emojis.
def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)



def clean_text(row):
    text=str(row)
    #remove stopwords.
    text=remove_stpwords(text)
    #deal wiith tashkeel
    p_tashkeel = re.compile(r'[\u0617-\u061A\u064B-\u0652,\U0001fae3\u200f\u2066\u2069\u200e£\u2067\u202b\u202c\u2069]')
    text = re.sub(p_tashkeel,"", text)
    text=re.sub(r'[\u200f\u2066\u202a\u202c\u2069\\\\u2066]','',text)
    #clean english words
    english=re.compile(r'[a-zA-Z/d]')
    text=re.sub(english,'',text)
    #ealing with /n \r ( ) []
    text=re.sub(r"[!\n\s\-%()@#+,;&^%$#@!“’‘.>؟:?'�ℂ__…'٪؜‼è،]",' ',text).strip()
    #clean digits
    text=re.sub(r'[\d]','',text)
    text=re.sub(r'\[]','',text)
    text=re.sub(r'''[*”؛•"መልስ]''',"",text)
    text=re.sub(r'\[[]\s.]+','',text)
    text=re.sub(r"[\s.]+",' ',text)
    #return the cleaned description
    text=remove_emojis(text)

    
    row=text.split()
    if len(row)<=2:
      return "Null"
    else:
      return text.strip()
    


In [None]:
#Now we cleaned the text.
data['clean']= data['Text'].apply(clean_text)
#data['clean']= data['clean'].apply(remove_emojis)

data['clean'].head()

In [197]:
## let's check nulls.
print("Data has:")
print(f"{data[data['clean']=='Null'].shape[0]} Nulls")

Data has:
2868 Nulls


In [198]:
#let's drop nulls
idx=data[data['clean']=='Null'].index

In [199]:
data.drop(idx,axis=0,inplace=True)

In [200]:
data.reset_index(inplace=True,drop=True)

In [None]:
for row in data['clean']:
  print(row)
  print("**"*10)

In [202]:
data['clean'].isnull().mean()*100

0.0

In [20]:
#to experiment with other BERT models simply change the model name below

arabert = TransformerDocumentEmbeddings('aubmindlab/bert-base-arabertv02')

Downloading:   0%|          | 0.00/381 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/384 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/825k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.64M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/543M [00:00<?, ?B/s]

# **Create Topics**


For BERTopic you do not need to define the number of topics in advance, however, if you want to do so simply pass the number of topics to BERTopic with nr_topics paramete.

In [203]:
documents=data['clean'].values

In [204]:
topic_model = BERTopic(language="arabic", low_memory=True ,calculate_probabilities=False,
                     embedding_model=arabert)

NOTE: Calculating probabilities can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model.

In [205]:
topics, probs = topic_model.fit_transform(documents)

In [206]:
#extract most frequent topics

topic_model.get_topic_freq().head(5)

Unnamed: 0,Topic,Count
0,0,1786
1,1,42
2,2,35
3,-1,31
4,3,20


-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated:

In [None]:
#show the top 10 words in topic 1

topic_model.get_topic(0)

# Evaluation
To evaluate the model topics coherence we use [Gensim](https://radimrehurek.com/gensim/models/coherencemodel.html) implementation of the Normalized
Pointwise Mutual Information (NPMI).

In [212]:
texts = [[word for word in str(document).split()] for document in documents]
id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]

In [213]:
topics=[]
for i in topic_model.get_topics():
  row=[]
  topic= topic_model.get_topic(i)
  for word in topic:
     row.append(word[0])
  topics.append(row)

In [214]:
# compute Coherence Score

cm = CoherenceModel(topics=topics, texts=texts, corpus=corpus, dictionary=id2word, coherence='c_npmi')
coherence = cm.get_coherence() 
print('\nCoherence Score: ', coherence)


Coherence Score:  -0.12998357811408848


# **Visualize Topics**
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good 
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. 
Instead, we can visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

In [219]:
topic_model.visualize_topics()

In [215]:
topic_model.visualize_documents(documents)

In [217]:
topic_model.visualize_heatmap()

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved. 

In [None]:
# Save model
topic_model.save("my_model")

In [None]:
# Load model
my_model = BERTopic.load("my_model")

## <font color='red'>Thank U <3 

End !