# Topic Modeling for Intent Recommendation

## 1. Configuration and setup

### 1.1 Installation
As Watson Studio is creating a new environment for each run, we have to download BERTopic and Spacy.


In [None]:
!pip install -U BERTopic

In [None]:
!pip install -U spacy

### 1.2 Import librairies

In [None]:
import pandas as pd
from tempfile import TemporaryFile
import io
from io import StringIO
import joblib
import numpy as np
import ibm_boto3
from botocore.client import Config
import json
from sentence_transformers import SentenceTransformer
import os
import spacy
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
import matplotlib.pyplot as plt
import seaborn as sns
import re

In [None]:
#If an error appear, just RERUN THE CELL
#TO DO: see how to avoir the error, to be able to run Jobs
from bertopic import BERTopic

### 1.3 Credentials & Watson Assistant configuration

This notebook uses Watson Assistant v1 API to access skill definition. To access message logs, the notebook uses both v1 and v2 APIs. You authenticate to the API by using IBM Cloud Identity and Access Management (IAM).

You can access the values you need for this configuration from the Watson Assistant user interface. Go to the Skills page and select View API Details from the menu of a skill title.

- The string to set in the call to `IAMAuthenticator` is your Api Key under Service Credentials
- The string to set for version is a date in the format version=YYYY-MM-DD. The version date string determines which version of the Watson Assistant V1 API will be called. For more information about version, see [Versioning](https://cloud.ibm.com/apidocs/assistant/assistant-v1#versioning).
- The string to pass into `assistant.set_service_url` is the base URL of Watson Assistant. For example, for us-south, the endpoint is `https://api.us-south.assistant.watson.cloud.ibm.com`. This value will be different depending on the location of your service instance. For more information, see [Service Endpoint](https://cloud.ibm.com/apidocs/assistant/assistant-v1?code=python#service-endpoint)

### 1.4 Cloud Object Storage functions
Cloud Object Storage provide the ressource to fetch/save an object. They are used with these functions : 

In [None]:
# Create resource
#For "storage" bucket
cos = ibm_boto3.resource("s3",
    ibm_api_key_id=COS_API_KEY_ID,
    ibm_service_instance_id=COS_RESOURCE_CRN,
    ibm_auth_endpoint=COS_AUTH_ENDPOINT,
    config=Config(signature_version="oauth"),
    endpoint_url=COS_ENDPOINT
)

#This is for "do_not_delete" Bucket 
cos2 = ibm_boto3.resource("s3",
    ibm_api_key_id=COS_API_KEY_ID,
    ibm_service_instance_id=COS_RESOURCE_CRN,
    ibm_auth_endpoint=COS_AUTH_ENDPOINT,
    config=Config(signature_version="oauth"),
    endpoint_url=COS_ENDPOINT2 #you have to change it depending on your bucket location 
)

#Function to get file from "do_not_delete" Bucket
def get_item_cos_donotdelete(bucket_name, item_name):
    print("Retrieving item from bucket: {0}, key: {1}".format(bucket_name, item_name))
    try:
        file = cos2.Object(bucket_name, item_name).get()
        #print("File Contents: {0}".format(file["Body"].read()))
        #print(pd.read_json(file["Body"]))
        return(file["Body"].read())
    except ClientError as be:
        print("CLIENT ERROR: {0}\n".format(be))
    except Exception as e:
        print("Unable to retrieve file contents: {0}".format(e))
        
#Function to get file from "storage" Bucket
def get_item_cos(bucket_name, item_name):
    print("Retrieving item from bucket: {0}, key: {1}".format(bucket_name, item_name))
    try:
        file = cos.Object(bucket_name, item_name).get()
        #print("File Contents: {0}".format(file["Body"].read()))
        #print(pd.read_json(file["Body"]))
        return(file["Body"].read())
    except ClientError as be:
        print("CLIENT ERROR: {0}\n".format(be))
    except Exception as e:
        print("Unable to retrieve file contents: {0}".format(e))

In [None]:
#function to get Bucket content, filter by prefix 

def get_bucket_contents_prefix(bucket_name, prefix):
    print("Retrieving bucket contents from: {0}".format(bucket_name))
    try:
        files = cos.Bucket(bucket_name).objects.filter(Prefix=prefix)
        for file in files:
            print("Item: {0} ({1} bytes).".format(file.key, file.size))
    except ClientError as be:
        print("CLIENT ERROR: {0}\n".format(be))
    except Exception as e:
        print("Unable to retrieve bucket contents: {0}".format(e))
    #need to add a return 

## 2. Load data from Cloud Object Storage 
### 2.1 Fetch and load data from a file (to delete )
For the moment the input is "The grand débat", from a French politic public form. 

In [None]:
file = get_item_cos(BUCKET,'grand_debat_bert.xlsx')

In [None]:



#Loading data from CSV

import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df = pd.read_csv(body,  sep= ',', index_col=[0])
df.head()

In [None]:
#df = pd.read_excel(file)
data = df["text"].to_list()

In [None]:
len(data)

In [None]:
#We are just keeping a sample

data = data[:10000] #used for Bertopic()

### 2.2 Fetch and load data from  Analysis output


In [None]:
#data = df['request_input'].tolist()

In [None]:
#data

# Topic Modeling with Bertopic and sentence transformer 
### Embedding 
Regarding to the pretrained model performance https://www.sbert.net/docs/pretrained_models.html, we are choosing the following model:

In [None]:
#sentence_transformers_models=['distiluse-base-multilingual-cased-v1','paraphrase-multilingual-mpnet-base-v2 ','paraphrase-multilingual-MiniLM-L12-v2','paraphrase-MiniLM-L3-v2','multi-qa-mpnet-base-dot-v1','multi-qa-distilbert-cos-v1','multi-qa-MiniLM-L6-cos-v1','distiluse-base-multilingual-cased-v2']
sentence_model = SentenceTransformer("all-mpnet-base-v2")
#sentence_model2 = SentenceTransformer('distiluse-base-multilingual-cased-v1')

In [None]:
#This is to avoid computation troubles during Bertopic() execution
os.environ["TOKENIZERS_PARALLELISM"] = "true"

### BERTopic Parameters: 
https://colab.research.google.com/drive/1ClTYut039t-LDtlcd-oQAdXWgcsSGTw9?usp=sharing#scrollTo=xG_slPMurnmz

- n_gram_range : The n-gram range for the CountVectorizer. 
 Advised to keep high values between 1 and 3. More would likely lead to memory issues. 
 NOTE: This param will not be used if you pass in your own CountVectorizer.
default value = (1, 1)


- top_n_word : keep between 10 and 20, no more than 30
top_n_words refers to the number of words per topic that you want extracted.


- min_topic_size : is an important parameter! 
It is advised to play around with this value depending on the size of the your dataset. 
default value = 10

#### For topic reduction:

- nr_topics : Specifying the number of topics will reduce the initial number of topics to the value specified. This reduction can take a while as each reduction in topics (-1) activates a c-TF-IDF calculation. 
If this is set to None, no reduction is applied. Use "auto" to automatically reduce topics using HDBSCAN.
None (default value)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# to remove stopword
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop

stopwords = en_stop.union(fr_stop)
#We are adding a vectorizer to deal with stopword
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=list(stopwords))

#Then we are creating a model for each embedding model:

#all-mpnet-base-v2
topic_model = BERTopic(embedding_model=sentence_model,
                      verbose=True,
                      n_gram_range=(1,3), 
                      nr_topics="auto", #auto is using HDBSCAN 
                      vectorizer_model=vectorizer_model, #better result with our vectorizer
                      top_n_words = 15,
                      min_topic_size = 100, #should depend from the dataset size 
                      low_memory=False, #True = low memory for computation, so longer
                      calculate_probabilities=True #calculate prob of each topic, is computationally expensive
                      )

In [None]:
#seed_topic_list: None (default value)
# A list of seed words per topic to converge around


#all-mpnet-base-v2
topics, probs = topic_model.fit_transform(data)

In [None]:
df = pd.DataFrame({'topic': topics, 'document': data})

### Topic Model evaluation: coherence & diversity 

### Saving the model inside the COS

In [None]:
#Ne fonctionne pas sur WS
#topic_model.save("model")

We are using a buffer as a temporary file, to put the data inside the function pu_object()

In [None]:
#tmp buffer
file_buffer = io.BytesIO()
csv_buffer = StringIO()


#converting topic_model and stocking inside file_buffer
#The embedding model is save with the model
joblib.dump(topic_model, file_buffer)

#Same for the df to csv
df.to_csv(csv_buffer,header=True, index=False)

Saving the model:

In [None]:
#saving the file_buffer content inside the COS
cos.Bucket(BUCKET).put_object(Key="model_without_vectorizer_10000data.joblib", Body= file_buffer.getvalue())

Saving the Dataframe as csv:

In [None]:
#saving the df content inside the COS
cos.Bucket(BUCKET).put_object(Key='df_model.csv', Body=csv_buffer.getvalue())

### Loading the model from the COS

Here, we are loading the model with a TemporyFile:

In [None]:
with TemporaryFile() as temp_file:
    #download the model into temp file
    cos.Object(BUCKET, "model_without_vectorizer_10000data.joblib").download_fileobj(temp_file)
    temp_file.seek(0)
    #load into joblib
    model=joblib.load(temp_file)
topic_model=model

In [None]:
model.get_topic_info()

Then, the Dataframe as a csv file:

In [None]:
csv_file = io.StringIO(get_item_cos(BUCKET,'df_model.csv').decode("utf-8"))
df=pd.read_csv(csv_file)

In [None]:
df.head()

### Data Visualisation

In [None]:
freq = topic_model.get_topic_info()
freq.head()

In [None]:
nb_topic = freq.shape[0]-1
print(f"there is {nb_topic} topics")

In [None]:
df[df['topic']==1].head(5)

In [None]:
topic_model.get_topic(1) # Select the most frequent topic

In [None]:
representative_document=topic_model.get_representative_docs()
representative_document

In [None]:
topic_model.generate_topic_labels()

### Evaluation topic modeling, coherence & diversity

In [None]:
def get_topic_word(topics, topic_model):
    list_word_topic= list()
    list_topic = set(topics)
    #Iteration to get the top n word for each topic
    for topic in list_topic:
        tmp=list() #list of word for a topic + the associated label
        tmp.append(str(topic))
        #tmp2 is the list of word with cTF-IFD score, but we just want the word
        tmp2 = topic_model.get_topic(topic)
        for el in tmp2 :
            if len(el[0])>3 : #filtering to only keep the word up to 3
                tmp.append(el[0])
        #list_word_topic is a list[list]
        list_word_topic.append(tmp)
    return list_word_topic             


In [None]:
#all-mpnet-base-v2
list_word_topic = get_topic_word(topics, topic_model)


In [None]:
from nltk.util import ngrams

The coherence measure is relative to the dataset. For each model fitted to this dataset, we will get a coherence measure :

In [None]:
#This is for topic coherence calculation
#To get the token + string check + lower (cases sensitive)
tokenizer = lambda s: re.findall( '\w+', str((s.lower())) )
data_tokenised = [ tokenizer(t) for t in data ]

from gensim.models import Phrases
bigram = Phrases(data, min_count=10)

for idx in range(len(data_tokenised)):
    for token in bigram[data_tokenised[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            data_tokenised[idx].append(token)

In [None]:
dictionary = Dictionary(data_tokenised)
dictionary.filter_extremes(no_below=10, no_above=0.2)
corpus = [dictionary.doc2bow(doc) for doc in data_tokenised]

In [None]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))
print(corpus[:1])

In [None]:
id2word = dictionary.id2token

In [None]:
# Creating a dictionary with the vocabulary
word2id = Dictionary(data_tokenised)

def get_coherence(list_word_topic):
    # Coherence model
    cm = CoherenceModel(topics=list_word_topic, 
                        texts=data_tokenised,
                        coherence='c_v',  
                        dictionary=word2id)

    #1st : coherence for each topic
    coherence_per_topic = cm.get_coherence_per_topic()
    
    #2nd :global coherence 
    coherence = cm.get_coherence()
    print("The coherence per topic is ",coherence_per_topic )
    print("The topic model coherence is ",coherence )
    return coherence, coherence_per_topic


In [None]:
#all-mpnet-base-v2
coherence, coherence_per_topic = get_coherence(list_word_topic)

In [None]:
#print the result
topics_str = [ '\n '.join(t[:4]) for t in list_word_topic ] #we are printing just the number and the 3 first word
data_topic_score = pd.DataFrame( data=zip(topics_str, coherence_per_topic), columns=['Topic', 'Coherence'] )
data_topic_score = data_topic_score.set_index('Topic')


fig, ax = plt.subplots( figsize=(nb_topic/3,nb_topic) )
ax.set_title("Topics coherence\n $C_v$")
sns.heatmap(data=data_topic_score, annot=True, square=True,
            cmap='Reds', fmt='.2f',
            linecolor='black', ax=ax )
plt.yticks( rotation=0 )
ax.set_xlabel('')
ax.set_ylabel('')
fig.show()

In [None]:
topics_str

### Topic Reduction after training

(BERT documentation : https://maartengr.github.io/BERTopic/getting_started/topicreduction/topicreduction.html#visualize-probablities)

As there is to much predected intents, we will reduced them:
 - First, let see them in 2 dimenssion, topic intersection's is a good way to find the potentiel merge

In [None]:
topic_model.visualize_topics()

Then let see how the BERTopic function can reduce them

In [None]:
# Further reduce topics
#update data2!! put the data again
new_topics, new_probs = topic_model.reduce_topics(data, topics, nr_topics="auto")

In [None]:
topic_model.visualize_topics()

As we can see above, the merging is not really optimal, there is still a lot of intersection who could be merge. A Data Analyst coul do that job by hand? Can we do something more to merge? May be by using the Similarity Matrix

In [None]:
#Based on the cosine similarity matrix between topic embeddings,
#a heatmap is created showing the similarity between topics.
topic_model.visualize_heatmap()

#to save it:
#fig = topic_model.visualize_heatmap()
#fig.write_html("path/to/file.html")
    

Working on the output. We want a list of accurate document who are discribing our topics. The goal is 15 exemple per topics. 

In [None]:
def get_representative_docs(df) : 
    list_topic = set(topics)
    for topic in list_topic : 
        topic_doc = df[df.topic == topic]
        