- This notebooks creates the corpus from text created from data_preparation.ipynb
- The purpose is to find the similar parts in the massive content and group them into a same category. Several attemps with several text clustering tecniques such as K-means clustering (KM), Laten Semantic Index (LSI) and Latent Direchet Allocation (LDA), LDA is selected because of coherent topics given.
- Finally, the  categories are plotted on graph and exported into HTML file.
- The model performance is evaludated by Coherence score of topics 


**Setup the google colab environment**

In [None]:
from google.colab import drive
# This will prompt for authorization.
# authorization code: 4/OwErfUj6QceGXhIGx_RWv0MKclb9rilw8UsJnZqFbSez-QS8zQ399JU
drive.mount('/content/drive')

!pip install PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

!pip install gensim

!pip install pyldavis

**Import libararies**

In [None]:
from gensim.utils import simple_preprocess
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from gensim.corpora.dictionary import Dictionary
from gensim import models


from joblib import dump, load

# Plotting tools
from pyLDAvis.gensim import prepare  
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

**Google drive access path**

In [None]:
csv_path = '/content/drive/My Drive/Colab Notebooks/s_user_csv/'
metadata_path = '/content/drive/My Drive/Colab Notebooks/output/'

- **Create dictionary based on the content**
- **Make the bag of word coprus from the dictionary**
- **Transform the corpus into matrix term document**
- **Save the dictionary , corpus, matrix term document and LDA model on the drive**

In [None]:
def make_corpus(list_text):
    dictionary = corpora.Dictionary(list_text)
    corpus = [content_dictionary.doc2bow(text) for text in list_text]
    return dictionary, corpus

def make_lda_mode(dictionay, corpus, n_topics):
    lda_model = models.LdaModel(corpus, id2word = dictionay, num_topics = n_topics, alpha = 'auto', eval_every = 5)
    topics = lda_model.print_topics(n_topics)
    return lda_model, topics

# load list_text_clean from drive
list_content_clean = load(metadata + 'list_content_clean.joblib')
dictionary, corpus = make_corpus(list_content_clean)
lda_model, topics = make_lda_model(dictionary, corpus)

# save
dump(dictionary, metadata_path + "dictionary.joblib")
dump(corpus, metadata_path + "corpus.joblib")
dump(lda_model, metadata_path + "lda_model.joblib")


for idx, topic in topics:
    print("topic: {}\n {}".format(idx, topic))

**Visualize the topcis with their key words**

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = prepare(lda_model, corpus, dictionary)
# export the model into htlm
pyLDAvis.save_html(vis, metadata_path + 'lda_model_50.html')
vis

- Computer Perplexity and Coherence Score to know good model is.
- The Perplexity is a mesure how good model is, the less, the better.
- Coherence score mesures how coherent model is, the more , the better. 

In [None]:
# Compute Coherence Score
coherence_lda = CoherenceModel(model = lda_model
                                     , corpus = corpus
                                     , texts = list_content_clean
                                     , dictionary = dictionary 
                                     , coherence='c_v')

coherence = coherence_lda.get_coherence()

print('\nCoherence Score: ', coherence_lda)

print('\nPerplexity: ', lda_model.log_perplexity(corpus)) 


**Visualization topic distribution**

In [None]:
y_axis = []
x_axis = []
for topic_id, dist in topics:
    x_axis.append(topic_id + 1)
    y_axis.append(dist)
width = 1 
plt.bar(x_axis, y_axis, width, align='center', color='r')
plt.xlabel('Topics')
plt.ylabel('Probability')
plt.title('Topic Distribution for doc')
plt.xticks(np.arange(2, len(x_axis), 2), rotation='vertical', fontsize=7)
plt.subplots_adjust(bottom=0.2)
plt.ylim([0, np.max(y_axis) + .01])
plt.xlim([0, len(x_axis) + 1])
plt.savefig(output_path)
plt.close()

**Find similar documents to each category**

In [None]:
# Assigns the topics to the documents in corpus
lda_corpus = lda_model[corpus]
# find the dominant topics
lda_corpus = [max(prob, key = lambda y : y[1]) for prob in lda_corpus ]
list_content_by_topic = [[] for i in range(50)]
# select the most relevant documents to the topic
for i, x in enumerate(lda_corpus):
    #print(x[0])
    list_content_by_topic[x[0]].append(list_content[i])