### LDA

LDA's approach to topic modeling considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.
Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.

***When we say topic, what is it actually and how it is represented?***

A topic is nothing but a collection of dominant keywords that are typical representatives. Just by looking at the keywords, you can identify what the topic is all about.

The following are key factors to obtaining good segregation topics:

- The quality of text processing.
- The variety of topics the text talks about. 
- The choice of topic modeling algorithm.
- The number of topics fed to the algorithm.
- The algorithms tuning parameters.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install --upgrade gensim



In [None]:
!pip install pyLDAvis==2.1.2

Collecting pyLDAvis==2.1.2
  Downloading pyLDAvis-2.1.2.tar.gz (1.6 MB)
[?25l[K     |▏                               | 10 kB 27.6 MB/s eta 0:00:01[K     |▍                               | 20 kB 17.6 MB/s eta 0:00:01[K     |▋                               | 30 kB 14.5 MB/s eta 0:00:01[K     |▉                               | 40 kB 13.4 MB/s eta 0:00:01[K     |█                               | 51 kB 7.3 MB/s eta 0:00:01[K     |█▏                              | 61 kB 7.9 MB/s eta 0:00:01[K     |█▍                              | 71 kB 8.2 MB/s eta 0:00:01[K     |█▋                              | 81 kB 9.1 MB/s eta 0:00:01[K     |█▉                              | 92 kB 8.8 MB/s eta 0:00:01[K     |██                              | 102 kB 7.1 MB/s eta 0:00:01[K     |██▎                             | 112 kB 7.1 MB/s eta 0:00:01[K     |██▍                             | 122 kB 7.1 MB/s eta 0:00:01[K     |██▋                             | 133 kB 7.1 MB/s eta 0:00:01[K

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from zipfile import ZipFile
from nltk import sent_tokenize
from nltk import word_tokenize
from collections import defaultdict
import gensim

import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
from gensim import corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim.models import TfidfModel
from gensim import models
# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings

In [None]:
#you retrieve data of mental illness posts stored in the zip file on the computer
def zip2text(path): 
    arch_in = ZipFile(mypath, "r")
    files_name = arch_in.namelist()
    out = {}
    for file in files_name:
        text = arch_in.read(file).decode('utf-8')
        article = [sent.split(" * ") for sent in text.split("\n")]
        out[file] = article
    arch_in.close()
    return out

In [None]:
#open the zip file (experimental group)
mypath = "/content/drive/MyDrive/Computational_Linguistics_Project /myarchive_illness.zip" 
resss = zip2text(mypath)

In [None]:
#open the zip file (control group)
my_second_path = '/Users/FrancescaPadovani/Desktop/MIO_PRO/myarchive_control.zip'
resss2 = zip2text(my_second_path)

In [None]:
i = 0 
for doc in resss:
    if i < 5:
        print(doc)
        i += 1

0_feeling.txt
1_going.txt
2_you.txt
3_i.txt
4_partners.txt


In [None]:
#just for visualization, don't run it if you don't need it. It's very heavy 
i = 0 
for doc in resss:
    for sent in resss[doc]:
        if i < 10:
            print(' '.join(sent))
        else:
            break

In [None]:
#you calculate the frequency of each token 
frequency = defaultdict(int)
for doc in resss:
    for sent in resss[doc]:
        for token in sent:
            frequency[token] += 1
            

In [None]:
#you keep tokens that have a frequency higher than 5
texts = []
for doc in resss:
    for sent in resss[doc]:
        keep = []
        for token in sent:
            if frequency[token] > 5:
                keep.append(token)
        texts.append(keep)


#### The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus. Let’s create them.

In [None]:
# Create Dictionary

id2word = corpora.Dictionary(texts)

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

In [None]:
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]]


In [None]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('big', 1),
  ('bit', 1),
  ('depressedi', 1),
  ('feeling', 1),
  ('low', 1),
  ('weekend', 1)]]

### Building the Topic Model

We have everything required to train the LDA model. In addition to the corpus and dictionary, you need to provide the number of topics as well.

Apart from that, alpha and eta are hyperparameters that affect sparsity of the topics. According to the Gensim docs, both defaults to 1.0/num_topics prior.

chunksize is the number of documents to be used in each training chunk. update_every determines how often the model parameters should be updated and passes is the total number of training passes.

In [None]:
# Build LDA model
model_LDA = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=5, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

The above **LDA model** is built with 5 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.

You can see the keywords for each topic and the weightage(importance) of each keyword using lda_model.print_topics() as shown next.

In [None]:
# Print the first 10 keyword in each topic
pprint(model_LDA.print_topics())
doc_lda = model_LDA[corpus]

#doc_lda[0]

In [None]:
coherence_model_lda = CoherenceModel(model=model_LDA, texts=texts, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(model_LDA, corpus, id2word)
vis

So how to infer pyLDAvis’s output?

Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic.
A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.

A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.

Alright, if you move the cursor over one of the bubbles, the words and bars on the right-hand side will update. These words are the salient keywords that form the selected topic.

We have successfully built a good looking topic model.

Given our prior knowledge of the number of natural topics in the document, finding the best model was fairly straightforward.

Upnext, we will improve upon this model by using Mallet’s version of LDA algorithm and then we will focus on how to arrive at the optimal number of topics given any large corpus of text.