## 4. Topic Modeling
Topic modeling is a text mining tool to discover semantic structures. Usually a document is described by several topics.

### 1. Latent Dirichlet Allocation
Is an statistical unsupervised model that allows sets to be explained by unobserved groups, it explains why some parts of the data are related.

#### 1. LDA in a BOW corpus
First we will create the model using a BOW corpus model created in previous steps.


In [3]:
import pickle

#LOAD LYRICS from disk
lyrics = list()
with open ('../dataset/lemma_lyrics', 'rb') as fp:
    lyrics = pickle.load(fp)

In [4]:
import gensim.corpora as corpora
from gensim.models import TfidfModel    

id2word = corpora.Dictionary(lyrics)
id2word.save("../dataset/lemma_lyrics_dict")

bow_corpus = list()

for lyric in lyrics:
    bow_corpus.append(id2word.doc2bow(lyric))

tfidf = TfidfModel(bow_corpus)
tfidf_corpus = tfidf[bow_corpus]
print(bow_corpus[0])
print(tfidf_corpus[0])

[(0, 2), (1, 1), (2, 1), (3, 13), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 3), (15, 3), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 45), (23, 1), (24, 3), (25, 3), (26, 1), (27, 5), (28, 1), (29, 6), (30, 1), (31, 1), (32, 1), (33, 4), (34, 3), (35, 1), (36, 1), (37, 1), (38, 4), (39, 1), (40, 2), (41, 1), (42, 2), (43, 2), (44, 6), (45, 1), (46, 1), (47, 2), (48, 7), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 3), (56, 3), (57, 1), (58, 5), (59, 2), (60, 2), (61, 3), (62, 1), (63, 1), (64, 1), (65, 3), (66, 3), (67, 2), (68, 1), (69, 1), (70, 1), (71, 2), (72, 3), (73, 2), (74, 3), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1)]
[(0, 0.05693230226707338), (1, 0.027704914478609205), (2, 0.04549673720943144), (3, 0.0289448229988323), (4, 0.015806998693867935), (5, 0.024025594355196165), (6, 0.023098009819700793), (7, 0.025811526273479265), (8, 0.010569261638354028), (9, 0.006612118130594775), (10, 0.0244344778227

In [3]:
from gensim.models.ldamulticore import LdaMulticore

lda_model = LdaMulticore(workers=4,
                   corpus=bow_corpus,
                   id2word=id2word,
                   num_topics=6, 
                   #random_state=100,
                   #update_every=1,
                   #chunksize=100,
                   #passes=10,
                   per_word_topics=False)
lda_model.save("../dataset/bow_lda/lda")

#### 2. Measure the model

Coherence measure is used to distiguish bad topics from good topics. In general a bigger coherence score means a better LDA model.

It's a 4 step process:

* Segmentation: 
* Probability estimation
* Confirmation measure
* Aggregation

In [4]:
from gensim.models.coherencemodel import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=lyrics, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.35834682548783175


#### 3. Plot the model

In [5]:
import pyLDAvis
import pyLDAvis.gensim  

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, id2word)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


#### 4. LDA in Tf-idf corpus
Comparing different number of corpus we get that 26 has the biggest coherence number.

In [11]:
from gensim.models.ldamulticore import LdaMulticore
from gensim.models.coherencemodel import CoherenceModel
import warnings

warnings.filterwarnings("ignore")

best_coherence = 0.0
best_number_topics = 0
for i in range(2,30,4):
    lda_model = LdaMulticore(workers=4,
                       corpus=tfidf_corpus,
                       id2word=id2word,
                       num_topics=5, 
                       #random_state=100,
                       #update_every=1,
                       #chunksize=100,
                       #passes=10,
                       per_word_topics=False)

    coherence_model_lda = CoherenceModel(model=lda_model, texts=lyrics, dictionary=id2word, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    print('Coherence Score: '+ str(coherence_lda) + ' Topics: '+str(i))
    
    if(coherence_lda > best_coherence):
        best_coherence = coherence_lda
        best_number_topics = i
        lda_model.save("../dataset/tfidf_lda/lda")
        
print("Biggest coherence score: "+str(best_coherence)+" Number of topics: "+str(best_number_topics))

Coherence Score: 0.37556559168123016 Topics: 2
Coherence Score: 0.3812067575010482 Topics: 6
Coherence Score: 0.37322642032476006 Topics: 10
Coherence Score: 0.36373050980598454 Topics: 14
Coherence Score: 0.38201783049802585 Topics: 18
Coherence Score: 0.38521544368248006 Topics: 22
Coherence Score: 0.36888376074325324 Topics: 26
Biggest coherence score: 0.38521544368248006 Number of topics: 22


In [7]:
from gensim.models.ldamulticore import LdaMulticore
from gensim.models.coherencemodel import CoherenceModel
import warnings

warnings.filterwarnings("ignore")

lda_model = LdaMulticore(workers=4,
                       corpus=tfidf_corpus,
                       id2word=id2word,
                       num_topics=6, 
                       #random_state=100,
                       #update_every=1,
                       #chunksize=100,
                       #passes=10,
                       per_word_topics=False)
lda_model.save("../dataset/tfidf_lda/lda")
        

In [8]:
from gensim.models.ldamulticore import LdaMulticore
import pyLDAvis
import pyLDAvis.gensim  

lda_model = LdaMulticore.load("../dataset/tfidf_lda/lda")

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, tfidf_corpus, id2word)
vis

### 2. Hierarchical Dirichlet Process
HDP models topics as a mixture of words, like LDA, but the number of topics is generated by a dirichlet process. 
#### 1. Create the model

In [7]:
from gensim.models import HdpModel

hdp = HdpModel(tfidf_corpus, id2word)
hdp.save("../dataset/hdp/hdp")

In [8]:
from gensim.models import HdpModel
hdp = HdpModel.load("../dataset/hdp/hdp")
#lda_model = hdp.suggested_lda_model()
#lda_model.save("../dataset/hdp_lda/lda")

Since we didn't define a max number of topics, the model used the default 150. Here we get 6 topics and its 20 (default) most probable words. 

In [9]:
topics = []
num_topics = 6
for topic_id, topic in hdp.show_topics(num_topics=num_topics, formatted=False):
    topic = [word for word, _ in topic]
    topics.append(topic)
print(topics)

[['love', 'be', 'get', 'not', 'do', 'go', 'know', 'want', 'baby', 'have', 'let', 'time', 'never', 'would', 'can', 'ill', 'say', 'feel', 'come', 's'], ['love', 'get', 'be', 'not', 'do', 'go', 'baby', 'know', 'want', 'have', 'let', 'time', 'come', 'never', 'say', 'would', 'ill', 'can', 's', 'feel'], ['love', 'be', 'get', 'not', 'do', 'go', 'baby', 'know', 'want', 'have', 'time', 'let', 'come', 'would', 'never', 'ill', 'say', 'can', 'feel', 's'], ['love', 'be', 'get', 'not', 'do', 'go', 'baby', 'know', 'want', 'have', 'let', 'time', 'come', 'never', 'would', 'say', 'ill', 'feel', 'can', 'make'], ['love', 'be', 'get', 'not', 'do', 'go', 'baby', 'know', 'want', 'have', 'let', 'time', 'come', 'never', 'would', 'ill', 'say', 'can', 'feel', 's'], ['love', 'get', 'be', 'not', 'do', 'go', 'baby', 'know', 'want', 'have', 'time', 'let', 'come', 'never', 'ill', 'would', 'say', 'can', 's', 'feel']]


And finally the coherence of different combination of topics.

In [13]:
from gensim.models.coherencemodel import CoherenceModel

for i in range(5,100,5):
    topics = []
    num_topics = i
    for topic_id, topic in hdp.show_topics(num_topics=num_topics, formatted=False):
        topic = [word for word, _ in topic]
        topics.append(topic)
    cm = CoherenceModel(texts=lyrics, topics=topics, dictionary=id2word, coherence='c_v')
    print("Topics: ", i, " Coherence: ", cm.get_coherence())

Topics:  5  Coherence:  0.36640896081739316
Topics:  10  Coherence:  0.36636619693585215
Topics:  15  Coherence:  0.3663519423086718
Topics:  20  Coherence:  0.3663448149950816
Topics:  25  Coherence:  0.3663405386069276
Topics:  30  Coherence:  0.36633768768149166
Topics:  35  Coherence:  0.3663478695580489
Topics:  40  Coherence:  0.3663448149950816
Topics:  45  Coherence:  0.366342439223885
Topics:  50  Coherence:  0.3666439688776489
Topics:  55  Coherence:  0.3666354046033162
Topics:  60  Coherence:  0.36662826770803875
Topics:  65  Coherence:  0.36662881100427497
Topics:  70  Coherence:  0.3666782312638522
Topics:  75  Coherence:  0.36666966663692785
Topics:  80  Coherence:  0.36666217258836903
Topics:  85  Coherence:  0.366655560192582
Topics:  90  Coherence:  0.36665443632405564
Topics:  95  Coherence:  0.3666895025511239


The coherence value is the same for every value of topics, if we check the topics all of them are the same. It means the model returned only one topic.