# TLJ Topic Modeling - Scikit-learn LDA - Visualization  & Topics

In [1]:
import pandas as pd
import numpy as np
from pprint import pprint

import pyLDAvis
import pyLDAvis.sklearn

import time

import pickle

# Scikit-learn LDA Default Model (20 Topics)
- Let's look at the results of the default model, choosing a somewhat arbitrary number of topics at 20. 

In [2]:
# Import data
skl_best_lda_model = pickle.load(open('../data/pickles/lda/lda_skl_default_model.pkl', 'rb'))
data_vectorized = pickle.load(open('../data/pickles/lda/lda_skl_default_data_vectorized.pkl', 'rb'))
vectorizer = pickle.load(open('../data/pickles/lda/lda_skl_default_vectorizer.pkl', 'rb'))

### View Topic Keywords

In [3]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 20
display_topics(skl_best_lda_model, vectorizer.get_feature_names(), no_top_words)

Topic 0:
characters, like, johnson, rian, jedi, disney, one, good, awakens, force, franchise, story, original, bad, even, fan, trilogy, made, see, new
Topic 1:
like, disney, good, one, even, see, bad, force, fans, movies, story, could, way, get, time, characters, people, character, episode, fan
Topic 2:
one, like, disney, force, people, episode, jedi, bad, reviews, films, many, awakens, fans, story, even, good, made, know, going, first
Topic 3:
filler, like, one, hate, see, disney, bad, good, people, plot, watch, many, say, thought, worst, characters, fans, way, get, even
Topic 4:
like, character, characters, story, one, luke, good, force, plot, jedi, even, bad, new, space, johnson, first, made, time, movies, awakens
Topic 5:
jedi, bad, one, new, story, good, plot, movies, characters, way, long, like, let, two, character, better, many, could, least, even
Topic 6:
even, one, characters, like, good, see, character, plot, scenes, time, get, could, felt, going, great, many, force, way, sto

- Again we see some topics that are somewhat intelligible, but they just aren't as clear as we might want. We'll move on to GridSearch to see if we can find optimal values based upon log-likelihood, the main scoring metric in the Scikit-learn version of LDA.

### Visualize Results

In [5]:
# Visualize with pyLDAvis
pyLDAvis.enable_notebook()
panel_1 = pyLDAvis.sklearn.prepare(skl_best_lda_model, data_vectorized, vectorizer)
panel_1

- This is perhaps somewhat better than the results we were seeing in gensim, but there is still significant overlap between topics. 

# Scikit-learn LDA GridSearch 1

In [6]:
# Import data
skl_best_lda_model = pickle.load(open('../data/pickles/lda/lda_skl_grid_best_model_1.pkl', 'rb'))
data_vectorized = pickle.load(open('../data/pickles/lda/lda_skl_grid_data_vectorized_1.pkl', 'rb'))
vectorizer = pickle.load(open('../data/pickles/lda/lda_skl_vectorizer_1.pkl', 'rb'))

## View Topic Keywords

In [7]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 20
display_topics(skl_best_lda_model, vectorizer.get_feature_names(), no_top_words)

Topic 0:
luke, like, force, jedi, new, one, characters, story, even, character, good, plot, see, rey, trilogy, time, awakens, johnson, disney, first
Topic 1:
disney, made, worst, ever, dead, rian, johnson, old, correct, politically, type, armada, call, intellect, filled, young, dont, original, time, one
Topic 2:
disney, like, bad, one, story, characters, good, movies, even, made, see, plot, people, worst, ever, franchise, time, watch, could, money
Topic 3:
luke, like, rey, force, kylo, one, character, jedi, first, finn, snoke, leia, even, order, space, rose, story, good, could, ship
Topic 4:
jedi, plot, movies, character, mess, new, least, series, bad, holes, let, fans, care, cinema, characters, cinematic, franchise, better, line, worse


## Visualize Results

In [8]:
# Visualize with pyLDAvis
pyLDAvis.enable_notebook()
panel_2 = pyLDAvis.sklearn.prepare(skl_best_lda_model, data_vectorized, vectorizer)
panel_2

- While apparently 5 topics generates the best log likelihood, there is significant overlap for topics 1, 2 and 3, and the topics themselves are pretty useless. They are far too general and have too may generic keywords.  

# Scikit-learn LDA GridSearch 2

In [9]:
# Import data
skl_best_lda_model = pickle.load(open('../data/pickles/lda/lda_skl_grid_best_model_2.pkl', 'rb'))
data_vectorized = pickle.load(open('../data/pickles/lda/lda_skl_grid_data_vectorized_2.pkl', 'rb'))
vectorizer = pickle.load(open('../data/pickles/lda/lda_skl_vectorizer_2.pkl', 'rb'))

## View Topic Keywords

In [10]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 20
display_topics(skl_best_lda_model, vectorizer.get_feature_names(), no_top_words)

Topic 0:
story, disney, new, good, bad, plot, see, franchise, fans, johnson, force, people, trilogy, original, fan, episode, way, rian, awakens, films
Topic 1:
luke, rey, force, kylo, first, snoke, leia, story, finn, new, good, ren, order, way, space, see, plot, rose, bad, end


## Visualize Results

In [11]:
# Visualize with pyLDAvis
pyLDAvis.enable_notebook()
panel_3 = pyLDAvis.sklearn.prepare(skl_best_lda_model, data_vectorized, vectorizer)
panel_3

# Summary
- Log likelihood for the sklearn LDA model just __doesn't seem to be yielding productive results__. For some reason __the lowest topic numbers are producing the best log-likelihoods__, but __the topics are useless__. 
- We'll move on from here to __try Non-Negative Matrix Factorization (NMF)__ to see if it produces better results.