# TLJ Topic Modeling - Non-Negative Matrix Factorization - Topics & Visualization

In [1]:
import pandas as pd
import numpy as np
import re 
from pprint import pprint

import pickle

import pyLDAvis
import pyLDAvis.sklearn

# NMF - 5 Topics

In [3]:
# Import topic results from 5 topic NMF model
nmf_topics_5 = pickle.load(open('../data/pickles/nmf/nmf_topics_5.pkl', 'rb'))
pprint(nmf_topics_5)

[('Topic: 0',
  'imdb_sentiments, google_sentiment, tokens, scores, reviews, '
  'nltk_sentiments, nltk_scores, google_score, google_magnitude'),
 ('Topic: 1',
  'scores, nltk_sentiments, tokens, reviews, nltk_scores, imdb_sentiments, '
  'google_sentiment, google_score, google_magnitude'),
 ('Topic: 2',
  'google_score, tokens, scores, reviews, nltk_sentiments, nltk_scores, '
  'imdb_sentiments, google_sentiment, google_magnitude'),
 ('Topic: 3',
  'nltk_scores, nltk_sentiments, reviews, tokens, scores, imdb_sentiments, '
  'google_sentiment, google_score, google_magnitude'),
 ('Topic: 4',
  'google_sentiment, tokens, nltk_scores, scores, reviews, nltk_sentiments, '
  'imdb_sentiments, google_score, google_magnitude')]


## NMF is promising
We don't have quantitative metrics to help guide something like a GridSearch or to help us to focus on a range of topic numbers, which means that we'll have to manually go through different numbers of topics and evaluate the results. However, ultimately this is what we have to do anyway, as whether we use metrics or not, the true test of the effectiveness of the model is how coherent the topics are. This is something subjective that only the analyst can do at this point.  

With this in mind, right off the bat NMF is quite promising. Just starting somewhat randomly with 5 topics in general the topics are more intelligible and are fairly easy to interpret, especially when compared with LDA. 
- Topic 0 is just a collection of character names, groups and objects, which is understandable. Most likely these reviews got into specific details of the film, but there really isn't a "topic" to discern.
- Topic 1 seems to be criticising the director Rian Johnson and Disney and comparing this film to the original trilogy and Lucas. We do have to do some inference here and assume that this is in the context of a negative review to get to these interpretations though.
- Topic 2 is a general amalgamation of negative sentiment and terms and the only proper noun there is Disney, so we can assume that they are associating this negativity with Disney's stewardship of the franchise. 
- Topic 3 revolves around the story and plot and seems to compare it negatively to the original trilogy. 
- Topic 4 is about the quality of the writing, specifically the humor in the film. This is definitely in line with some of the themes in criticism of the film you see when reading reviews.

It's also worth noting that __NMF executes much faster than LDA__, 10-20x faster.

# NMF - 10 Topics

In [4]:
# Import topic results from 5 topic NMF mode10
nmf_topics_10 = pickle.load(open('../data/pickles/nmf/nmf_topics_10.pkl', 'rb'))
pprint(nmf_topics_10)

[('Topic: 0',
  'reviews, tokens, scores, nltk_sentiments, nltk_scores, imdb_sentiments, '
  'google_sentiment, google_score, google_magnitude'),
 ('Topic: 1',
  'nltk_sentiments, tokens, scores, reviews, nltk_scores, imdb_sentiments, '
  'google_sentiment, google_score, google_magnitude'),
 ('Topic: 2',
  'google_sentiment, tokens, scores, reviews, nltk_sentiments, nltk_scores, '
  'imdb_sentiments, google_score, google_magnitude'),
 ('Topic: 3',
  'tokens, scores, reviews, nltk_sentiments, nltk_scores, imdb_sentiments, '
  'google_sentiment, google_score, google_magnitude'),
 ('Topic: 4',
  'google_score, tokens, scores, reviews, nltk_sentiments, nltk_scores, '
  'imdb_sentiments, google_sentiment, google_magnitude'),
 ('Topic: 5',
  'nltk_scores, tokens, scores, reviews, nltk_sentiments, imdb_sentiments, '
  'google_sentiment, google_score, google_magnitude'),
 ('Topic: 6',
  'google_magnitude, tokens, scores, reviews, nltk_sentiments, nltk_scores, '
  'imdb_sentiments, google_senti

## 10 Topics Notes
Much like 5 topics, 10 topics provide very intelligible results. Most of the topics from the 5 topic model are here as well and some are more focused. In addition it seems more specific topics have been teased out. We'll proceed with 15 and 20 to see how well they perform. 

# NMF - 15 Topics

In [5]:
nmf_topics_15 = pickle.load(open('../data/pickles/nmf/nmf_topics_15.pkl', 'rb'))
pprint(nmf_topics_15)

[('Topic: 0',
  'luke, rey, kylo, snoke, ren, skywalker, leia, force, yoda, lightsaber, '
  'dark, training, ben, side, kill, killed, finn, parents, vader, han'),
 ('Topic: 1',
  'see, people, reviews, go, going, fan, want, review, know, imdb, '
  'disappointed, said, everything, saw, fans, rating, loved, wanted, give, '
  'way'),
 ('Topic: 2',
  'worst, ever, seen, far, awful, series, doubt, piece, trash, believe, '
  'transformers, disaster, others, dont, phantom, without, menace, waste, '
  'dumpster, crap'),
 ('Topic: 3',
  'good, scenes, great, action, lot, felt, moments, effects, acting, feel, '
  'little, things, overall, visuals, pretty, script, boring, bit, interesting, '
  'however'),
 ('Topic: 4',
  'bad, joke, script, good, guys, acting, jokes, writing, worse, thought, '
  'sooo, guy, dont, directing, stupid, sorry, believe, disappointment, world, '
  'parody'),
 ('Topic: 5',
  'johnson, rian, abrams, director, kennedy, mark, hamill, kathleen, '
  'skywalker, fans, directio

## 15 Topics Notes
Similar to 10 we have the previous model's topics as well as some new ideas being revealed. 

# NMF - 20 Topics

In [6]:
nmf_topics_20 = pickle.load(open('../data/pickles/nmf/nmf_topics_20.pkl', 'rb'))
pprint(nmf_topics_20)

[('Topic: 0',
  'rey, kylo, snoke, ren, finn, rose, poe, luke, leia, resistance, parents, '
  'lightsaber, tfa, phasma, force, order, side, dark, interesting, killed'),
 ('Topic: 1',
  'see, people, going, want, go, know, fans, things, way, wanted, saw, love, '
  'hate, said, actually, went, feel, thing, thought, lot'),
 ('Topic: 2',
  'worst, ever, seen, far, doubt, series, awful, piece, believe, trash, '
  'transformers, others, disaster, without, dumpster, phantom, best, menace, '
  'slap, stupid'),
 ('Topic: 3',
  'story, line, telling, development, stupid, main, political, arc, stories, '
  'awful, way, visually, agenda, lines, unbelievable, tried, whole, follow, '
  'weak, simply'),
 ('Topic: 4',
  'bad, joke, script, jokes, guys, acting, writing, good, thought, sooo, '
  'worse, guy, directing, dont, stupid, sorry, parody, believe, world, thing'),
 ('Topic: 5',
  'johnson, rian, abrams, director, kennedy, kathleen, mark, hamill, fans, '
  'trilogy, direction, completely, skywalk

## 20 Topics Notes
Lots of very intelligible ideas here, though we are starting to see some overlap. For the most part though most of the reviews are at least somewhat interpretable and some new ideas are being introduced like Topic 16 where it seems like the idea is that critics were paid for positive reviews and that the user reviews on IMDb are fake. Very interesting. 

We could go further and maybe discover some additional topics, but I think this is good in terms of the number of topics. Lots to work with here. 

One additional area to explore though is bigrams and trigrams with NMF.

# NMF - 20 Topics - Bigrams

In [7]:
nmf_topics_20_bigrams = pickle.load(open('../data/pickles/nmf/nmf_topics_20_bigrams.pkl', 'rb'))
pprint(nmf_topics_20_bigrams)

[('Topic: 0',
  'rey, kylo, luke, snoke, ren, kylo ren, force, finn, leia, rose, lightsaber, '
  'poe, dark, parents, side, training, yoda, tfa, powerful, killed'),
 ('Topic: 1',
  'people, see, reviews, go, want, going, fan, fans, know, review, imdb, '
  'everything, way, hate, said, things, love, loved, feel, wanted'),
 ('Topic: 2',
  'worst ever, worst, ever, ever seen, seen, ever worst, series worst, doubt '
  'worst, without doubt, far worst, seen worst, doubt, ever others, ever '
  'believe, others disney, seen seen, crap worst, far, lose piece, ever cry'),
 ('Topic: 3',
  'johnson, rian, rian johnson, abrams, director, kennedy, kathleen, kathleen '
  'kennedy, fans, director rian, hate rian, ruining hate, disney rian, '
  'direction, franchise, hate, completely, saga, every, direct'),
 ('Topic: 4',
  'bad, bad bad, good, joke, script, bad script, guys, sooo bad, bad jokes, '
  'sooo, bad story, bad directing, jokes, acting, script bad, bad plot, bad '
  'writing, guy, bad guys, 

## 20 Topics Bigrams Notes
Bigrams generated topics aren't that different from the non-bigram model. There is a bit more clarity on some topics and one genuinely new one about the film being an "assembly line" creation, seen in Topic 12. 

Overall __I like the bigrams output slightly better__, but it's worth noting that __the execution time is much longer, almost 30x longer__. If performance is a significant consideration I would go with the non-bigrams model. 

As a final test we'll take a look at the trigrams version of the model. 

# NMF - 20 Topics - Trigrams

In [8]:
nmf_topics_20_trigrams = pickle.load(open('../data/pickles/nmf/nmf_topics_20_trigrams.pkl', 'rb'))
pprint(nmf_topics_20_trigrams)

[('Topic: 0',
  'luke, rey, kylo, force, ren, snoke, kylo ren, skywalker, luke skywalker, '
  'leia, dark, yoda, lightsaber, ben, side, training, kill, vader, dark side, '
  'finn'),
 ('Topic: 1',
  'good, people, see, great, lot, felt, scenes, feel, things, new, way, '
  'action, go, going, fan, end, little, first, films, actually'),
 ('Topic: 2',
  'worst ever, worst, ever, ever seen, worst ever seen, seen, ever worst, '
  'worst ever worst, ever worst ever, series worst, series worst ever, doubt '
  'worst ever, without doubt worst, doubt worst, without doubt, far worst, '
  'seen worst, ever seen worst, doubt, worst ever others'),
 ('Topic: 3',
  'bad, bad bad, good, script, joke, terrible, bad script, guys, sooo bad, '
  'acting, sooo, bad directing, bad jokes, bad story, terrible bad end, jokes, '
  'bad guys, bad plot, script bad, bad acting'),
 ('Topic: 4',
  'corporate, assembly line, crass, assembly, rip offs, hollywood, cinematic, '
  'rip, offs, adult, reference, cinema, st

## 20 Topics Trigrams Notes
The topic results for trigrams are __similar to the bigrams results__ in several ways. 
- __Improved Results__- Trigrams are a bit of an improvement over the 1-grams model, though not dramatically so.
- __New Topics__ - Trigrams revealed a couple of unique subjects that weren't present in both the default model and the bigrams model. It had the interesting "corporate assembly line production/stinky fish" topic and the "jar jar" coparisons topic. However it had these new topics as well: 
    - Topic 15: Diversity as a specific problem
    - Topic 17: Political correctness
    - Topic 19: Childish "self-spoofing"
- __Longer Execution Time__ - Trigrams also took much longer to run. It took 3x as long as bigrams and 75x times as long as the default model. Again, something to bear in mind.  
__If execution time was not an issue NMF trigrams would be the easy choice__. However, if execution time is a concern, and it often is, then you'll have to make choices about wh__at level of model effectiveness you want versus how long you're willing to wait for them. If this is in the middle of a live pipeline then __it could become a significant bottleneck__.  

However, __there may be ways to keep both model performance and execution time with additional optimization__. This is definitely an area to explore.

# pyLDAvis Visualization 
- Just to see if there is a __difference in the graphs between LDA and NMF__ we'll visualize the NMF bigrams and trigrams with 20 topics.

## 20 Topics Bigrams Visualization

In [9]:
# Import data
nmf_20_bigrams_model = pickle.load(open('../data/pickles/nmf/nmf_model_20_bigrams.pkl', 'rb'))
tlj_tfidf_data_20_bigrams = pickle.load(open('../data/pickles/nmf/nmf_tfidf_data_20_bigrams.pkl', 'rb'))
tlj_tfidf_model_20_bigrams = pickle.load(open('../data/pickles/nmf/nmf_tfidf_model_20_bigrams.pkl', 'rb'))


In [10]:
pyLDAvis.enable_notebook()
panel_bigrams = pyLDAvis.sklearn.prepare(nmf_20_bigrams_model, tlj_tfidf_data_20_bigrams, tlj_tfidf_model_20_bigrams, mds='tsne')

  kernel = (topic_given_term * np.log((topic_given_term.T / topic_proportion).T))
  log_lift = np.log(topic_term_dists / term_proportion)
  log_ttd = np.log(topic_term_dists)


In [11]:
panel_bigrams

## 20 Topics Trigrams Visualization

In [12]:
# Import data
nmf_20_trigrams_model = pickle.load(open('../data/pickles/nmf/nmf_model_20_trigrams.pkl', 'rb'))
tlj_tfidf_data_20_trigrams = pickle.load(open('../data/pickles/nmf/nmf_tfidf_data_20_trigrams.pkl', 'rb'))
tlj_tfidf_model_20_trigrams = pickle.load(open('../data/pickles/nmf/nmf_tfidf_model_20_trigrams.pkl', 'rb'))


In [13]:
pyLDAvis.enable_notebook()
panel_trigrams = pyLDAvis.sklearn.prepare(nmf_20_trigrams_model, tlj_tfidf_data_20_trigrams, tlj_tfidf_model_20_trigrams, mds='tsne')

  kernel = (topic_given_term * np.log((topic_given_term.T / topic_proportion).T))
  log_lift = np.log(topic_term_dists / term_proportion)
  log_ttd = np.log(topic_term_dists)


In [14]:
panel_trigrams

## NMF pyLDAvis Visualization Notes
The difference between the LDA visualization and NMF visualization is striking. We have really __well defined and separated topics__ with __virtually no overlap__. It's good to see that __this is in line with the much improved topic results in terms of human intelligibility__.

# NMF Summary
It was clear from the topics, but the visualization really confirms it. __NMF is the clear winner in terms of topic model performance and execution time__. Although the bigrams and trigrams do take longer than LDA, __even the monograms results far outperform LDA in terms of topics__ and is __10x faster than LDA.__