# Topic Modeling using LDA

In [2]:
# libraries

import pandas as pd # to work with dataframes
from wordcloud import WordCloud # to analyze frequency of different words in the corpus
import re # for using regular expressions

# used for pre-processing the text data and unsupervised topic modeling
import gensim
from gensim import models
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.corpora import Dictionary

# used for natural language processing {NLTK: Natural Language Tool-Kit} 
from nltk.stem import WordNetLemmatizer, SnowballStemmer, LancasterStemmer
from nltk.stem.porter import *
import nltk
nltk.download('wordnet')

import matplotlib.pyplot as plt

# Set a seed to reproduce the results later. Seed used here is '2019'
import numpy as np
np.random.seed(2019)

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/siddharth/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
pd.set_option('display.max_colwidth', -1)

news = pd.read_csv("abcnews-date-text.csv")
news = news.drop(columns = "publish_date") # since this data isn't really required for the topic modeling
news['index'] = news.index # we would want to give each article an index to reference back to it later
print("total number of articles: " + str(len(news)))
news[500:520]

total number of articles: 1103663


Unnamed: 0,headline_text,index
500,committal continues into goulburn jail riot,500
501,costello unhappy he wasnt consulted by stone,501
502,council approves poultry farm,502
503,council awaits more rain,503
504,council considers indigenous caravan park plan,504
505,council elections planned for may,505
506,council rejects combined field days stand idea,506
507,council to change tree protection by law,507
508,council to fund groundwater study,508
509,counsel begin summing up at warnes doping hearing,509


Just from eyeballing this section of the data (rows 500 to 519), it can be observed that there are articles from politics, law, sports just to name a few.

## Data pre-processing

##### Pre-processing includes the following steps:

1. <font color="blue"> Stemming-lemmatization </font>: This step is required to extract the rootwords from a document. For instance, the root word for happiness is "happy"
2. <font color="blue"> Removing stopwords </font>: Stopwords are the most commonly used words in natural language. Examples: "The", "a", "is", "at", "which" etc.

#### 1. Stemming-Lemmatization

There are many stemmers in the NLTK library.

Some of the frequently used stemmers are:
1. SnowballStemmer
2. LancasterStemmer
3. PorterStemmer

The lemmatizer in the NLTK library is: WordNetLemmatizer

Let's look at how these would work on different words.

In [4]:
# sample words to stem/lemmatize
sample_words = ['happiness', 'flies', 'workers', 'dogs', 'agreed', 'owned', 'humbled', 'meeting', 'helper', 
                'drinks', 'watching', 'traditional', 'politics', 'player', 'curator', 'better', 'best', 
                'cooker', 'cooking']

# initializing the stemmer and lemmatizer
stemmer = LancasterStemmer() # using the LancasterStemmer
lemmatizer = WordNetLemmatizer()

# stemming process
stems = [stemmer.stem(plural) for plural in sample_words]

# lemmatization process
lemmas = [lemmatizer.lemmatize(plural) for plural in sample_words]

# saves the results in a dictionary and creates a dataframe from it
pd.DataFrame(data = {'original word': sample_words, 'stemmed': stems, 'lemma': lemmas})

Unnamed: 0,original word,stemmed,lemma
0,happiness,happy,happiness
1,flies,fli,fly
2,workers,work,worker
3,dogs,dog,dog
4,agreed,agree,agreed
5,owned,own,owned
6,humbled,humbl,humbled
7,meeting,meet,meeting
8,helper,help,helper
9,drinks,drink,drink


From the dataframe above the following observations were made:
1. For some words, the lemmatization and stemming provide the exact same result (eg: "dogs", "drinks")
2. Lemmatization does not change certain words (eg: "agreed", "player", "watcher")
3. Stemming converts certain words to something that does not seem to be a word anymore (eg: "better", "curator", "politics", "traditional")

##### <font color = "black"> Q. There is an argument in the WordNetLemmatizer.lemmatize() called "pos" which stands for Part-of-speech. Do you think that could make a difference for the lemmatization process for the words given above? </font>

Note: by default, the "pos" argument is equal to "n"

where, n stands for NOUN

In [5]:
# lemmatization process (pos = "verb")
lemma_v = [lemmatizer.lemmatize(plural, pos = "v") for plural in sample_words]

# lemmatization process (pos = "verb")
lemma_a = [lemmatizer.lemmatize(plural, pos = "a") for plural in sample_words]

# saves the results in a dictionary and creates a dataframe from it
pd.DataFrame(data = {'original word': sample_words, 'stemmed': stems, 'lemma-noun': lemmas, 'lemma-verb': lemma_v,
                    'lemma-adjective': lemma_a})

Unnamed: 0,original word,stemmed,lemma-noun,lemma-verb,lemma-adjective
0,happiness,happy,happiness,happiness,happiness
1,flies,fli,fly,fly,flies
2,workers,work,worker,workers,workers
3,dogs,dog,dog,dog,dogs
4,agreed,agree,agreed,agree,agreed
5,owned,own,owned,own,owned
6,humbled,humbl,humbled,humble,humbled
7,meeting,meet,meeting,meet,meeting
8,helper,help,helper,helper,helper
9,drinks,drink,drink,drink,drinks


So, which part-of-speech argument should be used for this problem?

For this case, since we are modeling topics for a corpus of news articles, using verbs for the part-of-speech argument does seem to be intuitive. This is because we want to associate the articles to different topics based on the words we observe in a cluster of documents.

For instance, if we observe the words "cook", "chef", "vegetables", "restaurant", "soup", "wine" as the most commonly occuring words, we would want to choose a topic such as say "CULINARY NEWS" in the context of news articles.

So, in order to get the right root, we might want to extract the root based on the part of speech being set to "verb".

Another way could be to stem the resultant lemmatized word. You will see a lot of people use a combination of these methods to extract the roots, but for this analysis, I shall be using the lemmas only. You can see the result below when lemmatization was used with pos = "v"

In [6]:
# combined lemmatization and stemming process (in that order)
stem_lemma = [stemmer.stem(WordNetLemmatizer().lemmatize(plural, pos = "v")) for plural in sample_words]

pd.DataFrame(data = {'original word': sample_words, 'stemmed': stems, 'lemma-verb': lemma_v,
                    'stem-lemma': stem_lemma})

Unnamed: 0,original word,stemmed,lemma-verb,stem-lemma
0,happiness,happy,happiness,happy
1,flies,fli,fly,fly
2,workers,work,workers,work
3,dogs,dog,dog,dog
4,agreed,agree,agree,agr
5,owned,own,own,own
6,humbled,humbl,humble,humbl
7,meeting,meet,meet,meet
8,helper,help,helper,help
9,drinks,drink,drink,drink


Based on the dataframe above, I would prefer the lemma-verb column more than the others.

Let's now define a function that lemmatizes given word:

In [7]:
# function to lemmatize a given word

def lemmatize(text):
    return lemmatizer.lemmatize(text, pos = "v")

#### 2. Removing stopwords

Stopwords can be removed from text by comparing individual words in a sentence with a defined list of STOPWORDS. The STOPWORDS from gensim.parsing.preprocessing library has a list of such stopwords in the english language.

In [8]:
# stopwords in the list
gensim.parsing.preprocessing.STOPWORDS

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

Let's look at how it works for a few sentences:

In [9]:
sentence1 = "We are trying to learn how to implement topic modeling using LDA"
sentence2 = "This is a sample sentence to check how to remove stopwords from a sentence"

First, we shall tokenize the above sentences and then compare it with the stopwords. In order to do that, we will use the gensim.utils.simple_preprocess() function. Let's try that:

In [10]:
print(gensim.utils.simple_preprocess(sentence1))
print(gensim.utils.simple_preprocess(sentence2))

['we', 'are', 'trying', 'to', 'learn', 'how', 'to', 'implement', 'topic', 'modeling', 'using', 'lda']
['this', 'is', 'sample', 'sentence', 'to', 'check', 'how', 'to', 'remove', 'stopwords', 'from', 'sentence']


The tokenization works, so next let's try to remove all the stopwords from each of the sentences.

In [11]:
# initializing two empty lists for each of the sentences
res1 = []
res2 = []

for word in gensim.utils.simple_preprocess(sentence1):
    if word not in gensim.parsing.preprocessing.STOPWORDS and len(word) > 3:
        res1.append(word)
        
for word in gensim.utils.simple_preprocess(sentence2):
    if word not in gensim.parsing.preprocessing.STOPWORDS and len(word) > 3:
        res2.append(word)

print(res1)
print(res2)

# Note that we also checked for the length of the words along with removing stopwords

['trying', 'learn', 'implement', 'topic', 'modeling']
['sample', 'sentence', 'check', 'remove', 'stopwords', 'sentence']


Now that we have a way to remove the stopwords, we can combine the two processes of stemming/lemmatization and removing stopwords and define a function to help us do this

In [12]:
# function to preprocess the text

def preprocess(text):
    result = []
    for word in gensim.utils.simple_preprocess(text):
        if word not in gensim.parsing.preprocessing.STOPWORDS and len(word) > 3:
            l = lemmatize(word)
            if len(l) > 3:
                result.append(l)
    return result

Let's see how this works for the news article at index = 500 from the data frame above that had the headline "<font color = "red">committal continues into goulburn jail riot</font>"

In [13]:
preprocess(news.iloc[500,0])

['committal', 'continue', 'goulburn', 'jail', 'riot']

It works!!! Now we can implement this preprocessing function over the entire data we have.

In [14]:
processed_news = news['headline_text'].map(preprocess)

In [15]:
processed_news[500:520]

500    [committal, continue, goulburn, jail, riot]             
501    [costello, unhappy, wasnt, consult, stone]              
502    [council, approve, poultry, farm]                       
503    [council, await, rain]                                  
504    [council, consider, indigenous, caravan, park, plan]    
505    [council, elections, plan]                              
506    [council, reject, combine, field, days, stand, idea]    
507    [council, change, tree, protection]                     
508    [council, fund, groundwater, study]                     
509    [counsel, begin, warn, dope, hear]                      
510    [criminal, charge, pending, south, korea, subway, probe]
511    [crocs, prove, good, bullets]                           
512    [date, bushfires, coronial, inquiry]                    
513    [dean, receive, lifetime, parliamentary, pension]       
514    [death, spell, record, marriage]                        
515    [demons, thump, tigers]          

## Modeling

In order to model, we will try to examine these approaches:
1. LDA using bag of words
2. LDA using TF-IDF

#### Bag of words

This method uses frequency of occurence of words in the corpus as a feature. 

In [16]:
# creates a dictionary of unique words from the corpus
dictionary = Dictionary(processed_news)

In [17]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 community
2 decide
3 licence
4 aware
5 defamation
6 witness
7 call
8 infrastructure
9 protection
10 summit


In [18]:
# these are the number of unique words identified from the pre-processed news articles
len(dictionary)

77409

Given that there are about 1.1 million articles in the dataset, the number of unique words that would be used for modeling still seem to be high. In order for us to filter this out further, we will try to remove the following type of words:
1. All of the words that appear in less than 'x' number of the articles
2. All of the words that appear in more than 'y%' of the articles

In order to select the values of x and y, we will inspect our dictionary.

In [19]:
# document frequencies

words_in_articles = pd.DataFrame.from_dict(dictionary.dfs, orient = 'index', columns = ["number of documents"])
words_in_articles.sort_values("number of documents", ascending = False, inplace = True)
words_in_articles

Unnamed: 0,number of documents
238,36127
47,22403
323,19267
165,16922
213,16727
...,...
50384,1
50383,1
50382,1
50380,1


In [20]:
list1 = words_in_articles.index

words_freq = {}
for i in list1:
    words_freq[dictionary[i]] = words_in_articles['number of documents'][i]

words_in_articles['word'] = words_freq.keys()

In [21]:
words_in_articles

Unnamed: 0,number of documents,word
238,36127,police
47,22403,plan
323,19267,charge
165,16922,govt
213,16727,court
...,...,...
50384,1,hallahan
50383,1,dazy
50382,1,blakefield
50380,1,blechynden


In [22]:
#dictionary.filter_extremes(no_below = 100, no_above = 0.5)

In [23]:
len(dictionary)

77409

In [None]:
text = " ".join(headline for headline in news["headline_text"])

wordcloud = WordCloud(background_color="white").generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
text = " ".join(word for i in range(0, len(processed_news)) for word in processed_news[i])

wordcloud = WordCloud(background_color="white").generate(text)
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off")
plt.show()

In [24]:
bag_of_words = [dictionary.doc2bow(news) for news in processed_news]

In [25]:
lda_model_bow = models.LdaMulticore(bag_of_words, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [26]:
for idx, topic in lda_model_bow.print_topics(-1):
    print('Topic: {} \nWords: {}\n'.format(idx, topic))

Topic: 0 
Words: 0.023*"attack" + 0.023*"world" + 0.022*"kill" + 0.020*"coast" + 0.017*"interview" + 0.014*"south" + 0.013*"tasmania" + 0.013*"gold" + 0.012*"women" + 0.009*"australian"

Topic: 1 
Words: 0.019*"rural" + 0.015*"minister" + 0.014*"call" + 0.014*"lose" + 0.013*"need" + 0.013*"bank" + 0.012*"force" + 0.011*"royal" + 0.011*"government" + 0.010*"hobart"

Topic: 2 
Words: 0.031*"charge" + 0.029*"court" + 0.021*"murder" + 0.019*"face" + 0.019*"perth" + 0.016*"jail" + 0.014*"accuse" + 0.014*"home" + 0.014*"test" + 0.013*"australian"

Topic: 3 
Words: 0.019*"open" + 0.019*"country" + 0.017*"years" + 0.015*"government" + 0.015*"power" + 0.014*"hour" + 0.013*"league" + 0.013*"sydney" + 0.012*"hospital" + 0.011*"state"

Topic: 4 
Words: 0.019*"school" + 0.016*"health" + 0.016*"fund" + 0.014*"help" + 0.014*"indigenous" + 0.012*"council" + 0.012*"turnbull" + 0.012*"concern" + 0.011*"trial" + 0.010*"plan"

Topic: 5 
Words: 0.054*"australia" + 0.024*"crash" + 0.019*"house" + 0.018*"don

In [73]:
# 505, 506, 507, 508, 509, 510
t = bag_of_words[506]
show_headline(506)

council rejects combined field days stand idea
['council', 'reject', 'combine', 'field', 'days', 'stand', 'idea']


In [74]:
for index, score in sorted(lda_model_bow[t], key = lambda x: -1*x[1]):
    print("\nScore: {}\t \nTopic: {}\n".format(score, lda_model_bow.print_topic(index, 10)))


Score: 0.5126144289970398	 
Topic: 0.019*"rural" + 0.015*"minister" + 0.014*"call" + 0.014*"lose" + 0.013*"need" + 0.013*"bank" + 0.012*"force" + 0.011*"royal" + 0.011*"government" + 0.010*"hobart"


Score: 0.2623797059059143	 
Topic: 0.024*"adelaide" + 0.019*"live" + 0.016*"tasmanian" + 0.014*"family" + 0.013*"change" + 0.012*"abuse" + 0.012*"guilty" + 0.012*"victoria" + 0.011*"release" + 0.010*"find"


Score: 0.13747239112854004	 
Topic: 0.019*"open" + 0.019*"country" + 0.017*"years" + 0.015*"government" + 0.015*"power" + 0.014*"hour" + 0.013*"league" + 0.013*"sydney" + 0.012*"hospital" + 0.011*"state"


Score: 0.012507922016084194	 
Topic: 0.019*"school" + 0.016*"health" + 0.016*"fund" + 0.014*"help" + 0.014*"indigenous" + 0.012*"council" + 0.012*"turnbull" + 0.012*"concern" + 0.011*"trial" + 0.010*"plan"


Score: 0.01250425260514021	 
Topic: 0.023*"attack" + 0.023*"world" + 0.022*"kill" + 0.020*"coast" + 0.017*"interview" + 0.014*"south" + 0.013*"tasmania" + 0.013*"gold" + 0.012*"

In [60]:
def show_headline(n):
    print(news.headline_text[n])
    print(processed_news[n])

In [78]:
unseen_news = "Australia bushfires: New South Wales battles catastrophic conditions"
bow = dictionary.doc2bow(preprocess(unseen_news))

for index, score in sorted(lda_model_bow[bow], key = lambda x: -1*x[1]):
    print("\nScore: {}\t \nTopic: {}\n".format(score, lda_model_bow.print_topic(index, 10)))


Score: 0.38773059844970703	 
Topic: 0.054*"australia" + 0.024*"crash" + 0.019*"house" + 0.018*"donald" + 0.017*"rise" + 0.015*"price" + 0.012*"children" + 0.011*"game" + 0.011*"victorian" + 0.010*"farmers"


Score: 0.262622207403183	 
Topic: 0.027*"north" + 0.019*"china" + 0.017*"west" + 0.017*"time" + 0.013*"talk" + 0.013*"hold" + 0.012*"island" + 0.012*"john" + 0.011*"south" + 0.011*"korea"


Score: 0.13746239244937897	 
Topic: 0.031*"charge" + 0.029*"court" + 0.021*"murder" + 0.019*"face" + 0.019*"perth" + 0.016*"jail" + 0.014*"accuse" + 0.014*"home" + 0.014*"test" + 0.013*"australian"


Score: 0.13713382184505463	 
Topic: 0.062*"police" + 0.030*"queensland" + 0.028*"election" + 0.027*"melbourne" + 0.022*"canberra" + 0.018*"miss" + 0.017*"shoot" + 0.013*"flood" + 0.012*"news" + 0.012*"search"


Score: 0.012511197477579117	 
Topic: 0.023*"attack" + 0.023*"world" + 0.022*"kill" + 0.020*"coast" + 0.017*"interview" + 0.014*"south" + 0.013*"tasmania" + 0.013*"gold" + 0.012*"women" + 0.0

In [79]:
unseen_news = "Scientists develop a new method for identifying potentially habitable planets that could host \
                ALIEN LIFE outside of our solar system"
bow = dictionary.doc2bow(preprocess(unseen_news))

for index, score in sorted(lda_model_bow[bow], key = lambda x: -1*x[1]):
    print("\nScore: {}\t \nTopic: {}\n".format(score, lda_model_bow.print_topic(index, 10)))


Score: 0.31005433201789856	 
Topic: 0.041*"trump" + 0.022*"market" + 0.016*"record" + 0.015*"fight" + 0.015*"share" + 0.015*"break" + 0.014*"life" + 0.014*"fall" + 0.010*"campaign" + 0.010*"australias"


Score: 0.30996790528297424	 
Topic: 0.031*"charge" + 0.029*"court" + 0.021*"murder" + 0.019*"face" + 0.019*"perth" + 0.016*"jail" + 0.014*"accuse" + 0.014*"home" + 0.014*"test" + 0.013*"australian"


Score: 0.11001504212617874	 
Topic: 0.019*"school" + 0.016*"health" + 0.016*"fund" + 0.014*"help" + 0.014*"indigenous" + 0.012*"council" + 0.012*"turnbull" + 0.012*"concern" + 0.011*"trial" + 0.010*"plan"


Score: 0.10999252647161484	 
Topic: 0.023*"attack" + 0.023*"world" + 0.022*"kill" + 0.020*"coast" + 0.017*"interview" + 0.014*"south" + 0.013*"tasmania" + 0.013*"gold" + 0.012*"women" + 0.009*"australian"


Score: 0.1099497377872467	 
Topic: 0.019*"open" + 0.019*"country" + 0.017*"years" + 0.015*"government" + 0.015*"power" + 0.014*"hour" + 0.013*"league" + 0.013*"sydney" + 0.012*"hosp

In [101]:
wines = pd.read_csv("wines.csv")

In [102]:
wines = pd.DataFrame(wines['description'])

In [104]:
wines['index'] = wines.index
wines

Unnamed: 0,description,index
0,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.",0
1,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016.",1
2,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.",2
3,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of honey-drizzled guava and mango giving way to a slightly astringent, semidry finish.",3
4,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal characteristics. Nonetheless, if you think of it as a pleasantly unfussy country wine, it's a good companion to a hearty winter stew.",4
...,...,...
118835,"Notes of honeysuckle and cantaloupe sweeten this deliciously feather-light spätlese. It's intensely juicy, quenching the palate with streams of tart tangerine and grapefruit acidity, yet wraps up with a kiss of honey and peach.",118835
118836,"Citation is given as much as a decade of bottle age prior to release, which means it is pre-cellared and drinking at its peak. Baked cherry, cocoa and coconut flavors combine gracefully, with soft, secondary fruit compote highlights.",118836
118837,"Well-drained gravel soil gives this wine its crisp and dry character. It is ripe and fruity, although the spice is subdued in favor of a more serious structure. This is a wine to age for a couple of years, so drink from 2017.",118837
118838,"A dry style of Pinot Gris, this is crisp with some acidity. It also has weight and a solid, powerful core of spice and baked apple flavors. With its structure still developing, the wine needs to age. Drink from 2015.",118838


In [105]:
processed_wines = wines['description'].map(preprocess)

In [106]:
processed_wines[500:520]

500    [aromas, watermelon, dust, natural, vanilla, mark, bouquet, palate, fleshy, crisp, focus, nectarine, apple, strawberry, flavor, finish, last, sweetness]                                                                    
501    [verdelho, taste, like, marcona, almonds, fruity, aromatic, good, swirl, mark, tangy, sweetness, finish]                                                                                                                    
502    [deliciously, perfume, light, feather, bake, apple, lemon, flavor, crisp, honey, sweet, great, apéritif]                                                                                                                    
503    [sauvignon, blanc, consistent, performer, hanna, usually, yield, rich, ripe, wine, brisk, acidity, best, efforts, tart, clean, savory, pineapple, tangerine, meyer, lemon, flavor, touch, slight, sweetness]                
504    [simple, attractive, côtes, rhône, impressively, pure, cherry, berry, fruit, medi

In [107]:
wine_dict = Dictionary(processed_wines)

In [109]:

words_in_reviews = pd.DataFrame.from_dict(wine_dict.dfs, orient = 'index', columns = ["number of documents"])
words_in_reviews.sort_values("number of documents", ascending = False, inplace = True)

list2 = words_in_reviews.index

freq = {}
for i in list2:
    freq[wine_dict[i]] = words_in_reviews['number of documents'][i]

words_in_reviews['word'] = freq.keys()

words_in_reviews[0:15]

Unnamed: 0,number of documents,word
35,59857,flavor
31,59119,wine
9,50818,fruit
50,36599,finish
3,35607,aromas
14,34178,palate
0,30909,acidity
125,30675,drink
30,27464,tannins
156,25120,cherry


In [110]:
len(wine_dict)

24351

In [111]:
bow_wines = [wine_dict.doc2bow(review) for review in processed_wines]

In [112]:
lda_wine = models.LdaMulticore(bow_wines, num_topics=10, id2word=wine_dict, passes=2, workers=2)

In [113]:
for idx, topic in lda_wine.print_topics(-1):
    print('Topic: {} \nWords: {}\n'.format(idx, topic))

Topic: 0 
Words: 0.047*"flavor" + 0.031*"wine" + 0.021*"sweet" + 0.019*"pinot" + 0.018*"acidity" + 0.015*"like" + 0.013*"fruit" + 0.013*"drink" + 0.011*"good" + 0.011*"vanilla"

Topic: 1 
Words: 0.083*"wine" + 0.048*"fruit" + 0.037*"drink" + 0.036*"acidity" + 0.028*"ripe" + 0.022*"flavor" + 0.017*"tannins" + 0.017*"rich" + 0.016*"structure" + 0.015*"character"

Topic: 2 
Words: 0.029*"fruit" + 0.029*"flavor" + 0.023*"wine" + 0.022*"aromas" + 0.021*"black" + 0.019*"palate" + 0.018*"cherry" + 0.016*"spice" + 0.015*"nose" + 0.013*"finish"

Topic: 3 
Words: 0.035*"cherry" + 0.031*"palate" + 0.031*"aromas" + 0.030*"tannins" + 0.027*"black" + 0.019*"spice" + 0.018*"berry" + 0.017*"offer" + 0.015*"note" + 0.015*"drink"

Topic: 4 
Words: 0.043*"cabernet" + 0.033*"blend" + 0.025*"flavor" + 0.024*"sauvignon" + 0.024*"merlot" + 0.023*"blackberry" + 0.019*"wine" + 0.017*"chocolate" + 0.016*"cherry" + 0.016*"black"

Topic: 5 
Words: 0.026*"wine" + 0.018*"vineyard" + 0.009*"show" + 0.009*"fruit" + 0

In [122]:
def show_review(n):
    print(wines.description[n])
    print("\n")
    print(processed_wines[n])

In [126]:
t2 = bow_wines[686]
show_review(686)

This opens with aromas of underbrush, leather, berry and a balsamic note. The forward palate offers dried cherry, white pepper, tobacco and the warmth of alcohol alongside firm tannins. Drink this sooner rather than later.


['open', 'aromas', 'underbrush', 'leather', 'berry', 'balsamic', 'note', 'forward', 'palate', 'offer', 'cherry', 'white', 'pepper', 'tobacco', 'warmth', 'alcohol', 'alongside', 'firm', 'tannins', 'drink', 'sooner', 'later']


In [127]:
for index, score in sorted(lda_wine[t2], key = lambda x: -1*x[1]):
    print("\nScore: {}\t \nTopic: {}\n".format(score, lda_wine.print_topic(index, 10)))


Score: 0.9608637094497681	 
Topic: 0.035*"cherry" + 0.031*"palate" + 0.031*"aromas" + 0.030*"tannins" + 0.027*"black" + 0.019*"spice" + 0.018*"berry" + 0.017*"offer" + 0.015*"note" + 0.015*"drink"

