<a id='section1'></a>

###Latent Dirichlet Allocation of Yoga Reviews

Analysis flow in this notebook:

* [Set analysis options](#section2)
* [Analyze NYC reviews](#section3)
* [Analyze LA reviews](#section4)
    
The analysis procedure itself consists of the following steps, separately applied to the NYC and LA corpora:
 1. Concatenate the reviews by yoga business, making sure there are no duplicate reviews for a given business.
 2. Convert to lower case, remove accents, tokenize, and retain only tokens with alphabetical characters.
 3. Remove stop words and proper nouns.
 4. Stem.
 5. Create a corpus dictionary: (integer word ID, word, word frequency in corpus).
 6. Remove tokens that appear too often or not often enough.
 7. Convert each concatenated studio review into bag-of-words format: a list of (token ID, token count) 2-tuples.
 8. Apply tf-idf transformation to corpus.
 9. Apply Latent Dirichlet Allocation to corpus.


In [1]:
'''
First get the packages we'll need.
'''
from   pymongo import MongoClient
import logging
import nltk
from   gensim import corpora, models, similarities, matutils, utils
from   collections import defaultdict
from   pprint import pprint
import re
import matplotlib.pyplot as plt
import numpy as np
import math
%matplotlib inline

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

<a id='section2'></a>
[Back to top](#section1)

## Set Analysis Options

In [2]:
'''
Set common options for the LDA transformation that will be applied separately
to the NYC and LA corpora.
'''

# SVD algorithm is stochastic.  This does not affect the results of the analysis, except for some 
# plots (since the overall sign of singular vectors can flip depending on the random number).
# To avoid this problem, we set the seed here.
np.random.seed(918273646)

# The following are options for tokenizing, stemming, filtering, etc.
convert_lowercase = True
remove_accents    = True
remove_stopwords  = True
remove_proper     = True # If you enable this, make sure convert_lowercase = True.
stem_tokens       = True
filter_tails      = True
filter_low        = 6   # No dictionary entry for tokens that appear in 5 documents or less.
filter_up         = 0.1 # No dictionary entry for tokens that appear in more than 5% of the corpus documents.
n_LDA_topics      = 10

# Make list of stopwords.
stoplist = nltk.corpus.stopwords.words('english')
stoplist.append(u'\u0027s')   # "'s" as in "he's"
stoplist.append(u'n\u0027t')  # "n't" as in "he hasn't"
stoplist.append(u'\u0027m')   # "'m" as in "I'm"
stoplist.append(u'ya')        # as in "you"
stoplist.append(u'\u0027ve')  # "'ve" as in "I've"
stoplist.append(u'also')
stoplist.append(u've')
stoplist.append(u'm')
print('List of stopwords: %s' %[str(word) for word in stoplist])

# Make list of proper nouns.
ppn = ["aaron", "aarona", "abigail", "adam", "adelaide", "alan", "alexandra", "alexis", "alice", "alicia", 
       "alison", "alli", "allie", "ally", "alyssa", "amalia", "amanda", "amber", "ami", "amy", "ana", "andrea", 
       "angela", "angie", "anna", "anne", "annie", "anthony", "anya", "ariel", "ash", "ashley", "audra", 
       "becker", "becky", "belle", "beth", "beverly", "bijorn", "bjorn", "bonni", "brad", "brandi", "brandon", 
       "brian", "brittani", "brittany", "brook", 
       "caprice", "cara", "carey", "carla", "carlos", "carmen", "carolyn", "cassie", "cathy", "charlotte", 
       "chris", "chrissy", "christian", "christina", "christine", "claire", "claudia", "connie", "corey", 
       "courtney", "dad", "dalton", "dana", "daniel", "daniela", "danny", "davey", "david", "dawn", "deborah", 
       "deena", "diane", "dina", "dr", 
       "eddie", "edwin", "elaine", "elena", "elizabeth", "ella", "ellen", "emily", "eric", "erica", "erik", 
       "erika", "erin", "evans", "ezmy", "fergus", "fern", "francisco", "frank", "fred", 
       "gabriel", "gabriella", "gavin", "geralyn", "ghylian", "gigi", "gina", "glenda", 
       "hannah", "harper", "heather", "heidy", "henry", "hermann", "hermosa", "howard", "hsiao", "hunt", 
       "ikaika", "ingrid", "ivette", "jackie", "jacqui", "jahaira", "jake", "james", "jami", "jamie", "jane", 
       "janet", "jai", "jason", "jc", "jeff", "jen", "jeni", "jenni", "jennie", 
       "jennifer", "jenny", "jerry", "jess", "jesse", "jessica", "jill", "jillian", "jim", "joe", "joetta", 
       "john", "johnson", "joi", 
       "jose", "joseph", "josh", "joy", "joyce", "jq", "judy", "julia", "juliana", "julie", "justin",
       "kalie", "kallie", "karen", "kari", "kathleen", "kate", "kathi", "katie", "kaurwar", "kelli", "kelly", 
       "ken", "kerri", "kerry", "kim", "kimberli", "kimmy", "kris", "kristen",
       "lalita", "lance", "lani", "lara", "laura", "lauren", "laurie", "les", "lil", "liliana", "lilly", 
       "linda", "lindsay", "lindsey", "lisa", "liz", "lori", "luisa", "lulu", "lynn", "ma", 
       "madalina", "madison", "maggie", "mai", "malaika", "mandy", "marco", "margaret", "marja", "mark", 
       "marnie", "martha", "mary", "masako", "matt", "matthew", "mayuri", "meagan", "megan", "melissa", 
       "melody", "meredith", "meriany", "merilynn", "mia", "michael", "michelle", "mike", "mimi", "mollie", 
       "molly", "mommy", "monica", "monika", "morgan", "namgyal", "nancy", "naomi", "narisara", "natalie", 
       "nathaniel", "ness", "nick", "nicola", "nicole", "nikki", "novak", "patrick", "paula", "pauline", 
       "peter", "phebe", "phoenix", "politeia", "rachel", "rafael", "ramit", "rebeca", "rebecca", "renee", 
       "richard", "rob", "roger", "rose", "rosie", "rudy", "ruthie", "ryan", "sam", "samantha", 
       "sandhya", "santoshi", "sara", "sarah", "scott", "sean", "shana", "shawn", "shelly", "sherica", 
       "sherry", "sheryl", "sonia", "sonja", "spencer", "stacey", "stacy", "stefani", "stephan", "stephen",
       "stephanie", "stephaine", "steve", "sue", "susan", "suzanne", "suzi", "suzie", "tarzana", "theresa", 
       "tiffani", "travis", "tzaki", "tsewang", "ty", "victoria", "vincent", "wayne", "wendi", "wendy", 
       "wesley", "zander", "yvonne", "zoe", 
       "westchester", "miami", "harlem", "hoboken", "soho", "washington", "tribeca", "montclair", "jersey",
       "ues", "downtown", "blvd", "flatiron", "silverlake", "verdes", "segundo", "palisades", "venice",
       "noho", "malibu"]
print
print('Number of proper nouns in removal list = %i' % len(ppn))

# Select the stemmer.
stemmer = nltk.stem.porter.PorterStemmer()

List of stopwords: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', '

<a id='section3'></a>
[Back to top](#section1)

## Analyze NYC Reviews

In [3]:
'''
* Make a list of the NYC reviews we'll be analyzing, concatenating by business.
* Tokenize, and if enabled, remove stopwords and proper nouns, and stem.
* Create a corpus dictionary: (integer word ID, word, word frequency in corpus),
  removing words that appear too infrequently or too frequently.
* Convert the tokenized reviews of the corpus to bags of words.
* Apply tf-idf transformation to corpus.
* Apply Latent Dirichlet Allocation to the corpus.
'''

# Initialize output lists:
NYC_names    = []
NYC_reviews  = []
NYC_ratings  = []
NYC_websites = []

client = MongoClient()
yoga   = client.dsbc.yyrnyc
print('Opening NYC database...')
print('Total number of Yoga businesses = %i' %yoga.count())

cursor = yoga.find()
for record in cursor:
    reviews = []
    for review in record["usr_text"]:
        if review:
            reviews.append(review)

    # Eliminate duplicate reviews for a given studio
    # (different studios may still "share" a review):
    n_reviews = len(reviews)
    ureviews  = []
    for review in set(reviews):
        ureviews.append(review)
    n_ureviews = len(ureviews)
    if n_ureviews > 300:
        print('Business %s has %i reviews' %(record["biz_name"],n_ureviews))
    
    # Concatenate the unique reviews by business.
    con_review = ""
    for review in ureviews:
        con_review += " " + review

    if con_review:
        con_review += " " + record["biz_name"] # Add business name string only if there are reviews
        studio = record["biz_name"]+" [at] "+record["biz_address"]
        NYC_names.append(studio)
        NYC_reviews.append(con_review)
        NYC_ratings.append(record["biz_rating"])
        NYC_websites.append(record["biz_website"])

print('Number of reviewed NYC Yoga businesses = %i' % len(NYC_reviews))

# Tokenize, removing punctuation and numbers, and if enabled, convert to lower case and remove accents.
NYC_reviews_1 = [list(utils.tokenize(studio_review,lowercase=convert_lowercase,
                 deacc=remove_accents)) for studio_review in NYC_reviews]

if remove_stopwords:
    NYC_reviews_2 = [[word for word in studio_review if word not in stoplist] 
                     for studio_review in NYC_reviews_1]
    NYC_reviews_1 = NYC_reviews_2[:]

if remove_proper:
    NYC_reviews_2 = [[word for word in studio_review if word not in ppn] 
                     for studio_review in NYC_reviews_1]
    NYC_reviews_1 = NYC_reviews_2[:]

if stem_tokens:
    NYC_reviews_2 = [[stemmer.stem(word) for word in studio_review] for studio_review in NYC_reviews_1]
    # Create a dictionary to map stems to words (this is a one-to-many map, but this shouldn't matter much).
    NYC_stem_to_word = defaultdict(str)
    for review in NYC_reviews_1:
        for word in review:
            word_stem = stemmer.stem(word)
            NYC_stem_to_word[word_stem] = word    
    NYC_reviews_1 = NYC_reviews_2[:]
else:
    NYC_stem_to_word = defaultdict(str)
    for review in NYC_reviews_1:
        for word in review:
            NYC_stem_to_word[word] = word        
        
NYC_dictionary = corpora.Dictionary( NYC_reviews_1 )
if filter_tails:
    NYC_dictionary.filter_extremes( no_below=filter_low, no_above=filter_up, keep_n=None )
print(NYC_dictionary)
n_terms  = len(NYC_dictionary)
n_docs   = len(NYC_reviews_1)
print('In NYC corpus: Number of terms = %i, documents = %i, latent topics = %i' % (n_terms,n_docs,n_LDA_topics))

NYC_corpus_bow = [NYC_dictionary.doc2bow(studio_review) for studio_review in NYC_reviews_1]

NYC_tfidf = models.TfidfModel(NYC_corpus_bow)
NYC_corpus_tfidf = NYC_tfidf[NYC_corpus_bow]

NYC_lda_file = 'results/yoga_studios_nyc_lda'+str(n_LDA_topics)
%time NYC_lda = models.LdaModel( NYC_corpus_tfidf, id2word=NYC_dictionary, num_topics=n_LDA_topics, \
                                 update_every=5, passes=100 )
NYC_lda.save(NYC_lda_file)
    
NYC_corpus_lda = NYC_lda[NYC_corpus_tfidf]

Opening NYC database...
Total number of Yoga businesses = 796
Business Yoga to the People has 331 reviews
Number of reviewed NYC Yoga businesses = 560
Dictionary(3193 unique tokens: [u'dissatisfi', u'profici', u'foul', u'sleek', u'secondli']...)
In NYC corpus: Number of terms = 3193, documents = 560, latent topics = 10
CPU times: user 4min 49s, sys: 11 s, total: 5min
Wall time: 4min 59s


In [4]:
'''
Print out the topics, with their term distributions.
'''
num_topics   = n_LDA_topics
NYC_lda_file = 'results/yoga_studios_nyc_lda'+str(num_topics)
NYC_lda = models.LdaModel.load(NYC_lda_file)
for itopic in range(num_topics):
    out_string = "{:d}".format(itopic)+": "
    pshown     = 0.0
    for ind,(prob,word_stem) in enumerate(NYC_lda.show_topic(itopic, topn=20)):
        word    = NYC_stem_to_word[word_stem]
        pshown += prob
        if word == "":
            word = "["+word_stem+"]"
        if ind > 0:
            out_string += " + "+"{:.4f}".format(prob)+"*"+word
        else:
            out_string += " "+"{:.4f}".format(prob)+"*"+word
    print(out_string)    
    ptotal = sum([prob for (prob,word_stem) in NYC_lda.show_topic(itopic,topn=n_terms)])
    print("\nProbability displayed/total = %f/%f\n" % (pshown,ptotal))

0:  0.0065*blissful + 0.0041*insight + 0.0029*therapy + 0.0027*mature + 0.0026*temple + 0.0024*heartbeat + 0.0022*midnight + 0.0021*motion + 0.0020*stretchy + 0.0019*shown + 0.0019*profusely + 0.0019*dogma + 0.0018*retreat + 0.0017*arranging + 0.0017*split + 0.0017*depth + 0.0016*laughter + 0.0016*counseling + 0.0016*outdoor + 0.0016*chosen

Probability displayed/total = 0.046583/1.000000

1:  0.0037*tree + 0.0030*nidra + 0.0030*beneficial + 0.0026*farther + 0.0025*industrial + 0.0025*remaining + 0.0024*cosy + 0.0023*kirtan + 0.0022*accomodating + 0.0020*demeanor + 0.0020*healer + 0.0019*dig + 0.0019*swear + 0.0019*bottom + 0.0018*belly + 0.0018*friendships + 0.0017*jackson + 0.0017*modify + 0.0016*sued + 0.0016*meetup

Probability displayed/total = 0.043855/1.000000

2:  0.0049*zumba + 0.0038*barre + 0.0032*donations + 0.0026*pregnancy + 0.0026*yogaworks + 0.0025*height + 0.0024*children + 0.0023*hidden + 0.0022*ashtanga + 0.0020*treatment + 0.0020*lululemon + 0.0020*integrity + 0.001

<a id='section4'></a>
[Back to top](#section1)

## Analyze LA Reviews

In [5]:
'''
* Make a list of the LA reviews we'll be analyzing, concatenating by business.
* Tokenize, and if enabled, remove stopwords and proper nouns, and stem.
* Create a corpus dictionary: (integer word ID, word, word frequency in corpus),
  removing words that appear too infrequently or too frequently.
* Convert the tokenized reviews of the corpus to bags of words.
* Apply tf-idf transformation to corpus.
* Apply Latent Dirichlet Allocation to the corpus.
'''

# Initialize output lists:
LA_names    = []
LA_reviews  = []
LA_ratings  = []
LA_websites = []

client = MongoClient()
yoga   = client.dsbc.yyrla
print('Opening LA database...')
print('Total number of Yoga businesses = %i' %yoga.count())

cursor = yoga.find()
for record in cursor:
    reviews = []
    for review in record["usr_text"]:
        if review:
            reviews.append(review)

    # Eliminate duplicate reviews for a given studio
    # (different studios may still "share" a review):
    n_reviews = len(reviews)
    ureviews  = []
    for review in set(reviews):
        ureviews.append(review)
    n_ureviews = len(ureviews)
    if n_ureviews > 300:
        print('Business %s has %i reviews' %(record["biz_name"],n_ureviews))
    
    # Concatenate the unique reviews by business.
    con_review = ""
    for review in ureviews:
        con_review += " " + review

    if con_review:
        con_review += " " + record["biz_name"] # Add business name string only if there are reviews
        studio = record["biz_name"]+" [at] "+record["biz_address"]
        LA_names.append(studio)
        LA_reviews.append(con_review)
        LA_ratings.append(record["biz_rating"])
        LA_websites.append(record["biz_website"])

print('Number of reviewed LA Yoga businesses = %i' % len(LA_reviews))

# Tokenize, removing punctuation and numbers, and if enabled, convert to lower case and remove accents.
LA_reviews_1 = [list(utils.tokenize(studio_review,lowercase=convert_lowercase,
                deacc=remove_accents)) for studio_review in LA_reviews]

if remove_stopwords:
    LA_reviews_2 = [[word for word in studio_review if word not in stoplist] 
                    for studio_review in LA_reviews_1]
    LA_reviews_1 = LA_reviews_2[:]

if remove_proper:
    LA_reviews_2 = [[word for word in studio_review if word not in ppn] 
                    for studio_review in LA_reviews_1]
    LA_reviews_1 = LA_reviews_2[:]

if stem_tokens:
    LA_reviews_2 = [[stemmer.stem(word) for word in studio_review] for studio_review in LA_reviews_1]
    # Create a dictionary to map stems to words (this is a one-to-many map, but this shouldn't matter much).
    LA_stem_to_word = defaultdict(str)
    for review in LA_reviews_1:
        for word in review:
            word_stem = stemmer.stem(word)
            LA_stem_to_word[word_stem] = word    
    LA_reviews_1 = LA_reviews_2[:]
else:
    LA_stem_to_word = defaultdict(str)
    for review in LA_reviews_1:
        for word in review:
            LA_stem_to_word[word] = word

LA_dictionary = corpora.Dictionary( LA_reviews_1 )
if filter_tails:
    LA_dictionary.filter_extremes( no_below=filter_low, no_above=filter_up, keep_n=None )
print(LA_dictionary)
n_terms = len(LA_dictionary)
n_docs  = len(LA_reviews_1)
print('In LA corpus: Number of terms = %i, documents = %i, latent topics = %i' % (n_terms,n_docs,n_LDA_topics))

LA_corpus_bow = [LA_dictionary.doc2bow(studio_review) for studio_review in LA_reviews_1]

LA_tfidf = models.TfidfModel(LA_corpus_bow)
LA_corpus_tfidf = LA_tfidf[LA_corpus_bow]

LA_lda_file = 'results/yoga_studios_la_lda'+str(n_LDA_topics)
%time LA_lda = models.LdaModel( LA_corpus_tfidf, id2word=LA_dictionary, num_topics=n_LDA_topics, \
                                update_every=5, passes=100 )
LA_lda.save(LA_lda_file)
    
LA_corpus_lda = LA_lda[LA_corpus_tfidf]

Opening LA database...
Total number of Yoga businesses = 749
Business Runyon Canyon Park has 1285 reviews
Business Gold?s Gym Downtown Los Angeles has 380 reviews
Business 24 Hour Fitness has 426 reviews
Business 24 Hour Fitness has 304 reviews
Business LA Fitness has 317 reviews
Number of reviewed LA Yoga businesses = 749
Dictionary(5447 unique tokens: [u'haggl', u'ayurved', u'profici', u'pardon', u'ridden']...)
In LA corpus: Number of terms = 5447, documents = 749, latent topics = 10
CPU times: user 8min 42s, sys: 20.1 s, total: 9min 2s
Wall time: 10min 51s


In [6]:
'''
Print out the topics, with their term distributions.
'''
num_topics  = n_LDA_topics
LA_lda_file = 'results/yoga_studios_la_lda'+str(num_topics)
LA_lda      = models.LdaModel.load(LA_lda_file)
for itopic in range(num_topics):
    out_string = "{:d}".format(itopic)+": "
    pshown     = 0.0
    for ind,(prob,word_stem) in enumerate(LA_lda.show_topic(itopic, topn=20)):
        word    = LA_stem_to_word[word_stem]
        pshown += prob
        if word == "":
            word = "["+word_stem+"]"
        if ind > 0:
            out_string += " + "+"{:.4f}".format(prob)+"*"+word
        else:
            out_string += " "+"{:.4f}".format(prob)+"*"+word
    print(out_string)    
    ptotal = sum([prob for (prob,word_stem) in LA_lda.show_topic(itopic,topn=n_terms)])
    print("\nProbability displayed/total = %f/%f\n" % (pshown,ptotal))

0:  0.0026*awakened + 0.0023*chakras + 0.0022*devoted + 0.0021*whale + 0.0019*finest + 0.0019*elongated + 0.0016*unrelenting + 0.0015*hotel + 0.0015*remotely + 0.0014*breathwork + 0.0014*reiki + 0.0012*philosophical + 0.0012*embarked + 0.0012*poise + 0.0012*reccommend + 0.0012*hint + 0.0012*regimented + 0.0012*lessened + 0.0011*couch + 0.0011*gymnastics

Probability displayed/total = 0.030822/1.000000

1:  0.0111*thai + 0.0080*acupuncture + 0.0064*prenatal + 0.0055*masseuse + 0.0048*knots + 0.0043*birth + 0.0040*tissues + 0.0036*lake + 0.0036*reiki + 0.0036*ashtanga + 0.0035*corepower + 0.0031*healers + 0.0031*insurance + 0.0028*silver + 0.0028*iyengar + 0.0027*anatomy + 0.0026*cpy + 0.0026*swedish + 0.0025*brick + 0.0023*rehab

Probability displayed/total = 0.082893/1.000000

2:  0.0044*kundalini + 0.0037*pole + 0.0035*chiropractor + 0.0029*aerial + 0.0027*chi + 0.0027*dahn + 0.0025*lululemon + 0.0024*fusion + 0.0023*chakras + 0.0022*tai + 0.0020*silk + 0.0020*hatha + 0.0019*holistic 

<a id='section5'></a>
[Back to top](#section1)

### Some unusual words found in the NYC and LA corpora:

In New York City corpus:

* alma = Nueva Alma studio
* bonda = Bonda Yoga Studio
* daya = Daya Yoga Studio
* elahi = Elahi Yoga in the UES
* hys = Harlem Yoga Studio
* ikm = International Krav Maga
* joschi = Joschi Body Bodega
* krav maga = self-defense system developed for the military in Israel
* mrg = MRG fitness studio in Staten Island
* tenafly = borough in Bergen County, New Jersey
* vdy = Brooklyn Vindhya Yoga
* yith = Yoga in the (Jersey City) Heights

In Los Angeles corpus:

* cpy is "Core Power Yoga"
* bar could be juice bar, or simply bar, or barely, barefoot,...
* yas refers to "Yoga And Spinning", a fitness center that provides both yoga classes and indoor cycling.