# BA 476 Lab 2: Topic Modeling - LDA

## The Problem

The NY Department of Health regularly conducts restaurant inspections. Following each inspection, restaurants are assigned a score. The higher the score, the more violations were found. If the restaurant’s total score is
below 14 points the restaurants gets an ‘A’ grade card; between 14 and 27 points the restaurants gets a ‘B’; and over 27 points the restaurant gets a C. Grade cards are posted outside restaurants.  

To calculate a score for each restaurant, health inspectors look for a specific list of violations. Each violation carries its own specific score. The violation scores are tallied to come up with a total score, which in turn
determines the restaurant’s health grade (A, B, or C).

In this lab, we will help the DOH better use its limited resources by targeting inspection towards restaurants that are most likely to receive high scores. To do this, we will build a predictive model of health scores. Our
model will incorporate restaurant data such as whether the restaurant is a chain or not, its zipcode, and its cuisine.

In addition to these variables, we will examine whether crowdsourced Yelp reviews have any predictive power. To do so, we will need to turn the text of restaurant reviews into something we can feed to our machine learning algorithm. For this, we will use LDA.

## Setup

Agiain, we make sure the visualization tool is installed. 


In [None]:
!pip install pyldavis # pyldavis is not pre-installed in Colab so we install it

Now we import the necessary libraries:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# LDA libraries
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
# preprocessing
from sklearn.feature_extraction.text import CountVectorizer # text preprocessing
from sklearn.preprocessing import OneHotEncoder # for categorical features
# estimators
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

  from collections import Iterable
  from collections import Mapping


And mount our Google Drive as usual: 

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## The Data

The lab comes with three data files. The first two list of restaurant inspections in 2014 and have already been split into a training and testing set. Notice that every inspection has an associated health score. 

In [4]:
df_inspections_train = pd.read_csv('/content/drive/My Drive/ba476-test/data/doh-train.tsv', sep='\t')
df_inspections_test = pd.read_csv('/content/drive/My Drive/ba476-test/data/doh-test.tsv', sep='\t')

df_inspections_train.head()

Unnamed: 0,camis,dba,inspectionid,zipcode,cuisine,venue,chain_restaurant,inspdate,score,yelp_business_id,review_count,stars
0,50017275,SWEETGREEN,1270182,10012,Salads,Restaurant (no bar),1,2015-09-10,12,sweetgreen-new-york-4,128,4.0
1,50017275,SWEETGREEN,1276837,10012,Salads,Restaurant (no bar),1,2016-09-06,20,sweetgreen-new-york-4,128,4.0
2,41671229,PANERA BREAD,1139571,10028,American,Restaurant (no bar),1,2014-08-13,17,panera-bread-new-york-6,126,3.0
3,41671229,PANERA BREAD,1144197,10028,American,Restaurant (no bar),1,2014-09-12,40,panera-bread-new-york-6,126,3.0
4,41671229,PANERA BREAD,1153664,10028,American,Restaurant (no bar),1,2014-09-16,2,panera-bread-new-york-6,126,3.0


The third is the text and ratings of the restaurant reviews. The text has been cleaned for you.

In [5]:
df_review = pd.read_csv('/content/drive/My Drive/ba476-test/data/doh-reviews.tsv', sep='\t')
df_review.head(), df_review.shape

(        yelp_business_id  stars        date  \
 0  sweetgreen-new-york-4      5  2014-12-11   
 1  sweetgreen-new-york-4      4  2014-12-16   
 2  sweetgreen-new-york-4      5  2014-12-29   
 3  sweetgreen-new-york-4      4  2014-12-21   
 4  sweetgreen-new-york-4      5  2014-12-29   
 
                                                 text  
 0  hallelujah ok may littl excit new sweetgreen h...  
 1  made new sweetgreen fantast manag get first da...  
 2  veri happi sweet green thi neighborhood salad ...  
 3  nice thi branch open work neighborhood season ...  
 4  love thi new sweetgreen great locat great stan...  , (36367, 4))

We have about 36,000 reviews. 
Your eventual goal is to predict health scores for the inspections, since accurately predicting whether or not a restaurant has health code violations will help the inspectors focus their efforts where needed. 

To do this we will extract topics from the reviews, and use these topic weights as predictors. For example, if a topic is about food-poioning related illness we may want to inspect any restaurant whose reviews place a significant weight on that topic. 

Step 1 is to extract the topics. 

## LDA

The first thing we are going to do is estimate an LDA model. Recall that LDA maps documents (reviews in this case) to topics. Each document is assigned a weight for each topic. Topics are collections of words; each word is assigned a weight in the topic. Let’s get started.

We need the data in a format approporiate for LDA fitting. We've seen that the count vectorizer creates a matrix containing the multiplicity of every word in every document. 

In [6]:
tf_vectorizer = CountVectorizer(
    max_df=0.95, # do not include words that are in more than 95% of the articles
    min_df=2, # do not include words that are in less than two articles
    max_features=1000, # only include the top 1000 words
    stop_words='english', # Remove stopwords
    lowercase=True) # Lowercase the words

tf = tf_vectorizer.fit_transform(df_review.text)
tf_feature_names = tf_vectorizer.get_feature_names()



You can tweak the following parameters:

In [7]:
num_topics = 5

# Pick values for alpha/eta between 0 and 1.
# The smaller the values the more sparse the topics.
# Similarly, small eta means most topics contain only a few words.
# Large eta means most topics contain most words.
alpha = 0.01
eta = 0.01

More iterations will yield better results, but the returns are
diminishing. We set `max_iter` to something small while exploring to save time, but evetually you may want to train with 100-1000 iterations. 


In [8]:
lda = LatentDirichletAllocation(
    n_components=num_topics, # the number of topics to generate
    max_iter=10,
    learning_method='online',
    learning_offset=50.,
    doc_topic_prior=alpha,
    topic_word_prior=eta,
    random_state=8)

# Training the model
lda.fit(tf)

LatentDirichletAllocation(doc_topic_prior=0.01, learning_method='online',
                          learning_offset=50.0, n_components=5, random_state=8,
                          topic_word_prior=0.01)

Now that we have fit the LDA model, let’s look at the top words for each topic.

In [9]:
def display_topics(model, feature_names, no_top_words):
  for topic_idx, topic in enumerate(model.components_):
    print("Topic %d:" % (topic_idx))
    print(" ".join([feature_names[i]
                    for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [10]:
no_top_words = 20
display_topics(lda, tf_feature_names, no_top_words)

Topic 0:
food place thi great good servic love veri restaur time delici best amaz price realli alway definit lunch come tri
Topic 1:
thi like good order place burger tast sandwich fri tri chicken chees bagel realli sushi salad got eat flavor becaus
Topic 2:
thi order food time tabl servic wait ask place becaus came restaur onli like want said got minut come waiter
Topic 3:
veri good dish thi order delici flavor great sauc restaur food like chicken realli tri pizza tast love nice meal
Topic 4:
place great good drink bar thi veri nice like realli food brunch beer love friendli spot staff space friend area


Lets get the topics for the dataset:

In [11]:
doc_topics = lda.transform(tf)
print(doc_topics, doc_topics.shape)

[[2.48618708e-01 6.70373992e-01 2.12539851e-04 2.12539851e-04
  8.05822204e-02]
 [9.73652698e-02 4.35784707e-01 1.67859933e-01 1.96303867e-01
  1.02686223e-01]
 [5.78338563e-01 2.51031793e-01 4.33839479e-04 1.69761965e-01
  4.33839479e-04]
 ...
 [2.56081946e-04 1.20263498e-01 2.97656587e-01 2.56081946e-04
  5.81567752e-01]
 [4.53514739e-04 4.53514739e-04 4.53514739e-04 2.28892366e-01
  7.69747089e-01]
 [4.53514739e-04 9.49851338e-02 4.53514739e-04 3.39964192e-01
  5.64143645e-01]] (36367, 5)


Lets all get the topic-word matrix, where each row is a topic and each column is word.

In [12]:
topic_words = lda.components_
print(topic_words, topic_words.shape)

[[7.98374611e+01 6.07142036e+01 6.87320551e+02 ... 1.00000009e-02
  1.11373362e+02 3.17733883e+02]
 [1.24300807e+02 1.00027044e-02 5.98335909e+01 ... 1.00000340e-02
  9.47244446e+01 2.04357330e+02]
 [5.03420815e+02 1.24775461e+02 3.27363461e+02 ... 1.00000000e-02
  1.00000000e-02 1.00000000e-02]
 [3.04769433e+02 7.87324808e+01 5.63158665e+02 ... 2.91146368e+02
  2.77544529e+02 4.15380247e+02]
 [1.00009554e-02 4.36737700e+01 8.86382700e+01 ... 1.00003175e-02
  1.00000000e-02 1.02525761e+02]] (5, 1000)


## Visualizing LDA

Run the following code:

In [13]:
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


Now that you’ve looked at the top words in every topic and visualized them, do they make sense? Can you vaguely say what every topic is
about?

## Making Predictions

### Describe The Data

Which kinds of restaurants are the biggest offenders? Use pandas and matplotlib to describe the data, as you did in the beginning of the semester when we did descriptive analytics.

Try to use the LDA topics in your analysis. For instance, what is the topic distribution for Italian restaurants?
What is the topic distribution for restaurants that receive B’s?

## Sanity check: do our topics make sense?

Start by putting the topics into a `DataFrame` and then creating a new column for the scores of each restaurant

In [14]:
doc_topics = pd.DataFrame(doc_topics)

In [15]:
doc_topics["yelp_business_id"] = df_review.yelp_business_id
doc_topics["stars"] = df_review.stars
#doc_topics.head()

As an aside: is there a relationship between topics and stars?

Notice, we do not calculate the intercept of the model by setting `fit_intercept` to `False`.

In [16]:
lr_topics = LinearRegression(fit_intercept=False)
lr_topics.fit(doc_topics.loc[:, list(range(num_topics))], doc_topics.stars)

LinearRegression(fit_intercept=False)

In [17]:
lr_topics.coef_

array([4.69134371, 2.97364619, 1.8099557 , 4.43651141, 4.37311008])

Look at the mean ratings for each topic above. Which topic is associated with the lowest ratings? What is the low-rated topic about? Does this make sense? Did LDA “work”? My topics above probably don’t make too much sense (on purpose); you should be able to train much smarter LDA models by tuning the parameters.


## Preparing the data for predictive modelling
In the rest of this lab we predict health inspections scores after including the per-restaurant topic distributions as predictors. Many of the methods may be new to you, but they are standard ways of preparing data for supervised learning. Be sure to revisit this exercise once we've started supervised learning. 

For each business we have mutliple reviews. Because we want one topic vector per business we will aggregate by taking the mean topic value for each business.

In [18]:
biz_topics = doc_topics.groupby('yelp_business_id').mean()
colnames = list(biz_topics.columns)
new_colnames = [ 't' + str(t) for t in colnames[:-1]] + [colnames[-1] ]
biz_topics.columns = new_colnames
biz_topics.head()

Unnamed: 0_level_0,t0,t1,t2,t3,t4,stars
yelp_business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1-bite-mediterranean-new-york,0.171901,0.000575,0.698369,0.128579,0.000575,1.5
1-oak-new-york,0.07218,0.01899,0.60163,0.003272,0.303928,2.325
16-handles-new-york-6,0.346125,0.172471,0.126189,0.112014,0.2432,4.25
33-gourmet-deli-new-york,0.371333,0.495523,0.098337,0.022511,0.012297,3.2
38th-street-diner-new-york,0.066759,0.394436,0.526814,0.011712,0.00028,1.0


In [19]:
biz_topics.drop('stars', axis=1, inplace=True)

Now, let’s merge the topic weights with our inspection data

In [20]:
df_inspections_train_topics = df_inspections_train.merge(right=biz_topics, on='yelp_business_id')
df_inspections_test_topics = df_inspections_test.merge(right=biz_topics, on='yelp_business_id')
df_inspections_train_topics.head()

Unnamed: 0,camis,dba,inspectionid,zipcode,cuisine,venue,chain_restaurant,inspdate,score,yelp_business_id,review_count,stars,t0,t1,t2,t3,t4
0,50017275,SWEETGREEN,1270182,10012,Salads,Restaurant (no bar),1,2015-09-10,12,sweetgreen-new-york-4,128,4.0,0.242166,0.38173,0.033855,0.073409,0.268839
1,50017275,SWEETGREEN,1276837,10012,Salads,Restaurant (no bar),1,2016-09-06,20,sweetgreen-new-york-4,128,4.0,0.242166,0.38173,0.033855,0.073409,0.268839
2,41671229,PANERA BREAD,1139571,10028,American,Restaurant (no bar),1,2014-08-13,17,panera-bread-new-york-6,126,3.0,0.136304,0.450557,0.306742,0.020602,0.085795
3,41671229,PANERA BREAD,1144197,10028,American,Restaurant (no bar),1,2014-09-12,40,panera-bread-new-york-6,126,3.0,0.136304,0.450557,0.306742,0.020602,0.085795
4,41671229,PANERA BREAD,1153664,10028,American,Restaurant (no bar),1,2014-09-16,2,panera-bread-new-york-6,126,3.0,0.136304,0.450557,0.306742,0.020602,0.085795


In [21]:
topic_predictors = ['t'+str(i) for i in range(num_topics)]
other_predictors = ['stars', 'chain_restaurant']
cat_predictors = ['cuisine']
predictors = topic_predictors + other_predictors + cat_predictors

Now we have topics for both test and train data and we are ready to start making predicitons.

First, we have to convert the categorical predictor via one-hot encoding as we've done before. 

In [22]:
from sklearn.compose import ColumnTransformer

# list of the two transformations we want to do, and on which features
t = [('cat', OneHotEncoder(), cat_predictors)]

# instantiate columntransformer with our transforamtions t
col_transform = ColumnTransformer(transformers=t, remainder='passthrough')

In [23]:
X_topics_train = col_transform.fit_transform(df_inspections_train_topics[predictors])
X_topics_test = col_transform.fit_transform(df_inspections_test_topics[predictors])
y_train = df_inspections_train_topics.score
y_test = df_inspections_test_topics.score

In [24]:
lr = LinearRegression(normalize=True)

lr.fit(X_topics_train, y_train)
yhat_lr = lr.predict(X_topics_test)

from sklearn.metrics import mean_squared_error
mean_squared_error(yhat_lr, y_test)

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)




112.33759328309135

In [25]:
rf = RandomForestRegressor(max_depth=4)
rf.fit(X_topics_train, y_train)
yhat_rf = rf.predict(X_topics_test)
mean_squared_error(yhat_rf, y_test)

111.06637085281044