## Deep Inverse Regression with Yelp reviews

In this note we'll use [gensim](http://radimrehurek.com/gensim/) to turn the Word2Vec machinery into a document classifier, as in [Document Classification by Inversion of Distributed Language Representations](http://arxiv.org/pdf/1504.07295v3) from ACL 2015.

### Data and prep

First, download to the same directory as this note the data from the [Yelp recruiting contest](https://www.kaggle.com/c/yelp-recruiting) on [kaggle](https://www.kaggle.com/):
* https://www.kaggle.com/c/yelp-recruiting/download/yelp_training_set.zip
* https://www.kaggle.com/c/yelp-recruiting/download/yelp_test_set.zip

You'll need to sign-up for kaggle.



You can then unpack the data and grab the information we need.  

We'll use an incredibly simple parser

In [197]:
import re
contractions = re.compile(r"'|-|\"")
# all non alphanumeric
symbols = re.compile(r'(\W+)', re.U)
# single character removal
singles = re.compile(r'(\s\S\s)', re.I|re.U)
# separators (any whitespace)
seps = re.compile(r'\s+')

# cleaner (order matters)
def clean(text): 
    text = text.lower()
    text = contractions.sub('', text)
    text = symbols.sub(r' \1 ', text)
    text = singles.sub(' ', text)
    text = seps.sub(' ', text)
    return text

# sentence splitter
alteos = re.compile(r'([!\?])')
def sentences(l):
    l = alteos.sub(r' \1 .', l).rstrip("(\.)*\n")
    return l.split(".")


And put everything together in a review generator that provides tokenized sentences and the number of stars for every review.

In [198]:
from zipfile import ZipFile
import json

def YelpReviews(label):
    with ZipFile("yelp_%s_set.zip"%label, 'r') as zf:
        with zf.open("yelp_%s_set/yelp_%s_set_review.json"%(label,label)) as f:
            for line in f:
                rev = json.loads(line)
                yield {'y':rev['stars'],\
                       'x':[clean(s) for s in sentences(rev['text'])]}


For example:

In [199]:
YelpReviews("test").next()

{'x': [u'nice place big patio',
  u' now offering live sketch comedy ',
  u' wednesday november 17th see local troupe th sic sense in their 2nd annual holiday show ',
  u' lighter snappier take on the holiday times',
  u' not for the easily offended',
  u' sketches include the scariest holloween costume the first thanksgiving and who shot santa claus ',
  u' as well as the infectious song mama christmas'],
 'y': 5}

Now, since the files are small we'll just read everything into in-memory lists.  It takes a minute ...

In [298]:
revtrain = list(YelpReviews("training"))
print len(revtrain), "training reviews"

## and shuffle just in case they are ordered
import numpy as np
np.random.shuffle(revtrain)

229907 training reviews and 22956 test reviews


Finally, write a function to generate sentences from reviews that have certain star ratings

In [222]:
def StarSentences(reviews, stars=[1,2,3,4,5]):
    for r in reviews:
        if r['y'] in stars:
            for s in r['x']:
                yield s

### Word2Vec modeling

We fit out-of-the-box Word2Vec

In [223]:
from gensim.models import Word2Vec
import multiprocessing

## create a w2v learner 
basemodel = Word2Vec(
    workers=multiprocessing.cpu_count(), # use your cores
    iter=10) # sweeps of SGD through the data; more is better
print basemodel

Word2Vec(vocab=0, size=100, alpha=0.025)


Build vocab from all sentences (you could also pre-train the base model from a neutral or un-labeled vocabulary)

In [225]:
basemodel.build_vocab(StarSentences(revtrain))  

Now, we will _deep_ copy each base model and do star-specific training. This is where the big computations happen...

In [230]:
from copy import deepcopy
starmodels = [deepcopy(basemodel) for i in range(5)]
for i in range(5):
    slist = list(StarSentences(revtrain, [i+1]))
    print i+1, "stars (", len(slist), ")"
    starmodels[i].train(  slist, total_examples=len(slist) )
    

 1 ( 246207 )
2 ( 295371 )
3 ( 437718 )
4 ( 883235 )
5 ( 799704 )


### Inversion of the distributed representations

At this point, we have 5 different word2vec language representations.  Each 'model' has been trained conditional (i.e., limited to) text from a specific star rating.  We will apply Bayes rule to go from _p(text|stars)_ to _p(stars|text)_.

In [None]:
# read in the test set
revtest = list(YelpReviews("test"))

We'll go through this step by step for a single review, then give a function at the end to do it all at once.

#### step by step example

First, for any new sentence we can obtain its _likelihood_ (lhd; actually, the composite likelihood approximation; see the paper) using the [score](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.score) function in the `word2vec` class.  Here, we get the likelihood for each sentence in the first test review, then convert to a probability over star ratings.

In [337]:
# first review content
r = revtest[0]['x']
r

[u'nice place big patio',
 u' now offering live sketch comedy ',
 u' wednesday november 17th see local troupe th sic sense in their 2nd annual holiday show ',
 u' lighter snappier take on the holiday times',
 u' not for the easily offended',
 u' sketches include the scariest holloween costume the first thanksgiving and who shot santa claus ',
 u' as well as the infectious song mama christmas']

In [338]:
# the log likelihood of each sentence in this review under each w2v representation
# notice that score() takes a list [s] of sentences
llhd = np.array( [ mod.score(r) for mod in starmodels ] )
llhd # the 5 x nsentence array of likelihoods

array([[-1392.73327637, -2621.62890625, -7394.97021484, -3487.50146484,
        -2183.92285156, -8392.18261719, -3634.80883789],
       [-1330.90319824, -2545.45556641, -7279.16015625, -3442.45458984,
        -2109.57788086, -8300.24121094, -3551.42626953],
       [-1284.77978516, -2378.11181641, -6906.45703125, -3263.25366211,
        -2037.55859375, -7862.79492188, -3409.79345703],
       [-1298.27075195, -2346.79321289, -6846.25976562, -3209.80517578,
        -1981.15698242, -7657.74462891, -3317.09960938],
       [-1309.71130371, -2343.75366211, -6840.29150391, -3190.06494141,
        -1954.61560059, -7583.45361328, -3319.27368164]], dtype=float32)

In [339]:
# now exponentiate to get likelihoods, 
lhd = np.exp(llhd - llhd.max(axis=0)) # subtract row max to avoid numeric overload
lhd.round(2)

array([[ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ],
       [ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ],
       [ 1.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ],
       [ 0.  ,  0.05,  0.  ,  0.  ,  0.  ,  0.  ,  1.  ],
       [ 0.  ,  1.  ,  1.  ,  1.  ,  1.  ,  1.  ,  0.11]], dtype=float32)

In [340]:
# divide to get sentence-star probabilities
sprob = lhd/lhd.sum(axis=0)
sprob.round(2) # mostly 5-star, some 3-4 star sentences

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.05      ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.89999998],
       [ 0.        ,  0.94999999,  1.        ,  1.        ,  1.        ,
         1.        ,  0.1       ]], dtype=float32)

In [341]:
# and finally average the sentence probabilities to get the review probability
rprob = sprob.mean(axis=1)
print "star probs:\n ",
for p in rprob:
    print np.round(p,2), 
print "\ntrue class is", revtest[0]['y']

star probs:
  0.0 0.0 0.14 0.14 0.72 
true class is 5


#### A document probability function

Finally, we'll put this all together in a wrapper function.


In [393]:
"""
docprob takes two lists
* docs: a list of documents, each of which is a list of sentences
* models: the candidate word2vec models (each potential class)

it returns the array of class probabilities.  Everything is done in-memory.
"""

from itertools import groupby

#def docprob(docs, mods):
docs = [r['x'] for r in revtest[:10]]
mods = starmodels

llhd = np.array( [ m.score([s for d in docs for s in d]) for m in mods ] )
lhd = np.exp(llhd - llhd.max(axis=0)) # subtract row max to avoid numeric overload
prob = lhd/lhd.sum(axis=0)
prob

array([[  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   7.90332334e-43,   0.00000000e+00,
          6.32460043e-02,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   5.41475740e-41,
          0.00000000e+00,   2.00000003e-01,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   2.00000003e-01,
          2.00000003e-01,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        

In [384]:
docs
a = gensent()

In [385]:
a.next()

u'nice place big patio'

#### Test set example

Finally, we'll apply the inversion on the full test set.  

In [347]:
revprobs = [docprob(r['x'], starmodels) for r in revtest]

KeyboardInterrupt: 

In [344]:
revprobs

[array([  0.00000000e+00,   1.32971202e-21,   1.42856941e-01,
          1.35160163e-01,   7.21982956e-01], dtype=float32),
 array([  2.10820008e-02,   7.46492967e-02,   3.08000308e-04,
          4.71095055e-01,   4.32865709e-01], dtype=float32),
 array([  0.00000000e+00,   0.00000000e+00,   9.31741056e-26,
          2.55488089e-06,   9.99997497e-01], dtype=float32),
 array([ 0.02608696,  0.02608696,  0.02608714,  0.33043218,  0.59130675], dtype=float32),
 array([  0.00000000e+00,   0.00000000e+00,   2.83932750e-28,
          1.73301086e-01,   8.26698959e-01], dtype=float32),
 array([ 0.06153846,  0.06153846,  0.06153846,  0.17405614,  0.64132845], dtype=float32),
 array([  0.00000000e+00,   0.00000000e+00,   3.83223857e-36,
          9.92174685e-01,   7.82528240e-03], dtype=float32),
 array([  0.00000000e+00,   0.00000000e+00,   4.10701932e-24,
          3.33333343e-01,   6.66666687e-01], dtype=float32),
 array([  0.00000000e+00,   0.00000000e+00,   6.60448352e-25,
          7.77361274

In [321]:
docs = [[s.split() for s in r['x']] for r in revtest[:10]]
docs 


[[[u'nice', u'place', u'big', u'patio'],
  [u'now', u'offering', u'live', u'sketch', u'comedy'],
  [u'wednesday',
   u'november',
   u'17th',
   u'see',
   u'local',
   u'troupe',
   u'th',
   u'sic',
   u'sense',
   u'in',
   u'their',
   u'2nd',
   u'annual',
   u'holiday',
   u'show'],
  [u'lighter', u'snappier', u'take', u'on', u'the', u'holiday', u'times'],
  [u'not', u'for', u'the', u'easily', u'offended'],
  [u'sketches',
   u'include',
   u'the',
   u'scariest',
   u'holloween',
   u'costume',
   u'the',
   u'first',
   u'thanksgiving',
   u'and',
   u'who',
   u'shot',
   u'santa',
   u'claus'],
  [u'as',
   u'well',
   u'as',
   u'the',
   u'infectious',
   u'song',
   u'mama',
   u'christmas']],
 [[u'friendly', u'staff'],
  [u'make',
   u'sure',
   u'you',
   u'order',
   u'the',
   u'gyro',
   u'plate',
   u'and',
   u'souvlaki',
   u'plate'],
  [u'yum']],
 [[u'love', u'love', u'love', u'this', u'place', u'for', u'breakfast'],
  [u'they',
   u'are',
   u'always',
   u'busy'