In [1]:
import pandas as pd
import re

In [2]:
reviews_df = pd.read_csv('reviews_shortened.csv')
businesses_df = pd.read_csv('businesses_shortened.csv')

In [3]:
len(reviews_df)

3456983

In [4]:
len(filter(lambda x: x==True, reviews_df['text'].str.contains("by myself")))

4403

Now we need to classify reviews: using the text, we want to identify the ones in which the reviewer was on a date.
This is a somewhat tricky classification problem, because:

1. All the data is unlabelled. We have no training data.
2. If someone is on a date, it's not clear how apparent that will be in a review. Some reviewers may not allude to that at all, and if they do, it may just be conveyed by a very small set of words.
3. We're not trying to rank anything like document similarity, but rather are trying to identify whether a review contains some particular topic. Consequently, using [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) or any of the other similarity indicators 

1. If someone is on a date, it's 

Unfortunately, there's no canonical model for such a classifier. It'd be great to be able to import some library that maps words to probabilities of being associated with topics, but no such resource is publicly available. So we have to create our own. There are several ways to proceed:

1. Create a list of words that you *a priori* identify as strongly associated with dates, and select reviews where the frequency of these terms passes some threshold. For this topic, there are some  clear-cut terms: 'boyfriend', 'girlfriend', 'anniversary', etc. This is not a good approach, because it relies on the personal linguistic biases of the statistician, the aforementioned threshold is arbitrary, and it may assign undue weight to particular words. For example, whether to include the word 'date' in this list of words is controversial, because we don't know how often the word "date" in a yelp review is used in a romantic context, as opposed to e.g. referring to the date-time.

2. Identify some phrase that is almost always associated with being on a date, e.g. "on a date", and then use just the reviews containing that phrase. This is almost defensible, except that it risks sacrificing too much of the sample. In fact, out of 3,456,983 reviews, only 2,618 contain this phrase.

3. Using the identifying phrase "on a date", we can identify all reviews containing it: these reviews then form a *document* of date-associated reviews, a subset of the *corpus* of reviews. We can then use 

4. Use [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) on the corpus to abstractly cluster words into topics, and then to retrieve the reviews identified as containing those topics. 

In [59]:
date_words = ['date', 'dating', 'boyfriend', 'girlfriend', 'bf', 'gf', 'partner', 
              'boy friend', 'girl friend', 'fianc', 'fiance', 'fiancee'
              'marry', 'marriage', 'married', 'wedding', 'honeymoon', 'honey moon', 'anniversary']
# omitting low-signal words like SO, proposed, etc.

In [56]:
def includes_date_word(review):
    if any(date_word in review['text'] for date_word in date_words):
        return True
    else:
        return False

date_us_reviews = us_reviews[us_reviews.apply(includes_date_word)]

KeyError: ('text', u'occurred at index business_id')

In [58]:
us_reviews[:2]

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes,location
0,5UmKMjUEUNdYWqANhGckJw,2012-08-01,Ya85v4eqdd6k9Od8HbQjyA,4,Mr Hoagie is an institution Walking in it does...,review,PUFPaY9KxDAcGqfsorJp3Q,"{u'funny': 0, u'useful': 0, u'cool': 0}",USA
1,5UmKMjUEUNdYWqANhGckJw,2014-02-13,KPvLNJ21_4wbYNctrOwWdQ,5,Excellent food Superb customer service I miss ...,review,Iu6AxdBYGR4A0wspR9BYHA,"{u'funny': 0, u'useful': 0, u'cool': 0}",USA
