In [1]:
import pandas as pd
import re

In [2]:
reviews_df = pd.read_csv('reviews_shortened.csv')
businesses_df = pd.read_csv('businesses_shortened.csv')

Now we need to classify reviews: using the text, we want to identify the ones in which the reviewer was on a date. 
This is actually a very difficult classification problem, because:

1. All the data is unlabelled. We have no training data.

2. If someone is on a date, it's not clear how apparent that will be in a review. Some reviewers may not allude to that at all. Though including words like "boyfriend" or "girlfriend" would make it clear, we have to assume that a nontrivial portion of the yelp reviews were written by people on dates who did not mention that fact. Conversely, if someone is *not* on a date, it seems quite unlikely for the reviewer to mention that in any explicit way that we can encode in a model. 

While there are other difficulties in binary topic-classification of textual data (e.g. setting thresholds), the two points above confound any sweeping analysis. Though it's possible to use unsupervised/semi-supervised methods to construct some set of words associated with an abstract date-topic, the fundamental issue is that some very significant portion of the reviews may not carry a feature that lets us classify them correctly. In other words, the approach given by "if we detect date-words, classify it as a date, otherwise, classify it as a non-date" may lead to a high number of false negatives.

So we go for a more explicit approach: rather than trying to use the entire dataset, on which any attempt at accurate binary classification will almost certainly fail, we will isolate the reviews that are *certainly* dates and *certainly* non-dates. This will let us make a more explicit comparison, though we may also then use these statistics as reference points when comparing with the corpus of all reviews.

In particular, we identify certain dates as reviews that include the phrase "on a date", which is generally not used in any other context (i.e. the probability of this particular phrase referring to a datetime seems exceedingly low), and the phrase "by myself" to indicate when the reviewer is not on a date. We'll be careful to look out for negation of either of those two phrases.

In [21]:
date_words = ["boyfriend", "boy friend", "girlfriend", "girl friend", "wife", "husband"]
date_query = "|".join(date_words)

def on_a_date(review_text):
    return review_text.str.contains(date_query)

def by_myself(review_text):
    return review_text.str.contains(" by myself") & (~ review_text.str.contains("not by myself"))

non_dates = reviews_df[by_myself(reviews_df['text']) & (~ on_a_date(reviews_df['text']))]

dates = reviews_df[on_a_date(reviews_df['text']) & (~ by_myself(reviews_df['text']))]

note: it's actually really hard to identify whether someon was on a date. the phrase "on a date" is actually totally useless for this purpose. should use a SO-term, e.g. boyfriend, girlfriend, wife, husband, etc. 

but then we introduce some bias: is a person likely to mention their SO in a review under certain conditions?

This actually gets pretty tricky. needs a more sophisticated linguistic model than I anticipated.

hubby: 305303

len(dates): 323,153 // 305,303 no hubby
    len(non_dates): 4,364
    len(intersection) = 569
    
    
    
    \\\\
    
    after removing intersection
    
    len(dates): 304764
    len(non_dates): 3825
    
  \\\\ data processing: need to filter to restaurants and cafes only. no chiropractors
  

In [20]:
list(non_dates
     ['text'][:100])

['for the past two sundays ive come here to hang out for a couple of hours while my son is at a church function  i enjoy watching the sunday night football game here and enjoying a drink or two  last night i decided to grab a bite to eati was dining by myself and wanted some finger food  after browsing the menu for a few minutes i decided to go with the nachos 1050  to wash it down i had an iced teai watched the giantsbears game while i waited for my grubit didnt take too long before it camethe waitress delivered the nachos and they were a friggin mountain of nachos  see pics seriouslyim not opposed to large portions but damn this thing could easily feed 4 people  i was a little pissed because i didnt want to eat that much and the waitress didnt say anythingalso the menu didnt mention the gargantuan size of the nachos either  i reacted by saying damn thats a friggin mountain of foodshe replied i knowwell if you knew why didnt you say something when i ordered itthis is where i am torndo

In [58]:
us_reviews[:2]

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes,location
0,5UmKMjUEUNdYWqANhGckJw,2012-08-01,Ya85v4eqdd6k9Od8HbQjyA,4,Mr Hoagie is an institution Walking in it does...,review,PUFPaY9KxDAcGqfsorJp3Q,"{u'funny': 0, u'useful': 0, u'cool': 0}",USA
1,5UmKMjUEUNdYWqANhGckJw,2014-02-13,KPvLNJ21_4wbYNctrOwWdQ,5,Excellent food Superb customer service I miss ...,review,Iu6AxdBYGR4A0wspR9BYHA,"{u'funny': 0, u'useful': 0, u'cool': 0}",USA
