<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 4

## Help Yelp

---

In this project you will be investigating a small version of the [Yelp challenge dataset](https://www.yelp.com/dataset_challenge). You'll practice using classification algorithms, cross-validation, gridsearching – all that good stuff.



---

### The data

There are 5 individual .csv files that have the information, zipped into .7z format like with the SF data last project. The dataset is located in your datasets folder:

    DSI-SF-2/datasets/yelp_arizona_data.7z

The columns in each are:

    businesses_small_parsed.csv
        business_id: unique business identifier
        name: name of the business
        review_count: number of reviews per business
        city: city business resides in
        stars: average rating
        categories: categories the business falls into (can be one or multiple)
        latitude
        longitude
        neighborhoods: neighborhoods business belongs to
        variable: "property" of the business (a tag)
        value: True/False for the property
        
    reviews_small_nlp_parsed.csv
        user_id: unique user identifier
        review_id: unique review identifier
        votes.cool: how many thought the review was "cool"
        business_id: unique business id the review is for
        votes.funny: how many thought the review was funny
        stars: rating given
        date: date of review
        votes.useful: how many thought the review was useful
        ... 100 columns of counts of most common 2 word phrases that appear in reviews in this review
        
    users_small_parsed.csv
        yelping_since: signup date
        compliments.plain: # of compliments "plain"
        review_count: # of reviews:
        compliments.cute: total # of compliments "cute"
        compliments.writer: # of compliments "writer"
        compliments.note: # of compliments "note" (not sure what this is)
        compliments.hot: # of compliments "hot" (?)
        compliments.cool: # of compliments "cool"
        compliments.profile: # of compliments "profile"
        average_stars: average rating
        compliments.more: # of compliments "more"
        elite: years considered "elite"
        name: user's name
        user_id: unique user id
        votes.cool: # of votes "cool"
        compliments.list: # of compliments "list"
        votes.funny: # of compliments "funny"
        compliments.photos: # of compliments "photos"
        compliments.funny: # of compliments "funny"
        votes.useful: # of votes "useful"
       
    checkins_small_parsed.csv
        business_id: unique business identifier
        variable: day-time identifier of checkins (0-0 is Sunday 0:00 - 1:00am,  for example)
        value: # of checkins at that time
    
    tips_small_nlp_parsed.csv
        user_id: unique user identifier
        business_id: unique business identifier
        likes: likes that the tip has
        date: date of tip
        ... 100 columns of counts of most common 2 word phrases that appear in tips in this tip

The reviews and tips datasets in particular have parsed "NLP" columns with counts of 2-word phrases in that review or tip (a "tip", it seems, is some kind of smaller review).

The user dataset has a lot of columns of counts of different compliments and votes. I'm not sure whether the compliments or votes are _by_ the user or _for_ the user.

---

If you look at the website, or the full data, you'll see I have removed pieces of the data and cut it down quite a bit. This is to simplify it for this project. Specifically, business are limited to be in these cities:

    Phoenix
    Surprise
    Las Vegas
    Waterloo

Apparently there is a city called "Surprise" in Arizona. 

Businesses are also restricted to at least be in one of the following categories, because I thought the mix of them was funny:

    Airports
    Breakfast & Brunch
    Bubble Tea
    Burgers
    Bars
    Bakeries
    Breweries
    Cafes
    Candy Stores
    Comedy Clubs
    Courthouses
    Dance Clubs
    Fast Food
    Museums
    Tattoo
    Vape Shops
    Yoga
    
---

### Project requirements

**You will be performing 4 different sections of analysis, like in the last project.**

Remember that classification targets are categorical and regression targets are continuous variables.

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 1. Constructing a "profile" for Las Vegas

---

Yelp is interested in building out what they are calling "profiles" for cities. They want you to start with just Las Vegas to see what a prototype of this would look like. Essentially, they want to know what makes Las Vegas distinct from the other four.

Use the data you have to predict Las Vegas from the other variables you have. You should not be predicting the city from any kind of location data or other data perfectly associated with that city (or another city).

You may use any classification algorithm you deem appropriate, or even multiple models. You should:

1. Build at least one model predicting Las Vegas vs. the other cities.
- Validate your model(s).
- Interpret and visualize, in some way, the results.
- Write up a "profile" for Las Vegas. This should be a writeup converting your findings from the model(s) into a human-readable description of the city.

In [40]:
import numpy as np
import pandas as pd
import patsy

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.cross_validation import cross_val_score, StratifiedKFold, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

%config InlineBackend.figure_format = 'retina'
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

from sklearn.neighbors import KNeighborsClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report

In [41]:
# read in the business field
pathb = '/Users/paulmartin/Desktop/DSI-SF-2-GitPaulM/datasets/yelp_arizona_data/businesses_small_parsedb.csv'
biz = pd.read_csv(pathb)


In [42]:
# Create a Las Vegas Binary Field (either or)

biz_lv = biz
biz_lv['lv_flag'] = biz_lv['city'].map(lambda x: 1 if x == "Las Vegas" else 0)


#clean column names  (Note to self:  stringing replaces does not seem  work as planned with multiple replacements))
# probably a better way to do this in place, but this works

new_col = [x.replace('attributes.','').replace('hours.','').replace('Ambience.','') for x in biz_lv.columns.values]
new_col = [x.replace(' ','_').replace('-','_') for x in new_col]
new_col = [x.replace('.','_').capitalize() for x in new_col]
biz_lv.columns = new_col

# Eliminate columns with zeroes

biz_lv = biz_lv.loc[:, (biz_lv != 0).any(axis=0)] #http://stackoverflow.com/questions/21164910/delete-column-in-pandas-based-on-condition

# Columns (since we are here): the patsy formula

form_lv=[]
[form_lv.append(x) for x in biz_lv.columns if x not in ['Business_id','City','Name','Lv_flag']]
form =  ' + '.join(form_lv)
formula_lv = 'Lv_flag ~ ' + form + ' - 1'
formula_lv

'Lv_flag ~ Review_count + Stars + Divey + Dietary_restrictions_vegan + Happy_hour + Order_at_counter + Byob + Good_for_latenight + Outdoor_seating + Classy + By_appointment_only + Parking_lot + Touristy + Corkage + Good_for_brunch + Waiter_service + Parking_street + Hipster + Music_live + Music_background_music + Good_for_breakfast + Parking_garage + Music_karaoke + Good_for_dancing + Accepts_credit_cards + Good_for_lunch + Parking_valet + Take_out + Good_for_dessert + Music_video + Takes_reservations + Trendy + Delivery + Open + Wheelchair_accessible + Dietary_restrictions_gluten_free + Caters + Intimate + Good_for_dinner + Coat_check + Good_for_kids + Parking_validated + Music_dj + Has_tv + Casual + Dogs_allowed + Drive_thru + Dietary_restrictions_vegetarian + Good_for_groups + Open_24_hours + Romantic + Music_jukebox + Upscale - 1'

#### pathr = '/Users/paulmartin/Desktop/DSI-SF-2-GitPaulM/datasets/yelp_arizona_data/reviews_small_nlp_parsed.csv'
reviews = pd.read_csv(pathr)
reviews.shape

In [43]:
#shuffle/stratify
biz_lv = biz_lv.sample(frac=1).reset_index(drop=True)  #http://stackoverflow.com/questions/29576430/shuffle-dataframe-rows

# slice 
biz_lv_small = biz_lv[0:35000]

In [44]:
#patsy: (Remember:  ravel the y, and use iloc with train/test)
y,X = patsy.dmatrices(formula_lv, data=biz_lv_small, return_type='dataframe')

In [45]:
# confirm
y = np.ravel(y)
print type(y), y.shape
print type(X), X.shape

<type 'numpy.ndarray'> (35000,)
<class 'pandas.core.frame.DataFrame'> (35000, 53)


In [33]:
#Set up the indices, get the model and cross validate   Use log reg as classification model.
cv_indices = StratifiedKFold(y, n_folds=5)
logreg = LogisticRegression()
lr_scores = []

# cross validate i.e. train / test
for traini, testi in cv_indices:
    Xtr, ytr  = X.iloc[traini, :], y[traini]
    Xte, yte =  X.iloc[testi, :], y[testi]
  
logreg.fit(Xtr, ytr)
lr_scores.append(logreg.score(Xte, yte))

print cv_indices    

print 'Logistic Regression:'
print lr_scores
print np.mean(lr_scores)

print 'Baseline accuracy:', np.mean(y)


sklearn.cross_validation.StratifiedKFold(labels=[ 1.  1.  1. ...,  0.  0.  0.], n_folds=5, shuffle=False, random_state=None)
Logistic Regression:
[0.61823117588226895]
0.618231175882
Baseline accuracy: 0.616942857143


In [46]:
#Classification report / Confusion matrix

y_pred = logreg.predict(X)
print type(y_pred)
from sklearn.metrics import classification_report
cls_rep = classification_report(y, y_pred)
print cls_rep

confusion = pd.crosstab(y, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print confusion

<type 'numpy.ndarray'>
             precision    recall  f1-score   support

        0.0       0.57      0.01      0.03     13537
        1.0       0.61      0.99      0.76     21463

avg / total       0.60      0.61      0.48     35000

Predicted  0.0    1.0    All
Actual                      
0.0        178  13359  13537
1.0        135  21328  21463
All        313  34687  35000


In [47]:
# Standardize for regularization
ss = StandardScaler()
Xn = ss.fit_transform(X)


In [48]:
# Let's use Grid search and regularization to see if there is a difference in the model.  More academic really.

lr_params = {
    'penalty':['l1','l2'],
    'solver':['liblinear'],
    'C':np.linspace(0.0001, 1000, 50)
}

lr_gs = GridSearchCV(LogisticRegression(), lr_params, cv=5, verbose=1)

lr_gs.fit(Xn, y)
print lr_gs.best_params_
best_lr = lr_gs.best_estimator_

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:   15.2s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:  1.1min
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:  2.1min


{'penalty': 'l1', 'C': 183.67355102040815, 'solver': 'liblinear'}


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:  2.3min finished


In [55]:
# Calculate the predictions for the confusion matrix, based on new data.  Largee data sets, not
# expecting any surprises


y_pred = lr_gs.predict(Xn)
print type(y_pred)
from sklearn.metrics import classification_report
cls_rep = classification_report(y, y_pred)
print cls_rep


<type 'numpy.ndarray'>
             precision    recall  f1-score   support

        0.0       0.56      0.03      0.06     13537
        1.0       0.62      0.99      0.76     21463

avg / total       0.60      0.62      0.49     35000



In [56]:
# Cross validate the best model

cv_indices = StratifiedKFold(y, n_folds=5)

lr_scores = []

for train_inds, test_inds in cv_indices:
    
    Xtr, ytr = Xn[train_inds, :], y[train_inds]
    Xte, yte = Xn[test_inds, :], y[test_inds]
   
    best_lr.fit(Xtr, ytr)
    lr_scores.append(best_lr.score(Xte, yte))
    

print 'Logistic Regression:'
print lr_scores
print np.mean(lr_scores)

print 'Baseline accuracy:', np.mean(y)

Logistic Regression:
[0.61591201256963291, 0.61219825739180123, 0.61642857142857144, 0.61680240034290612, 0.61480211458779821]
0.615228671264
Baseline accuracy: 0.613228571429


In [57]:
# So the mean scores are consistent  Unlikely suprising given the data set.  
# Let's examine the confusion matrix
print y.shape
print Xn.shape

(35000,)
(35000, 53)


In [58]:
# Calculate the predictions for the confusion matrix and the classification report , based on new data


y_pred = best_lr.predict(Xn)
print type(y_pred)

print "Classification Report"
from sklearn.metrics import classification_report
cls_rep = classification_report(y, y_pred)
print cls_rep

print "Confusion Matrix"
confusion = pd.crosstab(y, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print confusion


<type 'numpy.ndarray'>
Classification Report
             precision    recall  f1-score   support

        0.0       0.56      0.03      0.06     13537
        1.0       0.62      0.99      0.76     21463

avg / total       0.60      0.62      0.49     35000

Confusion Matrix
Predicted  0.0    1.0    All
Actual                      
0.0        398  13139  13537
1.0        310  21153  21463
All        708  34292  35000


In [59]:
ls = X.columns.values.tolist()
coefs = pd.DataFrame({'variable':ls,'coef':best_lr.coef_[0], 'abs_coef':np.abs(best_lr.coef_[0])})
coefs.sort_values('abs_coef', ascending=False, inplace=True)
print coefs.head(10)

    abs_coef      coef                          variable
0   0.341452  0.341452                      Review_count
13  0.124498  0.124498                           Corkage
49  0.114595  0.114595                     Open_24_hours
52  0.102254  0.102254                           Upscale
21  0.087549  0.087549                    Parking_garage
50  0.072007  0.072007                          Romantic
8   0.071275 -0.071275                   Outdoor_seating
47  0.059184 -0.059184   Dietary_restrictions_vegetarian
35  0.058542 -0.058542  Dietary_restrictions_gluten_free
26  0.052823  0.052823                     Parking_valet


In [60]:
print ls

['Review_count', 'Stars', 'Divey', 'Dietary_restrictions_vegan', 'Happy_hour', 'Order_at_counter', 'Byob', 'Good_for_latenight', 'Outdoor_seating', 'Classy', 'By_appointment_only', 'Parking_lot', 'Touristy', 'Corkage', 'Good_for_brunch', 'Waiter_service', 'Parking_street', 'Hipster', 'Music_live', 'Music_background_music', 'Good_for_breakfast', 'Parking_garage', 'Music_karaoke', 'Good_for_dancing', 'Accepts_credit_cards', 'Good_for_lunch', 'Parking_valet', 'Take_out', 'Good_for_dessert', 'Music_video', 'Takes_reservations', 'Trendy', 'Delivery', 'Open', 'Wheelchair_accessible', 'Dietary_restrictions_gluten_free', 'Caters', 'Intimate', 'Good_for_dinner', 'Coat_check', 'Good_for_kids', 'Parking_validated', 'Music_dj', 'Has_tv', 'Casual', 'Dogs_allowed', 'Drive_thru', 'Dietary_restrictions_vegetarian', 'Good_for_groups', 'Open_24_hours', 'Romantic', 'Music_jukebox', 'Upscale']


In [67]:
import matplotlib.pyplot as plt
top_ten = coefs.columns.values.tolist()
top_ten = top_ten
ax = coefs.head(10)[['abs_coef']].plot(kind='bar', title ="Top Coefficients",figsize=(15,10),legend=True, fontsize=12)
ax.set_xlabel(top_ten,fontsize=12)
ax.set_ylabel("Absolute Coeff Value",fontsize=12)
plt.show()

# import matplotlib.pyplot as plt
# ax = df[['V1','V2']].plot(kind='bar', title ="V comp",figsize=(15,10),legend=True, fontsize=12)
# ax.set_xlabel("Hour",fontsize=12)
# ax.set_ylabel("V",fontsize=12)
# plt.show()


SyntaxError: invalid syntax (<ipython-input-67-41983edc9255>, line 4)

Based on the top 10 variables: The Las Vegas profile is one of convenience, fun, and entertainment.

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 2. Different categories of ratings

---

Yelp is finally ready to admit that their rating system sucks. No one cares about the ratings, they just use the site to find out what's nearby. The ratings are simply too unreliable for people. 

Yelp hypothesizes that this is, in fact, because different people tend to give their ratings based on different things. They believe that perhaps some people always base their ratings on quality of food, others on service, and perhaps other categories as well. 

1. Do some users tend to talk about service more than others in reviews/tips? Divide up the tips/reviews into more "service-focused" ones and those less concerned with service.
2. Create two new ratings for businesses: ratings from just the service-focused reviews and ratings from the non-service reviews.
3. Construct a regression model for each of the two ratings. They should use the same predictor variables (of your choice). 
4. Validate the performance of the models.
5. Do the models coefficients differ at all? What does this tell you about the hypothesis that there are in fact two different kinds of ratings?

In [17]:
pathr = '/Users/paulmartin/Desktop/DSI-SF-2-GitPaulM/datasets/yelp_arizona_data/reviews_small_nlp_parsed.csv'
reviews = pd.read_csv(pathr)
print reviews.shape
reviews.head()

(322398, 108)


Unnamed: 0,user_id,review_id,votes.cool,business_id,votes.funny,stars,date,votes.useful,10 minutes,15 minutes,...,service great,staff friendly,super friendly,sweet potato,tasted like,time vegas,try place,ve seen,ve tried,wait staff
0,o_LCYay4uo5N4eq3U5pbrQ,biEOCicjWlibF26pNLvhcw,0,EmzaQR5hQlF0WIl24NxAZA,0,3,2007-09-14,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,sEWeeq41k4ohBz4jS_iGRw,tOhOHUAS7XJch7a_HW5Csw,3,EmzaQR5hQlF0WIl24NxAZA,12,2,2008-04-21,3,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1AqEqmmVHgYCuzcMrF4h2g,2aGafu-x7onydGoDgDfeQQ,0,EmzaQR5hQlF0WIl24NxAZA,2,2,2009-11-16,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,pv82zTlB5Txsu2Pusu__FA,CY4SWiYcUZTWS_T_cGaGPA,4,EmzaQR5hQlF0WIl24NxAZA,9,2,2010-08-16,6,0,0,...,0,0,0,0,0,0,0,0,0,0
4,jlr3OBS1_Y3Lqa-H3-FR1g,VCKytaG-_YkxmQosH4E0jw,0,EmzaQR5hQlF0WIl24NxAZA,1,4,2010-12-04,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
#clean up the data a bit -- nulls / columns names
new_col = reviews.columns.values
new_col = [x.replace(' ','_') for x in new_col]
new_col = [x.replace('.','_') for x in new_col]
new_col = [x.replace('10','ten') for x in new_col]
new_col = [x.replace('15','fifteen') for x in new_col]
new_col = [x.replace('20','twenty') for x in new_col]
new_col = [x.replace('30','thirty') for x in new_col]

reviews.columns = new_col
reviews.columns.values
reviews.fillna(0,inplace=True)

# Eliminate columns with zeroes

reviews = reviews.loc[:, (reviews != 0).any(axis=0)] #http://stackoverflow.com/questions/21164910/delete-column-in-pandas-based-on-condition

# Columns (since we are here): the patsy formula

In [19]:
# Going to create two uber rating systems, based on solely the great service and the great food categories only. 
#The intent is to see if the other predictors are able to predict a great experience and if so by how much
#We will not throw out the initial review system.  Rather we will wait that one half of the original weighting
#plus  


reviews['uber_food'] = reviews['food_amazing'] + reviews['food_great']
reviews['uber_service'] = reviews['service_excellent'] + reviews['service_great']

reviews['uber_food'] = reviews['uber_food'].map(lambda x: 1 if  x >= 1 else 0) 
reviews['uber_service'] = reviews['uber_service'].map(lambda x: 1 if x >= 1 else 0)


In [20]:
reviews.uber_food.value_counts()

0    310848
1     11550
Name: uber_food, dtype: int64

In [21]:
np.std(reviews.uber_service)


0.18421919319490798

In [22]:
form_rv=[]
[form_rv.append(x) for x in reviews.columns if x not in ['stars','user_id','review_id','business_id','date','uber_food','uber_service','food_amazing','food_great','service_excellent','service_great']]
form =  ' + '.join(form_rv)
formula_rv_food = 'uber_food ~ ' + form + ' - 1'
formula_rv_service = 'uber_service ~ ' + form + ' - 1'


In [23]:
reviews.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
votes_cool,322398.0,0.660293,1.900567,0.0,0.0,0.0,1.0,137.0
votes_funny,322398.0,0.555918,1.801933,0.0,0.0,0.0,0.0,129.0
stars,322398.0,3.747371,1.303634,1.0,3.0,4.0,5.0,5.0
votes_useful,322398.0,1.067631,2.328657,0.0,0.0,0.0,1.0,141.0
ten_minutes,322398.0,0.014063,0.130140,0.0,0.0,0.0,0.0,6.0
fifteen_minutes,322398.0,0.012910,0.124089,0.0,0.0,0.0,0.0,7.0
twenty_minutes,322398.0,0.012280,0.118960,0.0,0.0,0.0,0.0,4.0
thirty_minutes,322398.0,0.009926,0.105938,0.0,0.0,0.0,0.0,4.0
bar_food,322398.0,0.007972,0.095778,0.0,0.0,0.0,0.0,5.0
beer_selection,322398.0,0.009764,0.101375,0.0,0.0,0.0,0.0,4.0


In [24]:
print formula_rv_food


uber_food ~ votes_cool + votes_funny + votes_useful + ten_minutes + fifteen_minutes + twenty_minutes + thirty_minutes + bar_food + beer_selection + best_ve + bloody_mary + bottle_service + chicken_waffles + customer_service + dance_floor + decided_try + definitely_come + definitely_recommend + didn_want + don_know + don_like + don_think + don_want + eggs_benedict + fast_food + feel_like + felt_like + fish_chips + food_came + food_delicious + food_good + food_just + food_service + french_fries + french_toast + friday_night + fried_chicken + friendly_staff + good_food + good_place + good_service + good_thing + good_time + great_atmosphere + great_experience + great_food + great_place + great_service + great_time + happy_hour + hash_browns + highly_recommend + hip_hop + ice_cream + just_like + just_ok + just_right + las_vegas + late_night + like_place + little_bit + long_time + looked_like + looks_like + love_place + mac_cheese + make_sure + mashed_potatoes + medium_rare + minutes_later +

In [25]:
#shuffle/stratify, # get a decent sample size
reviews = reviews.sample(frac=1).reset_index(drop=True)
reviews.reset_index()
#http://stackoverflow.com/questions/29576430/shuffle-dataframe-rows
reviews_sample = reviews[0:40000]


In [26]:
#patsy, ravel
y,X = patsy.dmatrices(formula_rv_food, data=reviews_sample, return_type='dataframe')
y = np.ravel(y)
print y.shape, X.shape

(40000,) (40000, 99)


In [27]:
#Train / test

cv_indices = StratifiedKFold(y, n_folds=10)

logreg = LogisticRegression()
lr_scores = []

#Train / Test loop
for traini, testi in cv_indices:
    Xtr, ytr  = X.iloc[traini, :], y[traini]
    Xte, yte =  X.iloc[testi, :], y[testi]
    logreg.fit(Xtr, ytr)
    lr_scores.append(logreg.score(Xte, yte))   

print 'Logistic Regression Food:'
print "Scores: ", lr_scores
print "Scores mean: ", np.mean(lr_scores)

print 'Baseline accuracy:', np.mean(y)


Logistic Regression Food:
Scores:  [0.96400899775056237, 0.96375906023494129, 0.96500874781304669, 0.96575856035991003, 0.96600849787553111, 0.96374093523380844, 0.96524131032758187, 0.96249062265566387, 0.96274068517129285, 0.96574143535883972]
Scores mean:  0.964449885278
Baseline accuracy: 0.035375


In [28]:

y_pred = logreg.predict(X)

print "Classification Report"
cls_rep = classification_report(y, y_pred)
print cls_rep
confusion = pd.crosstab(y, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print confusion

Classification Report
             precision    recall  f1-score   support

        0.0       0.97      1.00      0.98     38585
        1.0       0.55      0.07      0.12      1415

avg / total       0.95      0.97      0.95     40000

Predicted    0.0  1.0    All
Actual                      
0.0        38509   76  38585
1.0         1323   92   1415
All        39832  168  40000


In [29]:
ls = X.columns.values.tolist()
coefs_food = pd.DataFrame({'variable':ls,'coef':logreg.coef_[0], 'abs_coef':np.abs(logreg.coef_[0])})
coefs_food.sort_values('abs_coef', ascending=False, inplace=True)
coefs_food.rename(columns = {'coef': 'food_coef', 'abs_coef':'food_abs_coef' },inplace=True)
print coefs_food.columns.values

['food_abs_coef' 'food_coef' 'variable']


In [30]:
coefs_food.head()

Unnamed: 0,food_abs_coef,food_coef,variable
47,2.001932,2.001932,great_service
45,1.724599,1.724599,great_food
43,1.444005,1.444005,great_atmosphere
11,1.126614,-1.126614,bottle_service
14,1.110019,-1.110019,dance_floor


In [31]:
# patsy, ravel and then cross validate for the first rating: uber_service

y,X = patsy.dmatrices(formula_rv_service, data=reviews_sample, return_type='dataframe')
y = np.ravel(y)

cv_indices = StratifiedKFold(y, n_folds=10)

logreg = LogisticRegression()
lr_scores = []

#Train / Test loop
for traini, testi in cv_indices:
    Xtr, ytr  = X.iloc[traini, :], y[traini]
    Xte, yte =  X.iloc[testi, :], y[testi]
    logreg.fit(Xtr, ytr)
    lr_scores.append(logreg.score(Xte, yte))   

print 'Logistic Regression Service:'
print "Scores: ", lr_scores
print "Scores mean: ", np.mean(lr_scores)
print 'Baseline accuracy:', np.mean(y)


Logistic Regression Service:
Scores:  [0.96575, 0.96599999999999997, 0.96575, 0.96550000000000002, 0.96499999999999997, 0.96599999999999997, 0.96625000000000005, 0.96550000000000002, 0.96575, 0.96525000000000005]
Scores mean:  0.965675
Baseline accuracy: 0.034


In [32]:
y_pred = logreg.predict(X)

print "Classification Report"
cls_rep = classification_report(y, y_pred)
print cls_rep
confusion = pd.crosstab(y, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print confusion

Classification Report
             precision    recall  f1-score   support

        0.0       0.97      1.00      0.98     38640
        1.0       0.40      0.01      0.03      1360

avg / total       0.95      0.97      0.95     40000

Predicted    0.0  1.0    All
Actual                      
0.0        38613   27  38640
1.0         1342   18   1360
All        39955   45  40000


In [33]:
coefs_service = pd.DataFrame({'variable':ls,'coef':logreg.coef_[0], 'abs_coef':np.abs(logreg.coef_[0])})
coefs_service.sort_values('abs_coef', ascending=False, inplace=True)
coefs_service.rename(columns={'coef': "service_coef", 'abs_coef':'service_abs_coef' },inplace=True)



In [34]:
#Combine the coefficients, check out the difference

result = pd.merge(coefs_food, coefs_service, on='variable', how='inner')
result['coefficient_difference'] = result['food_coef'] - result['service_coef']
result

Unnamed: 0,food_abs_coef,food_coef,variable,service_abs_coef,service_coef,coefficient_difference
0,2.001932,2.001932,great_service,1.522668,1.522668,0.479264
1,1.724599,1.724599,great_food,1.484495,1.484495,0.240104
2,1.444005,1.444005,great_atmosphere,0.808145,0.808145,0.635859
3,1.126614,-1.126614,bottle_service,0.263263,0.263263,-1.389877
4,1.110019,-1.110019,dance_floor,0.379202,-0.379202,-0.730817
5,1.097004,1.097004,good_food,0.342609,-0.342609,1.439613
6,0.983395,-0.983395,pretty_good,0.066187,-0.066187,-0.917209
7,0.925114,-0.925114,like_place,0.142190,-0.142190,-0.782923
8,0.917924,0.917924,quality_food,0.208229,0.208229,0.709695
9,0.802525,-0.802525,hip_hop,0.753467,-0.753467,-0.049058


In [35]:
# difference in coefficients:  Models appear to be the same.  Likely a bad design of creating the ratings

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 3. Identifying "elite" users

---

Yelp, though having their own formula for determining whether a user is elite or not, is interested in delving deeper into what differentiates an elite user from a normal user at a broader level.

Use a classification model to predict whether a user is elite or not. Note that users can be elite in some years and not in others.

1. What things predict well whether a user is elite or not?
- Validate the model.
- If you were to remove the "counts" metrics for users (reviews, votes, compliments), what distinguishes an elite user, if anything? Validate the model and compare it to the one with the count variables.
- Think of a way to visually represent your results in a compelling way.
- Give a brief write-up of your findings.


In [36]:
# Approach here will be to set up a new field for Elite and use 
#KNN classification algorithm to find a fit

pathux = '/Users/paulmartin/Desktop/DSI-SF-2-GitPaulM/datasets/yelp_arizona_data/users_small_parsed.csv'
ux = pd.read_csv(pathux)
print ux.shape
ux.loc[0,'elite']
ux.info()

(144206, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144206 entries, 0 to 144205
Data columns (total 21 columns):
yelping_since          144206 non-null object
compliments.plain      47034 non-null float64
review_count           144206 non-null int64
compliments.cute       13133 non-null float64
compliments.writer     33222 non-null float64
fans                   144206 non-null int64
compliments.note       39872 non-null float64
compliments.hot        31748 non-null float64
compliments.cool       41069 non-null float64
compliments.profile    12368 non-null float64
average_stars          144206 non-null float64
compliments.more       25066 non-null float64
elite                  144206 non-null object
name                   144206 non-null object
user_id                144206 non-null object
votes.cool             144206 non-null int64
compliments.list       7180 non-null float64
votes.funny            144206 non-null int64
compliments.photos     18759 non-null float64
compli

In [37]:
ux.loc[0,'elite']

'[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015]'

In [38]:
def intinlist(x):
#    print x
    z=[]
    try:
        z = (x.replace("]","").replace("[","").split(","))
#        print z
        z = [int(i) for i in z]
#        print z
        return z
    except:
        z = []
        return z
    
ystr = '[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015]'
ystr = '[]'

#y = intinlist(ystr)
#y


In [39]:
#just in case
ux['elite'] = ux['elite'].map(intinlist)

#results = [int(i) for i in results]

In [40]:
ux.head()

Unnamed: 0,yelping_since,compliments.plain,review_count,compliments.cute,compliments.writer,fans,compliments.note,compliments.hot,compliments.cool,compliments.profile,...,compliments.more,elite,name,user_id,votes.cool,compliments.list,votes.funny,compliments.photos,compliments.funny,votes.useful
0,2004-10,959.0,1274,206.0,327.0,1179,611.0,1094.0,1642.0,116.0,...,134.0,"[2005, 2006, 2007, 2008, 2009, 2010, 2011, 201...",Jeremy,rpOyqD_893cqmDAtJLbdog,11093,38.0,7681,330.0,580.0,14199
1,2004-10,89.0,442,23.0,24.0,100,83.0,101.0,145.0,9.0,...,19.0,"[2005, 2006, 2007, 2008, 2009, 2010, 2011, 201...",Michael,4U9kSBLuBDU391x6bxU-YA,732,4.0,908,24.0,120.0,1483
2,2004-10,2.0,66,2.0,2.0,4,1.0,1.0,1.0,,...,1.0,[2005],Katherine,SIBCL7HBkrP4llolm4SC2A,13,,11,,,34
3,2004-10,5.0,101,1.0,3.0,7,3.0,5.0,4.0,1.0,...,2.0,[],Nader,UTS9XcT14H2ZscRIf0MYHQ,49,,53,1.0,8.0,243
4,2004-10,104.0,983,82.0,17.0,78,85.0,265.0,212.0,9.0,...,16.0,"[2005, 2006, 2007, 2008, 2010, 2011, 2012]",Helen,ZWOj6LmzwGvMDh-A85EOtA,1928,3.0,1109,57.0,70.0,2404


In [41]:
ux.loc[3,'elite']


[]

In [42]:
elite_level = 1
ux['elite_flag'] = ux['elite'].map(lambda x: 1 if len(x) >= elite_level else 0)
ux['elite_years'] = ux['elite'].map(lambda x: len(x))
ux.rename(columns=lambda x: x.replace(' ', '_').replace('.', '_'),inplace=True)


In [43]:
'[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015]'

#zero fill the nulls
ux.fillna(0, inplace=True)
ux.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
compliments_plain,144206.0,11.16758,145.287346,0.0,0.0,0.0,1.0,13129.0
review_count,144206.0,54.968649,138.452373,0.0,5.0,13.0,42.0,8529.0
compliments_cute,144206.0,0.790577,12.888196,0.0,0.0,0.0,0.0,1701.0
compliments_writer,144206.0,4.260211,49.650805,0.0,0.0,0.0,0.0,5178.0
fans,144206.0,2.958275,17.657132,0.0,0.0,0.0,1.0,1363.0
compliments_note,144206.0,4.646818,48.503054,0.0,0.0,0.0,1.0,5121.0
compliments_hot,144206.0,8.402972,103.066822,0.0,0.0,0.0,0.0,10658.0
compliments_cool,144206.0,11.412639,128.795241,0.0,0.0,0.0,1.0,12148.0
compliments_profile,144206.0,0.770079,23.244386,0.0,0.0,0.0,0.0,5178.0
average_stars,144206.0,3.768235,0.824896,0.0,3.4,3.84,4.26,5.0


In [44]:
#patsy prep

form_ux = []
[form_ux.append(x) for x in ux.columns if x not in ['elite','name','user_id','yelping_since']]
form =  ' + '.join(form_ux) + ' - 1'
formula = 'elite_flag ~ ' + form


#y,X = patsy.dmatrices(formula, data=ux, return_type='dataframe')
##patsy was too slow so plan b

y = ux['elite_flag']
X = ux
X.drop(['elite', 'name','user_id','yelping_since','elite_flag'], axis=1, inplace=True)
print y.shape
print X.shape

(144206,)
(144206, 18)


In [45]:
x = reviews.columns.values

In [46]:
x

array(['user_id', 'review_id', 'votes_cool', 'business_id', 'votes_funny',
       'stars', 'date', 'votes_useful', 'ten_minutes', 'fifteen_minutes',
       'twenty_minutes', 'thirty_minutes', 'bar_food', 'beer_selection',
       'best_ve', 'bloody_mary', 'bottle_service', 'chicken_waffles',
       'customer_service', 'dance_floor', 'decided_try', 'definitely_come',
       'definitely_recommend', 'didn_want', 'don_know', 'don_like',
       'don_think', 'don_want', 'eggs_benedict', 'fast_food', 'feel_like',
       'felt_like', 'fish_chips', 'food_amazing', 'food_came',
       'food_delicious', 'food_good', 'food_great', 'food_just',
       'food_service', 'french_fries', 'french_toast', 'friday_night',
       'fried_chicken', 'friendly_staff', 'good_food', 'good_place',
       'good_service', 'good_thing', 'good_time', 'great_atmosphere',
       'great_experience', 'great_food', 'great_place', 'great_service',
       'great_time', 'happy_hour', 'hash_browns', 'highly_recommend',
       '

In [47]:
#OK tried to run the below and ran into some performance issues.  So I"m going to use the test /train split
#to get a small sample.

# X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,t)
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=.75,stratify = y)

In [48]:

# Let's do Knn with gridsearch and see if we can identify a good model.
# Set up the grid parameters

params = {
    'n_neighbors':range(1,25),
    'weights':['uniform','distance']
    }

# fetch the model, assign the parameters and fit the model

knn = KNeighborsClassifier()
knn_gs = GridSearchCV(knn, params, cv=5, verbose=1)
knn_gs.fit(Xtrain, ytrain)

# print out the best model
print knn_gs.best_params_
best_knn = knn_gs.best_estimator_
print best_knn

# Calculate the predictions for the confusion matrix / residuals

y_pred = knn_gs.predict(Xtrain)



Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:   23.3s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:  1.9min
[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:  2.3min finished


{'n_neighbors': 24, 'weights': 'distance'}
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=24, p=2,
           weights='distance')


In [53]:
print "Classification Report /n"
from sklearn.metrics import classification_report
cls_rep = classification_report(ytrain, y_pred)
print cls_rep

print "Confusion Matrix"
confusion = pd.crosstab(ytrain, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print confusion

Classification Report /n
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     31895
          1       1.00      1.00      1.00      4156

avg / total       1.00      1.00      1.00     36051

Confusion Matrix
Predicted      0     1    All
Actual                       
0          31895     0  31895
1              0  4156   4156
All        31895  4156  36051


In [50]:
# OK let's do some logistic regression as an alternative  Results are too good.
# Let's start with our X and Y "full" and see if we can process this data.  X will be normalized
# for regularization


X_small, X_ignore, y_small, y_ignore = train_test_split(X, y, test_size=.70,stratify = y)

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

#Regularize means standardize

Xn = ss.fit_transform(X_small)

#Xn and y small is our data (reduced to help with processing speed)

lr_params = {
    'penalty':['l1','l2'],
    'solver':['liblinear'],
    'C':np.linspace(0.0001, 1000, 50)
}

lr_gs = GridSearchCV(LogisticRegression(), lr_params, cv=5, verbose=1)
lr_gs.fit(Xn, y_small)
print lr_gs.best_params_
best_lr = lr_gs.best_estimator_


Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    8.8s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:  2.2min
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:  3.3min
[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:  3.5min finished


{'penalty': 'l1', 'C': 20.408261224489795, 'solver': 'liblinear'}


In [54]:
#Move forward with the best model from the grid search

cv_indices = StratifiedKFold(y_small, n_folds=5)

lr_scores = []

for train_inds, test_inds in cv_indices:
    
    Xtr, ytr = Xn[train_inds, :], y[train_inds]
    Xte, yte = Xn[test_inds, :], y[test_inds]
    
    best_lr.fit(Xtr, ytr)
    lr_scores.append(best_lr.score(Xte, yte))


print 'Logistic Regression:'
print lr_scores
print np.mean(lr_scores)

print 'Baseline accuracy:', np.mean(y)


Logistic Regression:
[0.85473246272968917, 0.8741476944412343, 0.8637309292649098, 0.86985668053629217, 0.8725002889839325]
0.866993611191
Baseline accuracy: 0.115272596147


In [59]:
ytrain.shape


(36051,)

In [60]:
print "Classification Report /n"
from sklearn.metrics import classification_report
cls_rep = classification_report(ytrain, y_pred)
print cls_rep


# confusion matrix.  Note there are a couple  of methods.  Cross tab seems more flexible, readable
pd.crosstab(ytrain, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)



Classification Report /n
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     31895
          1       1.00      1.00      1.00      4156

avg / total       1.00      1.00      1.00     36051



Predicted,0,1,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,31895,0,31895
1,0,4156,4156
All,31895,4156,36051


In [61]:
z = pd.DataFrame(Xn)
ls = X.columns.values.tolist()
coefs = pd.DataFrame({'variable':ls,'coef':best_lr.coef_[0], 'abs_coef':np.abs(best_lr.coef_[0])})
coefs.sort_values('abs_coef', ascending=False, inplace=True)
print coefs

    abs_coef      coef             variable
11  0.231719  0.231719           votes_cool
16  0.194693 -0.194693         votes_useful
14  0.120289 -0.120289   compliments_photos
5   0.101162  0.101162     compliments_note
8   0.092708  0.092708  compliments_profile
12  0.071595 -0.071595     compliments_list
13  0.065177 -0.065177          votes_funny
3   0.050000 -0.050000   compliments_writer
1   0.043365  0.043365         review_count
7   0.038341 -0.038341     compliments_cool
10  0.032436 -0.032436     compliments_more
6   0.025411 -0.025411      compliments_hot
15  0.017244 -0.017244    compliments_funny
2   0.016552  0.016552     compliments_cute
4   0.010049 -0.010049                 fans
17  0.008527  0.008527          elite_years
0   0.007979 -0.007979    compliments_plain
9   0.001830  0.001830        average_stars


In [62]:
# Let's generate a new design Matrix
X_new = X[['elite_years','fans','average_stars','review_count']]

In [63]:
#ignoring compliments and votes let's redo the logistic regression
# Let's start with our X and Y "full" and see if we can process this data.  X will be normalized
# for regularization


X_small, X_ignore, y_small, y_ignore = train_test_split(X_new, y, test_size=.30,stratify = y)

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

Xn = ss.fit_transform(X_small)

#Xn and y small is our data (reduced to help with processing speed)

lr_params = {
    'penalty':['l1','l2'],
    'solver':['liblinear'],
    'C':np.linspace(0.0001, 1000, 50)
}

lr_gs = GridSearchCV(LogisticRegression(), lr_params, cv=5, verbose=1)
lr_gs.fit(Xn, y_small)
print lr_gs.best_params_
best_lr = lr_gs.best_estimator_


Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    8.0s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:   30.3s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:  1.1min


{'penalty': 'l1', 'C': 20.408261224489795, 'solver': 'liblinear'}


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:  1.2min finished


In [64]:
cv_indices = StratifiedKFold(y_small, n_folds=5)

lr_scores = []

for train_inds, test_inds in cv_indices:
    
    Xtr, ytr = Xn[train_inds, :], y[train_inds]
    Xte, yte = Xn[test_inds, :], y[test_inds]
    
    best_lr.fit(Xtr, ytr)
    lr_scores.append(best_lr.score(Xte, yte))


print 'Logistic Regression:'
print lr_scores
print np.mean(lr_scores)

print 'Baseline accuracy:', np.mean(y)


Logistic Regression:
[0.86389301634472515, 0.86928525434642623, 0.88340185249393233, 0.88161283931048151, 0.8947394491777293]
0.878586482335
Baseline accuracy: 0.115272596147


In [66]:
y_pred = best_lr.predict(Xn)
print type(y_pred)

print "Classification Report"
from sklearn.metrics import classification_report
cls_rep = classification_report(y_small, y_pred)
print cls_rep

print "Confusion Matrix"
confusion = pd.crosstab(y_small, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
print confusion


<type 'numpy.ndarray'>
Classification Report
             precision    recall  f1-score   support

          0       0.88      1.00      0.94     89308
          1       0.00      0.00      0.00     11636

avg / total       0.78      0.88      0.83    100944

Confusion Matrix
Predicted       0     All
Actual                   
0           89308   89308
1           11636   11636
All        100944  100944


  'precision', 'predicted', average, warn_for)


In [67]:
#more reasonable, less overfit
ls = X_new.columns.values.tolist()
coefs = pd.DataFrame({'variable':ls,'coef':best_lr.coef_[0], 'abs_coef':np.abs(best_lr.coef_[0])})
coefs.sort_values('abs_coef', ascending=False, inplace=True)
print coefs

   abs_coef      coef       variable
0  0.017549  0.017549    elite_years
3  0.008161 -0.008161   review_count
1  0.006184  0.006184           fans
2  0.004196  0.004196  average_stars


In [None]:

newX = X[[ls]]
corrs = newX.corr()

# Set the default matplotlib figure size:
fig, ax = plt.subplots(figsize=(8,8))

# Generate a mask for the upper triangle (taken from seaborn example gallery)
mask = np.zeros_like(corrs, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Plot the heatmap with seaborn.
# Assign the matplotlib axis the function returns. This will let us resize the labels.
ax = sns.heatmap(corrs, mask=mask)

# Resize the labels.
ax.set_xticklabels(ax.xaxis.get_ticklabels(), fontsize=14, rotation=30)
ax.set_yticklabels(ax.yaxis.get_ticklabels(), fontsize=14, rotation=0)

# If you put plt.show() at the bottom, it prevents those useless printouts from matplotlib.
plt.show()

<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 4. Find something interesting on your own

---

You want to impress your superiors at Yelp by doing some investigation into the data on your own. You want to do classification, but you're not sure on what.

1. Create a hypothesis or hypotheses about the data based on whatever you are interested in, as long as it is predicting a category of some kind (classification).
2. Explore the data visually (ideally related to this hypothesis).
3. Build one or more classification models to predict your target variable. **Your modeling should include gridsearching to find optimal model parameters.**
4. Evaluate the performance of your model. Explain why your model may have chosen those specific parameters during the gridsearch process.
5. Write up what the model tells you. Does it validate or invalidate your hypothesis? Write this up as if for a non-technical audience.

<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 5. ROC and Precision-recall

---

Some categories have fewer overall businesses than others. Choose two categories of businesses to predict, one that makes your proportion of target classes as even as possible, and another that has very few businesses and thus makes the target varible imbalanced.

1. Create two classification models predicting these categories. Optimize the models and choose variables as you see fit.
- Make confusion matrices for your models. Describe the confusion matrices and explain what they tell you about your models' performance.
- Make ROC curves for both models. What do the ROC curves describe and what do they tell you about your model?
- Make Precision-Recall curves for the models. What do they describe? How do they compare to the ROC curves?
- Explain when Precision-Recall may be preferable to ROC. Is that the case in either of your models?