# Data Import, Cleaning, and Preparation

This module is used to query the postgreSQL database in order to retrieve the Yelp and Violations dataset. There will be several steps to do this:
1. Import flattened violations dataset. We will use a SQL query to perform aggregation of violation data by restaurant and inspeciton date. 
2. Import Yelp business data and join to inspection data    
4. Join the Yelp business and Inspection Data with the Yelp Review Data
    + Reviews for a given establishment will be aggregated so that reviews *after* the previous inspection (or the earliest review date) and *before* the date of a given inspection are in one batch. 
    +  Aggregate any review "count" features using this same logic
    + Combine the review documents for a restaurant into a CLOB using the same logic
    
### TO DO:
1. Determine how we want to do Levenshtein matching to combine the datasets (see the *Join Yelp Review Data with Inspection Dataset* below) 
2. Create additional engineered features
3. n-gram extraction
4. vectorization of n-grams
5. model selection
    + training
    + validaiton
    + evaluation
    +repeat with additional or removed features and data segmentation (if aggregating the review text does not yield successful results)

## Import and Clean Data

In [34]:
import psycopg2 as psy
import pandas as pd
import re
import numpy as np

In [35]:
#set up connection to our DB
conn = psy.connect(database="sterndsyelp", 
                        user="mvsternds", 
                        password="nyustern123!", 
                        host="sterndsyelp.cawzspvmqd5q.us-east-1.rds.amazonaws.com", 
                        port="5432"
                       )
#open cursor and check our tables in the DB
cur = conn.cursor()

In [36]:
#cur.execute("SELECT * FROM public.restaurants LIMIT 50 ")
#biz = pd.DataFrame(cur.fetchall())

cur.execute("SELECT * FROM public.toronto_checkins")
checkins = pd.DataFrame(cur.fetchall())

cur.execute("SELECT * FROM public.toronto_reviews")
reviews = pd.DataFrame(cur.fetchall())

**NOTE: ONLY LIMITING to 50 rows during build phase to limit processing time. **

In [37]:
reviews.columns = ['bizID','reviewID','userID','type','stars','text','useful','funny','cool','date']
#get total reviews per biz
rev = reviews['bizID'].value_counts()
rev_counts = pd.DataFrame(rev).reset_index()
rev_counts.columns = ['bizID','all_review_count']

In [38]:
#not using this - can delete

checkins.columns = ['bizID','type','datetime']
#get total checkins per biz
chks  = checkins['bizID'].value_counts()
chk_counts = pd.DataFrame(chks).reset_index()
chk_counts.columns = ['bizID','checkin_counts']

### Join Yelp Review Data with Inspection Dataset

We have a few options here. While it is optimal to have as much done in Python as possible, the matching process in python is impractically slow. We can 1) use a manual implementation of Levenshtein Distance (LD), 2) use a package with optimized LD code, or 3) do the joining of the inspeciton data and review data in our database. OPtions 1& 2 are shown below, and code to retrieve the results of option 3 are at the bottom of this section.

If we decide to go with option 3, the last steps are to combine all records where the business ID, last inspection date, and inspection date are equal in order to get to one observation per restaurant-inspection combination. We can then add in any other engineered features.

**note: next cell should return matches once we include more than the 50 rows (fingers crossed)**

#### Levenshtein Option #3 (in-database) 
This option joins the yelp restaurant informaiton to each inspection record where:
 * The Levenshetein distance of the restaurant name from the two datasets is <3
 * The distance of the address from each dataset is <4
 * The date of the review is greater than the prior inspection date
 * The date of the review is less than or equal to inspeciton date on the record
 
Whitespace at the beginning and end of the name and address in each dataset is trimmed, and the strings are converted to uppercase before matching. The mathcing thresholds can be adjusted to increase potential for matching, or decrease false matches.

In [39]:
# The materialized view of the restaurant, inspection, and review data is "toronto_all"
cur.execute("SELECT * FROM public.toronto_all where review_date is not null" )
obs = pd.DataFrame(cur.fetchall())
obs.head()
obs.columns=['bizID','name','address','postal_code','neighborhood','lat','long','categories','attributes','is_open','review_cnt','hours','stars','setablishment_id','establishment_name','establishment_address','inspection_date','last_inspection','count_minor','count_sig','count_crucial','count_na','count_crucial_signficant','review_id','user_id','review_stars','review_text','useful','funny','cool','review_dt']
obs.head()

Unnamed: 0,bizID,name,address,postal_code,neighborhood,lat,long,categories,attributes,is_open,review_cnt,hours,stars,setablishment_id,establishment_name,establishment_address,inspection_date,last_inspection,count_minor,count_sig,count_crucial,count_na,count_crucial_signficant,review_id,user_id,review_stars,review_text,useful,funny,cool,review_dt
0,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,6 Garamond Court,M3C 1Z5,,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,14,,4.5,10355507,KAISEKI YU-ZEN HASHIMOTO,6 GARAMOND CRT,2016-03-22,2015-11-03,0,0,0,0,0,OriPsMQx1cRyE9hgNnhQFg,m7z-tX6XDZ27xGhGjnI21w,5,"""I am confident when I say this place is hands...",4,0,4,2016-02-15
1,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,6 Garamond Court,M3C 1Z5,,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,14,,4.5,10355507,KAISEKI YU-ZEN HASHIMOTO,6 GARAMOND CRT,2016-03-22,2015-11-03,0,0,0,0,0,1WON5dUarKdWJ6-yq7Ne6g,UrfdzamoBt0WW9Ifqy7RIw,5,"The short version of my review is this: ""Hashi...",3,0,1,2015-11-18
2,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,6 Garamond Court,M3C 1Z5,,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,14,,4.5,10355507,KAISEKI YU-ZEN HASHIMOTO,6 GARAMOND CRT,2016-03-22,2015-11-03,0,0,0,0,0,TNpm4Hs6x2yJvYbUZSeA4w,5KbUkX5DHGtDSmqdG5LLhw,5,"""It's a great cultural experience with great f...",0,0,0,2016-03-14
3,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,6 Garamond Court,M3C 1Z5,,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,14,,4.5,10355507,KAISEKI YU-ZEN HASHIMOTO,6 GARAMOND CRT,2016-03-22,2015-11-03,0,0,0,0,0,0pS7e898Z5AywcZQ9_DX1Q,xspGyCnzmZgsP-VUnQ3K4A,5,"""This was definitely where quality of food and...",1,0,1,2016-03-15
4,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,6 Garamond Court,M3C 1Z5,,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,14,,4.5,10355507,KAISEKI YU-ZEN HASHIMOTO,6 GARAMOND CRT,2016-11-17,2016-03-22,0,0,0,0,0,ePOLGcC5yJbD_DzOLfyKbA,_dRcIdWjks0phgfiY27AFQ,5,My sister and I enjoyed the lunchtime Kaiseki ...,0,0,0,2016-05-01


In [40]:
obs['bizID-dt'] = obs['bizID'] + "-" + obs['inspection_date'].map(str)

In [41]:
in_scope_rev = obs['bizID-dt'].value_counts()
in_scope_reviews = pd.DataFrame(in_scope_rev).reset_index()
in_scope_reviews.columns = ['bizID-dt','count_reviews_in_scope']
in_scope_reviews.head()

Unnamed: 0,bizID-dt,count_reviews_in_scope
0,fGurvC5BdOfd5MIuLUQYVA-2016-12-19,107
1,73_UT7fZ7mzXcguX8-oSuQ-2016-10-26,87
2,-J6FVdY9pSgAdFmmalO-pQ-2016-11-24,70
3,kOFDVcnj-8fd3doIpCQ06A-2016-04-26,64
4,s7Pj1mNYqRTGNOXLOiBafw-2016-12-01,62


In [42]:
#get dummies for star rating column
obs = pd.concat([obs, pd.get_dummies(obs['review_stars'], prefix='stars')], axis=1)
obs.head()

Unnamed: 0,bizID,name,address,postal_code,neighborhood,lat,long,categories,attributes,is_open,review_cnt,hours,stars,setablishment_id,establishment_name,establishment_address,inspection_date,last_inspection,count_minor,count_sig,count_crucial,count_na,count_crucial_signficant,review_id,user_id,review_stars,review_text,useful,funny,cool,review_dt,bizID-dt,stars_1,stars_2,stars_3,stars_4,stars_5
0,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,6 Garamond Court,M3C 1Z5,,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,14,,4.5,10355507,KAISEKI YU-ZEN HASHIMOTO,6 GARAMOND CRT,2016-03-22,2015-11-03,0,0,0,0,0,OriPsMQx1cRyE9hgNnhQFg,m7z-tX6XDZ27xGhGjnI21w,5,"""I am confident when I say this place is hands...",4,0,4,2016-02-15,01l8MH9tBK6GPvJQdMU1gw-2016-03-22,0,0,0,0,1
1,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,6 Garamond Court,M3C 1Z5,,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,14,,4.5,10355507,KAISEKI YU-ZEN HASHIMOTO,6 GARAMOND CRT,2016-03-22,2015-11-03,0,0,0,0,0,1WON5dUarKdWJ6-yq7Ne6g,UrfdzamoBt0WW9Ifqy7RIw,5,"The short version of my review is this: ""Hashi...",3,0,1,2015-11-18,01l8MH9tBK6GPvJQdMU1gw-2016-03-22,0,0,0,0,1
2,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,6 Garamond Court,M3C 1Z5,,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,14,,4.5,10355507,KAISEKI YU-ZEN HASHIMOTO,6 GARAMOND CRT,2016-03-22,2015-11-03,0,0,0,0,0,TNpm4Hs6x2yJvYbUZSeA4w,5KbUkX5DHGtDSmqdG5LLhw,5,"""It's a great cultural experience with great f...",0,0,0,2016-03-14,01l8MH9tBK6GPvJQdMU1gw-2016-03-22,0,0,0,0,1
3,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,6 Garamond Court,M3C 1Z5,,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,14,,4.5,10355507,KAISEKI YU-ZEN HASHIMOTO,6 GARAMOND CRT,2016-03-22,2015-11-03,0,0,0,0,0,0pS7e898Z5AywcZQ9_DX1Q,xspGyCnzmZgsP-VUnQ3K4A,5,"""This was definitely where quality of food and...",1,0,1,2016-03-15,01l8MH9tBK6GPvJQdMU1gw-2016-03-22,0,0,0,0,1
4,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,6 Garamond Court,M3C 1Z5,,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,14,,4.5,10355507,KAISEKI YU-ZEN HASHIMOTO,6 GARAMOND CRT,2016-11-17,2016-03-22,0,0,0,0,0,ePOLGcC5yJbD_DzOLfyKbA,_dRcIdWjks0phgfiY27AFQ,5,My sister and I enjoyed the lunchtime Kaiseki ...,0,0,0,2016-05-01,01l8MH9tBK6GPvJQdMU1gw-2016-11-17,0,0,0,0,1


In [43]:
stars = obs.groupby('bizID-dt')[['stars_1', 'stars_2','stars_3','stars_4','stars_5']].sum().reset_index()
stars.head()

Unnamed: 0,bizID-dt,stars_1,stars_2,stars_3,stars_4,stars_5
0,-2TBP3ZGu7M-FmfoNJvbrQ-2016-09-07,0,1,0,2,0
1,-2TBP3ZGu7M-FmfoNJvbrQ-2017-01-18,1,0,1,3,1
2,-6mzdR0YjOToJ8E04Y9O0Q-2015-11-27,0,0,1,0,0
3,-7BCZH437U5FjmNJ26llkg-2016-01-14,0,0,1,2,3
4,-7BCZH437U5FjmNJ26llkg-2016-08-10,0,0,0,3,5


In [44]:
combined_revs = obs.groupby('bizID-dt')['review_text'].apply(' '.join).reset_index()
combined_revs.head()

Unnamed: 0,bizID-dt,review_text
0,-2TBP3ZGu7M-FmfoNJvbrQ-2016-09-07,"""I loveee bacon, and I gotta say, Rashers did ..."
1,-2TBP3ZGu7M-FmfoNJvbrQ-2017-01-18,Perfectly cooked bacon sandwiches with a nice ...
2,-6mzdR0YjOToJ8E04Y9O0Q-2015-11-27,"""Place was okay, came here because we couldn't..."
3,-7BCZH437U5FjmNJ26llkg-2016-01-14,I loved their Lahmacun!!! It was spot on and I...
4,-7BCZH437U5FjmNJ26llkg-2016-08-10,"""Didn't know what to expect from Turkish (pizz..."


In [45]:
users = obs.groupby('bizID-dt')['user_id'].count().reset_index()
users.columns = ['bizID-dt','count_unique_users']
users.head()

Unnamed: 0,bizID-dt,count_unique_users
0,-2TBP3ZGu7M-FmfoNJvbrQ-2016-09-07,3
1,-2TBP3ZGu7M-FmfoNJvbrQ-2017-01-18,6
2,-6mzdR0YjOToJ8E04Y9O0Q-2015-11-27,1
3,-7BCZH437U5FjmNJ26llkg-2016-01-14,6
4,-7BCZH437U5FjmNJ26llkg-2016-08-10,8


In [46]:
sub = obs[['bizID-dt','bizID','name','postal_code','lat','long','categories','attributes','is_open','count_crucial_signficant']]
sub = sub.drop_duplicates()
sub.head()

Unnamed: 0,bizID-dt,bizID,name,postal_code,lat,long,categories,attributes,is_open,count_crucial_signficant
0,01l8MH9tBK6GPvJQdMU1gw-2016-03-22,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,M3C 1Z5,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,0
4,01l8MH9tBK6GPvJQdMU1gw-2016-11-17,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,M3C 1Z5,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,0
5,01nRNgH_ukm8E2td9TTZDA-2016-05-09,01nRNgH_ukm8E2td9TTZDA,Baroli Cafe,M5B 2H1,43.6536106,-79.3800603,"['Coffee & Tea', 'Food']","['BikeParking: False', 'BusinessAcceptsCreditC...",1,3
6,01nRNgH_ukm8E2td9TTZDA-2016-10-21,01nRNgH_ukm8E2td9TTZDA,Baroli Cafe,M5B 2H1,43.6536106,-79.3800603,"['Coffee & Tea', 'Food']","['BikeParking: False', 'BusinessAcceptsCreditC...",1,0
8,02BXFKzu1rgaYulNGYvi6g-2015-10-08,02BXFKzu1rgaYulNGYvi6g,Matsuda Japanese Cuisine,M1W 2B4,43.831795518,-79.2663108185,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,0


In [55]:
#merge in all data into one df
df1 = pd.merge(sub,stars,on='bizID-dt', how='left')
df2 = pd.merge(df1,combined_revs,on='bizID-dt', how='left')
df3 = pd.merge(df2,rev_counts,on='bizID', how='left')
df4 = pd.merge(df3,in_scope_reviews,on='bizID-dt', how='left')
df5 = pd.merge(df4,users,on='bizID-dt', how='left')

#make sure each bizID-dt is only appearing once in the data
print('Max of number of unique bizID-dt in df (should be 1):',max(df5['bizID-dt'].value_counts()))
df5.head()

Max of number of unique bizID-dt in df (should be 1): 2


Unnamed: 0,bizID-dt,bizID,name,postal_code,lat,long,categories,attributes,is_open,count_crucial_signficant,stars_1,stars_2,stars_3,stars_4,stars_5,review_text,all_review_count,count_reviews_in_scope,count_unique_users
0,01l8MH9tBK6GPvJQdMU1gw-2016-03-22,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,M3C 1Z5,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,0,0,0,0,0,4,"""I am confident when I say this place is hands...",14,4,4
1,01l8MH9tBK6GPvJQdMU1gw-2016-11-17,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,M3C 1Z5,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,0,0,0,0,0,1,My sister and I enjoyed the lunchtime Kaiseki ...,14,1,1
2,01nRNgH_ukm8E2td9TTZDA-2016-05-09,01nRNgH_ukm8E2td9TTZDA,Baroli Cafe,M5B 2H1,43.6536106,-79.3800603,"['Coffee & Tea', 'Food']","['BikeParking: False', 'BusinessAcceptsCreditC...",1,3,1,0,0,0,0,"""The worst treatment in my entire life.\n\nThe...",4,1,1
3,01nRNgH_ukm8E2td9TTZDA-2016-10-21,01nRNgH_ukm8E2td9TTZDA,Baroli Cafe,M5B 2H1,43.6536106,-79.3800603,"['Coffee & Tea', 'Food']","['BikeParking: False', 'BusinessAcceptsCreditC...",1,0,0,1,0,0,1,"""Stopped by while shopping in the eaton centre...",4,2,2
4,02BXFKzu1rgaYulNGYvi6g-2015-10-08,02BXFKzu1rgaYulNGYvi6g,Matsuda Japanese Cuisine,M1W 2B4,43.831795518,-79.2663108185,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,0,2,2,7,5,4,"""I live 5 minutes away from this restaurant an...",178,20,20


In [48]:
import ast

In [56]:
t = []
for i in range(len(df5['categories'])):
    x = ast.literal_eval(df5['categories'][i])
    t.append(x)
    
cats = pd.DataFrame(t)
cats_df = pd.get_dummies(cats, prefix='Category')
cats_df = cats_df.groupby(cats_df.columns, axis=1).sum()

In [57]:
df = pd.concat([df5, cats_df], axis=1)
df

Unnamed: 0,bizID-dt,bizID,name,postal_code,lat,long,categories,attributes,is_open,count_crucial_signficant,stars_1,stars_2,stars_3,stars_4,stars_5,review_text,all_review_count,count_reviews_in_scope,count_unique_users,Category_Afghan,Category_African,Category_American (New),Category_American (Traditional),Category_Antiques,Category_Arabian,Category_Argentine,Category_Arts & Entertainment,Category_Asian Fusion,Category_Automotive,Category_Bagels,Category_Bakeries,Category_Bangladeshi,Category_Barbeque,Category_Bars,Category_Beer,Category_Beer Bar,Category_Bistros,Category_Brasseries,Category_Brazilian,Category_Breakfast & Brunch,Category_Breweries,Category_British,Category_Buffets,Category_Burgers,Category_Butcher,Category_Cafes,Category_Cajun/Creole,Category_Cambodian,Category_Canadian (New),Category_Caribbean,Category_Caterers,Category_Cheese Shops,Category_Chicken Shop,Category_Chicken Wings,Category_Chinese,Category_Chocolatiers & Shops,Category_Cocktail Bars,Category_Coffee & Tea,Category_Coffee Roasteries,Category_Comfort Food,Category_Convenience Stores,Category_Cooking Schools,Category_Creperies,Category_Cuban,Category_Delicatessen,Category_Delis,Category_Department Stores,Category_Desserts,Category_Dim Sum,Category_Diners,Category_Dive Bars,Category_Do-It-Yourself Food,Category_Donairs,Category_Donuts,Category_Education,Category_Egyptian,Category_Ethiopian,Category_Ethnic Food,Category_Event Planning & Services,Category_Falafel,Category_Fashion,Category_Fast Food,Category_Filipino,Category_Fish & Chips,Category_Florists,Category_Flowers & Gifts,Category_Food,Category_Food Court,Category_Food Delivery Services,Category_French,Category_Fruits & Veggies,Category_Gastropubs,Category_German,Category_Gift Shops,Category_Gluten-Free,Category_Greek,Category_Grocery,Category_Halal,Category_Health & Medical,Category_Health Markets,Category_Herbs & Spices,Category_Himalayan/Nepalese,Category_Hot Dogs,Category_Hot Pot,Category_Ice Cream & Frozen Yogurt,Category_Imported Food,Category_Indian,Category_International,Category_International Grocery,Category_Internet Cafes,Category_Irish,Category_Irish Pub,Category_Italian,Category_Japanese,Category_Jazz & Blues,Category_Juice Bars & Smoothies,Category_Korean,Category_Kosher,Category_Latin American,Category_Lebanese,Category_Live/Raw Food,Category_Local Flavor,Category_Local Services,Category_Lounges,Category_Macarons,Category_Malaysian,Category_Meat Shops,Category_Mediterranean,Category_Mexican,Category_Middle Eastern,Category_Modern European,Category_Music Venues,Category_Nightlife,Category_Noodles,Category_Nutritionists,Category_Organic Stores,Category_Pakistani,Category_Party & Event Planning,Category_Patisserie/Cake Shop,Category_Persian/Iranian,Category_Personal Chefs,Category_Peruvian,Category_Pizza,Category_Polish,Category_Portuguese,Category_Poutineries,Category_Pubs,Category_Ramen,Category_Restaurants,Category_Russian,Category_Salad,Category_Salvadoran,Category_Sandwiches,Category_Scandinavian,Category_Seafood,Category_Seafood Markets,Category_Shopping,Category_Slovakian,Category_Smokehouse,Category_Soul Food,Category_Soup,Category_South African,Category_Southern,Category_Spanish,Category_Specialty Food,Category_Specialty Schools,Category_Sports Bars,Category_Steakhouses,Category_Sushi Bars,Category_Taiwanese,Category_Tapas Bars,Category_Tapas/Small Plates,Category_Tea Rooms,Category_Tex-Mex,Category_Thai,Category_Tires,Category_Turkish,Category_Ukrainian,Category_Vegan,Category_Vegetarian,Category_Venezuelan,Category_Venues & Event Spaces,Category_Vietnamese,Category_Wholesale Stores,Category_Wigs,Category_Wine & Spirits,Category_Wine Bars,Category_Wineries
0,01l8MH9tBK6GPvJQdMU1gw-2016-03-22,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,M3C 1Z5,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,0,0,0,0,0,4,"""I am confident when I say this place is hands...",14,4,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,01l8MH9tBK6GPvJQdMU1gw-2016-11-17,01l8MH9tBK6GPvJQdMU1gw,Kaiseki Yu-zen Hashimoto,M3C 1Z5,43.7264546,-79.3349744,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,0,0,0,0,0,1,My sister and I enjoyed the lunchtime Kaiseki ...,14,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,01nRNgH_ukm8E2td9TTZDA-2016-05-09,01nRNgH_ukm8E2td9TTZDA,Baroli Cafe,M5B 2H1,43.6536106,-79.3800603,"['Coffee & Tea', 'Food']","['BikeParking: False', 'BusinessAcceptsCreditC...",1,3,1,0,0,0,0,"""The worst treatment in my entire life.\n\nThe...",4,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,01nRNgH_ukm8E2td9TTZDA-2016-10-21,01nRNgH_ukm8E2td9TTZDA,Baroli Cafe,M5B 2H1,43.6536106,-79.3800603,"['Coffee & Tea', 'Food']","['BikeParking: False', 'BusinessAcceptsCreditC...",1,0,0,1,0,0,1,"""Stopped by while shopping in the eaton centre...",4,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,02BXFKzu1rgaYulNGYvi6g-2015-10-08,02BXFKzu1rgaYulNGYvi6g,Matsuda Japanese Cuisine,M1W 2B4,43.831795518,-79.2663108185,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,0,2,2,7,5,4,"""I live 5 minutes away from this restaurant an...",178,20,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,02BXFKzu1rgaYulNGYvi6g-2016-01-19,02BXFKzu1rgaYulNGYvi6g,Matsuda Japanese Cuisine,M1W 2B4,43.831795518,-79.2663108185,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,0,1,2,7,4,1,Subpar AYCE Lunch Sushi Experience.\n\nFood:\n...,178,15,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,02BXFKzu1rgaYulNGYvi6g-2016-12-28,02BXFKzu1rgaYulNGYvi6g,Matsuda Japanese Cuisine,M1W 2B4,43.831795518,-79.2663108185,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,3,0,0,4,4,0,We came in with a big group of people for lunc...,178,8,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,02BXFKzu1rgaYulNGYvi6g-2016-04-01,02BXFKzu1rgaYulNGYvi6g,Matsuda Japanese Cuisine,M1W 2B4,43.831795518,-79.2663108185,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,3,4,2,6,5,0,"The food was low quality sushi, sashimi, maki,...",178,17,17,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,02BXFKzu1rgaYulNGYvi6g-2017-03-20,02BXFKzu1rgaYulNGYvi6g,Matsuda Japanese Cuisine,M1W 2B4,43.831795518,-79.2663108185,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,2,1,1,1,1,0,"""Yeah.. never again. Honestly, for food, sushi...",178,4,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,02BXFKzu1rgaYulNGYvi6g-2016-08-23,02BXFKzu1rgaYulNGYvi6g,Matsuda Japanese Cuisine,M1W 2B4,43.831795518,-79.2663108185,"['Restaurants', 'Japanese']","['Alcohol: beer_and_wine', ""Ambience: {'romant...",1,0,7,2,4,6,2,"""It used to be better. Some food didn't taste ...",178,21,21,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [58]:
#useful code to view all columns of df

pd.set_option('display.max_columns', None)

In [60]:
df = pd.concat([df5, cats_df], axis=1)
del df['categories']
del df['name']
del df['bizID']
del df['bizID-dt']
df['count_crucial_signficant']= df['count_crucial_signficant']>0


In [65]:
#turn cell from markdown to code

from nltk.corpus import stopwords
import nltk
stop = stopwords.words('english')
df['review_text']=df['review_text'].str.lower()
df['review_text'] = df['review_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
ps = nltk.stem.WordNetLemmatizer()
df['review_text']=df["review_text"].apply(lambda x:[ps.lemmatize(y,pos='v') for y in x.split()])
df['review_text']=df['review_text'].apply(lambda x: ', '.join(x))
df['review_text'] = df['review_text'].str.replace('[^\w\s]','')

#del df['attributes']
list(df)
df

Unnamed: 0,postal_code,lat,long,is_open,count_crucial_signficant,stars_1,stars_2,stars_3,stars_4,stars_5,review_text,all_review_count,count_reviews_in_scope,count_unique_users,Category_Afghan,Category_African,Category_American (New),Category_American (Traditional),Category_Antiques,Category_Arabian,Category_Argentine,Category_Arts & Entertainment,Category_Asian Fusion,Category_Automotive,Category_Bagels,Category_Bakeries,Category_Bangladeshi,Category_Barbeque,Category_Bars,Category_Beer,Category_Beer Bar,Category_Bistros,Category_Brasseries,Category_Brazilian,Category_Breakfast & Brunch,Category_Breweries,Category_British,Category_Buffets,Category_Burgers,Category_Butcher,Category_Cafes,Category_Cajun/Creole,Category_Cambodian,Category_Canadian (New),Category_Caribbean,Category_Caterers,Category_Cheese Shops,Category_Chicken Shop,Category_Chicken Wings,Category_Chinese,Category_Chocolatiers & Shops,Category_Cocktail Bars,Category_Coffee & Tea,Category_Coffee Roasteries,Category_Comfort Food,Category_Convenience Stores,Category_Cooking Schools,Category_Creperies,Category_Cuban,Category_Delicatessen,Category_Delis,Category_Department Stores,Category_Desserts,Category_Dim Sum,Category_Diners,Category_Dive Bars,Category_Do-It-Yourself Food,Category_Donairs,Category_Donuts,Category_Education,Category_Egyptian,Category_Ethiopian,Category_Ethnic Food,Category_Event Planning & Services,Category_Falafel,Category_Fashion,Category_Fast Food,Category_Filipino,Category_Fish & Chips,Category_Florists,Category_Flowers & Gifts,Category_Food,Category_Food Court,Category_Food Delivery Services,Category_French,Category_Fruits & Veggies,Category_Gastropubs,Category_German,Category_Gift Shops,Category_Gluten-Free,Category_Greek,Category_Grocery,Category_Halal,Category_Health & Medical,Category_Health Markets,Category_Herbs & Spices,Category_Himalayan/Nepalese,Category_Hot Dogs,Category_Hot Pot,Category_Ice Cream & Frozen Yogurt,Category_Imported Food,Category_Indian,Category_International,Category_International Grocery,Category_Internet Cafes,Category_Irish,Category_Irish Pub,Category_Italian,Category_Japanese,Category_Jazz & Blues,Category_Juice Bars & Smoothies,Category_Korean,Category_Kosher,Category_Latin American,Category_Lebanese,Category_Live/Raw Food,Category_Local Flavor,Category_Local Services,Category_Lounges,Category_Macarons,Category_Malaysian,Category_Meat Shops,Category_Mediterranean,Category_Mexican,Category_Middle Eastern,Category_Modern European,Category_Music Venues,Category_Nightlife,Category_Noodles,Category_Nutritionists,Category_Organic Stores,Category_Pakistani,Category_Party & Event Planning,Category_Patisserie/Cake Shop,Category_Persian/Iranian,Category_Personal Chefs,Category_Peruvian,Category_Pizza,Category_Polish,Category_Portuguese,Category_Poutineries,Category_Pubs,Category_Ramen,Category_Restaurants,Category_Russian,Category_Salad,Category_Salvadoran,Category_Sandwiches,Category_Scandinavian,Category_Seafood,Category_Seafood Markets,Category_Shopping,Category_Slovakian,Category_Smokehouse,Category_Soul Food,Category_Soup,Category_South African,Category_Southern,Category_Spanish,Category_Specialty Food,Category_Specialty Schools,Category_Sports Bars,Category_Steakhouses,Category_Sushi Bars,Category_Taiwanese,Category_Tapas Bars,Category_Tapas/Small Plates,Category_Tea Rooms,Category_Tex-Mex,Category_Thai,Category_Tires,Category_Turkish,Category_Ukrainian,Category_Vegan,Category_Vegetarian,Category_Venezuelan,Category_Venues & Event Spaces,Category_Vietnamese,Category_Wholesale Stores,Category_Wigs,Category_Wine & Spirits,Category_Wine Bars,Category_Wineries
0,M3C 1Z5,43.7264546,-79.3349744,1,False,0,0,0,0,4,confident say place hand authentic kaiseki cui...,14,4,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,M3C 1Z5,43.7264546,-79.3349744,1,False,0,0,0,0,1,sister enjoy lunchtime kaiseki meal last weeke...,14,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,M5B 2H1,43.6536106,-79.3800603,1,True,1,0,0,0,0,worst treatment entire lifennthe girl work cou...,4,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,M5B 2H1,43.6536106,-79.3800603,1,False,0,1,0,0,1,stop shop eaton centre sandwich display catch ...,4,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,M1W 2B4,43.831795518,-79.2663108185,1,False,2,2,7,5,4,live 5 minutes away restaurant ive come years ...,178,20,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,M1W 2B4,43.831795518,-79.2663108185,1,False,1,2,7,4,1,subpar ayce lunch sushi experiencennfoodnthere...,178,15,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,M1W 2B4,43.831795518,-79.2663108185,1,True,0,0,4,4,0,come big group people lunch 9 honest service p...,178,8,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,M1W 2B4,43.831795518,-79.2663108185,1,True,4,2,6,5,0,food low quality sushi sashimi maki etc charge...,178,17,17,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,M1W 2B4,43.831795518,-79.2663108185,1,True,1,1,1,1,0,yeah never honestly food sushi taste good esp ...,178,4,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,M1W 2B4,43.831795518,-79.2663108185,1,False,7,2,4,6,2,use better food didnt taste anymore wait time ...,178,21,21,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [77]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

X = df.drop('count_crucial_signficant', axis=1)
Y = df['count_crucial_signficant']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=.75)

binary_vectorizer = CountVectorizer(binary=True)
binary_vectorizer.fit(X_train)
X_train_binary = binary_vectorizer.transform(X_train['review_text'])
X_test_binary = binary_vectorizer.transform(X_test['review_text'])
X_train.reshape()

<2033x225 sparse matrix of type '<class 'numpy.int64'>'
	with 5583 stored elements in Compressed Sparse Row format>

### Model Time!

In [78]:
model = LogisticRegression()
model.fit(X_train_binary, Y_train)
print ("Area under the ROC curve on test data = %.3f" % metrics.roc_auc_score(model.predict(X_test_binary), Y_test))
fpr, tpr, thresholds = metrics.roc_curve(Y_test, model.predict_proba(X_test_binary)[:,1])

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X_train)

X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

model_v2 = LogisticRegression()
model_v2.fit(X_train_tfidf, Y_train)
print ("Area under the ROC curve on test data = %.3f" % metrics.roc_auc_score(model_v2.predict(X_test_counts), Y_test))


Area under the ROC curve on test data = 0.525


ValueError: Found input variables with inconsistent numbers of samples: [182, 2033]