In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib notebook
from sklearn.cross_validation import train_test_split
# Each is a different implemntation of a text transform tool: Bag of Words & Tfidf
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

#### Read yelp_labelled data and split it using \n and \t

In [7]:
rows = []
with open('/Users/benstan/Desktop/GA-DS/SF-DAT-20-MASTER/Data/yelp_labelled.txt') as f:
    for i,line in enumerate(f.readlines()):
        row = (line.split('\n')[0]).split('\t')
        if row[1] == '':
            row[1] = np.nan
        else:
            row[1] = int(row[1])
        rows.append(row)

#### Put your yelp data into a dataframe and drop na values.

In [8]:
yelp_data = pd.DataFrame(rows,columns=['reviews','sentiment'])
yelp_data.head()

Unnamed: 0,reviews,sentiment
0,Wow... Loved this place.,1.0
1,I learned that if an electric slicer is used t...,
2,But they don't clean the chiles?,
3,Crust is not good.,0.0
4,Not tasty and the texture was just nasty.,0.0


In [9]:
yelp_data.dropna(inplace=True)

#### Using Pipeline, RandomForestClasifier, and GridSearchCV, play with min_df and max_df on your yelp data. Split your data to test and training. You can use either of CountVetorizer or TfidfVectorizer

In [10]:
count_vect = CountVectorizer(stop_words='english')
bag_words = count_vect.fit_transform(yelp_data['reviews'])

In [21]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split

X = bag_words
y = yelp_data['sentiment']

X_train, X_test, y_train, y_test = train_test_split(yelp_data['reviews'],yelp_data['sentiment'],test_size=0.25)

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('clf', RandomForestClassifier())])

In [23]:
params = {'vect__min_df':[1,2,3],
         'vect__max_df':[5,10,100,200,500,1000],
         'clf__n_estimators':[100,500,1000,5000]}

gs_clf = GridSearchCV(text_clf,params)

In [24]:
fit_grid = gs_clf.fit(X_train,y_train)

In [25]:
fit_grid.best_params_

{'clf__n_estimators': 1000, 'vect__max_df': 100, 'vect__min_df': 1}

#### How much test error do you get based on the optimizer you found above?

In [26]:
fit_grid.score(X_test,y_test)

0.78400000000000003

#### Look over few X_test instances and compare the category predicted for the observation and the actual review sentence. 

In [31]:
for i in range(4):
    print X_test.values[i], fit_grid.predict(X_test)[i]

the potatoes were great and so was the biscuit. 1.0
This place is amazing! 1.0
Definitely worth venturing off the strip for the pork belly, will return next time I'm in Vegas. 1.0
My breakfast was perpared great, with a beautiful presentation of 3 giant slices of Toast, lightly dusted with powdered sugar. 1.0


## Bonus Quetions: Can you find the test instances that are correctly classified and those that are misclassified?

In [33]:
#Misclassified instances
X_test[fit_grid.predict(X_test) != y_test]


1916    High-quality chicken on the chicken Caesar salad.
2396    I don't have very many words to say about this...
2354    Their frozen margaritas are WAY too sugary for...
3513                                The food wasn't good.
2362    So in a nutshell: 1) The restaraunt smells lik...
181     The scallop dish is quite appalling for value ...
505                 I dressed up to be treated so rudely!
659     Much better than the other AYCE sushi place I ...
3627    It really is impressive that the place hasn't ...
1384    The one down note is the ventilation could use...
2003    Very convenient, since we were staying at the ...
2859    Now the burgers aren't as good, the pizza whic...
624     If it were possible to give them zero stars, t...
2080    Prices are very reasonable, flavors are spot o...
436           Restaurant is always full but never a wait.
2294    I got to enjoy the seafood salad, with a fabul...
2916    If you stay in Vegas you must get breakfast he...
120     He cam

In [34]:
#Correctly Classified instances
X_test[fit_grid.predict(X_test) == y_test]



2731      the potatoes were great and so was the biscuit.
788                                This place is amazing!
330     Definitely worth venturing off the strip for t...
1776    My breakfast was perpared great, with a beauti...
3                                      Crust is not good.
1743    I swung in to give them a try but was deeply d...
3721    Then, as if I hadn't wasted enough of my life ...
2405    The staff is super nice and very quick even wi...
1632    I had the opportunity today to sample your ama...
2451                      And service was super friendly.
1970    Great place to relax and have an awesome burge...
562                            The WORST EXPERIENCE EVER.
1644                Just spicy enough.. Perfect actually.
3621    Shrimp- When I unwrapped it (I live only 1/2 a...
2907    My boyfriend tried the Mediterranean Chicken S...
3301    She was quite disappointed although some blame...
372     The menu is always changing, food quality is g...
2216    One ni