# Classification of texts with Naive Bayes

In this Notebook we are applying text mining techniques to discover [whether a hotel review is genuine or fake](https://www.kaggle.com/rtatman/deceptive-opinion-spam-corpus). In the previous Notebook, a document-feature matrix was made. Now we will use the Naive Bayes algorithm to predict whether the hotel review is genuine. 

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

First, let's do all the steps from the previous Notebook.

In [4]:
reviews = pd.read_csv('deceptive-opinion.csv')
text = reviews['text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
docu_feat = vect.transform(text)
feature_names = vect.get_feature_names() #Get the words from the vocabulary
rev_words = pd.concat([reviews, pd.DataFrame(docu_feat.toarray())], axis=1)
rev_words.columns = ['v_deceptive', 'v_hotel', 'v_polarity', 'v_source', 'v_text'] + feature_names
rev_words.head()


Unnamed: 0,v_deceptive,v_hotel,v_polarity,v_source,v_text,00,000,00a,00am,00pm,...,yum,yummo,yummy,yunan,yup,zagat,zest,zipped,zone,zoo
0,truthful,conrad,positive,TripAdvisor,We stayed for a one night getaway with family ...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,truthful,hyatt,positive,TripAdvisor,Triple A rate with upgrade to view room was le...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,truthful,hyatt,positive,TripAdvisor,This comes a little late as I'm finally catchi...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,truthful,omni,positive,TripAdvisor,The Omni Chicago really delivers on all fronts...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,truthful,hyatt,positive,TripAdvisor,I asked for a high floor away from the elevato...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
