This is based on a <a href="https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words"> Kaggle Tutorial</a>

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import re

In [2]:
original = pd.read_csv("review_final_academic.csv", index_col = 0)
original.head(2)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes
0,qHmamQPCAKkia9X0uryA8g,2006-09-23,M8G9Rs21i4euIo3T5gyGOg,4,Are you drunk? Is it around 3am? Are you in do...,review,Xsp1amevfceAqAMjKhZkgA,"{u'funny': 0, u'useful': 1, u'cool': 0}"
1,qHmamQPCAKkia9X0uryA8g,2011-04-06,o9HYfNDSACBPRykq8t21PQ,3,OH NO HE DIDN'T.\n\nTHE GUY MAKING THE SAUSAGE...,review,mRQzFZMGurB-3bP4CGTNpQ,"{u'funny': 3, u'useful': 1, u'cool': 0}"


In [3]:
#All reviews, 5 star reviews, and 1&2 star reviews
len (original), len (original[original.stars >4]), len (original[original.stars <3])

(122657, 30913, 19792)

In [4]:
extremerevnum = len(original[original.stars >4]) + len (original[original.stars <3])

print "The total # of extreme reviews is %d, equal to %d%% of total." % (extremerevnum,(extremerevnum*100.0)/len(original))

The total # of extreme reviews is 50705, equal to 41% of total.


Now, make a new dataframe 'extreme' comprised of those more extreme reviews.
Confirm that it has the number of reviews you'd expect.

In [5]:
extreme = pd.DataFrame(original[original.stars.isin([1,2,5])])
print "This extreme file has %d rows before dropping ones with NA values." % len(extreme)
extreme.dropna(inplace = True) #Removes any row with a NA value
print "This extreme file has %d rows after dropping ones with NA values." % len(extreme)
extreme.tail(1)

This extreme file has 50705 rows before dropping ones with NA values.
This extreme file has 50700 rows after dropping ones with NA values.


Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes
122655,Fh2vlptKk8jyedzCDPbXbw,2011-07-04,B0DlRa0upq4RHGFUVWY6xQ,5,My daughter and I got the 2 meals for $30 whic...,review,7TmlplDISrIVy1QVuRcJyA,"{u'funny': 1, u'useful': 1, u'cool': 1}"


The first review's index is 2, and you can see above that the last index is much higher that 50,699.  Let's fix that so we can iterate over the review text.

In [6]:
extreme.reset_index(inplace = True, drop = True)
extreme.tail(1)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes
50699,Fh2vlptKk8jyedzCDPbXbw,2011-07-04,B0DlRa0upq4RHGFUVWY6xQ,5,My daughter and I got the 2 meals for $30 whic...,review,7TmlplDISrIVy1QVuRcJyA,"{u'funny': 1, u'useful': 1, u'cool': 1}"


In [7]:
#The function to call to clean text.
def cleantext(text):
    #text = BeautifulSoup(text).get_text()  #This removes html stuff, but not actual links.
    text = re.sub("[^a-zA-Z]", " ", text).lower()  #This removes everything that isn't a letter
    return re.sub( '\s+', ' ', text).strip()  #This removes excess spaces

In [8]:
#Test clean text
print cleantext("Goats     llama7s    how     778::.45  many.times")

goats llama s how many times


We're almost ready to clean all the extreme reviews, and put them in a list.
First I want to show you that a bunch of them have weblinks, and that those aren't taken out too well.

In [46]:
#Make a list of indices for reviews with web links in them.
cats = []
for i in xrange(len(extreme)):
    if ("http" in extreme.text[i]):
        cats.append(i)

#A lot of the reviews have web-links in them.
print len(cats), cats[:4]

306 [16, 34, 62, 281]


In [56]:
hockey = extreme.text[62]
print "Original: \n%r \n\n Cleaned: \n%r" %(hockey, cleantext(hockey))

Original: 
"Anytime the great saxophonist Philip Greenlief ( http://evandermusic.com/artist_detail.asp?artist_id=117 ) and I play gig, share a bill or even just attend one or the other's concert all the evening's events just become part of the journey to this Top Dog. \nHe goes for the tops, I like the Brats and a lemonade.\nThe coda is a donut from King Pin across the street. \nAwesome." 

 Cleaned: 
'anytime the great saxophonist philip greenlief http evandermusic com artist detail asp artist id and i play gig share a bill or even just attend one or the other s concert all the evening s events just become part of the journey to this top dog he goes for the tops i like the brats and a lemonade the coda is a donut from king pin across the street awesome'


In [9]:
numreviews = len(extreme.text)
print "Total Reviews = ", numreviews

clean_reviews = []

for i in xrange(numreviews):
    if ((i+1)%10000 == 0):
        print "%d (%d percent) completed!" % ((i+1),(((i+1)*100.0)/numreviews))
    clean_reviews.append(cleantext(extreme.text[i]))

Total Reviews =  50700
10000 (19 percent) completed!
20000 (39 percent) completed!
30000 (59 percent) completed!
40000 (78 percent) completed!
50000 (98 percent) completed!


It's unhappy about one or two of the rows (if we use BeautifulSoup), but this seems not the end of the world.

In [10]:
#Just to double check it's the same length as the extreme dataframe
print len(clean_reviews)

50700


In [11]:
#Add these to the dataframe
extreme['clean'] = clean_reviews

In [12]:
#See!  Clean
print "Original: ", extreme.text[0]
print "="*80, "\nCleaned: ", extreme.clean[0]

Original:  i don't care what other people say... top dog is awesome. if you hate it, then you're obviously not awesome.

ps: don't forget to pour on the russian mustard ;)

pps: don't order "hot dog, please" you'll get booted. it's "top dog". get it straight.
Cleaned:  i don t care what other people say top dog is awesome if you hate it then you re obviously not awesome ps don t forget to pour on the russian mustard pps don t order hot dog please you ll get booted it s top dog get it straight


In [13]:
#Train test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(extreme.clean, extreme.stars, random_state=4)

The code below initialize the "CountVectorizer" object (scikit-learn's bag of words tool).
"Analyzer" is where we can change n-grams (e.g. bi-grams).
I modified the stop_words to be "english" rather than default of none.
The max features of 5000 means we're limiting the bag to the most highly represented 5000 words.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None,\
                             stop_words = "english", max_features = 5000)

#Vectorize our extreme reviews (for training)!
X_train_vectorized = vectorizer.fit_transform(X_train).toarray()

In [15]:
X_train_vectorized.shape

(38025L, 5000L)

Ok, so it's 38,025 rows (75% of the reviews are used for training) and 5,000 columns (words).
Let's do the regression!  (This can take a few minutes on a lil engine.)

In [16]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [17]:
logreg.fit(X_train_vectorized,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

In [18]:
y_pred = logreg.predict(vectorizer.transform(X_test).toarray())

from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred)

0.825562130178


Ok! so 82.6% isn't bad.  Right?  Maybe we can manage to get TFIDF working.

In [19]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
X_train_tfidf = tfidf.fit_transform(X_train_vectorized)

In [20]:
logreg.fit(X_train_tfidf,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

OK, weird but the regression seems to run a lot faster this way.

In [21]:
y_pred = logreg.predict(vectorizer.transform(X_test).toarray())

from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred)

0.828639053254


Ok, so 82.9% with TFIDF isn't much better than 82.6% was.

Hey, OMG wait.  I've been asking it to predict 1,2 or 5 stars, right?  So maybe I should pool the 1 and 2 stars as 0 and the 5 stars as 1.

In [22]:
#Make a new 'score' columns where 1 means 5 stars and 0 means few stars.
extreme['score'] = map(int,(extreme.stars == 5))

In [23]:
#Redo Train test split ("random state = 4" means in same way, so we can compare!) based on score.
X_train, X_test, y_train, y_test = train_test_split(extreme.clean, extreme.score, random_state=4)

In [24]:
#Probably don't need to redo this, but...
X_train = vectorizer.fit_transform(X_train).toarray()

In [25]:
logreg.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

Ok, that took about 10 minutes.
Now, let's see how good it looks, and then see if TFIDF alters it at all.

In [26]:
y_pred = logreg.predict(vectorizer.transform(X_test).toarray())

print metrics.accuracy_score(y_test, y_pred)

0.927968441815


Ooh, 92.8 is a little better.  How about TFIDF data?

In [27]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
X_train_tfidf = tfidf.fit_transform(X_train)

In [28]:
logreg.fit(X_train_tfidf,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

In [29]:
y_pred = logreg.predict(vectorizer.transform(X_test).toarray())

print metrics.accuracy_score(y_test, y_pred)

0.922840236686


So, TFIDF doesn't appear to help.  It's 92.3% instead of 92.8%.