#Text Mining

In [1]:
# Import pandas to read in data
import numpy as np
import pandas as pd

## Text classification
We are going to look at some Amazon reviews and classify them into positive or negative.

### Data
The file `data/books.csv` contains 2,000 Amazon book reviews. The data set contains two features: the first column (contained in quotes) is the review text. The second column is a binary label indicating if the review is positive or negative.

Let's take a quick look at the file.

In [2]:
!head -3 data/books.csv

review_text,positive
"THis book was horrible.  If it was possible to rate it lower than one star i would have.  I am an avid reader and picked this book up after my mom had gotten it from a friend.  I read half of it, suffering from a headache the entire time, and then got to the part about the relationship the 13 year old boy had with a 33 year old man and i lit this book on fire.  One less copy in the world...don't waste your money.I wish i had the time spent reading this book back so i could use it for better purposes.  THis book wasted my life",0
"I like to use the Amazon reviews when purchasing books, especially alert for dissenting perceptions about higly rated items, which usually disuades me from a selection.  So I offer this review that seriously questions the popularity of this work - I found it smug, self-serving and self-indulgent, written by a person with little or no empathy, especially for the people he castigates. For example, his portrayal of the family therapist see

Let's read the data into a pandas data frame. You'll notice two new attributed in `pd.read_csv()` that we've never seen before. The first, `quotechar` is tell us what is being used to "encapsulate" the text fields. Since our review text is surrounding by double quotes, we let pandas know. We use a `\` since the quote is also used to surround the quote. This backslash is known as an escape character. We also let pandas now this.

In [3]:
data = pd.read_csv("data/books.csv", quotechar="\"", escapechar="\\")

In [4]:
data.head()

Unnamed: 0,review_text,positive
0,THis book was horrible. If it was possible to...,0
1,I like to use the Amazon reviews when purchasi...,0
2,THis book was horrible. If it was possible to...,0
3,"I'm not sure who's writing these reviews, but ...",0
4,I picked up the first book in this series (The...,0


### Task 1: Preprocessing the text

Change text to lower case and remove stop words, then transform the row text collection into a matrix of token counts.

Hint: sklearn's function CountVectorizer has built-in options for these operations. Refer to http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html for more information.

In [5]:
# Import vectorizers to turn text into numeric
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


X_text = data['review_text']
Y = data['positive']

# Create a vectorizer that counts occurances of each token
binary_vectorizer = CountVectorizer(lowercase=True, stop_words='english')

# Let the vectorizer learn what tokens exist in the text data and how many time they occur in each document
binary_vectorizer.fit(X_text)

# Turn these tokens and counts into a numeric matrix
X_counts = binary_vectorizer.transform(X_text)

### Task 2: Build a logitical regression model using token counts
Build a logistic regression model using the token counts matrix we obtained in task 1. Perform a 5-fold cross-validation, and compute the mean AUC (Area under Curve).

In [6]:
# Import models and evaluation functions
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import cross_validation
from sklearn.cross_validation import cross_val_score

# Create a model
log_reg = LogisticRegression()

# Use this model and our data to get 5-fold cross validation AUCs
aucs = cross_validation.cross_val_score(log_reg, X_counts, Y, scoring="roc_auc", cv=5)

# Print out the average AUC rounded to three decimal points
print "Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3))

Area under the ROC curve for our classifier is 0.84


### Task 3: Build a logitical regression model using TFIDF

Transform the train data into a TFIDF matirx, and use it to build a new logistic regression model. Again, perform a 5-fold cross-validation, and compute the mean AUC.

Hint: Similar to CountVectorizer, sklearn's TfidfVectorizer function can do all the transformation work for you. Don't forget using the stop_words option.

In [7]:
# Create a vectorizer that will compute TFIDF of each token
tfidf_vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')

# Let the vectorizer learn what tokens exist in the text data and compute the TFIDF values
tfidf_vectorizer.fit(X_text)

# Turn these tokens and their TFIDF values into a numeric matrix
X_tfidf = tfidf_vectorizer.transform(X_text)

# Create a model
log_reg = LogisticRegression()

# Use this model and our data to get 5-fold cross validation AUCs
aucs = cross_validation.cross_val_score(log_reg, X_tfidf, Y, scoring="roc_auc", cv=5)

# Print out the average AUC rounded to three decimal points
print "Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3))

Area under the ROC curve for our classifier is 0.862


### Task 4: Build a logitical regression model using TFIDF over n-grams

We still want to use the TFIDF matirx, but instead of using TFIDF over single tokens, this time we want to go further and use TFIDF values of 1-gram and 2-gram tokens. We will use the new TFIDF matrix  build another logistic regression model. Again, perform a 5-fold cross-validation, and compute the mean AUC.

Hint: You can configure the agram range using an option of the TfidfVectorizer function

In [8]:
# Create a vectorizer that will compute TFIDF of each token
tfidf_vectorizer = TfidfVectorizer(lowercase=True, stop_words='english', ngram_range=(1,2))

# Let the vectorizer learn what tokens exist in the text data and compute the TFIDF values
tfidf_vectorizer.fit(X_text)

# Turn these tokens and their TFIDF values into a numeric matrix
X_tfidf = tfidf_vectorizer.transform(X_text)

# Create a model
log_reg = LogisticRegression()

# Use this model and our data to get 5-fold cross validation AUCs
aucs = cross_validation.cross_val_score(log_reg, X_tfidf, Y, scoring="roc_auc", cv=5)

# Print out the average AUC rounded to three decimal points
print "Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3))

Area under the ROC curve for our classifier is 0.863
