# Homework 3 - Text Classification

In [1]:
# Import pandas to read in data
import numpy as np
import pandas as pd

## Text classification
We are going to look at some Amazon reviews and classify them into positive or negative.

### Data
The file `data/books.csv` contains 2,000 Amazon book reviews. The data set contains two features: the first column (contained in quotes) is the review text. The second column is a binary label indicating if the review is positive or negative.

Let's take a quick look at the file.

In [2]:
!head -3 data/books.csv

review_text,positive
"THis book was horrible.  If it was possible to rate it lower than one star i would have.  I am an avid reader and picked this book up after my mom had gotten it from a friend.  I read half of it, suffering from a headache the entire time, and then got to the part about the relationship the 13 year old boy had with a 33 year old man and i lit this book on fire.  One less copy in the world...don't waste your money.I wish i had the time spent reading this book back so i could use it for better purposes.  THis book wasted my life",0
"I like to use the Amazon reviews when purchasing books, especially alert for dissenting perceptions about higly rated items, which usually disuades me from a selection.  So I offer this review that seriously questions the popularity of this work - I found it smug, self-serving and self-indulgent, written by a person with little or no empathy, especially for the people he castigates. For example, his portrayal of the family therapist see

Let's read the data into a pandas data frame. You'll notice two new attributed in `pd.read_csv()` that we've never seen before. The first, `quotechar` is tell us what is being used to "encapsulate" the text fields. Since our review text is surrounding by double quotes, we let pandas know. We use a `\` since the quote is also used to surround the quote. This backslash is known as an escape character. We also let pandas now this.

In [3]:
data = pd.read_csv("data/books.csv", quotechar="\"", escapechar="\\")

In [4]:
data.head()

Unnamed: 0,review_text,positive
0,THis book was horrible. If it was possible to...,0
1,I like to use the Amazon reviews when purchasi...,0
2,THis book was horrible. If it was possible to...,0
3,"I'm not sure who's writing these reviews, but ...",0
4,I picked up the first book in this series (The...,0


## Data set

In [5]:
# Entire data set
review_data = data['review_text']
# Entire label set
positive_data =  data['positive']

### Task 1: Preprocessing the text (25 points)

Change text to lower case and remove stop words, then transform the row text collection into a matrix of token counts.

Hint: sklearn's function CountVectorizer has built-in options for these operations. Refer to http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html for more information.

In [6]:
from nltk.corpus import stopwords
import string

In [10]:
def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
bow_transformer = CountVectorizer(analyzer=text_process).fit(review_data)
# Print total number of vocab words
print (len(bow_transformer.vocabulary_))

30903


In [13]:
# To Bag-of-Words (bow)
review_bow = bow_transformer.transform(review_data)
print ('Shape of Sparse Matrix: ', review_bow.shape)
print ('Amount of Non-Zero occurences: ', review_bow.nnz)
print ('sparsity: %.2f%%' % (100.0 * review_bow.nnz / (review_bow.shape[0] * review_bow.shape[1])))

Shape of Sparse Matrix:  (2000, 30903)
Amount of Non-Zero occurences:  147810
sparsity: 0.24%


In [14]:
# To Tfidf socores
# Useless
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer().fit(review_bow)
review_tfidf = tfidf_transformer.transform(review_bow)
print (review_tfidf.shape)

(2000, 30903)


### Task 2: Build a logistic regression model using token counts (25 points)

Build a logistic regression model using the token counts from task 1. Perform a 5-fold cross-validation (train-test ratio 80-20), and compute the mean AUC (Area under Curve).

In [17]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression

In [45]:
#logistic regression model
logreg = LogisticRegression(solver='lbfgs')

In [46]:
# 5-fold cross-validation and Computer mean AUC
mean_auc = cross_val_score(logreg, review_bow, positive_data, cv=5, scoring='roc_auc').mean()
print (mean_auc)

0.84611


### Task 3: Build a logistic regression model using TFIDF (25 points)

Transform the training data into a TFIDF matirx, and use it to build a new logistic regression model. Again, perform a 5-fold cross-validation, and compute the mean AUC.

Hint: Similar to CountVectorizer, sklearn's TfidfVectorizer function can do all the transformation work for you. Don't forget using the stop_words option.

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [25]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english',analyzer='word',lowercase=True)

In [27]:
# TFIDF matrix
tv_fitted = tfidf_vectorizer.fit(review_data,positive_data)
data_word_features = tv_fitted.transform(review_data)

In [30]:
#logistic regression model
logreg_tfidf = LogisticRegression(solver='lbfgs')

In [47]:
# 5-fold cross-validation and Computer mean AUC
mean_auc_tfidf = cross_val_score(logreg_tfidf, data_word_features, positive_data, cv=5, scoring='roc_auc').mean()
print (mean_auc_tfidf)

0.861895


### Task 4: Build a logistic regression model using TFIDF over n-grams (25 points)

We still want to use the TFIDF matirx, but instead of using TFIDF over single tokens, this time we want to go further and use TFIDF values of both 1-gram and 2-gram tokens. Then use this new TFIDF matrix to build another logistic regression model. Again, perform a 5-fold cross-validation, and compute the mean AUC.

Hint: You can configure the n-gram range using an option of the TfidfVectorizer function

In [48]:
# Set ngram_range to (1,2)
tfidf_vectorizer_ngram = TfidfVectorizer(stop_words='english',analyzer='word',ngram_range=(1, 2),lowercase=True)
tv_fitted_ngram = tfidf_vectorizer_ngram.fit(review_data,positive_data)
data_word_features_ngram = tv_fitted_ngram.transform(review_data)

In [42]:
#logistic regression model
logreg_tfidf_ngram = LogisticRegression(solver='lbfgs')

In [43]:
mean_auc_tfidf_ngram = cross_val_score(logreg_tfidf_ngram, data_word_features_ngram, positive_data, cv=5, 
                                     scoring='roc_auc').mean()
print (mean_auc_tfidf_ngram)

0.8631350000000001
