# NB_TopicClassifier_AmazonReviews

Author: Frank Fichtenmueller <br>
Date: 06/09/2017<br>

This is an example implementation of Amazon Review Data

Programmatic Structure:
- Generate a csv readout of the Customer Reviews for the Product
- load the reviews into a dict for easy access


Given a list of review texts, we create a dictionary for each review {field_source: review_text}

In [147]:
# Basic Imports
import os
import csv

In [148]:
# Change into data folder
os.chdir('D:\\AAA_ProgrammingFiles\\AAA_Learning\\AAA_Moocs\\Coursera_NLP-Introduction\\SentimentClassification\\data')

In [149]:
# Display list of all files
os.listdir()

['biologicalsciences_2010.csv', 'literature_2010.csv', 'sociology_2010.csv']

### Clean the Data

In [150]:
import re
from html import entities

# Helper Functions
def tokenize(text):
    text = [sent for sent in nltk.sent_tokenize(text)]
    text = [[word for word in nltk.word_tokenize(sent)] for sent in text]
    return text

def clean_punctuation(text):
    import re
    import string
    
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    tokenized_docs = []
    
    for review in text:
        new_review = []
        for token in review:
            new_token = regex.sub(u'', token)
            if not new_token == u'':
                new_review.append(new_token)
                
        tokenized_docs.append(new_review)
    
    return tokenized_docs


def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            try:
                if text[:3] == "&#x":
                    return chr(int(text[3:-1], 16))
                else:
                    return chr(int(text[2:-1]))
            except ValueError:
                pass
        
        else:
            try:
                text = chr(entities.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text
    return re.sub("&#?\w+;", fixup, text)
    
# Function to lemmatize the resulting word_sentence structure
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize(text):
    return [[lemmatizer.lemmatize(word) for word in sentence] for sentence in text]

In [155]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

amazon_reviews = []
target_labels = []

# Extract the text from the reviews
for infile in os.listdir(os.path.join(os.getcwd())):
    if infile.endswith('csv'):
        label = infile.split('.')[0]
        target_labels.append(label)
        
        with open(infile, 'r', encoding='utf8') as csvfile:
            amazon_reader = csv.DictReader(csvfile, delimiter=',')
            amazon_reviews = [{label: lemmatize(clean_punctuation(tokenize(unescape(row['review_text']))))} for row in amazon_reader]
        
        for doc in infile_rows:
            amazon_reviews.append(doc)

In [156]:
print('There are ' + str(len(amazon_reviews)) + ' total reviews.')
print('The labels are '+ ', '.join(target_labels) + '.')

There are 1122 total reviews.
The labels are biologicalsciences_2010, literature_2010, sociology_2010.


### Build the classifier

In [157]:
# Shuffle the training data
from random import shuffle
x = [amazon_reviews[i] for i in range(len(amazon_reviews))]
shuffle(x)

In [158]:
x

[{'biologicalsciences_2010': "Our intuition has been honed through a million years of awareness - from mastodons to murderers. As `sophisticated human beings', we have almost been trained to ignore those signals about danger - the hairs on the back of our neck, the tingling in our fingers, that shortness of breath. This sense of fear is our first signal that all is not right with our world. But we need to acknowledge and pay attention to these physical indications and bring them into our conscious awareness. The author has significant experience in security work, particularly in predicting violent behavior. If we can recognize this `code of violence, then we might be able to prevent it from happening to us. The books is a great `How To\\ in a unique field of violence prediction and avoidance. deBecker's clients include federal agencies"},
 {'sociology_2010': [['A', 'classic', 'that', 'everyone', 'should', 'have'],
   ['This', 'is', 'a', 'wonderful', 'book'],
   ['A',
    'different',
 

In [192]:
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from operator import itemgetter
from itertools import chain

# Generate a train test split
data = np.array([' '.join(list(itertools.chain.from_iterable(*el.values()))) for el in amazon_reviews])
targets = np.array([' '.join(list(itertools.chain.from_iterable(*el.keys()))) for el in amazon_reviews])

In [193]:
# Here we are using the itertools library to flatten the file into a wordlist for preparation
import itertools
' '.join(list(itertools.chain.from_iterable(*amazon_reviews[0].values())))

'This book wa given to me by my best friend after finishing college I will always treasure this thoughtful and special gift p Girlfriends is a collection of story that explore and celebrate female friendship through the eye ear and heart of everyday woman Some of the woman were friend for a lifetime others for a short time However all understood andor demonstrated the meaning of true friendship For example the story included everything from the thankful musing of a onceill woman about the extraordinaty kindness of her girlfriend to a giggly account of how two eerilysimiar best friend met a assigned roomates their first day of college The latter tale struck very close to home in a wonderfully spooky way p While many of the story tugged at the heartstrings I never felt manipulated by the author Note Part of the reason why I do nt like the Chicken Soup for the Soul series is that I feel that the author are just dying to make the reader clutch for the box of tissue Rather I appreciated the

In [225]:
# Now we create the test train split
X_train, X_test, y_train, y_test = train_test_split(data, targets, test_size=0.2, random_state=42)

# And vectorize the training data to numerical representation
# vectorizer = TfidfVectorizer(min_df=1, 
#                              ngram_range=(1, 2),
#                              stop_words='english',
#                              strip_accents='unicode',
#                              norm='l2')
# X_train = vectorizer.fit_transform(X_train)
# X_test = vectorizer.transform(X_test)

In [226]:
print(X_train[:4])

[ "S t r a i g h t f o r w a r d   d e s c r i p t i o n s ,   w i t h   v e r y   n i c e   i l l u s t r a t i o n s .   E a s y   i d e n t i f i c a t i o n   o f   w i n t e r ,   m i g r a t i o n   a n d   b r e e d i n g   h a b i t a t .   I ' d   s a y   t h i s   i s   a s   c l o s e   t o   a n   i d e a l   s i z e   f o r   a   f i e l d   g u i d e   a s   I   c a n   i m a g i n e ."
 'I   h a v e   r e a d   t h i s   b o o k   3   t i m e s   n o w   a n d   r e c e n t l y   b o u g h t   s e v e r a l   c o p i e s   t o   g i v e   t o   f e m a l e   f r i e n d s   a s   g i f t s .   T h e   a u t h o r   i s   a n   i n c r e d i b l e   m a n ,   w i d e l y   r e s p e c t e d   i n   h i s   f i e l d   a n d   w i t h   a n   i n t e r e s t i n g   b a c k g r o u n d .   T h i s   b o o k   h a s   h e l p e d   m e   m a k e   b e t t e r   d e c i s i o n s   r e g a r d i n g   m y   s a f e t y .   T h i s   b o o k   h e l p e d   m e   t o   b e   

In [223]:
# Train the classifier
clf_nb = MultinomialNB().fit(X_train, y_train)
y_predicted = clf_nb.predict(X_test)

In [224]:
all(y_predicted== y_test)

True

### Evaluate Classifier Performance

In [219]:
# Evaluate Performance
# print('Precision:: ', str(metrics.precision_score(y_test, y_predicted)))
print('Recal:: ', str(metrics.recall_score(y_test, y_predicted)))
# print('F1 Score:: ', str(metrics.f1_score(y_test, y_predicted)))

# print('\n\nConfusion Matrix:::: ')
# print(metrics.confusion_matrix(y_test, y_predicted))

  if pos_label not in present_labels:


ValueError: pos_label=1 is not a valid label: array(['b i o l o g i c a l s c i e n c e s _ 2 0 1 0',
       's o c i o l o g y _ 2 0 1 0'], 
      dtype='<U45')

In [None]:
# Evaluate the most important features per class
N = 10
vocabulary = np.array([t for t,i in sorted(vectorizer.vocabulary_.iteritems(), key=itemgetter(1))])