# Sentiment analysis of book reviews

### Import Packages

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

## Load a 'ready-to-fit' Data Set

In [2]:
filename = os.path.join(os.getcwd(), "bookReviews.csv")
df = pd.read_csv(filename, header=0)

In [3]:
df.head()

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


In [4]:
df.shape[:1]

(1973,)

### Positive reviews

In [5]:
pos_reviews = df[df['Positive Review'] == True]['Review']

print('Positive Review:')
print(pos_reviews.iloc[0])

print('{0} Positive reviewes in total.'.format(pos_reviews.count()))


Positive Review:

980 Positive reviewes in total.


### Negative reviews

In [6]:
neg_reviews = df[df['Positive Review'] == False]['Review']

print('Negative Review:')
print(neg_reviews.iloc[0])

print('{0} Negative reviewes in total.'.format(neg_reviews.count()))

Negative Review:
The book contained more profanity than I expected to read in a book by Rita Rudner.  I had expected more humor from a comedienne.  Too bad, because I really like her humor

993 Negative reviewes in total.


## Create Training and Test Data Sets

### Create Labeled Examples

* Get the `Positive_Review` column from DataFrame `df` and assign it to the variable `y`. This will be the label.
* Get the column `Review` from DataFrame `df` and assign it to the variable `X`. This will be the feature.

In [7]:
y = df['Positive Review']
X = df['Review']

X.shape

(1973,)

In [8]:
X.head()

0    This was perhaps the best of Johannes Steinhof...
1    This very fascinating book is a story written ...
2    The four tales in this collection are beautifu...
3    The book contained more profanity than I expec...
4    We have now entered a second time of deep conc...
Name: Review, dtype: object

## Split Labeled Examples into Training and Test Sets

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.75, random_state=1234)

X_train.head()

500     There is a reason this book has sold over 180,...
1047    There is one thing that every cookbook author ...
1667    Being an engineer in the aerospace industry I ...
1646    I have no idea how this book has received the ...
284     It is almost like dream comes true when I saw ...
Name: Review, dtype: object

## Implement TF-IDF Vectorizer to Transform Text

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
# 1. Create a TfidfVectorizer oject
tfidf_vectorizer = TfidfVectorizer()

# 2. Fit the vectorizer to X_train
tfidf_vectorizer.fit(X_train)

# 3. Print the first 50 items in the vocabulary
print("Vocabulary size {0}: ".format(len(tfidf_vectorizer.vocabulary_)))
print(str(list(tfidf_vectorizer.vocabulary_.items())[0:50])+'\n')

# 4. Transform both the training and test data using the fitted vectorizer and its 'transform' attribute
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# 5. Print the training matrix
print(X_train_tfidf.todense())

Vocabulary size 18558: 
[('there', 16673), ('is', 9043), ('reason', 13533), ('this', 16714), ('book', 2189), ('has', 7803), ('sold', 15423), ('over', 11793), ('180', 73), ('000', 1), ('copies', 3867), ('it', 9076), ('gets', 7240), ('right', 14207), ('to', 16835), ('the', 16627), ('point', 12568), ('accompanies', 444), ('each', 5372), ('strategy', 15943), ('with', 18277), ('visual', 17844), ('aid', 750), ('so', 15386), ('you', 18497), ('can', 2604), ('get', 7239), ('mental', 10534), ('picture', 12402), ('in', 8491), ('your', 18501), ('head', 7844), ('further', 7051), ('its', 9088), ('section', 14743), ('on', 11601), ('analyzing', 974), ('stocks', 15886), ('and', 984), ('commentary', 3384), ('state', 15782), ('of', 11543), ('financial', 6568), ('statements', 15786), ('market', 10286), ('are', 1220), ('money', 10863), ('if', 8336), ('just', 9282), ('starting', 15774)]

[[0.         0.16185315 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.       

## Fit a Logistic Regression Model to the Transformed Training Data and Evaluate the Model

In [19]:
# Create a LogisticRegression model object, and fit a Logistic Regression model to the transformed training data
model = LogisticRegression(max_iter=200)
model.fit(X_train_tfidf, y_train)

# Make predictions on the transformed test data using the predict_proba() method and
# save the values of the second column (probability that its 'True')
probability_predictions = model.predict_proba(X_test_tfidf)[:,1]

print('Probability that the first 5 reviews of the test data are positive: ', probability_predictions[:5])

Probability that the first 5 reviews of the test data are positive:  [0.60395215 0.59635347 0.47190056 0.46265613 0.65376739]


In [34]:
# Make predictions on the transformed test data using the predict() method
class_label_predictions = model.predict(X_test_tfidf)

# Compute the Area Under the ROC curve (AUC) for the test data
auc = roc_auc_score(y_test, probability_predictions)
print('AUC on the test data: {:.4f}'.format(auc))

# 5. Print out the size of the resulting feature space using the 'vocabulary_' attribute of the vectorizer
len_feature_space = len(tfidf_vectorizer.vocabulary_)
print('The size of the feature space: {0}'.format(len_feature_space))

# 6. Get a glimpse of the features:
first_five = list(tfidf_vectorizer.vocabulary_.items())[1:5]
print('Glimpse of first 5 entries of the mapping of a word to its column/feature index \n{}:'.format(first_five))


AUC on the test data: 0.9146
The size of the feature space: 18558
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('is', 9043), ('reason', 13533), ('this', 16714), ('book', 2189)]:


## Check two book reviews and see if the model properly predicted whether the reviews are good or bad reviews.

In [27]:
print('Review #1:\n')
print(X_test.to_numpy()[124])

print('\nPrediction: Is this a good review? {}\n'.format(class_label_predictions[124]))

print('Actual: Is this a good review? {}\n'.format(y_test.to_numpy()[124]))

Review #1:

I've been a fan of Carol Dweck's scholarly work for years. Her work on self-esteem, self-concept, and the incremental vs. entity theories of intelligence provides some of the most powerfully useful tools I've encountered for educators and parents in their work with children, as well as in their own self-awareness and lives. I'm delighted to see this information written here in such a user-friendly conversational tone, rich with stories that illustrate the nuances and complexities of Dweck's research and ideas. I'm recommending this book to all of my graduate students (teachers and principals working with gifted learners), as well as to parents of high-ability children.

Dona Matthews, Ph.D., Director of the Hunter College Center for Gifted Studies and Education, City University of New York


Prediction: Is this a good review? True

Actual: Is this a good review? True



In [29]:
print('Review #2:\n')
print(X_test.to_numpy()[90])

print('\nPrediction: Is this a good review? {}\n'.format(class_label_predictions[90]))

print('Actual: Is this a good review? {}\n'.format(y_test.to_numpy()[90]))

Review #2:

I bought this recording with high hopes.  What better complement to reading Shakespeare than hearing him, right?  Well, not with this recording.  The cast is made up of &quot;distinguished actors,&quot; the insert proclaims, but it's obvious that these actors haven't done Shakespeare since they were in junior high school.  Nor have they improved since then: none of the actors has any feel for the Shakespearean line.  The speaking is stiff and mechanical, and half the time it sounds like a Monty Python farce!  When there are no visual effects to distract us, low-quality acting really sticks out.  For audio recordings, you need the best voices.  Too bad Arkangel didn't realize this.  My advice?  Grind up these CDs and use them to fertilize your nasturtiums


Prediction: Is this a good review? False

Actual: Is this a good review? False

