## Connecting to Data

In [2]:
from pymongo import MongoClient

# Connect the notebook to the MongoDB database - "amazonreviews"
client = MongoClient()
db = client.amazonreviews

In [3]:
# Show one of the documnets in the collection "Books"
list(db.Books.find().limit(1))

[{'_id': ObjectId('5f391bb6bd04e741588262bc'),
  'marketplace': 'US',
  'customer_id': 32715830,
  'review_id': 'R2GANXKDIFZ6OI',
  'product_id': '014241543X',
  'product_parent': 712432151,
  'product_title': 'If I Stay',
  'product_category': 'Books',
  'star_rating': 5,
  'helpful_votes': 0,
  'total_votes': 0,
  'vine': 'N',
  'verified_purchase': 'N',
  'review_headline': 'Five Stars',
  'review_body': 'So beautiful',
  'review_date': '2015-08-31'}]

In [4]:
import pandas as pd

# Create a new dataframe for the Harry Potter by using "product_parent': 667539744"
HarryPotter_cursor = db.Books.find({'product_parent': 667539744 })
HarryPotter_df_raw = pd.DataFrame(list(HarryPotter_cursor))  

## Exploratory Data Analysis

In [5]:
# Show the first five rows in the dataframe "HP_df_raw"
HarryPotter_df_raw.head(5)

Unnamed: 0,_id,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,5f391bb6bd04e74158826435,US,42198815,R1L0NE9TE6EAYU,7020033458,667539744,Harry Potter and the Prisoner of Azkaban (Simp...,Books,5,0,0,N,Y,Five Stars,GREAT THANKS.,2015-08-31
1,5f391bb6bd04e74158827679,US,5328185,RD5V8C95DUZZ7,059035342X,667539744,Harry Potter and the Sorcerer's Stone,Books,5,0,0,N,N,This book is absolutely amazing! It is a favor...,This book is absolutely amazing! It is a favor...,2015-08-31
2,5f391bb6bd04e741588280ad,US,42237878,R3LW2TZQ5FLYGF,545162076,667539744,Harry Potter Paperback Box Set (Books 1-7),Books,5,0,1,N,Y,Five Stars,What's not to love about Harry Potter? Books w...,2015-08-31
3,5f391bb6bd04e741588280fc,US,12175857,R26KVAWWVTNZHF,439136369,667539744,Harry Potter and the Prisoner of Azkaban,Books,4,0,0,N,N,Rowling escalates her game and ups the ante,Prisoner_of_Azkaban_coverDo I need to put a su...,2015-08-31
4,5f391bb7bd04e741588290c6,US,16802733,RWIEHV6WZYGD7,545010225,667539744,Harry Potter and the Deathly Hallows (Book 7),Books,5,0,0,N,Y,Harry Potter... enough said.,Harry Potter... enough said.,2015-08-31


In [6]:
# Get info on "HarryPotter_df_raw"
HarryPotter_df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28440 entries, 0 to 28439
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   _id                28440 non-null  object
 1   marketplace        28440 non-null  object
 2   customer_id        28440 non-null  int64 
 3   review_id          28440 non-null  object
 4   product_id         28440 non-null  object
 5   product_parent     28440 non-null  int64 
 6   product_title      28440 non-null  object
 7   product_category   28440 non-null  object
 8   star_rating        28440 non-null  int64 
 9   helpful_votes      28440 non-null  int64 
 10  total_votes        28440 non-null  int64 
 11  vine               28440 non-null  object
 12  verified_purchase  28440 non-null  object
 13  review_headline    28440 non-null  object
 14  review_body        28440 non-null  object
 15  review_date        28440 non-null  object
dtypes: int64(5), object(11)
memory usage: 3.

In [7]:
# Check for NaN values
HarryPotter_df_raw.isnull().any()

_id                  False
marketplace          False
customer_id          False
review_id            False
product_id           False
product_parent       False
product_title        False
product_category     False
star_rating          False
helpful_votes        False
total_votes          False
vine                 False
verified_purchase    False
review_headline      False
review_body          False
review_date          False
dtype: bool

In [8]:
# Filter columns and delete "HarryPotter_df_raw"
HarryPotter_df = HarryPotter_df_raw.filter(['marketplace','customer_id','review_id','product_id','product_title','roduct_title','star_rating','helpful_votes','total_votes','vine','verified_purchase','review_headline','review_body','review_date'])
del HarryPotter_df_raw

In [9]:
# Revome any duplicates
HarryPotter_df = HarryPotter_df.drop_duplicates(subset=['review_id'])

In [10]:
# Change "Y" and "N" to integers 1 and 0
HarryPotter_df = HarryPotter_df.replace('Y', 1)
HarryPotter_df = HarryPotter_df.replace('N', 0)

In [12]:
import bs4

# Revome HTML
HarryPotter_df['review_body'] = HarryPotter_df['review_body'].apply(lambda x: bs4.BeautifulSoup(x, 'lxml').get_text())

 ### Hyperparameter

In [25]:
# Create sentiment parameter based on star rating
def get_sentiment(value):
    if value > 3:
        return 1
    elif value < 3:
        return -1
    else:
        return 0

HarryPotter_df['star_sentiment'] = HarryPotter_df.star_rating.apply(get_sentiment)

In [26]:
# Find number of review for each sentiment
print(HarryPotter_df['star_sentiment'].value_counts())

 1    25750
 0     1415
-1     1275
Name: star_sentiment, dtype: int64


In [23]:
# Find number of words in review_body
HarryPotter_df["num_words"] = HarryPotter_df["review_body"].apply(lambda x: len(str(x).split()))

In [14]:
# Find number of unique words in review_body
HarryPotter_df["num_unique_words"] = HarryPotter_df["review_body"].apply(lambda x: len(set(str(x).split())))

In [15]:
# Find number of characters in review_body
HarryPotter_df["num_chars"] = HarryPotter_df["review_body"].apply(lambda x: len(str(x)))

In [17]:
import string

# Find number of punctuation marks in review_body
HarryPotter_df["num_punctuations"] = HarryPotter_df['review_body'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )

In [19]:
import numpy as np

# Find average length of the words in review_body
HarryPotter_df["mean_word_len"] = HarryPotter_df["review_body"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

In [20]:
# Find general statistics on hyperparameters
HarryPotter_df.describe()

Unnamed: 0,customer_id,star_rating,helpful_votes,total_votes,vine,verified_purchase,num_words,num_unique_words,num_chars,num_punctuations,mean_word_len
count,28440.0,28440.0,28440.0,28440.0,28440.0,28440.0,28440.0,28440.0,28440.0,28440.0,28440.0
mean,37716650.0,4.621624,1.789768,3.215084,0.0,0.3077,130.238432,84.6282,721.107419,22.695464,4.490633
std,14750480.0,0.860609,16.472034,20.485293,0.0,0.46155,164.755524,77.379865,943.888726,32.694961,7.702693
min,15584.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
25%,26500010.0,5.0,0.0,0.0,0.0,0.0,38.0,33.0,202.0,6.0,4.153846
50%,43317180.0,5.0,0.0,1.0,0.0,0.0,82.0,63.0,439.0,13.0,4.411494
75%,50660110.0,5.0,1.0,2.0,0.0,1.0,160.0,111.0,879.0,27.0,4.666667
max,53096190.0,5.0,1550.0,1646.0,0.0,1.0,6556.0,1587.0,38667.0,1508.0,1300.0


In [27]:
# Filter reviews with less than 20 words
HarryPotter_df = HarryPotter_df[HarryPotter_df.num_words > 20]

## Summarization (TextRank Algorithm)

In [80]:
# Create and utilize mask to filter reviews based on word count, punctions, product, and negative sentiment 
mask = ((HarryPotter_df.num_words >= 50) & (HarryPotter_df.num_punctuations >= 5) & (HarryPotter_df.product_title=="Harry Potter And The Sorcerer's Stone") & (HarryPotter_df.star_sentiment==-1))
HarryPotter_Neg_df = HarryPotter_df[mask]
HarryPotter_TopNeg_df = HarryPotter_Neg_df.nlargest(50,'total_votes')

In [81]:
# Create and utilize mask to filter reviews based on word count, punctions, product, and positive sentiment 
mask = ((HarryPotter_df.num_words >= 50) & (HarryPotter_df.num_punctuations >= 5) & (HarryPotter_df.product_title=="Harry Potter And The Sorcerer's Stone") & (HarryPotter_df.star_sentiment==1))
HarryPotter_Pos_df = HarryPotter_df[mask]
HarryPotter_TopPos_df = HarryPotter_Pos_df.nlargest(50,'total_votes')

In [83]:
import nltk

# Download pre-trained 'punkt' tokenizer
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [84]:
from nltk.tokenize import sent_tokenize

# Divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, 
# collocations, and words that start sentences

# Negative reviews
negative_sentences_list = []
for s in HarryPotter_TopNeg_df['review_body']:
  negative_sentences_list.append(sent_tokenize(s))
# Flatten sentence list
negative_sentences_list = [y for x in negative_sentences_list for y in x]

# Positive reviews
positive_sentences_list = []
for s in HarryPotter_TopPos_df['review_body']:
  positive_sentences_list.append(sent_tokenize(s))
# Flatten sentence list
positive_sentences_list = [y for x in positive_sentences_list for y in x]

In [85]:
positive_sentences_list[0]

'An adult friend (age 49)loaned me three Harry Potter books for the summer.'

In [86]:
# Extract word vectors from GloVe
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [87]:
# Remove punctuations, numbers and special characters
positive_clean_sentences = pd.Series(positive_sentences_list).str.replace("[^a-zA-Z]", " ")
negative_clean_sentences = pd.Series(negative_sentences_list).str.replace("[^a-zA-Z]", " ")

In [88]:
# Make characters lowercase
positive_clean_sentences = [s.lower() for s in positive_clean_sentences]
negative_clean_sentences = [s.lower() for s in negative_clean_sentences]

In [89]:
# Download stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/ubuntu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [90]:
# Create stopwords function
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

def remove_stopwords(sentence):
    new_sentence = " ".join([i for i in sentence if i not in stop_words])
    return new_sentence

In [91]:
# Remove stopwords
positive_clean_sentences = [remove_stopwords(r.split()) for r in positive_clean_sentences]
negative_clean_sentences = [remove_stopwords(r.split()) for r in negative_clean_sentences]

In [92]:
# Funtion to create vector representation of sentences
def sentence_vectors(clean_sentences):
    sentence_vectors = []
    for i in clean_sentences:
      if len(i) != 0:
        vector = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
      else:
        vector = np.zeros((100,))
      sentence_vectors.append(vector)
    return sentence_vectors

In [93]:
# Create vector representation of negative sentences
positive_sentence_vectors = sentence_vectors(positive_clean_sentences)
negative_sentence_vectors = sentence_vectors(negative_clean_sentences)

In [94]:
from sklearn.metrics.pairwise import cosine_similarity

# Function to create similarity matrix
def similarity_matrix(sentences_list, sentence_vectors):
    sim_matrix = np.zeros([len(sentences_list), len(sentences_list)])
    for i in range(len(sentences_list)):
      for j in range(len(sentences_list)):
        if i != j:
          sim_matrix[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
    return sim_matrix

In [95]:
# Create similarity matrixes
postive_simiarity_matrix = similarity_matrix(positive_sentences_list, positive_sentence_vectors)
negative_simiarity_matrix = similarity_matrix(negative_sentences_list, negative_sentence_vectors)

In [96]:
import networkx as nx

# Create graph of similarity matrixes and apply PageRank 
nx_graph = nx.from_numpy_array(postive_simiarity_matrix)
positive_scores = nx.pagerank(nx_graph)
nx_graph = nx.from_numpy_array(negative_simiarity_matrix)
negitive_scores = nx.pagerank(nx_graph)

In [97]:
# Find Top 10 positive sentences for summary generation
positive_ranked_sentences = sorted(((positive_scores[i],s) for i,s in enumerate(positive_sentences_list)), reverse=True)
for i in range(10):
  print(positive_ranked_sentences[i][1])

I've read an interesting theory (obviously not true), that a much different writer than Rowling would have ended Book 7 with Harry having imagined all this fantasy world, where he was so prominent and famous, to help escape the neglect and abuse from the Dursleys.He gets a letter (actually, hundreds) saying he is in fact a wizard.
My mom had given it to me, and actually, i didn't want to read it.How wrong I was.Can I even start to explain the world and even life style that Harry has led me into.
I'm  not sure why I like this book so much, except for the reason it's so fun to  read.Maybe J.K.Rowling went to Hogwarts herself and put a spell over her  book so that the people who read it will have their noses stuck in the book  and would have to do everything with one hand(Like what Ron said in book  2,when Harry found the diary)!
When I read a good book, I get drawn into a different reality for a few days, and always hate to return to my own.
I originally read this book because I needed a

In [99]:
# Find Top 10 negative sentences for summary generation
negative_ranked_sentences = sorted(((negitive_scores[i],s) for i,s in enumerate(negative_sentences_list)), reverse=True)
for i in range(10):
  print(negative_ranked_sentences[i][1])

The half-twist ending did surprise me, but only because the rest of the story convinced me that the author wasn't truly capable of doing so--or perhaps didn't bother trying otherwise, considering her audience.Why cheat (rather than challenge) a child's sense of the wonderful if one is willing to (uncharacteristically) trust her sense of the horrible?But, in the end, the book's core tale of schooltime friendship and adventure manages to shine through, and when considered as a work of English boarding-school fiction rather than fantasy, it enjoys more than mild success.If you're looking for a great work of children's fantasy you may want to search elsewhere; if you're looking to kill a few hours with an enjoyable read, then "Sorcerer's Stone" could work for you.It's good fun at the very least, and certainly most kids will find it so.
Well, I read it, and for the most part, and could see how kids would really enjoy the fantasy and the plot line (such as it is), but I can't understand why 