# Amazon Kindle Store Book Reviews Dataset

## **Context**
This dataset is a small subset of book reviews from the Amazon Kindle Store category, spanning from **May 1996 to July 2014**. It contains detailed review data, allowing for a variety of analytical approaches in sentiment analysis, user behavior, and product evaluations.

---

## **Content**
The dataset consists of **982,619 entries**, with the following characteristics:
- Each reviewer has written at least **5 reviews**.
- Each product has received at least **5 reviews**.

### **Columns Overview**
1. **asin**: Unique identifier for the product (e.g., `B000FA64PK`).
2. **helpful**: The helpfulness rating of the review (e.g., `2/3` represents 2 out of 3 users found it helpful).
3. **overall**: Product rating given by the reviewer (e.g., `4.0`).
4. **reviewText**: Text of the review (e.g., the main heading or content of the review).
5. **reviewTime**: Date of the review in raw format.
6. **reviewerID**: Unique identifier for the reviewer (e.g., `A3SPTOKDG7WBLN`).
7. **reviewerName**: Name of the reviewer.
8. **summary**: A short summary or description of the review.
9. **unixReviewTime**: The timestamp of the review in Unix format.

---

## **Acknowledgements**
This dataset was curated by **Julian McAuley** and is available on the [UCSD website](http://jmcauley.ucsd.edu/data/amazon/). The licensing of the data belongs to the original authors.

---

## **Inspiration for Analysis**
Here are some ideas for exploring and analyzing the dataset:

### 1. **Sentiment Analysis**
- Identify the sentiment (positive, negative, neutral) from the `reviewText` and `summary`.
- Explore trends in sentiment over time.

### 2. **Helpfulness of Reviews**
- Analyze factors influencing the helpfulness score (`helpful`).
- Determine whether review length, sentiment, or product rating affects the perceived helpfulness.

### 3. **Fake Reviews and Outliers**
- Detect anomalies or potentially fake reviews using patterns in `reviewText` or review frequency.
- Identify reviews with extremely high or low helpfulness ratings.

### 4. **Product Insights**
- Find the **best-rated products** based on `overall` ratings.
- Analyze patterns in review content to group similar products (e.g., common keywords or sentiments).

### 5. **Temporal Analysis**
- Study the relationship between review time (`reviewTime`, `unixReviewTime`) and product popularity.
- Examine seasonal trends in reviews and ratings.

### 6. **Reviewer Behavior**
- Analyze reviewer habits (e.g., average ratings given, consistency in reviews).
- Identify top contributors or influencers in the dataset.

---

## **Potential Applications**
- Building recommendation systems based on review sentiment and ratings.
- Enhancing customer experience by identifying helpful reviews.
- Developing models for fake review detection.
- Market research on reader preferences and book trends.

---

Feel free to dive into the dataset for exciting and actionable insights!


In [1]:
import pandas as pd

data = pd.read_csv("./Dataset/all_kindle_review.csv")
data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


In [2]:
df = data[["reviewText", "rating"]]
df.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4


In [3]:
df.shape

(12000, 2)

In [4]:
df.isnull().sum()

reviewText    0
rating        0
dtype: int64

In [5]:
df["rating"].value_counts()

rating
5    3000
4    3000
3    2000
2    2000
1    2000
Name: count, dtype: int64

In [6]:
df['rating'].unique()

array([3, 5, 4, 2, 1], dtype=int64)

In [7]:
#preprocessing

In [8]:
#positive reviews are 1 and negative reviews are 0

df = df.copy()
df['rating']=df['rating'].apply(lambda x:0 if x<3 else 1)

In [9]:
df.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",1
1,Great short read. I didn't want to put it dow...,1
2,I'll start by saying this is the first of four...,1
3,Aggie is Angela Lansbury who carries pocketboo...,1
4,I did not expect this type of book to be in li...,1


In [10]:
df["rating"].value_counts()

rating
1    8000
0    4000
Name: count, dtype: int64

In [11]:
## 1. Lower All the cases
df['reviewText']=df['reviewText'].str.lower()

In [12]:
df.head()

Unnamed: 0,reviewText,rating
0,"jace rankin may be short, but he's nothing to ...",1
1,great short read. i didn't want to put it dow...,1
2,i'll start by saying this is the first of four...,1
3,aggie is angela lansbury who carries pocketboo...,1
4,i did not expect this type of book to be in li...,1


In [13]:
#removing special charcters

In [14]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\chandula\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
from bs4 import BeautifulSoup

In [16]:
## Removing special characters
df['reviewText']=df['reviewText'].apply(lambda x:re.sub('[^a-z A-z 0-9-]+', '',x))
## Remove the stopswords
df['reviewText']=df['reviewText'].apply(lambda x:" ".join([y for y in x.split() if y not in stopwords.words('english')]))
## Remove url 
df['reviewText']=df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , str(x)))
## Remove html tags
df['reviewText'] = df['reviewText'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())
## Remove any additional spaces
df['reviewText']=df['reviewText'].apply(lambda x: " ".join(x.split()))

  df['reviewText'] = df['reviewText'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())


In [17]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short hes nothing mess man hau...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four books wasnt expect...,1
3,aggie angela lansbury carries pocketbooks inst...,1
4,expect type book library pleased find price right,1


In [18]:
## Lemmatizer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\chandula\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [19]:
lemmatizer=WordNetLemmatizer()

In [20]:
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

In [21]:
df['reviewText']=df['reviewText'].apply(lambda x:lemmatize_words(x))

In [22]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short he nothing mess man haul...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four book wasnt expecti...,1
3,aggie angela lansbury carry pocketbook instead...,1
4,expect type book library pleased find price right,1


In [23]:
## Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df['reviewText'],df['rating'],test_size=0.20)

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
bow=CountVectorizer()
X_train_bow=bow.fit_transform(X_train).toarray()
X_test_bow=bow.transform(X_test).toarray()

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
X_train_tfidf=tfidf.fit_transform(X_train).toarray()
X_test_tfidf=tfidf.transform(X_test).toarray()

In [26]:
X_train_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [27]:
from sklearn.naive_bayes import GaussianNB
nb_model_bow=GaussianNB().fit(X_train_bow,y_train)
nb_model_tfidf=GaussianNB().fit(X_train_tfidf,y_train)

In [28]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

y_pred_bow=nb_model_bow.predict(X_test_bow)
y_pred_tfidf=nb_model_bow.predict(X_test_tfidf)

In [29]:
confusion_matrix(y_test,y_pred_bow)

array([[526, 301],
       [685, 888]], dtype=int64)

In [30]:
print("BOW accuracy: ",accuracy_score(y_test,y_pred_bow))

BOW accuracy:  0.5891666666666666


In [31]:
confusion_matrix(y_test,y_pred_tfidf)

array([[519, 308],
       [683, 890]], dtype=int64)

In [32]:
print("TFIDF accuracy: ",accuracy_score(y_test,y_pred_tfidf))

TFIDF accuracy:  0.5870833333333333


In [43]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
import numpy as np

In [36]:
# Tokenize text data
X_train_tokens = [simple_preprocess(text) for text in X_train]
X_test_tokens = [simple_preprocess(text) for text in X_test]

In [40]:
X_train_tokens

[['poetry',
  'prose',
  'short',
  'story',
  'comprise',
  'book',
  'author',
  'struggle',
  'life',
  'relationship',
  'childhood',
  'pain',
  'delved',
  'including',
  'dealing',
  'religion',
  'author',
  'deal',
  'stuck',
  'really',
  'belonging',
  'left',
  'church',
  'still',
  'part',
  'world',
  'wholei',
  'admit',
  'subtitle',
  'memoir',
  'exmormon',
  'drew',
  'download',
  'free',
  'book',
  'become',
  'fascinated',
  'lds',
  'church',
  'really',
  'thought',
  'book',
  'would',
  'perspective',
  'lds',
  'doctrine',
  'author',
  'defected',
  'sadly',
  'disappointed',
  'little',
  'discussed',
  'church',
  'fact',
  'little',
  'anything',
  'discussedi',
  'found',
  'book',
  'terribly',
  'confusing',
  'felt',
  'like',
  'real',
  'cohesiveness',
  'skipped',
  'around',
  'lot',
  'still',
  'little',
  'confused',
  'whether',
  'actual',
  'memoir',
  'book',
  'story',
  'poetry',
  'incredibly',
  'misleading',
  'title',
  'seemed',
  

In [93]:
#train a word2vec model

vword2vec_model = Word2Vec(sentences=X_train_tokens, vector_size=100, window=5, min_count=1, workers=4, sg=0)

In [94]:
def get_sentence_embedding(tokens, model):
    valid_vectors = [model.wv[word] for word in tokens if word in model.wv]
    if len(valid_vectors) > 0:
        return np.mean(valid_vectors, axis=0)
    else:
        return np.zeros(model.vector_size)

X_train_embeddings = np.array([get_sentence_embedding(tokens, word2vec_model) for tokens in X_train_tokens])
X_test_embeddings = np.array([get_sentence_embedding(tokens, word2vec_model) for tokens in X_test_tokens])


In [95]:
nb_model_word2vec = GaussianNB().fit(X_train_embeddings, y_train)

In [96]:
y_pred_word2vec = nb_model_word2vec.predict(X_test_embeddings)

In [97]:
confusion_matrix(y_test,y_pred_word2vec)

array([[632, 195],
       [595, 978]], dtype=int64)

In [99]:
print("word2vec accuracy: ",accuracy_score(y_test,y_pred_word2vec))

word2vec accuracy:  0.6708333333333333
