## About Dataset
Context
This is a small subset of dataset of Book reviews from Amazon Kindle Store category.

Content
5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset.
Columns

- asin - ID of the product, like B000FA64PK
- helpful - helpfulness rating of the review - example: 2/3.
- overall - rating of the product.
- reviewText - text of the review (heading).
- reviewTime - time of the review (raw).
- reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN
- reviewerName - name of the reviewer.
- summary - summary of the review (description).
- unixReviewTime - unix timestamp.

Acknowledgements
This dataset is taken from Amazon product data, Julian McAuley, UCSD website. http://jmcauley.ucsd.edu/data/amazon/

License to the data files belong to them.

Inspiration
- Sentiment analysis on reviews.
- Understanding how people rate usefulness of a review/ What factors influence helpfulness of a review.
- Fake reviews/ outliers.
- Best rated product IDs, or similarity between products based on reviews alone (not the best idea ikr).
- Any other interesting analysis

In [None]:
import pandas as pd
data=pd.read_csv('Kindle Reviews/all_kindle_review.csv')
data.head()

In [None]:
df = data[['reviewText', 'rating']]
df.head()

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df['rating'].unique()

In [None]:
df['rating'].value_counts()

Preprocessing and Cleaning

In [None]:
df['rating'] = df['rating'].apply(lambda x: 1 if x > 3 else (0 if x == 3 else -1))

In [None]:
df.head()

In [None]:
df['rating'].value_counts()

Review text preprocessing

In [None]:
df['reviewText'] = df['reviewText'].str.lower()

In [None]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

from bs4 import BeautifulSoup

In [None]:
# Removing special characters
df['reviewText'] = df['reviewText'].apply(lambda x: re.sub('[^a-zA-Z0-9]', ' ', x))

# Removing stopwords
df['reviewText'] = df['reviewText'].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords.words('english')]))

# Removing URLs
df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', str(x)))

In [None]:
# Removing HTML tags
df['reviewText'] = df['reviewText'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())

In [None]:
df.head()

In [None]:
## Remove any additional spaces
df['reviewText']=df['reviewText'].apply(lambda x: " ".join(x.split()))

In [None]:
df.head()

In [None]:
## Lemmatizer
from nltk.stem import WordNetLemmatizer

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
def lemmatize_words(doc):
    return ' '.join([lemmatizer.lemmatize(word) for word in doc.split()])

In [None]:
df['reviewText'] = df['reviewText'].apply(lambda x: lemmatize_words(x))

In [None]:
df.head()

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['reviewText'], df['rating'], test_size=0.2, random_state=42)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

bow = CountVectorizer()
X_train_bow = bow.fit_transform(X_train).toarray()
X_test_bow = bow.transform(X_test).toarray()

tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train).toarray()
X_test_tfidf = tfidf.transform(X_test).toarray()

ML Model

In [None]:
from sklearn.naive_bayes import GaussianNB

gnb_bow = GaussianNB().fit(X_train_bow, y_train)
gnb_tfidf = GaussianNB().fit(X_train_tfidf, y_train)

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print('Bow accuracy:', accuracy_score(y_test, gnb_bow.predict(X_test_bow)))

In [None]:
print('Bow Classification Report:\n', classification_report(y_test, gnb_bow.predict(X_test_bow)))

In [None]:
print('TF-IDF accuracy:', accuracy_score(y_test, gnb_tfidf.predict(X_test_tfidf)))

In [None]:
print('TF-IDF Classification Report:\n', classification_report(y_test, gnb_tfidf.predict(X_test_tfidf)))

In [None]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [None]:
words = []
for review in df['reviewText']:
    sent_token = sent_tokenize(review)
    for sent in sent_token:
        words.append(simple_preprocess(sent))

In [None]:
len(words[1])

In [None]:
import gensim

In [None]:
model = gensim.models.Word2Vec(words, vector_size=100, min_count=1)

In [None]:
model.wv.index_to_key

In [None]:
model.corpus_count

In [None]:
model.epochs

In [None]:
model.wv.most_similar('great')

In [None]:
def avg_word2vec(doc):
    valid_vectors = [model.wv[word] for word in doc if word in model.wv.index_to_key]
    if valid_vectors:
        return np.mean(valid_vectors, axis=0)
    else:
        # Return zero vector instead of NaN for empty documents
        return np.zeros(model.wv.vector_size)

In [None]:
from tqdm import tqdm
import numpy as np
X = []
for i in tqdm(range(len(words))):
    X.append(avg_word2vec(words[i]))

In [None]:
X[0]

In [None]:
df_new = pd.DataFrame()
for i in range(0, len(X)):
    df_new = pd.concat([df_new, pd.DataFrame(X[i].reshape(1, -1))], ignore_index=True)

In [None]:
df_new['rating'] = df['rating']

In [None]:
df_new.isnull().sum()

In [None]:
X = df_new.drop('rating', axis=1)
y = df_new['rating']

In [None]:
X.head()

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.head()

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(random_state=42)

In [None]:
classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score,classification_report
print("Accuracy:", accuracy_score(y_test, y_pred))

In [None]:
print(f"classification report:\n {classification_report(y_test, y_pred)}")