## About Dataset
Context This is a small subset of dataset of Book reviews from Amazon Kindle Store category.

Content 5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset. Columns

- asin - ID of the product, like B000FA64PK
- helpful - helpfulness rating of the review - example: 2/3.
- overall - rating of the product.
- reviewText - text of the review (heading).
- reviewTime - time of the review (raw).
- reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN
- reviewerName - name of the reviewer.
- summary - summary of the review (description).
- unixReviewTime - unix timestamp.
- Acknowledgements This dataset is taken from Amazon product data, Julian - McAuley, UCSD website. http://jmcauley.ucsd.edu/data/amazon/

## License to the data files belong to them.

Inspiration

Sentiment analysis on reviews.
Understanding how people rate usefulness of a review/ What factors influence helpfulness of a review.
Fake reviews/ outliers.
Best rated product IDs, or similarity between products based on reviews alone (not the best idea ikr).
Any other interesting analysis

## Best Practises
- Preprocessing And Cleaning
- Train Test Split
- BOW,TFIDF,Word2vec
- Train ML algorithms

In [1]:
import pandas as pd
df = pd.read_csv('../data/all_kindle_review.csv')
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


In [2]:
df = df[['reviewText', 'rating']]
df.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4


In [13]:
## postive review is 1 and negative review is 0
df['rating']=df['rating'].apply(lambda x:0 if x<3 else 1)

In [3]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
import re
from bs4 import BeautifulSoup
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [5]:
corpus = []
for i in range(len(df)):
    review = df['reviewText'][i]
    review = review.lower()
    review = re.sub('[^a-zA-Z]', ' ', review)
    review = re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , review)
    review = BeautifulSoup(review).get_text()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)


In [6]:
corpus

['jace rankin may short nothing mess man hauled saloon undertaker know famous bounty hunter oregon shot man saloon finished year long quest avenge sister murder trying figure next snotty nosed farm boy rescued gang bully offer money kill man forced ranch reluctantly agrees bring man justice kill outright first need tell sister widower news kyla kyle springer bailey riding trail sleeping ground past month trying find jace want revenge man killed husband took ranch amongst crime keen detour jace want take realizes option hide behind boy persona best try keep pace confrontation along way get shot jace discovers kyle kyla come clean whole reason need scoundrel dead hope still help book share touching moment slow blooming romance kyla find good reason fear men hide behind boy persona watching jace slowly pull shell help conquer fear endearing pain real deeply rooted disappear face sexiness neither understandable aversion marriage magically disappear round nookie would man drifted town town 

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(corpus, df['rating'], test_size = 0.20, random_state = 0)


In [8]:
!pip install gensim



In [15]:
print(f"Jumlah data latih: {len(X_train)}")
print(f"Jumlah data uji: {len(X_test)}")

from gensim.models import Word2Vec

w2v_model = Word2Vec(
    sentences=X_train,
    vector_size=100,
    window=5,
    min_count=1,
    workers=4
)



Jumlah data latih: 9600
Jumlah data uji: 2400


In [16]:
def create_review_vector(tokens, model):
    vector = np.zeros(model.vector_size)
    count = 0
    for word in tokens:
        # Cek apakah kata ada di dalam vocabulary model Word2Vec
        if word in model.wv:
            vector += model.wv[word]
            count += 1
    if count != 0:
        vector /= count
    return vector

In [17]:
import numpy as np
X_train_vec = np.array([create_review_vector(tokens, w2v_model) for tokens in X_train])
X_test_vec = np.array([create_review_vector(tokens, w2v_model) for tokens in X_test])


In [20]:
!pip install xgboost



In [24]:
## mengimport model yang akan digunakan
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

# mengimport untuk perfomance model
from sklearn.metrics import accuracy_score, classification_report,ConfusionMatrixDisplay, precision_score, recall_score, f1_score, roc_auc_score,roc_curve

In [None]:
## menyiapkan model dalam bentuk dictionary
models={
    "Logisitic Regression":LogisticRegression(),
    "Decision Tree":DecisionTreeClassifier(),
    "Random Forest":RandomForestClassifier(),
    "Gradient Boost":GradientBoostingClassifier(),
    "Adaboost":AdaBoostClassifier(),
    "Xgboost":XGBClassifier() 
}

# melakukan fit, pred, dan perfomance pada setiap model di dalam dictionary
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train_vec, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train_vec)
    y_test_pred = model.predict(X_test_vec)

    # Training set performance
    model_train_accuracy = accuracy_score(y_train, y_train_pred) # Calculate Accuracy
    model_train_f1 = f1_score(y_train, y_train_pred, average='weighted') # Calculate F1-score
    model_train_precision = precision_score(y_train, y_train_pred) # Calculate Precision
    model_train_recall = recall_score(y_train, y_train_pred) # Calculate Recall
    model_train_rocauc_score = roc_auc_score(y_train, y_train_pred)


    # Test set performance
    model_test_accuracy = accuracy_score(y_test, y_test_pred) # Calculate Accuracy
    model_test_f1 = f1_score(y_test, y_test_pred, average='weighted') # Calculate F1-score
    model_test_precision = precision_score(y_test, y_test_pred) # Calculate Precision
    model_test_recall = recall_score(y_test, y_test_pred) # Calculate Recall
    model_test_rocauc_score = roc_auc_score(y_test, y_test_pred) #Calculate Roc


    print(list(models.keys())[i])

    print('Model performance for Training set')
    print("- Accuracy: {:.4f}".format(model_train_accuracy))
    print('- F1 score: {:.4f}'.format(model_train_f1))

    print('- Precision: {:.4f}'.format(model_train_precision))
    print('- Recall: {:.4f}'.format(model_train_recall))
    print('- Roc Auc Score: {:.4f}'.format(model_train_rocauc_score))



    print('----------------------------------')

    print('Model performance for Test set')
    print('- Accuracy: {:.4f}'.format(model_test_accuracy))
    print('- F1 score: {:.4f}'.format(model_test_f1))
    print('- Precision: {:.4f}'.format(model_test_precision))
    print('- Recall: {:.4f}'.format(model_test_recall))
    print('- Roc Auc Score: {:.4f}'.format(model_test_rocauc_score))


    print('='*35)
    print('\n')

Logisitic Regression
Model performance for Training set
- Accuracy: 0.6658
- F1 score: 0.5462
- Precision: 0.6681
- Recall: 0.9887
- Roc Auc Score: 0.5068
----------------------------------
Model performance for Test set
- Accuracy: 0.6721
- F1 score: 0.5551
- Precision: 0.6757
- Recall: 0.9864
- Roc Auc Score: 0.5053


Decision Tree
Model performance for Training set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- Roc Auc Score: 1.0000
----------------------------------
Model performance for Test set
- Accuracy: 0.5679
- F1 score: 0.5719
- Precision: 0.6862
- Recall: 0.6603
- Roc Auc Score: 0.5189


Random Forest
Model performance for Training set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- Roc Auc Score: 1.0000
----------------------------------
Model performance for Test set
- Accuracy: 0.6650
- F1 score: 0.5944
- Precision: 0.6857
- Recall: 0.9276
- Roc Auc Score: 0.5257


Gradient Boost
Model performance for Training se

In [None]:
from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning for Gradient Boosting
gb_params = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.05],
    'max_depth': [3, 4, 5]
}
gb_grid = GridSearchCV(GradientBoostingClassifier(), gb_params, cv=3, scoring='accuracy')
gb_grid.fit(X_train_vec, y_train)

print("Best parameters for Gradient Boosting:", gb_grid.best_params_)
print("Best score for Gradient Boosting:", gb_grid.best_score_)

# Hyperparameter tuning for XGBoost
xgb_params = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.05],
    'max_depth': [3, 4, 5]
}
xgb_grid = GridSearchCV(XGBClassifier(), xgb_params, cv=3, scoring='accuracy')
xgb_grid.fit(X_train_vec, y_train)

print("Best parameters for XGBoost:", xgb_grid.best_params_)
print("Best score for XGBoost:", xgb_grid.best_score_)

In [29]:
best_gb_model = GradientBoostingClassifier()
best_gb_model

In [33]:
# Fit the model before making predictions
best_gb_model.fit(X_train_vec, y_train)

In [34]:
def predict_sentiment(text, model, w2v_model, lemmatizer, stopwords_set):
    # Preprocess the text
    review = text.lower()
    review = re.sub('[^a-zA-Z]', ' ', review)
    review = re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , review)
    review = BeautifulSoup(review, "html.parser").get_text()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if word not in stopwords_set]
    review = ' '.join(review)

    # Convert the preprocessed text to a vector
    review_vector = create_review_vector(review.split(), w2v_model)
    review_vector = review_vector.reshape(1, -1) # Reshape for prediction

    # Predict sentiment
    prediction = model.predict(review_vector)

    return "Positive" if prediction[0] == 1 else "Negative"

In [41]:
# Example usage:
new_review = "This book was amazing! I loved it."
stopwords_set = set(stopwords.words('english')) # Define stopwords_set here


predicted_sentiment = predict_sentiment(new_review, best_gb_model, w2v_model, lemmatizer, stopwords_set)
print(f"The sentiment of the review is: {predicted_sentiment}")

new_review_2 = "story old fashion feel sometimes story get bit slow liked main character strength determination find happened husband admirable book mix bit mystery crime investigation political cover magic together"
predicted_sentiment_2 = predict_sentiment(new_review_2, best_gb_model, w2v_model, lemmatizer, stopwords_set)
print(f"The sentiment of the review is: {predicted_sentiment_2}")

The sentiment of the review is: Positive
The sentiment of the review is: Positive
