<a href="https://colab.research.google.com/github/TheAmirHK/NLP_Projects/blob/master/Worth_a_Read_(NLP)_based_on_Goodreads_reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Python script uses Selenium to scrape reviews and extract them from a Goodreads book page. Finally it teturns a list of reviews.


In [1]:
!pip install selenium
!pip install google-colab-selenium
!pip install sumy



In [2]:
import google_colab_selenium as gs
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
import time

from selenium import webdriver

In [4]:
def scrape_goodreads_reviews_selenium(book_url, num_reviews=10):
    reviews = []
    try:
        driver = gs.Chrome()

        driver.get(book_url)
        reviews.append(driver.title)

        time.sleep(2)

        scroll_pause_time = 1
        screen_height = driver.execute_script("return window.screen.height;")
        i = 1

        while True:
            driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))
            i += 1
            time.sleep(scroll_pause_time)
            scroll_height = driver.execute_script("return document.body.scrollHeight;")
            if (screen_height) * i > scroll_height:
                break

        review_elements = driver.find_elements(By.CLASS_NAME, 'ReviewText')
        for review in review_elements[:num_reviews]:
            reviews.append(review.text)

        driver.quit()
    except Exception as e:
        print(f"Error while scraping: {e}")
    return reviews

book_url = input("Enter the YouTube URL: ")
reviews = scrape_goodreads_reviews_selenium(book_url, num_reviews=50)
actualTitle = reviews[0].split(" book reviews ")[0]

# Print the reviews
print(f"Scraped {len(reviews)} reviews:")
for i, review in enumerate(reviews, 1):
    print(f"{i}: {review}")


Enter the YouTube URL: https://www.goodreads.com/book/show/60194162/reviews?reviewFilters=eyJhZnRlciI6Ik1UTTRNVFVzTVRZMk56WTRNak15TURreE5nIn0%3D


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Scraped 51 reviews:
1: Demon Copperhead by Barbara Kingsolver book reviews | Goodreads
2: Now Winner of the Women's Prize for Fiction 2023 *aaargh*
Joint Winner of the Pulitzer Prize 2023 with the equally simplistic Trust
A twist on David Copperfield, focusing on an Appalachian boy whose life is overshadowed by the opioid epidemic? That sounds like a fantastic idea. And Kingsolver does a great job crafting Demon Copperhead's voice, making the resourceful boy (and later young man) sound witty, empathetic, and engaging, infusing his whole vibe with some and . Of course we're also dealing with an important topic, and Kingsolver is here to give the overlooked and left-behind a voice - but this is where the problem starts.

I am not at all saying that a successful, famous writer is generally unable to convey what it means to be a person like destitute Demon in the rural South, it just doesn't work very well in this particular text, and the reason can be found in the plot itself: While Kings

In [5]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.tokenizers import Tokenizer
import nltk

nltk.download('punkt_tab')

def summarize_reviews(reviews, num_sentences=5):
    # Join all reviews into one string
    reviews_text = " ".join(reviews)

    try:
        parser = PlaintextParser.from_string(reviews_text, Tokenizer("english"))
        summarizer = LsaSummarizer()
        summary_sentences = summarizer(parser.document, 5)
        summary = " ".join(str(sentence) for sentence in summary_sentences)
        return summary
    except Exception as e:
        return f"Error summarizing text: {str(e)}"

summary = summarize_reviews(reviews)
print("Summary of Reviews:")
print(summary)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Summary of Reviews:
Kingsolver has created a very particular world, inhabited it with people you can’t forget, and coerced from me an emotional investment that I’m not often willing to make in fiction. You can talk about substance abuse, poverty, domestic violence, assault, the foster care system, and ineffective social services without turning an already misrepresented population into sideshow freaks. Narrated entirely by Demon – his strength of character, enduring, resilience spirit, and ability to find humour in any situation that was thrown at him, made this an unforgettable read. Show more Tragic and heartbreaking The story takes place mostly in southern Appalachian Lee County, Virginia where Demon (Damon) was born to a teen mother with addiction. Show more Update: This book did win Kingsolver her long-awaited Pulitzer (along with Hernan Diaz’s Trust) so, for the first time, I made a good guess :-)


In [6]:
import re

def preprocess_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = text.lower()
    text = ' '.join(text.split())
    return text

preprocessed_reviews = [preprocess_text(review) for review in reviews]

In [7]:
def label_reviews(reviews):
    labels = []
    positive_keywords = [
    "exciting", "gripping", "captivating", "engaging", "amazing", "brilliant", "wonderful",
    "fantastic", "unforgettable", "well-written", "masterpiece", "incredible", "thrilling",
    "impressive", "heartwarming", "thought-provoking", "enjoyable", "powerful", "compelling",
    "beautiful", "breathtaking", "page-turner", "intriguing", "spellbinding", "inspirational",
    "charming", "unpredictable", "vivid", "well-paced", "moving", "creative", "emotional",
    "excels", "flawless", "richly detailed", "uplifting", "poignant", "immersive", "thoughtful",
    "enlightening"
]
    negative_keywords = [
    "boring", "predictable", "disappointing", "poorly written", "unimpressive", "slow",
    "terrible", "awful", "weak", "unoriginal", "confusing", "lackluster", "dragging",
    "forgettable", "frustrating", "painful", "underwhelming", "monotonous", "dull",
    "disconnected", "unrealistic", "shallow", "predictable", "lack of depth", "unengaging",
    "unconvincing", "clunky", "uninteresting", "overrated", "unpleasant", "bland", "empty",
    "tedious", "badly paced", "unstructured", "uncompelling", "incoherent", "unsatisfying",
    "frivolous", "mediocre", "disjointed", "unmemorable", "unfocused", "poor"
]

    for review in reviews:
        if any(word in review.lower() for word in positive_keywords):
            labels.append("Positive")
        elif any(word in review.lower() for word in negative_keywords):
            labels.append("Negative")
        else:
            labels.append(None)
    return labels

# Label the preprocessed reviews
labels = label_reviews(reviews)

# Filter out neutral reviews
labeled_data = [(reviews[i], labels[i]) for i in range(len(labels)) if labels[i] is not None]
texts, sentiments = zip(*labeled_data)

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Convert text into numerical features
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(texts)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, sentiments, test_size=0.2, random_state=42)

# Train a Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

Accuracy: 0.8333333333333334
Classification Report:
               precision    recall  f1-score   support

    Negative       0.00      0.00      0.00         1
    Positive       0.83      1.00      0.91         5

    accuracy                           0.83         6
   macro avg       0.42      0.50      0.45         6
weighted avg       0.69      0.83      0.76         6



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [9]:
def analyze_movie_reviews(reviews, model, vectorizer):
    preprocessed_reviews = [preprocess_text(review) for review in reviews]
    features = vectorizer.transform(preprocessed_reviews)
    predictions = model.predict(features)
    positive_count = predictions.tolist().count('Positive')

    return positive_count / len(predictions) * 100 # Calculate and return percentage

# Analyze the movie based on positive review percentage
positive_percentage = analyze_movie_reviews(reviews, model, vectorizer)
# print(f"Percentage of positive reviews: {positive_percentage:.2f}%")
if positive_percentage > 70:
    print(f"The book {actualTitle} is likely WORTH {positive_percentage:.0f}% reading !")
else:
    print(f"The book {actualTitle} is likely NOT worth reading.")

The book Demon Copperhead by Barbara Kingsolver is likely WORTH 100% reading !


In [10]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.tokenizers import Tokenizer
import nltk

nltk.download('punkt_tab')

def summarize_reviews(reviews, num_sentences=5):
    # Join all reviews into one string
    reviews_text = " ".join(reviews)

    try:
        parser = PlaintextParser.from_string(reviews_text, Tokenizer("english"))
        summarizer = LsaSummarizer()
        summary_sentences = summarizer(parser.document, 5)
        summary = " ".join(str(sentence) for sentence in summary_sentences)
        return summary
    except Exception as e:
        return f"Error summarizing text: {str(e)}"

summary = summarize_reviews(reviews)
print("Summary of Reviews:")
print(summary)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Summary of Reviews:
Kingsolver has created a very particular world, inhabited it with people you can’t forget, and coerced from me an emotional investment that I’m not often willing to make in fiction. You can talk about substance abuse, poverty, domestic violence, assault, the foster care system, and ineffective social services without turning an already misrepresented population into sideshow freaks. Narrated entirely by Demon – his strength of character, enduring, resilience spirit, and ability to find humour in any situation that was thrown at him, made this an unforgettable read. Show more Tragic and heartbreaking The story takes place mostly in southern Appalachian Lee County, Virginia where Demon (Damon) was born to a teen mother with addiction. Show more Update: This book did win Kingsolver her long-awaited Pulitzer (along with Hernan Diaz’s Trust) so, for the first time, I made a good guess :-)
