This notebook is to answer the subquestion (1) Are readers' book reviews about Science Fiction on the Goodreads website generally more positive or more negative? (2) Is there a correlation between multiple variables of book reviews, such as ratings, the review length, the book review support, and so on?

## **1.Clean data**

It is necessary to clean the book review text because it contains a lot of issues. For example, the text includes URLs, emojis, and spelling errors which will affect the result of the model.

Note: make sure you have installed the necessary libraries, such as langdetect and hunspell, before running the code. You can use pip, the Python package installer, to install these libraries.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993227 sha256=c7f989647d1b6477dcba6f608f4c8a18f2b20a54b04e55496b2d6821b703910b
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [None]:
!apt-get update
!apt-get install -y hunspell libhunspell-dev
!pip install hunspell

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
0% [Connecting to archive.ubuntu.com (91.189.91.81)] [Connecting to security.ubuntu.com (91.189.91.80% [Connecting to archive.ubuntu.com (91.189.91.81)] [Connecting to security.ubuntu.com (91.189.91.8                                                                                                    Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
                                                                                                    0% [Waiting for headers] [Waiting for headers] [Waiting for headers]                                                                    Hit:3 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:4 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:5 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease [24.3 kB]

In [None]:
# import required libraries
import os
import csv
import re
import nltk
import hunspell
import pandas as pd
from langdetect import detect, LangDetectException
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### 1.1 Delete non-English reviews

Since the Goodreads website is an international reading platform where readers come from all over the world and the language of book reviews varies, only the language English was selected in this study.

In [None]:
# Define a function named detect_language that takes a parameter 'text'.
def detect_language(text):
    try: # Begin a try block to handle potential exceptions.
        # Check if the input 'text' is a string.
        if isinstance(text, str):
            return detect(text)  # If 'text' is a string, detect its language using the detect function and return the result.
        else: # If 'text' is not a string:
            return "unknown" # Return "unknown" as the language cannot be detected.
    except LangDetectException:  # If a LangDetectException is raised during detection:
        return "unknown"  # Return "unknown" as the language detection failed.

# Set the input folder path where the CSV files with reviews are stored.
input_folder_path = "/content/drive/MyDrive/page_5_reviews"
# Set the output folder path where the filtered English-only reviews will be saved.
output_folder_path = "/content/drive/MyDrive/page_5_reviews_english_only"

# Loop through each file in the input folder.
for filename in os.listdir(input_folder_path):
    if filename.endswith(".csv"):   # Check if the file is a CSV file.
        # Construct the full file path of the input CSV file.
        file_path = os.path.join(input_folder_path, filename)

        # Read the CSV file into a DataFrame, considering the first row as the header.
        data = pd.read_csv(file_path, header=0)

        # Apply the detect_language function to the 'Content' column and create a new 'Language' column with the results.
        data['Language'] = data['Content'].apply(detect_language)

        # Filter the DataFrame to include only rows where the 'Language' column is 'en' (English).
        data = data[data['Language'] == 'en']

        # Construct the full file path for the output CSV file.
        output_file_path = os.path.join(output_folder_path, filename)
        # Save the filtered DataFrame to the output path, without including the index.
        data.to_csv(output_file_path, index=False)
# Print a message indicating the processing is completed and the location of the saved English-only reviews.
print("Processing completed. English-only reviews have been saved in:", output_folder_path)


Processing completed. English-only reviews have been saved in: /content/drive/MyDrive/page_5_reviews_english_only


### 1.2 Removing rows containing null values

In the dataset, there are missing values in the columns “Review Count”, “Followers”, “Rating”, and “Likes”. Since our research question is based on the correlation between these variables, it is crucial to ensure data integrity. Therefore, we first use the “isnull” method to detect missing values in these four columns, and then employ the “any” method to filter out rows where any of these columns contain missing values.

In [None]:
# Set the folder path where the filtered English-only review CSV files are stored.
folder_path = "/content/drive/MyDrive/page_5_reviews_english_only"

# Create a list of file paths for all CSV files in the folder.
csv_files = [os.path.join(folder_path, file) for file in os.listdir(folder_path) if file.endswith('.csv')]

# Specify the columns that need to be checked for null values.
columns_to_check = ["Review Count", "Followers", "Rating", "Likes"]

# Loop through each CSV file in the list of file paths.
for file_path in csv_files:
    # Read the CSV file into a DataFrame, considering the first row as the header.
    data = pd.read_csv(file_path, header=0)

    # Check if the 'Tags' column exists in the DataFrame.
    if 'Tags' in data.columns:
        data.drop(columns=['Tags'], inplace=True)  # If the 'Tags' column exists, drop it from the DataFrame.

    # Identify rows that have any null values in the specified columns.
    rows_with_nulls = data[data[columns_to_check].isnull().any(axis=1)]

    # If there are rows with null values in the specified columns:
    if not rows_with_nulls.empty:
        data = data.dropna(subset=columns_to_check)  # Drop rows with null values in the specified columns from the DataFrame.
    else:
        pass  # If there are no rows with null values, do nothing.

    # Save the filtered DataFrame back to the same CSV file path, without including the index.
    data.to_csv(file_path, index=False)

# Print a message indicating the data filtering process is completed for all files.
print("Data has been successfully filtered for all files!")


Data has been successfully filtered for all files!


### 1.3 Cleaning up the book reviews text

 Book reviews contain a lot of issues. For example, the text includes URLs, emojis, and spelling errors which will affect the result of the model. So, what we need to do is to fix these problems.

In [None]:
# Set the folder path where the filtered English-only review CSV files are stored.
folder_path = "/content/drive/MyDrive/page_5_reviews_english_only/"

# Create a list of filenames for all CSV files in the folder.
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Loop through each CSV file in the list of filenames.
for file in csv_files:
    # Construct the full file path of the input CSV file.
    file_path = os.path.join(folder_path, file)

    # Read the CSV file into a DataFrame, considering the first row as the header.
    data = pd.read_csv(file_path, header=0)

    # Define a function to remove URLs from a given text.
    def remove_urls(text):
        # Compile a regex pattern to match URLs starting with http, https, or www.
        url_pattern = re.compile(r'https?://\S+|www\.\S+')
        # Substitute matched URLs with an empty string to remove them from the text.
        return url_pattern.sub(r'', text)

    # Apply the remove_urls function to the 'Content' column and create a new column 'remove_urls_reviews'.
    data['remove_urls_reviews'] = data['Content'].apply(remove_urls)

    # Define a function to remove non-alphabetical characters from a given text.
    def remove_non_text(text):
        # Create a regex pattern to match any character that is not a letter or whitespace.
        processed_text = r'[^a-zA-Z\s]'
        # Substitute matched characters with a space to remove non-alphabetical characters.
        processed_text = re.sub(processed_text, ' ', text)
        return processed_text

    # Apply the remove_non_text function to the 'remove_urls_reviews' column and create a new column 'processed_reviews'.
    data['processed_reviews'] = data['remove_urls_reviews'].apply(remove_non_text)

    # Initialize the HunSpell spell checker with the English dictionary and affix files.
    spell_checker = hunspell.HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')

    # Define a function to correct spelling errors in a given text.
    def correct_spelling(text):
        # Split the text into individual words.
        words = text.split()
        # Initialize an empty list to store the corrected words.
        corrected_words = []
        # Loop through each word in the list of words.
        for word in words:
            # Check if the word is spelled correctly using the spell checker.
            if not spell_checker.spell(word):
                # If the word is misspelled, get suggestions for the correct spelling.
                suggestions = spell_checker.suggest(word)
                # If there are suggestions, take the first suggestion as the corrected word.
                if suggestions:
                    corrected_word = suggestions[0]
                    corrected_words.append(corrected_word)  # Append the corrected word to the list of corrected words.
                else:
                    # If no suggestions are available, keep the original word.
                    corrected_words.append(word)
            else:
                corrected_words.append(word) # If the word is spelled correctly, keep it as is.
        # Join the list of corrected words back into a single string and return it.
        return ' '.join(corrected_words)

    # Apply the correct_spelling function to the 'processed_reviews' column and create a new column 'corrected_reviews'.
    data['corrected_reviews'] = data['processed_reviews'].apply(correct_spelling)

    # Initialize the WordNet lemmatizer for reducing words to their base form.
    lemmatizer = WordNetLemmatizer()

    # Define a function to lemmatize each word in the given text.
    def lemmatize_text(text):
        # Split the text into words, lemmatize each word, and join them back into a single string.
        lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
        return lemmatized_text

    # Apply the lemmatize_text function to the 'corrected_reviews' column and create a new column 'lemmatized_reviews'.
    data['lemmatized_reviews'] = data['corrected_reviews'].apply(lemmatize_text)

    # Convert all text in the 'lemmatized_reviews' column to lowercase.
    data['lemmatized_reviews'] = data['lemmatized_reviews'].str.lower()

    # Import the set of English stop words from the NLTK library.
    stop_words = set(stopwords.words('english'))

    # Define a function to remove stop words from the given text.
    def remove_stop_words(text):
        # Split the text into words, filter out the stop words, and join the remaining words back into a single string.
        filtered_text = ' '.join([word for word in text.split() if word.lower() not in stop_words])
        return filtered_text

    # Apply the remove_stop_words function to the 'lemmatized_reviews' column and create a new column 'stop_words_removed'.
    data['stop_words_removed'] = data['lemmatized_reviews'].apply(remove_stop_words)
    # Rename the 'stop_words_removed' column to 'cleaned_reviews'.
    data['cleaned_reviews'] = data['lemmatized_reviews'].apply(remove_stop_words)

    # Drop intermediate columns that are no longer needed from the DataFrame.
    data = data.drop(columns=['remove_urls_reviews', 'processed_reviews','corrected_reviews','lemmatized_reviews','stop_words_removed'])

    # Save the cleaned DataFrame back to the same CSV file path, without including the index.
    data.to_csv(file_path, index=False)

# Print a message indicating the data has been successfully written to the file.
print("Data has been successfully written to the file！")

Data has been successfully written to the file！


## 2.calculate variables

In this study, we specifically analyze the correlation between the following variables:

(1)  the review length; (2)  the number of reviews that a reviewer has written;
(3)  the number of followers of the reviewer; (4)  the days passed since the book review have been posted; (5)  the sentiment score of the book review; (6)  the rating of book; (7)  the book review support


All variables need to be calculated to get except (2) the number of reviews that a reviewer has written, and (3) the number of followers of the reviewer, which can be obtained directly.

### 2.1 Calculate the number of days since a given date

In [None]:
# import required libraries
import os
import csv
import pandas as pd
from datetime import datetime

In [None]:
# Set the folder path where the filtered English-only review CSV files are stored.
folder_path = "/content/drive/MyDrive/page_5_reviews_english_only"

# Define a function to convert a date string into a datetime object.
def convert_to_date(date_str):
    # List of possible date formats to try.
    formats = ["%B %d, %Y", "%d-%b-%y"]
    # Loop through each date format.
    for fmt in formats:
        try:
            # Attempt to parse the date string using the current format.
            return datetime.strptime(date_str, fmt)
        except ValueError:
            # If parsing fails, continue to the next format.
            continue
    # Raise an error if the date string doesn't match any of the formats.
    raise ValueError(f"Date format does not match: {date_str}")

# Define a function to calculate the number of days since a given date.
def calculate_days_since(date_str, reference_date):
    # Convert the date string to a datetime object.
    date = convert_to_date(date_str)
    # Calculate the difference in days between the reference date and the given date.
    return (reference_date - date).days

# Set the reference date for the calculation.
reference_date = datetime(2024, 5, 28)

# Loop through each file in the folder.
for filename in os.listdir(folder_path):
    if filename.endswith(".csv"): # Check if the file is a CSV file.
        # Construct the full file path of the input CSV file.
        file_path = os.path.join(folder_path, filename)

        # Read the CSV file into a DataFrame, considering the first row as the header.
        data = pd.read_csv(file_path, header=0)

        # Apply the calculate_days_since function to the 'Date' column and create a new column 'Number_of_Days'.
        data['Number_of_Days'] = data['Date'].apply(lambda x: calculate_days_since(x, reference_date))

        # Save the updated DataFrame back to the same CSV file path, without including the index.
        data.to_csv(file_path, index=False)

# Print a message indicating the data conversion process is completed for all files.
print("Data has been successfully converted！")


Data has been successfully converted！


### 2.2 Calculate the number of words

In [None]:
# import required library
import os
import pandas as pd

In [None]:
# Set the folder path where the filtered English-only review CSV files are stored.
folder_path = "/content/drive/MyDrive/page_5_reviews_english_only"

# Loop through each file in the folder.
for filename in os.listdir(folder_path):
    if filename.endswith(".csv"): # Check if the file is a CSV file.
        # Construct the full file path of the input CSV file.
        file_path = os.path.join(folder_path, filename)

        # Read the CSV file into a DataFrame, considering the first row as the header.
        data = pd.read_csv(file_path, header=0)

        # Create a new column 'word_count' by applying a lambda function to each entry in the 'cleaned_reviews' column.
        # Calculate the number of words if the entry is a string; otherwise, set to 0.
        data['word_count'] = data['cleaned_reviews'].apply(lambda x: len(x.split()) if isinstance(x, str) else 0)

        # Save the updated DataFrame back to the same CSV file path, without including the index.
        data.to_csv(file_path, index=False)

# Print a message indicating the data processing is completed for all files.
print("Data has been successfully flitered！")


Data has been successfully flitered！


### 2.3 Sentiment analysis

We downloaded and used a model provided by Pianzola that was specially designed to categorize the sentiment of book reviews. The output contains the probability of positive, the probability of negative, the probability of neutral, the confidence of the classification and the result of the sentiment classification. For the convenience of the analysis, the probability of positive is used to represent the sentiment value of the book reviews for the subsequent correlation analysis.

In [None]:
pip install --upgrade transformers

Collecting transformers
  Using cached transformers-4.41.2-py3-none-any.whl (9.1 MB)
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.13.3
    Uninstalling tokenizers-0.13.3:
      Successfully uninstalled tokenizers-0.13.3
  Attempting uninstall: transformers
    Found existing installation: transformers 4.26.0
    Uninstalling transformers-4.26.0:
      Successfully uninstalled transformers-4.26.0


In [None]:
# import required libraries
from transformers import pipeline
import pandas as pd
import os

In [None]:
# Define the path to the folder containing the input CSV files
folder_path = "/content/drive/MyDrive/page_5_reviews_english_only/"
# Define the path to the folder where the output CSV files will be saved
output_folder_path = "/content/drive/MyDrive/page_5_reviews_positive_prob/"

# Create a list of all CSV file names in the input folder
# This filters the files in the folder to include only those that end with '.csv'
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Load a sentiment analysis model using the 'text-classification' pipeline
# The model 'fpianz/roberta-english-book-reviews-sentiment' is used for classifying text into sentiments
# The argument 'return_all_scores=True' ensures that the classifier returns scores for all sentiment classes
classifier = pipeline("text-classification", model="fpianz/roberta-english-book-reviews-sentiment", return_all_scores=True)

# Iterate over each CSV file in the list of CSV files
for csv_file in csv_files:
    # Construct the full file path for the current CSV file
    file_path = os.path.join(folder_path, csv_file)

    # Read the CSV file into a DataFrame
    # This allows us to manipulate and analyze the data in the file
    df = pd.read_csv(file_path)

    # Add a new column 'positive_prob' to the DataFrame and initialize it with 0.0
    # This column will store the probability of the review being positive
    df['positive_prob'] = 0.0

    # Iterate over each row in the DataFrame
    # This allows us to analyze each review individually
    for index, row in df.iterrows():
        # Extract the review text from the current row
        review = row['cleaned_reviews']

        # Get the sentiment analysis results for the review
        # The classifier returns a list of dictionaries with labels and scores
        results = classifier(review)

        # Initialize a variable to store the positive probability
        positive_prob = None
        # Iterate over the results for the review
        # Find the score corresponding to the 'positive' label
        for result in results[0]:
            if result['label'] == 'positive':
                positive_prob = result['score']
                break

        # Update the 'positive_prob' column for the current row with the obtained positive probability
        df.at[index, 'positive_prob'] = positive_prob

    # Construct the full file path for the output CSV file
    # This path is in the output folder and has the same name as the input CSV file
    output_file_path = os.path.join(output_folder_path, csv_file)
    df.to_csv(output_file_path, index=False)


## 3.Sentiment category statistics for book reviews

This part is to answer the subquestion: Are readers' book reviews about Science Fiction on the Goodreads website generally more positive or more negative?

In [None]:
# import required libraries
import os
import pandas as pd
import matplotlib.pyplot as plt

# List of folder paths containing the filtered English-only review CSV files.
folder_paths = [
    "/content/drive/MyDrive/page_1_reviews_english_only",
    "/content/drive/MyDrive/page_2_reviews_english_only",
    "/content/drive/MyDrive/page_3_reviews_english_only",
    "/content/drive/MyDrive/page_4_reviews_english_only",
    "/content/drive/MyDrive/page_5_reviews_english_only"
]

# Initialize an empty DataFrame to hold combined data from all CSV files.
combined_data = pd.DataFrame()

# Loop through each folder path in the list.
for folder_path in folder_paths:
    # Create a list of filenames for all CSV files in the current folder.
    csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

    # Loop through each CSV file in the list of filenames.
    for csv_file in csv_files:
        # Construct the full file path of the current CSV file.
        file_path = os.path.join(folder_path, csv_file)

        # Read the CSV file into a DataFrame, considering the first row as the header.
        data = pd.read_csv(file_path, header=0)

        # Concatenate the current DataFrame with the combined_data DataFrame, ignoring the index to avoid misalignment.
        combined_data = pd.concat([combined_data, data], ignore_index=True)

# Count the occurrences of each unique value in the 'label' column of the combined DataFrame.
label_counts = combined_data['label'].value_counts()

# Calculate the percentage representation of each label.
label_percentages = label_counts / label_counts.sum() * 100

# Print the counts of each label.
print(label_counts)
# Print the percentage representation of each label.
print(label_percentages)

label
positive    19561
negative    10127
neutral      8078
Name: count, dtype: int64
label
positive    51.795266
negative    26.815125
neutral     21.389610
Name: count, dtype: float64


In [None]:
# import required libraries
import os
import pandas as pd

# List of folder paths containing the filtered English-only review CSV files.
folder_paths = [
    "/content/drive/MyDrive/page_1_reviews_english_only",
    "/content/drive/MyDrive/page_2_reviews_english_only",
    "/content/drive/MyDrive/page_3_reviews_english_only",
    "/content/drive/MyDrive/page_4_reviews_english_only",
    "/content/drive/MyDrive/page_5_reviews_english_only"
]

# Initialize an empty DataFrame to hold combined data from all CSV files.
combined_data = pd.DataFrame()

# Loop through each folder path in the list.
for folder_path in folder_paths:
    # Create a list of filenames for all CSV files in the current folder.
    csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

    # Loop through each CSV file in the list of filenames.
    for csv_file in csv_files:
        # Construct the full file path of the current CSV file.
        file_path = os.path.join(folder_path, csv_file)

        # Read the CSV file into a DataFrame, considering the first row as the header.
        data = pd.read_csv(file_path, header=0)

        # Concatenate the current DataFrame with the combined_data DataFrame, ignoring the index to avoid misalignment.
        combined_data = pd.concat([combined_data, data], ignore_index=True)

# Create a pivot table from the combined DataFrame.
# - `index='Rating'`: Rows will represent unique values in the 'Rating' column.
# - `columns='label'`: Columns will represent unique values in the 'label' column.
# - `aggfunc='size'`: Aggregation function to count the number of occurrences.
# - `fill_value=0`: Replace any missing values with 0 in the pivot table.
summary_table = combined_data.pivot_table(index='Rating', columns='label', aggfunc='size', fill_value=0)

# Print the summary table to display the count of each label for each rating.
print(summary_table)


label   negative  neutral  positive
Rating                             
1.0         1253      207        68
2.0         2290      401       271
3.0         2886     1550      2054
4.0         2096     3260      7679
5.0         1602     2660      9489
