# Goodreads Book Reviews Analysis - Numerical Data Exploration

## Project Overview
This project aims to analyze **Goodreads book reviews**, focusing on **1-star ratings** to understand patterns in harsh reviews. The analysis is divided into two parts:
1. **Numerical Data Analysis** (Current Stage) - Examining numerical factors such as star ratings, review counts, and genre distributions.
2. **Natural Language Processing (NLP) Analysis** (Next Stage) - Exploring book descriptions and text reviews to identify sentiment patterns.

## Phase 1: Numerical Data Cleaning and Transformation

### 1. Data Import and Initial Inspection
- The dataset was imported using Pandas and inspected for missing values, incorrect formats, and inconsistencies.
- Key columns:
  - **Star Ratings** (`star_rating`)
  - **Number of Ratings** (`num_ratings`)
  - **Number of Reviews** (`num_reviews`)
  - **Genres** (`genres`)
  - **Community Reviews** (extracted into separate rating percentages)


In [None]:
pip install jupyter pandas numpy matplotlib seaborn scikit-learn nltk

In [None]:
pip install tqdm

In [None]:
df.head()

In [None]:
df.sample(20)

In [None]:
df_exploded.head(30)

In [None]:
df_exploded.info()

In [None]:
df_exploded.sample(30)

In [None]:
df_exploded.info()

In [None]:
df_exploded.sample(30)

## Phase 2: Numerical Data Exploration and Visualization

### 1. **Top 20 Most Common Genres**
**Objective**: Identify the most frequent book genres.

### 2. **Genres with the Highest Percentage of 1-Star Reviews**
**Objective**: Identify which genres tend to receive the most negative ratings.

## Adding dataset with text reviews

In [None]:
import pandas as pd
import json
import gzip

chunk_size= 10000
chunks= []

with gzip.open ("./Data/goodreads_reviews_dedup.json.gz", "rt", encoding="utf-8") as f:
    for i, line in enumerate(f): #read line by line
        chunks.append(json.loads(line)) #convert json to stionf dict

    #every chuck line, process data to write csv
        if (i + 1) % chunk_size == 0:
            df_chunk = pd.DataFrame(chunks)
            df_chunk.to_csv("goodreads_reviews", mode="a", index= False, header = (i < chunk_size))
            chunks = []
        
if chunks:
    df_chunk = pd.DataFrame(chunks)
    df_chunk.to_csv("goodreads_reviews", mode ="a", index=False, header=False) 


In [None]:
df_reviews = pd.read_csv("goodreads_reviews")

In [None]:
df_reviews.head()

In [None]:
df_reviews.info()

In [None]:
df_reviews['book_id'].duplicated().any()

In [None]:
import pandas as pd
import json
import gzip

chunk_size= 10000
chunks= []

with gzip.open ("./Data/goodreads_books.json.gz", "rt", encoding="utf-8") as f:
    for i, line in enumerate(f): #read line by line
        chunks.append(json.loads(line)) #convert json to stionf dict
         
    #every chuck line, process data to write csv
        if (i + 1) % chunk_size == 0:
            df_chunk = pd.DataFrame(chunks)
            df_chunk.to_csv("goodreads_books", mode="a", index= False, header = (i < chunk_size))
            chunks = []
        
if chunks:
    df_chunk = pd.DataFrame(chunks)
    df_chunk.to_csv("goodreads_books", mode ="a", index=False, header=False) 

In [None]:
df_books = pd.read_csv("goodreads_books")

In [None]:
df_books.head(10)

In [None]:
df_books.info()

In [None]:
print(df_books.columns)

In [None]:
df_merged = df_reviews.merge(df_books, on="book_id", how="inner")

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
df_merged.head(10)

In [None]:
print(df_merged.columns)

In [None]:
df_merged=df_merged.drop(columns=['user_id','date_added','read_at','started_at','date_updated','read_at','kindle_asin','work_id','n_comments','asin','similar_books','series','similar_books','publication_month','publication_day','edition_information','is_ebook'])


In [None]:
df_merged.info()

In [None]:
df_merged=df_merged.drop(columns=['format', 'num_pages', 'isbn13', 'link', 'title_without_series'])

In [None]:
df_merged['review_id'].duplicated().any()

In [None]:
(df_merged['text_reviews_count']== 0).any()

In [None]:
df_merged[df_merged['text_reviews_count'] == 0]
#?? maybe outdated text review count

In [None]:
df_merged[df_merged['rating'] == 0]
#reviews that have text but no star rating was left? I am choosing to leave these out of analysis

In [None]:
df_merged= df_merged[df_merged['rating'].notna() & (df_merged['rating'] !=0)]

In [None]:
#for this analysis I will only be focusing on english reviews
#removing nonenglish rows and rows with no text in review_text or description. I dont think this will hurt bc the df is so large
df_merged= df_merged.dropna(subset=['review_text','description'])

In [None]:
df_merged.head()

In [None]:
#cleaning popular shelves column
print(df_merged['popular_shelves'].iloc[0])

In [None]:
#seeing which shelves have the highest counts
import ast
from collections import Counter

#function that extracts shelf names from string lists of the shelf dictionaires
def shelf_names(shelves_str):
    shelves_list = ast.literal_eval(shelves_str) #convert the string to a list of dicts
    if isinstance(shelves_list, list):
        return [shelf['name'] for shelf in shelves_list if 'name' in shelf] #extract 'name' value from each dict if it exists
    return []

shelf_counter = Counter()

In [None]:
#very large operation (takes about 100 minutes to run)
for row in df_merged['popular_shelves'].dropna():
    shelf_counter.update(shelf_names(row))

print(shelf_counter.most_common(30))

In [None]:
import random

unique_shelves = list(shelf_counter.keys())
print(f"unique names: {len(unique_shelves)}")

In [None]:
print(shelf_counter.most_common(1000))

In [None]:
def normalize_shelf(name):
    return name.strip().lower().replace(" ", "-")

In [None]:
#Filtering shelf names

In [None]:
#cleaning the author column
print(df_merged['authors'].iloc[0])

In [None]:
#there is already a language code column but it's not through. Try lang detect to fill in missing
from langdetect import detect
df_merged['dec']

In [None]:
#checking for final cleaning steps to slim down dataset futher before splitting  then saving to a csv

In [None]:
#split df into managable chunks for further analysis

In [None]:
for star in range(0,6):
    df_star = df_merged[df_merged['rating'] == star]
    df_star.to_csv(f"{star}star_reviews.csv")

In [None]:
import zipfile
import os

csv_files = ["./Data/1star_reviews.csv"]

zip_path = "./Data/1star_reviews.zip"

with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for file in csv_files:
        arcname = os.path.basename(file)
        zipf.write(file,arcname=arcname)

zip_path

In [None]:
#assigning them to variables then checking size

df_5star = pd.read_csv("./Data/5star_reviews.csv")
df_5star.info()

In [None]:
df_4star = pd.read_csv("./Data/4star_reviews.csv")
df_4star.info()

In [None]:
df_3star = pd.read_csv("./Data/3star_reviews.csv")
df_3star.info()

In [None]:
df_2star = pd.read_csv("./Data/2star_reviews.csv")
df_2star.info()

In [None]:
df_1star = pd.read_csv("./Data/1star_reviews.csv")
df_1star.info()

In [None]:
df_0star = pd.read_csv("./Data/0star_reviews.csv")
df_0star.info()

In [None]:
# taking a sample of the smallest rating dataset to test for cleaning


In [None]:
sample_1star= df_1star.sample(10000, random_state=42)

In [None]:
pd.set_option('display.max_colwidth', None)
sample_1star[['review_text','description']].sample(5, random_state=1)


In [None]:
import numpy as np
import html
import re

In [None]:
#attempting to clean the html first
