# Goodreads Books with Genre
## Data Processing

In [2]:
import kagglehub
import pandas as pd
import statistics

Start with data processing! Import the data, parse through data to make sure they are all the correct types, and create a DataFrame

In [4]:
# Download latest version
file = kagglehub.dataset_download("middlelight/goodreadsbookswithgenres")

df = pd.read_csv('./goodreadsbookswithgenres/Goodreads_books_with_genres.csv')
df.head()


FileNotFoundError: [Errno 2] No such file or directory: './goodreadsbookswithgenres/Goodreads_books_with_genres.csv'

In [None]:
print("(cols, rows):", df.shape)
df.dtypes

In [None]:
df.count()

The `count()` function shows that the only column that has null values is genre. Since majority of our questions deal with genre, we are going to remove the books without a genre. Since the genres column is important, we are going to clean up the data by turning the string into an array.

In [None]:
# Drop all the rows without a genre
df = df.dropna(how='any')

# Turn the genres column from a string into an array
df.loc[:, 'genres'] = df['genres'].apply(lambda input: input.split(';'))
df.head()


Next, since we will be using the publication date, we want to convert the string into a datetime object for easier computation.

In [None]:
# Convert to datetime
df.loc[:, 'publication_date'] = df['publication_date'].apply(lambda date: pd.to_datetime(date,  errors='coerce'))

# Look at rows that are NaT (not a time)
nat_rows = df[df['publication_date'].isna()]
nat_rows

Since there are only two rows without a date, we feel comfortable removing them, since they will not make a huge impact on our data.

In [None]:
# Remove the rows with NaT
df = df.dropna(subset=['publication_date'])
df.head()

Now let's do some basic exploration of the data. Looking at the range of publication dates, genres, number of pages, publishers, average rating.

In [None]:
# Range of publication dates
min_date = df['publication_date'].min()
max_date = df['publication_date'].max()
print("Publication date ranges from", min_date, "to", max_date)

# Range of number of pages
min_pages = df['num_pages'].min()
max_pages = df['num_pages'].max()
print("Number of pages ranges from", min_pages, "to", max_pages)

# Range of average rating
min_average_rating = df['average_rating'].min()
max_average_rating = df['average_rating'].max()
print("Average rating ranges from", min_average_rating, "to", max_average_rating)

# Range of number of ratings
min_ratings = df['ratings_count'].min()
max_ratings = df['ratings_count'].max()
print("Number of ratings ranges from", min_ratings, "to", max_ratings)

# List of publishers
unique_publishers = df['publisher'].unique()
print("List of publishers:", unique_publishers)

# List of genres
all_genres = df['genres'].explode()
unique_genres = all_genres.unique()
print("List of genres:", unique_genres)

def filter_genres(genres):
    return [genre for genre in genres if ',' not in genre]

main_genres = df['genres'].apply(filter_genres).explode()
unique_main_genres = main_genres.unique()
print("List of MAIN genres:", unique_main_genres)
print("Number of genres", len(unique_genres))
print("Number of MAIN genres", len(unique_main_genres))


- **<font color='red'>I think we should remove the subgenres like Fantasy,Epic. And just have the broader genre</font>**
    - or shrink to a smaller category of genres
- **<font color='red'>I think we should remove the rows that have 0 as the page number</font>**
- **<font color='red'>Same with average rating and number of ratings?</font>**



In [None]:
# Genres with the most books
top_genres = main_genres.value_counts().head(25)
top_genres

**<font color='red'>What genres should we do?</font>**
- classics
- fantasy
- history
- mystery
- romance
- young adult
- science fiction
- childrens
- humor
- thriller
- biography
- short stories
- horror


# Test 1: Does a higher number of ratings lead to a lower average rating?
### We'll compare the number of ratings to the average rating for *each book*.


*   HO: Total number of ratings for individual books does not have an effect on average rating.
*   HA: Total number of ratings for individual books does have an effect on average rating.


In [None]:
# dataframe displaying info for just the top 25 genres
top_genres_list = top_genres.index.tolist()

df_top_genres = df[df['genres'].apply(lambda x: any(genre in top_genres_list for genre in x))]
df_top_genres

In [None]:
# scatter plot
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.scatter(df_top_genres['ratings_count'], df_top_genres['average_rating'])
plt.xlabel("Number of Ratings")
plt.ylabel("Average Rating (0-5)")
plt.title("Average Rating vs. Number of Ratings for Top Genres")
plt.xscale('log')

plt.grid(True)
plt.show()

In [None]:
# apply Pearson's correlation test on the dataframe with top genres
from scipy import stats
result = stats.pearsonr(df_top_genres['ratings_count'], df_top_genres['average_rating'])
result.pvalue

The p-value of approximately 1.192 x 10^-5 is much smaller than the significance level of 0.05. Therefore, we can reject the null hypothesis, which had stated that the total number of ratings didn't have a significant effect on the average ratings for books.

In [None]:
# note: there are no average ratings 'x' where 0.00 < x < 2.00
df.sort_values('average_rating').head(30)

In [None]:
# note: there are also fewer than 70 books with an average rating less than 3.00
df.sort_values('average_rating').head(70)