# Introduction

In this notebook, we will explore and analyze the Goodreads Top 500 Novels dataset. This dataset contains information about 500 highly-rated novels, including their title, author, rating, genre, publication year, number of pages, and language. 

Our main focus will be to analyze the representation of genres in the dataset and the correlation between genre and book popularity. Additionally, we will look at how book ratings and genre distributions have evolved over time.

The analysis will provide insights into the most popular genres, how they are rated, and trends in book popularity across years.


### Imports

In [107]:
import pandas as pd
import altair as alt
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)


### Dataset

In [108]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/melaniewalsh/responsible-datasets-in-context/main/datasets/top-500-novels/library_top_500.csv", sep=',', header=0, low_memory=False)
df

Unnamed: 0,top_500_rank,title,author,pub_year,orig_lang,genre,author_birth,author_death,author_gender,author_primary_lang,...,gr_num_ratings,gr_num_reviews,gr_avg_rating_rank,gr_num_ratings_rank,oclc_owi,author_viaf,gr_url,wiki_url,pg_eng_url,pg_orig_url
0,1,Don Quixote,Miguel de Cervantes,1605,Spanish,action,1547,1616,male,spa,...,269435,12053,318,211,1.810748e+09,17220427,https://www.goodreads.com/book/show/3836.Don_Q...,https://en.wikipedia.org/wiki/Don_Quixote,https://www.gutenberg.org/cache/epub/996/pg996...,https://www.gutenberg.org/cache/epub/2000/pg20...
1,2,Alice's Adventures in Wonderland,Lewis Carroll,1865,English,fantasy,1832,1898,male,eng,...,561016,15380,172,133,1.156132e+10,66462036,https://www.goodreads.com/book/show/24213.Alic...,https://en.wikipedia.org/wiki/Alice%27s_Advent...,https://www.gutenberg.org/cache/epub/11/pg11.txt,
2,3,The Adventures of Huckleberry Finn,Mark Twain,1884,English,action,1835,1910,male,eng,...,1262480,19440,373,68,3.373178e+09,50566653,https://www.goodreads.com/book/show/2956.The_A...,https://en.wikipedia.org/wiki/Adventures_of_Hu...,https://www.gutenberg.org/cache/epub/76/pg76.txt,
3,4,The Adventures of Tom Sawyer,Mark Twain,1876,English,action,1835,1910,male,eng,...,931898,13603,301,88,3.373178e+09,50566653,https://www.goodreads.com/book/show/24583.The_...,https://en.wikipedia.org/wiki/The_Adventures_o...,https://www.gutenberg.org/cache/epub/74/pg74.txt,
4,5,Treasure Island,Robert Louis Stevenson,1883,English,action,1850,1894,male,eng,...,486155,16307,368,145,3.434000e+03,95207986,https://www.goodreads.com/book/show/295.Treasu...,https://en.wikipedia.org/wiki/Treasure_Island,https://www.gutenberg.org/cache/epub/120/pg120...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,496,Stranger in a Strange Land,Robert A. Heinlein,1961,English,scifi,1907,1988,male,eng,...,311859,9961,310,190,7.894120e+05,12309757,,https://en.wikipedia.org/wiki/Stranger_in_a_St...,NA_not-pub-domain,
496,497,Vision in White,Nora Roberts,2009,English,romance,1965,ALIVE,female,eng,...,138445,4652,128,277,1.559638e+08,66448023,,https://en.wikipedia.org/wiki/Vision_in_White,NA_not-pub-domain,
497,498,The Whipping Boy,Sid Fleischman,1986,English,action,1920,2010,male,eng,...,27444,1623,476,445,4.415520e+08,66438084,,https://en.wikipedia.org/wiki/The_Whipping_Boy,NA_not-pub-domain,
498,499,Room,Emma Donoghue,2010,English,na,1969,ALIVE,female,eng,...,801989,50594,171,101,4.859780e+08,39539889,,https://en.wikipedia.org/wiki/Room_(novel),NA_not-pub-domain,


# Data Overview

The dataset consists of 500 rows and 29 columns, containing both numerical and categorical data. The key columns for analysis include the publication year (pub_year) and average rating (gr_avg_rating), both of which are numeric. The dataset also includes categorical information about the book's title, author, and genre.

There are some missing values across several columns, with significant gaps in fields like author_field_of_activity (171 missing), author_occupation (42 missing), and pg_orig_url (436 missing). However, since these columns are not central to our analysis, we will proceed by excluding or ignoring these missing values.


# Exploratory Analysis

In this section, we will conduct an exploratory analysis of the dataset. Specifically, we will look at:
- The distribution of book genres and their representation in the dataset
- The relationship between genres and their average ratings
- The evolution of ratings over time (publish year)
- A closer look at the top-rated books


### a) Genre Representation

In [109]:
# Filter the top 10 genres by count
top_10_genres = df['genre'].value_counts().nlargest(10).reset_index()
top_10_genres.columns = ['genre', 'Count']

# Create the bar chart
chart1 = alt.Chart(top_10_genres).mark_bar(color='lightblue').encode(
    x=alt.X('genre', sort='-y', title='genre'),
    y=alt.Y('Count', title='Count')
).properties(
    title='Top 10 Genres'
)
chart1

chart1 = alt.Chart(top_10_genres).mark_bar(color='lightblue').encode(
    x=alt.X('genre', sort='-y', title='Genre'),
    y=alt.Y('Count', title='Count'),
    tooltip=['genre', 'Count']
).properties(
    title='Top 10 Genres'
)
chart1


### b) Rating Distribution by Genre

In [110]:
# Calculate the average rating for each genre
avg_rating_by_genre = df.groupby('genre')['gr_avg_rating'].mean().reset_index()

# Create the horizontal bar chart
chart2 = alt.Chart(avg_rating_by_genre).mark_bar(color='salmon').encode(
    x=alt.X('gr_avg_rating', title='Average Rating'),
    y=alt.Y('genre', sort='-x', title='genre')
).properties(
    title='Average Rating by Genre'
)
chart2


### c) Popularity Over Time

In [111]:
# Generate a list of decades (every 10 years)
decades_to_show = list(range(min(avg_rating_by_year['pub_year']) // 10 * 10, max(avg_rating_by_year['pub_year']) + 10, 10))

chart3 = alt.Chart(avg_rating_by_year).mark_line(color='salmon').encode(
    x=alt.X('pub_year:O', title='Publication Year', axis=alt.Axis(values=decades_to_show, labelAngle=-45)),
    y=alt.Y('gr_avg_rating:Q', title='Average Rating')
).properties(
    title='Average Rating by Year of Publication',
    width=800,
    height=400
)

chart3


**Here's a Quick View of All the Charts Side by Side**

To make it easier to compare the charts without scrolling up and down, here they are all displayed together. Take a look and see if you can spot any interesting patterns or trends in the data. Enjoy!

In [112]:
(chart1 | chart2 | chart3).resolve_scale(y='independent')

### d) Top Books by Rating

In [113]:
# Sort the dataset by rating to get the top 10 highest-rated books
top_books = df.sort_values(by='gr_avg_rating', ascending=False).head(10)

# Display top 10 books
top_books[['title', 'author', 'gr_avg_rating']]


Unnamed: 0,title,author,gr_avg_rating
178,Harry Potter and the Deathly Hallows,J.K. Rowling,4.62
102,Harry Potter and the Prisoner of Azkaban,J.K. Rowling,4.58
446,Harry Potter and the Half-Blood Prince,J.K. Rowling,4.58
105,Harry Potter and the Goblet of Fire,J.K. Rowling,4.57
23,The Return of the King,J.R.R. Tolkien,4.56
47,The Complete Sherlock Holmes,Arthur Conan Doyle,4.5
124,Harry Potter and the Order of the Phoenix,J.K. Rowling,4.5
109,The Two Towers,J.R.R. Tolkien,4.48
44,Harry Potter and the Sorcerer's Stone,J.K. Rowling,4.47
205,The Help,Kathryn Stockett,4.47


# Findings & Interpretations

**Top 10 Genres Distribution:**

The first bar chart illustrates the count of the top 10 genres in the dataset. The "NA" category, likely representing unclassified or unknown genres, has a significantly higher count than the others, indicating either missing or ambiguous data in this category. Among the specified genres, history and fantasy are prominent, followed by romance and bildungsroman.
Fantasy ranks third, which could indicate its popularity or representation in the dataset, aligning with typical trends in children's literature where fantasy is a dominant genre.

**Average Rating by Genre:**

The second chart shows that average ratings across genres are fairly consistent, hovering around 4.0. This may suggest that readers generally rate books positively across all genres, which could be due to selection bias (popular books are more likely to be rated).
Genres like fantasy and history appear to maintain similar ratings, underscoring their general popularity and favorable reception among readers.

**Average Rating by Publication Year:**

This line chart shows average ratings across different publication years. Ratings remain relatively stable over time, which might suggest that readers rate classic and contemporary works similarly. However, 2014 shows a slight increase in average rating, potentially reflecting a trend in recent years toward higher ratings or improved reception of newer works.

**Top-Rated Books:**

The table of top-rated books includes multiple entries from popular series, like Harry Potter by J.K. Rowling and The Lord of the Rings by J.R.R. Tolkien, showing high ratings ranging from 4.47 to 4.62. The inclusion of these books highlights their enduring popularity and strong reader approval, with fantasy and classic literature genres standing out.
The Help by Kathryn Stockett and The Complete Sherlock Holmes by Arthur Conan Doyle show high ratings as well, representing genres outside of fantasy, which adds diversity to the top-rated works.

# Conclusion

In this analysis, we looked at the distribution of genres and examined how genre might influence popularity in the Goodreads Top 500 Novels dataset. Our findings revealed several interesting patterns and insights:

**Genre Popularity:** The predominance of "NA" in genre counts indicates a need to address missing or ambiguous data, possibly by refining the classification. Among classified genres, fantasy is popular and well-rated, consistent with the target audience's preference for imaginative and adventurous narratives.

**Consistency in Ratings:** The average ratings across genres and publication years suggest a generally high reader satisfaction that is stable over time and across genres, indicating a well-curated dataset of well-regarded books.

**Enduring Appeal of Fantasy and Classics:** The top-rated books list underscores the timeless appeal of classic fantasy literature and established authors. This trend can be valuable for informing which genres or authors to focus on in future projects or recommendations for readers interested in high-quality literature.

Moving forward, we could:
- Explore the correlation between the number of pages and rating to understand if longer books are more highly rated.
- Perform a sentiment analysis on book descriptions (if available) to understand how the content of the books correlates with their ratings.
- Investigate the impact of language and author on book popularity.

Overall, this analysis provides a foundation for understanding trends in book popularity, and further exploration can provide deeper insights into specific genres or years.