# Exploring+Visualizing Author's Nationalities in Top 500 Novels

## Importing the Data


In [1]:
import pandas as pd
import altair as alt

df = pd.read_csv("https://raw.githubusercontent.com/melaniewalsh/responsible-datasets-in-context/main/datasets/top-500-novels/library_top_500.csv")
df.head()

Unnamed: 0,top_500_rank,title,author,pub_year,orig_lang,genre,author_birth,author_death,author_gender,author_primary_lang,...,gr_num_ratings,gr_num_reviews,gr_avg_rating_rank,gr_num_ratings_rank,oclc_owi,author_viaf,gr_url,wiki_url,pg_eng_url,pg_orig_url
0,1,Don Quixote,Miguel de Cervantes,1605,Spanish,action,1547,1616,male,spa,...,269435,12053,318,211,1810748000.0,17220427,https://www.goodreads.com/book/show/3836.Don_Q...,https://en.wikipedia.org/wiki/Don_Quixote,https://www.gutenberg.org/cache/epub/996/pg996...,https://www.gutenberg.org/cache/epub/2000/pg20...
1,2,Alice's Adventures in Wonderland,Lewis Carroll,1865,English,fantasy,1832,1898,male,eng,...,561016,15380,172,133,11561320000.0,66462036,https://www.goodreads.com/book/show/24213.Alic...,https://en.wikipedia.org/wiki/Alice%27s_Advent...,https://www.gutenberg.org/cache/epub/11/pg11.txt,
2,3,The Adventures of Huckleberry Finn,Mark Twain,1884,English,action,1835,1910,male,eng,...,1262480,19440,373,68,3373178000.0,50566653,https://www.goodreads.com/book/show/2956.The_A...,https://en.wikipedia.org/wiki/Adventures_of_Hu...,https://www.gutenberg.org/cache/epub/76/pg76.txt,
3,4,The Adventures of Tom Sawyer,Mark Twain,1876,English,action,1835,1910,male,eng,...,931898,13603,301,88,3373178000.0,50566653,https://www.goodreads.com/book/show/24583.The_...,https://en.wikipedia.org/wiki/The_Adventures_o...,https://www.gutenberg.org/cache/epub/74/pg74.txt,
4,5,Treasure Island,Robert Louis Stevenson,1883,English,action,1850,1894,male,eng,...,486155,16307,368,145,3434.0,95207986,https://www.goodreads.com/book/show/295.Treasu...,https://en.wikipedia.org/wiki/Treasure_Island,https://www.gutenberg.org/cache/epub/120/pg120...,


## Distribution of Countries Where Top 500 Novel Authors are From

In [2]:
# replace missing values with unknown
df['author_nationality'].fillna('Unknown', inplace=True)

# frequency of each nationality
nationality_counts = df['author_nationality'].value_counts().reset_index()
nationality_counts.columns = ['author_nationality', 'count']

# bar graph to visualize top 20 nationalities
top_nationalities_chart = alt.Chart(nationality_counts.head(20)).mark_bar().encode(
    x=alt.X('count:Q', title='Books'),
    y=alt.Y('author_nationality:N', sort='-x', title='Author Nationality'),
    tooltip=['author_nationality', 'count']
).properties(
    title="20 Most Frequent Author Nationalities in Top 500 Novels",
    width=600,
    height=400
)

top_nationalities_chart

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['author_nationality'].fillna('Unknown', inplace=True)


To little suprise, a dataset from an English site about the nationality of top 500 novel authors is dominated by English-speaking countries, America and Great Britain. I think because this comes from Goodreads, a website that is only in English and clearly made for English speaking audience, then it would make sense that they would mainly read and rate books from their own country. There could be a in a completely different language that these users may enjoy, but they would never have the opportunity to read  and rate them, solely because they don't speak that language. That's why I would be concerned that books are being left out only because of a language barrier. International authors are locked out of these massive platforms like Goodreads, and can't establish themselves on these bigger audiences without translating their book in English

It would be nice to actually be able to tell how many other different countries are represented in the top 500 novels, so clearing out the top 2 bars would "unskew" the data and let us take a closer look at the runner ups. Let's filter out the top English speaking countries (we'll leave in the Aussies, though) to get a better look on the other nationalities of these authors.

In [3]:
# Exclude the top 2 most frequent nationalities
nationality_counts_filtered = nationality_counts.iloc[2:]

# Bar graph to visualize the next top 20 nationalities (after excluding the top 2)
top_nationalities_chart_filtered = alt.Chart(nationality_counts_filtered.head(20)).mark_bar().encode(
    x=alt.X('count:Q', title='Books'),
    y=alt.Y('author_nationality:N', sort='-x', title='Author Nationality'),
    tooltip=['author_nationality', 'count']
).properties(
    title="Next 20 Most Frequent Author Nationalities in Top 500 Novels (Excluding Top 2)",
    width=600,
    height=400
)

top_nationalities_chart_filtered

## Average Ratings by Author's Nationality

In [None]:
# find the average rating for each nationality
average_ratings_by_nationality = df.groupby('author_nationality')['gr_avg_rating'].mean().reset_index()

# filter out countries with fewer than 5 books
book_counts = df['author_nationality'].value_counts()
filtered_nationalities = book_counts[book_counts >= 5].index
average_ratings_by_nationality = average_ratings_by_nationality[average_ratings_by_nationality['author_nationality'].isin(filtered_nationalities)]

# bar visualization of avg rating by nationality
average_ratings_chart = alt.Chart(average_ratings_by_nationality).mark_bar().encode(
    x=alt.X('gr_avg_rating:Q', title='Average Rating'),
    y=alt.Y('author_nationality:N', sort='-x', title='Author Nationality'),
    tooltip=['author_nationality', 'gr_avg_rating']
).properties(
    title="Average Goodreads Rating by Author Nationality",
    width=600,
    height=400
)


average_ratings_chart


Russia and Canada have slightly higher average ratings than others, with averages above 4.1, while Ireland (IE) has the lowest average rating, slightly below 4.0. Although this difference is minor, it may suggest there are  slight overall reader preferences or the particular selection of books by authors of certain nationalities in this dataset.

The average ratings for each nationality are quite close, hovering around the 4.0 mark. I'd say overall there isn't a significant difference in how books by authors from different nationalities are rated by Goodreads users. This means there's a consistent appreciation for these books among these readers, regardless of where author is from.

Since this 
dataset is English-speaking dominant, with focus to countries such as the US, UK, and Canada, it’s worth noting that the relatively high average ratings might reflect the familiarity or popularity of these books among a predominantly English-speaking audience on Goodreads.