# Book Analysis

This notebook explores a dataset of book details scraped from [Tor](https://publishing.tor.com/).

## Imports

In [1]:
from os import environ

from dotenv import load_dotenv
import pandas as pd
import altair as alt
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords  # Lists of boring words
from pandas.core.common import flatten

## Setup

In [2]:
load_dotenv()

pd.set_option("max_colwidth", 150)  # Display more text

## Data sourcing

In [3]:
def string_to_list(string: str) -> list[str]:
    """Returns a list from a list stored as a string."""
    return string[2:-2].split("', '")

In [4]:
books = pd.read_csv(environ["FINAL_BOOK_FILEPATH"], parse_dates=["publication_date"],
                    converters={"formats": string_to_list, "contributors": string_to_list})

In [5]:
books.sample(3)

Unnamed: 0,title,description,series,series_number,pages,publication_date,formats,contributors
289,The Truth of the Aleke,"Moses Ose Utomi returns to his Forever Desert series with The Truth of the Aleke , continuing his epic fable about truth, falsehood, and the shac...",The Forever Desert,2.0,112.0,2024-03-05,"[Hardcover, e-Book]",[Moses Ose Utomi]
33,The Book of Ile-Rien,,,,990.0,2024-02-27,"[e-Book, Trade Paperback]",[Martha Wells]
48,Cold Counsel,"In Chris Sharp's new epic fantasy Cold Counsel , Slud of the Blood Claw Clan, Bringer of Troubles, was born at the heart of the worst storm the m...",,,302.0,2017-02-21,[e-Book],[Chris Sharp]


## Data exploration

**How many books are missing descriptions or page numbers? Can you work out why?**

In [6]:
books[(books["description"].isna()) | (books["pages"].isna())]

Unnamed: 0,title,description,series,series_number,pages,publication_date,formats,contributors
33,The Book of Ile-Rien,,,,990.0,2024-02-27,"[e-Book, Trade Paperback]",[Martha Wells]
41,The Butcher of the Forest,,,,160.0,2024-02-27,"[e-Book, Trade Paperback]",[]
174,The Murderbot Diaries,“We are all a little bit Murderbot.” – NPR on Martha Wells's The Murderbot Diaries...,The Murderbot Diaries,0.0,,2020-10-27,[Hardcover],[Martha Wells]
199,"The Practice, the Horizon, and the Chain",,,,112.0,2024-03-26,"[e-Book, Trade Paperback]",[]
225,"Seanan McGuire's Wayward Children, Volumes 1-3",Winner: 2022 Hugo Award for Best Series For the first time experience the first three hardcover volumes of Seanan McGuire's Hugo and Nebula Award...,Wayward Children,0.0,,2020-11-17,[Hardcover],[Seanan McGuire]


**How many books are not part of a series**?

In [7]:
books["series"].isna().sum()

174

**What is the name of the longest series?**

In [8]:
series = books.loc[books["series"].str.len().idxmax(), "series"]
series

'The Locked Tomb Trilogy The Locked Tomb Series'

In [9]:
books.groupby("series")["series_number"].max().sort_values(ascending=False).head(1)

series
Laundry Files    13.0
Name: series_number, dtype: float64

In [10]:
books.groupby("series")["pages"].sum().sort_values(ascending=False).head(5)

series
A Sin du Jour Affair     2966.0
Wayward Children         2144.0
Laundry Files            1600.0
Witches of Lychford      1464.0
The Murderbot Diaries    1456.0
Name: pages, dtype: float64

**How many books were published in each month of the year (bar chart)?**

In [11]:
# Get books per month of year
books["month"] = books["publication_date"].dt.month

month_counts = books.groupby("month").count()["title"].reset_index()

month_counts

# Chart it

alt.Chart(month_counts).mark_bar().encode(x="month", y="title", color="month")

**What's the average number of pages?**

In [12]:
books["pages"].mean().round(2)

260.69

**What proportion of books have more than one author (pie chart)?**

In [13]:
# Find multiple contributors

books["multiple_contributors"] = books["contributors"].apply(lambda x: len(x) > 1)

# Find the counts

contributor_counts = books["multiple_contributors"].value_counts().reset_index()
# Make the chart

alt.Chart(contributor_counts).mark_arc().encode(theta="count", color="multiple_contributors")

**How many books were published each year (line chart)?**

In [14]:
year_counts = books.set_index("publication_date").resample('Y')["title"].count().reset_index()
year_counts["year"] = year_counts["publication_date"].dt.year

chart = alt.Chart(year_counts).mark_line(color="purple").encode(x="year", y="title")

In [15]:
chart.save("purple_line.png")

## Data cleaning

In [16]:
books = books.dropna(subset=["description"])

unwanted_phrases = ["([Tt]he)? [Nn]ew [Yy]ork [Tt]imes",
                    "([tT]he )? \w+ [aA]wards?",
                    "\w+ [Nn]ominee",
                    "At the Publisher's request, this title is being sold without Digital Rights Management Software (DRM) applied."]

for p in unwanted_phrases:
    books["description"] = books["description"].str.replace(p, " ", regex=True)

## Text processing

In [17]:
# Create a processed_title column

books["processed_description"] = books["description"].str.lower().apply(word_tokenize)

In [18]:
# stopwords carry grammatical meaning but not semantic meaning

# Remove all stopwords, all punctuation

stops = stopwords.words("english")
stops.extend(["since", "also"])  # Words that should be in stops
stops.extend(["publishing", "sampler", "award", "editorial", "spotlight",
              "debut", "publishers", "weekly", "anthology", "story", "author",
              "series", "finalist", "edition", "best", "seller", "selling", "fiction",
              "novel", "novella", "novelette", "title", "book", "winner", "winning"])  # domain-specific stopwords

def filter_tokens(tokens: list[str]) -> list[str]:

    return [t for t in tokens
            if t not in stops
            and t.isalpha()
            and len(t) > 2]

books["processed_description"] = books["processed_description"].apply(filter_tokens)

In [19]:
books[["description", "processed_description"]].sample(4)

Unnamed: 0,description,processed_description
88,"Finding Baba Yaga is a mythic yet timely novel-in-verse by the beloved and prolific bestselling author and poet Jane Yolen, “the Hans Christian ...","[finding, baba, yaga, mythic, yet, timely, beloved, prolific, bestselling, poet, jane, yolen, hans, christian, andersen, america, newsweek, young,..."
256,"One of the best books of 2018, according to Kirkus Reviews, the Chicago Review of Books , and BookRiot. finalist Malka Older's State Tectonics c...","[one, books, according, kirkus, reviews, chicago, review, books, bookriot, malka, older, state, tectonics, concludes, centenal, cycle, cyberpunk, ..."
29,"Rising science fiction and fantasy star P. Djèlí Clark brings an alternate New Orleans of orisha, airships, and adventure to life in his immersiv...","[rising, science, fantasy, star, djèlí, clark, brings, alternate, new, orleans, orisha, airships, adventure, life, immersive, black, god, drums, a..."
110,"The Sin du Jour crew caters to the Shadow Government in Greedy Pigs , Matt Wallace's fifth Sin du Jour Affair “I never did give them hell. I just...","[sin, jour, crew, caters, shadow, government, greedy, pigs, matt, wallace, fifth, sin, jour, affair, never, give, hell, told, truth, thought, poli..."


## Most popular words

In [20]:
# Select a subset of the data

data = books[books["publication_date"].dt.year == 2023]

# Gather all of the tokens into one big list

all_tokens = list(flatten(data["processed_description"]))

In [21]:
token_counts = pd.Series(all_tokens).value_counts().reset_index()
token_counts.head(2)

Unnamed: 0,index,count
0,world,26
1,city,25


In [22]:
chart = alt.Chart(token_counts.head(20)).mark_bar(color="green").encode(x=alt.X("index").sort("-y"), y="count")
chart

In [23]:
chart.save("./output/top_twenty_words.png")