Now comes the hard part. Analyzing the genres section. Unfortunately, pandas dataframes are not meant to hold lists, so we're going to need to be clever about this. 

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('7_final_books.csv').drop('Unnamed: 0', axis=1)
df = df[df['genre'] != '[]']

Some research on how to deal with this led me to this excellent article by Max Hilsdorf: 

https://towardsdatascience.com/dealing-with-list-values-in-pandas-dataframes-a177e534f173

Let's try out some of his ideas.

First, I need to see the formatting of my lists.

In [None]:
print(type(df.iloc[0]['genre']))
print(df.iloc[0]['genre'])

As expected, pandas has put this lists into the form of a string. But there is one relief--each element is surrounded by quotation marks. That should make things a bit easier! 

In [None]:
df['genre'] = df['genre'].apply(eval)

In [None]:
print(type(df.iloc[0]['genre']))
print(df.iloc[0]['genre'])

Wonderful! Each book's genre feature is now treated as a list! Now we just need to find the counts--ideally without looping through all 22,000 data points. Hilsdorf recommends we consider the column as a 2D array, and then convert it to a 1D array. 

In [None]:
def to_1D(series):
    return pd.Series([x for _list in series for x in _list])

In [None]:
all_genres = to_1D(df['genre'])

In [None]:
all_genres.head(10)

In [None]:
print(f"Our {len(df)} books have a total of {len(all_genres)} genres.\
 An average of {round(len(all_genres)/len(df), 1)} genres per book.")

Now that the genre column is correctly interpreted as a list, we can get more specific by adding a column for the number of genres each book has. 

In [None]:
df['num genres'] = df['genre'].apply(lambda x: len(x))

In [None]:
df.head()

Next, let's see which genres are most common. I know from inspection that many of the elements classed as genres are actually not categories that would generally be considered genres--elements like "Australia," or "Star Trek." I'd anticipate finding these near the bottom for number of books. This is likely because Goodreads genres are generated by how users shelve their own books. 

In [None]:
len(all_genres.value_counts())

All in all, we have nearly 700 supposed "genres!" Definitely too many to try to classify from only 22,000 books. Time to start finding ones to eliminate! 

In [None]:
all_genres.value_counts().head()

As it turns out, one of the top 5 "genres" is, in fact, not a genre--"Audiobook" is a format. It looks like not all the "genres" we need to get rid of will be close to the bottom! 

Still, it might be worth checking the singletons.

In [None]:
np.sum(all_genres.value_counts() == 1)

It seems that over a hundred of the "genres" appear only once in the entire dataframe! Let's expand our net even more. 

In [None]:
np.sum(all_genres.value_counts() <= 10)

Wow! Just about half of the genres appear in fewer than 10 books. 

Out of curiosity, let's see what some of the singletons look like.

In [None]:
genres_df = pd.DataFrame(all_genres.value_counts())

In [None]:
genres_df[genres_df['count'] == 1].head(10)

Here we can see locations, time periods, overly-specific combination genres, topics, and even two characters. 

In [None]:
genres_df[genres_df['count'] == 1].tail(10)

These patterns continue at the bottom of the list. None of these are good candidates for classification, and can easily be removed. 

Let's check genres with 10 books.

In [None]:
genres_df[genres_df['count'] == 10].head(10)

These do indeed follow similar patterns! Mostly topics, genres, formats, and "read for school," which is simply the way people shelve their books. 

Let's try approaching this from the opposite direciton. How many genres have at least 1000 books? 

In [None]:
print(f"{len(genres_df[genres_df['count'] >= 1000])} genres have more than 1000 books.")
genres_df[genres_df['count'] >= 1000]

The most comon, by far, is Fiction, with more than half of our dataset. It could be argued whether this represents a "genre" at all, and it may or may not be a useful classification. 

"Audiobook" is certainly not a genre, and "Mystery Thriller" appears to be a combination of two larger genres, "Mystery" and "Thriller." 

In [None]:
#Number of books with the genre "Mystery Thriller" (as seen above)
len(df[df['genre'].apply(lambda x: "Mystery Thriller" in x)])

In [None]:
#Number of books with both the genre "Mystery Thriller" and the genre "Mystery"
len(df[df['genre'].apply(lambda x: ("Mystery Thriller" in x) & ("Mystery" in x))])

In [None]:
#Number of books with both the genre "Mystery Thriller" and the genre "Thriller"
len(df[df['genre'].apply(lambda x: ("Mystery Thriller" in x) & ("Thriller" in x))])

In [None]:
#Number of books with all three genres, "Mystery Thriller", "Mystery", and "Thriller"
len(df[df['genre'].apply(lambda x: ("Mystery Thriller" in x) & ("Thriller" in x) & ("Mystery" in x))])

In [None]:
3220/3934

Sure enough, there's more than an 80% overlap in "Mystery Thriller" and both of the other genres. So "Mystery Thriller" is mostly redundant and can be removed--or at least replaced with both Mystery and Thriller. 

The rest require some domain knowledge. Is "Magic" a genre? Are "Novels" a genre seperate from Fiction? Are most of the books in other genres actually novels without mentioning it?

It might be useful to check the correlations between these. Surely there's a way this can be done! 

In [None]:
top_genres = genres_df[genres_df['count']>=1000].reset_index()

In [None]:
corr = np.ones(shape=(27,27))

In [None]:
for i in range(1, len(top_genres)):
    for j in range(i):
        genre_i = top_genres.iloc[i]['index']
        genre_j = top_genres.iloc[j]['index']
        count_i = len(df[df['genre'].apply(lambda x: genre_i in x)])
        count_j = len(df[df['genre'].apply(lambda x: genre_j in x)])
        count_both = len(df[df['genre'].apply(lambda x: (genre_i in x) & (genre_j in x))])
        corr[i][j] = count_both/count_i
        corr[j][i] = count_both/count_j

In [None]:
top_genres.iloc[0]['index']

In [None]:
corr = pd.DataFrame(corr)

In [None]:
corr.columns = top_genres['index']

In [None]:
corr.index = top_genres['index']

In [None]:
corr.style.background_gradient()

To interpret this dataframe, each cell represents the percentage of the row genre that also belongs to the colum genre. 

For example, the left column represents the percentage of each genre that also has the Fiction tag. The top row represents the percentage of Fiction-tagged books that also have each other genre. 

So, looking just at the 4 boxes in the upper-lefthand corner, 100% of Fiction is Fiction and 100% of Mystery is Mystery (of course), but also, 87% of Mystery is Fiction, and 45% of Fiction is Mystery. (This surprisingly large number is likely a consequence of the preponderance of Hardy Boys and Nancy Drew novels in the dataset).

The darker colors near the left and on the top make sense, since the genres is sorted in order of frequency. 

### Analysis 

Nearly everything is highly correlated with Fiction, except Nonfiction, History, and Biography, none of which ever coexist in the same book. This is promising for the accuracy of our data! But it also means that a classification for Fiction is likely redundant. 

98% of Magic books fall under Fantasy, and Magic will likely be dropped. . 

75% of Science Fiction Fantasy is Fantasy, and 70% of Science Fiction Fantasy is Science Fiction. So we can likely break up Science Fiction Fantasy into its component parts. 

Now that we have some insight into features to drop, let's try actually dropping some. Let's say that we end up keeping exactly the features with more than 1000 books. 

In [None]:
def filter_genres(genres, keep):
    filtered = [genre for genre in genres if genre in keep]
    return filtered

In [None]:
# Create an array of the top genres
keep_genres = top_genres['index'].values

In [None]:
# Create a new column on our dataframe with only the top genres
df['filtered genres'] = df['genre'].apply(lambda x: filter_genres(x, keep_genres))

In [None]:
df.head()

In [None]:
# Create a new column with the number of genres from the allowed list in each book
df['num filtered genres'] = df['filtered genres'].apply(lambda x: len(x))

In [None]:
df.head()

For reference, let's look at the description of the original genres list.

In [None]:
df['num genres'].describe()

Some books had only 1 genre, but 75% of the books had at least 6, and at least half had the maximum number I scraped from goodreads--7. 

Now let's see what happens after I remove every genre except the top ones.

In [None]:
df['num filtered genres'].describe()

Now are mean has gone from around 6 to around 4, and the median has decreased all the way from 7 to 4. However, most books still have genres left. 

Let's see which books have lost all of their genres. I'll use a sample of the first and last 5 of those books.

In [None]:
lost_genres = df[df['num filtered genres'] == 0]
print(f"{len(lost_genres)} books no longer have any genres left. Here are some of them:")
pd.concat([lost_genres.head(),lost_genres.tail()])

In [None]:
lost_genres['num genres'].describe()

Most of the books that have lost all their genres had only 1 genre tag originally, and it did not happen to be a popular one. But at least some unlucky books had 7 genres, and none of them were winners.

In [None]:
lost_genres[lost_genres['num genres'] == 7]

In [None]:
df.iloc[20233]['genre']

Looks like there was just one book this unlucky--a Spider-Man comic. 

Next, let's try splitting up the combination genres--Science Fiction Fantasy into Science Fiction and Fantasy, and Mystery Thriller into Mystery and Thriller.

In [None]:
def split_genres(genres):
    
    # So we don't change the original list
    split = genres.copy()
    
    if "Science Fiction Fantasy" in genres: 
        split.remove("Science Fiction Fantasy")
        split.append("Science Fiction")
        split.append("Fantasy")
        
        # Remove duplicates
        split = list(set(split))
        
    if "Mystery Thriller" in genres:
        split.remove("Mystery Thriller")
        split.append("Mystery")
        split.append("Thriller")
        
        # Remove duplicates
        split = list(set(split))
        
    return split

In [None]:
df['split genres'] = df['filtered genres'].apply(split_genres)

In [None]:
df.head()

In [None]:
df['num split genres'] = df['split genres'].apply(lambda x: len(x))

In [None]:
df.head()

In [None]:
print(df.iloc[4]['genre'])
print(df.iloc[4]['filtered genres'])
print(df.iloc[4]['split genres'])

In [None]:
df['num split genres'].describe()

In [None]:
to_1D(df['split genres']).value_counts()