Now comes the hard part. Analyzing the genres section. Unfortunately, pandas dataframes are not meant to hold lists, so we're going to need to be clever about this. 

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('7_final_books.csv').drop('Unnamed: 0', axis=1)
df = df[df['genre'] != '[]']

Some research on how to deal with this led me to this excellent article by Max Hilsdorf: 

https://towardsdatascience.com/dealing-with-list-values-in-pandas-dataframes-a177e534f173

Let's try out some of his ideas.

First, I need to see the formatting of my lists.

In [3]:
print(type(df.iloc[0]['genre']))
print(df.iloc[0]['genre'])

<class 'str'>
['Science Fiction', 'Fiction', 'Fantasy', 'Queer', 'LGBT', 'Adult', 'Time Travel']


As expected, pandas has put this lists into the form of a string. But there is one relief--each element is surrounded by quotation marks. That should make things a bit easier! 

In [4]:
df['genre'] = df['genre'].apply(eval)

In [5]:
print(type(df.iloc[0]['genre']))
print(df.iloc[0]['genre'])

<class 'list'>
['Science Fiction', 'Fiction', 'Fantasy', 'Queer', 'LGBT', 'Adult', 'Time Travel']


Wonderful! Each book's genre feature is now treated as a list! Now we just need to find the counts--ideally without looping through all 22,000 data points. Hilsdorf recommends we consider the column as a 2D array, and then convert it to a 1D array. 

In [6]:
def to_1D(series):
    return pd.Series([x for _list in series for x in _list])

In [7]:
all_genres = to_1D(df['genre'])

In [8]:
all_genres.head(10)

0    Science Fiction
1            Fiction
2            Fantasy
3              Queer
4               LGBT
5              Adult
6        Time Travel
7    Science Fiction
8            Mystery
9            Fiction
dtype: object

In [9]:
print(f"Our {len(df)} books have a total of {len(all_genres)} genres.\
 An average of {round(len(all_genres)/len(df), 1)} genres per book.")

Our 22022 books have a total of 131598 genres. An average of 6.0 genres per book.


Now that the genre column is correctly interpreted as a list, we can get more specific by adding a column for the number of genres each book has. 

In [10]:
df['num genres'] = df['genre'].apply(lambda x: len(x))

In [11]:
df.head()

Unnamed: 0,title,author,total words,vividness,passive voice,all adverbs,ly-adverbs,non-ly-adverbs,genre,year,num genres
0,The Vanished Birds,Simon Jimenez,124205.0,55.18,6.37,1.95,0.36,1.58,"[Science Fiction, Fiction, Fantasy, Queer, LGB...",2020.0,7
1,The Price of Honor,Jonathan P. Brazee,77253.0,35.35,8.71,2.63,0.71,1.92,[Science Fiction],2017.0,1
2,The Case of the Baker Street Irregulars,Anthony Boucher,80557.0,32.33,8.41,3.72,1.64,2.08,"[Mystery, Fiction, Crime, Humor, Classics, 20t...",1940.0,7
3,Wildoak,C. C. Harrington,55602.0,74.34,6.92,3.04,1.16,1.87,"[Middle Grade, Historical Fiction, Fiction, An...",2022.0,7
4,The Holiday,T. M. Logan,101767.0,50.3,8.02,3.06,1.12,1.93,"[Thriller, Mystery, Fiction, Mystery Thriller,...",2019.0,7


Next, let's see which genres are most common. I know from inspection that many of the elements classed as genres are actually not categories that would generally be considered genres--elements like "Australia," or "Star Trek." I'd anticipate finding these near the bottom for number of books. This is likely because Goodreads genres are generated by how users shelve their own books. 

In [12]:
len(all_genres.value_counts())

688

All in all, we have nearly 700 supposed "genres!" Definitely too many to try to classify from only 22,000 books. Time to start finding ones to eliminate! 

In [13]:
all_genres.value_counts().head()

Fiction      15167
Mystery       7840
Thriller      6067
Fantasy       5670
Audiobook     5477
Name: count, dtype: int64

As it turns out, one of the top 5 "genres" is, in fact, not a genre--"Audiobook" is a format. It looks like not all the "genres" we need to get rid of will be close to the bottom! 

Still, it might be worth checking the singletons.

In [14]:
np.sum(all_genres.value_counts() == 1)

117

It seems that over a hundred of the "genres" appear only once in the entire dataframe! Let's expand our net even more. 

In [15]:
np.sum(all_genres.value_counts() <= 10)

333

Wow! Just about half of the genres appear in fewer than 10 books. 

Out of curiosity, let's see what some of the singletons look like.

In [16]:
genres_df = pd.DataFrame(all_genres.value_counts())

In [17]:
genres_df[genres_df['count'] == 1].head(10)

Unnamed: 0,count
Mauritius,1
Paranormal Urban Fantasy,1
Victorian Romance,1
Transport,1
Wonder Woman,1
Romanian Literature,1
Golden Age Mystery,1
Romanticism,1
Spider Man,1
Georgian,1


Here we can see locations, time periods, overly-specific combination genres, topics, and even two characters. 

In [18]:
genres_df[genres_df['count'] == 1].tail(10)

Unnamed: 0,count
M M Sports Romance,1
Mysticism,1
Green,1
M M Mystery,1
American Fiction,1
Banks,1
Cthulhu Mythos,1
Prayer,1
Bigfoot,1
Romania,1


These patterns continue at the bottom of the list. None of these are good candidates for classification, and can easily be removed. 

Let's check genres with 10 books.

In [19]:
genres_df[genres_df['count'] == 10].head(10)

Unnamed: 0,count
Terrorism,10
Academic,10
Ecology,10
Ukraine,10
Birds,10
Social Issues,10
Category Romance,10
Chapter Books,10
Fitness,10
Read For School,10


These do indeed follow similar patterns! Mostly topics, genres, formats, and "read for school," which is simply the way people shelve their books. 

Let's try approaching this from the opposite direciton. How many genres have at least 1000 books? 

In [20]:
print(f"{len(genres_df[genres_df['count'] >= 1000])} genres have more than 1000 books.")
genres_df[genres_df['count'] >= 1000]

27 genres have more than 1000 books.


Unnamed: 0,count
Fiction,15167
Mystery,7840
Thriller,6067
Fantasy,5670
Audiobook,5477
Mystery Thriller,3934
Crime,3765
Science Fiction,3483
Contemporary,3328
Romance,3234


The most comon, by far, is Fiction, with more than half of our dataset. It could be argued whether this represents a "genre" at all, and it may or may not be a useful classification. 

"Audiobook" is certainly not a genre, and "Mystery Thriller" appears to be a combination of two larger genres, "Mystery" and "Thriller." 

In [21]:
#Number of books with the genre "Mystery Thriller" (as seen above)
len(df[df['genre'].apply(lambda x: "Mystery Thriller" in x)])

3934

In [22]:
#Number of books with both the genre "Mystery Thriller" and the genre "Mystery"
len(df[df['genre'].apply(lambda x: ("Mystery Thriller" in x) & ("Mystery" in x))])

3774

In [23]:
#Number of books with both the genre "Mystery Thriller" and the genre "Thriller"
len(df[df['genre'].apply(lambda x: ("Mystery Thriller" in x) & ("Thriller" in x))])

3342

In [24]:
#Number of books with all three genres, "Mystery Thriller", "Mystery", and "Thriller"
len(df[df['genre'].apply(lambda x: ("Mystery Thriller" in x) & ("Thriller" in x) & ("Mystery" in x))])

3220

In [25]:
3220/3934

0.8185053380782918

Sure enough, there's more than an 80% overlap in "Mystery Thriller" and both of the other genres. So "Mystery Thriller" is mostly redundant and can be removed--or at least replaced with both Mystery and Thriller. 

The rest require some domain knowledge. Is "Magic" a genre? Are "Novels" a genre seperate from Fiction? Are most of the books in other genres actually novels without mentioning it?

It might be useful to check the correlations between these. Surely there's a way this can be done! 

In [26]:
top_genres = genres_df[genres_df['count']>=1000].reset_index()

In [27]:
corr = np.ones(shape=(27,27))

In [28]:
for i in range(1, len(top_genres)):
    for j in range(i):
        genre_i = top_genres.iloc[i]['index']
        genre_j = top_genres.iloc[j]['index']
        count_i = len(df[df['genre'].apply(lambda x: genre_i in x)])
        count_j = len(df[df['genre'].apply(lambda x: genre_j in x)])
        count_both = len(df[df['genre'].apply(lambda x: (genre_i in x) & (genre_j in x))])
        corr[i][j] = count_both/count_i
        corr[j][i] = count_both/count_j

In [29]:
top_genres.iloc[0]['index']

'Fiction'

In [30]:
corr = pd.DataFrame(corr)

In [31]:
corr.columns = top_genres['index']

In [32]:
corr.index = top_genres['index']

In [33]:
corr.style.background_gradient()

index,Fiction,Mystery,Thriller,Fantasy,Audiobook,Mystery Thriller,Crime,Science Fiction,Contemporary,Romance,Nonfiction,Suspense,Historical Fiction,Adult,Young Adult,Historical,Horror,Adventure,Paranormal,History,Literary Fiction,Science Fiction Fantasy,Magic,Novels,Biography,Classics,LGBT
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
Fiction,1.0,0.451045,0.358278,0.280345,0.286807,0.245797,0.217182,0.186919,0.197732,0.150722,0.0,0.180919,0.174458,0.156458,0.137206,0.112547,0.089273,0.075427,0.042065,0.0,0.078658,0.073976,0.049581,0.07213,0.0,0.061779,0.048856
Mystery,0.872577,1.0,0.628316,0.097577,0.290689,0.481378,0.433036,0.055867,0.139796,0.082908,0.004082,0.34477,0.135969,0.102041,0.081122,0.091709,0.08699,0.046939,0.048214,0.002551,0.020536,0.010332,0.014413,0.023087,0.001148,0.03648,0.019005
Thriller,0.895665,0.811933,1.0,0.061645,0.320092,0.550849,0.453766,0.08093,0.1284,0.040547,0.002143,0.434811,0.064282,0.107796,0.04714,0.029998,0.116038,0.054557,0.027031,0.001648,0.013186,0.00989,0.001319,0.024065,0.000659,0.008241,0.017307
Fantasy,0.749912,0.134921,0.065961,1.0,0.171781,0.012169,0.011111,0.337566,0.047443,0.231217,0.0,0.010229,0.117989,0.14903,0.289947,0.07284,0.13351,0.118871,0.189065,0.0,0.018695,0.156614,0.195414,0.027337,0.0,0.04709,0.085714
Audiobook,0.79423,0.416104,0.354574,0.177835,1.0,0.269673,0.219281,0.137667,0.204674,0.130729,0.171444,0.187877,0.114661,0.188424,0.067555,0.08417,0.051488,0.046376,0.024283,0.054044,0.072668,0.055505,0.035969,0.029578,0.080701,0.015337,0.025561
Mystery Thriller,0.947636,0.959329,0.849517,0.017539,0.375445,1.0,0.524911,0.027453,0.136756,0.038638,0.000254,0.516014,0.063803,0.121505,0.051601,0.033045,0.064311,0.019319,0.014489,0.000254,0.012201,0.001017,0.000508,0.014743,0.0,0.023386,0.014743
Crime,0.8749,0.901726,0.731208,0.016733,0.318991,0.548473,1.0,0.016733,0.074635,0.028685,0.026029,0.367596,0.087649,0.052058,0.018592,0.060823,0.040372,0.020983,0.006906,0.015936,0.008234,0.003187,0.002125,0.020186,0.011155,0.030013,0.012483
Science Fiction,0.813953,0.125754,0.14097,0.549526,0.21648,0.031008,0.018088,1.0,0.03675,0.078668,0.0,0.032156,0.048808,0.111111,0.167959,0.018375,0.13006,0.109101,0.031582,0.0,0.016078,0.241459,0.034166,0.058857,0.0,0.059432,0.068045
Contemporary,0.901142,0.329327,0.234075,0.080829,0.336839,0.161659,0.084435,0.038462,1.0,0.389724,0.009014,0.127704,0.09345,0.279748,0.134916,0.032452,0.029447,0.007512,0.029447,0.000601,0.245793,0.001803,0.005108,0.145433,0.003906,0.015024,0.07512
Romance,0.706865,0.200989,0.076067,0.40538,0.221398,0.047001,0.033395,0.084725,0.401051,1.0,0.007112,0.09462,0.176252,0.22047,0.270563,0.142239,0.010513,0.025356,0.180581,0.000309,0.027211,0.011132,0.128942,0.022573,0.00402,0.018553,0.080396


To interpret this dataframe, each cell represents the percentage of the row genre that also belongs to the colum genre. 

For example, the left column represents the percentage of each genre that also has the Fiction tag. The top row represents the percentage of Fiction-tagged books that also have each other genre. 

So, looking just at the 4 boxes in the upper-lefthand corner, 100% of Fiction is Fiction and 100% of Mystery is Mystery (of course), but also, 87% of Mystery is Fiction, and 45% of Fiction is Mystery. (This surprisingly large number is likely a consequence of the preponderance of Hardy Boys and Nancy Drew novels in the dataset).

The darker colors near the left and on the top make sense, since the genres is sorted in order of frequency. 

### Analysis 

Nearly everything is highly correlated with Fiction, except Nonfiction, History, and Biography, none of which ever coexist in the same book. This is promising for the accuracy of our data! But it also means that a classification for Fiction is likely redundant. 

98% of Magic books fall under Fantasy, and Magic will likely be dropped. . 

75% of Science Fiction Fantasy is Fantasy, and 70% of Science Fiction Fantasy is Science Fiction. So we can likely break up Science Fiction Fantasy into its component parts. 

Now that we have some insight into features to drop, let's try actually dropping some. Let's say that we end up keeping exactly the features with more than 1000 books. 

In [34]:
def filter_genres(genres, keep):
    filtered = [genre for genre in genres if genre in keep]
    return filtered

In [35]:
# Create an array of the top genres
keep_genres = top_genres['index'].values

In [36]:
# Create a new column on our dataframe with only the top genres
df['filtered genres'] = df['genre'].apply(lambda x: filter_genres(x, keep_genres))

In [37]:
df.head()

Unnamed: 0,title,author,total words,vividness,passive voice,all adverbs,ly-adverbs,non-ly-adverbs,genre,year,num genres,filtered genres
0,The Vanished Birds,Simon Jimenez,124205.0,55.18,6.37,1.95,0.36,1.58,"[Science Fiction, Fiction, Fantasy, Queer, LGB...",2020.0,7,"[Science Fiction, Fiction, Fantasy, LGBT, Adult]"
1,The Price of Honor,Jonathan P. Brazee,77253.0,35.35,8.71,2.63,0.71,1.92,[Science Fiction],2017.0,1,[Science Fiction]
2,The Case of the Baker Street Irregulars,Anthony Boucher,80557.0,32.33,8.41,3.72,1.64,2.08,"[Mystery, Fiction, Crime, Humor, Classics, 20t...",1940.0,7,"[Mystery, Fiction, Crime, Classics]"
3,Wildoak,C. C. Harrington,55602.0,74.34,6.92,3.04,1.16,1.87,"[Middle Grade, Historical Fiction, Fiction, An...",2022.0,7,"[Historical Fiction, Fiction, Young Adult]"
4,The Holiday,T. M. Logan,101767.0,50.3,8.02,3.06,1.12,1.93,"[Thriller, Mystery, Fiction, Mystery Thriller,...",2019.0,7,"[Thriller, Mystery, Fiction, Mystery Thriller,..."


In [38]:
# Create a new column with the number of genres from the allowed list in each book
df['num filtered genres'] = df['filtered genres'].apply(lambda x: len(x))

In [39]:
df.head()

Unnamed: 0,title,author,total words,vividness,passive voice,all adverbs,ly-adverbs,non-ly-adverbs,genre,year,num genres,filtered genres,num filtered genres
0,The Vanished Birds,Simon Jimenez,124205.0,55.18,6.37,1.95,0.36,1.58,"[Science Fiction, Fiction, Fantasy, Queer, LGB...",2020.0,7,"[Science Fiction, Fiction, Fantasy, LGBT, Adult]",5
1,The Price of Honor,Jonathan P. Brazee,77253.0,35.35,8.71,2.63,0.71,1.92,[Science Fiction],2017.0,1,[Science Fiction],1
2,The Case of the Baker Street Irregulars,Anthony Boucher,80557.0,32.33,8.41,3.72,1.64,2.08,"[Mystery, Fiction, Crime, Humor, Classics, 20t...",1940.0,7,"[Mystery, Fiction, Crime, Classics]",4
3,Wildoak,C. C. Harrington,55602.0,74.34,6.92,3.04,1.16,1.87,"[Middle Grade, Historical Fiction, Fiction, An...",2022.0,7,"[Historical Fiction, Fiction, Young Adult]",3
4,The Holiday,T. M. Logan,101767.0,50.3,8.02,3.06,1.12,1.93,"[Thriller, Mystery, Fiction, Mystery Thriller,...",2019.0,7,"[Thriller, Mystery, Fiction, Mystery Thriller,...",7


For reference, let's look at the description of the original genres list.

In [40]:
df['num genres'].describe()

count    22022.000000
mean         5.975752
std          1.884849
min          1.000000
25%          6.000000
50%          7.000000
75%          7.000000
max          7.000000
Name: num genres, dtype: float64

Some books had only 1 genre, but 75% of the books had at least 6, and at least half had the maximum number I scraped from goodreads--7. 

Now let's see what happens after I remove every genre except the top ones.

In [41]:
df['num filtered genres'].describe()

count    22022.000000
mean         3.967033
std          1.792948
min          0.000000
25%          3.000000
50%          4.000000
75%          5.000000
max          7.000000
Name: num filtered genres, dtype: float64

Now are mean has gone from around 6 to around 4, and the median has decreased all the way from 7 to 4. However, most books still have genres left. 

Let's see which books have lost all of their genres. I'll use a sample of the first and last 5 of those books.

In [42]:
lost_genres = df[df['num filtered genres'] == 0]
print(f"{len(lost_genres)} books no longer have any genres left. Here are some of them:")
pd.concat([lost_genres.head(),lost_genres.tail()])

218 books no longer have any genres left. Here are some of them:


Unnamed: 0,title,author,total words,vividness,passive voice,all adverbs,ly-adverbs,non-ly-adverbs,genre,year,num genres,filtered genres,num filtered genres
353,Startup Mixology,Frank Gruber,61170.0,19.85,7.47,2.81,1.0,1.81,"[Business, Entrepreneurship, Technology]",2014.0,3,[],0
492,Turbulent Wake,Paul E. Hardisty,84687.0,64.52,6.67,2.44,0.52,1.93,[Canada],2019.0,1,[],0
773,From Shy Guy To Ladies Man,Chris Bale,23616.0,28.81,7.85,4.16,1.89,2.27,[Self Help],2016.0,1,[],0
1111,One Step Ahead,Audrey Walker,19968.0,28.98,10.51,3.49,0.93,2.56,[Psychological Thriller],2021.0,1,[],0
1124,Dawn of Chaos,Daniel Willcocks & Michael Anderle,74595.0,51.85,7.2,2.93,1.02,1.92,[Zombies],2018.0,1,[],0
21729,The Outer Dark,Zachary Rawlins,203197.0,48.58,6.94,3.56,1.72,1.84,[Urban Fantasy],2017.0,1,[],0
21772,Endurance,Brunoo Miller,42029.0,43.12,8.19,2.82,1.0,1.82,[Dystopia],2019.0,1,[],0
21825,Emperor of the Earth,Czeslaw Milosz,87103.0,21.72,6.22,2.94,1.13,1.81,"[Poetry, Poland, Criticism, Essays]",1977.0,4,[],0
21833,Blue Moon,Jessica Saren,24065.0,50.2,6.61,3.44,1.73,1.72,[Reverse Harem],2018.0,1,[],0
22129,How To Have Kick-Ass Ideas,Chris Baréz-Brown,28957.0,26.86,8.99,4.2,1.14,3.06,[Business],2008.0,1,[],0


In [43]:
lost_genres['num genres'].describe()

count    218.000000
mean       1.458716
std        0.916134
min        1.000000
25%        1.000000
50%        1.000000
75%        2.000000
max        7.000000
Name: num genres, dtype: float64

Most of the books that have lost all their genres had only 1 genre tag originally, and it did not happen to be a popular one. But at least some unlucky books had 7 genres, and none of them were winners.

In [44]:
lost_genres[lost_genres['num genres'] == 7]

Unnamed: 0,title,author,total words,vividness,passive voice,all adverbs,ly-adverbs,non-ly-adverbs,genre,year,num genres,filtered genres,num filtered genres
20233,Coming Home,Laurèn Lee,60114.0,57.65,6.2,2.87,0.98,1.89,"[Comics, Graphic Novels, Marvel, Spider Man, C...",2001.0,7,[],0


In [45]:
df.iloc[20233]['genre']

['Comics',
 'Graphic Novels',
 'Marvel',
 'Spider Man',
 'Comic Book',
 'Superheroes',
 'Graphic Novels Comics']

Looks like there was just one book this unlucky--a Spider-Man comic. 

Next, let's try splitting up the combination genres--Science Fiction Fantasy into Science Fiction and Fantasy, and Mystery Thriller into Mystery and Thriller.

In [46]:
def split_genres(genres):
    
    # So we don't change the original list
    split = genres.copy()
    
    if "Science Fiction Fantasy" in genres: 
        split.remove("Science Fiction Fantasy")
        split.append("Science Fiction")
        split.append("Fantasy")
        
        # Remove duplicates
        split = list(set(split))
        
    if "Mystery Thriller" in genres:
        split.remove("Mystery Thriller")
        split.append("Mystery")
        split.append("Thriller")
        
        # Remove duplicates
        split = list(set(split))
        
    return split

In [47]:
df['split genres'] = df['filtered genres'].apply(split_genres)

In [48]:
df.head()

Unnamed: 0,title,author,total words,vividness,passive voice,all adverbs,ly-adverbs,non-ly-adverbs,genre,year,num genres,filtered genres,num filtered genres,split genres
0,The Vanished Birds,Simon Jimenez,124205.0,55.18,6.37,1.95,0.36,1.58,"[Science Fiction, Fiction, Fantasy, Queer, LGB...",2020.0,7,"[Science Fiction, Fiction, Fantasy, LGBT, Adult]",5,"[Science Fiction, Fiction, Fantasy, LGBT, Adult]"
1,The Price of Honor,Jonathan P. Brazee,77253.0,35.35,8.71,2.63,0.71,1.92,[Science Fiction],2017.0,1,[Science Fiction],1,[Science Fiction]
2,The Case of the Baker Street Irregulars,Anthony Boucher,80557.0,32.33,8.41,3.72,1.64,2.08,"[Mystery, Fiction, Crime, Humor, Classics, 20t...",1940.0,7,"[Mystery, Fiction, Crime, Classics]",4,"[Mystery, Fiction, Crime, Classics]"
3,Wildoak,C. C. Harrington,55602.0,74.34,6.92,3.04,1.16,1.87,"[Middle Grade, Historical Fiction, Fiction, An...",2022.0,7,"[Historical Fiction, Fiction, Young Adult]",3,"[Historical Fiction, Fiction, Young Adult]"
4,The Holiday,T. M. Logan,101767.0,50.3,8.02,3.06,1.12,1.93,"[Thriller, Mystery, Fiction, Mystery Thriller,...",2019.0,7,"[Thriller, Mystery, Fiction, Mystery Thriller,...",7,"[Suspense, Fiction, Mystery, Audiobook, Crime,..."


In [49]:
df['num split genres'] = df['split genres'].apply(lambda x: len(x))

In [50]:
df.head()

Unnamed: 0,title,author,total words,vividness,passive voice,all adverbs,ly-adverbs,non-ly-adverbs,genre,year,num genres,filtered genres,num filtered genres,split genres,num split genres
0,The Vanished Birds,Simon Jimenez,124205.0,55.18,6.37,1.95,0.36,1.58,"[Science Fiction, Fiction, Fantasy, Queer, LGB...",2020.0,7,"[Science Fiction, Fiction, Fantasy, LGBT, Adult]",5,"[Science Fiction, Fiction, Fantasy, LGBT, Adult]",5
1,The Price of Honor,Jonathan P. Brazee,77253.0,35.35,8.71,2.63,0.71,1.92,[Science Fiction],2017.0,1,[Science Fiction],1,[Science Fiction],1
2,The Case of the Baker Street Irregulars,Anthony Boucher,80557.0,32.33,8.41,3.72,1.64,2.08,"[Mystery, Fiction, Crime, Humor, Classics, 20t...",1940.0,7,"[Mystery, Fiction, Crime, Classics]",4,"[Mystery, Fiction, Crime, Classics]",4
3,Wildoak,C. C. Harrington,55602.0,74.34,6.92,3.04,1.16,1.87,"[Middle Grade, Historical Fiction, Fiction, An...",2022.0,7,"[Historical Fiction, Fiction, Young Adult]",3,"[Historical Fiction, Fiction, Young Adult]",3
4,The Holiday,T. M. Logan,101767.0,50.3,8.02,3.06,1.12,1.93,"[Thriller, Mystery, Fiction, Mystery Thriller,...",2019.0,7,"[Thriller, Mystery, Fiction, Mystery Thriller,...",7,"[Suspense, Fiction, Mystery, Audiobook, Crime,...",6


In [51]:
print(df.iloc[4]['genre'])
print(df.iloc[4]['filtered genres'])
print(df.iloc[4]['split genres'])

['Thriller', 'Mystery', 'Fiction', 'Mystery Thriller', 'Crime', 'Audiobook', 'Suspense']
['Thriller', 'Mystery', 'Fiction', 'Mystery Thriller', 'Crime', 'Audiobook', 'Suspense']
['Suspense', 'Fiction', 'Mystery', 'Audiobook', 'Crime', 'Thriller']


In [52]:
df['num split genres'].describe()

count    22022.000000
mean         3.797975
std          1.635543
min          0.000000
25%          3.000000
50%          4.000000
75%          5.000000
max          8.000000
Name: num split genres, dtype: float64

In [53]:
to_1D(df['split genres']).value_counts()

Fiction               15167
Mystery                8000
Thriller               6659
Fantasy                5970
Audiobook              5477
Science Fiction        3830
Crime                  3765
Contemporary           3328
Romance                3234
Nonfiction             3106
Suspense               2937
Historical Fiction     2917
Adult                  2679
Young Adult            2543
Historical             1997
Horror                 1595
Adventure              1343
Paranormal             1301
History                1212
Literary Fiction       1200
Magic                  1131
Novels                 1104
Biography              1077
Classics               1058
LGBT                   1009
Name: count, dtype: int64