# Intro
Some websites allow writers and publishers to publish their books in a simple way. Each person has a preference for literary genres when choosing their next reading, so choosing genres well when publishing a book can make your book reach the right audience, thereby increasing your sales or advertising.
The focus of this project will be to create a method for recommending literary genre tags for writers and publishers to publish their books. Thus, when filling in filling out the book's information on a platform, it indicates some genre tags that best fit the description of the work. 

# Data
We will use a dataset with information from some books published on the Google Books platform. The dataset contains information such as the name of the author, name of the work, genre, date of publication, description of the book, etc. Thus, we will analyze it and try to obtain a classification method based on this information to provide genres for an unpublished work.
All information about the dataset can be found [here](https://www.kaggle.com/bilalyussef/google-books-dataset).

In [1]:
#Load packages
import pandas as pd
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

The dataset has a numeric column with no name, but it resembles an ID. As we already have a column with a unique identification (ISBN), we will use it as an index and ignore the cited column. For more info about ISBN: [ISBN on wikipedia](https://en.wikipedia.org/wiki/International_Standard_Book_Number)

In [2]:
books_df = pd.read_csv("google_books.csv", usecols=["title", "author", "rating", "voters", "price", "currency", "description", "publisher", "page_count", "generes", "ISBN", "language", "published_date"], index_col="ISBN")
books_df.head()

Unnamed: 0_level_0,title,author,rating,voters,price,currency,description,publisher,page_count,generes,language,published_date
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
9781612626864,Attack on Titan: Volume 13,Hajime Isayama,4.6,428,43.28,SAR,NO SAFE PLACE LEFT At great cost to the Garris...,Kodansha Comics,192,none,English,"Jul 31, 2014"
9780758272799,Antiques Roadkill: A Trash 'n' Treasures Mystery,Barbara Allan,3.3,23,26.15,SAR,Determined to make a new start in her quaint h...,Kensington Publishing Corp.,288,"Fiction , Mystery &amp, Detective , Cozy , Gen...",English,"Jul 1, 2007"
9781506713816,The Art of Super Mario Odyssey,Nintendo,3.9,9,133.85,SAR,Take a globetrotting journey all over the worl...,Dark Horse Comics,368,"Games &amp, Activities , Video &amp, Electronic",English,"Nov 5, 2019"
9781617734076,Getting Away Is Deadly: An Ellie Avery Mystery,Sara Rosett,4.0,10,26.15,SAR,"With swollen feet and swelling belly, pregnant...",Kensington Publishing Corp.,320,none,English,"Mar 1, 2009"
9780007287758,"The Painted Man (The Demon Cycle, Book 1)",Peter V. Brett,4.5,577,28.54,SAR,The stunning debut fantasy novel from author P...,HarperCollins UK,544,"Fiction , Fantasy , Dark Fantasy",English,"Jan 8, 2009"


The dataset has only books in the English language, which makes it easy to use NLP techniques in only one language 

In [3]:
print(books_df.language.unique())

['English']


The dataset has only 1299 examples of books. It seems to be a very small amount for a machine learning problem. In the future, we can use a larger dataset for the analysis. Also, we can handle that by using techniques such as K-fold cross validation

In [4]:
print(books_df.shape)

(1299, 12)


We have 183 different authors, 82 publishers and 242 genres. However, we can see that the gender column can have several values for each book, so the number of different genres must be different from that. We will transform the gender column so that it is a list of them, not just a string

In [5]:
print("Number of authors: {}".format(books_df.author.nunique()))
print("Number of publishers: {}".format(books_df.publisher.nunique()))
print("Number of genres: {}".format(books_df.generes.nunique()))

Number of authors: 183
Number of publishers: 82
Number of genres: 242


We will remove some columns that do not seem to influence the literary genre of a book. This way, we will reduce the memory usage and leave the dataset cleaner for a better modeling

In [6]:
books_df.drop(columns=["rating", "voters", "price", "currency", "published_date", "language", "page_count", "title", "author", "publisher"], inplace=True)
books_df.head(10)

Unnamed: 0_level_0,description,generes
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1
9781612626864,NO SAFE PLACE LEFT At great cost to the Garris...,none
9780758272799,Determined to make a new start in her quaint h...,"Fiction , Mystery &amp, Detective , Cozy , Gen..."
9781506713816,Take a globetrotting journey all over the worl...,"Games &amp, Activities , Video &amp, Electronic"
9781617734076,"With swollen feet and swelling belly, pregnant...",none
9780007287758,The stunning debut fantasy novel from author P...,"Fiction , Fantasy , Dark Fantasy"
9780007369218,HBO’s hit series A GAME OF THRONES is based on...,none
9781789090154,The novelization of the highly anticipated God...,"Fiction , Media Tie-In"
9781250166609,From #1 New York Times bestselling author Bran...,"Fiction , Fantasy , Epic"
9780062651242,NATIONAL BESTSELLERDeveloping video games—hero...,"Games &amp, Activities , Video &amp, Electronic"
9781529018592,A short gift book of festive hospital diaries ...,"Biography &amp, Autobiography , Medical (incl...."


Some genres are "none", that means that the exact genre of that book is not known. In this case, we will remove those lines that have such a value 

In [7]:
books_df["generes"] = books_df["generes"].replace("none", np.nan)
books_df["generes"].head()
books_df.dropna(subset=["generes"], inplace=True)
print(books_df.shape)

(772, 2)


## Text preprocessing

In this phase, we will clean the texts, removing symbols, accents, punctuations, etc. Next, we will use some Natural Language Processing techniques to better process the text, such as tokenization, removing stop words and text normalization

In [8]:
books_df["generes"] = books_df["generes"].apply(lambda x: re.sub(r"&amp", "", str(x).lower()))
books_df["generes"] = books_df["generes"].str.split(" , ")

In [9]:
books_df["description"] = books_df["description"].apply(lambda x: re.sub(r"[\,\.\\\/\_\:\;\>\<\}\{\´\`\(\)\+\-\%\$\#\@\!\?\[\]]|\d", "", str(x).lower().strip()))
books_df.head()

Unnamed: 0_level_0,description,generes
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1
9780758272799,determined to make a new start in her quaint h...,"[fiction, mystery, detective, cozy, general]"
9781506713816,take a globetrotting journey all over the worl...,"[games, activities, video, electronic]"
9780007287758,the stunning debut fantasy novel from author p...,"[fiction, fantasy, dark fantasy]"
9781789090154,the novelization of the highly anticipated god...,"[fiction, media tie-in]"
9781250166609,from new york times bestselling author brando...,"[fiction, fantasy, epic]"


In [10]:
def text_preprocessing(text):
    #Tokenization
    tokenized = word_tokenize(text)
    #Text normalization
    #stemming
    stemmer = PorterStemmer()
    stemmed = [stemmer.stem(token) for token in tokenized]
    #lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(token) for token in stemmed]
    #Remove stopwords
    stop_words = stopwords.words("english")
    text_no_stop = [token for token in lemmatized if token not in stop_words]
    
    processed_text = " ".join(text_no_stop)

    return processed_text

In [11]:
books_df["description"] = books_df["description"].apply(lambda x: text_preprocessing(x))
books_df["description"].head()

ISBN
9780758272799    determin make new start quaint hometown bank m...
9781506713816    take globetrot journey worldand beyondwith thi...
9780007287758    stun debut fantasi novel author peter v brett ...
9781789090154    novel highli anticip god war game hi vengeanc ...
9781250166609    new york time bestsel author brandon sanderson...
Name: description, dtype: object

We will create a list containing all the literary genres present in the dataset 

In [12]:
genres = []

for i in range(len(books_df["generes"].values)):
    for genre in books_df["generes"].iloc[i]:
        if genre not in genres:
            genres.append(genre)
print(genres)
print(len(genres))

['fiction', 'mystery', 'detective', 'cozy', 'general', 'games', 'activities', 'video', 'electronic', 'fantasy', 'dark fantasy', 'media tie-in', 'epic', 'biography', 'autobiography', 'medical (incl. patients)', 'dragons', 'mythical creatures', 'comics', 'graphic novels', 'superheroes', 'comics & graphic novels', 'sports', 'military', 'science fiction', 'women', 'juvenile fiction', 'humorous stories', 'classics', 'business', 'economics', 'motivational', 'social science', 'action', 'adventure', 'women sleuths', 'noir', 'leadership', 'literary criticism', 'accounting', 'financial', 'cooking', 'methods', 'baking', 'literary collections', 'letters', 'literary', 'amateur sleuth', 'thrillers', 'suspense', 'industries', 'computers', 'information technology', 'self-help', 'personal growth', 'marketing', 'family', 'relationships', 'private investigators', 'self-esteem', 'crime', 'management', 'corporate finance', 'psychology', 'interpersonal relations', 'personal success', 'hard-boiled', 'communi

Turn each genre into a column, so our dataset will be in a wide-form. Whenever a sample has that genre in its list of genres, then we put a value of 1 in the column that represents the genre, if it does not have it, we add 0. For example, if a book is only "adventure", then its "adventure" column will have a value of 1 and the rest of the genre columns will have a value of 0 

In [13]:
for genre in genres:
    books_df[genre] = books_df["generes"].apply(lambda x: 1 if genre in x else 0)

## Modeling and evaluation
First, we will separate the data into two sets, a training set, which will serve to train the Machine Learning (ML) algorithm, and a test set to check how well the model does with data not yet seen. Next, we will use a pipeline so that the data can be processed with tf-idf (term frequency-inverse document frequency) and then go through the ML algorithm using the One-vs-Rest technique (OvR), which allows us to do a multilabel classification, that is, of all classes (Multiclass) the result for being more than 1 of them. More info about multilabel classification go [here](https://towardsdatascience.com/multi-label-text-classification-with-scikit-learn-30714b7819c5) and more info about OvR go [here] (https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/)

In [14]:
train, test = train_test_split(books_df, test_size=0.2, shuffle=True, random_state=42)
x_train = train["description"]
x_test = test["description"]

We will use unigrams and bigrams to better analysis

In [15]:
pipe = Pipeline([
    ("vectorizer", TfidfVectorizer(ngram_range=(1, 2))),
    ("model", OneVsRestClassifier(MultinomialNB(), n_jobs=-1))
])

In [16]:
accuracies = []

for genre in genres:
    pipe.fit(x_train, train[genre])
    
    accuracy = pipe.score(x_test, test[genre])

    #If you want to check every accuracy for every single genre, uncomment the two lines bellow
    #print("Accuracy for genre {genre} is {score} on training set".format(genre=genre, score=pipe.score(x_train, train[genre])))
    #print("Accuracy for genre {genre} is {score} on test set".format(genre=genre, score=accuracy))
    
    accuracies.append(accuracy)

genres_and_accuracies = {k: v for (k, v) in zip(accuracies, genres)}
print("Overall accuracies mean: {acc_mean}".format(acc_mean=np.mean(accuracies)))

acc_min = np.amin(accuracies)
print("The min accuracy was on genre {genre} with the value {acc_min}".format(genre=genres_and_accuracies.get(acc_min), acc_min=acc_min))

acc_max = np.amax(accuracies)
print("The max accuracy was on genre {genre} with the value {acc_max}".format(genre=genres_and_accuracies.get(acc_max), acc_max=acc_max))

Overall accuracies mean: 0.9909581484590478
The min accuracy was on genre general with the value 0.8193548387096774
The max accuracy was on genre health care delivery with the value 1.0


As the dataset has few values, we can obtain a more real result of accuracy using the K-fold cross validation technique. So, we will divide the dataset into 5 equal parts, where 1 of them will be used as a validation set and the rest will train the model, in the end we will receive each of the measured accuracy and we will take the average of them as a result 

In [17]:
accuracies = []

for genre in genres:
    accuracy = cross_val_score(pipe, books_df["description"], books_df[genre], cv=5, n_jobs=-1).mean()

    #If you want to check every accuracy for every single genre, uncomment the line bellow
    #print("Accuracy for genre {genre} is {score} using cross validation".format(genre=genre, score=accuracy))

    accuracies.append(accuracy)

genres_and_accuracies = {k: v for (k, v) in zip(accuracies, genres)}
print("Overall accuracies mean with cross validation: {acc_mean}".format(acc_mean=np.mean(accuracies)))

acc_min = np.amin(accuracies)
print("The min accuracy was on genre {genre} with value {acc_min} using cross validation".format(genre=genres_and_accuracies.get(acc_min), acc_min=acc_min))

acc_max = np.amax(accuracies)
print("The max accuracy was on genre {genre} with value {acc_max} using cross validation".format(genre=genres_and_accuracies.get(acc_max), acc_max=acc_max))

Overall accuracies mean with cross validation: 0.9911545032494788
The min accuracy was on genre general with value 0.7823963133640552 using cross validation
The max accuracy was on genre health care delivery with value 0.9987096774193549 using cross validation


# Results and discussion
After pre-processing the description of the books and using Natural Language Processing (NLP) techniques, such as Term frequency-Inverse document frequency (tf-idf), and Machine Learning, we were able to have a final accuracy of 99% in identifying the literary genre of a book through a brief description of it. This system can help authors and publishers to better publish their books, where a book publishing platform can use this algorithm to recommend genre tags at the time of publication, or can be used to categorize already published books that do not have a specified genre.
Although the results are very good, we can expand the project using more data from books, in addition to using other Machine Learningal gorithms and doing hyperparametrization to obtain even more accurate results. 