# Intro
Some websites allow writers and publishers to publish their books in a simple way. Each person has a preference for literary genres when choosing their next reading, so choosing genres well when publishing a book can make your book reach the right audience, thereby increasing your sales or advertising.
The focus of this project will be to create a method for recommending literary genre tags for writers and publishers to publish their books. Thus, when filling in filling out the book's information on a platform, it indicates some genre tags that best fit the description of the work. 

# Data
We will use a dataset with information of published books on the Goodreads platform. The dataset contains information such as the name of the authors, title, genres, description of the book, and so on. Thus, we will analyze it and try to obtain a classification method based on this information to provide genres for an unpublished work, based on it's description.
All information about the dataset can be found [here](https://www.kaggle.com/bahramjannesarr/goodreads-book-datasets-10m).

In [1]:
#Load packages
import pandas as pd
import re
import numpy as np
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.metrics import hamming_loss

# Download nltk data. Uncomment if you need to download NLTK data
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('stopwords')

In [3]:
books_df = pd.read_csv("books_data.csv", usecols=["Name", "Authors", "Description", "Genres"])
books_df.head()

Unnamed: 0,Name,Authors,Description,Genres
0,Haroun and the Sea of Stories,Salman Rushdie,The author of The Satanic Verses returns with ...,['Fiction']
1,The Desire and Pursuit of the Whole: A Romance...,Frederick Rolfe,<i>The Desire and Pursuit of the Whole</i> sta...,['Fiction']
2,"Green Arrow, Vol. 2: Sounds of Violence",Kevin Smith,The reinvention of a classic comics character ...,"['Comic books, strips, etc']"
3,"Trojan Odyssey (Dirk Pitt, #17)",Clive Cussler,Long hailed as the grand master of adventure f...,['Fiction']
4,"Strontium Dog: Search/Destroy Agency Files, Vo...",John Wagner,"Earth, the late 22nd century. Following the at...",['Bounty hunters']


In [4]:
print(books_df.shape)

(111436, 4)


In [5]:
books_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111436 entries, 0 to 111435
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Name         111436 non-null  object
 1   Authors      111436 non-null  object
 2   Description  111436 non-null  object
 3   Genres       111436 non-null  object
dtypes: object(4)
memory usage: 3.4+ MB


The dataset has 111436 samples. It is a very good amount of examples for a machine learning problem.

We will focus on text analysis on the "Description" column, so we will drop the other columns. This way, we will reduce the memory usage and leave the dataset cleaner for a better modeling.

In [6]:
books_description_and_genres = books_df.drop(columns=["Name", "Authors"])

# Text preprocessing

In this phase, we will clean the texts, removing symbols, accents, punctuations, etc. Next, we will use some Natural Language Processing techniques to better process the text, such as tokenization, removing stop words and text normalization.

In [6]:
books_description_and_genres["Genres"] = books_description_and_genres["Genres"].apply(lambda x: re.sub(r"[\[\]\']|&amp", "", str(x).lower()))
books_description_and_genres["Description"] = books_description_and_genres["Description"].apply(lambda x: re.sub(r"[\,\.\\\/\_\:\;\>\<\}\{\´\`\(\)\+\-\%\$\#\@\!\?\[\]\&]|\d", "", str(x).lower().strip()))
books_description_and_genres.head()

Unnamed: 0,Description,Genres
0,the author of the satanic verses returns with ...,fiction
1,ithe desire and pursuit of the wholei stands a...,fiction
2,the reinvention of a classic comics character ...,"comic books, strips, etc"
3,long hailed as the grand master of adventure f...,fiction
4,earth the late nd century following the atomic...,bounty hunters


In [7]:
def text_preprocessing(text):
    # Tokenization
    tokenized = word_tokenize(text)
    # Text normalization
    # Stemming
    stemmer = PorterStemmer()
    stemmed = [stemmer.stem(token) for token in tokenized]
    # Lemmatization
    # lemmatizer = WordNetLemmatizer()
    # lemmatized = [lemmatizer.lemmatize(token) for token in stemmed]
    # Remove stopwords
    stop_words = stopwords.words("english")
    text_no_stop = [token for token in stemmed if token not in stop_words]
    
    processed_text = " ".join(text_no_stop)

    return processed_text

In [8]:
books_description_and_genres["Description"] = books_description_and_genres["Description"].apply(lambda x: text_preprocessing(x))
books_description_and_genres["Description"].head()

0    author satan vers return hi humor access novel...
1    ith desir pursuit wholei stand uniqu scurril s...
2    reinvent classic comic charact continuesoliv q...
3    long hail grand master adventur fiction clive ...
4    earth late nd centuri follow atom war britain ...
Name: Description, dtype: object

We will create a list containing all the literary genres present in the dataset.

In [9]:
print("Number of genres {num_genres}".format(num_genres=books_description_and_genres["Genres"].nunique()))
top_100_genres = books_description_and_genres["Genres"].value_counts().head(100).reset_index()
top_100_genres.rename(columns={
    "Genres": "Number of books",
    "index": "Genres"
}, inplace=True)

top_100_genres

Number of genres 6282


Unnamed: 0,Genres,Number of books
0,fiction,28767
1,juvenile fiction,10559
2,history,7477
3,religion,3334
4,biography & autobiography,3167
...,...,...
95,christian life,58
96,africa,58
97,europe,58
98,"detective and mystery stories, english",58


We have 6282 genres, some of them are only for a small list of books. So we will reduce the number of genres to 100, to reduce memory usage. One think to notice here is that the majority of the books are fictions, having almost 3 times the number of books of the second most higher genre, "juvenile fiction".

In [10]:
genres = []

for genre in top_100_genres["Genres"]:
    if genre not in genres:
        genres.append(genre)

Turn each genre into a column, so our dataset will be in a wide-form. Whenever a sample has that genre in its list of genres, then we put a value of 1 in the column that represents the genre, if it does not have it, we add 0. For example, if a book is only "adventure", then its "adventure" column will have a value of 1 and the rest of the genre columns will have a value of 0.

In [11]:
for genre in genres:
    books_description_and_genres[genre] = books_description_and_genres["Genres"].apply(lambda x: 1 if genre == x else 0)

books_description_and_genres.head()

Unnamed: 0,Description,Genres,fiction,juvenile fiction,history,religion,biography & autobiography,comics & graphic novels,juvenile nonfiction,business & economics,...,"fantasy fiction, american","authors, american",australian fiction,london (england),children,christian life,africa,europe,"detective and mystery stories, english",artists
0,author satan vers return hi humor access novel...,fiction,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ith desir pursuit wholei stand uniqu scurril s...,fiction,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,reinvent classic comic charact continuesoliv q...,"comic books, strips, etc",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,long hail grand master adventur fiction clive ...,fiction,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,earth late nd centuri follow atom war britain ...,bounty hunters,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Modeling
First, we will separate the data into two sets, a training set, which will serve to train the Machine Learning (ML) algorithm, and a test set to check how well the model does with data not yet seen. Next, we will use a pipeline so that the data can be processed with tf-idf (term frequency-inverse document frequency) and bag-of-words (BoW), then use the Naive Bayes ML algorithm using the One-vs-Rest technique (OvR), which allows us to do a multilabel classification, that is, of all classes (Multiclass) the result for being more than 1 of them. More info about multilabel classification go [here](https://towardsdatascience.com/multi-label-text-classification-with-scikit-learn-30714b7819c5) and more info about OvR go [here](https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/).

In [12]:
x = books_description_and_genres["Description"]
y = books_description_and_genres.drop(columns=["Description", "Genres"])
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, shuffle=True, random_state=42)

We will use unigrams (one word) and bigrams (combination of two terms) to better analysis

In [13]:
pipe_tf_idf = Pipeline([
    ("vectorizer", TfidfVectorizer(ngram_range=(1, 2), max_features=10000)),
    ("model", OneVsRestClassifier(MultinomialNB(), n_jobs=-1))
])

pipe_tf_idf.fit(x_train, y_train)

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(max_features=10000, ngram_range=(1, 2))),
                ('model',
                 OneVsRestClassifier(estimator=MultinomialNB(), n_jobs=-1))])

In [14]:
pipe_bag_of_words = Pipeline([
    ("vectorizer", CountVectorizer(ngram_range=(1, 2), max_features=10000)),
    ("model", OneVsRestClassifier(MultinomialNB(), n_jobs=-1))
])

pipe_bag_of_words.fit(x_train, y_train)

Pipeline(steps=[('vectorizer',
                 CountVectorizer(max_features=10000, ngram_range=(1, 2))),
                ('model',
                 OneVsRestClassifier(estimator=MultinomialNB(), n_jobs=-1))])

# Evaluation
Hamming loss is one of the most used valuation techniques in cases of multilabel classification. Accuracy would measure the hit rate if all possible genres in a sample were classified correctly, for example: if the algorithm predicted that the genres of a hypothetical book would be action and adventure, but the book is of the genre adventure and romance, the accuracy test would consider this result as an error, despite the fact that the algorithm got one of the values right. Hamming loss checks each forecast, label by label, and gives us a result based on the rate of errors we make in each assessment, so the lower the value of hamming loss, the better our result. More information [here](https://www.geeksforgeeks.org/an-introduction-to-multilabel-classification/).

In [15]:
print("TF-IDF results:")
print("Accuracy: {accuracy}".format(accuracy=pipe_tf_idf.score(x_test, y_test)))

predicted_tf_idf_labels = pipe_tf_idf.predict(x_test)

hamming_loss_result = hamming_loss(y_test, predicted_tf_idf_labels)
print("Hamming loss: {hamming_loss}".format(hamming_loss=hamming_loss_result))

TF-IDF results:
Accuracy: 0.38258255563531945
Hamming loss: 0.007022164393395549


In [16]:
print("Bag-of-Words results:")
print("Accuracy: {accuracy}".format(accuracy=pipe_bag_of_words.score(x_test, y_test)))

predicted_bow_labels = pipe_bag_of_words.predict(x_test)

hamming_loss_result = hamming_loss(y_test, predicted_bow_labels)
print("Hamming loss: {hamming_loss}".format(hamming_loss=hamming_loss_result))

Bag-of-Words results:
Accuracy: 0.13419777458722182
Hamming loss: 0.03543386575735822


As expected, the accuracy was extremely low, as it will penalize the result even in cases where only 1 of the labels is erroneously predicted. However, looking at a metric more focused on multilabel, such as hamming_loss, we see that the error rate was very small, which indicates that our model did well in most cases. Again, the less the Hamming loss result, the better.
The results for TF-IDF are better than using only bag-of-words model.

# Results and discussion
After pre-processing the description of the books, using Natural Language Processing (NLP) techniques and Naive Bayes ML Algorithm, we were able to have a good result in identifying the literary genre of a book through a brief description of it. This system can help authors and publishers to better publish their books, where a book publishing platform can use this algorithm to recommend genre tags at the time of publication, or can be used to categorize already published books that do not have a specified genre.
Although the results are very good, we can expand the project using other Machine Learning algorithms, doing hyperparametrization to obtain even more accurate results and using Part-of-Speech (POS) tags to use Lemmatization in Text Preprocessing.

In [26]:
# A Game of Thrones
# A Song of Ice and Fire
book_description = "Winter is coming. Such is the stern motto of House Stark, the northernmost of the fiefdoms that owe allegiance to King Robert Baratheon in far-off King’s Landing. There Eddard Stark of Winterfell rules in Robert’s name. There his family dwells in peace and comfort: his proud wife, Catelyn; his sons Robb, Brandon, and Rickon; his daughters Sansa and Arya; and his bastard son, Jon Snow. Far to the north, behind the towering Wall, lie savage Wildings and worse—unnatural things relegated to myth during the centuries-long summer, but proving all too real and all too deadly in the turning of the season."

processed_book_description = text_preprocessing(book_description)

y_pred = pipe_tf_idf.predict([processed_book_description])

# If you want to see the entire array, uncomment the line bellow
# print(y_pred)

genres_pred = []
for i in range(len(y.columns)):
    if y_pred[0][i] == 1:
        genres_pred.append(y.columns[i])

print("Predicted genres: {genres_pred}".format(genres_pred=genres_pred))

['fiction']
