In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


## Predicting the Genre of Books from Summaries

In this portfolio, I will try to create a predictive model, based on a dataset, to determine the genre of any book by the input of their summary. 

I'll use a set of book summaries from the [CMU Book Summaries Corpus](http://www.cs.cmu.edu/~dbamman/booksummaries.html) in this experiment.  This contains a large number of summaries (16,559) and includes meta-data about the genre of the books taken from Freebase.  Each book can have more than one genre and there are 227 genres listed in total.  To simplify the problem of genre prediction I will select a small number of target genres that occur frequently in the collection and select the books with these genre labels.  This will give me one genre label per book. 

## Data Preparation

At first, I would read the data. It is made available in tab-separated format but has no column headings. I would use `read_csv` to read this and would need to set the separator to `\t` (tab) and supply the column names.  The names come from the [ReadMe](data/booksummaries/README.txt) file.

In [3]:
names = ['wid', 'fid', 'title', 'author', 'date', 'genres', 'summary']

books = pd.read_csv("data/booksummaries/booksummaries.txt", sep="\t", header=None, names=names, keep_default_na=False)
books.head()

Unnamed: 0,wid,fid,title,author,date,genres,summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...


Next I will filter the data so that only my target genre labels are included and assign each text to just one of the genre labels.  It's possible that one text could be labelled with two of these labels (eg. Science Fiction and Fantasy) but I will just assign one of those here. 

In [10]:
target_genres = ["Children's literature",
                 'Science Fiction',
                 'Novel',
                 'Fantasy',
                 'Mystery']

# create a Series of empty strings the same length as the list of books
genre = pd.Series(np.repeat("", books.shape[0]))
# look for each target genre and set the corresponding entries in the genre series to the genre label
for g in target_genres:
    genre[books['genres'].str.contains(g)] = g

# add this to the book dataframe and then select only those rows that have a genre label
# drop some useless columns
books['genre'] = genre
genre_books = books[genre!=''].drop(['genres', 'fid', 'wid'], axis=1)

genre_books.shape

(8954, 5)

In [5]:
# check how many books we have in each genre category
genre_books.groupby('genre').count()

Unnamed: 0_level_0,title,author,date,summary
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Children's literature,1092,1092,1092,1092
Fantasy,2311,2311,2311,2311
Mystery,1396,1396,1396,1396
Novel,2258,2258,2258,2258
Science Fiction,1897,1897,1897,1897


## Modelling

Now that the data is ready, I will create the model to predict the genre of any book given their summary, depending on my dataset.

In [11]:
genre_books['summary']

0         Old Major, the old boar on the Manor Farm, ca...
1         Alex, a teenager living in near-future Englan...
2         The text of The Plague is divided into five p...
4         The novel posits that space around the Milky ...
6         Ged is a young boy on Gont, one of the larger...
                               ...                        
16525     Beautiful Creatures is set in fictional Gatli...
16526     After returning home, more strange things are...
16531                                           ==Receptio
16532     The novel is split into seven parts, the firs...
16549     The story starts with former government agent...
Name: summary, Length: 8954, dtype: object

In [94]:
genre_books.head()

Unnamed: 0,title,author,date,summary,genre
0,Animal Farm,George Orwell,1945-08-17,"Old Major, the old boar on the Manor Farm, ca...",Children's literature
1,A Clockwork Orange,Anthony Burgess,1962,"Alex, a teenager living in near-future Englan...",Novel
2,The Plague,Albert Camus,1947,The text of The Plague is divided into five p...,Novel
4,A Fire Upon the Deep,Vernor Vinge,,The novel posits that space around the Milky ...,Fantasy
6,A Wizard of Earthsea,Ursula K. Le Guin,1968,"Ged is a young boy on Gont, one of the larger...",Fantasy


Text preprocessing, tokenizing and filtering of stopwords are all included in CountVectorizer, which builds a dictionary of features and transforms documents to feature vectors. The result is a sparse matrix of all the tokenized words. This allows me to get an occurence count of the words.

In [105]:
from sklearn.feature_extraction.text import CountVectorizer
 
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(genre_books.summary)
X_train_counts.shape

(8954, 88356)

The index value of a word in the vocabulary is linked to its frequency in the whole text.

In [96]:
count_vect.vocabulary_.get(u'algorithm')

3059

It is necessary for buiding an accurate enough model to divide the number of occurrences of each word in a summary by the total number of words in the summary. This is called tf for Term Frequencies.

Another refinement on top of tf is to downscale weights for words that occur in many summaries and are therefore less informative than those that occur only in a smaller portion.

This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.

In the next step, I would calculate the tf–idf as follows using TfidfTransformer:

In [98]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(8954, 88356)

#### Training a classifier

I will now train a Naive-Bayes classifier with a multinomial variant for the purpose of building my predictive model.

In [99]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train_tfidf, genre_books.genre)

#### The prediction

In [100]:
docs_new = ['This portfolio took me a long time to figure out', 'Once upon a time in a land far far away']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

#predicted

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, category))
    
     

'This portfolio took me a long time to figure out' => Novel
'Once upon a time in a land far far away' => Fantasy


#### Building a pipeline classifier:

In [101]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
    ])

text_clf.fit(genre_books.summary, genre_books.genre)


Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

#### Checking the accuracy of the model:

In [107]:
docs_test = genre_books.summary
predicted = text_clf.predict(docs_test)
np.mean(predicted == genre_books.genre).round(3)

0.667

## Summary

The model predicts the genre of a random text with about 67% accuracy. This is not a very bad score considering that the model is designed to assign only one genre to the texts whereas the texts could belong to multiple genres. This could cause the prediction to be proven as false, even when it predicts the genre correctly, when matched with the original book summaries as they are predefined a specific genre at random out of their multiple genres.