In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline



## Predicting the Genre of Books from Summaries

A set of book summaries from the [CMU Book Summaries Corpus](http://www.cs.cmu.edu/~dbamman/booksummaries.html) was used in this experiment.  This contains a large number of summaries (16,559) and includes meta-data about the genre of the books taken from Freebase.  Each book can have more than one genre and there are 227 genres listed in total.  To simplify the problem of genre prediction I selected a small number of target genres that occur frequently in the collection and select the books with these genre labels.  This will give us one genre label per book. 

My goal in this portfolio is to take this data and build a predictive model to classify the books into one of the five target genres and build at least one model but you could build two and compare the results if you have time.

You should report on each stage of your experiment as you work with the data.


## Data Preparation

My first task is to read the data. It is made available in tab-separated format but has no column headings. I used the |`read_csv` to read this but we need to set the separator to `\t` (tab) and supply the column names.  The names come from the [ReadMe](data/booksummaries/README.txt) file.

In [3]:
names = ['wid', 'fid', 'title', 'author', 'date', 'genres', 'summary']

books = pd.read_csv("data/booksummaries.txt", sep="\t", header=None, names=names, keep_default_na=False)
books.head()

Unnamed: 0,wid,fid,title,author,date,genres,summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...


Next, I filtered the data so that only our target genre labels are included and we assign each text to just one of the genre labels.  It's possible that one text could be labelled with two of these labels (eg. Science Fiction and Fantasy) but we will just assign one of those here. 

In [5]:
target_genres = ["Children's literature",
                 'Science Fiction',
                 'Novel',
                 'Fantasy',
                 'Mystery']

# create a Series of empty strings the same length as the list of books
genre = pd.Series(np.repeat("", books.shape[0]))
# look for each target genre and set the corresponding entries in the genre series to the genre label
for g in target_genres:
    genre[books['genres'].str.contains(g)] = g

# add this to the book dataframe and then select only those rows that have a genre label
# drop some useless columns
books['genre'] = genre
genre_books = books[genre!=''].drop(['genres', 'fid', 'wid'], axis=1)

genre_books.shape

(8954, 5)

In [6]:
# check how many books we have in each genre category
genre_books.groupby('genre').count()

Unnamed: 0_level_0,title,author,date,summary
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Children's literature,1092,1092,1092,1092
Fantasy,2311,2311,2311,2311
Mystery,1396,1396,1396,1396
Novel,2258,2258,2258,2258
Science Fiction,1897,1897,1897,1897


In [7]:
# Extract Summary column
Test_dataset = genre_books
Test_dataset

Unnamed: 0,title,author,date,summary,genre
0,Animal Farm,George Orwell,1945-08-17,"Old Major, the old boar on the Manor Farm, ca...",Children's literature
1,A Clockwork Orange,Anthony Burgess,1962,"Alex, a teenager living in near-future Englan...",Novel
2,The Plague,Albert Camus,1947,The text of The Plague is divided into five p...,Novel
4,A Fire Upon the Deep,Vernor Vinge,,The novel posits that space around the Milky ...,Fantasy
6,A Wizard of Earthsea,Ursula K. Le Guin,1968,"Ged is a young boy on Gont, one of the larger...",Fantasy
...,...,...,...,...,...
16525,Beautiful Creatures,Margaret Stohl,2009-12-01,Beautiful Creatures is set in fictional Gatli...,Fantasy
16526,Beautiful Chaos,Gary Russell,,"After returning home, more strange things are...",Fantasy
16531,Guardians of Ga'Hoole Book 4: The Siege,Helen Dunmore,2004-05-01,==Receptio,Fantasy
16532,The Casual Vacancy,J. K. Rowling,2012-09-27,"The novel is split into seven parts, the firs...",Fantasy


In [9]:
# fill missing values
genre_books['author'] = genre_books['author'].fillna('No Book')
genre_books['title'] = genre_books['title'].fillna('No Book')
genre_books['date'] = genre_books['date'].fillna('No Book')
genre_books['summary'] = genre_books['summary'].fillna('No Book')

Remove punctuation marks from words

In [42]:
# remove punctuation marks from summaries to clean the dataset
genre_books['summary'] = genre_books['summary'].str.replace('[^\w\s]','')
genre_books.head()

Unnamed: 0,title,author,date,summary,genre,Title_Author
0,Animal Farm,George Orwell,1945-08-17,Old Major the old boar on the Manor Farm call...,0,Animal Farm George Orwell
1,A Clockwork Orange,Anthony Burgess,1962,Alex a teenager living in nearfuture England ...,3,A Clockwork Orange Anthony Burgess
2,The Plague,Albert Camus,1947,The text of The Plague is divided into five p...,3,The Plague Albert Camus
4,A Fire Upon the Deep,Vernor Vinge,,The novel posits that space around the Milky ...,1,A Fire Upon the Deep Vernor Vinge
6,A Wizard of Earthsea,Ursula K. Le Guin,1968,Ged is a young boy on Gont one of the larger ...,1,A Wizard of Earthsea Ursula K. Le Guin


In [10]:
genre_books['genre'].unique()


array(["Children's literature", 'Novel', 'Fantasy', 'Science Fiction',
       'Mystery'], dtype=object)

Convert labels to machine readable format

In [11]:
# assign a label encoder to make the genres machine readable to the computer
from sklearn.preprocessing import LabelEncoder

feature = ['genre']
for x in feature:
    Le = LabelEncoder()
    Le.fit(list(genre_books[x].values))
    genre_books[x] = Le.transform(list(genre_books[x]))

In [12]:
# numerical replacements for genres
genre_books['genre'].unique()


array([0, 3, 1, 4, 2])

In [13]:
pip install --user -U nltk

Requirement already up-to-date: nltk in /Users/brigidechikwu/.local/lib/python3.7/site-packages (3.5)
Note: you may need to restart the kernel to use updated packages.


In [14]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/brigidechikwu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
from nltk.corpus import stopwords
stop = list(stopwords.words('english'))
stop[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

I imported nltk which contains common stop words

In [18]:
# extracted columns and save to a new variable
Titles = Test_dataset[['title']]
Authors = Test_dataset[['author']]
Genres = Test_dataset[['genre']]


In [27]:
Test_dataset['Title_Author'] = Test_dataset['title'] + ' ' + Test_dataset['author']
print (Test_dataset['Title_Author'].head(5))

0                 Animal Farm George Orwell
1        A Clockwork Orange Anthony Burgess
2                   The Plague Albert Camus
4         A Fire Upon the Deep Vernor Vinge
6    A Wizard of Earthsea Ursula K. Le Guin
Name: Title_Author, dtype: object


In [28]:
def change(t):
    t = t.split()
    return ' '.join([(i) for (i) in t if i not in stop])


In [29]:
Test_dataset['Title_Author'].apply(change)


0                                Animal Farm George Orwell
1                       A Clockwork Orange Anthony Burgess
2                                  The Plague Albert Camus
4                            A Fire Upon Deep Vernor Vinge
6                      A Wizard Earthsea Ursula K. Le Guin
                               ...                        
16525                   Beautiful Creatures Margaret Stohl
16526                         Beautiful Chaos Gary Russell
16531    Guardians Ga'Hoole Book 4: The Siege Helen Dun...
16532                     The Casual Vacancy J. K. Rowling
16549                          The Third Lynx Timothy Zahn
Name: Title_Author, Length: 8954, dtype: object

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=2, max_features=70000, strip_accents='unicode',lowercase =True,
                            analyzer='word', token_pattern=r'\w+', use_idf=True, 
                            smooth_idf=True, sublinear_tf=True, stop_words = 'english')
vectors = vectorizer.fit_transform(Test_dataset['Title_Author'])
vectors.shape

(8954, 4134)

## Modelling

I built a model using Naive Bayes Classifier

In [36]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split


In [38]:
X_train, X_test, y_train, y_test = train_test_split(vectors, genre_books['genre'], test_size=0.02)


In [39]:
print (X_train.shape)
print (y_train.shape)
print (X_test.shape)
print (y_test.shape)

(8774, 4134)
(8774,)
(180, 4134)
(180,)


In [40]:
clf = MultinomialNB(alpha=.45)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print (metrics.f1_score(y_test, pred, average='macro'))
print (metrics.accuracy_score(y_test, pred))

0.5827046282724414
0.6111111111111112


# Test the model with summary

In [47]:
#Making a prediction
text = ['Old Major the old boar on the Manor Farm call']
text[0] = text[0].lower()
#text = list(text)
s = (vectorizer.transform(text))
#s = vectorizer.fit_transform(df)
print (s.shape)
d = (clf.predict(s))

(1, 4134)


In [46]:
Le.inverse_transform(d)[0]


"Children's literature"