# Machine Learning for Page Sage

This notebook should serve as a tutorial and explanatory process for the Page Sage recommendation algorithm.

First, let's create some preprocessors for the data since we know what format to expect them.

----

In [197]:
import re
import numpy as np

def text_preprocessor(text):
    text = re.sub('<[^>]*>', ' ', text)
    text_split = text.split('\\')
    return ''.join(text_split)

The text preprocessor removes the html code and backslashes from the inputted text.  The description tends to have html information.

In [198]:
def category_preprocessor(categories):
    if type(categories) != type(list()):
        return categories
    cats = []
    for category in categories:
        cat_split = category.split('/')
        for cat in cat_split:
            if cat.strip(' ') not in cats:
                cats.append(cat.strip(' '))

    return cats

The category preprocessor removes the '/' character to altogether remove the sub category listings.

-----

Now that the preprocessors are done, we can write the method to do our `GET` requests to the Google Books Search API.

In [199]:
import requests
import os

def pull_books(book_list):
    book_data = []
    search_key = str(os.environ.get('SEARCH_KEY'))
    baseURL = 'https://www.googleapis.com/books/v1/volumes/'
    endURL = '?key=' + search_key
    
    headers = {'Accept': 'application/json'}
    
    for volume in book_list:
        url = baseURL + volume['id'] + endURL
        book_info = requests.get(url, params=headers).json()
        new_book = {}
        
        new_book['rating'] = volume['rating']
        
        try:
            page_count = int(book_info['volumeInfo']['pageCount'])
        except (KeyError):
            page_count = 100 

        try:
            categories = category_preprocessor(book_info['volumeInfo']['categories'])
        except (KeyError):
            categories = category_preprocessor(['Fiction'])

        try:
            average_rating = float(book_info['volumeInfo']['averageRating'])
        except (KeyError):
            average_rating = float(3.5)

        try:
            ratings_count = book_info['volumeInfo']['ratingsCount']
        except (KeyError):
            ratings_count = float(0)

        try:
            maturity_rating = book_info['volumeInfo']['maturityRating']
        except (KeyError):
            maturity_rating = 'NOT_MATURE'

        try:
            description = text_preprocessor(book_info['volumeInfo']['description'])
        except (KeyError):
            cats = ""
            for cat in categories:
                cats += cat + " " 
            description = text_preprocessor(book_info['volumeInfo']['title'] + " " + cats)

        book_data.append({
            'rating'         : volume['rating'],
            'page_count'     : page_count,
            'categories'     : categories,
            'average_rating' : average_rating,
            'ratings_count'  : ratings_count,
            'maturity_rating': maturity_rating,
            'description'    : description
        })  
    
    return book_data

The above method has a bunch of try/except statements.  This is because the Google Books Search API sometimes does not return information in its `volume` requests (even though the information exists in the general `list` requests).  Some defaults were picked to fill in some of the data, with average (or most common) values being chosen for each default.

-----

Now, we read in a list of books from the following file:

`emily_books_2_tier.txt`

In [200]:
def read_book_list(filename):
    volumes = []
    with open(filename, 'r') as input_file:
        for line in input_file:
            volume = line.split(',')
            volumes.append({'id': volume[0], 'rating': int(volume[1].strip())})
    return volumes

book_list = read_book_list('emily_books_2_tier.txt')

print('%i books were read in' % len(book_list))

61 books were read in


------

Next we make the API call.

In [201]:
book_data = pull_books(book_list)

And let's take a look at what the data looks like now by viewing the first entry.

In [341]:
for book_info in book_data[0]:
    print('%s\t:\t%s' % (book_info, book_data[0][book_info]))

rating	:	1
page_count	:	352
categories	:	['Fiction', 'Thrillers', 'Suspense', 'Fantasy', 'Contemporary', 'Psychological']
average_rating	:	4.0
ratings_count	:	34
maturity_rating	:	NOT_MATURE
description	:	 Acclaimed author Graham Joyce's mesmerizing new novel centers around the disappearance of a young girl from a small town in the heart of England. Her sudden return twenty years later, and the mind-bending tale of where she's been, will challenge our very perception of truth.    For twenty years after Tara Martin disappeared from her small English town, her parents and her brother, Peter, have lived in denial of the grim fact that she was gone for good. And then suddenly, on Christmas Day, the doorbell rings at her parents' home and there, disheveled and slightly peculiar looking, Tara stands. It's a miracle, but alarm bells are ringing for Peter. Tara's story just does not add up. And, incredibly, she barely looks a day older than when she vanished.    Award-winning author Graham Joy

----

Let's also get a read on the review balance.

In [342]:
positive = 0
negative = 0

for book_info in book_data:
    if book_info['rating'] == 1:
        positive += 1
    else:
        negative += 1

review_share = lambda x: 100*x/len(book_data)
    
print('There are %i positive reviews in this dataset' % (positive))
print('There are %i negative reviews in this dataset' % (negative))
print('(or)')
print('There is a %i/%i balance between positive and negative reviews' % (review_share(positive), \
                                                                          review_share(negative)))

There are 32 positive reviews in this dataset
There are 29 negative reviews in this dataset
(or)
There is a 52/47 balance between positive and negative reviews


---
That's a nice, pretty even balance.   Let's move on to training.

To train this dataset, we will have to do a fair amount one-hot encoding, as well as creating a bag of words for the descriptions.

Since our data is formatted in a somewhat strange way, we'll have to do some one-hot encoding by hand.

Let's first make the data a bit more  processable, though.

In [343]:
def make_processable(book_data):
    books = []
    for book_info in book_data:
        book = []
        for key in book_info:
            book.append(book_info[key])
        books.append(book)
    return books

books = make_processable(book_data)

print('The new book list is %i elements long.\n' % (len(books)))

for book_info in books[0]:
    print(book_info)

The new book list is 61 elements long.

1
352
['Fiction', 'Thrillers', 'Suspense', 'Fantasy', 'Contemporary', 'Psychological']
4.0
34
NOT_MATURE
 Acclaimed author Graham Joyce's mesmerizing new novel centers around the disappearance of a young girl from a small town in the heart of England. Her sudden return twenty years later, and the mind-bending tale of where she's been, will challenge our very perception of truth.    For twenty years after Tara Martin disappeared from her small English town, her parents and her brother, Peter, have lived in denial of the grim fact that she was gone for good. And then suddenly, on Christmas Day, the doorbell rings at her parents' home and there, disheveled and slightly peculiar looking, Tara stands. It's a miracle, but alarm bells are ringing for Peter. Tara's story just does not add up. And, incredibly, she barely looks a day older than when she vanished.    Award-winning author Graham Joyce is a master of exploring new realms of understanding that

That makes it a little bit easier to see what needs to be preprocessed.

The first problem is processing the 3 item (categories).  Let's work on one-hot encoding that.

First, we'll gather all the different possibilities, then we'll transform them into one-hots.

In [344]:
def gather_categories(books):
    categories = []
    for book in books:
        for category in book[2]:
            if category.lower() not in categories:
                categories.append(category.lower())
    return categories

user_categories = gather_categories(books)

print('There are %i categories for this user.\n' % (len(user_categories)))
print('The categories are:\n', user_categories)

There are 79 categories for this user.

The categories are:
 ['fiction', 'thrillers', 'suspense', 'fantasy', 'contemporary', 'psychological', 'young adult fiction', 'science fiction', 'space opera', 'romance', 'general', 'occult & supernatural', 'historical', 'paranormal', 'fairy tales, folk tales, legends & mythology', 'action & adventure', 'juvenile fiction', 'wizards & witches', 'school & education', 'boarding school & prep school', 'fantasy & magic', 'history', 'united states', 'nature', 'animals', 'horses', 'horror', 'superheroes', 'gaslamp', 'literary', 'family life', 'coming of age', 'epic', 'dark fantasy', 'classics', 'media tie-in', 'mystery & detective', 'humorous', 'absurdist', 'dragons & mythical creatures', 'romantic comedy', 'women', 'frankenstein (fictitious character)', "frankenstein's monster (fictitious character)", 'drama', 'frankenstein, victor (fictitious character)', 'horror plays', 'monsters', 'scientists', 'american', 'european', 'english, irish, scottish, welsh

----

Since there are 81 categories, we need an 81 bit binary string.

In [345]:
def generate_category_one_hot(categories):
    offset = 2
    cat_dict = dict()
    for category in categories:
        cat_dict[category] = offset
        offset += 1
    return cat_dict
    
categories = generate_category_one_hot(user_categories)

print('Positions for each category:')
for category in categories:
    print('%s  :  %s' %  (category, categories[category]))

Positions for each category:
fiction  :  2
thrillers  :  3
suspense  :  4
fantasy  :  5
contemporary  :  6
psychological  :  7
young adult fiction  :  8
science fiction  :  9
space opera  :  10
romance  :  11
general  :  12
occult & supernatural  :  13
historical  :  14
paranormal  :  15
fairy tales, folk tales, legends & mythology  :  16
action & adventure  :  17
juvenile fiction  :  18
wizards & witches  :  19
school & education  :  20
boarding school & prep school  :  21
fantasy & magic  :  22
history  :  23
united states  :  24
nature  :  25
animals  :  26
horses  :  27
horror  :  28
superheroes  :  29
gaslamp  :  30
literary  :  31
family life  :  32
coming of age  :  33
epic  :  34
dark fantasy  :  35
classics  :  36
media tie-in  :  37
mystery & detective  :  38
humorous  :  39
absurdist  :  40
dragons & mythical creatures  :  41
romantic comedy  :  42
women  :  43
frankenstein (fictitious character)  :  44
frankenstein's monster (fictitious character)  :  45
drama  :  46
franke

----

Each category has been given a numerical value that acts as the index to where it should be a 1.

Now, to reformat the data.

In [346]:
def reformat_categories(books_data):
    reformatted_books_data = []
    for book in books_data:
        new_book = []
        for info in book:
            if type(info) == type(list()):
                for i in range(len(categories)):
                    new_book.append(0)
            else:
                new_book.append(info)
        reformatted_books_data.append(new_book)
    return reformatted_books_data
        
new_books  = reformat_categories(books)

print('The length of the reformatted books is %i' % (len(new_books)))
print('The length of a new book in the categories is %i' % (len(new_books[0])))

The length of the reformatted books is 61
The length of a new book in the categories is 85


-----

Now let's one hot encode categories.

In [347]:
from copy import deepcopy

def one_hot_categories(old_books, new_books, categories):
    new_books = deepcopy(new_books)
    for index, old_book in enumerate(old_books):
        old_categories = old_book[2]
        
        for category in old_categories:
            new_books[index][categories[category.lower()]] = 1
    return new_books

new_books = one_hot_categories(books, new_books, categories)

print('New Book format:\n', new_books[0])

New Book format:
 [1, 352, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4.0, 34, 'NOT_MATURE', " Acclaimed author Graham Joyce's mesmerizing new novel centers around the disappearance of a young girl from a small town in the heart of England. Her sudden return twenty years later, and the mind-bending tale of where she's been, will challenge our very perception of truth.    For twenty years after Tara Martin disappeared from her small English town, her parents and her brother, Peter, have lived in denial of the grim fact that she was gone for good. And then suddenly, on Christmas Day, the doorbell rings at her parents' home and there, disheveled and slightly peculiar looking, Tara stands. It's a miracle, but alarm bells are ringing for Peter. Tara's story just does not add up. And, incredibly, she 

----

That adds a lot of dimensions, but it sure looks a lot better for processing!

Let's now do the same process for the maturity ratings.

In [348]:
def gather_maturities(books):
    maturities = []
    for book in books:
        if book[83].lower() not in maturities:
            maturities.append(book[83].lower())
    return maturities

maturity_list = gather_maturities(new_books)
print(maturity_list)

['not_mature']


----

Luckily, we only have one maturity level, but let's still encode it for the classifier

In [349]:
def generate_maturity_one_hot(maturities):
    offset = 83
    mat_dict = dict()
    for maturity in maturities:
        mat_dict[maturity] = offset
        offset += 1
    return mat_dict

maturities = generate_maturity_one_hot(maturity_list)

print('Positions for each maturity:\n', maturities)

Positions for each maturity:
 {'not_mature': 83}


In [350]:
def reformat_maturities(books_data):
    reformatted_books_data = []
    for book in books_data:
        new_book = []
        for index, info in enumerate(book):
            if index == 83:
                for i in range(len(maturities)):
                    new_book.append(0)
            else:
                new_book.append(info)
        reformatted_books_data.append(new_book)
    return reformatted_books_data

new_books = reformat_maturities(new_books)
print('The length of the reformatted books is %i' % (len(new_books)))
print('The length of a new book with the reformat is %i' % (len(new_books[0])))

The length of the reformatted books is 61
The length of a new book with the reformat is 85


----

Now let's add the 1s in the appropriate spots.

In [351]:
def one_hot_maturities(old_books, new_books, maturities):
    new_books = deepcopy(new_books)
    for index, old_book in enumerate(old_books):
        new_books[index][maturities[old_book[5].lower()]] = 1
    return new_books

new_books = one_hot_maturities(books, new_books, maturities)

print('The book data format is:\n', new_books[0])

The book data format is:
 [1, 352, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4.0, 34, 1, " Acclaimed author Graham Joyce's mesmerizing new novel centers around the disappearance of a young girl from a small town in the heart of England. Her sudden return twenty years later, and the mind-bending tale of where she's been, will challenge our very perception of truth.    For twenty years after Tara Martin disappeared from her small English town, her parents and her brother, Peter, have lived in denial of the grim fact that she was gone for good. And then suddenly, on Christmas Day, the doorbell rings at her parents' home and there, disheveled and slightly peculiar looking, Tara stands. It's a miracle, but alarm bells are ringing for Peter. Tara's story just does not add up. And, incredibly, she bar

---

Perfect!  Now we need to preprocess the last string.

For this we will do a bag of words for all of the data in the dataset.

In [352]:
import nltk
from nltk.stem.porter import PorterStemmer

success = nltk.download('stopwords')
success = nltk.download('names')

from nltk.corpus import stopwords
from nltk.corpus import names

stop = stopwords.words('english')
name = names.words()
name = [n.lower() for n in name]

def collect_and_trim_punctuation(books):
    sentences = ""
    for book in books:
        text = book[len(book)-1]
        text = re.sub(',', ' ', text)
        text = re.sub('\\xa0', ' ', text)
        text = re.sub('★', ' ', text)
        text = re.sub('-', ' ', text)
        text = re.sub('[0-9]', ' ', text)
        text = re.sub('"', ' ', text)
        text = re.sub("'", ' ', text)
        sentences += text.lower() + " "
    return sentences

def porter_tokenizer(text):
    porter = [PorterStemmer().stem(word) for word in text.split()]
    new_porter = set()
    for p in porter:
        if '(' in p or ')' in p:
            continue
        if '-' in p or '#' in p:
            continue
        if '[' in p or ']' in p:
            continue
        if '—' in p or '$' in p:
            continue
        if '"' in p or "'" in p:
            continue
        if '…' in p or '...' in p:
            continue
        if '•' in p or '!' in p:
            continue
        if '&' in p or ';' in p:
            continue
        if '–' in p or '“a' in p:
            continue
        if '&' in p or '“' in p or '”' in p:
            continue
        if ':' in p or '’' in p or '?' in p:
            continue
        if '/' in p:
            p = p.split('/')
            for word in p:
                new_porter.add(word)
            continue
        if '.' in p:
            p = p.strip('.')
            p = ''.join(p.split('.'))
        new_porter.add(p)
    return list(new_porter)


full_text = collect_and_trim_punctuation(new_books)

tokenized_text = [PorterStemmer().stem(word) for word in full_text.split()]
print('There are %i words in the Porter Stemmer tokenized text w/o other preprocessing' % (len(tokenized_text)))

tokenized_text = porter_tokenizer(full_text)
print('There are %i words in the Porter Stemmer tokenized text w/ other preprocessing' % (len(tokenized_text)))

tokenized_text = [w for w in porter_tokenizer(full_text) if w not in stop]
print('There are %i words in the Porter Stemmer tokenized text after removing stop words' % len(tokenized_text))

tokenized_text = [w for w in porter_tokenizer(full_text) if (w not in stop) and (w not in name)]
del tokenized_text[0] # Removes empty string from list
print('There are %i words in the Porter Stemmer tokenized text after removing names' % len(tokenized_text))

[nltk_data] Downloading package stopwords to /Users/peter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package names to /Users/peter/nltk_data...
[nltk_data]   Package names is already up-to-date!


There are 10011 words in the Porter Stemmer tokenized text w/o other preprocessing
There are 2466 words in the Porter Stemmer tokenized text w/ other preprocessing
There are 2357 words in the Porter Stemmer tokenized text after removing stop words
There are 2183 words in the Porter Stemmer tokenized text after removing names


----

We can see, then, that removing some of the text greatly reduce the number of word vectors we will need to create.


Let's now try one-hot encoding these.

In [353]:
def generate_description_one_hot(descriptions):
    offset = 84
    desc_dict = dict()
    for description in descriptions:
        desc_dict[description] = offset
        offset += 1
    return desc_dict

def reformat_descriptions(books_data):
    reformatted_books_data = []
    for book in books_data:
        new_book = []
        for index, info in enumerate(book):
            if index == 84:
                for i in range(len(descriptions)):
                    new_book.append(0)
            else:
                new_book.append(info)
        reformatted_books_data.append(new_book)
    return reformatted_books_data

def one_hot_descriptions(old_books, new_books, descriptions):
    new_books = deepcopy(new_books)
    for index, old_book in enumerate(old_books):
        for description in descriptions:
            if description in old_book[6]:
                new_books[index][descriptions[description]] = 1
    return new_books

descriptions = generate_description_one_hot(tokenized_text)
temp_books = reformat_descriptions(new_books)
final_books = one_hot_descriptions(books, temp_books, descriptions)

print('Length of a copy of the newly formatted books :', len(final_books[0]))

Length of a copy of the newly formatted books : 2267


----

Perfect!  The above with our data should be 2267.  Now let's start training.

In [405]:
import numpy as np

def get_labels(books):
    books = deepcopy(books)
    new_data = []
    labels = []
    for book in books:
        item = book[0]
        labels.append(float(item))
        book.pop(0)
        new_data.append(book)
    return (np.asarray(new_data), np.asarray(labels))

X_train, y_train = get_labels(final_books)

print('Number of labels :', len(y_train))
print('Number of samples :', len(X_train))

Number of labels : 61
Number of samples : 61


---

Perfect again!  Let's try training!

In [414]:
import tensorflow as tf
import tensorflow.keras as keras

np.random.seed(123)
tf.set_random_seed(123)


model = keras.models.Sequential()

model.add(keras.layers.Dense(units=100, input_dim=X_train.shape[1],
                             kernel_initializer='glorot_uniform',
                             bias_initializer='zeros',
                             activation='selu'))

model.add(keras.layers.Dense(units=100, input_dim=100,
                             kernel_initializer='glorot_uniform',
                             bias_initializer='zeros',
                             activation='selu'))

model.add(keras.layers.Dense(units=1, input_dim=100, 
                             kernel_initializer='glorot_uniform',
                             bias_initializer='zeros',
                             activation='softmax'))

# Using RMS Prop for better base performance and efficient activation; 
# need to play with decay rate
sgd_optimizer = keras.optimizers.SGD(lr=0.0001, decay=1e-7, momentum=.9)
adadelta_optimizer = keras.optimizers.Adadelta()
rms_prop_optimizer = keras.optimizers.RMSprop()
nadam_optimizer = keras.optimizers.Nadam()

model.compile(optimizer=sgd_optimizer, loss='binary_crossentropy')

# Train with fit method
history = model.fit(X_train, y_train,
                    batch_size=3, epochs=70, verbose=1,
                    validation_split=0.1)


# Predict class labels (return class labels as integers)
y_train_pred = model.predict_classes(X_train, verbose=0)
correct_preds = np.sum(y_train == y_train_pred, axis=1)

print(y_train_pred)
print(train_acc)
print()
print('First 3 predictions: ', y_train_pred[:3])
print()
print('Training accuracy: %.2f%%' % (train_acc * 100))
print()

Train on 54 samples, validate on 7 samples
Epoch 1/70
Epoch 2/70
Epoch 3/70
Epoch 4/70
Epoch 5/70
Epoch 6/70
Epoch 7/70
Epoch 8/70
Epoch 9/70
Epoch 10/70
Epoch 11/70
Epoch 12/70
Epoch 13/70
Epoch 14/70
Epoch 15/70
Epoch 16/70
Epoch 17/70
Epoch 18/70
Epoch 19/70
Epoch 20/70
Epoch 21/70
Epoch 22/70
Epoch 23/70
Epoch 24/70
Epoch 25/70
Epoch 26/70
Epoch 27/70
Epoch 28/70
Epoch 29/70
Epoch 30/70
Epoch 31/70
Epoch 32/70
Epoch 33/70
Epoch 34/70
Epoch 35/70
Epoch 36/70
Epoch 37/70
Epoch 38/70
Epoch 39/70
Epoch 40/70
Epoch 41/70
Epoch 42/70
Epoch 43/70
Epoch 44/70
Epoch 45/70
Epoch 46/70
Epoch 47/70
Epoch 48/70
Epoch 49/70
Epoch 50/70
Epoch 51/70
Epoch 52/70
Epoch 53/70
Epoch 54/70
Epoch 55/70
Epoch 56/70
Epoch 57/70
Epoch 58/70
Epoch 59/70
Epoch 60/70
Epoch 61/70
Epoch 62/70
Epoch 63/70
Epoch 64/70
Epoch 65/70
Epoch 66/70
Epoch 67/70
Epoch 68/70
Epoch 69/70
Epoch 70/70
[[1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 

TypeError: only size-1 arrays can be converted to Python scalars