# Content Based Model 
This notebook documents the process for serializing the components used for the recommender class. This includes spacy's nlp, the document term matrix and sklearn tfidf model used to vectorize text within the class

In [0]:
import os
import re
import pickle

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
import spacy

!python -m spacy download en_core_web_sm

#### Data Source
Our database contains a table `gb_data` with over 100,000 descriptions and text snippets to compare. The tfidf in this notebook is only trained on rows where description is not null.

In [0]:
query = '''
SELECT *
FROM gb_data
'''

In [0]:
df = pd.read_sql(sql=query, con=os.environ["DATABASE_URL"])

In [0]:
df = df.dropna(axis='index', subset=['description'])

In [0]:
def cleaner(text):
    text = text.replace('\n', ' ')
    text = re.sub(r'\W+\s', ' ', text)
    text = re.sub(r'[^a-zA-Z ^0-9]', '', text)
    return text

In [0]:
df['description'] = df['description'].apply(cleaner)

In [0]:
nlp = spacy.load('en_core_web_sm')

In [0]:
STOP_WORDS = ["new", "book", "author", "story", "life", "work", "best", 
              "edition", "readers", "include", "provide", "information"]
STOP_WORDS = nlp.Defaults.stop_words.union(STOP_WORDS)

In [0]:
def tokenize(text):
    '''
    Input: String
    Output: list of tokens
    '''
    doc = nlp(text)

    tokens = []
    
    for token in doc:
        if ((token.text.lower() not in STOP_WORDS) & 
            (token.is_punct == False) & 
            (token.pos_ != 'PRON') & 
            (token.is_alpha == True)):
            tokens.append(token.text.lower())
            
    return tokens

In [0]:
tfidf = TfidfVectorizer(
    analyzer='word',
    ngram_range=(1, 2),
    min_df=15,
    max_df=0.85,
    tokenizer=tokenize)

In [0]:
dtm = tfidf.fit_transform(df['description'])

The document term matrix consists of 116k+ documents with over 90k unique unigrams and bigrams

In [0]:
dtm.shape

(116626, 96208)

In [0]:
nn = NearestNeighbors(metric='cosine')
nn.fit(dtm)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [0]:
ready_player_one = ['Immersing himself in a mid-21st-century '
                    'technological virtual utopia to escape '
                    'an ugly real world of famine, poverty '
                    'and disease, Wade Watts joins an '
                    'increasingly violent effort to solve a '
                    "series of puzzles by the virtual world's "
                    'super-wealthy creator, who has promised '
                    'that the winner will be his heir. (This '
                    'book was previously listed in Forecast.)']

the_martian = ["Nominated as one of America’s best-loved novels by PBS’s "
               "The Great American Read Six days ago, astronaut Mark Watney "
               "became one of the first people to walk on Mars. Now, he's sure "
               "he'll be the first person to die there. After a dust storm "
               "nearly kills him and forces his crew to evacuate while thinking "
               "him dead, Mark finds himself stranded and completely alone with "
               "no way to even signal Earth that he’s alive—and even if he "
               "could get word out, his supplies would be gone long before a "
               "rescue could arrive. Chances are, though, he won't have time to "
               "starve to death. The damaged machinery, unforgiving environment, "
               "or plain-old human error are much more likely to kill him first. "
               "But Mark isn't ready to give up yet. Drawing on his ingenuity, "
               "his engineering skills—and a relentless, dogged refusal to "
               "quit—he steadfastly confronts one seemingly insurmountable "
               "obstacle after the next. Will his resourcefulness be enough to "
               "overcome the impossible odds against him?"]

In [0]:
rp_one = tfidf.transform(ready_player_one)
neighbors = nn.kneighbors(rp_one.todense(), n_neighbors=10)
neighbors

(array([[0.79823089, 0.84525823, 0.85666574, 0.85904816, 0.86928128,
         0.87437104, 0.87959326, 0.87968499, 0.88407935, 0.88553481]]),
 array([[ 9327, 96848, 57228, 47717, 20798, 14723, 40165, 95049, 72260,
         81469]]))

In [0]:
def get_recs_wspacy(df, description):
    """
    Gets recommendations via NN and sorted by average rating

    df: pandas DataFrame object with textual description labeled as such
    description: textual summary of book content
    """
    description = [description]
    book = tfidf.transform(description)
    distances, neighbors = nn.kneighbors(book.todense(), n_neighbors=20)
    neighbors = neighbors.tolist()[0]

    sorted_ratings = df['averagerating'].iloc[neighbors].sort_values(ascending=False).index

    for position, index in enumerate(sorted_ratings):
        print("{0}. {1}".format(position+1, df['title'].loc[index]))

In [0]:
automate = "This is the second edition of the best selling Python book in "\
           "the world. Python Crash Course, 2nd Edition is a straightforward "\
           "introduction to the core of Python programming. Author Eric Matthes "\
           "dispenses with the sort of tedious, unnecessary information that "\
           "can get in the way of learning how to program, choosing instead to "\
           "provide a foundation in general programming concepts, Python "\
           "fundamentals, and problem solving. Three real world projects in the "\
           "second part of the book allow readers to apply their knowledge in "\
           "useful ways. Readers will learn how to create a simple video game, "\
           "use data visualization techniques to make graphs and charts, and "\
           "build and deploy an interactive web application. Python Crash "\
           "Course, 2nd Edition teaches beginners the essentials of Python "\
           "quickly so that they can build practical programs and develop "\
           "powerful programming techniques."

get_recs(df, automate)

1. OpenGL Game Programming
2. JavaScript and JQuery
3. The Psychology of Computer Programming
4. Compilers
5. Earthsong
6. Monty Python's Flying Circus
7. Fundamentals of Python: First Programs
8. Elinor
9. The Fairly Incomplete & Rather Badly Illustrated Monty Python Song Book
10. Monty Python's Big Red Book
11. Never Trust a Calm Dog
12. Beginning Programming with C++ For Dummies
13. Koi's Python
14. The Basic Book
15. Animals That Show and Tell
16. The Brand New Monty Python Papperbok
17. Beginning iOS Programming For Dummies
18. An Introduction to Programming with Modula-2
19. The Best British Stand-Up and Comedy Routines
20. QBasic for Beginners


Serialized components for use within deployed application

In [0]:
pickle.dump(nlp, open('nlp.pkl', 'wb'))
pickle.dump(tfidf, open('tfidf_model.pkl', 'wb'))
pickle.dump(dtm, open('dtm.pkl', 'wb'))