# Capstone Project: Tell Me a Story...#

Elyse Renouf   **|**   July 31, 2020   

Tell Me A Story (TMAS) is a content-based young children's book recommender using NLP that is derived from a Kaggle dataset of Goodreads.com book reviews. It was created so parents could input one of their child's favourite books and TMAS can recommend other similar books to read based entirely on key words in the book descriptions. 

**Please Note:** This is notebook 3 of 3 that were used to build the final product. This notebook includes the vectorization steps I took and the final recommender model that is TMAS. 

<hr>

### Data Vectorization & Building the Recommender Model ###

Because this is a content-based recommender, the key preparation here is ensuring all columns used in the recommender are in numerical format. So the emphasis was on text preparation. 

I built my tokenizer function to remove punctuation, split the sentence into individual words, stem the verbs to their root word, to remove English stop words, and to remove words shorter than 2 characters in an attempt to remove any lingering html tags from the original web scraping without removing words related to animal sounds which are key features in infant books (hee haw, baa baa, etc.). I then tested it on a simple sentence to ensure it worked as required before applying it within my model on the entire dataset. 

I used a helper function that would look up a books TFIDF score by the book name, then instantiated and fit the model and transformed the data. Once all the book description text was transformed, I used a function to calculate cosine similarities to determine just HOW similar one book description is to that of other books in the dataset. 

Once all of these steps were complete, I was able to build a Streamlit app to test version one and also to build out version two in this notebook, which allows the user to filter books returned to them by the age group column I feature engineered in earlier steps. 

In [1]:
import numpy as np
import pandas as pd

# To calc cosine distance later
from sklearn.metrics.pairwise import cosine_similarity

from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
#read in the cleaned kidsbooks dataset
books = pd.read_csv('data/books.csv', index_col=0)

In [3]:
df_descriptions = books[['isbn', 'name', 'description', 'is_preschooler']]
df_descriptions.head(10)

Unnamed: 0,isbn,name,description,is_preschooler
0,688175740,russell's secret,"have you ever heard the words<br />""""you can s...",1
1,1582460892,"oye, hormiguita","the spanish translation of bestseller hey, lit...",1
2,1904442978,dear bunny,the cutest couple of star-crossed rabbits craf...,1
3,689840047,a revolutionary field trip: poems of colonial ...,the class is embarking on the field trip of a ...,1
4,1905236816,"home for a tiger, home for a bear",you'll learn about the habitats of these and m...,1
5,439598397,dog breath! the horrible trouble with hally to...,"hally tosis is a very good dog, but she has a ...",1
6,312367511,action jackson,one late spring morning the american artist ja...,1
7,590898264,cinderella bigfoot,head for the hills! the author and illustrator...,1
8,618064877,nursery crimes,jambo and marva emigrated from france to iowa ...,1
9,887764746,the legend of the panda,<i>a timeless tale about a beloved animal</i><...,0


In [4]:
#creating separate dfs for all infant books
infbooks = df_descriptions[df_descriptions['is_preschooler']==0]

#and exporting as CSV for safekeeping
infbooks.to_csv('infbooks.csv', header=True)
infbooks.shape

(2961, 4)

In [5]:
#creating separate dfs for all preschooler books
preschoolbooks = df_descriptions[df_descriptions['is_preschooler']==1]

#and exporting as CSV for safekeeping
preschoolbooks.to_csv('preschoolbooks.csv', header=True)
preschoolbooks.shape

(7429, 4)

In [6]:
#create tokenizer function to remove punctuation, split sentence, stem words, and remove stopwords
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords 

#remove english stop words like the, and, if, a, etc. 
ENGLISH_STOP_WORDS = stopwords.words('english')
stemmer = nltk.stem.PorterStemmer()

def my_tokenizer(sentence):
    for punctuation_mark in string.punctuation:
        # Remove punctuation and re-set to lower case
        sentence = sentence.replace(punctuation_mark,'').lower() #includes !"#$%&'()*+, -./:;<=>?@[\]^_`{|}~             
    # split sentence into words
    listofwords = sentence.split(' ')
    listofstemmed_words = []     
    # Remove stopwords and any tokens that are just empty strings
    for word in listofwords:
        if (not word in ENGLISH_STOP_WORDS) and (word!='') and (len(word)>2):
            # Stem words
            stemmed_word = stemmer.stem(word)
            listofstemmed_words.append(stemmed_word)
    return listofstemmed_words

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/elyserenouf/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
# A sentence used for checking our process
sentence = "I was walking down the road and I saw a donkey <br> Hee Haw <br>!!"

In [8]:
# checking tokenizer on simple sentence from above
my_tokenizer(sentence)

['walk', 'road', 'saw', 'donkey', 'hee', 'haw']

In [9]:
# doing some word frequency discovery: using my tokenizer and just simply countvectorizing words
# because this isn't to determine/predict positive/negative reviews but simply to get word counts, I will do this on the entire dataset
from sklearn.feature_extraction.text import CountVectorizer

#instantiate the model
bagofwords = CountVectorizer(stop_words="english", min_df=5, tokenizer=my_tokenizer)

#fit the model
bagofwords.fit(df_descriptions['description'].fillna(''))

#transform the data
allwords = bagofwords.transform(df_descriptions['description'].fillna(''))
allwords

  'stop_words.' % sorted(inconsistent))


<10390x6451 sparse matrix of type '<class 'numpy.int64'>'
	with 285358 stored elements in Compressed Sparse Row format>

In [10]:
# Let's put the allwords sparse matrix information into a data frame
word_counts = np.array(np.sum(allwords, axis=0)).reshape((-1,))
words = np.array(bagofwords.get_feature_names())
words_df = pd.DataFrame({"word":words, 
                         "count":word_counts})

#finding 25 most frequently used words in these book descriptions
words_df.sort_values(by="count", ascending=False).head(25)

Unnamed: 0,word,count
676,book,4755
5480,stori,2683
1021,children,2676
2868,illustr,2649
3379,littl,2534
3496,make,1964
3859,new,1876
6416,young,1859
1448,day,1852
4578,reader,1783


So if I were to try using CountVectorizer for my recommender function, all the words in the description end up being equally important but would factor more heavily the frequency of words in order to make book description matches - this doesn't make sense for a recommender machine, especially because the words that are most frequently used in children's book descriptions are pretty generic. 

For future iterations and increased performance, I would investigate more closely and potentially break down the words by part of speech (noun, verb, adjective), and create a list of the frequently appearing words to drop that don't really easily describe the story itself (book, reader, story, like, illustration, know, use, tale, etc.).

As it stands, I have decided to use the TFIDF vectorizer which will put more emphasis/weight on less frequently appearing words, which are likely more descriptive/indicative of what a book is actually about. Many book descriptions include the same words (see above) but the key nouns, adjectives, and verbs specific to just that book are the ones that really tell us what the story is about and will be the ones that allow us to find similar stories using those specific nouns and adjectives:

eg. "This story is about a boy and a giraffe who fly to the moon"  - will likely have more emphasis on giraffe, moon, and fly in the overall dataset, remove the stop words, and less emphasis on story or boy because of their likelihood of appearing in multiple stories. 

#### And Now To Vectorize, Apply My Tokenizer, and Build The Recommender: ####

In [11]:
#Need this helper function to look up a book TFIDF by its name.
def get_book_by_name(name, tfidf_scores, keys):
    row_id = keys[name]
    row = tfidf_scores[row_id,:]
    return row

In [12]:
#importing TFIDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

#Instantiating the TFIDF Vectorizer
vectorizer = TfidfVectorizer(stop_words = "english", min_df=5, tokenizer=my_tokenizer)

index = 0
keys = {}

df_descriptions = books[['name','description', 'is_preschooler']]

#set a loop for book results lookup and return
for book in df_descriptions.itertuples() :
    key = book[1]
    keys[key] = index
    index += 1

#Fit the vectorizer to the data
vectorizer.fit(df_descriptions['description'].fillna(''))

#Transform the data
tfidf_scores = vectorizer.transform(df_descriptions['description'].fillna(''))

print(tfidf_scores.shape)
print(df_descriptions.shape)

  'stop_words.' % sorted(inconsistent))


(10390, 6451)
(10390, 3)


In [13]:
#creating a function to calculate how similar one book description is to the other book description

#import cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

#create content recommender function that also takes into account age group
def content_recommender(name, is_preschooler, tfidf_scores, bookdf=df_descriptions) :
    
    #if statement to filter books by is_preschooler column
    if is_preschooler==True:
        bookdf = bookdf[bookdf['is_preschooler']== 1]
    else: 
        bookdf = bookdf[bookdf['is_preschooler']== 0]
        
    #Store the results in this DF
    similar_books = pd.DataFrame(columns = ["name","similarity"] )
    
    #The book we are finding books similar to
    book_1 = get_book_by_name(name, tfidf_scores, keys)
    
    #Go through ALL the books
    for book in bookdf['name']:
                
        #Find the similarity of the two books
        book_2 = get_book_by_name(book,tfidf_scores,keys)
        similarity = cosine_similarity(book_1,book_2)
        similar_books.loc[len(similar_books)] = [book, similarity[0][0]]

    return similar_books.sort_values(by=['similarity'],ascending=False)[1:]

In [14]:
#the moment of truth
#type in a book (known to be in the database), select True for books for kids 3+ years old (30 pages+) otherwise false
similar_books = content_recommender("the lion inside", True, tfidf_scores)
#return a list of 10 similar books in descending similarity order
similar_books.head(10)

KeyError: 'the lion inside'

### So...what? Is this a good score?### 

In order to determine whether the scores my recommender function have given are actually "good", I decided to do a test based on my knowledge of some of the books that exist in the dataframe.

I was able to see that when two very different books were put in the recommender (very different in description, subject matter, and key words), the similarity score was ~4%. And when I entered in two very similar books (two books from the Clifford the Big Red Dog series), the similarity score was ~24%. 

So, for the purpose of this analysis, using those similarity scores as a benchmark, it looks like my recommender is working very well. 

In [None]:
#creating a test function to compare cosine similarities
get_book_by_name("the velveteen rabbit", tfidf_scores, keys)

In [None]:
#testing out similarities between two very different kids books 
book_1 = get_book_by_name('the velveteen rabbit', tfidf_scores, keys)
book_2 = get_book_by_name('a big city abc', tfidf_scores, keys)

print("Similarity:", cosine_similarity(book_1, book_2))

In [None]:
#testing out similarities between two very similar kids books 
book_1 = get_book_by_name('a charlie brown christmas', tfidf_scores, keys)
book_2 = get_book_by_name('a charlie brown valentine', tfidf_scores, keys)

print("Similarity:", cosine_similarity(book_1, book_2))

So once I had a working recommender, in order to make it more interactive for presentation and demonstration purposes, I built a streamlit app. 

The cells below are just some snippets of code that helped me to form my final streamlit app code which is included in my project folder and is named TMAS-app.py. 

My initial app prototype (as it works now) is a simple recommender, not filtered by age. It includes a dropdown selection box where users can select what book they are reading with their kids now and then it will recommend a few titles they could try out based on similarity of some of the words in their descriptions. 

There is also an option to click on a selectbox at the bottom of the app page and see the entire book list dataframe for futher exploration. Unfortunately it is not live so I have included screenshots of this in my presentation slide deck and demo video. 

Future iterations (prototype can be found in file TMAS-app-filtered.py) would create a checkbox that allows the user to filter that initial dropdown list by the age of their child...so they would select 0-2 and would only see books categorized as is_preschooler=0 in the selection list, they will also only be recommended similar books from the full infant/toddler book list. 

For additional complexity and more reliable/realistic results, it would be better to get access to the GoodReads.com API and source books with an existing age and book type category which would allow the user to filter and be matched with books even more relevant/similar to the books they are currently reading with their children. 

#### Testing a Few Things For The Streamlit App: ####

In [15]:
#for streamlit app for sorting the name list
df_descriptions['name'].sort_values()

6492               "a" is for airplane / "a" es para avion
8778                     "a" is for animals: an abc pop-up
5655                                   "bee my valentine!"
3933                          "let's get a pup!" said kate
7828                          "let's get a pup!" said kate
                               ...                        
5283                                   zuzu's wishing cake
15624                                 ¡me gusta mi sombra!
1344                                 ¡no me gusta mi mono!
1147     ¡sí, se puede! / yes, we can!: janitor strike ...
17568                                        ¿quién salta?
Name: name, Length: 10390, dtype: object

In [16]:
#for streamlit app for identifying the book name filter
books[books['is_preschooler']== 1]

#note: the html tag in the 0th row is present BEFORE applying the tokenizer 

Unnamed: 0,isbn,name,description,is_preschooler
0,0688175740,russell's secret,"have you ever heard the words<br />""""you can s...",1
1,1582460892,"oye, hormiguita","the spanish translation of bestseller hey, lit...",1
2,1904442978,dear bunny,the cutest couple of star-crossed rabbits craf...,1
3,0689840047,a revolutionary field trip: poems of colonial ...,the class is embarking on the field trip of a ...,1
4,1905236816,"home for a tiger, home for a bear",you'll learn about the habitats of these and m...,1
...,...,...,...,...
17957,0679885390,lady bug's ball,join lady bug and her gauzy-winged guests as t...,1
17959,1581058101,"uno, dos, tres: dime quien eres! (one, two, th...","children play ""one, two, three. who can it be ...",1
17965,0736424954,actual fairy size (disney fairies),the never fairies stand only five inches tall....,1
17974,0439545641,las tres preguntas,nikolai is a boy who believes that if he can f...,1


So this is TMAS as it stands today. 

Please see the final report and complete zip file for all project files and documentation and don't hestiate to contact me if you have any questions about this project. Thank you so much for your time and attention.

**Happy reading!!**