# Content Based Recommodation

Similar content is recommended using attributes of the content.Because it uses attributes or tags of the content, such as book title, author, and rating, new books can be recommended immediately.

### Content-based filtering

Using user ratings of books he/she read, we can look through the metadata of the favourite books (e.g. title, genre, author, description, keywords) and find similar titles. Basically, if a user enjoys one book, then he or she will enjoy a similar book as well.

Pros:Quick, easy to understand (= transparent to users), no need for other users' ratings (will work even with low numbers of users), and more reliable in the beginning of the algorithm

Cons: By relying on metadata, with more features, we risk recommending the same genres and topics, there will be no diversity and novelty, so recommendations won't be personalized

### How does it work?

1. Select the features based on which we measure the similarity between books

      1.1 ideally, is there some research on what are the best predictors? -> NEEDS TO BE EXPLORED


2. Combine all the words in one column


3. Convert them to the matrix format, so the books are as rows and words are as columns (words are converted into      vectors with semantic meaning)

      3.1 popularity question: decide whether it makes sense to downweight words that occur a lot or not                      (TfidfVectorizer vs. CountVectorizer)
     
      3.2 each word is assigned term frequency (TF, number of times it appears in the column) and inverse document            frequency (IDF, how significant the word is in the whole column)
      
     
4. Calculate the similarity between the words/vectors

      4.1 there are different ways to calculate similarity: possible to experiment with Pearson, Euclidean,Jaccard,           cosine
     
      4.2 cosine similarity is used in this notebook: similarity is calculated as the cosine of the angle between 2           vectors of the books A and B, the closer the vectors, the smaller the angle and larger the cosine,                 preferred when data is sparse

In [None]:
import pandas as pd
import numpy as np

In [44]:
import re
import string
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [45]:
vamsidhar_books_data = pd.read_csv('/Users/vamsidharreddy/CMPE-255-Final-Project/data/books_data.csv')

In [46]:
vamsidhar_books_data

Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9.780439e+12,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9.780440e+12,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9.780316e+12,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9.780061e+12,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9.780743e+12,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,7130616,7130616,7392860,19,441019455,9.780441e+12,Ilona Andrews,2010.0,Bayou Moon,...,17204,18856,1180,105,575,3538,7860,6778,https://images.gr-assets.com/books/1307445460m...,https://images.gr-assets.com/books/1307445460s...
9996,9997,208324,208324,1084709,19,067973371X,9.780680e+12,Robert A. Caro,1990.0,Means of Ascent,...,12582,12952,395,303,551,1737,3389,6972,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...
9997,9998,77431,77431,2393986,60,039330762X,9.780393e+12,Patrick O'Brian,1977.0,The Mauritius Command,...,9421,10733,374,11,111,1191,4240,5180,https://images.gr-assets.com/books/1455373531m...,https://images.gr-assets.com/books/1455373531s...
9998,9999,8565083,8565083,13433613,7,61711527,9.780062e+12,Peggy Orenstein,2011.0,Cinderella Ate My Daughter: Dispatches from th...,...,11279,11994,1988,275,1002,3765,4577,2375,https://images.gr-assets.com/books/1279214118m...,https://images.gr-assets.com/books/1279214118s...


In [47]:
content_data = vamsidhar_books_data[['original_title','authors','average_rating']]
content_data = content_data.astype(str)

In [48]:
content_data['content'] = content_data['original_title'] + ' ' + content_data['authors'] + ' ' + content_data['average_rating']

In [49]:
content_data = content_data.reset_index()
indices = pd.Series(content_data.index, index=content_data['original_title'])

# Content Based Recommodation Author

In [50]:
#removing stopwords
tfidf = TfidfVectorizer(stop_words='english')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(content_data['authors'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(10000, 6175)

By using TF-IDF encoding, a term (a tag for a book in our example) will be weighed according to the importance of the term within the document: The more frequently the term appears, the larger its weight.Likewise, it weighs the item inversely to its frequency across the overall dataset: It will emphasize terms that are relatively rare occurrences in the general dataset but important to the specific content at hand.Words such as 'is', 'are', 'by' or 'a' that are likely to appear in every book's content, but are not useful for user recommendations, will be weighed less heavily than words that are specific to the content we are recommending.

# Compute the cosine similarity matrix

We are going to use a simple similarity-based method called cosine similarity

In [51]:
cosine_sim_author = linear_kernel(tfidf_matrix, tfidf_matrix)

# Author Wise Recommodation

In [52]:
def get_books_recommendations(title, cosine_sim=cosine_sim_author):
    idx = indices[title]

    # Get the pairwsie similarity scores of all books with that book
    sim_score = list(enumerate(cosine_sim_author[idx]))

    # Sort the books based on the similarity scores
    sim_score = sorted(sim_score, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar books
    sim_score = sim_score[1:11]

    # Get the book indices
    book_indices = [i[0] for i in sim_score]

    # Return the top 10 most similar books
    return list(content_data['original_title'].iloc[book_indices])

In [53]:
def author_bookshows(book):
    for book in book:
        print(book)

In [54]:
vamsi_books1 = get_books_recommendations('The Hobbit', cosine_sim_author)
author_bookshows(vamsi_books1)

The Hobbit or There and Back Again
 The Fellowship of the Ring
The Two Towers
The Return of the King
The Lord of the Rings
The Hobbit and The Lord of the Rings
Unfinished Tales of Númenor and Middle-Earth
Nikola Tesla: Imagination and the Man That Invented the 20th Century
Entwined
The Children of Húrin


In [55]:
vamsi_books2 =get_books_recommendations('Shadow Kiss', cosine_sim_author)
author_bookshows(vamsi_books2)

Frostbite
Shadow Kiss
Spirit Bound
Blood Promise
Last Sacrifice 
Bloodlines
The Golden Lily
The Indigo Spell
The Fiery Heart
nan


In [56]:
vamsi_books3 = get_books_recommendations('Harry Potter and the Goblet of Fire', cosine_sim_author)
author_bookshows(vamsi_books3)

Harry Potter and the Order of the Phoenix
Harry Potter and the Chamber of Secrets
Harry Potter and the Goblet of Fire
Harry Potter and the Deathly Hallows
Harry Potter and the Half-Blood Prince
Harry Potter Boxed Set Books 1-4
nan
Harry Potter and the Prisoner of Azkaban
The Casual Vacancy
The Tales of Beedle the Bard


# Content Based Filtering On Multiple Matrix

In [70]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(content_data['content'])

cosine_sim_content = cosine_similarity(count_matrix, count_matrix)

In [71]:
def get_book_recom(title, cosine_sim=cosine_sim_content):
    idx = indices[title]

    # Get the pairwsie similarity scores of all books with that book
    sim_scores = list(enumerate(cosine_sim_content[idx]))

    # Sort the books based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar books
    sim_scores = sim_scores[1:11]

    # Get the book indices
    book_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar books
    return list(content_data['original_title'].iloc[book_indices])

In [72]:
def bookshow(book):
    for book in book:
        print(book)

In [73]:
vamsi_books4 = get_recommendations('The Hobbit', cosine_sim_content)
bookshow(vamsi_books4)

The Hobbit or There and Back Again
The Hobbit and The Lord of the Rings
No, David!
The History of the Hobbit, Part One: Mr. Baggins
David Gets in Trouble
nan
The Silmarillion
The Children of Húrin
Unfinished Tales of Númenor and Middle-Earth
The Two Towers


In [74]:
vamsi_books5 =get_recommendations('Shadow Kiss', cosine_sim_content)
bookshow(vamsi_books5)

Spirit Bound
Silver Shadows
Frostbite
nan
Last Sacrifice 
Bloodlines
nan
Storm Born
Succubus On Top
Blood Promise


In [75]:
vamsi_books6 =get_recommendations('The Two Towers', cosine_sim_content)
bookshow(vamsi_books6)

Towers of Midnight
The Silmarillion
The Children of Húrin
Unfinished Tales of Númenor and Middle-Earth
The Hobbit or There and Back Again
Reckless
 The Fellowship of the Ring
The Return of the King
The Lord of the Rings
Last Sacrifice 


In [76]:
vamsi_books7 = get_recommendations('Harry Potter and the Goblet of Fire', cosine_sim_content)
bookshow(vamsi_books7)

Harry Potter and the Prisoner of Azkaban
Harry Potter and the Philosopher's Stone
Harry Potter and the Order of the Phoenix
Harry Potter and the Chamber of Secrets
Harry Potter and the Deathly Hallows
Harry Potter and the Half-Blood Prince
Harry Potter Boxed Set Books 1-4
Harry Potter Collection (Harry Potter, #1-6)
nan
Complete Harry Potter Boxed Set


# Collaborative filtering model

Using Model based Collaborative filtering: Singular Value Decomposition

#### Things to do:

1. Transform the data into a pivot table -> Format required for colab model
2. Create a user_index column to count the no. of users -> Change naming convention of user by using counter
3. Apply SVD method on a large sparse matrix -> To predict ratings for all resto that weren't rated by a user
4. Predict ratings for all restaurants not rated by a user using SVD
5. Wrap it all into a function

In [94]:
vamsidhar_books_data['original_title'].isnull().sum()

585

In [96]:
vamsidhar_books_data['book_id'].isnull().sum()

0

In [97]:
vamsidhar_ratings_data = pd.read_csv('/Users/vamsidharreddy/CMPE-255-Final-Project/data/books_ratings_data.csv')

In [98]:
vamsidhar_ratings_data

Unnamed: 0,book_id,user_id,rating
0,1,314,5
1,1,439,3
2,1,588,5
3,1,1169,4
4,1,1185,4
...,...,...,...
981751,10000,48386,5
981752,10000,49007,4
981753,10000,49383,5
981754,10000,50124,5


check if there are any nan values

In [101]:
vamsidhar_ratings_data.apply(lambda x: x.isnull().sum(),axis=0)

book_id    0
user_id    0
rating     0
dtype: int64

In [102]:
vamsidhar_books_data.apply(lambda x:x.isnull().sum(), axis=0)

id                              0
book_id                         0
best_book_id                    0
work_id                         0
books_count                     0
isbn                          700
isbn13                        585
authors                         0
original_publication_year      21
original_title                585
title                           0
language_code                1084
average_rating                  0
ratings_count                   0
work_ratings_count              0
work_text_reviews_count         0
ratings_1                       0
ratings_2                       0
ratings_3                       0
ratings_4                       0
ratings_5                       0
image_url                       0
small_image_url                 0
dtype: int64

In [103]:
vamsidhar_books_data['title']

0                 The Hunger Games (The Hunger Games, #1)
1       Harry Potter and the Sorcerer's Stone (Harry P...
2                                 Twilight (Twilight, #1)
3                                   To Kill a Mockingbird
4                                        The Great Gatsby
                              ...                        
9995                            Bayou Moon (The Edge, #2)
9996    Means of Ascent (The Years of Lyndon Johnson, #2)
9997                                The Mauritius Command
9998    Cinderella Ate My Daughter: Dispatches from th...
9999                                  The First World War
Name: title, Length: 10000, dtype: object

In [105]:
vamsidhar_books_data = pd.DataFrame(vamsidhar_books_data, columns=['book_id', 'authors', 'title', 'average_rating'])

In [107]:
vamsidhar_books_data = vamsidhar_books_data.sort_values('book_id')

In [108]:
vamsidhar_books_data['book_id']

26             1
20             2
1              3
17             5
23             6
          ...   
7522    31538647
4593    31845516
9568    32075671
9579    32848471
8891    33288638
Name: book_id, Length: 10000, dtype: int64

merging both the data files

In [110]:
vamsidhar_books_data = pd.merge(vamsidhar_books_data, vamsidhar_ratings_data, on='book_id')

In [111]:
vamsidhar_books_data

Unnamed: 0,book_id,authors,title,average_rating,user_id,rating
0,1,"J.K. Rowling, Mary GrandPré",Harry Potter and the Half-Blood Prince (Harry ...,4.54,314,5
1,1,"J.K. Rowling, Mary GrandPré",Harry Potter and the Half-Blood Prince (Harry ...,4.54,439,3
2,1,"J.K. Rowling, Mary GrandPré",Harry Potter and the Half-Blood Prince (Harry ...,4.54,588,5
3,1,"J.K. Rowling, Mary GrandPré",Harry Potter and the Half-Blood Prince (Harry ...,4.54,1169,4
4,1,"J.K. Rowling, Mary GrandPré",Harry Potter and the Half-Blood Prince (Harry ...,4.54,1185,4
...,...,...,...,...,...,...
79696,9998,"Kōbō Abe, E. Dale Saunders",The Woman in the Dunes,3.91,51295,5
79697,9998,"Kōbō Abe, E. Dale Saunders",The Woman in the Dunes,3.91,51559,5
79698,9998,"Kōbō Abe, E. Dale Saunders",The Woman in the Dunes,3.91,52087,4
79699,9998,"Kōbō Abe, E. Dale Saunders",The Woman in the Dunes,3.91,52330,4


get rating of every user for every book

In [113]:
each_book_rating = pd.pivot_table(vamsidhar_books_data, index='user_id', values='rating', columns='title', fill_value=0)

In [114]:
each_book_rating

title,'Salem's Lot,"'Tis (Frank McCourt, #2)",1421: The Year China Discovered America,1776,1984,A Bend in the River,A Bend in the Road,A Brief History of Time,A Briefer History of Time,A Case of Need,...,"Women in Love (Brangwen Family, #2)",World War Z: An Oral History of the Zombie War,"World Without End (The Kingsbridge Series, #2)",Wuthering Heights,"Xenocide (Ender's Saga, #3)",Year of Wonders,You Shall Know Our Velocity!,Zen and the Art of Motorcycle Maintenance: An Inquiry Into Values,Zodiac,number9dream
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53419,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.0,0
53420,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.0,0
53422,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.0,0
53423,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.0,0


get correlation score of each book transpose is used to get correlation of every book not users

In [115]:
book_corr = np.corrcoef(each_book_rating.T)

In [116]:
book_corr.shape

(812, 812)

In [117]:
book_list=  list(each_book_rating)
book_titles =[] 
for i in range(len(book_list)):
    book_titles.append(book_list[i])

get sorted list of book for correlation indexing

In [118]:
book_titles

["'Salem's Lot",
 "'Tis (Frank McCourt, #2)",
 '1421: The Year China Discovered America',
 '1776',
 '1984',
 'A Bend in the River',
 'A Bend in the Road',
 'A Brief History of Time',
 'A Briefer History of Time',
 'A Case of Need',
 'A Christmas Carol',
 'A Christmas Carol and Other Christmas Writings',
 'A Fine Balance',
 'A Great and Terrible Beauty (Gemma Doyle, #1)',
 'A Heartbreaking Work of Staggering Genius',
 'A History of God: The 4,000-Year Quest of Judaism, Christianity, and Islam',
 'A History of the World in 6 Glasses',
 'A Home at the End of the World',
 'A House for Mr Biswas',
 'A Lesson Before Dying',
 'A Little Princess',
 'A Living Nightmare (Cirque Du Freak, #1)',
 'A Man Without a Country',
 'A Map of the World',
 "A Midsummer Night's Dream",
 'A Million Little Pieces',
 'A Modest Proposal and Other Satirical Works',
 'A Moveable Feast',
 'A Painted House',
 "A People's History of the United States",
 'A Portrait of the Artist as a Young Man',
 'A Prayer for Owen M

test for input

In [119]:
book = 'The Alchemist'
book_index = book_titles.index(book)



In [120]:
corr_score = book_corr[book_index]

In [121]:
print(sorted(corr_score, reverse=True))

[1.0, 0.5804857413440452, 0.2196663264149613, 0.19298997079510882, 0.15251384752470595, 0.1463570428410713, 0.13870095308272185, 0.13769475974400108, 0.11383304425232792, 0.10472138309135845, 0.09346690953823901, 0.09341886680804283, 0.08949997374514919, 0.08721691242309942, 0.08574558329026544, 0.08356999042970055, 0.08306001993133742, 0.08302559913562292, 0.07958892942066186, 0.07868002375742361, 0.07845979769522061, 0.07796321169385508, 0.07700934513808477, 0.07695451809523714, 0.07672706130004493, 0.07578082102959577, 0.07552093749021321, 0.07478009829320781, 0.07269039421129048, 0.07216761879376286, 0.07120964922007475, 0.07029240996873322, 0.06902505390312294, 0.06833743803330107, 0.06602562615376437, 0.06417775892613453, 0.06270416051662087, 0.06268077467642716, 0.06086572431563057, 0.06015892568932874, 0.0590343932421973, 0.05853025363980898, 0.05717098706446823, 0.05639337116488889, 0.05585048158903767, 0.055746867014601104, 0.055678980892647216, 0.053642659691906736, 0.053536

In [122]:
condition = (corr_score >= 0.1)

In [123]:
np.extract(condition, book_titles)
# similar books to the alchemist

array(['Getting Things Done: The Art of Stress-Free Productivity',
       'Notes from a Small Island', 'Perfume: The Story of a Murderer',
       'Sex, Drugs, and Cocoa Puffs: A Low Culture Manifesto',
       'The Alchemist', 'The New York Trilogy',
       'The Plot Against America',
       'The Virtue of Selfishness: A New Concept of Egoism', 'The Zahir',
       'Treasure Island'], dtype='<U118')

function to get recommendation for a list of books

In [124]:
def get_recommendation(books_list):
    book_similarities = np.zeros(book_corr.shape[0])
    
    for book in books_list:    
#         print(book)
        book_index = book_titles.index(book)
#         print(book_index)
        book_similarities += book_corr[book_index] 
    book_preferences = []
    for i in range(len(book_titles)):
        book_preferences.append((book_titles[i],book_similarities[i]))
        
    return sorted(book_preferences, key= lambda x: x[1], reverse=True)
    
#     return book_preferences

In [125]:
my_fav_books = ['The Alchemist','The Adventures of Sherlock Holmes','The Great Gatsby','To Kill a Mockingbird','The Da Vinci Code (Robert Langdon, #2)','The Fellowship of the Ring (The Lord of the Rings, #1)']

In [126]:
book_recommendations = get_recommendation(my_fav_books)

In [127]:
print('The books you should like')
print('-'*25)
i=0
cnt=0
while cnt < 9:
    book_to_read = book_recommendations[i][0]
    i += 1
    if book_to_read in my_fav_books:
        continue
    else:
        print(book_to_read)
        cnt += 1

The books you should like
-------------------------
The Plot Against America
The New York Trilogy
Harry Potter and the Sorcerer's Stone (Harry Potter, #1)
The Lord of the Rings (The Lord of the Rings, #1-3)
J.R.R. Tolkien 4-Book Boxed Set: The Hobbit and The Lord of the Rings
The Ultimate Hitchhiker's Guide to the Galaxy
The Body Farm (Kay Scarpetta, #5)
Perfume: The Story of a Murderer
Hatchet (Brian's Saga, #1)


In [130]:
global metric,k
k=10
metric='cosine'