Now, after getting the final cleaned datasets, we can perform recommendation strategies. As per the available data, I have decided to perform two different methods to generate recommendation for a user.

Methods are as follows:

1) Top-N recommendations

2) Content based recommendations

For Top-N recommendation, I would simply find out the top 10 books on the basis of user's rating. Obviously it wouldnt be that efficient because this type recommendation system will return the same list of books to all users irrespective of their interest and previous reading.

As we have seen, there are many users who have rated many books. I am making a thrushhold of 100. That means I am going to consider only those users who have rated 100 or more than 100 books. It makes more sense if we consider those users who have rated many books. User with more readings can rate a book book more appropriately.

### 1) Top-N Recommendation 

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

In [3]:
df_tags = pd.read_csv('Data/tags.csv')

In [4]:
df_book_tags = pd.read_csv('Data/book_tags.csv')

In [5]:
df_book = pd.read_csv('Data/books.csv')

In [6]:
df_ratings = pd.read_csv('Data/ratings.csv')

In [7]:
df_genres = pd.read_csv('Data/genres.csv')

In [8]:
top = df_ratings.groupby(['book_id'],as_index = False).agg({'rating':'mean'})
top.sort_values('rating',ascending = False)

Unnamed: 0,book_id,rating
3627,3628,4.829876
7946,7947,4.818182
9565,9566,4.768707
6919,6920,4.766355
8977,8978,4.761364
...,...,...
1821,1822,2.492537
4990,4991,2.462687
7635,7636,2.283333
4044,4045,2.254717


In [9]:
# We need to merge "top" dataframe with "df_books" dataframe so that we can get complete list of books with
# titles and respective authors.

In [10]:
top.shape

(10000, 2)

In [12]:
top_list = pd.merge(df_book,top,on = 'book_id')

In [13]:
top_list.shape

(10000, 24)

In [14]:
top_list = top_list[['book_id','authors','title','rating']]
top_list.head()

Unnamed: 0,book_id,authors,title,rating
0,1,Suzanne Collins,"The Hunger Games (The Hunger Games, #1)",4.279707
1,2,"J.K. Rowling, Mary GrandPré",Harry Potter and the Sorcerer's Stone (Harry P...,4.35135
2,3,Stephenie Meyer,"Twilight (Twilight, #1)",3.214341
3,4,Harper Lee,To Kill a Mockingbird,4.329369
4,5,F. Scott Fitzgerald,The Great Gatsby,3.772224


In [15]:
top_list.sort_values('rating',ascending = False).head(10)

Unnamed: 0,book_id,authors,title,rating
3627,3628,Bill Watterson,The Complete Calvin and Hobbes,4.829876
7946,7947,"Anonymous, Lane T. Dennis, Wayne A. Grudem",ESV Study Bible,4.818182
9565,9566,Bill Watterson,Attack of the Deranged Mutant Killer Monster S...,4.768707
6919,6920,Bill Watterson,The Indispensable Calvin and Hobbes,4.766355
8977,8978,Bill Watterson,The Revenge of the Baby-Sat,4.761364
6360,6361,Bill Watterson,There's Treasure Everywhere: A Calvin and Hobb...,4.760456
6589,6590,Bill Watterson,The Authoritative Calvin and Hobbes: A Calvin ...,4.757202
4482,4483,Bill Watterson,It's a Magical World: A Calvin and Hobbes Coll...,4.747396
3274,3275,"J.K. Rowling, Mary GrandPré","Harry Potter Boxed Set, Books 1-5 (Harry Potte...",4.736842
1787,1788,Bill Watterson,The Calvin and Hobbes Tenth Anniversary Book,4.728528


Clearly, above result is not efficient. it is just a list of top 10 books which were rated well.

### 2) Content based recommendations

In [17]:
df_book = df_book[['book_id','authors','original_title']]

In [18]:
gen_list = df_genres.tag_name.tolist()

In [19]:
df_tags = df_tags[df_tags.tag_name.isin(gen_list)]

In [20]:
df_merged_tags = pd.merge(df_book_tags, df_tags, on = 'tag_id') 

In [21]:
df_merged_tags

Unnamed: 0,goodreads_book_id,tag_id,count,tag_name
0,1,11305,37174,fantasy
1,2,11305,3441,fantasy
2,3,11305,47478,fantasy
3,5,11305,39330,fantasy
4,6,11305,38378,fantasy
...,...,...,...,...
215567,16124019,13983,9,harlequin-romance
215568,17853024,283,4,10th-century
215569,22608582,4269,6,benghazi
215570,24612624,12927,24,gender-identity


In [None]:
df_full_info = pd.merge(df_book,df_merged_tags,on='book_id')

In [None]:
df_full_combined_tags = df_full_info.groupby('goodreads_book_id')['tag_name'].apply(lambda x: "%s" % ' '.join(x)).reset_index()

In [None]:
df_full_combined_tags = pd.merge(df_book,df_full_combined_tags,on='goodreads_book_id')

In [None]:
df_full_combined_tags['authors'] = df_full_combined_tags['authors'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))

In [None]:
df_full_combined_tags['authors'] = df_full_combined_tags['authors'].astype('str').apply(lambda x: str.lower(x.replace(",", " ")))

In [None]:
df_full_combined_tags['Final_string'] = df_full_combined_tags['authors'] + ' ' + df_full_combined_tags['tag_name']

In [None]:
df_full_combined_tags.Final_string[0]

In [None]:
cnt_vectr = CountVectorizer(analyzer = 'word', ngram_range = (1, 2), min_df = 0, stop_words = 'english')

In [None]:
token_Matrix =  cnt_vectr.fit_transform(df_full_combined_tags.Final_string)

In [None]:
cos_sim = cosine_similarity(token_Matrix,token_Matrix)

In [None]:
df_full_combined_tags = df_full_combined_tags.reset_index()
bookTitles = df_full_combined_tags['original_title']
indices = pd.Series(df_full_combined_tags.index, index = bookTitles)

In [None]:
def MoreLikeThis(title):
    index = indices[title]
    similarityScore = list(enumerate(cos_sim[index]))
    similarityScore = sorted(similarityScore, key = lambda x: x[1], reverse = True)
    similarityScore = similarityScore[1:11]
    bookIndex = [i[0] for i in similarityScore]
    return bookTitles.iloc[bookIndex]

In [None]:
MoreLikeThis('The Hunger Games')

In [None]:
u2 = df_ratings[df_ratings.user_id==2]

In [None]:
u2_blist = u2.book_id.tolist()

In [None]:
u2_blist


In [None]:
(df_book.goodreads_book_id==127).value_counts()

In [None]:
(df_book.goodreads_book_id == u2_blist[2]).value_counts()

In [None]:
df_book[df_book.goodreads_book_id == u2_blist[2]].original_title

In [None]:
result = MoreLikeThis('Harry Potter and the Order of the Phoenix')

In [None]:
res_list = result.index.tolist()

In [None]:
u2_blist = set(u2_blist)
res_list = set(res_list)

In [None]:
overlap = u2_blist & res_list
print(overlap)

In [None]:
match = float(len(overlap)) / len(res_list) * 100
print(match)

In [None]:
df_book.book_id.max()

In [None]:
df_ratings.book_id.max()

In [None]:
def accuracy(u_id):
    u = df_ratings[df_ratings.user_id==u_id]
    
    u_blist = u.book_id.tolist()
    
    for i in u_blist:
        print(i)
        for j in df_book.goodreads_book_id:
            print(j)
            if (i == j):
                title = df_book[df_book.goodreads_book_id==i].original_title
                print('----')
            else:
                print('goint to next')
                print('========')
            
            
        
                            
    result = MoreLikeThis(title)
    
    res_list = result.index.tolist()
    u_blist = set(u_blist)
    
    res_list = set(res_list)
   
    overlap = u_blist & res_list
   
    match = float(len(overlap)) / len(res_list) * 100
    print(match)
    
    
    
    