## Processing the data and obtaining the cosine similarity matrix

Lets have a look at the data

In [3]:
import pandas as pd

books=pd.read_csv('books.csv')
print('Current shape of the data set {}'.format(books.shape))
books.head()

Current shape of the data set (10000, 23)


Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


In [4]:
books.dropna(inplace=True)

In [5]:
ratings=pd.read_csv('ratings.csv')

So this is how a typical row in the data looks like... It has the book id , Author's name title etc. So from here we select the parts of the data that would be relevant in establishing the similarity.

In [6]:
books.iloc[0]

id                                                                           1
book_id                                                                2767052
best_book_id                                                           2767052
work_id                                                                2792775
books_count                                                                272
isbn                                                                 439023483
isbn13                                                         9780439023480.0
authors                                                        Suzanne Collins
original_publication_year                                               2008.0
original_title                                                The Hunger Games
title                                  The Hunger Games (The Hunger Games, #1)
language_code                                                              eng
average_rating                                      

In [7]:
pred_df=books[['id', 'books_count', 'authors', 'original_publication_year', 
       'language_code', 'average_rating', 'ratings_count',
       'work_ratings_count', 'work_text_reviews_count', 'ratings_1',
       'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5']]

This is another csv that was present in the dataset. As you can see it stores user id and book ids (of those that haven't been read by the corresponding user) 

In [8]:
pd.read_csv('to_read.csv')

Unnamed: 0,user_id,book_id
0,1,112
1,1,235
2,1,533
3,1,1198
4,1,1874
...,...,...
912700,53424,4716
912701,53424,4844
912702,53424,5907
912703,53424,7569


### Converting the values in the data into numbers

In the first step we convert the parts of the data that are represented in strings by assigning an unique number to each of these elements.

In [9]:
import numpy as np

def numerical_dict(val):
    entry1=list(pred_df[val].unique())
    entry2=list(np.arange(0,len(entry1)))
    replacement=dict(zip(entry1,entry2))
    return replacement

pred_df['authors'].replace(numerical_dict('authors'),inplace=True)
pred_df['language_code'].replace(numerical_dict('language_code'),inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


Next we convert the year of publish into one-hot encodings and append it to the data to get a more broader set of features...

In [10]:
one_hot_encoded=pd.get_dummies(pred_df['original_publication_year'])

In [11]:
pred_df=pd.concat([pred_df,one_hot_encoded],axis=1)

In [8]:
pred_df

Unnamed: 0,id,books_count,authors,original_publication_year,language_code,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5
0,1,272,0,2008.0,0,4.34,4780653,4942365,155254,66715,127936,560092,1481305,2706317
1,2,491,1,1997.0,0,4.44,4602479,4800065,75867,75504,101676,455024,1156318,3011543
2,3,226,2,2005.0,1,3.57,3866839,3916824,95009,456191,436802,793319,875073,1355439
3,4,487,3,1960.0,0,4.25,3198671,3340896,72586,60427,117415,446835,1001952,1714267
4,5,1356,4,1925.0,0,3.89,2683664,2773745,51992,86236,197621,606158,936012,947718
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,9995,199,1739,1924.0,0,3.09,10866,12110,681,1478,2225,3805,2985,1617
9995,9996,19,1090,2010.0,0,4.09,17204,18856,1180,105,575,3538,7860,6778
9996,9997,19,2898,1990.0,0,4.25,12582,12952,395,303,551,1737,3389,6972
9997,9998,60,1493,1977.0,0,4.35,9421,10733,374,11,111,1191,4240,5180


### Normalizing the data

In [12]:
def mean_normalize(val):
    norm=((val-val.mean())/(val.max()-val.min()))
    norm=norm-norm.min()
    return norm

mean_normalize(dat)

0       0.795745
1       0.838298
2       0.468085
3       0.757447
4       0.604255
          ...   
9994    0.263830
9995    0.689362
9996    0.757447
9997    0.800000
9998    0.502128
Name: average_rating, Length: 7860, dtype: float64

In [13]:
norm_list=['books_count', 'authors',  
       'language_code', 'average_rating', 'ratings_count',
       'work_ratings_count', 'work_text_reviews_count', 'ratings_1',
       'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5']

for name in norm_list:
    pred_df[name]
    pred_df[name]=mean_normalize(pred_df[name])

In [14]:
pred_df.drop(['id'],axis=1,inplace=True)

In [15]:
pred_df.reset_index(inplace=True)
pred_df.drop(['index'],axis=1,inplace=True)

In [16]:
pred_df.drop(['original_publication_year'],axis=1,inplace=True)

## Collaborative Filtering

In this process we use the Cosine Similarity to obtain how closely and how far are each of the vectors i.e. books related to the books in the dataset. So, as you can see in the similarity matrix all the diagonals are ones this is because each book is very much similar to itself i.e. they are the same. In the columns, the books with ids close to 1 are similar to the books in the row considered.

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

k=cosine_similarity(pred_df.values, dense_output=True)

In [21]:
output=pd.DataFrame(k,columns=np.arange(0,k.shape[1]))

In [22]:
output.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7850,7851,7852,7853,7854,7855,7856,7857,7858,7859
0,1.0,0.821479,0.702303,0.783091,0.712639,0.749069,0.664184,0.659365,0.641182,0.589605,...,0.105207,0.108832,0.136029,0.370533,0.179524,0.075785,0.172424,0.157056,0.181674,0.105922
1,0.821479,1.0,0.673356,0.771578,0.697349,0.693877,0.672732,0.642645,0.636889,0.602594,...,0.121175,0.125273,0.156568,0.147345,0.20683,0.087419,0.19733,0.181107,0.209766,0.120128
2,0.702303,0.673356,1.0,0.66762,0.693934,0.610861,0.560667,0.670071,0.629638,0.502358,...,0.066163,0.068282,0.084583,0.080151,0.111337,0.050742,0.107796,0.097737,0.112414,0.066912
3,0.783091,0.771578,0.66762,1.0,0.696978,0.685331,0.656694,0.645354,0.637118,0.59433,...,0.134162,0.138703,0.173204,0.162995,0.228414,0.097063,0.217775,0.200088,0.231865,0.133012
4,0.712639,0.697349,0.693934,0.696978,1.0,0.621298,0.627017,0.651871,0.640303,0.619514,...,0.113511,0.117076,0.145262,0.136973,0.190564,0.089726,0.182706,0.167545,0.195188,0.111687


## This code will create the csv file that will be used in making the backend of the website

Let's checkout the reccomender system and how it works

In [None]:
output.to_csv('output2.csv')

#### We use the following code to obtain the reccomended book_ids

In [24]:
s=pd.Series(list(enumerate(k[4764])))

def val(val):
    return val[1]

pred=s.apply(val)
predictions=pd.DataFrame(np.arange(0,7860),pred).sort_index(ascending=False)
predictions

Unnamed: 0,0
1.000000,4764
0.999900,4906
0.999366,3678
0.999267,2005
0.999225,7620
...,...
0.080465,3667
0.071789,7189
0.069662,6485
0.044404,3082


In [25]:
new_b=books.dropna()

In [26]:
new_b.reset_index(inplace=True)
new_b.drop(['index'],axis=1,inplace=True)
new_b.drop(['id'],axis=1,inplace=True)

In [27]:
reccomended=predictions.head(38).head(6)[0]

In [28]:
new_b.iloc[reccomended]

Unnamed: 0,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
4764,16360,16360,1355172,196,7113803,9780007000000.0,Agatha Christie,1936.0,Murder in Mesopotamia,"Murder in Mesopotamia (Hercule Poirot, #14)",...,21207,24492,892,108,957,7300,9946,6181,https://images.gr-assets.com/books/1308808558m...,https://images.gr-assets.com/books/1308808558s...
4906,16297,16297,894955,171,425205959,9780425000000.0,Agatha Christie,1936.0,Cards on the Table,"Cards on the Table (Hercule Poirot, #15)",...,19577,23004,1031,112,857,6468,9372,6195,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...
3678,668,668,1711534,63,451187849,9780451000000.0,"Ayn Rand, Leonard Peikoff",1936.0,We the Living,We the Living,...,20994,22571,1178,647,1531,5121,7689,7583,https://images.gr-assets.com/books/1306188481m...,https://images.gr-assets.com/books/1306188481s...
2005,16322,16322,626006,240,1579126243,9781579000000.0,Agatha Christie,1936.0,The ABC Murders,"The A.B.C. Murders (Hercule Poirot, #13)",...,49513,57370,2159,250,1906,14323,23443,17448,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...
7620,9648,9648,3226250,99,141183721,9780141000000.0,George Orwell,1936.0,Keep the Aspidistra Flying,Keep the Aspidistra Flying,...,9599,11261,746,121,615,2926,4556,3043,https://images.gr-assets.com/books/1331244097m...,https://images.gr-assets.com/books/1331244097s...
2941,373755,373755,1595511,107,679732187,9780680000000.0,William Faulkner,1936.0,"Absalom, Absalom!","Absalom, Absalom!",...,30283,32324,1589,1222,2115,6149,10046,12792,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...


### Obtain the books dataset that will be used in the website

In [29]:
new_b.to_csv('new_books.csv')