<a href="https://colab.research.google.com/github/Davilirio/Python_data_analysis/blob/master/book_recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Using the book rating dataset: https://www.kaggle.com/philippsp/book-recommender-collaborative-filtering-shiny

We'll look to find the best books, group books in parameter space, analyse the books dataset and make a book recomendation system for a specific user.

In [None]:
import fastai
import seaborn as sns
from fastai.collab import *
from google.colab import drive, files

drive.mount('/content/gdrive/')
path = Path('/content/gdrive/My Drive/data_science_stuff/datasets/books/')

Opening both dataframes, we'll start by taking a look at how our data is distributed and some information about it

In [None]:
ratings = pd.read_csv(path/'ratings.csv')
books = pd.read_csv(path/'books.csv')

In [None]:
ratings = ratings.merge(books[['book_id', 'title']])
ratings.head()

In [None]:
len(ratings)

In [6]:
mean_rating = ratings.groupby('title').rating.mean()

In [None]:
print('Top 10 books with highest mean rating \n'),
mean_rating.sort_values(ascending = False)[:10]

In [None]:
print('Top 10 books with lowest mean rating \n'),
mean_rating.sort_values()[:10]

Even though we show this kind of data, note that the mean rating is **not** as informative as it seems once it can be biased by the type of user that read/rates that book in the platform

In [None]:
most_read = ratings.groupby('title').rating.count().sort_values(ascending = False)
print('Ratings seem to be limited at 100 \n'),
most_read[:10]

In [None]:
min = ratings['rating'].min()
max = ratings['rating'].max()
print(f"The lowest rating is: {min} \nThe highest rating is: {max}")

Defining a basic Fast.ai learner for collab filtering

In [10]:
train = ratings[:-30]
test = ratings[-30:]

In [None]:
data = CollabDataBunch.from_df(train, seed=42, valid_pct=0.1,
                               item_name='title', user_name='user_id', test=test)
y_range = [0.5, 5.5]

In [None]:
learn = collab_learner(data,y_range=y_range, wd=1e-1, n_factors=50)

In [None]:
  learn.lr_find()
learn.recorder.plot()

In [None]:
def fit_1c(ep, lr):
  learn.fit_one_cycle(ep, lr)
  learn.recorder.plot_losses()

In [None]:
fit_1c(25, 6e-2)

In [16]:
learn.save('rcmd_bk_v1')

In [17]:
learn=None
gc.collect()

10608

In [48]:
learn.load('rcmd_bk_v1');

In [23]:
learn.model.cuda() # model into the gpu

EmbeddingDotBias(
  (u_weight): Embedding(27560, 50)
  (i_weight): Embedding(813, 50)
  (u_bias): Embedding(27560, 1)
  (i_bias): Embedding(813, 1)
)

Analysing book bias to extract the unbiased opinion about the book within the whole user base

In [None]:
book_bias = learn.bias(most_read.index[:500])
book_bias.shape

In [25]:
book_info = [(b, i, mean_rating.loc[i]) for i,b in zip(most_read.index,book_bias)]

In [26]:
for t, b in zip(most_read.index[:5], book_bias[:5]):
  print(f'Book: {t}\n -> Bias: {b} \n -> Mean Rating: {mean_rating.loc[t]} \n')

Book: Pearls of Lutra (Redwall, #9)
 -> Bias: -0.09689515829086304 
 -> Mean Rating: 2.8 

Book: Blue Ocean Strategy: How To Create Uncontested Market Space And Make The Competition Irrelevant
 -> Bias: 0.5335620641708374 
 -> Mean Rating: 4.02 

Book: Narcissus and Goldmund
 -> Bias: 0.39145544171333313 
 -> Mean Rating: 3.8 

Book: Blue Like Jazz: Nonreligious Thoughts on Christian Spirituality
 -> Bias: -0.04157985374331474 
 -> Mean Rating: 3.02 

Book: Naked
 -> Bias: 0.43617478013038635 
 -> Mean Rating: 3.81 



In [27]:
print('Best books by the members'),
sorted(book_info,key=lambda book: book[0], reverse=True)[:5]

Best books by the members


[(tensor(0.9083), 'Girl with a Pearl Earring', 4.53),
 (tensor(0.8616), 'The Taste of Home Cookbook', 4.55),
 (tensor(0.8441), 'Franny and Zooey', 4.39),
 (tensor(0.8353), 'The Lost Boy (Dave Pelzer #2)', 4.4),
 (tensor(0.7916),
  'Longitude: The True Story of a Lone Genius Who Solved the Greatest Scientific Problem of His Time',
  4.34)]

In [28]:
print('Worst books by the members'),
sorted(book_info,key=lambda book: book[0])[:5]

Worst books by the members


[(tensor(-0.2300), 'Nine Stories', 2.53),
 (tensor(-0.2101),
  "Harry Potter and the Sorcerer's Stone (Harry Potter, #1)",
  3.09),
 (tensor(-0.1060), 'The Chamber', 2.88),
 (tensor(-0.0969), 'Pearls of Lutra (Redwall, #9)', 2.8),
 (tensor(-0.0669), 'The Woman in White', 2.88)]

Analysing book weights to find groups in the embedding space

In [29]:
book_w = learn.weight(most_read.index[:500])
book_w.shape # 50 factors, so a 500x50 matrix

torch.Size([500, 50])

In [30]:
# transforming in an numpy array:
array_bw = np.asarray(book_w)

As we have 50 factors, it may be almost impossible to interpretate each one. Here we cover the maximum amount of space in the matrix of weights space using only 3 components instead of 50 to try to extract understandable information

In [31]:
principal_comps = book_w.pca(3) # creating 3 components
comp1, comp2, comp3 = principal_comps.t() # dividing them into 3 separate tensors

In [32]:
principal_comps.t()

tensor([[-0.2183,  1.2633, -0.2229,  ...,  0.0641, -0.0174, -0.2483],
        [-0.4163, -0.3362, -0.4003,  ..., -0.2484,  0.0841,  0.7073],
        [ 0.0891, -0.2761,  0.3456,  ..., -0.8884, -0.7034,  0.3243]])

In [33]:
comp_1 = [(w, n) for w, n in zip(comp1, most_read.index)]
comp_2 = [(f, i) for f, i in zip(comp2, most_read.index)]
comp_3 = [(f, i) for f, i in zip(comp3, most_read.index)]

Looking for information in the first component:

In [34]:
sorted(comp_1, key=lambda w: w[0], reverse=True)[:10]

[(tensor(1.3407), 'Postmortem (Kay Scarpetta, #1)'),
 (tensor(1.2633),
  'Blue Ocean Strategy: How To Create Uncontested Market Space And Make The Competition Irrelevant'),
 (tensor(1.2536), 'Point of Origin (Kay Scarpetta, #9)'),
 (tensor(1.2363), 'The Adventures of Huckleberry Finn'),
 (tensor(1.1118), 'Job: A Comedy of Justice'),
 (tensor(1.0955), 'Congo'),
 (tensor(1.0813), 'In the Skin of a Lion'),
 (tensor(1.0732), 'Night (The Night Trilogy #1)'),
 (tensor(1.0132), 'Goldfinger (James Bond, #7)'),
 (tensor(0.9552), 'Endymion (Hyperion Cantos, #3)')]

In [35]:
sorted(comp_1, key=lambda w: w[0])[:10]

[(tensor(-1.2110), 'The Confusion (The Baroque Cycle, #2)'),
 (tensor(-1.0973), 'The Doors of Perception & Heaven and Hell'),
 (tensor(-1.0772), "The River (Brian's Saga, #2)"),
 (tensor(-1.0761),
  'Harry Potter and the Order of the Phoenix (Harry Potter, #5)'),
 (tensor(-1.0496), 'Memories of My Melancholy Whores'),
 (tensor(-0.9951), 'Harry Potter Collection (Harry Potter, #1-6)'),
 (tensor(-0.8931), 'Blindness'),
 (tensor(-0.8585), 'Shalimar the Clown'),
 (tensor(-0.8449), 'The Sirens of Titan'),
 (tensor(-0.7868), "Plum Lovin' (Stephanie Plum, #12.5)")]

Looking for information in the second component:

In [36]:
sorted(comp_2, key=lambda w: w[0], reverse=True)[:10]

[(tensor(1.0656), 'Martin Chuzzlewit'),
 (tensor(1.0290), 'Nine Stories'),
 (tensor(0.9984), 'Life of Pi'),
 (tensor(0.9446), 'The Virgin Blue'),
 (tensor(0.9408), 'Harry Potter Boxed Set, Books 1-5 (Harry Potter, #1-5)'),
 (tensor(0.9402), 'The Terminal Man'),
 (tensor(0.9314), 'My Life in France'),
 (tensor(0.9119), 'I Hope They Serve Beer in Hell (Tucker Max, #1)'),
 (tensor(0.9026), 'Atlas Shrugged'),
 (tensor(0.8809),
  'The Path Between the Seas: The Creation of the Panama Canal, 1870-1914')]

In [37]:
sorted(comp_2, key=lambda w: w[0])[:10]

[(tensor(-1.2952), 'Runaways, Vol. 1: Pride and Joy (Runaways, #1)'),
 (tensor(-1.1784), 'The Testament'),
 (tensor(-0.9996), 'Wizard and Glass (The Dark Tower, #4)'),
 (tensor(-0.9137), 'Kafka on the Shore'),
 (tensor(-0.9007), 'War and Peace'),
 (tensor(-0.8932), 'Song of Susannah (The Dark Tower, #6)'),
 (tensor(-0.8721), 'From the Mixed-Up Files of Mrs. Basil E. Frankweiler'),
 (tensor(-0.8720), 'Skinny Legs and All'),
 (tensor(-0.8681), 'Point of Origin (Kay Scarpetta, #9)'),
 (tensor(-0.8672), 'In Our Time')]

Looking for information in the third component


In [38]:
sorted(comp_3, key=lambda w: w[0], reverse=True)[:10]

[(tensor(1.3240), 'The Secret Garden'),
 (tensor(1.2172), 'Three Men in a Boat (Three Men, #1)'),
 (tensor(1.1144), 'A Christmas Carol'),
 (tensor(1.1086), 'Salamandastron (Redwall, #5)'),
 (tensor(1.1061), 'Birdsong'),
 (tensor(1.0615), 'How We Are Hungry'),
 (tensor(1.0414), 'A Bend in the Road'),
 (tensor(1.0322), 'Waiting for the Barbarians'),
 (tensor(1.0137),
  'Longitude: The True Story of a Lone Genius Who Solved the Greatest Scientific Problem of His Time'),
 (tensor(0.9972), 'My Life in France')]

In [39]:
sorted(comp_3, key=lambda w: w[0])[:10]

[(tensor(-1.2127), 'Never Let Me Go'),
 (tensor(-1.0171), 'Cradle and All'),
 (tensor(-0.9945), 'Rainbow Six (Jack Ryan Universe, #10)'),
 (tensor(-0.9339), 'Hard Eight (Stephanie Plum, #8)'),
 (tensor(-0.9271), 'I Like You: Hospitality Under the Influence'),
 (tensor(-0.8884), 'The Dark Tower (The Dark Tower, #7)'),
 (tensor(-0.8884), 'Hey Nostradamus!'),
 (tensor(-0.8878), 'Farmer Boy (Little House, #3)'),
 (tensor(-0.8514), 'Timbuktu'),
 (tensor(-0.7995), 'Fantastic Mr. Fox')]

Predictions

In [41]:
learn.model.cuda()
predictions = learn.get_preds(ds_type=DatasetType.Train)
preds = predictions[0].numpy()

In [42]:
for t, p in zip(test['rating'].values[:6], preds[:6]):
  print(f'True value: {t} \nPredicted value: {p}\n\n')

True value: 3 
Predicted value: 3.2062723636627197


True value: 5 
Predicted value: 3.0001964569091797


True value: 4 
Predicted value: 3.6449759006500244


True value: 4 
Predicted value: 3.9866089820861816


True value: 3 
Predicted value: 4.871235370635986


True value: 5 
Predicted value: 3.237518548965454


