<a href="https://colab.research.google.com/github/Davilirio/Python_data_analysis/blob/master/book_recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Using the book rating dataset: https://www.kaggle.com/philippsp/book-recommender-collaborative-filtering-shiny

We'll look to find the best books, group books in parameter space, analyse the books dataset and make a book recomendation system for a specific user.

In [None]:
import fastai
import seaborn as sns
from fastai.collab import *
from google.colab import drive, files

drive.mount('/content/gdrive/')
path = Path('/content/gdrive/My Drive/data_science_stuff/datasets/books/')

Opening both dataframes, we'll start by taking a look at how our data is distributed and some information about it

In [None]:
ratings = pd.read_csv(path/'ratings.csv')
books = pd.read_csv(path/'books.csv')

In [None]:
ratings = ratings.merge(books[['book_id', 'title']])
ratings.head()

In [None]:
len(ratings)

In [5]:
mean_rating = ratings.groupby('title').rating.mean()

In [None]:
print('Top 10 books with highest mean rating \n'),
mean_rating.sort_values(ascending = False)[:10]

In [None]:
print('Top 10 books with lowest mean rating \n'),
mean_rating.sort_values()[:10]

Even though we show this kind of data, note that the mean rating is **not** as informative as it seems once it can be biased by the type of user that read/rates that book in the platform

In [None]:
most_read = ratings.groupby('title').rating.count().sort_values(ascending = False)
print('Ratings seem to be limited at 100 \n'),
most_read[:10]

In [None]:
min = ratings['rating'].min()
max = ratings['rating'].max()
print(f"The lowest rating is: {min} \nThe highest rating is: {max}")

Defining a basic Fast.ai learner for collab filtering

In [10]:
train = ratings[:-30]
test = ratings[-30:]

In [None]:
data = CollabDataBunch.from_df(train, seed=42, valid_pct=0.1,
                               item_name='title', user_name='user_id', test=test)
y_range = [0.5, 5.5]

In [None]:
learn = collab_learner(data,y_range=y_range, wd=1e-1, n_factors=50)

In [None]:
  learn.lr_find()
learn.recorder.plot()

In [None]:
def fit_1c(ep, lr):
  learn.fit_one_cycle(ep, lr)
  learn.recorder.plot_losses()

In [None]:
fit_1c(25, 6e-2)

In [19]:
learn.model.cuda() # model into the gpu

EmbeddingDotBias(
  (u_weight): Embedding(27560, 50)
  (i_weight): Embedding(813, 50)
  (u_bias): Embedding(27560, 1)
  (i_bias): Embedding(813, 1)
)

Analysing book bias to extract the unbiased opinion about the book within the whole user base

In [None]:
book_bias = learn.bias(most_read.index[:500])
book_bias.shape

In [21]:
book_info = [(b, i, mean_rating.loc[i]) for i,b in zip(most_read.index,book_bias)]

In [22]:
for t, b in zip(most_read.index[:5], book_bias[:5]):
  print(f'Book: {t}\n -> Bias: {b} \n -> Mean Rating: {mean_rating.loc[t]} \n')

Book: Pearls of Lutra (Redwall, #9)
 -> Bias: -0.11252568662166595 
 -> Mean Rating: 2.8 

Book: Blue Ocean Strategy: How To Create Uncontested Market Space And Make The Competition Irrelevant
 -> Bias: 0.5358605980873108 
 -> Mean Rating: 4.02 

Book: Narcissus and Goldmund
 -> Bias: 0.4363575875759125 
 -> Mean Rating: 3.8 

Book: Blue Like Jazz: Nonreligious Thoughts on Christian Spirituality
 -> Bias: -0.05481935665011406 
 -> Mean Rating: 3.02 

Book: Naked
 -> Bias: 0.4303600788116455 
 -> Mean Rating: 3.81 



In [23]:
print('Best books by the members'),
sorted(book_info,key=lambda book: book[0], reverse=True)[:5]

Best books by the members


[(tensor(0.9039), 'Girl with a Pearl Earring', 4.53),
 (tensor(0.8696), 'The Taste of Home Cookbook', 4.55),
 (tensor(0.8390), 'Franny and Zooey', 4.39),
 (tensor(0.8122), 'The Lost Boy (Dave Pelzer #2)', 4.4),
 (tensor(0.8107), 'The Universe in a Nutshell', 4.38)]

In [24]:
print('Worst books by the members'),
sorted(book_info,key=lambda book: book[0])[:5]

Worst books by the members


[(tensor(-0.2711), 'Nine Stories', 2.53),
 (tensor(-0.2348),
  "Harry Potter and the Sorcerer's Stone (Harry Potter, #1)",
  3.09),
 (tensor(-0.1152), 'The Chamber', 2.88),
 (tensor(-0.1125), 'Pearls of Lutra (Redwall, #9)', 2.8),
 (tensor(-0.0618), 'The Woman in White', 2.88)]

Analysing book weights to find groups in the embedding space

In [25]:
book_w = learn.weight(most_read.index[:500])
book_w.shape # 50 factors, so a 500x50 matrix

torch.Size([500, 50])

In [26]:
# transforming in an numpy array:
array_bw = np.asarray(book_w)

As we have 50 factors, it may be almost impossible to interpretate each one. Here we cover the maximum amount of space in the matrix of weights space using only 3 components instead of 50 to try to extract understandable information

In [27]:
principal_comps = book_w.pca(3) # creating 3 components
comp1, comp2, comp3 = principal_comps.t() # dividing them into 3 separate tensors

In [28]:
principal_comps.t()

tensor([[ 0.7490, -0.4674,  0.3095,  ...,  0.8699,  0.0835, -0.1962],
        [-0.5093,  0.6513, -0.1807,  ...,  0.3220, -0.3976, -0.5434],
        [ 0.7307,  0.3598, -0.2873,  ..., -0.1587, -0.6213,  0.6585]])

In [29]:
comp_1 = [(w, n) for w, n in zip(comp1, most_read.index)]
comp_2 = [(f, i) for f, i in zip(comp2, most_read.index)]
comp_3 = [(f, i) for f, i in zip(comp3, most_read.index)]

Looking for information in the first component:

In [30]:
sorted(comp_1, key=lambda w: w[0], reverse=True)[:10]

[(tensor(1.0661), 'We Were the Mulvaneys'),
 (tensor(1.0228), 'Antigone (The Theban Plays, #3)'),
 (tensor(1.0118), 'Mornings on Horseback'),
 (tensor(1.0016), 'My Name is Red'),
 (tensor(0.9917), 'Unfinished Tales of Númenor and Middle-Earth'),
 (tensor(0.9665), 'The Road'),
 (tensor(0.9471), 'A Great and Terrible Beauty (Gemma Doyle, #1)'),
 (tensor(0.9168), 'The Snows of Kilimanjaro and Other Stories'),
 (tensor(0.9088), "The River (Brian's Saga, #2)"),
 (tensor(0.8872), 'Last Chance to See')]

In [31]:
sorted(comp_1, key=lambda w: w[0])[:10]

[(tensor(-1.4511), 'The Adventures of Huckleberry Finn'),
 (tensor(-1.4233), 'Ten Apples Up On Top!'),
 (tensor(-1.2946), 'Endymion (Hyperion Cantos, #3)'),
 (tensor(-1.2487), 'Goldfinger (James Bond, #7)'),
 (tensor(-1.2152), 'Postmortem (Kay Scarpetta, #1)'),
 (tensor(-1.0525), 'East of Eden'),
 (tensor(-0.9626), 'Here on Earth'),
 (tensor(-0.9625), 'His Excellency: George Washington'),
 (tensor(-0.9338), 'The Wind in the Willows'),
 (tensor(-0.9323), 'Sentimental Education')]

Looking for information in the second component:

In [32]:
sorted(comp_2, key=lambda w: w[0], reverse=True)[:10]

[(tensor(1.2250), 'Regeneration (Regeneration, #1)'),
 (tensor(1.1154), 'What Looks Like Crazy on an Ordinary Day (Idlewild, #1)'),
 (tensor(1.0140), 'Inés of My Soul'),
 (tensor(1.0084), 'Songs in Ordinary Time'),
 (tensor(0.9180), 'All the Names'),
 (tensor(0.8908), 'The Odyssey'),
 (tensor(0.8899), 'Nickel and Dimed: On (Not) Getting By in America'),
 (tensor(0.8869), 'East of Eden'),
 (tensor(0.8777), 'Rainbow Six (Jack Ryan Universe, #10)'),
 (tensor(0.8415), 'What Do You Care What Other People Think?')]

In [33]:
sorted(comp_2, key=lambda w: w[0])[:10]

[(tensor(-1.1753), 'Persuasion'),
 (tensor(-1.0113), 'Generation X: Tales for an Accelerated Culture'),
 (tensor(-0.9765),
  'Sherlock Holmes: The Complete Novels and Stories, Volume I'),
 (tensor(-0.8995),
  'Blue Like Jazz: Nonreligious Thoughts on Christian Spirituality'),
 (tensor(-0.8846),
  'Freakonomics: A Rogue Economist Explores the Hidden Side of Everything (Freakonomics, #1)'),
 (tensor(-0.8732), 'Assassination Vacation'),
 (tensor(-0.8231), 'Me Talk Pretty One Day'),
 (tensor(-0.8226), 'The Time Machine'),
 (tensor(-0.8051), 'An Ideal Husband'),
 (tensor(-0.7931), 'How to Win Friends and Influence People')]

Looking for information in the third component


In [34]:
sorted(comp_3, key=lambda w: w[0], reverse=True)[:10]

[(tensor(1.1003), 'Where the Heart Is'),
 (tensor(1.0510), 'The Terror'),
 (tensor(1.0022), 'The Snows of Kilimanjaro and Other Stories'),
 (tensor(0.9365), 'A Christmas Carol'),
 (tensor(0.9208), 'Naked'),
 (tensor(0.8979), 'Life of Pi'),
 (tensor(0.8727), 'Live and Let Die (James Bond, #2)'),
 (tensor(0.8634), 'Oedipus Rex  (The Theban Plays, #1)'),
 (tensor(0.8553), 'Brave New World Revisited '),
 (tensor(0.8370), 'Night (The Night Trilogy #1)')]

In [35]:
sorted(comp_3, key=lambda w: w[0])[:10]

[(tensor(-1.0766), 'Men Are from Mars, Women Are from Venus'),
 (tensor(-1.0539), 'Trace (Kay Scarpetta, #13)'),
 (tensor(-1.0428), 'Pompeii'),
 (tensor(-0.9296), 'Hamlet'),
 (tensor(-0.9250), 'Heretics of Dune (Dune Chronicles #5)'),
 (tensor(-0.9152), 'The Street Lawyer'),
 (tensor(-0.8976), 'A Bend in the River'),
 (tensor(-0.8850), 'The Road'),
 (tensor(-0.8512), 'Breaking the Spell: Religion as a Natural Phenomenon'),
 (tensor(-0.8455), 'Bridge to Terabithia')]

Predictions

In [None]:
predictions = learn.get_preds(ds_type=DatasetType.Train)
preds = predictions[0].numpy()

In [41]:
for t, p in zip(test['rating'].values[:6], preds[:6]):
  print(f'True value: {t} \nPredicted value: {p}\n\n')

True value: 3 
Predicted value: 4.898460865020752


True value: 5 
Predicted value: 4.0081987380981445


True value: 4 
Predicted value: 2.4927268028259277


True value: 4 
Predicted value: 3.9799094200134277


True value: 3 
Predicted value: 3.9950125217437744


True value: 5 
Predicted value: 2.6713061332702637


