# Memory-Based Collaborative Filtering

### Using the provided dataset, we will construct memory-based collaborative filtering in Python. Columns include bookId, UserId, Rating, id, and title. 

### Using UserId, bookId, and Ratings, we'll try to discover people similar to our target user (chosen at random) and propose items to our potential customer recommended products will be the things that similar users have enjoyed in the past.

In [1]:
from collections import defaultdict
import random
import numpy
import scipy.optimize
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

In [2]:
books_data = pd.read_csv('../../data_preprocessing/books_data.csv')
ratings_data = pd.read_csv('../../data/books_ratings_data.csv')
book_tags_data = pd.read_csv('../../data/book_tags_data.csv')
tags_data = pd.read_csv('../../data/tags_data.csv')

In [3]:
ratings_data = ratings_data.drop_duplicates()

In [4]:
dataset = pd.merge(ratings_data, books_data[['id','title']], how='inner', left_on='book_id', right_on='id')

In [5]:
dataset.head(10)

Unnamed: 0,book_id,user_id,rating,id,title
0,1,314,5,1,the hunger games the hunger games
1,1,439,3,1,the hunger games the hunger games
2,1,588,5,1,the hunger games the hunger games
3,1,1169,4,1,the hunger games the hunger games
4,1,1185,4,1,the hunger games the hunger games
5,1,2077,4,1,the hunger games the hunger games
6,1,2487,4,1,the hunger games the hunger games
7,1,2900,5,1,the hunger games the hunger games
8,1,3662,4,1,the hunger games the hunger games
9,1,3922,5,1,the hunger games the hunger games


### Before we start lets try and find out the unique number of users and books in the dataset. 

In [6]:
unique_users = dataset['user_id'].unique()
print(len(unique_users))

52895


In [7]:
unique_bookId= dataset['book_id'].unique()
print(len(unique_bookId))

9397


### So we have around 53,000 unique users reading around 9300 unique books.

### Now we will create a new dataframe consisting of all userId in the first column. 

In [26]:
new_df = pd.DataFrame(unique_users)
new_df.columns=['userId']

### Next we will create a column for each book in the dataset. This will help us to fill the ratings in the matrix cells given by each user. Similary to a sparse matrix.

In [11]:
for title in unique_title:
    new_df[title]=None

### Finally we will fill the ratings given by each user in the corresponding cells. This indeed will return us a very huge matrix consisting of approximately 53000 x 9300. 

In [15]:
for j,user in enumerate(new_df['userId'].values):
    for i in range(dataset[dataset['user_id']==user].shape[0]):
        title = dataset[dataset['user_id']==user]['title'].iloc[i]
        rating = dataset[dataset['user_id']==user]['rating'].iloc[i]
        new_df[title][j] = rating

### Below is the new dataframe consisting of all the ratings.

In [16]:
new_df.head

<bound method NDFrame.head of        userId the hunger games the hunger games  \
0         314                                 5   
1         439                                 3   
2         588                                 5   
3        1169                                 4   
4        1185                                 4   
...       ...                               ...   
52890   52363                              None   
52891   27590                              None   
52892   51339                              None   
52893    9310                              None   
52894    6532                              None   

      harry potter and the sorcerers stone harry potter twilight twilight  \
0                                                  None                 3   
1                                                  None              None   
2                                                  None                 1   
3                                                

### Now from the above prepared matrix, we’ll pick a user randomly and find similar users as that of our target user. As we can see out selected user have rated 65 books. Now we will use this data to recommend him books based on the target user who has rated books similarly.

In [30]:
user = pd.DataFrame(new_df.iloc[2121])
user = user.drop(['userId'])
user[user.notnull().values]

Unnamed: 0,2121


### Printing the head and finding out the userId for getting the similarity

In [18]:
subset = new_df[new_df[user[user.notnull().values].index[0]].notnull()]
subset.head()

Unnamed: 0,userId,the hunger games the hunger games,harry potter and the sorcerers stone harry potter,twilight twilight,to kill a mockingbird,the great gatsby,the fault in our stars,the hobbit,the catcher in the rye,angels demons robert langdon,...,the famished road,the red knight the traitor son cycle,the time travellers guide to medieval england a handbook for visitors to the fourteenth century,the everafter war the sisters grimm,billy budd sailor,bayou moon the edge,means of ascent the years of lyndon johnson,the mauritius command,cinderella ate my daughter dispatches from the frontlines of the new girlie girl culture,the first world war
32,17984,5,5.0,,5.0,,4.0,,,4.0,...,,,,,,,,,,
38,20848,5,,4.0,,,,5.0,5.0,,...,,,,,,,,,,
44,23612,4,5.0,3.0,,,,,,3.0,...,,,,,,,,,,
56,28158,4,,,,4.0,3.0,,3.0,,...,,,,,,,,,,
86,46977,4,3.0,2.0,,,,,,2.0,...,,,,,,,,,,


### After that replacing all the null values with 0 so that we can perform calculations on the numerical values.

In [19]:
subset = subset.replace([None],0)
subset = subset.set_index('userId')

In [20]:
subset.head()

Unnamed: 0_level_0,the hunger games the hunger games,harry potter and the sorcerers stone harry potter,twilight twilight,to kill a mockingbird,the great gatsby,the fault in our stars,the hobbit,the catcher in the rye,angels demons robert langdon,pride and prejudice,...,the famished road,the red knight the traitor son cycle,the time travellers guide to medieval england a handbook for visitors to the fourteenth century,the everafter war the sisters grimm,billy budd sailor,bayou moon the edge,means of ascent the years of lyndon johnson,the mauritius command,cinderella ate my daughter dispatches from the frontlines of the new girlie girl culture,the first world war
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
17984,5,5,0,5,0,4,0,0,4,5,...,0,0,0,0,0,0,0,0,0,0
20848,5,0,4,0,0,0,5,5,0,0,...,0,0,0,0,0,0,0,0,0,0
23612,4,5,3,0,0,0,0,0,3,4,...,0,0,0,0,0,0,0,0,0,0
28158,4,0,0,0,4,3,0,3,0,0,...,0,0,0,0,0,0,0,0,0,0
46977,4,3,2,0,0,0,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
subset['similarity'] = 0.0
for user in subset.index:
    cos = cosine_similarity(subset.loc[28158].values.reshape(1,-1),subset.loc[user].values.reshape(1,-1))
    subset['similarity'][user]=cos

### Finally finding out users similar to our selected user.

In [33]:
subset['similarity']

userId
17984    0.181441
20848    0.070158
23612    0.168881
28158    1.000000
46977    0.096887
           ...   
40772    0.022457
41732    0.008473
42322    0.023021
45989    0.056228
51123    0.013671
Name: similarity, Length: 100, dtype: float64

In [25]:
!jupyter nbconvert Memory*.ipynb --to python

[NbConvertApp] Converting notebook Memory_Based_Chetan_Nain.ipynb to python
[NbConvertApp] Writing 2169 bytes to Memory_Based_Chetan_Nain.py
