## Book Recomdation System For Users

We have implemented **Recommender System** using a book rating data set [link](https://lists.w3.org/Archives/Public/public-rww/2013Dec/0002.html). We will create a data mining model that recommends books to users.

In [30]:
import pandas as pd
import numpy as np
# import the functions for cosine distance, euclidean distance, and the correlation distance
from scipy.spatial.distance import cosine, euclidean
from sklearn.model_selection import train_test_split

# settings
%matplotlib inline



In [31]:
df_data1 = pd.read_table("DBbook_train_ratings.tsv") # the location to the data file
                                             

print(df_data1.head(10))

   userID  itemID  rate
0    6873    3201     4
1    6873    3098     4
2    6873    4198     4
3    6873    5950     4
4    6873     204     4
5    6873    8010     5
6    6873    5232     4
7    6873    7538     4
8    6873    5231     3
9    6873    5438     3


#### Printing the number of unique users and the number of unique books in this data set.

In [32]:
# df_data1["user"].unique() returns the unique values in the "user" column
# the len function returns the length of a list-like object
print("Ratings: %d" % len(df_data1))
print("Unique users: %d" % len(df_data1["userID"].unique()))
print("Unique items: %d" % len(df_data1["itemID"].unique()))

Ratings: 75558
Unique users: 6181
Unique items: 6166


Creating the utility matrix. Each row of this matrix represents a user, and each column represents an item. Printing the first 10 rows of matrix. Printing the number and the percentage of cells in the utility matrix that are not populated. And filling these empty cells with 0s.

### Finding the K most similar items/user to a given item/user

In [33]:
dense_matrix = df_data1.pivot_table(values="rate", index=["userID"], columns=["itemID"])
print("Shape of the user-item matrix: %d x %d" % dense_matrix.shape)
print(dense_matrix.head(10))

Shape of the user-item matrix: 6181 x 6166
itemID  1     2     3     5     7     8     9     11    12    13    ...  8157  \
userID                                                              ...         
1        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
2        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
3        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
4        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
5        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
6        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
7        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
8        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
9        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
10       NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN

In [34]:
# return all ratings for Item 3
# access a column using the column name
dense_matrix[3] # equivalent to dense_matrix.loc[:,3]

userID
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
        ..
7251   NaN
7252   NaN
7253   NaN
7254   NaN
7255   NaN
Name: 3, Length: 6181, dtype: float64

In [35]:
# return all ratings by User 2
# access a row using the row index
"""
This is not to be confused with DataFrame.iloc[].
loc locates a row using its index.
iloc locates a row using its position.
A row with an index of 1 maybe the 4th row of a data frame.
"""
dense_matrix.loc[2] # .loc[row_lable, column_lable]

itemID
1      NaN
2      NaN
3      NaN
5      NaN
7      NaN
        ..
8164   NaN
8166   NaN
8167   NaN
8168   NaN
8169   NaN
Name: 2, Length: 6166, dtype: float64

In [36]:
# retrieve the rating of User 2 on Item 1
# the first index is the column index (item)
# the second index is the row index (user)
dense_matrix[1][2]

nan

In [37]:
# the mean rating of Item 2
dense_matrix[2].mean()

4.05

In [38]:
# the iloc method returns a row from a data frame based on a row index
user1_ratings = dense_matrix.iloc[1]
# the isnull() method of a pandas series checks each value to see if it is null (NaN)
print(len(user1_ratings.isnull()))
# the following line selects values user1_ratings that are not null
user1_not_null = user1_ratings[user1_ratings.notnull()]
# print the number of item (non-missing ratings) rated by User 1
len(user1_not_null)

6166


8

### Replacing missing rating with 0s

Missing values (NaN) usually cannot be used in numerical calculations

In [39]:
dense_matrix = dense_matrix.fillna(0)
dense_matrix.head()

itemID,1,2,3,5,7,8,9,11,12,13,...,8157,8160,8161,8162,8163,8164,8166,8167,8168,8169
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Q3. Printing the top 5 similar users to userID 2 based on Euclidean distance. 

### Finding the K most similar items/user to a given item/user

#### Using Euclidean distance

In [40]:
# define a functions, which takes the given item number (integer) as the input and returns the top K similar items (in a data frame)
def top_k_items(item_number, k):
    # copy the dense matrix and transpose it so each row represents an item
    df_sim = dense_matrix.transpose()
    # remove the active item 
    df_sim = df_sim.loc[df_sim.index != item_number]
    # calculate the distance between the given item for each row (apply the function to each row if axis = 1)
    df_sim["distance"] = df_sim.apply(lambda x: euclidean(dense_matrix[item_number], x), axis=1)
    # return the top k from the sorted distances
    return df_sim.sort_values(by="distance").head(k)["distance"]   

In [41]:
# retrieve top five similar items to ItemID 2
top_k_items(2, 5)

itemID
1       12.727922
1591    17.888544
6447    17.888544
6576    17.916473
7312    18.138357
Name: distance, dtype: float64

In [42]:
def top_k_users(user_number, k):
    # no need to transpose the matrix this time because the rows already represent users
    # remove the active user
    df_sim = dense_matrix.loc[dense_matrix.index != user_number]
    # calculate the distance for between the given user and each row
    df_sim["distance"] = df_sim.apply(lambda x: euclidean(dense_matrix.loc[user_number], x), axis=1)
    # return the top k from the sorted distances
    return df_sim.sort_values(by="distance").head(k)["distance"] 

In [43]:
# retrieve top five similar users to User 3
top_k_users(3, 5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


userID
5066    16.613248
590     16.852300
6861    16.911535
6875    16.970563
4568    17.175564
Name: distance, dtype: float64

#### K Nearest Neighbors Algorithm

In [44]:
from sklearn.neighbors import NearestNeighbors
# create an instance of the learner and specify it to use Euclidean distance
nbrs = NearestNeighbors(n_neighbors=5, metric="euclidean")
# fit the learner using all user ratings except for User 3
nbrs.fit(dense_matrix.loc[dense_matrix.index != 3])
user3= dense_matrix.loc[3].values.reshape((-1,len(dense_matrix.loc[3])))
#user3= dense_matrix.loc[3]
print(user3)
# the learner returns the distances and locations of the 5 nearest neighbors of User 3
distances, locs = nbrs.kneighbors(user3, 5)
print(distances, locs)
# retrieve these neighbors from the user-item matrix based on the locations
# the ravel method returns a flattened list. In the code below, it converts a 2D array with one row to a 1D array
sim_users = dense_matrix.loc[dense_matrix.index != 3].iloc[locs.ravel()].index
# print the user indexes and the distances
for sim_user, dist in zip(sim_users, distances.ravel()):
    print(sim_user, dist)

[[0. 0. 0. ... 0. 0. 0.]]
[[16.61324773 16.85229955 16.91153453 16.97056275 17.17556404]] [[4318  497 5847 5860 3903]]
5066 16.61324772583615
590 16.852299546352718
6861 16.911534525287763
6875 16.97056274847714
4568 17.175564037317667


#### Print the Euclidean distance between itemID 18 and itemID 1. printing the Enclidean distance between itemID36 and itemID 1. Printing a statement that tells between itemID36 and itemID18, which is more similar to itemID 1 and why. 

In [45]:
print("Euclidean distance between itemID 18 and itemID 1:",euclidean(dense_matrix[18], dense_matrix[1]))
print("Euclidean distance between itemID 36 and itemID 1:",euclidean(dense_matrix[36], dense_matrix[1]))

Euclidean distance between itemID 18 and itemID 1: 29.189039038652847
Euclidean distance between itemID 36 and itemID 1: 40.124805295477756


#### Printing the top 5 similar items to itemID 8010.

In [46]:
top_k_items(8010, 5)

itemID
3711    127.921851
4559    127.964839
330     129.715072
1311    129.722781
7328    129.761319
Name: distance, dtype: float64

#### Removing books and users with less than 20 rating scores from the utility matrix by copying and maybe modifying the following codes. Printing the shape of the dataset.

In [47]:
df_item_freq = df_data1.groupby("itemID").count()
df_user_freq = df_data1.groupby("userID").count()
selected_items = df_item_freq[df_item_freq["userID"] > 20].index
dense_matrix = dense_matrix[selected_items]
selected_users = df_user_freq[df_user_freq["itemID"] > 20].index
dense_matrix = dense_matrix.loc[selected_users]
print("Shape of the dataset:", dense_matrix.shape)

Shape of the dataset: (766, 776)


#### Removing users that haven’t rated itemID8010, and then please write code to print the counts of the different rating scores of this item. Printing the shape of the dataset.

In [48]:
#Removing users that haven't rated itemID 8010
dense_matrix = dense_matrix[dense_matrix[8010] > 0]
dense_matrix[8010].value_counts()
print("Shape of the dataset:", dense_matrix.shape)

Shape of the dataset: (174, 776)


#### Partitioning the data set you obtained from Q7 for validating the performance on predicting rating on itemID 8010. Randomly select 25% of the users as the testing set and the others as the training set. Please print the dimensions of the training set and the testing set. Please write code to print the mean rating of itemID 8010 in the training set and its mean rating in the testing set. 

In [49]:
dense_matrix[8010].value_counts()

4.0    68
5.0    58
3.0    27
2.0    13
1.0     8
Name: 8010, dtype: int64

In [50]:
from sklearn.model_selection import train_test_split

# create a data frame for the predictors
df_x = dense_matrix[[col for col in dense_matrix.columns if col != 8010]]
print(df_x.shape)

# create a series for the outcome
df_y = dense_matrix[[8010]]
print(df_y.shape)

train_x, test_x, train_y, test_y = train_test_split(df_x, df_y, test_size=0.25, random_state=0)

print("Train set shape:", train_x.shape)
print("Test set shape:", test_x.shape)

print("Mean rating of itemID 8010 in training set:", train_y.mean().iloc[0])
print("Mean rating of itemID 8010 in test set:", test_y.mean().iloc[0])

(174, 775)
(174, 1)
Train set shape: (130, 775)
Test set shape: (44, 775)
Mean rating of itemID 8010 in training set: 3.9153846153846152
Mean rating of itemID 8010 in test set: 3.8181818181818183


Using the training and test dataset obtained in Q8 and write code to 1) Printing the userID of the the user in the 5th row (not userID5) in the test dataset, and 2) Predicting this user’s rating of itemID 8010 based on the top 5 similar users in the training dataset, and printing the user’s predicted rating and the actual rating of the book. 

In [51]:
# specify the number of similar users to retrieve
k = 5

def user_based_predict(user_number):
    # retrieve the top k similar users
    # copy from all the training predictors
    df_sim = df_train_x.copy()
    # for each user, calculate the distance between this user and the active user
    df_sim["distance"] = df_sim.apply(lambda x: euclidean(df_test_x.loc[user_number], x), axis=1)
    # create a new data frame to store the top k similar users
    df_sim_users = df_sim.loc[df_sim.sort_values(by="distance").head(k).index]
    # calculate these similar users' rating on 151, weighted by distance
    df_sim_users["weighed_d"] = list(map(lambda x: df_sim_users.loc[x]["distance"]*df_train_y.loc[x][8010], df_sim_users.index))
    predicted = df_sim_users["weighed_d"].sum()/df_sim_users["distance"].sum()
    return predicted

In [52]:
# print 5th row userID
print("5th row userID in test set:", test_x.index[4])

uid = test_x.index[4]
print("Predicted rating of this user on Item 8010:", user_based_predict(uid))
print("True rating of this user on Item 8010:", test_y.loc[uid][8010])

5th row userID in test set: 5614
Predicted rating of this user on Item 8010: 4.013489148846391
True rating of this user on Item 8010: 4.0


#### Conclusion:
According to the result we see that our prediction model did very well; the prediction is prediction is pretty close to the actual writting.

###                          ----THE END----