## Amazon Recommendation System

In this project, I looked at datasets from Amazon reviews.  The dataset comes from https://jmcauley.ucsd.edu/data/amazon/, and is a subset of a large datasets just showing reviews for musical instruments. Only UserID, ItemID, Rating and Timestamp has been left during data cleaning process as we will be focusing on Item ID and Rating for each product.

The project's aim was to understand recommendation algorithms, optimizing product suggestions for new and existing users based on their preferences and behavior. To achieve this, I created question that can be answers for business units, ensuring alignment and effective implementation. The focus was on unraveling recommendation algorithms to enhance user experience through personalized product recommendations.


In [1]:
##load some packages
import random
import pandas
import matplotlib.pyplot as plt
import numpy as np

## this will optimize our math
from scipy.sparse import csr_matrix as sparse_matrix
from sklearn.neighbors import NearestNeighbors
import os


In [2]:
## read in the data set
amazon = pandas.read_csv("ratings_Musical_Instruments.csv",names=("user","item","rating","timestamp"))

### Part 1: Descriptive Questions:


* How big is the dataset, in rows and columns?
* How many unique users are there?
* How many unique items are there?

In [3]:
# How big is the dataset, in rows and columns?
print("The dataset has", amazon.shape[0], "rows and", amazon.shape[1], "columns.")

# How many unique users are there?
unique_users = amazon['user'].nunique()
print("There are", unique_users, "unique users in the dataset.")

# How many unique items are there?
unique_items = amazon['item'].nunique()
print("There are", unique_items, "unique items in the dataset.")



The dataset has 500176 rows and 4 columns.
There are 339231 unique users in the dataset.
There are 83046 unique items in the dataset.



* The user who rated the most instruments has rated how many items?
* The item with the MOST ratings is what? How many ratings does it have?  Hint: Check out the Amazon website, by going to "www.amazon.com/dp/item_code", where you put the item into item_code.
* The item with the highest mean average rating is what?  What is the rating?

In [4]:
# The user who rated the most instruments has rated how many items?
user_ratings = amazon['user'].value_counts()
most_ratings_user = user_ratings.index[0]
most_ratings_count = user_ratings.iloc[0]
print("The user with the most ratings is", most_ratings_user, "and they have rated", most_ratings_count, "items.")

# The item with the MOST ratings is what? How many ratings does it have?
item_ratings = amazon['item'].value_counts()
most_ratings_item = item_ratings.index[0]
most_ratings_item_count = item_ratings.iloc[0]
print("The item with the most ratings is", most_ratings_item, "and it has", most_ratings_item_count, "ratings.")

# The item with the highest mean average rating is what? What is the rating?
item_mean_ratings = amazon.groupby('item')['rating'].mean()
highest_mean_item = item_mean_ratings.idxmax()
highest_mean_rating = item_mean_ratings.max()
print("The item with the highest mean average rating is", highest_mean_item, "and its rating is", highest_mean_rating)



The user with the most ratings is A2PAD826IH1HFE and they have rated 483 items.
The item with the most ratings is B000ULAP4U and it has 3523 ratings.
The item with the highest mean average rating is 0014072149 and its rating is 5.0


* What is the item with the lowest mean average?  What is the rating?
* If we built a matrix with all of the users and items, how large would it be? (dimensions, and how many total entries)
* Looking at the size of the dataset, what is the percentage of non-zero entries in the matrix?

In [5]:
# What is the item with the lowest mean average? What is the rating?
item_mean_ratings = amazon.groupby('item')['rating'].mean()
lowest_mean_item = item_mean_ratings.idxmin()
lowest_mean_rating = item_mean_ratings.min()
print("The item with the lowest mean average rating is", lowest_mean_item, "and its rating is", lowest_mean_rating)

# If we built a matrix with all of the users and items, how large would it be?
users = amazon["user"].unique()
items = amazon["item"].unique()
num_users = len(users)
num_items = len(items)

print("The user-item matrix would be", num_users, "rows by", num_items, "columns.")
print("It would have a total of", num_users * num_items, "entries.")

# Looking at the size of the dataset, what is the percentage of non-zero entries in the matrix?
matrix_rows, matrix_columns = len(users), len(items)
nonzero_entries = amazon.shape[0]
total_entries = matrix_rows * matrix_columns
percent_nonzero_entries = nonzero_entries / total_entries * 100
print("The percentage of non-zero entries in the user-item matrix is", round(percent_nonzero_entries, 2), "%.")

The item with the lowest mean average rating is 0201891859 and its rating is 1.0
The user-item matrix would be 339231 rows by 83046 columns.
It would have a total of 28171777626 entries.
The percentage of non-zero entries in the user-item matrix is 0.0 %.


In [6]:
## Here, redefine the following, for later on:

n = 83046 # the number of unique items (not 10!)
d = 339231 # the number of unique users (more than 6!)


In [7]:
## Here I am making a sparse matrix

def create_X(ratings,n,d,user_key="user",item_key="item"):
    user_mapper = dict(zip(np.unique(ratings[user_key]), list(range(d))))
    item_mapper = dict(zip(np.unique(ratings[item_key]), list(range(n))))

    user_inverse_mapper = dict(zip(list(range(d)), np.unique(ratings[user_key])))
    item_inverse_mapper = dict(zip(list(range(n)), np.unique(ratings[item_key])))

    user_ind = [user_mapper[i] for i in ratings[user_key]]
    item_ind = [item_mapper[i] for i in ratings[item_key]]

    X = sparse_matrix((ratings["rating"], (item_ind, user_ind)), shape=(n,d))
    
    return X, user_mapper, item_mapper, user_inverse_mapper, item_inverse_mapper, user_ind, item_ind



In [8]:
## define X in a new window

X, user_mapper, item_mapper, user_inverse_mapper, item_inverse_mapper, user_ind, item_ind = create_X(amazon, n=n, d=d)

## Part 2: Try some Machine Learning

Here, I have build a BASIC recommender for similar items.


I decided to use KNN with two different iterations to find 5 items that are close to a specific item. Also I fixed the code to avoid algorithm returning the orginal item in the list. 

In [9]:
### To return the correct user, I'm going use a helper function.

def amazon_name (index):
    name =amazon.loc[index,]['item']
    return(name)

def find_nearestneighbour(model, X, query_ind):
    nbs = model.kneighbors(X[query_ind], return_distance = False)
    return(nbs)


In [10]:
##fit the k neighbours model
model = NearestNeighbors(n_neighbors=6) # set n_neighbors to 6 to exclude the item itself
model.fit(X)

##find the neighbours for item 6428320
query_index = 0 # index of the item you are trying to find similar items for
distances, indices = model.kneighbors(X[query_index], n_neighbors=6) # add one more neighbor to exclude the item itself

##return item names
print("by Euclidean distance, the 5 most similar items to item 6428320 are:")
for i in range(1, 6): # start the loop from 1 to exclude the first item (which is itself)
    index = indices[0][i]
    print(amazon_name(index))

by Euclidean distance, the 5 most similar items to item 6428320 are:
B00000DDLN
B0002E4Z8M
B0002F4WDE
B0002F4WDE
B0002CZVW8


* Explain why you decided to use this model and create recommendation system.

KNN, or k-nearest neighbors, is a valuable technique in recommendation systems, particularly in user-based collaborative filtering. This approach leverages the ratings of similar users to predict the rating of an item for a specific user. The underlying principle is that individuals who have shown preferences for similar items in the past are likely to exhibit similar preferences in the future. KNN serves as an effective method for making recommendations by identifying and considering the preferences of users with similar tastes.

Using the work above, I created a recommendation system based on KNN and tracing back to sparse matrix to find the user's 5 similar KNN(users with similar taste). 
Once done, I asked the program to save the list of the top 5 similar user and find their highest average rated items from dataframe. 
Then I asked the program to get top 5 highest average rating and display the item names along with thier ratings. 

* Assumption:
 There is a high chance of correlation between users based on the list of the items chosen and rated.

In [11]:
def amazon_name (index):
    name =amazon.loc[index,]['user']
    return(name)

def find_nearestneighbour(model, X, query_ind):
    nbs = model.kneighbors(X[query_ind], return_distance = False)
    return(nbs)


In [12]:
##fit the k neighbours model
model = NearestNeighbors(n_neighbors=6) # set n_neighbors to 6 to exclude the item itself
model.fit(X)

##find the neighbours for user A1YS9MDZP93857
query_index = 0 
distances, indices = model.kneighbors(X[query_index], n_neighbors=6) # add one more neighbor to exclude the item itself

##return item names
print("by Euclidean distance, the 5 most similar user to userID A1YS9MDZP93857 are:")
#list of users
list = []
for i in range(1, 6): # start the loop from 1 to exclude the first item (which is itself)
    index = indices[0][i]
    list.append(amazon_name(index))
print(list)

by Euclidean distance, the 5 most similar user to userID A1YS9MDZP93857 are:
['A2ZEHHKT2ZLJVR', 'ABW2RYQ718C00', 'A26ODZQ7JCFHDO', 'A3KQE8IT5TPG3W', 'A39TYRIZLTCK9P']


In [13]:
# Find the ratings for the selected users
user_ratings = amazon[amazon['user'].isin(list)]

# Compute the average rating for each item
item_avg_ratings = user_ratings.groupby('item')['rating'].mean()

# Find the top 5 items with the highest average rating
top_items = item_avg_ratings.sort_values(ascending=False).head(5)

print("For the user A1YS9MDZP93857, the algrithm recommends following Top 5 items with highest average rating based on similar users:\n", top_items)

For the user A1YS9MDZP93857, the algrithm recommends following Top 5 items with highest average rating based on similar users:
 item
B00000DDLN    5.0
B000068NVI    5.0
B0002E4Z8M    5.0
B0002F4WDE    5.0
B000EEJ8IM    5.0
Name: rating, dtype: float64


* What are the three top products you would recommend to a new user with no rating or purchasing history?


For a new user without any rating or purchasing history, I would recommend the top three products with the highest average ratings. This strategy aims to enhance customer satisfaction by suggesting items that have consistently received positive feedback. By focusing on products with the best average ratings, there is a greater likelihood of satisfying the new user's preferences, as these items have proven to be well-received by previous customers.

Therefore, recommending with the highest average rating is the best option. 

If we can give a quick survay to users when they make thier account to check their preference, we might be able to recommend better products for the user, even if we do not have rating or purchasing history with us. Or, if we can let users to create a profile using google email so that it can integrate user's search engine history to provide better recommendation.

In [14]:
# Calculate the average rating for each item
item_ratings = amazon.groupby('item')['rating'].mean()

# Sort the items based on their average rating
top_items = item_ratings.sort_values(ascending=False).head(3).index.values

# Print the resulting top items
print(top_items)

['B001QKQXWW' 'B002GNOMK8' 'B002GOSEME']


* What rating do you think a new user would give to item "B009CIIWQA"?

In [15]:
item_id = "B009CIIWQA"
mean_rating = amazon.loc[amazon["item"] == item_id, "rating"].mean()
print("The new user would give item", item_id, mean_rating, "out of 5")

The new user would give item B009CIIWQA 4.008849557522124 out of 5



* What are the top three products you would recommend to user "A27L1LDJZVRLJD"?
* What rating do you think user "A27L1LDJZVRLJD" would give to the LED music lamp, item "B009CIIWQA"?


In [16]:
from scipy.sparse import csr_matrix

# Assuming the sparse matrix `X` has the following rows and columns:
# rows = user IDs, columns = item IDs
# Load the data into a sparse matrix named `X`

# Find the index of user "A27L1LDJZVRLJD"
user_index = np.where(X[:, X.getcol(0).toarray().ravel() == 'A27L1LDJZVRLJD'].toarray().ravel() > 0)[0][0]

print("Index of user 'A27L1LDJZVRLJD':", user_index)

Index of user 'A27L1LDJZVRLJD': 9066


  user_index = np.where(X[:, X.getcol(0).toarray().ravel() == 'A27L1LDJZVRLJD'].toarray().ravel() > 0)[0][0]


In [17]:
### To return the correct item, I'm going to need some helper functions

def amazon_name (index):
    name =amazon.loc[index,]['user']
    return(name)


def find_nearestneighbour(model, X, query_ind):
    nbs = model.kneighbors(X[query_ind], return_distance = False)
    return(nbs)

In [18]:
##fit the k neighbours model
model = NearestNeighbors(n_neighbors=6) # set n_neighbors to 6 to exclude the item itself
model.fit(X)

##find the neighbours for user A27L1LDJZVRLJD
query_index = 9066 # index of the item you are trying to find similar user for
distances, indices = model.kneighbors(X[query_index], n_neighbors=6) # add one more neighbor to exclude the item itself

##return item names
print("by Euclidean distance, the 5 most similar user to userID A27L1LDJZVRLJD are:")
#list of users
list = []
for i in range(1, 6): # start the loop from 1 to exclude the first item (which is itself)
    index = indices[0][i]
    list.append(amazon_name(index))
print(list)

by Euclidean distance, the 5 most similar user to userID A27L1LDJZVRLJD are:
['AYJ06K64P1316', 'A2ZMBMOTP195Z1', 'A1TCG8ZFPCPZVW', 'A3N8WO094E7QFH', 'APDW069J09OEL']


In [19]:
##find the neighbours for user A27L1LDJZVRLJD
query_index = 9066 # index of the item you are trying to find similar user for
distances, indices = model.kneighbors(X[query_index], n_neighbors=6) # add one more neighbor to exclude the item itself

##return item names
print("by Euclidean distance, the 5 most similar user to userID A27L1LDJZVRLJD are:")
#list of users
list = []
for i in range(1, 6): # start the loop from 1 to exclude the first item (which is itself)
    index = indices[0][i]
    list.append(amazon_name(index))
print(list)

by Euclidean distance, the 5 most similar user to userID A27L1LDJZVRLJD are:
['AYJ06K64P1316', 'A2ZMBMOTP195Z1', 'A1TCG8ZFPCPZVW', 'A3N8WO094E7QFH', 'APDW069J09OEL']


In [20]:
# Find the index of the item "B009CIIWQA"
item_index = item_mapper['B009CIIWQA']

# Get ratings of similar users for the item "B009CIIWQA"
similar_users_ratings = X[indices[:, 1:], item_index].toarray().ravel()

# Calculate the predicted rating as the average of similar users' ratings
predicted_rating = np.mean(similar_users_ratings)

# Print the predicted rating
print("Predicted rating for user 'A27L1LDJZVRLJD' on item 'B009CIIWQA':", predicted_rating)

Predicted rating for user 'A27L1LDJZVRLJD' on item 'B009CIIWQA': 0.0


Based on the 5 most simlar users from User ID A27L1LDJZVRLJD, their rating for the item "B009CIIWQA" does not exists meaning that they are not interested on the product. Therefore, if the user 'A27L1LDJZVRLJD' purchased the item 'B009CIIWQA', they will give low rating on the item.