## Recommendation Algorithm
This project looks at Amazon reviews.  This dataset comes from https://jmcauley.ucsd.edu/data/amazon/, and is a subset of a large dataset, just showing reviews for musical instruments.

In [105]:
##load some packages

import random
import pandas
import matplotlib.pyplot as plt
import numpy as np

## this will optimize our math
from scipy.sparse import csr_matrix as sparse_matrix

from sklearn.neighbors import NearestNeighbors


import os

## You can add more here, if you need it!


In [106]:
## read in the data set
amazon = pandas.read_csv("ratings_Musical_Instruments.csv",names=("user","item","rating","timestamp"))

### Part 1: Descriptive Questions:

1. How big is the dataset, in rows and columns?
* How many unique users are there?
* How many unique items are there?
* The user who rated the most instruments has rated how many items?
* The item with the MOST ratings is what? How many ratings does it have?  Hint: Check out the Amazon website, by going to "www.amazon.com/dp/item_code", where you put the item into item_code.
* The item with the highest mean average rating is what?  What is the rating?
* What is the item with the lowest mean average?  What is the rating?
* If we built a matrix with all of the users and items, how large would it be? (dimensions, and how many total entries)
* Looking at the size of the dataset, what is the percentage of non-zero entries in the matrix?

In [107]:
# How big is the dataset, in rows and columns?
print("Dataset size: ", amazon.shape)

d = amazon['user'].nunique()
n = amazon['item'].nunique()

# How many unique users are there?
print("Unique users: ", d)

# How many unique items are there?
print("Unique items: ", n)

# The user who rated the most instruments has rated how many items?
user_ratings = amazon.groupby('user')['rating'].count()
print("User with the most ratings: ", user_ratings.idxmax())
print("Number of ratings: ", user_ratings.max())

# The item with the MOST ratings is what? How many ratings does it have?
item_ratings = amazon.groupby('item')['rating'].count()
print("Item with the most ratings: ", item_ratings.idxmax())
print("Number of ratings: ", item_ratings.max())

# The item with the highest mean average rating is what? What is the rating?
item_mean_ratings = amazon.groupby('item')['rating'].mean()
print("Item with the highest mean rating: ", item_mean_ratings.idxmax())
print("Mean rating: ", item_mean_ratings.max())

# What is the item with the lowest mean average? What is the rating?
print("Item with the lowest mean rating: ", item_mean_ratings.idxmin())
print("Mean rating: ", item_mean_ratings.min())

# If we built a matrix with all of the users and items, how large would it be?
print("Size of full X matrix (GB):", (n*d)*8/1e9)

# Looking at the size of the dataset, what is the percentage of non-zero entries in the matrix?
sparsity = (len(amazon) / (n * d))*100
print("Percentage of non-zero entries:", sparsity)

Dataset size:  (500176, 4)
Unique users:  339231
Unique items:  83046
User with the most ratings:  A2PAD826IH1HFE
Number of ratings:  483
Item with the most ratings:  B000ULAP4U
Number of ratings:  3523
Item with the highest mean rating:  0014072149
Mean rating:  5.0
Item with the lowest mean rating:  0201891859
Mean rating:  1.0
Size of full X matrix (GB): 225.374221008
Percentage of non-zero entries: 0.0017754506181334572


In [110]:
## Here, redefine the following, for later on:

n = amazon['item'].nunique() # the number of unique items (not 10!)
d = amazon['user'].nunique() # the number of unique users (more than 6!)

## These will need to be correct for you to finish

In [111]:
## Here we are making the same sparse matrix that we made in class
## Look it over, but don't worry overly

def create_X(ratings,n,d,user_key="user",item_key="item"):
    user_mapper = dict(zip(np.unique(ratings[user_key]), list(range(d))))
    item_mapper = dict(zip(np.unique(ratings[item_key]), list(range(n))))

    user_inverse_mapper = dict(zip(list(range(d)), np.unique(ratings[user_key])))
    item_inverse_mapper = dict(zip(list(range(n)), np.unique(ratings[item_key])))

    user_ind = [user_mapper[i] for i in ratings[user_key]]
    item_ind = [item_mapper[i] for i in ratings[item_key]]

    X = sparse_matrix((ratings["rating"], (item_ind, user_ind)), shape=(n,d))
    
    return X, user_mapper, item_mapper, user_inverse_mapper, item_inverse_mapper, user_ind, item_ind



In [112]:
## define X in a new window
## For this to work, you need to define n and d properly above!
## The error you see is BECAUSE n and d are incorrect

X, user_mapper, item_mapper, user_inverse_mapper, item_inverse_mapper, user_ind, item_ind = create_X(amazon, n=n, d=d)

## Part 2: Try some Machine Learning

Thinking of the code from our movie exploration in class, build a BASIC recommender for similar items.  You can do ANY of the following, but you need to choose at least 1:
* Use KNN with at least two different iterations to find 5 items that are close to a spcific item.  In class, my algorithm always returned the original movie in the list - can you modify the code so this doesn't happen?
* Use *any type of clustering* to cluster items OR users into groups that are similar.
* Can you decompose the matrix via PCA?  How can this help us recommend?


In [113]:
# initialize the KNN model with Euclidean distance metric
model_euc = NearestNeighbors(n_neighbors=6, metric='euclidean')

# fit the model using the X matrix
model_euc.fit(X)

# choose a random item as the query item
query_item = np.random.choice(list(item_mapper.keys()))

# find the k nearest neighbors of the query item using Euclidean distance metric
query_item_index = item_mapper[query_item]
distances_euc, indices_euc = model_euc.kneighbors(X[query_item_index].reshape(1, -1))

# exclude the first item, which is always the same as the query item
similar_items_euc = [item_inverse_mapper[indices_euc.flatten()[i]] 
                     for i in range(1, len(indices_euc.flatten()))]
print("Items similar to", query_item, "using Euclidean distance metric:", similar_items_euc)


# initialize the KNN model with cosine distance metric
model_cos = NearestNeighbors(n_neighbors=6, metric='cosine')

# fit the model using the X matrix
model_cos.fit(X)

# find the k nearest neighbors of the query item using cosine distance metric
distances_cos, indices_cos = model_cos.kneighbors(X[query_item_index].reshape(1, -1))

# exclude the first item, which is always the same as the query item
similar_items_cos = [item_inverse_mapper[indices_cos.flatten()[i]] 
                     for i in range(1, len(indices_cos.flatten()))]
print("Items similar to", query_item, "using cosine distance metric:", similar_items_cos)

Items similar to B001J77DQC using Euclidean distance metric: ['B003SPJ4MO', 'B001J77DQC', 'B003LUH61M', 'B002KPINOS', 'B000KGGLHU']
Items similar to B001J77DQC using cosine distance metric: ['B004I65D8G', 'B001J77DQC', 'B003LUH61M', 'B000KGGLHU', 'B001EYX0DK']


Using your work above, do the following:

1. Create *some sort* of recomendation system, and 
* EXPLAIN what you have done and why
* What are the three top products you would recommend to a new user with no rating or purchasing history (the "cold start" problem).
* What rating do you think a new user would give to item "B009CIIWQA" (a rechargeable Music Stand LED lamp, at https://www.amazon.com/dp/B009CIIWQA
* What are the top three products you would recommend to user "A27L1LDJZVRLJD"?
* What rating do you think user "A27L1LDJZVRLJD" would give to the LED music lamp, item "B009CIIWQA"?

## Explain what I have done and why
This code is a recommendation system that suggests items similar to a randomly chosen query item using the KNN with two distance metrics, Euclidean and Cosine.

The code initializes a KNN model with k set to 6 (the system will recommend 6 similar items - one of which is the item that is selected to show the other 5 items that are similar to that 1 item). and trains it on a data matrix called X, containing features of all the items. Then, a random query item is chosen and the model finds the k nearest neighbors of the query item using Euclidean and Cosine distances, respectively. 

Note that the KNN excludes the first item in the list of similar items, which is always the same as the query item, and prints the remaining items that are similar to the randomly chosen item, based on each distance metric.

The code prints a list of items similar to the query item using both distance metrics.


## What are the three top products you would recommend to a new user with no rating or purchasing history (the "cold start" problem).

In [114]:
item_mean_ratings = amazon.groupby('item')['rating'].mean()
top_items = item_mean_ratings.sort_values(ascending=False)[:3]
print("Top 3 items to recommend to a new user:")
for item, rating in top_items.iteritems():
    print(f"- {item}: {rating:.2f}")

Top 3 items to recommend to a new user:
- B001QKQXWW: 5.00
- B002GNOMK8: 5.00
- B002GOSEME: 5.00


## What rating do you think a new user would give to item "B009CIIWQA" (a rechargeable Music Stand LED lamp, at https://www.amazon.com/dp/B009CIIWQA

In [115]:
query_item = "B009CIIWQA"
query_item_index = item_mapper[query_item]
distances, indices = model_cos.kneighbors(X[query_item_index].reshape(1, -1))
neighbor_ratings = [X[neighbor_index, :].sum() / (X[neighbor_index, :] != 0).sum() for neighbor_index in indices.flatten()[1:]]
estimated_rating = sum(neighbor_ratings) / len(neighbor_ratings)
print(f"A new user would likely give item {query_item} a rating of {estimated_rating:.2f}")

A new user would likely give item B009CIIWQA a rating of 4.00


## What are the top three products you would recommend to user "A27L1LDJZVRLJD"?

In [116]:
# get the user ID
user_id = "A27L1LDJZVRLJD"

# get the index of the user in the X matrix
user_index = user_mapper[user_id]

# find the items that the user has already rated
rated_items = set(amazon.loc[amazon['user']==user_id, 'item'])

# find the k nearest neighbors of the items the user has already rated
similar_items = set()
for item in rated_items:
    item_index = item_mapper[item]
    distances, indices = model_cos.kneighbors(X[item_index].reshape(1, -1))
    similar_items.update([item_inverse_mapper[i] for i in indices.flatten()[1:]])

# select the top three items that the user has not rated
recommendations = []
for item in similar_items:
    if item not in rated_items:
        recommendations.append(item)
    if len(recommendations) == 3:
        break

print("Top three products recommended to user", user_id, ":", recommendations)


Top three products recommended to user A27L1LDJZVRLJD : ['B00HTXIP4E', 'B00I9ZITRY', 'B00BUME2XS']


## What rating do you think user "A27L1LDJZVRLJD" would give to the LED music lamp, item "B009CIIWQA"?

In [117]:
# Find the index of the user "A27L1LDJZVRLJD" in the X matrix
user_index = user_mapper["A27L1LDJZVRLJD"]

# Find the index of the item "B009CIIWQA" in the X matrix
item_index = item_mapper["B009CIIWQA"]

# Find the indices of the k nearest neighbors of the item "B009CIIWQA" using the KNN model
distances, indices = model_cos.kneighbors(X[item_index].reshape(1, -1))

# Get the ratings of the k nearest neighbors for the item "B009CIIWQA" from the X matrix
neighbor_ratings = X[indices.flatten(), user_index].toarray().flatten()

# Compute the weighted average of the ratings using the distances as weights
weights = 1 / distances.flatten()
predicted_rating = np.dot(neighbor_ratings, weights) / np.sum(weights)

# Return the predicted rating for user "A27L1LDJZVRLJD"
print("Predicted rating for user A27L1LDJZVRLJD on item B009CIIWQA:", predicted_rating)


Predicted rating for user A27L1LDJZVRLJD on item B009CIIWQA: 0.0
