# Project- Recommendation Systems: Amazon product reviews

**Marks: 60**


Dear Learner,

Welcome to project on Recommendation Systems. We will work with the Amazon ptoduct reviews dataset for this project work.

Do read the problem statement and the guidelines around the same.

----
### Context: 
-------

Online E-commerce websites like Amazon, Flipkart uses different recommendation models to provide different suggestions to different users. Amazon currently uses item-to-item collaborative filtering, which scales to massive data sets and produces high-quality recommendations in real-time.

---------
### Objective:
------------
Build a recommendation system to recommend products to customers based on the their previous ratings for other products.

--------
### Dataset Attributes
------------
- userId : Every user identified with a unique id
- productId : Every product identified with a unique id
- Rating : Rating of the corresponding product by the corresponding user
- timestamp : Time of the rating ( ignore this column for this exercise)

---------------------------
### Guidelines
-----------------------------------------
- Downlod the dataset from the drive link provided to you
- The exercise consists of semi written code blocks. You need to fill the blocks (________________) as per the instructions to achieve the required results.
- To be able to complete the assessment in the expected time, do not change the variable names. The codes might throw errors when the names are changed. 
- The marks of each requirement is mentioned in the question.
- You can raise your issues on the discussion forum on the Olympus.
- Uncomment the code snippets and work on them
--------------------------------------------
Wishing you all the best!

### Data Source:
Amazon Reviews data: http://jmcauley.ucsd.edu/data/amazon/

Electronics (ratings only) dataset: This dataset includes no metadata or reviews, but only (user,Product,rating,Product Name) 

### Import Required Libraries

In [1]:
import numpy as np
import pandas as pd
import math
import json
import time
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn.externals import joblib
import scipy.sparse
from scipy.sparse import csr_matrix
import warnings; warnings.simplefilter('ignore')
%matplotlib inline



### Read the dataset 

In [2]:
#Import the data set
df = pd.read_csv('ratings_Electronics.csv', header=None)
df.columns = ['user_id', 'prod_id', 'rating', 'prod_name']
df = df.drop('prod_name', axis=1)
df_copy = df.copy(deep=True)

In [6]:
# see few rows of the imported dataset
df.head()

Unnamed: 0,user_id,prod_id,rating
0,AKM1MP6P0OYPR,132793040,5.0
1,A2CX7LUOHB2NDG,321732944,5.0
2,A2NWSAGRHCP8N5,439886341,1.0
3,A2WNBOD3WNDNKT,439886341,3.0
4,A1GI0U4ZRJA8WN,439886341,1.0


### Exploratory data analysis (5 marks)

In [1]:
# Check the number of rows and columns
rows, columns = _________
print("No of rows: ", rows) 
print("No of columns: ", columns) 

In [2]:
#Check Data types
____________

In [3]:
# Check for missing values present
print('Number of missing values across columns-\n', ______________)

**Observations:** __________________________________________

In [4]:
# Summary statistics of 'rating' variable
______________________________________

**Observations:** __________________________________________________________

**Let's check the ratings distribution and visualize it.**

In [5]:
# Check the distribution of the ratings 
with sns.axes_style('white'):
    g = sns.factorplot("rating", data=___, aspect=2.0,kind='count')
    g.set_ylabels("Total number of ratings") 

**Let's now check the number of unique users and items in the dataset.**

In [6]:
# Number of unique user id and product id in the data
print('Number of unique USERS in Raw data = ', df['user_id']______)
print('Number of unique ITEMS in Raw data = ', df['prod_id']_______)

In [7]:
# Top 10 users based on rating
most_rated = df.groupby('user_id').size().sort_values(ascending=False)[:10]
most_rated

### Data preparation as per requirement on number of minimum ratings (1 mark)

**Let's take a subset of dataset (by only keeping the users who have given 50 or more number of ratings) to make the dataset less sparse and easy to work with.**

In [13]:
counts = df['user_id'].value_counts()
df_final = df[df['user_id'].isin(counts[counts >= 50].index)]

In [8]:
print('The number of observations in the final data =', len(df_final))
print('Number of unique USERS in the final data = ', df_final['user_id']______)
print('Number of unique ITEMS in the final data = ', df_final['prod_id']______)

**Observations:** df_final has users who have rated 50 or more items and **we will using df_final to build recommendation systems**

**Let's calculate the density of the rating matrix**

In [9]:
final_ratings_matrix = df_final.pivot(index = 'user_id', columns ='prod_id', values = 'rating').fillna(0)
print('Shape of final_ratings_matrix: ', final_ratings_matrix.shape)

given_num_of_ratings = np.count_nonzero(final_ratings_matrix)
print('given_num_of_ratings = ', given_num_of_ratings)
possible_num_of_ratings = final_ratings_matrix.shape[0] * final_ratings_matrix.shape[1]
print('possible_num_of_ratings = ', possible_num_of_ratings)
density = (given_num_of_ratings/possible_num_of_ratings)
density *= 100
print ('density: {:4.2f}%'.format(density))
final_ratings_matrix.head()

### Build Popularity based Recommendation System by Averaging (10 marks)

In [11]:
# make a dataframe where you have the average rating and ratings count of each product in descending order of ratings count
#
#

In [12]:
#defining a function to get the top n products based on highest average rating and some minimum interactions (minimum number of ratings) of that product
def top_n_products(final_rating, n, min_interaction):
    #
    #
    #

#### Recommending top 5 products based on popularity and with minimum interactions of 50

In [13]:
list(top_n_products(_______________))

#### Recommending top 5 products based on popularity and with minimum interactions of 100

In [14]:
list(top_n_products(______________))

We have recommended the top 5 products by using popularity recommendation system.

### Build Collaborative Filtering based Recommendation System (15 marks)

**Let's first compute the user-item interactions matrix by making the userid as index**

In [15]:
interactions_matrix = ________________________________________________
interactions_matrix

**Let's fill all the missing values by zero since cosine similarity can't work with missing values**

In [16]:
interactions_matrix__________________
interactions_matrix.head()

**Here user id (index) is of the object data type. We will replacing the user id by numbers starting from 0 to 1539 (for all user ids) so that the index is of integer type and represents a user id in the same format**

In [17]:
interactions_matrix['user_index'] = np.arange(0, interactions_matrix.shape[0], 1)
interactions_matrix.set_index(['user_index'], inplace=True)

# Actual ratings given by users
interactions_matrix.head()

**Let's first define a function to get similar users to a particular user**

In [18]:
# defining a function to get similar users and return the most similar users and thir similarity scores
def similar_users(user_index, interactions_matrix):
#
#
#

#### Finding out top 10 similar users to the user index 3 and their similarity score

In [19]:
similar = similar_users(_________)[0][0:10]
similar

In [20]:
# printing the original user ids associated with the above user index
for i in similar:
    print(final_ratings_matrix.index[i])

In [21]:
#Print the similarity score
similar_users(___________)[1][0:10]

#### Finding out top 10 similar users to the user index 1521 and their similarity score

In [22]:
similar = similar_users(_______________)[0][0:10]
similar

In [23]:
# printing the original user ids associated with the above user index
for i in similar:
    print(final_ratings_matrix.index[i])

In [24]:
#Print the similarity score
similar_users(_____________)[1][0:10]

We have found 10 similar users by using similarity based collaboartive filtering.

**To build a more robust similar user identification, we can also put a threshold to the similarity score.**

In [25]:
# defining the recommendations function to get recommendations by using the similar users preferences and return the recommendations of products
#
#
#

#### Recommend 5 products to user index 3 based on similarity based collaborative filtering

In [26]:
recommendations(______________)

#### Recommend 5 products to user index 1521 based on similarity based collaborative filtering

In [27]:
recommendations(______________)

### Build Model-based Collaborative Filtering: Singular Value Decomposition (15 marks)

**SVD is best to apply on a large sparse matrix. Note that for sparse matrices, you can use the sparse.linalg.svds() function to perform the decomposition**

In [28]:
from scipy.sparse.linalg import svds # for sparse matrices
# Singular Value Decomposition
U, sigma, Vt = svds(_______________, k = ___) # here the number of latent features are 50. 
# Construct diagonal array in SVD
sigma = np.diag(sigma)

In [29]:
#checking the shape of the U matrix
#

In [30]:
#checking the shape of the sigma matrix
#

In [31]:
#checking the shape of the Vt matrix
#

In [32]:
all_user_predicted_ratings = np.dot(np.dot(__, ____), __) 

# Create a dataframe of predicted ratings (all_user_predicted_ratings) by using the columns from interactions_matrix
preds_df = ________________________
preds_df.head()

In [1]:
interactions_matrix

NameError: name 'interactions_matrix' is not defined

In [43]:
# Define a function to recommend items with the highest predicted ratings

def recommend_items(user_index, interactions_matrix, preds_df, num_recommendations):
      
    user_idx = user_index-1 # index starts at 0
    
    # Get and sort the user's ratings
    ___________________________________  #sorted_user_ratings
    
    ___________________________________   #sorted_user_predictions
    

    #Concatenate the sorted user ratings and sorted user predictions with the axis 1
    temp = pd.concat(__________________________________)
    temp.index.name = 'Recommended Items'
    temp.columns = ['user_ratings', 'user_predictions']
    
    temp = temp.loc[temp.user_ratings == 0]   
    temp = temp.sort_values('user_predictions', ascending=False)
    print('\nBelow are the recommended items for user(user_index = {}):\n'.format(user_index))
    print(temp.head(num_recommendations))

**Recommending the 5 products to user index 121 based on this model**

In [33]:
#Enter 'user_index' and 'num_recommendations' for the user #
user_index = ___
num_recommendations = __
recommend_items(___________________)

**Recommending the 5 products to user index 465 based on this model**

In [34]:
#Enter 'user_index' and 'num_recommendations' for the user #
user_index = ____
num_recommendations = ___
recommend_items(________________________)

### Evaluate the model. (10 marks)

#### Evaluation of Model-based Collaborative Filtering (SVD)

In [36]:
# Actual final ratings given by the users
final_ratings_matrix.head()

In [38]:
# Average ACTUAL rating for each item
__________________________

In [39]:
# Predicted ratings 
preds_df.head()

In [40]:
# Average PREDICTED rating for each item
______________________

In [41]:
#creating a dataframe containing average actual ratings and avearge predicted ratings based on the items
rmse_df = pd.concat(_________________________)
rmse_df.columns = ['Avg_actual_ratings', 'Avg_predicted_ratings']
print(rmse_df.shape)
rmse_df['item_index'] = np.arange(0, rmse_df.shape[0], 1)
rmse_df.head()

In [42]:
# Calculating RMSE
RMSE = ________________________________________
print('\nRMSE SVD Model = {} \n'.format(RMSE))

### Let's recommend 10 products to the user index 200. (2 marks)

In [43]:
# Enter 'user_index' and 'num_recommendations' for the user #
user_index = ___
num_recommendations = __
recommend_items(_______________________________)

### Summarise your insights. (2 marks)

Insights here