<a href="https://colab.research.google.com/github/JinHuiXu1991/Jin_DATA606/blob/main/ipynb/DATA606_Part3_HybridRecommender.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Amazon Product Recommender Systems
## Author: Jin Hui Xu

##Hybrid Recommender

The two types of filtering have their own drawbacks such as the novelty problem of Content-based Filtering and the cold start problem of Collaborative Filtering, so in reality, more robust recommender systems like hybrid recommenders are often used. This notebook will build a hybrid recommender that combines Content-based Filtering and Collaborative Filtering to overcome the drawbacks and improve overall performance.


In [1]:
!wget https://github.com/JinHuiXu1991/Jin_DATA606/blob/main/cleaned_data/cleaned_amazon_product.zip?raw=true

!wget https://github.com/JinHuiXu1991/Jin_DATA606/blob/main/cleaned_data/cleaned_amazon_review.zip?raw=true

!wget https://github.com/JinHuiXu1991/Jin_DATA606/blob/main/models/Content_based_LDA_output.zip?raw=true

!wget https://github.com/JinHuiXu1991/Jin_DATA606/blob/main/models/final_cr_sentiment_data.zip?raw=true 

!wget https://github.com/JinHuiXu1991/Jin_DATA606/blob/main/models/final_cr_sentiment_model.pickle?raw=true



--2022-04-24 03:53:10--  https://github.com/JinHuiXu1991/Jin_DATA606/blob/main/cleaned_data/cleaned_amazon_product.zip?raw=true
Resolving github.com (github.com)... 52.192.72.89
Connecting to github.com (github.com)|52.192.72.89|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/JinHuiXu1991/Jin_DATA606/raw/main/cleaned_data/cleaned_amazon_product.zip [following]
--2022-04-24 03:53:10--  https://github.com/JinHuiXu1991/Jin_DATA606/raw/main/cleaned_data/cleaned_amazon_product.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/JinHuiXu1991/Jin_DATA606/main/cleaned_data/cleaned_amazon_product.zip [following]
--2022-04-24 03:53:11--  https://raw.githubusercontent.com/JinHuiXu1991/Jin_DATA606/main/cleaned_data/cleaned_amazon_product.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133,

In [2]:
!pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 3.6 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1630167 sha256=00ced8edcf05415e990f6affb32ff61bf55a785c319a262ae8a81e41fce2b06a
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from surprise import Dataset
from surprise import Reader
from surprise import dump
import os
from surprise import NormalPredictor
from surprise import SVD
from surprise.accuracy import rmse
import gzip
from collections import defaultdict

In [4]:
def load_model(model_filename):
    file_name = os.path.expanduser(model_filename)
    _, loaded_model = dump.load(file_name)
    return loaded_model

In [5]:
def get_rec_user(uid, input_df):
    input_id = uid
    data1 = [input_id]
    data2 = input_df['asin'].unique().tolist()

    df = pd.DataFrame(data1)
    df.columns =['reviewerID']

    df1 = pd.DataFrame(data2)
    df1.columns =['asin']
    # filter out reviewed products
    reviewed_product = input_df[input_df['reviewerID'] == input_id].asin.unique().tolist()
    df1 = df1[~df1['asin'].isin(reviewed_product)]

    # Now to perform cross join, we will create
    # a key column in both the DataFrames to
    # merge on that key.
    df['key'] = 1
    df1['key'] = 1

    # to obtain the cross join we will merge
    # on the key and drop it.
    result = pd.merge(df, df1, on ='key')

    result['overall']=0.0
    del result['key']
    return result

In [6]:
def collaborative_SVD_recommender(predictions, product_df, top_num=10, LDA_result=None):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, _, est, _ in predictions:
      if LDA_result is not None:
        if iid in LDA_result:
          top_n[uid].append((iid, est))
      else:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:top_num]

    for uid, user_ratings in top_n.items():
      result = [iid for (iid, _) in user_ratings]

    return result, product_df[product_df['asin'].isin(result)]['ori_title'].tolist()

In [7]:
def topic_modeling_recommender(id, df, top_n=10):
    id = id.lower()

    # get the input product topic number
    topic_num = df[df['asin'].str.lower() == id]['topic_num'].item()

    # remove the input product from the recommendation data
    exclude_input_df = df.copy()
    exclude_input_df = exclude_input_df[exclude_input_df['asin'].str.lower() != id]

    # get the top 10 Probability product for the matching topic number
    output_df = exclude_input_df[exclude_input_df['topic_num'] == topic_num].sort_values('probability', ascending=False).head(top_n)

    # get the product indices
    product_indices = output_df.index.tolist()

    # return the top 10 most similar product
    return df['asin'].iloc[product_indices].tolist(), df['ori_title'].iloc[product_indices].tolist()

Load our Content-Based filtering model and generate top 100 recommendation products for an input product ID

In [8]:
product_df = pd.read_csv('./Content_based_LDA_output.zip?raw=true', compression='zip')

In [9]:
# get input id and title for the recommendation
input_id = product_df[product_df['title'].str.contains('refrigerator', case=False)].iloc[0]['asin']
input_title = product_df[product_df['title'].str.contains('refrigerator', case=False)].iloc[0]['title']

asin, title = topic_modeling_recommender(input_id, product_df, 100)

In [10]:
print('Topic Modeling Recommender Result for {}, {}: '.format(input_id, input_title))
for i in range(0, len(asin)):
  print('{}. {}, {}'.format(i+1, asin[i], title[i]))

Topic Modeling Recommender Result for B0001YH10C, coldmate mr 128 mini cooler warmer deluxe mini refrigerator: 
1. B013PRRB4W, Power Pair Special-LG Turbo Series Ultra-Capacity Laundry System with Steam*PURE WHITE COLOR*(WM4270HWA_DLEX4270W)
2. B013PSOBNA, Power Pair Special-LG Turbo Series Ultra-Capacity Laundry System with Steam and Matching Storage Pedestals *GRAPHITE STEEL*(WM4270HVA_DLEX4270V_WDP4V X 2)
3. B013PT1PFQ, Power Pair Special-LG Turbo Series Ultra-Capacity Laundry System with Steam*GRAPHITE STEEL*(WM4270HVA_DLEX4270V)
4. B00HX3ZJKS, PAIR SPECIAL- LG Turbo Series Ultra Capacity Laundry System With Steam Technology (WM3470HVA,DLEX3470V,WDP4V x2)
5. B0049OSUD2, Maytag MFI2665XEM Ice2O 25.5 Cu. Ft. Stainless Steel French Door Refrigerator - Energy Star
6. B00MG225MQ, Power Pair Special- LG Turbo Series Ultra Capacity Laundry System with Steam Technology(WM3570HWA_DLEX3570W)*PURE WHITE IN COLOR*
7. B00MG17WBQ, POWER PAIR SPECIAL-LG TURBO SERIES ULTRA CAPACITY LAUNDRY SYSTEM 

Load our Collaborative filtering model 



In [11]:
# test the final pretrained model
model_filename = "./final_cr_sentiment_model.pickle?raw=true"
loaded_model = load_model(model_filename)

result_list = []
title_df = pd.read_csv('./cleaned_amazon_product.zip?raw=true', compression='zip')
final_merged_df = pd.read_csv('./final_cr_sentiment_data.zip?raw=true', compression='zip')

input_id = 'A1CY6CQC5HPQGL'
result = get_rec_user(input_id, final_merged_df)

reader = Reader()
valid_Dataset = Dataset.load_from_df(result, reader)

testset = valid_Dataset.df.values.tolist()
predictions = loaded_model.test(testset)

top_num=20
cr_asin, cr_title = collaborative_SVD_recommender(predictions, title_df, top_num, asin)

print('Collaborative Recommender Result for customer {}: '.format(input_id))

for i in range(0, len(cr_asin)):
  print('{}. {}, {}'.format(i+1, cr_asin[i], cr_title[i]))

Collaborative Recommender Result for customer A1CY6CQC5HPQGL: 
1. B00JV8FUTI, Whirlpool WED8500SR 27" Electric Dryer, 6.7 cuft., 9 Cycles, AccuDry, : White
2. B013PRRB4W, LG LRBP1031T10.0 Cu. Ft. Titanium Counter Depth Bottom Freezer Refrigerator
3. B0050OZRLS, Frigidaire : FTF2140FS 27 Front-Load Washer - White
4. B01B3Q6U6W, Speed Queen : AWN311 : 7 Cycle Topload Rear Control Washer 3.3 C.Ft. WHITE
5. B00EE89JLU, Electrolux EWMED70JIWWave-Touch 8.0 Cu. Ft. White Stackable With Steam Cycle Electric Dryer
6. B00SZAH9Y2, Maytag MVWX600XW Bravos X 3.6 Cu. Ft. White Top Load Washer - Energy Star
7. B00HX3RS3O, Kitchenaid KUDE50CXSS Superba Series EQ Dishwasher
8. B000UVWS0Y, Maytag MFI2569YEW
9. B003S6HC9A, Electrolux EI15IM55GSIQ-Touch 15&quot; Stainless Steel Undercounter Built-In Ice Maker
10. B007RJZLX8, Electrolux EW23BC85KS Wave-Touch 22.6 Cu. Ft. Stainless Steel Counter Depth French Door Refrigerator - Energy Star
11. B0015YT5X8, Maytag MDC4809PAB JetClean Plus 24&quot; Black Porta

The goal of our Hybrid Model is to generate recommendations based on both content-based filtering and collaborative filtering.

It will take both reviewer ID and product ID as input, and first get 100 recommendation results from the content-based filtering model, then input the reviewer ID and the recommended product IDs from the Content-based filtering model to the Collaborative filtering model. This model will generate recommendations that meet product similarities and customer personality as much as possible.

Of course, if either ID is missing from the input, our Hybrid Model can handle it by calling its "Child Models" to generate recommendations respectively. If no input IDs are entered, then it will use our base model for the recommendation.

In [12]:
def hybrid_recommender(reviewerID="", productID=""):
    # load the final pretrained SVD model for collaborative filtering
    model_filename = "./final_cr_sentiment_model.pickle?raw=true"
    loaded_model = load_model(model_filename)
    reader = Reader()

    final_merged_df = pd.read_csv('./final_cr_sentiment_data.zip?raw=true', compression='zip')
    product_df = pd.read_csv('./Content_based_LDA_output.zip?raw=true', compression='zip')

    # both product ID and reviewer ID entered, do hybrid
    if productID != "" and reviewerID != "":
        lda_asin, _ = topic_modeling_recommender(productID, product_df, 100)
        result = get_rec_user(reviewerID, final_merged_df)
        valid_Dataset = Dataset.load_from_df(result, reader)
        testset = valid_Dataset.df.values.tolist()
        predictions = loaded_model.test(testset)
        asin, title = collaborative_SVD_recommender(predictions, product_df, 10, lda_asin)

    # product ID entered and no reviewer ID entered, do content based
    elif productID != "" and reviewerID == "":
        asin, title = topic_modeling_recommender(productID, product_df, 10)

    # no product ID entered and reviewer ID entered, do collaborative 
    elif productID == "" and reviewerID != "":
        result = get_rec_user(reviewerID, final_merged_df)
        valid_Dataset = Dataset.load_from_df(result, reader)
        testset = valid_Dataset.df.values.tolist()
        predictions = loaded_model.test(testset)
        asin, title = collaborative_SVD_recommender(predictions, product_df, 10)

    # no product ID entered and no reviewer ID entered, do rating rank 
    else:
        review_df = pd.read_csv('./cleaned_amazon_review.zip?raw=true', compression='zip')
        merged_df = review_df.merge(product_df, on='asin', how='left')
        product_grouped = merged_df.groupby('asin').size().reset_index(name='counts')
        product_grouped_rating = merged_df.groupby('asin')['overall'].sum().reset_index(name='overall_sum')
        product_grouped['rating_mean'] = product_grouped_rating['overall_sum'] / product_grouped['counts']

        rating_df = product_df.merge(product_grouped, on='asin', how='left')
        output_df = rating_df.sort_values(['rating_mean', 'counts'], ascending=[False, False]).head(10)
        
        # get the product indices
        product_indices = output_df.index.tolist()

        # return the top 10 most similar product
        asin, title = product_df['asin'].iloc[product_indices].tolist(), product_df['ori_title'].iloc[product_indices].tolist()


    return asin, title

In [14]:
hy_input_reviewerID = 'A1CY6CQC5HPQGL'
hy_input_productID = 'B0001YH10C'
hy_asin, hy_title = hybrid_recommender(reviewerID=hy_input_reviewerID, productID=hy_input_productID)
#hy_asin, hy_title = hybrid_recommender(reviewerID=hy_input_reviewerID, productID='')
#hy_asin, hy_title = hybrid_recommender(reviewerID='', productID=hy_input_productID)
#hy_asin, hy_title = hybrid_recommender(reviewerID='', productID='')
print('Hybrid Recommender Result for customer {} and product {}: '.format(hy_input_reviewerID, hy_input_productID))

for i in range(0, len(hy_asin)):
  print('{}. {}, {}'.format(i+1, hy_asin[i], hy_title[i]))

Hybrid Recommender Result for customer A1CY6CQC5HPQGL and product B0001YH10C: 
1. B00JV8FUTI, LG LRBP1031T10.0 Cu. Ft. Titanium Counter Depth Bottom Freezer Refrigerator
2. B013PRRB4W, Electrolux EWMED70JIWWave-Touch 8.0 Cu. Ft. White Stackable With Steam Cycle Electric Dryer
3. B0050OZRLS, Maytag MFI2569YEW
4. B01B3Q6U6W, Maytag MDC4809PAB JetClean Plus 24&quot; Black Portable Full Console Dishwasher - Energy Star
5. B00EE89JLU, LG WM3050CW4.0 Cu. Ft. White Stackable Front Load Washer - Energy Star
6. B00SZAH9Y2, LG PAIR SPECIAL- Turbo Series With Steam Technology WM3470HWA+DLEX3470W
7. B00HX3RS3O, LG DLEX3570W 7.4 Cu. Ft. Electric SteamDryer with NFC Tag On - White
8. B000UVWS0Y, Speed Queen ADEE8RGS 27&quot; ADA Compliant Button Control Front Load Electric Dryer with 7.0 Cu. Ft. Capacity Reversible Door 6 Preset Cycles Moisture Sensor Interior Light Time Remaining Display in
9. B003S6HC9A, Power Pair Special-LG Turbo Series Ultra-Capacity Laundry System with Steam*PURE WHITE COLOR*(

As you can see, the recommendation system is now suggesting more products that are similar to the product ID B0001YH10C for customer ID A1CY6CQC5HPQGL