<a href="https://colab.research.google.com/github/HappyBonny52/DATA301-Big-Data-Computing-and-Systems-/blob/main/DATA301_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **DATA301 Project**

*   Name : Bonghyun Kwon
*   Student ID : 91816426

\

**Hypothesis :**

 The Market Basket Analysis with the Apriori algorithm and cosine similarity will both be effective in generating personalized product recommendations on Amazon. However, it is hypothesized that the effectiveness and efficiency of the two approaches may vary in different situations. Market Basket Analysis with the Apriori algorithm is expected to excel in capturing frequent itemsets and association rules, leading to accurate recommendations based on customer purchase history. On the other hand, cosine similarity is anticipated to be effective in identifying similar customers and recommending products based on their preferences. It is expected that the performance of the algorithms will depend on factors such as the sparsity of the data, the diversity of the product catalog, and the availability of customer purchase history. By implementing and evaluating both algorithms, the research aims to validate or refute this hypothesis and obtain insights of different characteristics of these algorithms when it comes to recommending products.

\
**The research question :**

### "What personalized items can be recommended to customers on AMAZON using Market Basket Analysis and Cosine similarity? And what difference do they make when it comes to recommending personalized products?"

# **Setup for project**

Install pyspark

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!pip install -q pyspark

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Start the Spark Context

In [None]:
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import *
spark = SparkSession.builder.master("local[*]").appName('SparkExample').config(
    "spark.executor.memory", "1g").config(
        "spark.executorEnv.PYTHONHASHSEED","0").config("spark.ui.port", "4050"
        ).getOrCreate()
sc = spark.sparkContext

# **Load Dataset**

We use AMAZON REVIEW DATASET(2018) for MBA

from link : https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/

Specifically, we use meta data in the category of
"*Grocery and Gourment Food*" containing 287,209 products

Given url for this meta data is : https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/metaFiles2/meta_Grocery_and_Gourmet_Food.json.gz

In [None]:
import urllib.request
url = "https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/metaFiles2/meta_Grocery_and_Gourmet_Food.json.gz"
filename = 'meta_Grocery_Food.json.gz'
urllib.request.urlretrieve(url, filename)

('meta_Grocery_Food.json.gz', <http.client.HTTPMessage at 0x7fe8736b7a30>)

Display the number of datasets/baskets

In [None]:
item_purchased = spark.read.json('meta_Grocery_Food.json.gz').rdd

num_basket = item_purchased.count()
print(f"The number of datasets(baskets) : {num_basket}")

The number of datasets(baskets) : 287051


## Appendix A. For displaying the contents of imported DATASET

In [None]:
first_5_item_purchased = item_purchased.take(5)

print("First five datasets\n")
for i in range(5):
  print(f"Row {i+1} (")
  print(f"\talso_buy = {first_5_item_purchased[i][0]},")
  print(f"\talso_view = {first_5_item_purchased[i][1]},")
  print(f"\tasin = '{first_5_item_purchased[i][2]}',")
  print(f"\tbrand = '{first_5_item_purchased[i][3]}',")
  print(f"\tcategory = {first_5_item_purchased[i][4]},")
  print(f"\tdate = '{first_5_item_purchased[i][5]}',")
  print(f"\tdescription = {first_5_item_purchased[i][6]},")
  print(f"\tdetails = {first_5_item_purchased[i][7]},")
  print(f"\tfeature = {first_5_item_purchased[i][8]},")
  print(f"\tfit = '{first_5_item_purchased[i][9]}',")
  print(f"\timageURL = {first_5_item_purchased[i][10]},")
  print(f"\timageURLHighRes = {first_5_item_purchased[i][11]},")
  print(f"\tmain_cat = '{first_5_item_purchased[i][12]}',")
  print(f"\tprice = {first_5_item_purchased[i][13]},")
  print(f"\trank = {first_5_item_purchased[i][14]},")
  print(f"\ttitle = '{first_5_item_purchased[i][18]}',")
  print(")\n")

First five datasets

Row 1 (
	also_buy = [],
	also_view = ['B0000D9MYM', 'B0000D9MYL', 'B00ADHIGBA', 'B00H9OX598', 'B001LM42GY', 'B001LM5TDY'],
	asin = '0681727810',
	brand = 'Ariola Imports',
	category = ['Grocery & Gourmet Food', 'Dairy, Cheese & Eggs', 'Cheese', 'Gouda'],
	date = '',
	description = ['BEEMSTER GOUDA CHEESE AGED 18/24 MONTHS', 'Statements regarding dietary supplements have not been evaluated by the FDA and are not intended to diagnose, treat, cure, or prevent any disease or health condition.'],
	details = None,
	feature = [],
	fit = '',
	imageURL = [],
	imageURLHighRes = [],
	main_cat = 'Grocery',
	price = $41.91,
	rank = 165,181 in Grocery & Gourmet Food (,
	title = 'Beemster Gouda - Aged 18/24 Months - App. 1.5 Lbs',
)

Row 2 (
	also_buy = ['B01898YHXK', 'B01BCM6LAC', 'B00Q4OL47O', 'B00Q4OL5QE', '0804189986', 'B00Q4OL2F8', '1101902663', 'B019PDEUU8', 'B01AC97ONO', 'B01LZIS4VX', 'B019QN2DN8', 'B019PDYP7Q', '1101902639', 'B00Q4OL0S2', 'B019PHEIVA', 'B019QNGPSW', 'B004

# **Extract and create the useful datasets for Market Basket Analysis**
For Amazon products, each product has an Identification called as 'ASIN'
ASIN stands for Amazon Standard Identification Number which consists of ten-digit alphanumeric code

---

Datasets we need are :

\

*   **1 - The number of customers who bought products on Amazon**

    *(To get the number of baskets and use it for calculating interest and confidence)*
*   **2 - ASIN and the name of its product**

    (*Since ASIN is the id, it's hard to tell what this product is by looking from the code, so this is to get the descriptive name of the product to know what this product represents)*
*  **3 - Baskets containing one or more items per customer they purchased**

    *(To figure out what items/products are in each basket and there are one or more products per basket)*

\
> Note that since the name of each product is long, we use ASIN(identification of each product) for dataset and show the name of it when recommending it to customer to let them know what they are






**1 - The number of customers who bought products on Amazon**

Since the number of datasets corresponds to basket per each customer,
dataset is equal to the number of basket what we get from the above in load dataset section.

In [None]:
print("The number of customers : ", num_basket)

The number of customers :  287051


**2 - ASIN and the name of its product**

In [None]:
asin = item_purchased.map(lambda x : x[2])
title = item_purchased.map(lambda x : x[18])

index_asin = asin.zipWithIndex().map(lambda x : (x[1], x[0]))
index_title = title.zipWithIndex().map(lambda x : (x[1], x[0]))

join_asin_title = index_asin.join(index_title).map(lambda x : x[1])

print(join_asin_title.take(5))

[('0681727810', 'Beemster Gouda - Aged 18/24 Months - App. 1.5 Lbs'), ('1888861118', 'Letter C - Swarovski Crystal Monogram Wedding Cake Topper Letter'), ('1888861614', 'Letter S - Swarovski Crystal Monogram Wedding Cake Topper Letter'), ('1888861339', '1 X Fully Covered in Crystal Monogram Wedding Cake Topper Letter - Letter O'), ('188886155X', 'Letter L - Swarovski Crystal Monogram Wedding Cake Topper Letter')]


**3 - Baskets containing one or more items per customer they purchased**


In [None]:
also_buy = item_purchased.map(lambda x : x[0])
index_also_buy = also_buy.zipWithIndex().map(lambda x : (x[1], x[0]))

all_items_in_basket_with_indx = index_asin.join(index_also_buy)
list_of_baskets_with_indx = all_items_in_basket_with_indx.map(lambda x : (x[0], [x[1][0]]+(x[1][1])))
list_of_baskets = list_of_baskets_with_indx.sortBy(lambda x : x[0])
num_products = list_of_baskets.flatMap(lambda x : x[1]).distinct().count()

print(f"The number of products : ", num_products)
print(list_of_baskets.take(5))

The number of products :  502705
[(0, ['0681727810']), (1, ['0853347867', 'B01898YHXK', 'B01BCM6LAC', 'B00Q4OL47O', 'B00Q4OL5QE', '0804189986', 'B00Q4OL2F8', '1101902663', 'B019PDEUU8', 'B01AC97ONO', 'B01LZIS4VX', 'B019QN2DN8', 'B019PDYP7Q', '1101902639', 'B00Q4OL0S2', 'B019PHEIVA', 'B019QNGPSW', 'B0041RGD0E', '0998089508', '0988775115', 'B0006Z7NNG', 'B00MU73UKS', 'B01KIJ7JGA', 'B005F9W9JQ']), (2, ['1888861118']), (3, ['1888861517']), (4, ['1888861614'])]


# **Apply A-priori algorithm**

### Step 1 : A-priori algorithm

In [None]:
from operator import *
#Most frequent items
def a_priori_step1(items_per_basket):
  items = items_per_basket.flatMap(lambda x: x[1])
  itemsCount = items.map(lambda x : (x, 1)).reduceByKey(add)
  return itemsCount

print("Top 5 purchased items ",a_priori_step1(list_of_baskets).takeOrdered(5, lambda kv: -kv[1]))

Top 5 purchased items  [('B07CX6LN8T', 3411), ('B008GVJ9S4', 2070), ('B0799CH1ZZ', 1740), ('B00HFC2E82', 1378), ('B003ZXEBOK', 1376)]


### Step 2 : A-priori algorithm

In [None]:
def a_priori(rdd, support=100):
  supportItems = a_priori_step1(rdd).filter(lambda x : x[1] > support)
  freqItems = sc.broadcast(supportItems.collectAsMap())

  def filter_uncommon(text):
    return [item  for item in text if item in freqItems.value]

  def items_pair_tup(line):
    """Function for pairing up all items to tuple within the same line"""
    return [tuple(sorted((line[j], line[i]))) for i in range(len(line)) for j in range(i)]

  freqItemPerBasket = rdd.map(lambda x : x[1]).map(filter_uncommon)
  freqItemPair = freqItemPerBasket.filter(lambda x : len(x)>=2).flatMap(items_pair_tup)
  freqItemPairCount = freqItemPair.map(lambda x : (x, 1)).reduceByKey(add)

  return freqItemPairCount

print("Top 5 frequently co-purchased item pair ", a_priori(list_of_baskets).takeOrdered(5, lambda kv: -kv[1]))

Top 5 frequently co-purchased item pair  [(('B003ZXEBOK', 'B008GVJ9S4'), 1003), (('B008GVJ9S4', 'B00HFC2E82'), 988), (('B0079OYIFS', 'B00H2AAXMQ'), 935), (('B0079OYIFS', 'B01G4I8WCE'), 917), (('B003ZXEBOK', 'B00HFC2E82'), 847)]


### Useful Dataset for recommending itmes by MBA using apriori algorithm.

In [None]:
top_item_counts = a_priori_step1(list_of_baskets).filter(lambda kv: kv[1]>=100)
top_pair_counts = a_priori(list_of_baskets, support=100)

### Association rules for evaluating item sets follow this formula:

\

 **Confidence, conf(I -> j) = support(I u j) / support(I)**



**Interest, interest(I -> j) = conf(I -> j) - Pr[j]**


\
obtained from lab3_part1 word document and associated knowledges has been achived from the Mining of Massive Datasets (3rd edition).(MMDS). Chapter 6


### Calculate Confidence

In [None]:
def confidence(item_counts, pair_counts, n):
  item_support = item_counts.map(lambda x : (x[0], x[1]/n))
  pair_support = pair_counts.map(lambda x : (x[0], x[1]/n))
  X_Ypair_map = pair_support.map(lambda x : (x[0][0],(x[0][1], x[1])))
  Y_Xpair_map = pair_support.map(lambda x : (x[0][1],(x[0][0], x[1])))
  X_Yjoin = item_support.join(X_Ypair_map)
  Y_Xjoin = item_support.join(Y_Xpair_map)
  confCombined = X_Yjoin.union(Y_Xjoin)
  confidence = confCombined.map(lambda x : ((x[0],x[1][1][0]), x[1][1][1]/x[1][0]))
  return confidence

top_rule_confidences = confidence(top_item_counts, top_pair_counts, num_basket)
print(top_rule_confidences.takeOrdered(5, lambda kv: -kv[1]))

[(('B006K4O3VM', 'B0063AA7DG'), 1.0), (('B006K4O3VM', 'B006AQF62A'), 0.9919354838709676), (('B006K4O3VM', 'B005DJ8S02'), 0.9919354838709676), (('B00D62U00E', 'B000IDC8S6'), 0.990909090909091), (('B00D62U00E', 'B0063AA7DG'), 0.990909090909091)]


### Calculate interest

In [None]:
def interest(item_counts, rule_confidences, n):
  #Map the items with the corresponding its probability (Y, Probability of Y)
  ProbY = item_counts.map(lambda x : (x[0], x[1]/n))

  #Swap the pair((X, Y), _) form to (Y, (X, Conf(X->Y))) to make Y as key
  X_YconfPair = rule_confidences.map(lambda x : (x[0][1],(x[0][0], x[1])))

  #Join to get the form of (Y,(Prob(Y), (X, Conf(X->Y)))) to calculate interest next
  X_Yjoin = ProbY.join(X_YconfPair)

  #Calculate Interest by subtracting Conf(X->Y) - Prob(Y)
  #and mapping them in form ((X, Y), Interest(X->Y))
  interest = X_Yjoin.map(lambda x : ((x[1][1][0], x[0]), x[1][1][1]-x[1][0]))
  return interest

top_interest = interest(top_item_counts, top_rule_confidences, num_basket)
print(top_interest.takeOrdered(5, lambda kv: -kv[1]))

[(('B006K4O3VM', 'B0063AA7DG'), 0.9994217055505816), (('B006K4O3VM', 'B005DJ8S02'), 0.9913606731230518), (('B006K4O3VM', 'B006AQF62A'), 0.9913328035110316), (('B00D62U00E', 'B000IDC8S6'), 0.9903447312656827), (('B00D62U00E', 'B005DJ8S02'), 0.9903342801611751)]


# **Recommend Items based on Market Basket Anaylsis**

### The few number of sample customers for recommending personalized items

In [None]:
sample_small = list_of_baskets.filter(lambda x : len(x[1])>=3).sample(False, 0.00005, 81)
num_of_customer = sample_small.count()
sample_customers = (sample_small.map(lambda x : x[0]).collect())
print(f"There are {num_of_customer} number of customers in this sample")
print(f"List of customer's ID {sample_customers}")
print(f"Each customer' ID and their baskets : \n{sample_small.collect()}")

There are 5 number of customers in this sample
List of customer's ID [16321, 90887, 93660, 128345, 270219]
Each customer' ID and their baskets : 
[(16321, ['B000I346XQ', 'B00716PFZQ', 'B007O58268', 'B00OGLF4NW', 'B006YVD9G6', 'B002HQT61E', 'B008F2JI8K', 'B00019FVUE', 'B0000ETAH7', 'B001M1DTYU', 'B001LO50SQ', 'B00DLNQHS2', 'B06ZZGNRHT', 'B001EQ5BQW', 'B01N5LU4I4', 'B0000CFTNR', 'B00C7BGB5I', 'B002MN424S', 'B007XNW1ZO', 'B001ACDOA0', 'B000WR4IRW', 'B00M75N9KY', 'B001HTRADI', 'B0078DPHQE', 'B0025VQE5W', 'B011PK0CFQ', 'B000XK0FFW', 'B001L048WO', 'B00716PFYM', 'B001K3L48I']), (90887, ['B005HPOVCQ', 'B01IHDB91A', 'B0014CWPQA', 'B00FVYF334', 'B00O6HRAO2', 'B00LWZYTZO', 'B01IRIONK4', 'B000R9EEH4', 'B000PYF8VM', 'B000Z4J53Y', 'B0748J34WZ', 'B000R2Z682', 'B0014E2OLY', 'B000R91EHW', 'B000RYFRVG', 'B07CX6LN8T', 'B01N0SS46Z']), (93660, ['B005P9WCVG', 'B005P9WFXG', 'B005P9W92S', 'B005P9WGUI', 'B0041HURTM', 'B005P9WIXS', 'B005P9WBVW', 'B000NY6PP2', 'B005P7YQ0I', 'B005P7YT5A', 'B00DL0AM40', 'B005P7YTO

### Recommend personalized items to customer based on their basket.
Using interest & confidence calculated

In [None]:
def recommend_items(sample_data_rdd, co_purchased_pairs_rdd, target_user):
  # Create a lookup dictionary for user items
  user_items_dict = dict(sample_data_rdd.collect())

  # Get the items already purchased by the target user
  purchased_items = user_items_dict.get(target_user, [])

  # Filter the co-purchased pairs with interest higher than 0.6
  filtered_pairs_rdd = co_purchased_pairs_rdd.filter(lambda x: x[1] > 0.5)

  # Find the co-purchased items from the co-purchased pair list
  recommendations = filtered_pairs_rdd.filter(lambda x: any(item in purchased_items for item in x[0]))

  # Extract the recommended items
  recommended_items = recommendations.flatMap(lambda x: x[0]).distinct().collect()

  # Extract the recommended items
  recommended_items = recommendations.flatMap(lambda x: x[0]).distinct().collect()
  return target_user, recommended_items

for target_user in sample_customers:
  target_customer, recommendations = (recommend_items(sample_small, top_interest, target_user))
  print((f"Recommended Items for user {target_customer}", recommendations))
  print(f"The number of recommended items for user {target_customer} : {len(recommendations)}\n")

('Recommended Items for user 16321', ['B001CDTO6U', 'B01MCZWIRJ', 'B001ID2TPC', 'B005EG78OQ', 'B0037X4QDO', 'B00IRY8CJ2', 'B00PO9GEKC', 'B001CDVCCO', 'B004Y4Z9CC', 'B008L0T0TS', 'B001ACDOA0', 'B001L048WO', 'B009QDQVK0', 'B01MA28YIS', 'B00F8FHRF8', 'B008L0ZS44', 'B009YLDEMW', 'B01AP7LMS6', 'B008GQ1ITM', 'B011PK0CFQ', 'B06XPQ5WG6', 'B00AQB146W', 'B077YDC48D', '160774970X'])
The number of recommended items for user 16321 : 24

('Recommended Items for user 90887', ['B07CX6LN8T', 'B002RZ1SRU', 'B01C2JF1WS', 'B0014CXSFM', 'B01N0SS3EH', 'B013SGABHY', 'B00BNQLR9I', 'B01N183ZGT', 'B00DX53PWE', 'B000WHSD5A', 'B000UENHBU', 'B00FNNFCMK', 'B01IVTAW9A', 'B07B8N49KT', 'B003Y8EASI', 'B01GFPSAPA', 'B0014E2UP4', 'B000RUP0JE', 'B000WH6WIA', 'B00MZF1YHQ', 'B07G39L3Q6', 'B01AZ15E22', 'B00061ETX2', 'B000RULSYK', 'B07G38C4JG', 'B000PYF8VM', 'B00G4U6UY0', 'B000T9WLTK', 'B00G4U704Y', 'B00W0C9F5M', 'B000R9COOY', 'B000T9YBT8', 'B01N0SS46Z', 'B000RPVQC4', 'B0015GQBHE', 'B00091S3K4', 'B0014D1JWA', 'B000QJAWYE', 'B

### Adding the name of the items to let customers specifically know what that recommended items are.

In [None]:
def recommend_items_with_name(item_names_dict, sample_data_rdd, co_purchased_pairs_rdd, target_user):
  # Create a lookup dictionary for user items
  user_items_dict = dict(sample_data_rdd.collect())

  # Get the items already purchased by the target user
  purchased_items = user_items_dict.get(target_user, [])

  # Filter the co-purchased pairs with interest higher than 0.6
  filtered_pairs_rdd = co_purchased_pairs_rdd.filter(lambda x: x[1] > 0.6)

  # Find the co-purchased items from the co-purchased pair list
  recommendations = filtered_pairs_rdd.filter(lambda x: any(item in purchased_items for item in x[0]))

  # Extract the recommended items
  recommended_items = recommendations.flatMap(lambda x: x[0]).distinct().collect()

  # Get the names of the recommended items
  recommended_item_names = [item_names_dict.get(item, "") for item in recommended_items if item_names_dict.get(item, "")]
  return (f"Recommended Items for customer {target_user}", recommended_item_names)

# Create a lookup dictionary for item names
item_names_dict = dict(join_asin_title.collect())
for target_user in sample_customers:
  # Print the recommended items with their names
  print(recommend_items_with_name(item_names_dict, sample_small, top_interest, target_user))
  print()

('Recommended Items for customer 16321', ["Peychaud's Aromatic Cocktail Bitters - 10 Ounce Bottle", 'Fee Brothers Cherry Bitters 5oz', 'Fee Brothers Black Walnut Bitters 5oz', 'BOURBON BARREL FOODS WOODFORD RESERVE BOURBON CHERRIES WRCC', "Regan's Orange Bitters No. 6, 10 Ounces", 'Fee Brothers Aztec Chocolate Cocktail Bitters 5oz', 'Angostura Orange Bitters, 4-Ounce', 'Teavana Perfectea Rock Sugar (1 lb)', "Peychaud's Bitters - 5 ounce", 'Fee  Brothers Black Walnut Cocktail Bitters - 4 Ounce', 'Luxardo, Gourmet Cocktail Maraschino Cherries 400G Jar', 'Luxardo Maraschino Cherries, 418 mL'])

('Recommended Items for customer 90887', ["Kellogg's Breakfast Cereal, Frosted Mini-Wheats, Original, Low Fat, Excellent Source of Fiber, Family Size, 24 oz Box", 'Smartfood Popcorn, White Cheddar, 12 ct Bags', 'Capri Sun Juice Drink, Fruit Punch, 10-Count, 6-Oz', 'Tostitos Salsa Con Queso - Medium, 15 Ounce', "REESE'S Snack Size Peanut Butter Cups, 10.5 Ounce", "Cap'N Crunch Cereal, Crunchberries,

## Appendix B. If you'd like to see the name of each product in a more readable way

Note that there are also unknown product's name which are not collected from this AMAZON REVIEW DATASET,

so for unknown products, looking up its ASIN on google is recommended

In [None]:
# Create a lookup dictionary for item names
item_names_dict = dict(join_asin_title.collect())

# Sample data RDD
sample_data_rdd = sample_small

# Co-purchased pair list RDD
co_purchased_pairs_rdd = top_interest

# Extract the items purchased by each user
user_items_rdd = sample_data_rdd.map(lambda x: (x[0], x[1]))

# Create a lookup dictionary for user items
user_items_dict = dict(user_items_rdd.collect())

# Get the user for whom you want to recommend additional items
target_user = 16321

# Get the items already purchased by the target user
purchased_items = user_items_dict.get(target_user, [])

# Filter the co-purchased pairs with interest higher than 0.6
filtered_pairs_rdd = co_purchased_pairs_rdd.filter(lambda x: x[1] > 0.6)

# Find the co-purchased items from the co-purchased pair list
recommendations = filtered_pairs_rdd.filter(lambda x: any(item in purchased_items for item in x[0]))

# Extract the recommended items
recommended_items = recommendations.flatMap(lambda x: x[0]).distinct().collect()

# Get the names of the recommended items
recommended_item_names = [item_names_dict.get(item, "Unknown") for item in recommended_items]

# Print the recommended items with their names
print(f"Recommended Items for user {target_user}:")
for item, name in zip(recommended_items, recommended_item_names):
    print("Item:", item)
    print("Name:", name)
    print()

Recommended Items for user 16321:
Item: B001CDTO6U
Name: Peychaud's Aromatic Cocktail Bitters - 10 Ounce Bottle

Item: B01MCZWIRJ
Name: Unknown

Item: B001ID2TPC
Name: Fee Brothers Cherry Bitters 5oz

Item: B005EG78OQ
Name: Fee Brothers Black Walnut Bitters 5oz

Item: B0037X4QDO
Name: Unknown

Item: B00PO9GEKC
Name: BOURBON BARREL FOODS WOODFORD RESERVE BOURBON CHERRIES WRCC

Item: B001CDVCCO
Name: Regan's Orange Bitters No. 6, 10 Ounces

Item: B00IRY8CJ2
Name: Unknown

Item: B008L0T0TS
Name: Fee Brothers Aztec Chocolate Cocktail Bitters 5oz

Item: B001ACDOA0
Name: Angostura Orange Bitters, 4-Ounce

Item: B009QDQVK0
Name: Unknown

Item: B01MA28YIS
Name: Unknown

Item: B001L048WO
Name: Teavana Perfectea Rock Sugar (1 lb)

Item: B00F8FHRF8
Name: Peychaud's Bitters - 5 ounce

Item: B008L0ZS44
Name: Fee  Brothers Black Walnut Cocktail Bitters - 4 Ounce

Item: B01AP7LMS6
Name: Unknown

Item: B009YLDEMW
Name: Unknown

Item: B011PK0CFQ
Name: Luxardo, Gourmet Cocktail Maraschino Cherries 400G 

# **Setup for Cosine_similairy Recommendations**

### **Generate Sample for cosine_similarity recommendations**

Note that sample_small is not able to implement cosine_similarity
Sine the size of sample is too small to find the similar shopping tendency customers.Therefore, generating samples for bigger size.

Samples to be generated:



*   Sample_medium ; medium size of sample
*   Sample_large ; large size of sample

\
Samples bigger than the size of "sample_large" is approaching the boundary of infeasiblity so only two sizes of samples are used for cosine_similarity recommendations.

Generate sample_medium

In [None]:
sample_medium = list_of_baskets.filter(lambda x : len(x[1])>=3).sample(False, 0.001, 81)

Generate sample_large

In [None]:
sample_large = list_of_baskets.filter(lambda x : len(x[1])>=3).sample(False, 0.01, 51)

Details of each sample

In [None]:
print("Details of sample_medium")
print("____________________________________________________________________________________________")
print(f"The number of unique items : {sample_medium.flatMap(lambda x: x[1]).distinct().count()}")
print(f"The number of unique items (including redundant): ", sample_medium.flatMap(lambda x: x[1]).count())
print(f"The number of baskets : ", sample_medium.count())
print(f"The contents of sample_medium; list of (customer_id, customer's basket) : \n", sample_medium.take(10))
print()
print("Details of sample_large")
print("____________________________________________________________________________________________")
print(f"The number of unique items : {sample_large.flatMap(lambda x: x[1]).distinct().count()}")
print(f"The number of unique items (including redundant): ", sample_large.flatMap(lambda x: x[1]).count())
print(f"The number of baskets : ",  sample_large.count())
print(f"The contents of sample_large; list of (customer_id, customer's basket) : \n", sample_large.take(10))

Details of sample_medium
____________________________________________________________________________________________
The number of unique items : 2859
The number of unique items (including redundant):  3015
The number of baskets :  74
The contents of sample_medium; list of (customer_id, customer's basket) : 
 [(818, ['B0000DJ7XG', 'B001710KLA', 'B073QVVMCB', 'B001G5J7XO', 'B01E5SMSMA', 'B0000DJ7T0', 'B0000DJ7SR', 'B01I5FXHJW', 'B001712914', 'B01NCSDWE7', 'B01E5SL420', 'B0033Y16CQ', 'B00ZRDTKJW', 'B001TQKFE0', 'B01EYSCA88', 'B0000DJ7U0', 'B00ZVIWWB6', 'B00RZ6GXJE', 'B00CMS145S', 'B007H8Q6P6', 'B001EQ5AVI', 'B01K28ZMV6', 'B01A774T8O', 'B000V1JVCG', 'B01M10T49T', 'B074KN7S2D', 'B01H7DFM32', 'B000T7HG06', 'B00TE2EPCO', 'B01E5SL65K', 'B079DX8S3Z', 'B00ZWPVAKC', 'B01ITMO0G0']), (6009, ['B0001GV4O4', 'B0001GV4OE', 'B0001GV4NU', 'B0001GV4OY', 'B0001GV4NK', 'B0001GV4NA']), (8930, ['B00099XM0M', 'B000RPYZ3G', 'B000RPUD9Q', 'B000WHE7S2', 'B00FLJY8MQ', 'B07CX6LN8T', 'B000PXZZQG', 'B000UEUAGU', 'B

# **Data Setup for using Cosine_similarity**
### Generate sparse vectors for each baskets in Sample

In [None]:
from pyspark.ml.linalg import SparseVector
from math import sqrt
#Generate Sparse vector for each sampled customer of what they purchased
def generate_SV(customer_items_rdd):

  """This function call creat_sparse_vector function
  to perform generating Sparse Vector for each basket with customer ID"""

  # customer_items_rdd: a list of (customer_id, [item_id1, item_id2, ...]) tuples called 'customer_items'
  # Create a dictionary of item indices for efficient lookup
  item_indices = customer_items_rdd.flatMap(lambda x: x[1]).distinct().zipWithIndex().collectAsMap()

  #Broadcast the item_indices dictionary to all worker nodes
  item_indices_broadcast = sc.broadcast(item_indices)

  # Create sparse vectors for each customer's item vector
  def create_sparse_vector(customer_items, item_indices_broadcast):
      indices = [item_indices_broadcast.value.get(item, 0) for item in customer_items]
      indices = [index for index in indices if index != 0]
      indices.sort()
      values = [1] * len(indices)
      print(values)
      vector_size = len(item_indices_broadcast.value)
      return (indices, SparseVector(vector_size, indices, values))

  customer_vectors = customer_items_rdd.map(lambda x : (x[0], create_sparse_vector(x[1], item_indices_broadcast)))

  return customer_vectors


### For extracting useful data for recommending items through cosine-similarity

In [None]:
def SV_information(SV_rdd):
  customer_id_list = []
  only_SV_list = []
  for customer_id, vector in SV_rdd.collect():
    customer_id_list.append(customer_id)
    only_SV_list.append(vector[1])
  return customer_id_list, only_SV_list

### Display the contents of generated Sparse vector for customer baskets in detail

In [None]:
# Print the resulting customer vectors
print("Display outputs of generated Sparse Vector")
for customer_id, vector in generate_SV(sample_medium).collect():
    print("Customer ID:", customer_id)
    print("Sparse Vector:", vector[1])
    print("Indices : ", vector[0])
    print()

SV_information(generate_SV(sample_medium))

Display outputs of generated Sparse Vector
Customer ID: 818
Sparse Vector: (2859,[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,1443,1444,1445,1446,1447,1448,1449,1450,1451,1452,1453,1454,1455,1456,1457],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
Indices :  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1443, 1444, 1445, 1446, 1447, 1448, 1449, 1450, 1451, 1452, 1453, 1454, 1455, 1456, 1457]

Customer ID: 6009
Sparse Vector: (2859,[18,19,20,21,1458,1459],[1.0,1.0,1.0,1.0,1.0,1.0])
Indices :  [18, 19, 20, 21, 1458, 1459]

Customer ID: 8930
Sparse Vector: (2859,[22,23,24,25,26,27,28,29,30,31,32,33,34,1460,1461,1462,1463,1464,1465,1466,1467,1468,1469,1470],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
Indices :  [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 1460, 1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, 1469, 1470]

Customer ID

([818,
  6009,
  8930,
  10300,
  16321,
  18240,
  25239,
  29578,
  31564,
  43584,
  63224,
  69945,
  74501,
  79133,
  79395,
  83734,
  83982,
  90887,
  91443,
  93099,
  93660,
  95209,
  99502,
  104198,
  105820,
  107435,
  114381,
  122745,
  126051,
  128345,
  133680,
  134310,
  146434,
  147447,
  147703,
  150614,
  151435,
  154591,
  158116,
  162950,
  168004,
  175620,
  179901,
  189981,
  192388,
  194377,
  194643,
  194732,
  197639,
  199642,
  200950,
  204947,
  215481,
  218370,
  222580,
  231071,
  231983,
  233647,
  237741,
  237810,
  244365,
  250802,
  254648,
  260333,
  262848,
  265885,
  270219,
  271666,
  280003,
  281410,
  281908,
  282953,
  284592,
  286213],
 [SparseVector(2859, {1: 1.0, 2: 1.0, 3: 1.0, 4: 1.0, 5: 1.0, 6: 1.0, 7: 1.0, 8: 1.0, 9: 1.0, 10: 1.0, 11: 1.0, 12: 1.0, 13: 1.0, 14: 1.0, 15: 1.0, 16: 1.0, 17: 1.0, 1443: 1.0, 1444: 1.0, 1445: 1.0, 1446: 1.0, 1447: 1.0, 1448: 1.0, 1449: 1.0, 1450: 1.0, 1451: 1.0, 1452: 1.0, 1453: 1.0,

### Function for calculating cosine similarity

Specifically, calculating cosine similarity when parameters are given as Sparse Vector

In [None]:
def cosine_similarity_sv(x, y):
  """
  Computes cosine similarity on spark ML SparseVectors x and y
  """
  #multiplication of two 1's is a 1, anything else is a 0, so bitset dot product
  # is equivalent to bitwise and followed by a count of all nonzero elements
  dot_product = x.dot(y)

  #magnitude is sqrt(sum of squares), square of 0 is 0 and square of 1 is 1,
  #so sum of squares for us is equivalent to count of all nonzero elements
  mag_x = sqrt(x.numNonzeros())
  mag_y = sqrt(y.numNonzeros())

  return dot_product / (mag_x*mag_y)

### Extract the each cosine-similarity with all other customers in sample
(by calculating cosine_similarity with provided Sparse Vector)

*Note that target customer(who are to get personalized recommendations) is set to the first customer in each sample.*

*Hence, you will see that in the cosine similarity list, the first element's cosine-similarity is 1 or near as it is the target customer*

In [None]:
def customer_similarity(sample, baskets):
  customer_id, only_indexes = baskets
  baskets_list = sc.broadcast(only_indexes)
  #Use the first customer in the sample as recomendation needed customer
  test_user_sv = only_indexes[0]
  similarity_cal = sample.map(lambda x: [cosine_similarity_sv(test_user_sv, s) for s in baskets_list.value])
  similarity_list = similarity_cal.take(1)[0]
  return similarity_list

print("Cosine_similarity list between target customers and the other customers (in each sample)")
print("\nFor target customer in Sample_Medium")
print(customer_similarity(sample_medium, SV_information(generate_SV(sample_medium))))
print("\nFor target customer in Sample_large")
print(customer_similarity(sample_large, SV_information(generate_SV(sample_large))))

Cosine_similarity list between target customers and the other customers (in each sample)

For target customer in Sample_Medium
[0.9999999999999998, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.018430244519362142, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.01928791874526149, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

For target customer in Sample_large
[1.0000000000000002, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.033113308926626096, 0.0, 0.0, 0.0, 0.0, 0.3857583749052298, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

### Find similar customers

Note that by running this function find_similar_customers, we get the indexes of similar customers, so these indexes will be used to extract the actual details about similar customers including their own id and the items they purchased

In [None]:
def find_similar_customers(sample, similarity_list):
    similarity_rdd = sc.parallelize(similarity_list)
    similarity_rdd_with_indx = similarity_rdd.zipWithIndex().map(lambda x : (x[1],x[0]))
    filter_test_customer = similarity_rdd_with_indx.filter(lambda x : x[1] < 0.9 and x[1]!=0.0)
    similar_customers = filter_test_customer.sortBy(lambda x : -x[1])

    return similar_customers

similar_basket_medium = find_similar_customers(sample_medium, customer_similarity(sample_medium, SV_information(generate_SV(sample_medium))))
similar_basket_large = find_similar_customers(sample_large, customer_similarity(sample_large, SV_information(generate_SV(sample_large))))
print(f"For medium samples's test customer, \n list of (similar user index, similarity) \n", similar_basket_medium.collect())
print(f"\nFor large samples's test customer, \n list of (similar user index, similarity) \n", similar_basket_large.collect())

For medium samples's test customer, 
 list of (similar user index, similarity) 
 [(36, 0.01928791874526149), (9, 0.018430244519362142)]

For large samples's test customer, 
 list of (similar user index, similarity) 
 [(12, 0.3857583749052298), (453, 0.05555555555555555), (200, 0.0545544725589981), (170, 0.05360562674188975), (502, 0.05360562674188975), (236, 0.05103103630798287), (419, 0.03450327796711771), (7, 0.033113308926626096), (389, 0.031686212526223896)]


### Obtain similar customers' basket and customer's its own id

Note that in previous code, we achievd the indexes of each similar customers in sample

In this step, by using these indexes, we extract the similar customers' detail including customer's id and its own baskets

In [None]:
def generate_similar_customer_with_basket(similar_customers_rdd, sample):
  similar_customers_list = similar_customers_rdd.map(lambda x : x[0]).collect()
  sample_indx = sample.zipWithIndex().map(lambda x : (x[1],x[0]))
  similar_user_with_basket_indx = sample_indx.filter(lambda x : x[0] in similar_customers_list)

  #Remove index for getting similar customers basket
  similar_user_with_basket = similar_user_with_basket_indx.map(lambda x : x[1])
  return similar_user_with_basket

similar_basket_medium_ASINS = generate_similar_customer_with_basket(similar_basket_medium, sample_medium)
similar_basket_large_ASINS = generate_similar_customer_with_basket(similar_basket_large, sample_large)
print(similar_basket_medium_ASINS.collect())
print(similar_basket_large_ASINS.collect())

[(43584, ['B001GHYO4E', 'B00014FNU2', 'B00EIS9NW6', 'B00ENP215K', 'B010BUE6CG', 'B006VRNEFO', 'B002OOLUTK', 'B001EQ4SHK', 'B00HJCXX24', 'B01FE7PM1U', 'B005ERUCVQ', 'B0054M0U18', 'B074CHSXN4', 'B00CYMU3TA', 'B007SNJ98G', 'B01H7DFM32', 'B00UHMOG60', 'B01FG5W2KE', 'B015FFIOO8', 'B002SKVZIQ', 'B005EOTMA6', 'B07939DYBP', 'B0005ZXPY8', 'B00VVMTLV0', 'B0147CMIRE', 'B001B49UOG', 'B00VF03MUY', 'B01NB9PVP6', 'B001GHYO44', 'B00GB8PS9I', 'B0755JN13H', 'B004XJXIM6', 'B01M173US6', 'B001L97SQS', 'B018Y0WYHY', 'B001B5JT8C', 'B00LIAVX98', 'B000F0AUGO', 'B003T0668I', 'B00IXUKFVI', 'B01E9GXNCM', 'B00TXQQDW6', 'B0774M73SF', 'B00060N5OW', 'B01M4IWGHE', 'B06Y1DCSRQ', 'B0793KYR9W', 'B003XKWPD4', 'B000R4FGK8', 'B071W21LLT', 'B001PQMJIY', 'B001H8R00M', 'B079VP6DH6', 'B01FXMD3KS', 'B01HSU0UC2', 'B00GGBLPVU', 'B005HUWCFY', 'B00STSZ77G', 'B00DLKWXRO', 'B004NG9FWQ', 'B0053KN8QA', 'B00ASBOP9S', 'B007P0FWSI', 'B00P2XTTS4', 'B004YPEFH6', 'B00HNSJSX2', 'B0741GFQHJ', 'B016UMTPRK', 'B00COCUM96', 'B00BFUFYZK', 'B00125V10

# **Recommend Items to target customer based on Cosine Similarity**

Since the number of products are very large compared with the number of customers it is not likely to find the similar customers with the high value of similarity. Therefore, we will recommend products in two ways.



1.   For customer who wants many recommendations
2.   For customer who wants more related and less number of recommendations



## **For this first case customer,**

1.   For customer who wants many recommendations

### Recomendation Approach:
Recommend all items that **"all similar customers"** purchased but target customer didn't purchase

\
Since we are recommending all products if items are not purchased by target customer from similar customer's basket. This will generate quite a lot of recommendations.



In [None]:
def recommended_items(sample_rdd, similar_customers_basket_rdd):
  only_items_for_similar_customers = similar_customers_basket_rdd.flatMap(lambda x : x[1])
  recommend_needed_customer_items = sample_rdd.take(1)[0][1]
  new_items = only_items_for_similar_customers.filter(lambda x : x not in recommend_needed_customer_items)
  return new_items

more_recommendations_medium = recommended_items(sample_medium, similar_basket_medium_ASINS)
more_recommendations_large =  recommended_items(sample_large, similar_basket_large_ASINS)
print(f"Recommended Items for customer {sample_medium.take(1)[0][0]} :\n{more_recommendations_medium.collect()}")
print(f"The number of recommended items(medium_sample): {more_recommendations_medium.count()}")
print(f"Recommended Items for customer {sample_large.take(1)[0][0]} :\n{more_recommendations_large.collect()}")
print(f"The number of recommended items(medium_large): {more_recommendations_large.count()}")


Recommended Items for customer 818 :
['B001GHYO4E', 'B00014FNU2', 'B00EIS9NW6', 'B00ENP215K', 'B010BUE6CG', 'B006VRNEFO', 'B002OOLUTK', 'B001EQ4SHK', 'B00HJCXX24', 'B01FE7PM1U', 'B005ERUCVQ', 'B0054M0U18', 'B074CHSXN4', 'B00CYMU3TA', 'B007SNJ98G', 'B00UHMOG60', 'B01FG5W2KE', 'B015FFIOO8', 'B002SKVZIQ', 'B005EOTMA6', 'B07939DYBP', 'B0005ZXPY8', 'B00VVMTLV0', 'B0147CMIRE', 'B001B49UOG', 'B00VF03MUY', 'B01NB9PVP6', 'B001GHYO44', 'B00GB8PS9I', 'B0755JN13H', 'B004XJXIM6', 'B01M173US6', 'B001L97SQS', 'B018Y0WYHY', 'B001B5JT8C', 'B00LIAVX98', 'B000F0AUGO', 'B003T0668I', 'B00IXUKFVI', 'B01E9GXNCM', 'B00TXQQDW6', 'B0774M73SF', 'B00060N5OW', 'B01M4IWGHE', 'B06Y1DCSRQ', 'B0793KYR9W', 'B003XKWPD4', 'B000R4FGK8', 'B071W21LLT', 'B001PQMJIY', 'B001H8R00M', 'B079VP6DH6', 'B01FXMD3KS', 'B01HSU0UC2', 'B00GGBLPVU', 'B005HUWCFY', 'B00STSZ77G', 'B00DLKWXRO', 'B004NG9FWQ', 'B0053KN8QA', 'B00ASBOP9S', 'B007P0FWSI', 'B00P2XTTS4', 'B004YPEFH6', 'B00HNSJSX2', 'B0741GFQHJ', 'B016UMTPRK', 'B00COCUM96', 'B00BFUFYZ

More reduced_size recommendation
Recommend items to customer when the most similar customer purchased but the customer wants to get recommendation didn't purchase.

## **For this second case customer,**

2. For customer who wants more related and less number of recommendations

### Recomendation Approach:
Recommend all items that **"only the MOST similar customer"** purchased but target customer didn't purchase

\
We are recommending only products from the MOST similar customer's basket not like the first case (they are recomending products from all similar customer's baskets) This will generate qless number of recommendations to target customer and it is likely to be more personalized as the most similar customer shares the highest shopping taste with the target customer.


In [None]:
def reduced_size_recommendations(sample_rdd, similar_customers_basket_rdd, similarity_rdd):

  recommend_needed_customer_items = sample_rdd.take(1)[0][1]

  most_similar_customer_index = similarity_rdd.sortBy(lambda x : -x[1]).take(1)[0][0]

  sample_indx = sample_rdd.zipWithIndex().map(lambda x : (x[1],x[0]))
  similar_user_with_basket_indx = sample_indx.filter(lambda x : x[0] == most_similar_customer_index)

  #Remove index for getting similar customers basket
  most_similar_user_with_basket = similar_user_with_basket_indx.map(lambda x : x[1])
  most_similar_user_only_items = most_similar_user_with_basket.flatMap(lambda x : x[1])

  new_items = most_similar_user_only_items.filter(lambda x : x not in recommend_needed_customer_items)
  return new_items

less_recommendations_medium = reduced_size_recommendations(sample_medium, similar_basket_medium_ASINS, similar_basket_medium)
less_recommendations_large =  reduced_size_recommendations(sample_large, similar_basket_large_ASINS, similar_basket_large)
print(f"Recommended Items for customer {sample_medium.take(1)[0][0]} :\n{less_recommendations_medium.collect()}")
print(f"The number of recommended items(medium_sample): {less_recommendations_medium.count()}")
print(f"Recommended Items for customer {sample_large.take(1)[0][0]} :\n{less_recommendations_large.collect()}")
print(f"The number of recommended items(medium_large): {less_recommendations_large.count()}")

Recommended Items for customer 818 :
['B00DE6Q23Q', 'B00DE6KSTK', 'B00E0Y8PTA', 'B00BOAGEZA', 'B006GHVOX8', 'B007FA1EBW', 'B007F1LYWK', 'B001EQ4H1M', 'B0019N2EQU', 'B01IAYHZVO', 'B004M6ZDKA', 'B009F28PY2', 'B004K6771U', 'B018Y0WYHY', 'B00KPFHOPO', 'B005EOTMA6', 'B0044MNYIK', 'B01FG5W2KE', 'B004SIAOBK', 'B000FLXBN2', 'B00D1RPES2', 'B005VBD2UI', 'B00IV6ZM58', 'B00T77IBNU', 'B004CPAGNA', 'B000VDZ2N6', 'B00060N5OW', 'B00CF5DFGE', 'B0014E4DS6', 'B01GG68Q4I', 'B01M099264', 'B0032GQM5G', 'B018JZFUXY', 'B00IO2RL8K', 'B004YVOFB6', 'B00T70RNQ8', 'B01F7N4KM8', 'B0001CXUHW', 'B004G7PFRQ', 'B00KYW1K26', 'B0012XV382', 'B01H07E286', 'B00U9WIN62', 'B0058PP61U', 'B01A5MHEBU', 'B00F21YX8W', 'B00Q9EX6IY', 'B00AN91QCY', 'B00EQMJFAE', 'B00H46SFU0', 'B00CQG5QUA', 'B00FQGP20Q', 'B00M9C14UW', 'B077QLSLRQ', 'B0046EJ570', 'B01N49HPHK', 'B079VP6DH6', 'B01J3TGH1E', 'B003FULBQ4', 'B00HS6FWN4', 'B00HCNCQ2S', 'B00VF03MUY', 'B01FAOX2GO', 'B074DRLRPJ', 'B0013GMLRA', 'B01FXMDA2O', 'B072MZLXYZ', 'B00KKC6C2I', 'B0000CFN0

# **Test Scalability**

Write  pyspark_recommendations.py

In [None]:
%%writefile pyspark_recommendations.py

from operator import *
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import *
spark = SparkSession.builder.master("local[*]").appName('SparkExample').config(
    "spark.executor.memory", "1g").config(
        "spark.executorEnv.PYTHONHASHSEED","0").config("spark.ui.port", "4050"
        ).getOrCreate()
sc = spark.sparkContext
from pyspark.ml.linalg import SparseVector
from math import sqrt
import pyspark, time
import sys
from operator import add

support_threshold = 1000
if len(sys.argv) == 3:
  support_threshold = int(sys.argv[2])

time_start = time.time()

file_path = sys.argv[1]

item_purchased = spark.read.json(file_path).rdd

num_basket = item_purchased.count()

asin = item_purchased.map(lambda x : x[2])
title = item_purchased.map(lambda x : x[18])

index_asin = asin.zipWithIndex().map(lambda x : (x[1], x[0]))
index_title = title.zipWithIndex().map(lambda x : (x[1], x[0]))

join_asin_title = index_asin.join(index_title).map(lambda x : x[1])

also_buy = item_purchased.map(lambda x : x[0])
index_also_buy = also_buy.zipWithIndex().map(lambda x : (x[1], x[0]))

all_items_in_basket_with_indx = index_asin.join(index_also_buy)
list_of_baskets_with_indx = all_items_in_basket_with_indx.map(lambda x : (x[0], [x[1][0]]+(x[1][1])))
list_of_baskets = list_of_baskets_with_indx.sortBy(lambda x : x[0])
num_products = list_of_baskets.flatMap(lambda x : x[1]).distinct().count()

#Most frequent items
def a_priori_step1(items_per_basket):
  items = items_per_basket.flatMap(lambda x: x[1])
  itemsCount = items.map(lambda x : (x, 1)).reduceByKey(add)
  return itemsCount

def a_priori(rdd, support=100):
  supportItems = a_priori_step1(rdd).filter(lambda x : x[1] > support)
  freqItems = sc.broadcast(supportItems.collectAsMap())

  def filter_uncommon(text):
    return [item  for item in text if item in freqItems.value]

  def items_pair_tup(line):
    """Function for pairing up all items to tuple within the same line"""
    return [tuple(sorted((line[j], line[i]))) for i in range(len(line)) for j in range(i)]

  freqItemPerBasket = rdd.map(lambda x : x[1]).map(filter_uncommon)
  freqItemPair = freqItemPerBasket.filter(lambda x : len(x)>=2).flatMap(items_pair_tup)
  freqItemPairCount = freqItemPair.map(lambda x : (x, 1)).reduceByKey(add)

  return freqItemPairCount

top_item_counts = a_priori_step1(list_of_baskets).filter(lambda kv: kv[1]>=100)
top_pair_counts = a_priori(list_of_baskets, support=100)

def confidence(item_counts, pair_counts, n):
  item_support = item_counts.map(lambda x : (x[0], x[1]/n))
  pair_support = pair_counts.map(lambda x : (x[0], x[1]/n))
  X_Ypair_map = pair_support.map(lambda x : (x[0][0],(x[0][1], x[1])))
  Y_Xpair_map = pair_support.map(lambda x : (x[0][1],(x[0][0], x[1])))
  X_Yjoin = item_support.join(X_Ypair_map)
  Y_Xjoin = item_support.join(Y_Xpair_map)
  confCombined = X_Yjoin.union(Y_Xjoin)
  confidence = confCombined.map(lambda x : ((x[0],x[1][1][0]), x[1][1][1]/x[1][0]))
  return confidence

top_rule_confidences = confidence(top_item_counts, top_pair_counts, num_basket)

def interest(item_counts, rule_confidences, n):
  #Map the items with the corresponding its probability (Y, Probability of Y)
  ProbY = item_counts.map(lambda x : (x[0], x[1]/n))

  #Swap the pair((X, Y), _) form to (Y, (X, Conf(X->Y))) to make Y as key
  X_YconfPair = rule_confidences.map(lambda x : (x[0][1],(x[0][0], x[1])))

  #Join to get the form of (Y,(Prob(Y), (X, Conf(X->Y)))) to calculate interest next
  X_Yjoin = ProbY.join(X_YconfPair)

  #Calculate Interest by subtracting Conf(X->Y) - Prob(Y)
  #and mapping them in form ((X, Y), Interest(X->Y))
  interest = X_Yjoin.map(lambda x : ((x[1][1][0], x[0]), x[1][1][1]-x[1][0]))
  return interest

top_interest = interest(top_item_counts, top_rule_confidences, num_basket)
sample_small = list_of_baskets.filter(lambda x : len(x[1])>=3).sample(False, 0.00005, 81)
num_of_customer = sample_small.count()
sample_customers = (sample_small.map(lambda x : x[0]).collect())

def recommend_items(sample_data_rdd, co_purchased_pairs_rdd, target_user):
  # Create a lookup dictionary for user items
  user_items_dict = dict(sample_data_rdd.collect())

  # Get the items already purchased by the target user
  purchased_items = user_items_dict.get(target_user, [])

  # Filter the co-purchased pairs with interest higher than 0.6
  filtered_pairs_rdd = co_purchased_pairs_rdd.filter(lambda x: x[1] > 0.5)

  # Find the co-purchased items from the co-purchased pair list
  recommendations = filtered_pairs_rdd.filter(lambda x: any(item in purchased_items for item in x[0]))

  # Extract the recommended items
  recommended_items = recommendations.flatMap(lambda x: x[0]).distinct().collect()

  # Extract the recommended items
  recommended_items = recommendations.flatMap(lambda x: x[0]).distinct().collect()
  return target_user, recommended_items

for target_user in sample_customers:
  target_customer, recommendations = (recommend_items(sample_small, top_interest, target_user))
  print((f"Recommended Items for user {target_customer}", recommendations))
  print(f"The number of recommended items for user {target_customer} : {len(recommendations)}\n")

sample_medium = list_of_baskets.filter(lambda x : len(x[1])>=3).sample(False, 0.001, 81)

sample_large = list_of_baskets.filter(lambda x : len(x[1])>=3).sample(False, 0.01, 51)

#Generate Sparse vector for each sampled customer of what they purchased
def generate_SV(customer_items_rdd):

  """This function call creat_sparse_vector function
  to perform generating Sparse Vector for each basket with customer ID"""

  # customer_items_rdd: a list of (customer_id, [item_id1, item_id2, ...]) tuples called 'customer_items'
  # Create a dictionary of item indices for efficient lookup
  item_indices = customer_items_rdd.flatMap(lambda x: x[1]).distinct().zipWithIndex().collectAsMap()

  #Broadcast the item_indices dictionary to all worker nodes
  item_indices_broadcast = sc.broadcast(item_indices)

  # Create sparse vectors for each customer's item vector
  def create_sparse_vector(customer_items, item_indices_broadcast):
      indices = [item_indices_broadcast.value.get(item, 0) for item in customer_items]
      indices = [index for index in indices if index != 0]
      indices.sort()
      values = [1] * len(indices)
      vector_size = len(item_indices_broadcast.value)
      return (indices, SparseVector(vector_size, indices, values))

  customer_vectors = customer_items_rdd.map(lambda x : (x[0], create_sparse_vector(x[1], item_indices_broadcast)))

  return customer_vectors

def SV_information(SV_rdd):
  customer_id_list = []
  only_SV_list = []
  for customer_id, vector in SV_rdd.collect():
    customer_id_list.append(customer_id)
    only_SV_list.append(vector[1])
  return customer_id_list, only_SV_list



def cosine_similarity_sv(x, y):
  """
  Computes cosine similarity on spark ML SparseVectors x and y
  """
  #multiplication of two 1's is a 1, anything else is a 0, so bitset dot product
  # is equivalent to bitwise and followed by a count of all nonzero elements
  dot_product = x.dot(y)

  #magnitude is sqrt(sum of squares), square of 0 is 0 and square of 1 is 1,
  #so sum of squares for us is equivalent to count of all nonzero elements
  mag_x = sqrt(x.numNonzeros())
  mag_y = sqrt(y.numNonzeros())

  return dot_product / (mag_x*mag_y)

def customer_similarity(sample, baskets):
  customer_id, only_indexes = baskets
  baskets_list = sc.broadcast(only_indexes)
  #Use the first customer in the sample as recomendation needed customer
  test_user_sv = only_indexes[0]
  similarity_cal = sample.map(lambda x: [cosine_similarity_sv(test_user_sv, s) for s in baskets_list.value])
  similarity_list = similarity_cal.take(1)[0]
  return similarity_list

def find_similar_customers(sample, similarity_list):
    similarity_rdd = sc.parallelize(similarity_list)
    similarity_rdd_with_indx = similarity_rdd.zipWithIndex().map(lambda x : (x[1],x[0]))
    filter_test_customer = similarity_rdd_with_indx.filter(lambda x : x[1] < 0.9 and x[1]!=0.0)
    similar_customers = filter_test_customer.sortBy(lambda x : -x[1])

    return similar_customers

similar_basket_medium = find_similar_customers(sample_medium, customer_similarity(sample_medium, SV_information(generate_SV(sample_medium))))
similar_basket_large = find_similar_customers(sample_large, customer_similarity(sample_large, SV_information(generate_SV(sample_large))))


def generate_similar_customer_with_basket(similar_customers_rdd, sample):
  similar_customers_list = similar_customers_rdd.map(lambda x : x[0]).collect()
  sample_indx = sample.zipWithIndex().map(lambda x : (x[1],x[0]))
  similar_user_with_basket_indx = sample_indx.filter(lambda x : x[0] in similar_customers_list)

  #Remove index for getting similar customers basket
  similar_user_with_basket = similar_user_with_basket_indx.map(lambda x : x[1])
  return similar_user_with_basket

similar_basket_medium_ASINS = generate_similar_customer_with_basket(similar_basket_medium, sample_medium)
similar_basket_large_ASINS = generate_similar_customer_with_basket(similar_basket_large, sample_large)


def recommended_items(sample_rdd, similar_customers_basket_rdd):
  only_items_for_similar_customers = similar_customers_basket_rdd.flatMap(lambda x : x[1])
  recommend_needed_customer_items = sample_rdd.take(1)[0][1]
  new_items = only_items_for_similar_customers.filter(lambda x : x not in recommend_needed_customer_items)
  return new_items

more_recommendations_medium = recommended_items(sample_medium, similar_basket_medium_ASINS)
more_recommendations_large =  recommended_items(sample_large, similar_basket_large_ASINS)
print(f"Recommended Items for customer {sample_medium.take(1)[0][0]} :\n{more_recommendations_medium.collect()}")
print(f"The number of recommended items(medium_sample): {more_recommendations_medium.count()}")
print(f"Recommended Items for customer {sample_large.take(1)[0][0]} :\n{more_recommendations_large.collect()}")
print(f"The number of recommended items(medium_large): {more_recommendations_large.count()}")


time_end = time.time()
print(f"elapsed time is {time_end-time_start}")

Writing pyspark_recommendations.py


In [None]:
USERNAME="bonghyun"
%env REGION=australia-southeast1
%env ZONE=australia-southeast1-a
%env PROJECT=data301-2023-$USERNAME
%env CLUSTER=data301-2023-$USERNAME-project-cluster
%env BUCKET=data301-2023-$USERNAME-project-bucket

env: REGION=australia-southeast1
env: ZONE=australia-southeast1-a
env: PROJECT=data301-2023-bonghyun
env: CLUSTER=data301-2023-bonghyun-project-cluster
env: BUCKET=data301-2023-bonghyun-project-bucket


Run code to setup google cloud project and storage bucket.

In [None]:
!python3 -m pip install google-cloud-dataproc[libcst]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting google-cloud-dataproc[libcst]
  Downloading google_cloud_dataproc-5.4.1-py2.py3-none-any.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.5/307.5 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting grpc-google-iam-v1<1.0.0dev,>=0.12.4 (from google-cloud-dataproc[libcst])
  Downloading grpc_google_iam_v1-0.12.6-py2.py3-none-any.whl (26 kB)
Installing collected packages: grpc-google-iam-v1, google-cloud-dataproc
Successfully installed google-cloud-dataproc-5.4.1 grpc-google-iam-v1-0.12.6


In [None]:
!gcloud auth login

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=https%3A%2F%2Fsdk.cloud.google.com%2Fauthcode.html&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=9WURBcHzzJou0VjCn1HLyQrxp07I2X&prompt=consent&access_type=offline&code_challenge=Pnsv-3vsCHrITKMFZMf1CxFG0pFO-5CHWJaKxazjHUg&code_challenge_method=S256

Enter authorization code: 4/0AbUR2VOF8u6Ujps_8b8NMI1ZbFkhKrWtaJEtHuxr9BS8_eqA-MBXOUXNQRrP4JCG54ipCw

You are now logged in as [bonghyun990@gmail.com].
Your current project is [None].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


In [None]:
!gcloud config set project $PROJECT

Updated property [core/project].


In [None]:
!gcloud services enable dataproc.googleapis.com cloudresourcemanager.googleapis.com

Operation "operations/acat.p2-224962939363-e3399dee-a7c4-4a5c-84c3-0558b02fd3ea" finished successfully.


In [None]:
!gsutil mb -c regional -l $REGION -p $PROJECT gs://$BUCKET

Creating gs://data301-2023-bonghyun-project-bucket/...
ServiceException: 409 A Cloud Storage bucket named 'data301-2023-bonghyun-project-bucket' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


Run and modify the cluster create/execute/delete code for each test.

NOTE: it may take 5-10 minutes

In [None]:
!gcloud storage cp ./meta_Grocery_Food.json.gz gs://$BUCKET

Copying file://./meta_Grocery_Food.json.gz to gs://data301-2023-bonghyun-project-bucket/meta_Grocery_Food.json.gz

Average throughput: 19.7MiB/s


For running program with the number of processors


(Creation of Clusters)


P=1

In [None]:
!gcloud dataproc clusters create $CLUSTER --region=$REGION --bucket=$BUCKET --zone=$ZONE \
--master-machine-type=custom-1-6144 \
--image-version=1.5 --max-age=30m --single-node

Waiting on operation [projects/data301-2023-bonghyun/regions/australia-southeast1/operations/74cc27e0-4d4e-3c14-b8cd-dc6624aa376f].

Created [https://dataproc.googleapis.com/v1/projects/data301-2023-bonghyun/regions/australia-southeast1/clusters/data301-2023-bonghyun-project-cluster] Cluster placed in zone [australia-southeast1-a].


P = 4

In [None]:
!gcloud dataproc clusters create $CLUSTER --region=$REGION --bucket=$BUCKET --zone=$ZONE \
--master-machine-type=n1-standard-2 --worker-machine-type=n1-standard-2 \
--image-version=1.5 --max-age=30m --num-masters=1 --num-workers=2

Waiting on operation [projects/data301-2023-bonghyun/regions/australia-southeast1/operations/3fa3c999-f27a-37fc-abff-62fb17b95478].

Created [https://dataproc.googleapis.com/v1/projects/data301-2023-bonghyun/regions/australia-southeast1/clusters/data301-2023-bonghyun-project-cluster] Cluster placed in zone [australia-southeast1-a].


P = 8

In [None]:
!gcloud dataproc clusters create $CLUSTER --region=$REGION --bucket=$BUCKET --zone=$ZONE \
--master-machine-type=n1-standard-4 --worker-machine-type=n1-standard-4 \
--image-version=1.5 --max-age=30m --num-masters=1 --num-workers=2

Waiting on operation [projects/data301-2023-bonghyun/regions/australia-southeast1/operations/3be3e7b0-00a2-35d8-9ef9-16c5ec29446e].

Created [https://dataproc.googleapis.com/v1/projects/data301-2023-bonghyun/regions/australia-southeast1/clusters/data301-2023-bonghyun-project-cluster] Cluster placed in zone [australia-southeast1-a].


P = 16

In [None]:
!gcloud dataproc clusters create $CLUSTER --region=$REGION --bucket=$BUCKET --zone=$ZONE \
--master-machine-type=n1-standard-8 --worker-machine-type=n1-standard-8 \
--image-version=1.5 --max-age=30m --num-masters=1 --num-workers=2

Waiting on operation [projects/data301-2023-bonghyun/regions/australia-southeast1/operations/2079977d-460a-31dc-8210-0db243311358].

Created [https://dataproc.googleapis.com/v1/projects/data301-2023-bonghyun/regions/australia-southeast1/clusters/data301-2023-bonghyun-project-cluster] Cluster placed in zone [australia-southeast1-a].


Submit Jobs

In [None]:
!gcloud dataproc jobs submit pyspark --cluster=$CLUSTER --region=$REGION pyspark_recommendations.py -- gs://$BUCKET/meta_Grocery_Food.json.gz

Job [5f5ccc008da7466ea80e499dc8ac0212] submitted.
Waiting for job output...
23/06/04 08:46:53 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
23/06/04 08:46:53 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
23/06/04 08:46:53 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator
23/06/04 08:46:53 INFO org.spark_project.jetty.util.log: Logging initialized @3650ms to org.spark_project.jetty.util.log.Slf4jLog
23/06/04 08:46:54 INFO org.spark_project.jetty.server.Server: jetty-9.4.z-SNAPSHOT; built: unknown; git: unknown; jvm 1.8.0_362-b09
23/06/04 08:46:54 INFO org.spark_project.jetty.server.Server: Started @3743ms
23/06/04 08:46:54 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@1aa397fd{HTTP/1.1, (http/1.1)}{0.0.0.0:4050}
23/06/04 08:47:01 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
23/06/04 08:47:01 INFO org.apache.hadoop.io.compress.CodecPool: Got bra

Delete Cluster

In [None]:
!gcloud dataproc clusters delete $CLUSTER --region=$REGION --quiet

Waiting on operation [projects/data301-2023-bonghyun/regions/australia-southeast1/operations/9ad58f8b-7d58-3089-9f7d-29c310570911].
Deleted [https://dataproc.googleapis.com/v1/projects/data301-2023-bonghyun/regions/australia-southeast1/clusters/data301-2023-bonghyun-project-cluster].
