<a href="https://colab.research.google.com/github/Ref4al/Week5/blob/main/RecSystems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embeddings for Recommendation Systems

As we’ve mentioned, the concept of embeddings is useful in so many other domains. In industry, it’s widely used for recommendation systems, for example.

we’ll use the word2vec algorithm to embed songs using human-made music playlists. Imagine if we treated each song as we would a word or token, and we treated each playlist like a sentence. These embeddings can then be used to recommend similar songs that often appear together in playlists.

The dataset we’ll use was collected by Shuo Chen from Cornell University. It contains playlists from hundreds of radio stations around the US. Figure 2-17 demonstrates this dataset.

![Three playlists containing watched video IDs](../assets/videos_playlists.png)

Figure 2-17. For video embeddings that capture video similarity we’ll use a dataset made up of a collection of playlists, each containing a list of videos.


Let’s demonstrate the end product before we look at how it’s built. So let’s give it a few songs and see what it recommends in response.



### Training a Song Embedding Model

We’ll start by loading the dataset containing the song playlists as well as each song’s metadata, such as its title and artist:



In [5]:
!pip -q uninstall -y kaggle kaglesdk
!pip -q install kaggle==1.6.17
!kaggle -v


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.7/82.7 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
Kaggle API 1.6.17


In [6]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"refalm","key":"5efb9eeccf750ee66649f0b6e849c0ce"}'}

In [7]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [8]:
!kaggle datasets list -s instacart | head

ref                                                       title                                              size  lastUpdated          downloadCount  voteCount  usabilityRating  
--------------------------------------------------------  ------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
yasserh/instacart-online-grocery-basket-analysis-dataset  InstaCart Online Grocery Basket Analysis Dataset  197MB  2022-01-25 16:43:37          16872        101  1.0              
psparks/instacart-market-basket-analysis                  Instacart Market Basket Analysis                  197MB  2017-11-20 03:08:26          32117        201  0.4117647        
viswajithkn/instacart-predict-shopping-time               Instacart Predict Shopping Time                    12MB  2018-10-29 17:24:48            667         12  0.64705884       
mohdshahnawazaadil/supermarket-superstore-dataset-bundle  Supermarket / Superstore Dataset Bundle   

In [9]:
!kaggle datasets download -d psparks/instacart-market-basket-analysis --unzip

Dataset URL: https://www.kaggle.com/datasets/psparks/instacart-market-basket-analysis
License(s): CC0-1.0
Downloading instacart-market-basket-analysis.zip to /content
 98% 193M/197M [00:01<00:00, 167MB/s]
100% 197M/197M [00:01<00:00, 155MB/s]


In [5]:
!ls -lh

total 681M
-rw-r--r-- 1 root root 2.6K Jan 14 15:55 aisles.csv
-rw-r--r-- 1 root root  270 Jan 14 15:55 departments.csv
-rw-r--r-- 1 root root   62 Jan 14 15:50 kaggle.json
-rw-r--r-- 1 root root 551M Jan 14 15:55 order_products__prior.csv
-rw-r--r-- 1 root root  24M Jan 14 15:55 order_products__train.csv
-rw-r--r-- 1 root root 104M Jan 14 15:55 orders.csv
-rw-r--r-- 1 root root 2.1M Jan 14 15:55 products.csv
drwxr-xr-x 1 root root 4.0K Dec 11 14:34 sample_data


In [10]:
import pandas as pd

op = pd.read_csv("order_products__prior.csv")
products = pd.read_csv("products.csv")

op = op.sort_values(["order_id", "add_to_cart_order"])

baskets = op.groupby("order_id")["product_id"].apply(list)
baskets = baskets[baskets.apply(len) > 1]

sentences = baskets.apply(lambda xs: [str(x) for x in xs]).tolist()
print("Num baskets:", len(sentences))

Num baskets: 3058126


In [11]:
pip -q install gensim

In [13]:
from gensim.models import Word2Vec

model = Word2Vec(
    sentences=sentences,
    vector_size=32,
    window=5,
    min_count=2,
    negative=10,
    workers=4,
    sg=1
)

In [14]:
print("Model trained. Vocab size:", len(model.wv))


Model trained. Vocab size: 49546


In [16]:
prod_name = products.set_index("product_id")["product_name"].to_dict()

def recommend(product_id, topn=10):
    sims = model.wv.most_similar(str(product_id), topn=topn)
    return [(int(pid), prod_name.get(int(pid), "UNKNOWN"), float(score)) for pid, score in sims]

def recommend_for_basket(product_ids, topn=10):
    positives = [str(pid) for pid in product_ids]
    sims = model.wv.most_similar(positive=positives, topn=topn + len(product_ids))
    out = []
    for pid, score in sims:
        pid_int = int(pid)
        if pid_int not in product_ids:
            out.append((pid_int, prod_name.get(pid_int, "UNKNOWN"), float(score)))
        if len(out) == topn:
            break
    return out

print("Seed:", prod_name.get(24852))
recommend(24852, topn=10)

Seed: Banana


[(28204, 'Organic Fuji Apple', 0.9256421327590942),
 (27730, 'Almond Breeze Original Almond Milk', 0.8688809275627136),
 (47144, 'Unsweetened Original Almond Breeze Almond Milk', 0.8677605390548706),
 (432, 'Vanilla Almond Breeze Almond Milk', 0.8547711968421936),
 (45066, 'Honeycrisp Apple', 0.8494503498077393),
 (4942, 'Vanilla Almond Breeze', 0.8437309265136719),
 (20842, 'Total 0% Greek Yogurt', 0.8386955261230469),
 (19348, 'Fat Free Milk', 0.8220975995063782),
 (47766, 'Organic Avocado', 0.82010817527771),
 (49610, '100% Lactose Free Fat Free Milk', 0.818316638469696)]

In [18]:
recommend(24852, topn=5)


[(28204, 'Organic Fuji Apple', 0.9256421327590942),
 (27730, 'Almond Breeze Original Almond Milk', 0.8688809275627136),
 (47144, 'Unsweetened Original Almond Breeze Almond Milk', 0.8677605390548706),
 (432, 'Vanilla Almond Breeze Almond Milk', 0.8547711968421936),
 (45066, 'Honeycrisp Apple', 0.8494503498077393)]

In [21]:
import pandas as pd

def recommend_df(product_id, topn=10):
    recs = recommend(product_id, topn=topn)
    return pd.DataFrame(recs, columns=["product_id", "product_name", "similarity"])

recommend_df(24852, topn=10)

Unnamed: 0,product_id,product_name,similarity
0,28204,Organic Fuji Apple,0.925642
1,27730,Almond Breeze Original Almond Milk,0.868881
2,47144,Unsweetened Original Almond Breeze Almond Milk,0.867761
3,432,Vanilla Almond Breeze Almond Milk,0.854771
4,45066,Honeycrisp Apple,0.84945
5,4942,Vanilla Almond Breeze,0.843731
6,20842,Total 0% Greek Yogurt,0.838696
7,19348,Fat Free Milk,0.822098
8,47766,Organic Avocado,0.820108
9,49610,100% Lactose Free Fat Free Milk,0.818317


In [22]:
recommend_for_basket([24852, 47766, 27845], topn=10)

[(45066, 'Honeycrisp Apple', 0.8934086561203003),
 (23434, 'Premium 7 Sources Oil Blend', 0.885626494884491),
 (24024, '1% Lowfat Milk', 0.8771352767944336),
 (13422, 'Unflavored Whey Protein', 0.8625248670578003),
 (37646, 'Organic Gala Apples', 0.8600385189056396),
 (36529,
  'Organic Whole Grain Vegan Crispy Sunflower Twigs',
  0.8545241951942444),
 (17872, 'Total 2% Lowfat Plain Greek Yogurt', 0.8543748259544373),
 (28204, 'Organic Fuji Apple', 0.8485047817230225),
 (647, 'Melanie Medleys Vegetable Cream Cheese', 0.845668613910675),
 (30377, 'Fat Free Chocolate Milk', 0.8448039889335632)]

In [24]:
banana_recs = recommend_df(24852, topn=20)
banana_recs.to_csv("banana_recommendations.csv", index=False)

In [25]:
model.save("instacart_word2vec.model")