## 推薦系統與NLP
- 以NLP為要出發點的推薦系統，如以「商品標題」「商品描述」來提供商品特徵。

---

1. Heuristic: 非監督式比如用tf-idf提供標題相似性作為推薦依據。
2. ML: 監督式標籤，搭配以上之資訊以及其他產品metadata作為輸入，透過機器學習分類/回歸模型，提供預估機率/rating作為排序依據。
3. DL: 在Embedding或者模型本身採用神經網路，在資料量足夠龐大可使用，尤其Embedding可透過NLP、CV方式得到很大提升。

In [5]:
#### 套件 ####

## 資料處理 ##
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer


In [2]:
# 讀取資料
import os
predir = '../collaborative-filtering/datasets'

df = pd.read_excel(io=os.path.join(predir,'Online Retail.xlsx'), sheet_name='Online Retail')
print(df.shape)

(541909, 8)


### Heuristic
1. 將產品的標題或者描述，用NLP技術向量化，簡單可用BoW、tf-idf等方式。
2. 提取一個seed product，計算其餘商品與其在標題、描述上的相似性程度。(通常可以pre-computed)
3. 根據相似程度排序，取出Top-K作為推薦。

In [83]:
## 1. 將產品的標題或者描述，用NLP技術向量化，簡單可用BoW、tf-idf等方式。

## 資料處理
columns = [
    'StockCode', 
    'Description', 
    #'CustomerID'
]
df = df[columns]
df['StockCode'] = df['StockCode'].apply(lambda x: str(x).lower())
df['Description'] = df['Description'].apply(lambda x: str(x).lower())
df = df.drop_duplicates()  # 因為替代商品推薦不需要相同資訊


#X_train, X_test, y_train, y_test = train_test_split(df['Description'], df['StockCode'], test_size=0.25)

corpus = []

for e in df['Description']:
    corpus.append(e) # 因為有float

tfidf = TfidfVectorizer()
tfidf.fit(corpus)

TfidfVectorizer()

In [84]:
df

Unnamed: 0,StockCode,Description
0,85123a,white hanging heart t-light holder
1,71053,white metal lantern
2,84406b,cream cupid hearts coat hanger
3,84029g,knitted union flag hot water bottle
4,84029e,red woolly hottie white heart.
...,...,...
536908,23090,missing
537621,85123a,cream hanging heart t-light holder
538554,85175,
538919,23169,smashed


In [85]:
tfidf.transform(corpus).shape

(5635, 2277)

In [86]:
from sklearn.metrics.pairwise import cosine_similarity

X_tfidf = tfidf.transform(corpus)
cs = cosine_similarity(X_tfidf)
cs

array([[1.        , 0.21809563, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.21809563, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [87]:
df_cs = pd.DataFrame(cs, index=df['Description'], columns=df['Description'])
df_cs

Description,white hanging heart t-light holder,white metal lantern,cream cupid hearts coat hanger,knitted union flag hot water bottle,red woolly hottie white heart.,set 7 babushka nesting boxes,glass star frosted t-light holder,hand warmer union jack,hand warmer red polka dot,assorted colour bird ornament,...,check,nan,check,lost,check,missing,cream hanging heart t-light holder,nan,smashed,"paper craft , little birdie"
Description,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
white hanging heart t-light holder,1.000000,0.218096,0.000000,0.0,0.252711,0.0,0.349923,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.772519,0.0,0.0,0.0
white metal lantern,0.218096,1.000000,0.000000,0.0,0.158434,0.0,0.000000,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
cream cupid hearts coat hanger,0.000000,0.000000,1.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.186842,0.0,0.0,0.0
knitted union flag hot water bottle,0.000000,0.000000,0.000000,1.0,0.000000,0.0,0.000000,0.188033,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
red woolly hottie white heart.,0.252711,0.158434,0.000000,0.0,1.000000,0.0,0.000000,0.000000,0.082215,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.108257,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
missing,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.000000,0.0,0.0,0.0
cream hanging heart t-light holder,0.772519,0.000000,0.186842,0.0,0.108257,0.0,0.334560,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.000000,0.0,0.0,0.0
,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.000000,1.0,0.0,0.0
smashed,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,1.0,0.0


In [88]:
## 2. 提取一個seed product，計算其餘商品與其在標題、描述上的相似性程度。(通常可以pre-computed)

seed_product_name = 'white hanging heart t-light holder'
df_cs.loc[seed_product_name]

Description
white hanging heart t-light holder     1.000000
white metal lantern                    0.218096
cream cupid hearts coat hanger         0.000000
knitted union flag hot water bottle    0.000000
red woolly hottie white heart.         0.252711
                                         ...   
missing                                0.000000
cream hanging heart t-light holder     0.772519
nan                                    0.000000
smashed                                0.000000
paper craft , little birdie            0.000000
Name: white hanging heart t-light holder, Length: 5635, dtype: float64

In [89]:
## 查看seed product

df[df['Description'] == seed_product_name]

Unnamed: 0,StockCode,Description
0,85123a,white hanging heart t-light holder


In [92]:
## 3. 根據相似程度排序，取出Top-K作為推薦。

top_k = 10

seed_product_name = 'white hanging heart t-light holder'
df_cs.loc[seed_product_name].sort_values(ascending=False)[1:1+top_k]

Description
pink hanging heart t-light holder     0.833079
red hanging heart t-light holder      0.818065
cream hanging heart t-light holder    0.772519
hanging heart zinc t-light holder     0.770505
heart t-light holder                  0.764170
heart t-light holder                  0.764170
hanging heart jar t-light holder      0.760117
silver hanging t-light holder         0.694943
glass heart t-light holder            0.667908
hanging  butterfly t-light holder     0.663458
Name: white hanging heart t-light holder, dtype: float64

> 可以看到都是燈架(holder)，就可以產生出許多替代商品，不同顏色、款色，但都是holder!

### 模組化 heuristic 方法