## Project - Content based recommender system(notes)
- 目的
    - 根據自己筆記的標題/內文去做內容推薦，可能會找到輔助可以看的資源！
    - 注意!先用最簡單版本，先做出baseline再說!
- 步驟
    - 資料清洗、缺失值補值

    - 將筆記標題/內文去做向量化(斷詞、加入辭典(因為有中英文))

    - 計算距離(相似度)

- 參考
    - [Upside](https://engineering.upside.com/from-content-based-recommendations-to-personalization-a-tutorial-773c9903b521)

In [86]:
import numpy as np
import pandas as pd
import sqlite3
import datetime
import nltk

## 讀取資料
- [用pandas 來讀取 sqlite 資料](https://ka666wang.medium.com/%E7%94%A8pandas%E4%BE%86%E8%AE%80%E5%8F%96sqlite%E8%B3%87%E6%96%99%E6%AA%94%E6%A1%88-d700288e60ae)

In [87]:
# 讀取log資料!
with sqlite3.connect('data.sqlite') as con:
    df = pd.read_sql("select * from log", con=con)

print(df.shape)
print(df.dtypes)
print(df.head())

(52, 6)
id_         int64
time       object
domain     object
title      object
content    object
url        object
dtype: object
   id_             time domain                   title  \
0    2  2021/1/10 12:42     後端            Daily I/O 建立   
1    3  2021/1/10 13:00    經濟學  經濟學家在 Uber 研究如何道歉效果比較好   
2    7  2021/1/10 10:00     創業       從零到一 - 創業家無可取代的特質   
3    8  2021/1/10 22:38     後端           實現CRUD的"Read"   
4    9  2021/1/12 01:47     後端                    修改資料   

                                             content  \
0  透過flask建立應用，終於開始啦!!!\r\n其中也用到資料庫技術sqlite！\r\n\...   
1  John List因為被Uber雷了一次之後，投訴Uber，後面得到offer去幫忙研究，發...   
2  其實很簡單，講述得就是創業通常都有其特別、矛盾之處，可能不同的情境之下，會顯得矛盾，但那只是...   
3  簡單透過sqlalchemy套件的查詢函數，隸屬於該table class。\r\n搭配ht...   
4  透過request.form得到物件，修改原先檔案的值，透過restful api去單一呈現...   

                                                 url  
0                                                     
1  https://buzzorange.com/techorange/2020/11/17/t...  
2                  

### 資料清洗、缺失值補值
- 補缺失值
- 特徵轉換, 如時間特徵轉換、類型轉換

In [88]:
# 無缺失值, 直接做特徵轉換

# 時間特徵轉換
df['datetime'] = df['time'].apply(func=lambda x: datetime.datetime.strptime(x, '%Y/%m/%d %H:%M'))

In [89]:
df.head()

Unnamed: 0,id_,time,domain,title,content,url,datetime
0,2,2021/1/10 12:42,後端,Daily I/O 建立,透過flask建立應用，終於開始啦!!!\r\n其中也用到資料庫技術sqlite！\r\n\...,,2021-01-10 12:42:00
1,3,2021/1/10 13:00,經濟學,經濟學家在 Uber 研究如何道歉效果比較好,John List因為被Uber雷了一次之後，投訴Uber，後面得到offer去幫忙研究，發...,https://buzzorange.com/techorange/2020/11/17/t...,2021-01-10 13:00:00
2,7,2021/1/10 10:00,創業,從零到一 - 創業家無可取代的特質,其實很簡單，講述得就是創業通常都有其特別、矛盾之處，可能不同的情境之下，會顯得矛盾，但那只是...,,2021-01-10 10:00:00
3,8,2021/1/10 22:38,後端,"實現CRUD的""Read""",簡單透過sqlalchemy套件的查詢函數，隸屬於該table class。\r\n搭配ht...,,2021-01-10 22:38:00
4,9,2021/1/12 01:47,後端,修改資料,透過request.form得到物件，修改原先檔案的值，透過restful api去單一呈現...,,2021-01-12 01:47:00


#### 文字前處理
- 轉為小寫
- 辭典建立(英文輸入進去)
- 中文斷詞(ckip or jieba)
- 英文斷詞nltk(此情況應該比較沒用到)
- 去除停用詞(中、英文)

In [90]:
# 轉為小寫
df['title'] = df['title'].apply(func=lambda x: x.lower())
df['content'] = df['content'].apply(func=lambda x: x.lower())
print('\n轉為小寫完畢。\n')


# 辭典建立
new_words = 'flask\nsqlalchemy\nrequest\nform\nrestful\napi\nuber\ngoogle\ncrud\ndaily\ni/o\nwalmart\nmedium\ngpt-3\nai\nfacebook\naiot\nbing\nfb\njava\njs\npython\ndive into deep learning\nnlp\ncv\n'
with open('new_dict.txt', 'w', encoding='utf-8') as f:
    f.write(new_words)
print('\nDict建立完畢。\n')

    
# 中文斷詞
import jieba
jieba.load_userdict('new_dict.txt')
word_vectors = []
for i in range(len(df)):
    word_vectors.append(list(jieba.cut(df.iloc[i, 4])))  # content用tf-idf比較好!

print(len(word_vectors))   # 應該是總樣本數
print(word_vectors[:5])
print('\n中文斷詞完畢。\n')


# 去除停用詞
from nltk.corpus import stopwords
stopwords_eng = set(stopwords.words('english'))
with open(r'C:\Users\aband\OneDrive\桌面\NLP_marathon\NLP_practice\1-st_NLP\hw\datasets\停用詞-繁體中文.txt', 'r', encoding='utf8') as f:
    stopwords_chinese = set(f.read().split('\n'))
stopwords_all = stopwords_chinese.union(stopwords_eng)

for i in range(len(word_vectors)):
    for j in range(len(word_vectors[i])-1, -1, -1):
        if word_vectors[i][j] in stopwords_all:
            word_vectors[i].pop(j)
print(word_vectors[:5])
print('\n去除停用詞完畢。\n')


轉為小寫完畢。


Dict建立完畢。

52
[['透過', 'flask', '建立', '應用', '，', '終於開始', '啦', '!', '!', '!', '\r\n', '其中', '也', '用到', '資料', '庫技術', 'sqlite', '！', '\r\n', '\r\n', '\r\n', '修改', '了', '!', '!', '!', '2021', '/', '1', '/', '12', ' ', '01', ':', '46'], ['john', ' ', 'list', '因為', '被', 'uber', '雷', '了', '一次', '之', '後', '，', '投訴', 'uber', '，', '後', '面', '得到', 'offer', '去', '幫忙', '研究', '，', '發現', '財務上', '的', '打擊', '以及', '道歉', '會', '得到', '品牌', '忠誠度', '的', '提高', '，', '但', '一昧', '的', '道歉', '只會', '降低', '。'], ['其實', '很', '簡單', '，', '講述', '得', '就是', '創業', '通常', '都', '有', '其', '特別', '、', '矛盾', '之處', '，', '可能', '不同', '的', '情境', '之下', '，', '會', '顯得', '矛盾', '，', '但', '那', '只是', '不同', '的', '情境', '有', '不同', '的', '極端', '表現', '罷了', '，', '特別', '是', '一件', '好', '事情', '，', '如', '賈伯斯', '，', '特立', '獨行', '，', '最', '後', '也', '被', 'apple', '找回', '來', '才', '讓', 'apple', '起飛', '！'], ['簡單', '透過', 'sqlalchemy', '套件', '的', '查詢', '函數', '，', '隸屬', '於', '該', 'table', ' ', 'class', '。', '\r\n', '搭配', 'html', '做', '呈現', '，', '雖然醜',

#### 向量化
- *tf-idf
    - [簡潔用法](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
- word2vec
- bert

In [91]:
corpus = []
for i in range(len(word_vectors)):
    corpus.append(' '.join(word_vectors[i]))
corpus

['透過 flask 建立 應用 終於開始 \r\n 用到 資料 庫技術 sqlite \r\n \r\n \r\n 修改 2021 12   01 46',
 'john   list uber 雷 一次 後 投訴 uber 後 面 得到 offer 幫忙 研究 發現 財務上 打擊 道歉 得到 品牌 忠誠度 提高 一昧 道歉 只會 降低',
 '簡單 講述 創業 通常 特別 矛盾 之處 情境 之下 顯得 矛盾 情境 表現 特別 一件 好 事情 賈伯斯 特立 獨行 最 後 apple 找回 apple 起飛',
 '簡單 透過 sqlalchemy 套件 查詢 函數 隸屬 table   class \r\n 搭配 html 做 呈現 雖然醜 開心',
 '透過 request form 得到 物件 修改 原先 檔案 值 透過 restful   api 單一呈現 記錄   進而去 修改',
 '使用 flask form   button 值 判斷 刪除 資料 太棒 拉',
 'collins 追求 完美 主義 研究 做 深且 全面 具 參考 價值 brill 追求 簡單 哲學 透過 快速 有效 方法 達到 目的',
 '資料 難處 完 覺得 做 更好 r   code 則是 好 簡潔 檢定 做起 統計 分析 框架 確定 r 特化 做 統計 程式 語言',
 '講述 產品 內部 外部 開發 比較 特別 後 處理 講得 籠統 參考 價值 不高 crnn 用以 做 文本 識別 模型 之前 沒聽過 覺得 這很實 \r\n 提到 圖卷 積神經 網路 說 好 陌生 覺得 圖 領域 厲害 推薦 系統 相關 應用 目前 社會 重要 結構 我遲 早會 走入 領域',
 '簡單 說 一種 動態 程式 風格 專注 做 物件 舉例 說 一個 函式 接入 一個 參數 假定 參數 一個 做 predict 物件 物件 不在乎 說 直接 具備 多型 能力 靜態 語言 實現 一個 加法 函式 需要 給予 參數型 但動態 語言 不用 中間 差距',
 'py 檔案 轉換 成二 進位 pyc pyc 版本 無法 使用 說 3.6 py 轉成 pyc 無法 3.5 使用 \r\n 跨平台 jvm 效果 外 防止 source   code 洩漏 原因 逆 轉換 感覺 防 君子',
 'bloom  

In [92]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)  # 此時為稀疏矩陣
print(vectorizer.get_feature_names())
print(X.shape)

['01', '10', '11', '12', '13', '14', '15', '19', '2021', '31', '3c', '46', '__', 'action', 'activation', 'ad', 'ai', 'aiot', 'alexnet', 'api', 'app', 'apple', 'attention', 'average', 'basis', 'beat', 'bert', 'bias', 'bing', 'bits', 'bloom', 'bn', 'brad', 'brill', 'build', 'button', 'call', 'causality', 'chebyshev', 'cherry', 'class', 'clt', 'cnn', 'cobra', 'code', 'collins', 'computer', 'conquer', 'console', 'conv2d', 'correlation', 'covid19', 'cpu', 'crnn', 'cross', 'cs', 'danger', 'data', 'default', 'dense', 'device', 'different', 'divide', 'dredging', 'effect', 'epoch', 'error', 'errorhandler', 'explorer', 'f12', 'fallacy', 'false', 'fb', 'filter', 'flash', 'flashed', 'flask', 'flatten', 'form', 'function', 'gambler', 'gerrymandering', 'get', 'google', 'gpt', 'gpu', 'gridsearch', 'gu', 'hash', 'hawthorne', 'html', 'init', 'input', 'internet', 'java', 'john', 'js', 'jvm', 'keras', 'kernel', 'layer', 'layers', 'level', 'list', 'logic', 'make', 'managing', 'mapping', 'math', 'mcnamara'

In [93]:
X.toarray()  # 轉為ndarray

array([[0.30176871, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [94]:
df_new = pd.concat([df.iloc[:, 0], pd.DataFrame(X.toarray())], axis=1)   # 記住default, axis=0
df_new.head()

Unnamed: 0,id_,0,1,2,3,4,5,6,7,8,...,1311,1312,1313,1314,1315,1316,1317,1318,1319,1320
0,2,0.301769,0.0,0.0,0.273162,0.0,0.0,0.0,0.0,0.301769,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### 計算距離
- *cosine similarity
- 歐幾里得距離
- others

In [95]:
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import euclidean_distances
from sklearn.preprocessing import StandardScaler

# 單個
# print(cosine(df_new.iloc[0, 1:], df_new.iloc[1, 1:]))

def get_notes_recommendation(df: pd.DataFrame, id_: int) -> pd.DataFrame:
    # 用來計算的特徵
    # features = [...]
    
    # 把id_放在第一個row
    df_sorted = df.copy()
    df_sorted = pd.concat([df_sorted[df_sorted['id_'] == id_], df_sorted[df_sorted['id_'] != id_]])
    
    # 標準化, 目前沒用到
    # df_features = df_sorted[features].copy()
    # df_features = normalize_features(df_features)
    df_features = df_sorted.iloc[:, 1:]   # 因為沒用到上面的!所以方便之後擴充先改名
    
    # 計算距離
    X = df_features.values
    Y = df_features.values[0].reshape(1, -1)    # 第一筆, 也就是目前看的notes
    distances = cosine_similarity(X, Y)                    # 可以換
    
    df_sorted['similarity_distance'] = distances
    return df_sorted.sort_values('similarity_distance', ascending=False).reset_index(drop=True)
    
def normalize_features(df: pd.DataFrame) -> pd.DataFrame:
    df_norm = df.copy()
    for col in df_norm.columns:
        # fill any NaN's with the mean
        df_norm[col] = df_norm[col].fillna(df_norm[col].mean())
        df_norm[col] = StandardScaler().fit_transform(df_norm[col].values.reshape(-1, 1))
    return df_norm

df_cosine = get_notes_recommendation(df_new, 2)

In [96]:
print(df_new.shape)
print(df_cosine.shape)
df_cosine

(52, 1322)
(52, 1323)


Unnamed: 0,id_,0,1,2,3,4,5,6,7,8,...,1312,1313,1314,1315,1316,1317,1318,1319,1320,similarity_distance
0,2,0.301769,0.0,0.0,0.273162,0.0,0.0,0.0,0.0,0.301769,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.171482
2,10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125259
3,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112911
4,35,0.0,0.032604,0.032604,0.029514,0.032604,0.032604,0.065209,0.0,0.0,...,0.032604,0.032604,0.065209,0.032604,0.0,0.065209,0.032604,0.0,0.065209,0.055184
5,47,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054768
6,49,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0545
7,41,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051565
8,50,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.151474,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.048732
9,27,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.048356
