# 基於內容的推薦方法 (Content based)

實驗利用基金資料屬性資料的推薦法

## 概念: 

* 蒐集基金六種屬性
        1. 投資區域
        2. 基金類型
        3. 配息頻率
        4. 基金目前規模區間
        5. 高收益債
        6. 風險屬性
        
* 利用已買的基金屬性, 推算屬性權重
    
    
    1. 已買 J12 (新興市場, 組合型, 10-100億元, 高風險), J23 (新興市場,平衡型, 100億元以上,高風險) ==> 特徵: 新興市場(2/8), 高風險(2/8), 組合型(1/8),平衡型(1/8),10-100億元(1/8),100億元(1/8)
    
    2. 候選基金中,計算分數 新興市場(2/8) + 高風險(2/8) + 組合型(1/8) + 平衡型(1/8) + 10-100億元(1/8) + 100億元以上(1/8) 
    
    3. 排序

In [1]:
import pandas as pd
import pypyodbc 
from tqdm import tqdm 
import numpy as np 
import pickle

In [2]:
conn = pypyodbc.connect("DRIVER={SQL Server};SERVER=dbm_public;UID=sa;PWD=01060728;DATABASE=test")

## 整理基金屬性

In [3]:
df_item_features = pd.read_sql("""
    select [基金代碼],
        [投資區域],
        [基金類型],
        [配息頻率],
        [基金目前規模區間],
        [高收益債],
        [風險屬性]
    from ihong_基金推薦demo_基金推薦2
""",conn)
df_item_features.head(3)

Unnamed: 0,基金代碼,投資區域,基金類型,配息頻率,基金目前規模區間,高收益債,風險屬性
0,100,台灣,股票型,,d.10~100億元台幣,,
1,101,台灣,股票型,,c.5~10億元台幣,,
2,103,台灣,股票型,,c.5~10億元台幣,,


In [66]:
df_item_features['iidx'] = df_item_features['基金代碼'].\
    apply(lambda row,mapper:mapper.get(row,np.nan),args=[itemid_to_idx])
df_item_features.head(3)    

Unnamed: 0,基金代碼,投資區域,基金類型,配息頻率,基金目前規模區間,高收益債,風險屬性,iidx
0,100,台灣,股票型,,d.10~100億元台幣,,,871.0
1,101,台灣,股票型,,c.5~10億元台幣,,,1355.0
2,103,台灣,股票型,,c.5~10億元台幣,,,198.0


## 用戶購買資料

In [15]:
with open('./funds/sp_funds_datasets.pickle','rb') as f:
    data = pickle.load(f)

In [122]:
test = data['test']
train = data['train']
user_idxs = data['user_idxs']
idx_to_userid = data['idx_to_userid']
userid_to_idx = data['userid_to_idx']
idx_to_itemid = data['idx_to_itemid']
itemid_to_idx = data['itemid_to_idx']

fundid_names_df = pd.read_csv('./funds/fundid_to_name.csv',encoding='cp950')
fundid_to_names = {}

for d in fundid_names_df.to_dict('records'):
    fundid_to_names[d['基金代碼']] = d['基金中文名稱']

### 測試單一用戶購買的基金特徵    

In [99]:
user_pur_item_indx = train[3,].indices
test_item_df = df_item_features[df_item_features['iidx'].isin(user_pur_item_indx)]
test_item_df

Unnamed: 0,基金代碼,投資區域,基金類型,配息頻率,基金目前規模區間,高收益債,風險屬性,iidx
422,63Z,中國,股票型,,h.1~5兆元台幣,,高風險,32.0
558,71S,全球,股票型,,i.>10兆元台幣,,,674.0
631,776,印度,股票型,,h.1~5兆元台幣,,高風險,5.0
1189,IQ4,全球,股票型,年配,h.1~5兆元台幣,,,102.0
1335,J99,全球,平衡型,月配,i.>10兆元台幣,高收益債,,11.0


### 彙整成`dict`

In [116]:
# defaultdict
tmp = defaultdict(int)
for idx, row in test_item_df.iterrows():
    for e in row[1:-1].tolist():
        if e:
            tmp[e]+=1
total = sum(tmp.values())
tmp2 = {k:v/ total for k,v in tmp.items()}
tmp2

{'h.1~5兆元台幣': 0.15,
 'i.>10兆元台幣': 0.1,
 '中國': 0.05,
 '全球': 0.15,
 '印度': 0.05,
 '平衡型': 0.05,
 '年配': 0.05,
 '月配': 0.05,
 '股票型': 0.2,
 '高收益債': 0.05,
 '高風險': 0.1}

### 所有用戶的購買特徵

In [125]:
dict(defaultdict(int))

{}

In [133]:
user_numbers = train.shape[0]
users_pur_profile = {}
for uidx in tqdm(range(user_numbers)):
    user_pur_item_idxs = train[uidx,].indices
    user_pur_items_df = df_item_features[df_item_features['iidx'].isin(user_pur_item_idxs)]
    
    user_profile = defaultdict(int)
    for idx, row in user_pur_items_df.iterrows():
        for e in row[1:-1].tolist():
            if e:
                user_profile[e] +=1
    total = sum(user_profile.values())
    user_profile = {k:v/ total for k,v in user_profile.items()}
    
    users_pur_profile[uidx] = user_profile

100%|███████████████████████████████████████████████████| 26324/26324 [00:58<00:00, 451.50it/s]


In [145]:
test_uidx = 100
for w in sorted(users_pur_profile[test_uidx],key=users_pur_profile[test_uidx].get, reverse=True):
    print(w,users_pur_profile[test_uidx][w])

股票型 0.25
h.1~5兆元台幣 0.17857142857142858
全球 0.10714285714285714
歐洲 0.07142857142857142
高風險 0.07142857142857142
i.5~10兆元台幣 0.07142857142857142
月配 0.03571428571428571
美國 0.03571428571428571
新興市場 0.03571428571428571
i.>10兆元台幣 0.03571428571428571
中國 0.03571428571428571
平衡型 0.03571428571428571
高收益債 0.03571428571428571


## 基於每個用戶的選購特徵作內容推薦

In [148]:
temp = next(df_item_features.iterrows())[1]
temp

基金代碼                 100
投資區域                  台灣
基金類型                 股票型
配息頻率                None
基金目前規模區間    d.10~100億元台幣
高收益債                None
風險屬性                None
iidx                 871
Name: 0, dtype: object

In [157]:
def get_cb_scores(uidx,df_item_features,user_pur_profile):
    """for a uidx calculate content based scores w.r.t user_pur_profile """
    scores = {}
    for item_iidx, row in df_item_features.iterrows():
        score = 0
        for key in row[1:-1]:
            if user_pur_profile.get(key):
                score += user_pur_profile.get(key)
        scores[item_iidx] = score
    return scores
        

In [159]:
test_uidx = 100
uidx100_cb_scores = get_cb_scores(test_uidx,df_item_features,users_pur_profile[test_uidx])
uidx100_cb_scores

{0: 0.25,
 1: 0.25,
 2: 0.25,
 3: 0,
 4: 0,
 5: 0.3571428571428571,
 6: 0.35714285714285715,
 7: 0.10714285714285714,
 8: 0.10714285714285714,
 9: 0.3571428571428571,
 10: 0.2857142857142857,
 11: 0.35714285714285715,
 12: 0.25,
 13: 0.35714285714285715,
 14: 0.10714285714285714,
 15: 0.35714285714285715,
 16: 0.3214285714285714,
 17: 0.3571428571428571,
 18: 0.3571428571428571,
 19: 0.3214285714285714,
 20: 0.14285714285714285,
 21: 0.17857142857142855,
 22: 0.3571428571428571,
 23: 0.10714285714285714,
 24: 0.25,
 25: 0.25,
 26: 0.3214285714285714,
 27: 0.10714285714285714,
 28: 0.10714285714285714,
 29: 0.3571428571428571,
 30: 0.10714285714285714,
 31: 0.25,
 32: 0.07142857142857142,
 33: 0.10714285714285714,
 34: 0.3571428571428571,
 35: 0.07142857142857142,
 36: 0.07142857142857142,
 37: 0.07142857142857142,
 38: 0.10714285714285714,
 39: 0.10714285714285714,
 40: 0.3214285714285714,
 41: 0.5357142857142857,
 42: 0.07142857142857142,
 43: 0.3214285714285714,
 44: 0.03571428571428

In [200]:
for iidx in sorted(uidx100_cb_scores,key = uidx100_cb_scores.get, reverse=True)[:20]:
    print(uidx100_cb_scores[iidx])
    rec_fund = df_item_features[df_item_features['iidx'] == iidx].to_dict()
    print(rec_fund['基金代碼'])
    print(rec_fund['投資區域'])
    print(rec_fund['基金類型'])

0.6071428571428571
{1781: 'UD5'}
{1781: '台灣'}
{1781: '股票型'}
0.6071428571428571
{}
{}
{}
0.6071428571428571
{1616: 'N14'}
{1616: '美國'}
{1616: '股票型'}
0.5714285714285714
{}
{}
{}
0.5714285714285714
{1668: 'Q23'}
{1668: '全球'}
{1668: '股票型'}
0.5714285714285714
{1102: 'GH3'}
{1102: '台灣'}
{1102: '股票型'}
0.5714285714285714
{533: '70G'}
{533: '印尼'}
{533: '股票型'}
0.5714285714285714
{}
{}
{}
0.5714285714285714
{1661: 'Q10'}
{1661: '台灣'}
{1661: '股票型'}
0.5714285714285714
{1402: 'L0A'}
{1402: '全球'}
{1402: '債券型'}
0.5714285714285714
{54: '218'}
{54: '中國'}
{54: '債券型'}
0.5714285714285714
{34: '185'}
{34: '新興市場'}
{34: '股票型'}
0.5714285714285714
{}
{}
{}
0.5714285714285714
{1510: 'M38'}
{1510: '歐洲'}
{1510: '股票型'}
0.5714285714285714
{166: '344'}
{166: '中國'}
{166: '股票型'}
0.5714285714285714
{338: '57R'}
{338: '全球'}
{338: '股票型'}
0.5714285714285714
{}
{}
{}
0.5714285714285714
{1695: 'S65'}
{1695: '拉丁美洲'}
{1695: '股票型'}
0.5714285714285714
{819: 'A29'}
{819: '新興市場'}
{819: '債券型'}
0.5714285714285714
{1552: 'ML3'}
{1552

In [169]:
df_item_features.head()

Unnamed: 0,基金代碼,投資區域,基金類型,配息頻率,基金目前規模區間,高收益債,風險屬性,iidx
0,100,台灣,股票型,,d.10~100億元台幣,,,871.0
1,101,台灣,股票型,,c.5~10億元台幣,,,1355.0
2,103,台灣,股票型,,c.5~10億元台幣,,,198.0
3,111,台灣,貨幣型,,e.100~500億元台幣,,,1137.0
4,120,台灣,貨幣型,,e.100~500億元台幣,,,287.0
