## 数据处理和特征工程 — 构造特征  
1.构造lfm_reco    
2.构造user_pop和item_pop，user_rate和item_rate  
3.生成最终训练文件  
4.生成最终测试文件  

我们认为仅仅利用原始特征是不够的，我们希望从数据中挖掘更多有用信息，因此我们另外构造了5个特征，分别是：  
用户活跃度user_pop，歌曲热度item_pop，用户订阅率user_rate，歌曲订阅率item_rate，LFM推荐度lfm_reco，并且写入原始训练和测试文件，得到最终训练和测试文件train_final.csv和test_final.csv。

In [1]:
from sklearn.tree import DecisionTreeClassifier
from lightgbm.sklearn import LGBMClassifier
import pandas as pd
import numpy as np
import pickle

In [2]:
dpath = "./data/"

### 1.构造lfm_reco

训练LFM模型，得到用户隐向量user_vec和歌曲隐向量item_vec

In [3]:
def lfm_train(train_data, F, alpha, beta, step):
    """
    train LFM model,get latent factor user_vec and item_vec
    Args:
        train_data: train_data for lfm
        F: user vector len, item vector len
        alpha:regularization factor
        beta: learning rate
        step: iteration number
    Return:
        dict: key itemid, value:np.ndarray
        dict: key userid, value:np.ndarray
    """
    user_vec = {}
    item_vec = {}
    count = 0
    for step in range(step):
        fin = open(dpath+train_data,"r+")
        start = 0
        #每次取一行，随机梯度下降？
        for line in fin:
            if start == 0:
                start += 1
                continue
            cols = line.strip().split(",")
            userid,itemid,target = cols[0],cols[1],cols[-1]
            if userid not in user_vec:
                user_vec[userid] = np.random.randn(F)
            if itemid not in item_vec:
                item_vec[itemid] = np.random.randn(F)
            #target是str，需转换为int
            delta = int(target)-lfm_score(user_vec[userid],item_vec[itemid])
            for i in range(F):
                user_vec[userid][i] += beta*(delta*item_vec[itemid][i]\
                                            -alpha*user_vec[userid][i])
                item_vec[itemid][i] += beta*(delta*user_vec[userid][i]\
                                            -alpha*item_vec[itemid][i])
            count += 1
            #第1轮不更新学习率
            if step == 0:
                continue
            #每200000个样本更新一次学习率
            if count%200000==0:
                beta *= 0.95
            if count%1000000==0:
                print("step %d,count %d,learning rate %g:"%(step, count, beta))
    pickle.dump(user_vec,open(dpath+"user_vec.pkl","wb"))
    pickle.dump(item_vec,open(dpath+"item_vec.pkl","wb"))

根据user_vec和item_vec计算基于LFM的推荐度得分lfm_reco

In [3]:
def lfm_score(user_vector,item_vector):
    """
    user_vector and item_vector distance
    Args:
        user_vector: lfm model produce user vector
        item_vector: lfm model produce item vector
    Return:
         lfm recommend score
    """
    score = np.dot(user_vector, item_vector)/\
                (np.linalg.norm(user_vector)*np.linalg.norm(item_vector))
    return score

In [5]:
%%time
lfm_train("train_merge.csv", 60, 0.01, 0.1, 5)  

step 1,count 8000000,learning rate 0.0814506:
step 1,count 9000000,learning rate 0.0630249:
step 1,count 10000000,learning rate 0.0487675:
step 1,count 11000000,learning rate 0.0377354:
step 1,count 12000000,learning rate 0.0291989:
step 1,count 13000000,learning rate 0.0225936:
step 1,count 14000000,learning rate 0.0174825:
step 2,count 15000000,learning rate 0.0135276:
step 2,count 16000000,learning rate 0.0104674:
step 2,count 17000000,learning rate 0.00809947:
step 2,count 18000000,learning rate 0.00626722:
step 2,count 19000000,learning rate 0.00484945:
step 2,count 20000000,learning rate 0.00375241:
step 2,count 21000000,learning rate 0.00290355:
step 2,count 22000000,learning rate 0.00224671:
step 3,count 23000000,learning rate 0.00173846:
step 3,count 24000000,learning rate 0.00134519:
step 3,count 25000000,learning rate 0.00104088:
step 3,count 26000000,learning rate 0.000805413:
step 3,count 27000000,learning rate 0.000623214:
step 3,count 28000000,learning rate 0.000482231:


### 2.构造user_pop和item_pop，user_rate和item_rate  

根据用户历史行为数据得到user_record和item_record  
user_record = {userid: [value1,value2],...}  
item_record = {itemid: [value1,value2],...}  
用户活跃度user_pop=value2，用户订阅率user_rate=value2/value1  
歌曲热度item_pop=value2，歌曲订阅率item_rate=value2/value1  

In [6]:
def get_record_file(input_file):
    """
    get user_record dict and item_record dict
    user_record = {userid: [value1,value2],...}
    item_record = {itemid: [value1,value2],...}
    args:
        input_file: user item record file - train_merge.csv
    """
    fin = open(dpath+input_file,"r+")
    user_record = {}
    item_record = {}
    start = 0
    for line in fin:
        if start == 0:
            start += 1
            continue
        cols = line.strip().split(",")
        userid,itemid,target = cols[0],cols[1],cols[-1]
        if userid not in user_record:
            user_record[userid] = [0,0]
        user_record[userid][0] += 1
        #TypeError: unsupported operand type(s) for +=: 'int' and 'str'  
        #int(target)
        user_record[userid][1] += int(target)
        if itemid not in item_record:
            item_record[itemid] = [0,0]
        item_record[itemid][0] += 1
        item_record[itemid][1] += int(target)
    pickle.dump(user_record,open(dpath+"user_record.pkl","wb"))
    pickle.dump(item_record,open(dpath+"item_record.pkl","wb"))

In [7]:
%%time
get_record_file("train_merge.csv")

Wall time: 21.8 s


In [4]:
def get_mean_std(user_record,item_record):
    """
    get user_pop mean and std, get item_pop mean and std
    args:
        user_record: user_record dict
        item_record: item_record dict
    return:
        user_pop_mean,user_pop_std,item_pop_mean,item_pop_std
    """
    user_pop_mean = np.mean(list(map(lambda x:x[1],user_record.values())))
    user_pop_std = np.std(list(map(lambda x:x[1],user_record.values())))
    
    item_pop_mean = np.mean(list(map(lambda x:x[1],item_record.values())))
    item_pop_std = np.std(list(map(lambda x:x[1],item_record.values())))
    
    return user_pop_mean,user_pop_std,item_pop_mean,item_pop_std

 ### 3. 生成最终训练文件

把user_pop,item_pop,user_rate,item_rate,lfm_reco写入文件保存，得到train_final.csv 

user_rate和item_rate取值为何一样  
outcols item_rate写成user_rate

In [9]:
def generate_train_final(input_file,output_file):
    """
    generate final train file
    args:
        input_file: input file path
        output_file: output file path
    """
    fin = open(dpath+input_file,"r+")
    fout = open(dpath+output_file,"w+")
    user_vec = pickle.load(open(dpath+"user_vec.pkl","rb"))
    item_vec = pickle.load(open(dpath+"item_vec.pkl","rb"))
    user_record = pickle.load(open(dpath+"user_record.pkl","rb"))
    item_record = pickle.load(open(dpath+"item_record.pkl","rb"))
    user_pop_mean,user_pop_std,item_pop_mean,item_pop_std\
                    = get_mean_std(user_record,item_record)
    start = 0
    lfm_reco = 0
    outcols = []
    for line in fin:
        cols = line.strip().split(",")
        #写入column name
        if start == 0:
            outcols = cols[:-1]+["user_pop","item_pop","user_rate","item_rate","lfm_reco"]+[cols[-1]]
            fout.write(",".join(outcols)+"\n")
            start += 1
            continue
        userid,itemid = cols[0],cols[1]
        #计算user_pop，item_pop
        user_pop = round((user_record[userid][1]-user_pop_mean)/user_pop_std,3)
        item_pop = round((item_record[itemid][1]-item_pop_mean)/item_pop_std,3)
        #计算user_rate，item_rate
        user_rate = round(user_record[userid][1]/user_record[userid][0],3)
        item_rate = round(item_record[itemid][1]/item_record[itemid][0],3)
        #计算lfm_reco
        if cols[0] in user_vec and cols[1] in item_vec:
            lfm_reco = lfm_score(user_vec[cols[0]],item_vec[cols[1]])
            lfm_reco = np.around(lfm_reco,decimals=5)
            outcols = cols[:-1]+[str(user_pop)]+[str(item_pop)]+\
                      [str(user_rate)]+[str(item_rate)]+[str(lfm_reco)]+[cols[-1]]
        else:
            continue
        #写入文件
        fout.write(",".join(outcols)+"\n")
    fin.close()
    fout.close()

In [10]:
%%time
generate_train_final("train_merge.csv","train_final.csv")

Wall time: 4min 48s


### 4. 生成最终测试文件

把user_pop,item_pop,user_rate,item_rate,lfm_reco写入文件保存，得到test_final.csv 

In [7]:
def generate_test_final(input_file,output_file):
    """
    generate final test file
    args:
        input_file: input file path
        output_file: output file path
    """
    fin = open(dpath+input_file,"r+")
    fout = open(dpath+output_file,"w+")
    user_vec = pickle.load(open(dpath+"user_vec.pkl","rb"))
    item_vec = pickle.load(open(dpath+"item_vec.pkl","rb"))
    user_record = pickle.load(open(dpath+"user_record.pkl","rb"))
    item_record = pickle.load(open(dpath+"item_record.pkl","rb"))
    user_pop_mean,user_pop_std,item_pop_mean,item_pop_std\
                    = get_mean_std(user_record,item_record)
    start = 0
    lfm_reco = 0
    outcols = []
    for line in fin:
        cols = line.strip().split(",")
        #写入column name
        if start == 0:
            outcols = cols+["user_pop","item_pop","user_rate","item_rate","lfm_reco"]
            fout.write(",".join(outcols)+"\n")
            start += 1
            continue
        userid,itemid = cols[1],cols[2]
        #计算user_pop，user_rate
        if userid in user_record:
            user_pop = round((user_record[userid][1]-user_pop_mean)/user_pop_std,3)
            user_rate = round(user_record[userid][1]/user_record[userid][0],3)
        else:
            user_pop = 0.
            user_rate = 0.
        #计算item_pop，item_rate
        if itemid in item_record:
            item_pop = round((item_record[itemid][1]-item_pop_mean)/item_pop_std,3)
            item_rate = round(item_record[itemid][1]/item_record[itemid][0],3)
        else:
            item_pop = 0.
            item_rate = 0.
        #计算lfm_reco
        if userid in user_vec and itemid in item_vec:
            lfm_reco = lfm_score(user_vec[cols[1]],item_vec[cols[2]])
            lfm_reco = np.around(lfm_reco,decimals=5)
        #写入文件
        outcols = cols+[str(user_pop)]+[str(item_pop)]+\
                  [str(user_rate)]+[str(item_rate)]+[str(lfm_reco)]
        fout.write(",".join(outcols)+"\n")
    fin.close()
    fout.close()

In [8]:
%%time
generate_test_final("test_merge.csv","test_final.csv")

Wall time: 1min 27s


In [9]:
test_final = pd.read_csv(dpath+"test_final.csv")

In [10]:
test_final.head()

Unnamed: 0,id,msno,song_id,source_system_tab,source_screen_name,source_type,city,bd,gender,registered_via,...,expiration_date,song_length,genre_ids,language,mult_genre,user_pop,item_pop,user_rate,item_rate,lfm_reco
0,0,V8ruy7SGk7tDm3zA51DPpn6qutt+vmKMBKa21dp54uM=,WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=,3,7,2,1,0,2,7,...,13,-0.14208,24,4,0,-0.401,2.944,0.366,0.504,0.38083
1,1,V8ruy7SGk7tDm3zA51DPpn6qutt+vmKMBKa21dp54uM=,y/rsZ9DC7FwK5F2PK2D5mj+aOBUJAjuu3dZ14NgE0vM=,3,7,2,1,0,2,7,...,13,0.45662,25,4,0,-0.401,32.878,0.366,0.625,0.53185
2,2,/uQAlrAkaczV+nWCd2sPF2ekvXPRipV7q0l+gbLuxjw=,8eZLFOdGVdXBSqoAv5nsLigeH2BvKXzTQYtUM53I0k4=,0,16,9,1,0,2,4,...,12,0.42822,12,2,0,-0.594,-0.072,0.144,0.4,-0.88885
3,3,1a6oo/iXKatxQx4eS9zTVD+KlSVaAFbTIqVvwLC1Y0k=,ztCf8thYsS4YN3GcIL/bvoxLm/T5mYBVKOO4C9NiVfQ=,7,11,7,3,5,1,9,...,13,0.2375,25,8,0,-0.034,-0.029,0.296,0.226,-0.18739
4,4,1a6oo/iXKatxQx4eS9zTVD+KlSVaAFbTIqVvwLC1Y0k=,MKVMpslKcQhMaFEgcEQhEfi5+RZhMYlU3eRDpySrH8Y=,7,11,7,3,5,1,9,...,13,-0.30702,32,0,0,-0.034,-0.072,0.296,0.4,0.98938


In [11]:
test_final.shape

(2556790, 21)

### 遇到的坑

write() argument must be str, not bytes  
解决：用"wb"格式打开

计算delta以及user_record时出现错误  
TypeError: unsupported operand type(s) for +=: 'int' and 'str'  
解决：target是str，需转换为int

user_rate和item_rate取值为何一样  
outcols item_rate误写成user_rate

train_final前5行为何为空  
因为outcols没有满足判断条件，直接写入[]，解决：修改判断语句

'gbk' codec can't decode byte 0x80 in position 0: illegal multibyte sequence  
pickle.load改用"rb"

TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'  
不能用"lfm_reco"，改为["lfm_reco"]，这里不能用append

TypeError: can only concatenate list (not "str") to list  
cols[-1]改为[cols[-1]]

TypeError: 'int' object is not callable  
lfm_reco变量和函数重名，把计算推荐度的函数名改为lfm_score