# 数据分析代码

1. 测试集数据分析结果：
  + 测试集总共有2761799个测试样本，有32615个用户数，用户数大于测试样本数。
  + 测试集共有790304个items。
2. final_track2_train.txt分析结果：

| uid | user_city | item_id | author_id | item_city | channel | finish | like | music_id | did | creat_time | video_duration |
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| 用户id | 用户的城市 | 作品id | 视频的作者id | 作品城市 | 作品城市 | 作品来源 | 是否浏览完作品 | 是否对作品点赞 | 音乐id | 设备id | 作品发布时间 | 作品时长 | 
  > + 由于在这个表里面sparse feature全部都是one-hot形式，所以使用sklearn的LabelEncoder进行编码转化成index。
  > + 对于表中的dense feature都是单值的形式，所以先对其进行归一化处理到(0,1)，然后再作为输入。

3. track2_title.txt分析结果：
  > + 每一行一个字典字符串，一个key是item_id，一个key是title_feature，需要对title_feature统计词数。
```
{"item_id": 4036886, "title_features": {"a": b}}
```
  > + 输出一个tuple，tuple\[0\]表示词汇的index列表，tuple\[1\]是对应词汇出现的次数

4. track2_video_features.txt分析结果：
  > + 每一行一个字典字符串，一个是key是item+_id，一个key是video_feature_dim_128，表示128维的视频表示。
```
{"item_id": 11274473, "video_feature_dim_128": [0, 128]}
```
  > + 对于这个视频信息，可以通过两种方式输入模型，一种是通过全连接层，映射成field embedding的维度作为输入。另外一种是将embedding的每一维作为一个dense feature，然后对这个dense feature作为输入。
  > + 暂定整体做一个向量输入

5. track2_face_attrs.txt分析结果：
  > 每一行是一个字典字符串
```
{"item_id": 6603879, "face_attrs": [{"gender": 0, "beauty": 0.53,"relative_position":[0.4306, 0.3203, 0.3333, 0.2969]}]}
```
  > + sparse feature直接one-hot，dense feature归一化处理。

In [3]:
import codecs
import pandas as pd

In [1]:
final_track2_train_path = "D:\Competition\内容理解与推荐\data\train_set\final_track2_train.txt"
track2_title_path = "D:\Competition\内容理解与推荐\data\train_set\track2_title.txt"
track2_face_attrs = "D:\Competition\内容理解与推荐\data\train_set\track2_face_attrs.txt"
track2_video_features_path = "D:\Competition\内容理解与推荐\data\train_set\track2_video_features.txt"
test_data_path = "./data/result.csv"

In [6]:
def ShowFinalTrackInfo(final_track_path):
    with codecs.open(final_track_path, "r", encoding="utf-8") as fp:
        content = fp.read()
    num = content.count("\n")+1
    
    print("==> final_track样本个数 : %d" % num)

def ShowTestDataInfo(test_data_path):
    df = pd.read_csv(test_data_path, encoding="utf-8")
    
    print("==> 测试集样本个数 : %d" % len(df))
    print("==> 测试集用户个数 : %d" % len(set(df["uid"])))
    print("==> 测试集item个数 : %d" % len(set(df["item_id"])))

In [7]:
ShowDataInfo(test_data_path)

==> The number of test samples : 2761799
==> The number of users : 32615
==> The number of items : 790304


## 产生FFM训练文件

In [1]:
import os
process_path = "/disk/private-data/ICME2019/process"
split_num = 5
target = ["finish", "like"]

def transform(line):
    line = line.split(",")
    line = [line[0]] + ["%d:%s"%(i,j) for i,j in zip(range(11), line[1:])]
    return ",".join(line)

In [2]:
for tar in target:
    print("==> Process %s" % tar)
    for i in range(split_num):
        print("==> Process %d train data" % (i+1))
        train_path = os.path.join(process_path, "FM_%s_train%d.txt"%(tar, i+1))
        FFM_path = os.path.join(process_path, "FFM_%s_train%d.txt"%(tar, i+1))
        with open(train_path, "r") as fp:
            content = fp.read().split("\n")
        content = list(map(transform, content))
        with open(FFM_path, "w") as fp:
            fp.write("\n".join(content))
        

==> Process finish
==> Process 1 train data
==> Process 2 train data
==> Process 3 train data
==> Process 4 train data
==> Process 5 train data
==> Process like
==> Process 1 train data
==> Process 2 train data
==> Process 3 train data
==> Process 4 train data
==> Process 5 train data


In [2]:
tar = "finish"
print("==> Process %s" % tar)
train_path = os.path.join(process_path, "FM_%s_valid.txt"%tar)
FFM_path = os.path.join(process_path, "FFM_%s_valid.txt"%tar)
with open(train_path, "r") as fp:
    content = fp.read().split("\n")
content = list(map(transform, content))
# with open(FFM_path, "w") as fp:
#     fp.write("\n".join(content))

==> Process like


In [3]:
with open(FFM_path, "w") as fp:
     fp.write("\n".join(content))