<div 
     style="padding: 20px; 
            color: black;
            margin: 0;
            font-size: 250%;
            text-align: center;
            display: fill;
            border-radius: 5px;
            background-color: #0daae3;
            overflow: hidden;
            font-weight: 700;
            border: 5px solid black;"
     >
            该版本用于数据集的转换及方案设计
</div>




## 导包

In [1]:
import os
import json

import numpy as np 
import pandas as pd
from tqdm import tqdm
from glob import glob

In [2]:
path_project = r'/workspace/datasets/otto'

# path dir
path_row_data = os.path.join(path_project, 'row_data')
path_new_data = os.path.join(path_project, 'new_data')
path_results  = os.path.join(path_project, 'results')

# path row_data
path_train = os.path.join(path_row_data, 'train.jsonl')
path_test  = os.path.join(path_row_data, 'test.jsonl')
path_sample_submission = os.path.join(path_row_data, 'sample_submission.csv')

# parquet 格式的文件存放路径
path_parquet = os.path.join(path_new_data, 'parquet')
path_parquet_train = os.path.join(path_parquet, 'train')
path_parquet_test = os.path.join(path_parquet, 'test')

## 数据集转换

### 数据量查看

In [3]:
for path_tmp in [path_train, path_test]:
    count = 0
    with open(path_tmp, 'r') as f:
        for line in f:
            count += 1

    print("{}文件行数为：{}".format(path_tmp.split('/')[-1], count))

train.jsonl文件行数为：12899779
test.jsonl文件行数为：1671803


**训练集数据量过大，考虑对其进行分块**

In [4]:
chunksize = 100_000

num_chunks = int(np.ceil(12899779 / chunksize))
print('训练集分块数为：{}'.format(num_chunks))

训练集分块数为：129


### 读取部分数据 

In [5]:
n = 2

df_train = pd.DataFrame()
chunks = pd.read_json(path_train, lines=True, chunksize=chunksize)

for i, chunk in enumerate(chunks):
    if i < n:
        df_train = pd.concat([df_train, chunk])
    else:
        break

df_train = df_train.set_index("session", drop=True).sort_index()

In [6]:
df_train.head()

Unnamed: 0_level_0,events
session,Unnamed: 1_level_1
0,"[{'aid': 1517085, 'ts': 1659304800025, 'type':..."
1,"[{'aid': 424964, 'ts': 1659304800025, 'type': ..."
2,"[{'aid': 763743, 'ts': 1659304800038, 'type': ..."
3,"[{'aid': 1425967, 'ts': 1659304800095, 'type':..."
4,"[{'aid': 613619, 'ts': 1659304800119, 'type': ..."


In [7]:
df_train.iloc[5,0]

[{'aid': 1098089, 'ts': 1659304800133, 'type': 'clicks'},
 {'aid': 1354785, 'ts': 1659304827838, 'type': 'clicks'},
 {'aid': 342507, 'ts': 1659304856326, 'type': 'clicks'},
 {'aid': 1120175, 'ts': 1659304862271, 'type': 'clicks'},
 {'aid': 1808870, 'ts': 1659304863925, 'type': 'clicks'},
 {'aid': 1402845, 'ts': 1659304865459, 'type': 'clicks'},
 {'aid': 829383, 'ts': 1659304867287, 'type': 'clicks'},
 {'aid': 743867, 'ts': 1659305242050, 'type': 'clicks'},
 {'aid': 747242, 'ts': 1659365531565, 'type': 'clicks'},
 {'aid': 63299, 'ts': 1660347615375, 'type': 'clicks'},
 {'aid': 1813405, 'ts': 1660347698897, 'type': 'clicks'},
 {'aid': 1813405, 'ts': 1660347708319, 'type': 'carts'},
 {'aid': 140361, 'ts': 1660347738186, 'type': 'clicks'},
 {'aid': 1813405, 'ts': 1660348737117, 'type': 'clicks'},
 {'aid': 1813405, 'ts': 1660348787598, 'type': 'clicks'}]

In [8]:
n = 2

df_test = pd.DataFrame()
chunks = pd.read_json(path_test, lines=True, chunksize=chunksize)

for i, chunk in enumerate(chunks):
    if i < n:
        df_test = pd.concat([df_test, chunk])
    else:
        break

df_test = df_test.set_index("session", drop=True).sort_index()

In [9]:
df_test.head()

Unnamed: 0_level_0,events
session,Unnamed: 1_level_1
12899779,"[{'aid': 59625, 'ts': 1661724000278, 'type': '..."
12899780,"[{'aid': 1142000, 'ts': 1661724000378, 'type':..."
12899781,"[{'aid': 141736, 'ts': 1661724000559, 'type': ..."
12899782,"[{'aid': 1669402, 'ts': 1661724000568, 'type':..."
12899783,"[{'aid': 255297, 'ts': 1661724000572, 'type': ..."


In [10]:
df_test.iloc[9,0]

[{'aid': 245131, 'ts': 1661724001619, 'type': 'clicks'},
 {'aid': 39846, 'ts': 1661724018620, 'type': 'clicks'},
 {'aid': 1259911, 'ts': 1661724034826, 'type': 'clicks'},
 {'aid': 1663048, 'ts': 1661724078816, 'type': 'clicks'}]

### 将数据保存为Parquet格式

In [11]:
def jsonl2parquet(path_inp, path_out):
    """
    将数据集由原始格式转位parquet格式
    
    Args:
        path_inp: str, 原始数据集的存放路径，文件
        path_out: str, 新数据集的存放路径，文件夹

    Returns:
        
    """
    chunksize = 100_000
    
    reader = pd.read_json(path_inp, lines=True, chunksize=chunksize)
    os.makedirs(path_out, exist_ok=True)

    for i, chunk in enumerate(reader):
        event_dict = {
            'session': [],
            'aid': [],
            'ts': [],
            'type': [],
        }
        
        for session, events in zip(chunk['session'].values, chunk['events'].values):
            for event in events:
                event_dict['session'].append(session)
                event_dict['aid'].append(event['aid'])
                event_dict['ts'].append(event['ts'])
                event_dict['type'].append(event['type'])
        
        start = str(i*chunksize).zfill(9)
        end = str(i*chunksize + chunksize).zfill(9)
        df_event = pd.DataFrame(event_dict)
        df_event.to_parquet(os.path.join(path_out, '{}_{}.parquet'.format(start, end)))
    print('数据集转换完成！！！')
        

In [12]:
jsonl2parquet(path_train, path_parquet_train)
jsonl2parquet(path_test, path_parquet_test)

数据集转换完成！！！
数据集转换完成！！！


## 读取parquet文件

In [13]:
files = sorted(glob(path_parquet_train))[:5]

df_list = []
for path in files:
    df_list.append(pd.read_parquet(path))

df_data = pd.concat(df_list).reset_index(drop=True)

In [14]:
df_data.head()

Unnamed: 0,session,aid,ts,type
0,0,1517085,1659304800025,clicks
1,0,1563459,1659304904511,clicks
2,0,1309446,1659367439426,clicks
3,0,16246,1659367719997,clicks
4,0,1781822,1659367871344,clicks


## 方案设计

### 总体方案设计
Idea：
+ [Recommendation Systems for Large Datasets](https://www.kaggle.com/competitions/otto-recommender-system/discussion/364721)：二阶段方案的提出
+ [Co-visitation Matrix](https://www.kaggle.com/code/vslaykovsky/co-visitation-matrix): 共现矩阵的提出及最近访问结果的提交
+ [Item type vs multiple clicks vs latest items](https://www.kaggle.com/code/ingvarasgalinskas/item-type-vs-multiple-clicks-vs-latest-items)
+ [co-visitation matrix - simplified, imprvd logic 🔥](https://www.kaggle.com/code/radek1/co-visitation-matrix-simplified-imprvd-logic)

### Candidate ReRank Model using Handcrafted Rules
+ 方案来自：[Candidate ReRank Model - [LB 0.573]](https://www.kaggle.com/code/cdeotte/candidate-rerank-model-lb-0-573/notebook)

该方案的实施通过人工规则

**Setp 1 产生候选集**

对于每个测试用户（每个session），我们通过以下逻辑产生候选者：
1. 用户历史点击/收藏/购买过的商品
2. 评估时间段，最受欢迎的20个点击/收藏/付款的商品
3. 类别加权的点击/收藏/购买-收藏/购买共同访问矩阵
4. 收藏/购买-收藏/购买的共同访问矩阵
5. 时间加权的点击/收藏/购买-点击的共同访问矩阵

**Setp 2 重排序且选出20个商品**

对于给定的候选集合，我们必须挑选20个商品作为我们预测的结果，在这里我们实际机器学习的方式手工做特征，通过XGBoost模型来进行预测。特征工程的逻辑为：
1. 最近访问过的商品
2. 以前多次访问过的商品
3. 以前收藏或购买过的商品
4. 收藏/购买-收藏/购买的共同访问矩阵
5. 当前热门的商品

![](./img/c_r_model.png)

#### Step 1 共同访问矩阵

我们构建三个共同访问矩阵:
+ 一个基于用户以前的点击/收藏/购买记录来计算收藏/购买的商品热度，对于这个矩阵我们采用类别加权的方式构造 
+ 一个基于用户以前的收藏/购买记录来计算收藏/购买的热度，
+ 一个基于用户以前的点击/收藏/购买记录来计算点击的热度，对于这个矩阵我们采用时间加权的方式构造