# 飞桨常规赛：论文引用节点分类 3月第4名方案

## 赛题介绍


图神经网络（Graph Neural Network）是一种专门处理图结构数据的神经网络，目前被广泛应用于推荐系统、金融风控、生物计算中。图神经网络的经典问题主要有三种，包括节点分类、连接预测和图分类三种。

数据集为论文引用网络，图由大量的学术论文组成，节点之间的边是论文的引用关系，每一个节点提供简单的词向量组合的节点特征。我们的目的是给每篇论文推断出它的论文类别。





### 参考资料：
常规赛：论文引用网络节点分类-12月榜首分享：https://aistudio.baidu.com/aistudio/projectdetail/1462003?channel=0&channelType=0&shared=1

常规赛：论文引用网络节点分类-基于PGL的DeeperGCN魔改版： https://aistudio.baidu.com/aistudio/projectdetail/1274656?channelType=0&channel=0

### 方案完全在官方baseline上进行改动，
最优单模型为DeeperGCN，简单投票学习后，又提升了半个百分点。
调参主要在层数和drop_out参数进行了较多调整。
其中单模型DeeperGCN 10层 0.2 dropout 750epoch 准确度为0.732

## 运行方式
本次基线基于飞桨PaddlePaddle 1.8.4版本，若本地运行则可能需要额外安装pgl、easydict、pandas等模块。
pgl为1.2.0版本

## 本地运行
下载左侧文件夹中的所有py文件（包括build_model.py, model.py）,以及work目录，然后在右上角“文件”->“导出Notebook到py”，这样可以保证代码是最新版本），执行导出的py文件即可。完成后下载submission.csv提交结果即可。



## AI Studio (Notebook)运行
依次运行下方的cell，完成后下载submission.csv提交结果即可。若运行时修改了cell，推荐在右上角重启执行器后再以此运行，避免因内存未清空而产生报错。 Tips：若修改了左侧文件夹中数据，也需要重启执行器后才会加载新文件。

## 终端运行
在终端中执行python train.py，执行完成，会生成submission.csv，直接提交结果 

## 程序说明

单层用CPU可跑下来，多层需要用GPU才能跑动**

## 代码整体逻辑

1. 读取提供的数据集，包含构图以及读取节点特征（用户可自己改动边的构造方式）

2. 配置化生成模型，用户也可以根据教程进行图神经网络的实现。

3. 开始训练

4. 执行预测并产生结果文件

5. 本地集成学习


## 环境配置

该项目依赖飞桨paddlepaddle==1.8.4, 以及pgl==1.2.0。请按照版本号下载对应版本就可运行。

In [None]:
!pip install pgl==1.2.0 easydict 


Looking in indexes: https://mirror.baidu.com/pypi/simple/
Collecting pgl==1.2.0
[?25l  Downloading https://mirror.baidu.com/pypi/packages/35/fa/2290e78914d34d4e4480d7982b8f4d0c58a7e53535113a668a9d75d5c3b6/pgl-1.2.0-cp37-cp37m-manylinux1_x86_64.whl (7.9MB)
[K     |████████████████████████████████| 7.9MB 12.1MB/s eta 0:00:01
Collecting redis-py-cluster (from pgl==1.2.0)
[?25l  Downloading https://mirror.baidu.com/pypi/packages/2b/c5/3236720746fa357e214f2b9fe7e517642329f13094fc7eb339abd93d004f/redis_py_cluster-2.1.0-py2.py3-none-any.whl (41kB)
[K     |████████████████████████████████| 51kB 18.3MB/s eta 0:00:01
Collecting redis<4.0.0,>=3.0.0 (from redis-py-cluster->pgl==1.2.0)
[?25l  Downloading https://mirror.baidu.com/pypi/packages/a7/7c/24fb0511df653cf1a5d938d8f5d19802a88cef255706fdda242ff97e91b7/redis-3.5.3-py2.py3-none-any.whl (72kB)
[K     |████████████████████████████████| 81kB 16.0MB/s eta 0:00:01
Installing collected packages: redis, redis-py-cluster, pgl
Successfully instal

In [None]:
# 添加如下代码, 这样每次环境(kernel)启动的时候只要运行下方代码即可: 
# Also add the following code, 
# so that every time the environment (kernel) starts, 
# just run the following code: 
#import sys 
#import os
#os.system('pip install paddlepaddle==1.8.4 pgl==1.2.0 easydict')

In [None]:
import pgl
import paddle.fluid as fluid
import numpy as np
import time
import pandas as pd

## 图网络配置

model.py中已有一些的图网络模型：
GAT
GCN
DeeperGCN
ResGAT
ResGATII
ResGCN
APPNP

例如GAT的配置
```
config = {
    "model_name": "GAT",
    "num_layers":  1,
    "dropout": 0.5,
    "learning_rate": 0.01,
    "weight_decay": 0.0005,
    "edge_dropout": 0.00,
}
```

In [None]:
from easydict import EasyDict as edict


# config = {
#     "model_name": "ResGAT",
#     "num_layers": 5,
#     "dropout": 0.1,
#     "learning_rate": 0.001,
#     "weight_decay": 0.0005,
#     "edge_dropout": 0.00,
# }


config = {
    "model_name": "DeeperGCN",
    "num_layers": 10,
    "learning_rate": 0.001,
    "weight_decay": 0.0005,
    "feat_drop":0.2,
    "attn_drop":0.2,
    "dropout":0.2,
    "edge_dropout": 0.00,
}

print(config)
dicfile=open('config.txt','w')
dicfile.truncate
for [key,value] in config.items():
    print('{}:{}'.format(key,value),file=dicfile)
dicfile.close()


config = edict(config)

{'model_name': 'DeeperGCN', 'num_layers': 3, 'learning_rate': 0.001, 'weight_decay': 0.0005, 'feat_drop': 0.2, 'attn_drop': 0.2, 'dropout': 0.2, 'edge_dropout': 0.0}


## 数据加载模块

这里主要是用于读取数据集，包括读取图数据构图，以及训练集的划分。

In [None]:
from collections import namedtuple

Dataset = namedtuple("Dataset", 
               ["graph", "num_classes", "train_index",
                "train_label", "valid_index", "valid_label", "test_index"])

def load_edges(num_nodes, self_loop=True, add_inverse_edge=True):
    # 从数据中读取边
    edges = pd.read_csv("work/edges.csv", header=None, names=["src", "dst"]).values

    if add_inverse_edge:
        edges = np.vstack([edges, edges[:, ::-1]])

    if self_loop:
        src = np.arange(0, num_nodes)
        dst = np.arange(0, num_nodes)
        self_loop = np.vstack([src, dst]).T
        edges = np.vstack([edges, self_loop])
    
    return edges

def load():
    # 从数据中读取点特征和边，以及数据划分
    node_feat = np.load("work/feat.npy")
    num_nodes = node_feat.shape[0]
    edges = load_edges(num_nodes=num_nodes, self_loop=True, add_inverse_edge=True)
    graph = pgl.graph.Graph(num_nodes=num_nodes, edges=edges, node_feat={"feat": node_feat})
    
    indegree = graph.indegree()
    norm = np.maximum(indegree.astype("float32"), 1)
    norm = np.power(norm, -0.5)
    graph.node_feat["norm"] = np.expand_dims(norm, -1)
    
    df = pd.read_csv("work/train.csv")
    node_index = df["nid"].values
    node_label = df["label"].values
    train_part = int(len(node_index) * 0.8)
    train_index = node_index[:train_part]
    train_label = node_label[:train_part]
    valid_index = node_index[train_part:]
    valid_label = node_label[train_part:]
    test_index = pd.read_csv("work/test.csv")["nid"].values
    dataset = Dataset(graph=graph, 
                    train_label=train_label,
                    train_index=train_index,
                    valid_index=valid_index,
                    valid_label=valid_label,
                    test_index=test_index, num_classes=35)
    return dataset

In [None]:
dataset = load()

train_index = dataset.train_index
train_label = np.reshape(dataset.train_label, [-1 , 1])
train_index = np.expand_dims(train_index, -1)

val_index = dataset.valid_index
val_label = np.reshape(dataset.valid_label, [-1, 1])
val_index = np.expand_dims(val_index, -1)

test_index = dataset.test_index
test_index = np.expand_dims(test_index, -1)
test_label = np.zeros((len(test_index), 1), dtype="int64")


## 组网模块

这里是组网模块，目前已经提供了**GCN**, **GAT**, **APPNP**等模型。可以通过简单的配置，model.py中设定模型的层数，hidden_size等。

In [None]:
import pgl
import model
import paddle.fluid as fluid
import numpy as np
import time
from build_model import build_model

#place = fluid.CPUPlace()
place = fluid.CUDAPlace(0)
train_program = fluid.default_main_program()
startup_program = fluid.default_startup_program()
with fluid.program_guard(train_program, startup_program):
    with fluid.unique_name.guard():
        gw, loss, acc, pred = build_model(dataset,
                            config=config,
                            phase="train",
                            main_prog=train_program)

test_program = fluid.Program()
with fluid.program_guard(test_program, startup_program):
    with fluid.unique_name.guard():
        _gw, v_loss, v_acc, v_pred = build_model(dataset,
            config=config,
            phase="test",
            main_prog=test_program)


test_program = test_program.clone(for_test=True)

exe = fluid.Executor(place)

## 开始训练过程

图神经网络采用FullBatch的训练方式，每一步训练就会把所有整张图训练样本全部训练一遍。



In [8]:
import os
epoch = 900


exe.run(startup_program)
#if os.path.isdir('./best_model/'):
#    fluid.load(train_program, os.path.join('./best_model/', 'model'), exe)

# 将图数据变成 feed_dict 用于传入Paddle Excecutor
best_acc = 0.0
feed_dict = gw.to_feed(dataset.graph)

log_accuracy=np.array([0,0,0])


for epoch in range(epoch):
    # Full Batch 训练
    # 设定图上面那些节点要获取
    # node_index: 训练节点的nid    
    # node_label: 训练节点对应的标签
    feed_dict["node_index"] = np.array(train_index, dtype="int64")
    feed_dict["node_label"] = np.array(train_label, dtype="int64")
    
    train_loss, train_acc = exe.run(train_program,
                                feed=feed_dict,
                                fetch_list=[loss, acc],
                                return_numpy=True)

    # Full Batch 验证
    # 设定图上面那些节点要获取
    # node_index: 训练节点的nid    
    # node_label: 训练节点对应的标签
    feed_dict["node_index"] = np.array(val_index, dtype="int64")
    feed_dict["node_label"] = np.array(val_label, dtype="int64")
    val_loss, val_acc = exe.run(test_program,
                            feed=feed_dict,
                            fetch_list=[v_loss, v_acc],
                            return_numpy=True)
    print("Epoch", epoch, "Train Acc", train_acc[0], "Train Loss", train_loss[0],"Valid Acc", val_acc[0],"Valid Loss", val_loss[0])

    log1=[epoch, train_acc[0],val_acc[0]]
    log_accuracy=np.vstack([log_accuracy,log1])

    if  val_acc[0]> best_acc:
        best_acc = val_acc[0]
        ckpt_dir =  os.path.join('./', 'best_model')
        print("Save model checkpoint to {}".format(ckpt_dir))
        if not os.path.isdir(ckpt_dir):
            os.makedirs(ckpt_dir)
        fluid.save(test_program, os.path.join(ckpt_dir, 'model'))

data_log = pd.DataFrame(log_accuracy)
data_log.to_csv('log_accuracy.csv')
    

Epoch 0 Train Acc 0.044137537 Train Loss 3.6809561 Valid Acc 0.02847583 Valid Loss 3.3700404
Save model checkpoint to ./best_model
Epoch 1 Train Acc 0.05300064 Train Loss 3.463572 Valid Acc 0.09254645 Valid Loss 3.216527
Save model checkpoint to ./best_model
Epoch 2 Train Acc 0.09147861 Train Loss 3.3158355 Valid Acc 0.15092191 Valid Loss 3.1268187
Save model checkpoint to ./best_model
Epoch 3 Train Acc 0.13374741 Train Loss 3.2282777 Valid Acc 0.15156262 Valid Loss 3.0670798
Save model checkpoint to ./best_model
Epoch 4 Train Acc 0.15474835 Train Loss 3.165578 Valid Acc 0.15341353 Valid Loss 3.0131493
Save model checkpoint to ./best_model
Epoch 5 Train Acc 0.16624546 Train Loss 3.1149883 Valid Acc 0.18352672 Valid Loss 2.9629219
Save model checkpoint to ./best_model
Epoch 6 Train Acc 0.18541326 Train Loss 3.063855 Valid Acc 0.24560404 Valid Loss 2.9145317
Save model checkpoint to ./best_model
Epoch 7 Train Acc 0.1988325 Train Loss 3.0178084 Valid Acc 0.32070905 Valid Loss 2.8673716
Sa

KeyboardInterrupt: 

In [None]:
#print(log_accuracy)

In [None]:

import matplotlib.pyplot as plt

def draw_result(lst_iter, val_acc, train_acc, title):
    plt.plot(lst_iter, val_acc, '-b', label='val_acc')
    plt.plot(lst_iter, train_acc, '-r', label='train_acc')

    plt.xlabel("n iteration")
    plt.legend(loc='upper left')
    plt.title(title)
    plt.savefig(title+".png")  # should before show method

    plt.show()

draw_result(log_accuracy[:,0],log_accuracy[:,2],log_accuracy[:,1],'accuracy curve')



## 对测试集进行预测

训练完成后，我们对测试集进行预测。预测的时候，由于不知道测试集合的标签，我们随意给一些测试label。最终我们获得测试数据的预测结果。


In [None]:
feed_dict["node_index"] = np.array(test_index, dtype="int64")
feed_dict["node_label"] = np.array(test_label, dtype="int64") #假标签
if os.path.isdir('./best_model/'):
    fluid.load(test_program, os.path.join('./best_model/', 'model'), exe)
test_prediction = exe.run(test_program,
                            feed=feed_dict,
                            fetch_list=[v_pred],
                            return_numpy=True)[0]

## 生成提交文件

最后一步，我们可以使用pandas轻松生成提交文件，最后下载 submission.csv 提交就好了。

In [None]:
submission = pd.DataFrame(data={
                            "nid": test_index.reshape(-1),
                            "label": test_prediction.reshape(-1)
                        })
submission.to_csv("submission.csv", index=False)

## 集成学习：简单投票（本地）

选取投票中多数的作为预测结果。

本地运行文件为：vote.py