# 实验说明

本实验基于`naturalcc`工具集，它是业界首个专注代码智能的深度模型开源训练工具集，包含了对业界现有的代码模型和下游任务的性能评价。相关论文发表在软件工程顶级会议ICSE上，参考文献 [NaturalCC: An Open-Source Toolkit for Code Intelligence](https://xcodemind.github.io/papers/icse22_naturalcc_camera_submitted.pdf)，实验要求使用`naturalcc`工具集复现类型推导任务中的经典方法`Typilus`，此方法发表在2020年的程序语言顶级会议PLDI上，相关文献为[Typilus: neural type hints](https://arxiv.org/pdf/2004.10657.pdf)。

实验分为五个部分，环境准备、数据获取、模型和超参数配置、模型训练与评价、拓展研究。

完整复现Typilus需要计算节点的内存大于或等于`128GB`，显存大于或等于 `32GB`，如无计算节点，建议裁剪数据集，以及使用 [Google Colab](http://colab.research.google.com) (可以提供12GB的内存和一张16GB显存的 Tesla T4显卡）

为避免包冲突，建议使用[Anaconda](http://anaconda.org)管理Python环境。

# 环境准备

## NaturalCC 环境

首先在GitHub上下载`naturalcc`工具集，执行下面的命令

In [1]:
!git clone https://github.com/CGCL-codes/naturalcc.git

Cloning into 'naturalcc'...
Updating files:  85% (1425/1667)
Updating files:  86% (1434/1667)
Updating files:  87% (1451/1667)
Updating files:  88% (1467/1667)
Updating files:  89% (1484/1667)
Updating files:  90% (1501/1667)
Updating files:  91% (1517/1667)
Updating files:  92% (1534/1667)
Updating files:  93% (1551/1667)
Updating files:  94% (1567/1667)
Updating files:  95% (1584/1667)
Updating files:  96% (1601/1667)
Updating files:  97% (1617/1667)
Updating files:  98% (1634/1667)
Updating files:  99% (1651/1667)
Updating files: 100% (1667/1667)
Updating files: 100% (1667/1667), done.


之后在本地安装naturalcc环境，需要提前安装pytorch，此命令只需要执行一次。
推荐使用 Anaconda 虚拟环境容器，避免影响本地Python环境。Anaconda可以在[这里](https://www.anaconda.com/)获取。

In [3]:
%conda install pytorch torchvision torchaudio cpuonly -c pytorch

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: c:\Users\zengw\Anaconda3\envs\typilus

  added / updated specs:
    - cpuonly
    - pytorch
    - torchaudio
    - torchvision


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    cpuonly-2.0                |                0           2 KB  pytorch
    freetype-2.12.1            |       ha860e81_0         490 KB
    libtiff-4.4.0              |       h8a3f274_1         832 KB
    libuv-1.40.0               |       he774522_0         255 KB
    mkl-service-2.4.0          |   py38h2bbff1b_0          51 KB
    mkl_fft-1.3.1              |   py38h277e83a_0         139 KB
    mkl_random-1.2.2           |   py38hf11a4ad_0         225 KB
    numpy-1.23.3               |   py38h3b20f71_0          11 KB
    numpy-base-1.23.3          |   py38h4

In [4]:
%conda install -c dglteam dgl

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: c:\Users\zengw\Anaconda3\envs\typilus

  added / updated specs:
    - dgl


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    dgl-0.9.1                  |           py38_0         2.5 MB  dglteam
    networkx-2.8.4             |   py38haa95532_0         2.2 MB
    scipy-1.9.3                |   py38he11b74f_0        18.0 MB
    tqdm-4.64.1                |   py38haa95532_0         143 KB
    ------------------------------------------------------------
                                           Total:        22.9 MB

The following NEW packages will be INSTALLED:

  dgl                dglteam/win-64::dgl-0.9.1-py38_0 None
  fftw               pkgs/main/win-64::fftw-3.3.9-h2bbff1b_1 None
  icc_rt             pkgs/main/win-64::icc_rt-2022.

Python要求最低版本3.8，安装完成后，执行下面的命令

In [1]:
%cd naturalcc
%pip install -r requirements.txt
%pip install --editable .
%cd ..

d:\workspace\GNN\type_prediction_lab\naturalcc
Collecting numba
  Using cached numba-0.56.4-cp38-cp38-win_amd64.whl (2.5 MB)
Collecting boto3
  Using cached boto3-1.26.5-py3-none-any.whl (132 kB)
Collecting filelock
  Using cached filelock-3.8.0-py3-none-any.whl (10 kB)
Collecting ruamel.yaml
  Using cached ruamel.yaml-0.17.21-py3-none-any.whl (109 kB)
Collecting pathos
  Using cached pathos-0.3.0-py3-none-any.whl (79 kB)
Collecting tree-sitter==0.19.0
  Using cached tree_sitter-0.19.0-cp38-cp38-win_amd64.whl
Collecting jsonlines
  Using cached jsonlines-3.1.0-py3-none-any.whl (8.6 kB)
Collecting dpu_utils
  Using cached dpu_utils-0.6.1-py2.py3-none-any.whl (73 kB)
Collecting rouge
  Using cached rouge-1.0.1-py3-none-any.whl (13 kB)
Collecting nltk
  Using cached nltk-3.7-py3-none-any.whl (1.5 MB)
Collecting jsbeautifier
  Using cached jsbeautifier-1.14.7-py3-none-any.whl
Collecting loguru
  Using cached loguru-0.6.0-py3-none-any.whl (58 kB)
Collecting gpustat
  Using cached gpustat-1.

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'c:\\Users\\zengw\\Anaconda3\\envs\\typilus\\Lib\\site-packages\\win32\\_win32sysloader.pyd'
Consider using the `--user` option or check the permissions.



Obtaining file:///D:/workspace/GNN/type_prediction_lab/naturalcc
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Installing collected packages: ncc
  Attempting uninstall: ncc
    Found existing installation: ncc 0.6
    Uninstalling ncc-0.6:
      Successfully uninstalled ncc-0.6
  Running setup.py develop for ncc
Successfully installed ncc-0.6
Note: you may need to restart the kernel to use updated packages.
d:\workspace\GNN\type_prediction_lab


### [GPU训练] 需要注意的地方

训练过程依赖[pytorch库](https://pytorch.org)和[dgl库](https://www.dgl.ai)，如果使用GPU+CUDA训练，需要确保所安装的库版本与本机的CUDA版本对应，例如CUDA 11.0+需要安装```torch+cu110, dgl-cu110```，如果仅使用CPU训练，则只需要安装CPU版本的`torch`和`dgl`即可

在使用GPU训练前需要通过`gpustat`或`nvidia-smi`检查GPU是否已经正确安装，如下所示

In [2]:
!gpustat

'gpustat' is not recognized as an internal or external command,
operable program or batch file.


## Typilus环境

构建Typilus数据集需要使用`docker`，在电脑上安装`docker`后，获取`typilus`的源代码

In [3]:
!git clone https://github.com/typilus/typilus.git

Cloning into 'typilus'...


之后在 `typilus/src/data_preparation`目录构建docker镜像

In [4]:
%cd typilus/src/data_preparation/
# 先需要下载docker
!docker build -t typilus-env .

d:\workspace\GNN\type_prediction_lab\typilus\src\data_preparation


#1 [internal] load build definition from Dockerfile
#1 sha256:cb3a441730ca699bf1a565166d8f2f920d3d482d0cbe74aff70be34db3da2d74
#1 transferring dockerfile: 1.09kB done
#1 DONE 0.1s

#2 [internal] load .dockerignore
#2 sha256:fc567aa8afcf3099568ab1236701b9515816739f7ce42f8431f7b573d1157e65
#2 transferring context: 2B done
#2 DONE 0.1s

#3 [internal] load metadata for docker.io/library/ubuntu:18.04
#3 sha256:ae46bbb1b755529d0da663ca0256a22acd7c9fe21844946c149800baa67c4e4b
#3 ...

#4 [auth] library/ubuntu:pull token for registry-1.docker.io
#4 sha256:88a605f84635f6476420836a78765b23b0ac84786289e608a7825877cc6b9767
#4 DONE 0.0s

#3 [internal] load metadata for docker.io/library/ubuntu:18.04
#3 sha256:ae46bbb1b755529d0da663ca0256a22acd7c9fe21844946c149800baa67c4e4b
#3 DONE 5.5s

#13 [internal] load build context
#13 sha256:4fb7345baccb02faf63b4cb4f428093a570730c48a92f0c0dd73b3eb297c95f3
#13 transferring context: 228.11kB done
#13 DONE 0.1s

#5 [ 1/16] FROM docker.io/library/ubuntu:18.04@sha2

# 数据获取

## Typilus Graph 构建

在开始训练之前，我们首先需要获得训练所需的数据，我们将首先使用`Typilus`内置的数据处理程序将Python数据转换为Typilus Graph，然后把graph导入`naturalcc`进行训练和预测

运行刚才构建的docker镜像，并设置Typilus训练数据保存的位置

In [7]:
# windows下要加 ""
!docker run --rm -it -v "$(pwd)/data:/usr/data" typilus-env:latest bash

the input device is not a TTY.  If you are using mintty, try prefixing the command with 'winpty'


在Docker shell中，输入下面的命令来构建数据集

``bash scripts/prepare_data.sh metadata/typedRepos.txt, 这一步在windows上会出现问题``

*注意：这条命令可能会执行若干天，关于命令执行时间过长和死循环的问题，可以参考[这里](https://github.com/typilus/typilus/issues/1)*

获取的数据保存在 `xxx/yyy/zzz/graph-dataset-split` 中，使用Tree命令观察

In [8]:
!tree /mnt/gold/bizq/Typilus_data/graph-dataset-split

[01;34m/mnt/gold/bizq/Typilus_data/graph-dataset-split[00m
├── [01;34mtest[00m
│   ├── [01;31mgraph-000.jsonl.gz[00m
│   ├── [01;31mgraph-001.jsonl.gz[00m
│   ├── [01;31mgraph-002.jsonl.gz[00m
│   ├── [01;31mgraph-003.jsonl.gz[00m
│   ├── [01;31mgraph-004.jsonl.gz[00m
│   ├── [01;31mgraph-005.jsonl.gz[00m
│   ├── [01;31mgraph-006.jsonl.gz[00m
│   ├── [01;31mgraph-007.jsonl.gz[00m
│   ├── [01;31mgraph-008.jsonl.gz[00m
│   ├── [01;31mgraph-009.jsonl.gz[00m
│   ├── [01;31mgraph-010.jsonl.gz[00m
│   ├── [01;31mgraph-011.jsonl.gz[00m
│   ├── [01;31mgraph-012.jsonl.gz[00m
│   ├── [01;31mgraph-013.jsonl.gz[00m
│   ├── [01;31mgraph-014.jsonl.gz[00m
│   ├── [01;31mgraph-015.jsonl.gz[00m
│   ├── [01;31mgraph-016.jsonl.gz[00m
│   ├── [01;31mgraph-017.jsonl.gz[00m
│   ├── [01;31mgraph-018.jsonl.gz[00m
│   ├── [01;31mgraph-019.jsonl.gz[00m
│   ├── [01;31mgraph-020.jsonl.gz[00m
│   ├── [01;31mgraph-021.jsonl.gz[00m
│   ├── [01;31mgraph-022.jsonl.gz[0

## NaturalCC 数据处理

接下来把生成的Typilus Graph导入NaturalCC

数据的处理过程包含了两个阶段，首先将原始数据整理为naturalcc统一的格式，接下来对数据进行binarize，以适合模型训练。

In [8]:
import ncc_dataset
ncc_dataset.prepare_dataset('typilus', typilus_path="/mnt/gold/bizq/Typilus_data")

ModuleNotFoundError: No module named 'zenodo_client'

In [5]:
ncc_dataset.binarize_dataset('typilus')

NaturalCC dataset and cache path: '/mnt/gold/bizq/ncc_data'
Using backend: pytorch
[32m[2022-10-16 09:30:49]    INFO >> Namespace(yaml_file='typilus') (preprocess.py:418, cli_main())[0m
[32m[2022-10-16 09:30:49]    INFO >> Load arguments in /home/dell/Code/jupyter/naturalcc/ncc_dataset/typilus/preprocess/typilus.yml (preprocess.py:420, cli_main())[0m
[32m[2022-10-16 09:30:49]    INFO >> {'preprocess': {'task': 'typilus', 'langs': ['nodes', 'edges', 'supernodes.annotation'], 'trainpref': '/mnt/gold/bizq/ncc_data/typilus/attributes/train', 'validpref': '/mnt/gold/bizq/ncc_data/typilus/attributes/valid', 'testpref': '/mnt/gold/bizq/ncc_data/typilus/attributes/test', 'dataset_impl': 'mmap', 'destdir': '/mnt/gold/bizq/ncc_data/typilus/type_inference/data-mmap', 'only_train': 1, 'edge_backward': 1, 'thresholds': [5, 5, 5], 'dicts': [None, None, None], 'nwords': [9999, 99, 99], 'padding_factor': 1, 'workers': 40}} (preprocess.py:422, cli_main())[0m
[32m[2022-10-16 09:30:49]    INFO >> 

# 模型训练

我们已经提供了训练代码，模型训练的过程只需要执行代码即可。

In [2]:
!python naturalcc/run/type_prediction/typilus/train.py

Using backend: pytorch
[32m[2022-10-17 15:35:01]    INFO >> Load arguments in naturalcc/run/type_prediction/typilus/config/typilus.yml (train.py:295, cli_main())[0m
[32m[2022-10-17 15:35:01]    INFO >> {'criterion': 'typilus', 'optimizer': 'torch_adam', 'lr_scheduler': 'fixed', 'tokenizer': None, 'bpe': None, 'common': {'no_progress_bar': 0, 'log_interval': 50, 'log_format': 'simple', 'tensorboard_logdir': '', 'memory_efficient_fp16': 1, 'fp16_no_flatten_grads': 1, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'empty_cache_freq': 0, 'task': 'typilus', 'seed': 1, 'cpu': 0, 'fp16': 0, 'fp16_opt_level': '01', 'server_ip': '', 'server_port': '', 'bf16': 0}, 'dataset': {'num_workers': 0, 'skip_invalid_size_inputs_valid_test': 1, 'max_tokens': None, 'max_sentences': 32, 'required_batch_size_multiple': 8, 'dataset_impl': 'mmap', 'train_subset': 'train', 'valid_subset': 'valid', 'validate_interval': 1,

In [15]:
!gpustat

[1m[37mdell-gpu           [m  Sun Oct 16 14:47:50 2022  [1m[30m460.56[m
[36m[0][m [34mGeForce RTX 3090[m |[31m 34'C[m, [32m  0 %[m | [36m[1m[33m    0[m / [33m24268[m MB |
[36m[1][m [34mGeForce RTX 3090[m |[31m 39'C[m, [32m  0 %[m | [36m[1m[33m    0[m / [33m24268[m MB |


# 模型评价

**挑战1**. 我们在训练代码中插入了模型的评价代码，试着用它们评价模型的准确率！

# 拓展研究

**挑战2**. 调整Typilus模型的超参数（在`config/typilus.yml`中），试着提高模型的训练准确率。

**挑战3**. 利用`naturalcc`中的其他代码，比较在同样的数据集上，`lstm`,`transformer`,`typilus` 的模型预测能力。

**挑战4**. 修改naturalcc的代码，支持[LAMBDANET](https://arxiv.org/pdf/2005.02161.pdf), [Type4Py](https://arxiv.org/pdf/2101.04470.pdf), [Plato](https://arxiv.org/pdf/2107.00157.pdf) 等模型。