chmod -R 777 DACroDomainFSSA
Basic
pytorch = 1.7.1transformers==3.0.2
For graph training
stanzatorch_scatter==2.0.5torch_sparse==0.6.8torch_cluster==1.5.8torch_geometric==1.6.3scipy==2.3.4gensim==3.8.1
For post-bert training
- Download raw data
Download raw data form here, then put them indata/domain_data/init_data- Transfer raw data to json format data
Run thedata/data_process_utils/data_process.sh
You will get 8 json file indata/domain_data/processed_data
Notes: this script includes concept extractor, so it will take some time to run- Link to conceptNet
Run the python filedata/data_process_utils/data_linkConceptNet
You will get concept triplets for each reviews by adding a new keyconceptNetTriplesto their json file which are got by the previous step
Notes: there are a lot of concept to link and conceptNet has rate limits, this will take about five hours.- Format data for graph training
Run the python filedata/data_process_utils/data_genrate_fomat_triple
You will get a conceptnet_english_ours.txt which is needed in domain graph construction
there are tow encoder to train(graph encoder and bert encoder)
- This is based on Github project Kingdom
- Samply, you can run
extension/Graph-Embedding/preprocess_graph.pyto aggregate a domain graph for each domainFor example, aggregate a domain graph for books domain by running
extension/Graph-Embedding/preprocess_graph.pywith parameter--domain books
- Run
extension/Graph-Embedding/train_graph_model.pywith parameter--domain books(or other domain name in{"books", "dvd","electronics", "kitchen"})
The trained GCN model weight of the domain is placed inextension/Graph-Embedding/weight- Add graph feature data by the key
graphFeaturefor each reviews json which have generated at Data Pre-Processing phaseRun
extension/Graph-Embedding/add_grap_feature.pywith--domain books(or other in{"books", "dvd","electronics", "kitchen"})
- This extension is based on fairseq. You can follow fairseq example to pretrain bert
Samply, you can run
extension/Post-Bert/data/data_process_utils/data_process.shto generate fairseq data format.
Then, runextension/Post-Bert/data/data_process_utils/precess2databin.sh
- Run
extension/Post-Bert/data/data_process_utils/MASK_training.shto get post-bert checkpoint which will be located inextension/Post-Bert/checkpointsRemenber you need to switch domains in
{ book2dvd, book2kitchen, book2electronis, dvd2electronis, dvd2kitchen, electronis2kitchen }Checkpoint will be loaded when you run DAProNetModel
You can run train_demo.py to get start. As our paper describes, the model will encoder a review to latent code which is the base for calculating prototype.
--encoder can be used to select different sentence encoder in { cnn, bert, roberta, bert_newGraph, roberta_newGraph, graph}.
--pretrain_ckpt is used to load checkpoint for bert base model. The post-bert checkpoints which are trained in Prepare Encoder phase can passed in here. if not given, the basic model will be used meaning post-training strategy is lossed.
--is_old_graph_feature 1 means the graph feature is the same with Kingdom. Before using it, you need to download graph_features form Kingdom and put it in data/. Notes it only works when --encoder in {bert_newGraph, roberta_newGraph, graph}
--train_iter, --val_iter and --test_iter means to do how many iters.
- Proto(CNN)
CUDA_VISIBLE_DEVICES=5 python train_demo.py --K 5 --Q 5 --encoder cnn --train book_reviews --val dvd_reviews --test dvd_reviews
- Proto(BERT)
CUDA_VISIBLE_DEVICES=0,1,2 python train_demo.py --K 5 --Q 2 --encoder bert --hidden_size 768 --train book_reviews --val dvd_reviews --test dvd_reviews --batch_size 2
- Proto(Post-BERT)
CUDA_VISIBLE_DEVICES=3,1,2 python train_demo.py --K 5 --Q 2 --encoder roberta --hidden_size 768 --train book_reviews --val dvd_reviews --test dvd_reviews --batch_size 2 --pretrain_ckpt ${checkpoint}
- Proto(Post-BERT,New-Graph)
CUDA_VISIBLE_DEVICES=4 python train_demo.py --K 5 --Q 2 --encoder roberta_newGraph --hidden_size 768 --train book_reviews --val dvd_reviews --test dvd_reviews --batch_size 1 --pretrain_ckpt /home/cike/project/DACroDomainFSSA/extension/Post-Bert/checkpoints/MASK_book2dvd/checkpoint200_d_b_92.6.pt --train_iter 10000 --grad_iter 16
- Proto(Post-BERT,Old-Graph)
CUDA_VISIBLE_DEVICES=4 python train_demo.py --K 5 --Q 2 --encoder roberta_newGraph --hidden_size 768 --train book_reviews --val dvd_reviews --test dvd_reviews --batch_size 1 --pretrain_ckpt /home/cike/project/DACroDomainFSSA/extension/Post-Bert/checkpoints/MASK_book2dvd/checkpoint200_d_b_92.6.pt --train_iter 10000 --grad_iter 16 --is_old_graph_feature 1
- Proto(graph)
CUDA_VISIBLE_DEVICES=0,1,2 python train_demo.py --K 5 --Q 2 --encoder graph --hidden_size 768 --train book_reviews --val dvd_reviews --test dvd_reviews --batch_size 2
- 切分实验设计