Improved GTS model and application on Amazon Review Data

Information Extraction from text is extremely useful for decision-making based on large datasets such as product reviews. Aspect-oriented Fine-grained Opinion Extraction (AFOE) aims to automatically extract opinion pairs (aspect term, opinion term) or opinion triplets (aspect term, opinion term, sentiment) from review text. While pipeline approaches may suffer from error propagation and inconvenience in real-world scenarios, the combination of a pre-trained encoder with a Grid Tagging decoder can turn this work into a unified and generalized task. The GTS model is first proposed by this paper: Grid Tagging Scheme for Aspect-oriented Fine-grained Opinion Extraction. Zhen Wu, Chengcan Ying, Fei Zhao, Zhifang Fan, Xinyu Dai, Rui Xia. In Findings of EMNLP, 2020. To further reduce the GTS model’s error on triplet extraction, we have performed error analysis and experimented with different encoders and data augmentation techniques, which improved the F1 score by 6.

Data

Lap 14

Target-opinion-sentiment triplet datasets are completely from alignments of this paper TOWE. It contains English sentences extracted from laptop custumer reviews. Human annotators tagged the aspect terms (SB1) and their polarities (SB2); 899 sentences and 1264 triplets were used for training and 332 for testing (evaluation).

Amazon Data

Amazon review dataset is an unlabeled dataset which contains product reviews and metadata from Amazon, including 233.1 million reviews spanning May 1996 - Oct 2018. This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). We've selected the subcategory "Electronics" for our data augmentation and further application, including 5000 sentences.

Data Augmentation Files

There are three py files in the "data/Data Augmentation" file. Each is for one of our three data augmentation methods.

For 'Self-training', we used the amazon dataset and generated pseudo label for it then output it in json format.

For 'Synonym Replacement', we used the nlpaug package to randomly replace some words in the sentence with their synonyms.

For 'Sentence Concatenation', we concatenated the sentences with opposite sentiment (the labels are also concatenated).

Self-training method is designed for the format of amazon dataset. Synonym Replacement and Sentence Concatenation' are designed for the lap14 dataset.

Requirements

See requirement.txt or Pipfile for details

pytorch==1.7.1
transformers==3.4.0
python=3.6

Usage

Training

The training process is included in the file code/BertModel/Train Project Model.ipynb.

python main.py --task triplet --mode train --dataset lap14_concat_syn --bert_model_path bert-base-uncased --bert_tokenizer_path bert-base-uncased --batch_size 8

Arguments

dataset
- lap14: laptop reviews
- res14: restaurant reviews
- res15: restaurant reviews
- res16: restaurant reviews
- neg_pos_concat: lap14 data with positve and negative sentences concatenated
- lap14_concat_syn: lap14 data with positve and negative sentences concatenated and synonym replacement
- amazon_lap14_full_synonym: lap14 data with positve and negative sentences concatenated and synonym replacement + pseudo labels generated from Amazon data
bert_model_path and bert_tokenizer_path: Encoders and its correspoding tokenizer
- roberta-base: RoBERTa
- bert-base-uncased: BERT
- vinai/bertweet-base: BERTweet The best model will be saved in the folder "savemodel/".
Error Analysis

The error analysis code is in the file code/BertModel/Error_Analysis.py.

Results

Models	Dataset	Precision	Recall	F1
GTS-BERT (baseline)	lap14	57.52	51.92	54.58
GTS-BERTweet	lap14	57.66	57.98	57.82
GTS-BERT + Augmented Data	lap14	62.12	53.95	57.75
GTS-RoBERTa	lap14	59.51	62.57	61.00
GTS-RoBERTa + Augmented Data	lap14	61.06	59.27	61.15


## Reference
[1]. Zhen Wu, Chengcan Ying, Fei Zhao, Zhifang Fan, Xinyu Dai, Rui Xia. [Grid Tagging Scheme for Aspect-oriented Fine-grained Opinion Extraction](https://arxiv.org/pdf/2010.04640.pdf). In Findings of EMNLP, 2020.

[2]. Zhifang Fan, Zhen Wu, Xin-Yu Dai, Shujian Huang, Jiajun Chen. [Target-oriented Opinion Words Extraction with Target-fused Neural Sequence Labeling](https://www.aclweb.org/anthology/N19-1259.pdf). In Proceedings of NAACL, 2019.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
code/BertModel		code/BertModel
data		data
Pipfile		Pipfile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improved GTS model and application on Amazon Review Data

Data

Lap 14

Amazon Data

Data Augmentation Files

Requirements

Usage

Training

Error Analysis

Results

About

Releases

Packages

Contributors 2

Languages

Kudalf/11785_Project

Folders and files

Latest commit

History

Repository files navigation

Improved GTS model and application on Amazon Review Data

Data

Lap 14

Amazon Data

Data Augmentation Files

Requirements

Usage

Training

Error Analysis

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages