Skip to content

Kudalf/11785_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Improved GTS model and application on Amazon Review Data

Information Extraction from text is extremely useful for decision-making based on large datasets such as product reviews. Aspect-oriented Fine-grained Opinion Extraction (AFOE) aims to automatically extract opinion pairs (aspect term, opinion term) or opinion triplets (aspect term, opinion term, sentiment) from review text. While pipeline approaches may suffer from error propagation and inconvenience in real-world scenarios, the combination of a pre-trained encoder with a Grid Tagging decoder can turn this work into a unified and generalized task. The GTS model is first proposed by this paper: Grid Tagging Scheme for Aspect-oriented Fine-grained Opinion Extraction. Zhen Wu, Chengcan Ying, Fei Zhao, Zhifang Fan, Xinyu Dai, Rui Xia. In Findings of EMNLP, 2020. To further reduce the GTS model’s error on triplet extraction, we have performed error analysis and experimented with different encoders and data augmentation techniques, which improved the F1 score by 6.

image

Data

Lap 14

Target-opinion-sentiment triplet datasets are completely from alignments of this paper TOWE. It contains English sentences extracted from laptop custumer reviews. Human annotators tagged the aspect terms (SB1) and their polarities (SB2); 899 sentences and 1264 triplets were used for training and 332 for testing (evaluation).

Amazon Data

Amazon review dataset is an unlabeled dataset which contains product reviews and metadata from Amazon, including 233.1 million reviews spanning May 1996 - Oct 2018. This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). We've selected the subcategory "Electronics" for our data augmentation and further application, including 5000 sentences.

Data Augmentation Files

There are three py files in the "data/Data Augmentation" file. Each is for one of our three data augmentation methods.

For 'Self-training', we used the amazon dataset and generated pseudo label for it then output it in json format.

For 'Synonym Replacement', we used the nlpaug package to randomly replace some words in the sentence with their synonyms.

For 'Sentence Concatenation', we concatenated the sentences with opposite sentiment (the labels are also concatenated).

Self-training method is designed for the format of amazon dataset. Synonym Replacement and Sentence Concatenation' are designed for the lap14 dataset.

Requirements

See requirement.txt or Pipfile for details

  • pytorch==1.7.1
  • transformers==3.4.0
  • python=3.6

Usage

  • Training

The training process is included in the file code/BertModel/Train Project Model.ipynb.

python main.py --task triplet --mode train --dataset lap14_concat_syn --bert_model_path bert-base-uncased --bert_tokenizer_path bert-base-uncased --batch_size 8

Arguments

  • dataset

    • lap14: laptop reviews
    • res14: restaurant reviews
    • res15: restaurant reviews
    • res16: restaurant reviews
    • neg_pos_concat: lap14 data with positve and negative sentences concatenated
    • lap14_concat_syn: lap14 data with positve and negative sentences concatenated and synonym replacement
    • amazon_lap14_full_synonym: lap14 data with positve and negative sentences concatenated and synonym replacement + pseudo labels generated from Amazon data
  • bert_model_path and bert_tokenizer_path: Encoders and its correspoding tokenizer

    • roberta-base: RoBERTa
    • bert-base-uncased: BERT
    • vinai/bertweet-base: BERTweet The best model will be saved in the folder "savemodel/".
  • Error Analysis

The error analysis code is in the file code/BertModel/Error_Analysis.py.

Results

Models Dataset Precision Recall F1
GTS-BERT (baseline) lap14 57.52 51.92 54.58
GTS-BERTweet lap14 57.66 57.98 57.82
GTS-BERT + Augmented Data lap14 62.12 53.95 57.75
GTS-RoBERTa lap14 59.51 62.57 61.00
GTS-RoBERTa + Augmented Data lap14 61.06 59.27 61.15

## Reference
[1]. Zhen Wu, Chengcan Ying, Fei Zhao, Zhifang Fan, Xinyu Dai, Rui Xia. [Grid Tagging Scheme for Aspect-oriented Fine-grained Opinion Extraction](https://arxiv.org/pdf/2010.04640.pdf). In Findings of EMNLP, 2020.

[2]. Zhifang Fan, Zhen Wu, Xin-Yu Dai, Shujian Huang, Jiajun Chen. [Target-oriented Opinion Words Extraction with Target-fused Neural Sequence Labeling](https://www.aclweb.org/anthology/N19-1259.pdf). In Proceedings of NAACL, 2019.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published