490 NLP Coursework 1

This coursework is contributed by Zhiquan Li, Bohua Peng and Yifan Zhou. Two files are provided in this repository: coursework.py and bret.py. The first python file provides the implementation of our improved baseline architecture. The second python file provides the implementation of the BERT architecture we mentioned in our report.

Abstract

Our implementations aims to develop a regression modelfor prediction on the quality of machine trans-lation. In this repository, we conduct several ex-periments on pre-processing, sentence embed-ding and regression models, based on the train-ing data: language pairs (English - Chinese) and translation scores. We also try to use BERT to train a regression model for betterprediction. The best result comes from sim-plified SIF model along with SVM regressorwith the kernel value of ’RBF’.

Results

We calculate the pearson value of each model on test or validation set.

Model	Test Set	Validation Set
SIF+SVM(rbf)+POS	0.29	0.26
SIF+SVM(rbf)	/	0.27
SIF(Modified)+ SVM(rbf)	0.30	0.29
PCA+SVM(logistic)	0.25	0.21
BERT	/	0.06

From table above, we notice that our simplified SIF model performs the best results along with SVM regressor with the kernel value of 'RBF'.

Module Explanation

The file coursework.py contains 5 modules that indicate our contribution to this improved baseline model.

PCA model
- get_embeddings_pca() Input: file path for English sentences Output: Sentence embedding using PCA model
- get_sentence_embeddings_zh_pca() Input: file path for Chinese sentences Output: Sentence embedding using PCA model
  
  An example code to get a sentence embedding is provided below. Functions listed below use similar methods to get sentence embeddings.
```
zh_train_mt = get_sentence_embeddings_zh_pca("./train.enzh.mt") # Chinese
zh_train_src = get_embeddings_pca("./train.enzh.src",glove,nlp_en) # English
```
SIF model
- get_embeddings_sif() Input: file path for English sentences Output: Sentence embedding using SIF model
  
  get_sentence_embeddings_zh_sif() Input: file path for Chinese sentences Output: Sentence embedding using SIF model
SIF simplified model
- get_embeddings_sif_sip() Input: file path for English sentences Output: Sentence embedding using SIF model
  
  get_sentence_embeddings_zh_sif_sip() Input: file path for Chinese sentences Output: Sentence embedding using SIF model
POS extension
- get_postagger() Input: words sequence Output: pos tag sequence

As for the bert.py, the key module is the model module.

Bert end-to-end regression model
- model(transformer_model=bert) Input: indices of sentence Output: score tensor

How to test

Here we provide sample test case to help you test our code.

For the improved baseline model:

zh_train_mt = get_sentence_embeddings_zh("./train.enzh.mt") 
# return the sentence embedding of Chinese sentences

zh_train_src = get_embeddings("./train.enzh.src",glove,nlp_en) # return the sentence embedding of source sentences (English)

f_train_scores = open("./train.enzh.scores",'r') # Ground Truth File

zh_train_scores = f_train_scores.readlines() # Translation Score (Ground Truth)

X_train= [np.array(zh_train_src),np.array(zh_train_mt)] # Concatenate the embeddings to form the input

X_train_zh = np.array(X_train).transpose() 

# SVR is used to perform the regression task
clf_zh = SVR(kernel='rbf')
clf_zh.fit(X_train_zh, y_train_zh)

For the BERT model:

test_bert(
         test_en_file_path="dev.enzh.src",
         test_zh_file_path="dev.enzh.en",
         score_file_path="dev.enzh.scores",
         test_batch_size=1000)
# test_en_file_path: path of English test file
# test_zh_file_path: path of Chinese test file
# score_file_path: path of score file path
# test_batch_size: batch size of the test Dataloader

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
NNRe.py		NNRe.py
bert.py		bert.py
bert_pretraining.py		bert_pretraining.py
coursework.py		coursework.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

490 NLP Coursework 1

Abstract

Results

Module Explanation

How to test

About

Releases

Packages

Contributors 3

Languages

Ken-97/NLPcourseworkCode

Folders and files

Latest commit

History

Repository files navigation

490 NLP Coursework 1

Abstract

Results

Module Explanation

How to test

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages