<a href="https://colab.research.google.com/github/MinhongW/text_generation/blob/main/fine_tune_t5_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarization - Text-to-Table Generation

## Background
- Documentation based on MDT is a widely in placed process across different value streams
- Documentation is a time-consuming process
- The current version of AutoDoc is rule-based and does not achieve automatic text generation

## Objectives
- Automatically generate descriptions for tables
- Build task specific and domain specific language model for text summarization
- The model will be on-prem without relying on OpenAI's API (so the data will not be passed to OpenAI)



In [24]:
import pandas as pd
import json
import math
import platform
import sys
import tensorflow as tf
import nltk
nltk.download('punkt')
import string

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
! pip install datasets transformers rouge-score nltk sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m98.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece
  Downloading sentencepiece-0.1.98-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m77.2 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━

In [3]:
print(f"Python Platform: {platform.platform()}")
print()
print(f"Python {sys.version}")
print(f"Pandas {pd.__version__}")
gpu = len(tf.config.list_physical_devices('GPU'))>0
print("GPU is", "available" if gpu else "NOT AVAILABLE")

Python Platform: Linux-5.10.147+-x86_64-with-glibc2.31

Python 3.10.11 (main, Apr  5 2023, 14:15:10) [GCC 9.4.0]
Pandas 1.5.3
GPU is available


In [4]:
!git clone https://github.com/MinhongW/text_generation.git

Cloning into 'text_generation'...
remote: Enumerating objects: 83, done.[K
remote: Counting objects: 100% (83/83), done.[K
remote: Compressing objects: 100% (75/75), done.[K
remote: Total 83 (delta 31), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (83/83), 2.42 MiB | 5.24 MiB/s, done.


In [5]:
data_folder = 'text_generation/data'
t1 = open(f'{data_folder}/table_train.json')
t2 = open(f'{data_folder}/table_desc_train.json')
t3 = open(f'{data_folder}/paper_train.json')

v1 = open(f'{data_folder}/table_val.json')
v2 = open(f'{data_folder}/table_desc_val.json')
v3 = open(f'{data_folder}/paper_val.json')

te1 = open(f'{data_folder}/table_test.json')
te2 = open(f'{data_folder}/table_desc_test.json')
te3 = open(f'{data_folder}/paper_test.json')

In [6]:
tables_train = json.load(t1)
descs_train = json.load(t2)
papers_train = json.load(t3)

tables_val = json.load(v1)
descs_val = json.load(v2)
papers_val = json.load(v3)

tables_test = json.load(te1)
descs_test = json.load(te2)
papers_test = json.load(te3)

In [7]:
# double check the order in table file and desc file

for i in range(len(tables_train)):
    table_id1 = tables_train[i]['table_id_paper']
    table_id2 = descs_train[i]['table_id_paper']
    if table_id1 != table_id2:
               print('oops')

In [8]:
tables_test[106]

{'table_id_paper': 'P19-1064table_2',
 'caption': 'The overall mention detection results on the test set of OntoNotes. The F1 improvement is statistically significant under t-test with p < 0.05.',
 'row_header_level': 2,
 'row_headers': [['Model', 'Our full model'], ['Model', 'Lee et al. (2018)']],
 'column_header_level': 1,
 'column_headers': [['Prec.'], ['Rec.'], ['F1']],
 'contents': [['89.6', '82.2', '85.7'], ['86.2', '83.7', '84.9']],
 'metrics_loc': 'column',
 'metrics_type': ['Prec.', 'Rec.', 'F1'],
 'target_entity': None,
 'table_html_clean': "<table border='1' class='dataframe'>  <thead>    <tr style='text-align: right;'>      <th></th>      <th>Prec.</th>      <th>Rec.</th>      <th>F1</th>    </tr>  </thead>  <tbody>    <tr>      <td>Model || Our full model</td>      <td>89.6</td>      <td>82.2</td>      <td>85.7</td>    </tr>    <tr>      <td>Model || Lee et al. (2018)</td>      <td>86.2</td>      <td>83.7</td>      <td>84.9</td>    </tr>  </tbody></table>",
 'table_name': 

# Data

Publicly available data from paper **[Towards Table-to-Text Generation with Numerical Reasoning](https://aclanthology.org/2021.acl-long.115.pdf)**.

The dataset **numericNLG** contains 1.3K tables with their corresponding descriptions. Tables were extracted from ACL Anthology website (NLP/AI domained papers). Data cleasing and annotation were done manually.


| Dataset | Size |
| ----------- | ----------- |
| Train | 1084 |
| Val | 136 |
| Test | 135 |


![picture](https://raw.githubusercontent.com/MinhongW/text_generation/main/imgs/fig_table_naive_representation.png)


In [9]:
tables_test[106]

{'table_id_paper': 'P19-1064table_2',
 'caption': 'The overall mention detection results on the test set of OntoNotes. The F1 improvement is statistically significant under t-test with p < 0.05.',
 'row_header_level': 2,
 'row_headers': [['Model', 'Our full model'], ['Model', 'Lee et al. (2018)']],
 'column_header_level': 1,
 'column_headers': [['Prec.'], ['Rec.'], ['F1']],
 'contents': [['89.6', '82.2', '85.7'], ['86.2', '83.7', '84.9']],
 'metrics_loc': 'column',
 'metrics_type': ['Prec.', 'Rec.', 'F1'],
 'target_entity': None,
 'table_html_clean': "<table border='1' class='dataframe'>  <thead>    <tr style='text-align: right;'>      <th></th>      <th>Prec.</th>      <th>Rec.</th>      <th>F1</th>    </tr>  </thead>  <tbody>    <tr>      <td>Model || Our full model</td>      <td>89.6</td>      <td>82.2</td>      <td>85.7</td>    </tr>    <tr>      <td>Model || Lee et al. (2018)</td>      <td>86.2</td>      <td>83.7</td>      <td>84.9</td>    </tr>  </tbody></table>",
 'table_name': 

# Naive representation

Simply flatten T into a sequence ignoring its table structure by concatenating captions, headers, metrics and targeted cell values.

<img src="https://raw.githubusercontent.com/MinhongW/text_generation/main/imgs/fig_naive_representation.png" width="50%" height="50%">

In [10]:
def naive_representation(tables, descs):
    """
    Input_text is generated by naive representation of the tables.
    Each table is simply flattened into a sequence ignoring its table structure
    by concatenating captions, headers, metrics and targeted cell values.
    Target_text is the description of the corresponding table.
    Returns a df contains input_text and target_text
    
    """
    
    data = {'input_text':[],
           'target_text':[]}
    
    for i in range(len(tables)):
        table = tables[i]
        caption = 'summarize: ' + 'caption: ' + table['table_id'] + ' ' + table['caption']
        row_names = 'row name: ' + ' '.join(' '.join(x) for x in table['row_headers']) + '.'
        col_names = 'colume name: ' + ' '.join(' '.join(x) for x in table['column_headers']) + '.'
        metrics = 'metric: ' + ' '.join(table['metrics_type']) + '.'
        values = 'value: ' + ' '.join(' '.join(x) for x in table['contents']) + '.'        
        tmp = [caption, row_names, col_names, metrics, values]
        text = ' '.join(tmp)
        
        desc = descs[i]['description']        
        
        data['input_text'].append(text)
        data['target_text'].append(desc)
    
    df = pd.DataFrame(data)
    
    return df

In [11]:
df_train = naive_representation(tables_train, descs_train)
df_val = naive_representation(tables_val, descs_val)
df_test = naive_representation(tables_test, descs_test)

In [12]:
df_train.head()

Unnamed: 0,input_text,target_text
0,summarize: caption: table_2 Comparison of diff...,Table 2 summarizes the performances of propose...
1,summarize: caption: table_3 Pearson correlatio...,Table 3 presents the correlation results for t...
2,summarize: caption: table_4 Comparison between...,Results. Table 4 presents the results of our r...
3,summarize: caption: table_2 Spearman’s rank co...,Table 2 shows the results of our contextdepend...
4,summarize: caption: table_4 Examples of attent...,"From Table 4, we can find that in the first ho..."


In [13]:
# example

df_test['input_text'][106]

'summarize: caption: table_2 The overall mention detection results on the test set of OntoNotes. The F1 improvement is statistically significant under t-test with p < 0.05. row name: Model Our full model Model Lee et al. (2018). colume name: Prec. Rec. F1. metric: Prec. Rec. F1. value: 89.6 82.2 85.7 86.2 83.7 84.9.'

In [14]:
from datasets import Dataset
ds_train = Dataset.from_pandas(df_train)
ds_val = Dataset.from_pandas(df_val)
ds_test = Dataset.from_pandas(df_test)

# T5 MODEL

## **T**ext-**t**o-**T**ext **T**ransfer **T**ransformer

The T5 model was presented in [Exploring the Limits of Transformer Learning with a Unified Text-to-Text Transformer (2020)](https://arxiv.org/pdf/1910.10683.pdf).

T5 is based on vanilla transformer architecture, which is widely used for seq2seq modelling in NLP tasks.

<img src="https://raw.githubusercontent.com/MinhongW/text_generation/main/imgs/fig_transformers01.png" width="60%" height="60%">

From [Attention is all you need (2017)](https://arxiv.org/pdf/1706.03762.pdf)


<img src="https://raw.githubusercontent.com/MinhongW/text_generation/main/imgs/fig_transformers02.png" width="80%" height="80%">

From http://jalammar.github.io/illustrated-transformer/

## More details about T5

- Trained on C4 (The Colossal Clean Crawled Corpus) dataset + some filters. About 750GB reasonably clean natural English free text.

- Model size:

| Model | No. of Parameters |
| ----------- | ----------- |
| T5-small | 60M |
| T5-base | 220M |
| T5-large | 770M |
| T5-3b | 3B |
| T5-11b | 11B |


- BERT-style masking but with a little bit difference. Instead of masking single word, a sequence of words would be masked as well.


- Prefix is used to specify which task model should perform. Tested in our case, it doesn't affect our performance.


In [15]:
import torch
if torch.cuda.is_available():
    device = torch.device("cuda:0")
    print("Running on the GPU")
else:
    device = torch.device("cpu")
    print("Running on the CPU")

Running on the GPU


## Training with trainer API

In [16]:
from transformers import T5Tokenizer
from transformers import T5ForConditionalGeneration, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset, load_metric

In [17]:
model_checkpoint = "t5-small"

tokenizer = T5Tokenizer.from_pretrained(model_checkpoint)

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

In [18]:
# tokenizer.model_max_length

In [19]:
def tokenize_function(examples):
    model_inputs = tokenizer(examples["input_text"], padding="max_length", truncation=True)
    # Setup the tokenizer for targets
    # change as_target_tokenizer() to text_target later
    with tokenizer.as_target_tokenizer():
      labels = tokenizer(examples["target_text"], padding="max_length", truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

ds_train_tokenized = ds_train.map(tokenize_function, batched=True)
ds_val_tokenized = ds_val.map(tokenize_function, batched=True)
ds_test_tokenized = ds_test.map(tokenize_function, batched=True)

Map:   0%|          | 0/1084 [00:00<?, ? examples/s]



Map:   0%|          | 0/136 [00:00<?, ? examples/s]

Map:   0%|          | 0/135 [00:00<?, ? examples/s]

In [20]:
data_collator = DataCollatorForSeq2Seq(tokenizer)

In [21]:
batch_size = 4 # batch_size need to be tuned. set it as 4 here due to the memory limits

model_name = "t5-small-v2.0"
# model_folder = "text_generation/models"
model_dir = f"model/{model_name}"

In [22]:
# hyper-parameter tuning is needed
args = Seq2SeqTrainingArguments(
    output_dir=model_dir,
    evaluation_strategy="steps",
    eval_steps=200,
    logging_strategy="steps",
    logging_steps=200,
    save_strategy="steps",
    save_steps=2000,
    learning_rate=4e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True,
    fp16=True,
    load_best_model_at_end=True,
    metric_for_best_model="rouge1"
    #report_to="tensorboard",
)

In [25]:
import numpy as np

metric = load_metric("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them. (when model is unable to make a confident prediction)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip()))
                      for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) 
                      for label in decoded_labels]
    
    # Compute ROUGE scores
    # Return rouge1, rouge2, rougeL, rougeLsum
    result = metric.compute(predictions=decoded_preds, references=decoded_labels,
                            use_stemmer=True)

    # Extract ROUGE f1 scores
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length to metrics
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id)
                      for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

  metric = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [26]:
model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)
# for name, param in model.named_parameters():
#     if 'encoder.block.0' in name or 'encoder.block.1' in name or 'decoder.block.10' in name or 'decoder.block.11' in name:
#         param.requires_grad = False


# use seq2seqtrainer (sub class of trainer)
# cause we need to predict with generate and evaluate with rouge
trainer = Seq2SeqTrainer(
    #model_init=model_init, # model_init has to be callable
    model=model,
    args=args,
    train_dataset=ds_train_tokenized,
    eval_dataset=ds_val_tokenized,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
# compute_loss could be used to customise loss function
# CrossEntropyLoss is the default loss function used in seq2seq task using T5 model

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [27]:
# seq2seqtrainer will automatically use gpu

trainer.train()



Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
200,2.6849,1.490971,0.0,0.0,0.0,0.0,0.0
400,1.541,1.420154,11.0327,4.7913,9.2548,10.3784,12.0147
600,1.4012,1.395404,17.5224,7.3824,14.2681,16.4574,18.8603
800,1.4405,1.377603,17.7613,7.6526,14.4248,16.775,19.0
1000,1.3873,1.365566,17.9532,7.6513,14.5319,16.9283,19.0
1200,1.3914,1.356912,17.7797,7.5094,14.3804,16.8014,19.0
1400,1.3628,1.349644,18.0314,7.5879,14.5419,17.0459,19.0
1600,1.3623,1.344878,17.8209,7.6277,14.4069,16.79,19.0
1800,1.3498,1.340438,17.8309,7.5497,14.459,16.8988,19.0
2000,1.3322,1.337383,17.7716,7.4442,14.3344,16.7745,19.0


TrainOutput(global_step=2710, training_loss=1.4784060207240255, metrics={'train_runtime': 1013.6293, 'train_samples_per_second': 10.694, 'train_steps_per_second': 2.674, 'total_flos': 1467105127956480.0, 'train_loss': 1.4784060207240255, 'epoch': 10.0})

In [28]:
trainer.save_model()

In [29]:
trainer.save_model(f'trained_model/{model_name}')

## Evluate the model on the test set

In [38]:
from torch.utils.data import DataLoader

model.to(device)

# define the collate function
def collate_fn(inputs):
    # tokenize the inputs and return a dictionary with input_ids and their lengths
    input_ids = tokenizer.batch_encode_plus(inputs, padding=True, truncation=True, return_tensors='pt')
    input_ids = input_ids['input_ids'].to(device)
    input_lengths = torch.sum(input_ids != tokenizer.pad_token_id, dim=1)
    return {'input_ids': input_ids, 'input_lengths': input_lengths}

# create the DataLoader
test_dataset = DataLoader(ds_test['input_text'], batch_size=4, collate_fn=collate_fn)

generated_outputs = []

model.eval()
with torch.no_grad():
    for batch in test_dataset:
        # move the batch to the GPU
        batch = {k: v.to(device) for k, v in batch.items()}
        # pass the inputs through the model to generate output
        generated_ids = model.generate(
            input_ids=batch['input_ids'],
            attention_mask=batch['input_ids'].ne(tokenizer.pad_token_id),
            max_length=512,
            num_beams=4,   # double check here
            early_stopping=True
        )
        # convert generated ids to text
        generated_texts = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
        # append generated texts to the output list
        generated_outputs.extend(generated_texts)

In [39]:
target_texts = df_test['target_text'].to_list() # a list of target strings

# Rouge expects a newline after each sentence
generated_outputs = ["\n".join(nltk.sent_tokenize(output.strip()))
                    for output in generated_outputs]
target_texts = ["\n".join(nltk.sent_tokenize(label.strip())) 
                for label in target_texts]


scores = metric.compute(predictions=generated_outputs, references=target_texts,
                    use_stemmer=True)
scores = {key: value.mid.fmeasure * 100 for key, value in scores.items()}
#scores = {key: round(value.mid.fmeasure, 4) * 100 for key, value in scores.items()}

In [40]:
scores

{'rouge1': 27.13564289136349,
 'rouge2': 9.392546582664652,
 'rougeL': 20.478029957936155,
 'rougeLsum': 24.188834920481685}

# Results

## Naive representation + T5-small

Without prefix: RougeL 20.64

With prefix: RougeL 20.48

## Other approaches

Freeze the first 2 layers of the encoder and the last 2 layers of the decoder: RougeL 19.87

Pull out embedding layer from T5 model and build LSTM model on top of it.

## Rouge score
- Calculates the similarity between a candidate document and a collection of reference documents
- ROUGE-N measures the number of matching n-grams between the model-generated text and a human-produced reference.
- ROUGE-L is based on the longest common subsequence (LCS) between our model output and reference, i.e. the longest sequence of words (not necessarily consecutive, but still in order) that is shared between both.
- Higher scores indicate better performance


# Challenges

- Data is small (only ~1k data)

- Currently no environment with GPU resources in HSBC
  - Only can run experiments on publically available data
  - Cannot fine tune language models on HSBC environments
  - Only can test T5-small on Colab
  - Hard to do hyper-parameter tuning with limited resources

# Error analysis

In [48]:
from textwrap import wrap

In [41]:
tokenizer = T5Tokenizer.from_pretrained(model_checkpoint)
model = T5ForConditionalGeneration.from_pretrained(model_dir)

max_input_length = 512

In [52]:
line_length = 150
text = df_test['input_text'][0]
lines = wrap(text, line_length)
print('input:')
print('\n'.join(lines))
print('\n')

text = df_test['target_text'][0]
lines = wrap(text, line_length)
print('original description:')
print('\n'.join(lines))
print('\n')

input_text = df_test['input_text'][0]
input_ids = tokenizer.encode(input_text, truncation=True, return_tensors='pt')
output = model.generate(input_ids=input_ids, max_length=512)
text = tokenizer.decode(output[0])
lines = wrap(text, line_length)
print('generated description: ')
print('\n'.join(lines))


input:
summarize: caption: table_5 Link prediction results on the test-I, test-II, and test-all sets of FB122 and WN18 (filtered setting). row name: FB122
TransE FB122 TransH FB122 TransR FB122 KALE-Trip FB122 KALE-Pre FB122 KALE-Joint WN18 TransE WN18 TransH WN18 TransR WN18 KALE-Trip WN18 KALE-Pre WN18
KALE-Joint. colume name: Test-I MRR Test-I MED Test-I HITS@3 (%) Test-I HITS@5 (%) Test-I HITS@10 (%) Test-II MRR Test-II MED Test-II HITS@3 (%) Test-
II HITS@5 (%) Test-II HITS@10 (%) Test-ALL MRR Test-ALL MED Test-ALL HITS@3 (%) Test-ALL HITS@5 (%) Test-ALL HITS@10 (%). metric: MRR MED HITS@3 (%)
HITS@5 (%) HITS@10 (%) MRR MED HITS@3 (%) HITS@5 (%) HITS@10 (%) MRR MED HITS@3 (%) HITS@5 (%) HITS@10 (%). value: 0.296 13.0 36.0 41.5 48.1 0.630 2.0
77.5 82.8 88.4 0.480 2.0 58.9 64.2 70.2 0.280 15.0 33.6 39.1 46.4 0.606 2.0 70.1 75.4 82.0 0.460 3.0 53.7 59.1 66.0 0.283 16.0 33.4 39.2 46.0 0.499
2.0 57.0 63.2 70.1 0.401 5.0 46.4 52.4 59.3 0.299 10.0 36.6 42.9 50.2 0.650 2.0 79.0 83.4 88.7 

In [54]:
example = 7

line_length = 150
text = df_test['input_text'][example]
lines = wrap(text, line_length)
print('input:')
print('\n'.join(lines))
print('\n')

text = df_test['target_text'][example]
lines = wrap(text, line_length)
print('original description:')
print('\n'.join(lines))
print('\n')

input_text = df_test['input_text'][example]
input_ids = tokenizer.encode(input_text, truncation=True, return_tensors='pt')
output = model.generate(input_ids=input_ids, max_length=512)
text = tokenizer.decode(output[0])
lines = wrap(text, line_length)
print('generated description: ')
print('\n'.join(lines))


input:
summarize: caption: table_1 METEOR results for different configuration of our model on STARdev, STARtest and CARTOON datasets. row name: Model FULL -
Model FULL -SEM Model FULL -SYN. colume name: STARdev STARtest CARTOON. metric: METEOR METEOR METEOR. value: 31.82 29.16 32.08 28.72 25.55 27.55 31.92
29.14 32.04.


original description:
Table 1 reports METEOR, we notice that removing the semantic coherence scores in -SEM hurts the performance compared to FULL, this confirms our claim
that semantic compatibility is crucial for building coherent stories. On the other hand, -SYN performs similarly to FULL. Closer inspection of the
-SYN system’s output reveals a greater diversity in thematic elements as a result of the relaxed syntactic compatibility constraints. Hence it is more
likely to have greater overlap with any of the reference rewrites, resulting in higher METEOR scores.


generated description: 
<pad> Table 1 shows the results for different configurations of our model on ST