# Sample Constructions

Explore using RoBERTa to determine correct vs incorrect on 2 sample constructions types (Caused-Motion and Way), 72 sentence pairs.

In [1]:
import sys
sys.path.append('../')

import pickle
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict

import torch
import transformers
from transformers import pipeline

%matplotlib inline
%load_ext autoreload
%autoreload 2
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

## Load data

Keep only sentence pairs where the rating difference is >= 2

In [2]:
data = pd.read_csv("../data/paired_sentences.csv")

In [3]:
data = data[abs(data.Rating1 - data.Rating2) >= 2]

In [4]:
data.head()

Unnamed: 0,Type,Acc,Sentence1,Sentence2,Rating1,Rating2
0,CM,1,The audience laughed Bob off the stage.,The performers laughed Bob off the theatre.,5,3
1,CM,1,The audience laughed Bob off the stage.,The singers laughed Bob off the movie.,5,0
3,CM,1,John chopped carrots into the salad.,John chopped oranges into the bread.,5,1
5,CM,1,The professor invited us into his office.,The teacher invited us into his desk.,5,0
7,CM,0,The key allowed John into the house.,The door allowed John into the condo.,4,0


## Run RoBERTa

This function is copy-pasted from the layerwise anomaly project, but has some serious limitations:

1. Assumes sentences have the same length
2. Assumes only one masked token differs

Therefore the majority of sentences are excluded right now...

In [5]:
nlp = pipeline("fill-mask", model='roberta-base')

In [6]:
def fill_one(sent1, sent2):
  toks1 = nlp.tokenizer(sent1, add_special_tokens=False)['input_ids']
  toks2 = nlp.tokenizer(sent2, add_special_tokens=False)['input_ids']
  
  if len(toks1) != len(toks2):
    return None

  masked_toks = []
  dtok1 = None
  dtok2 = None
  num_masks = 0
  for ix in range(len(toks1)):
    if toks1[ix] != toks2[ix]:
      masked_toks.append(nlp.tokenizer.mask_token_id)
      num_masks += 1
      dtok1 = toks1[ix]
      dtok2 = toks2[ix]
    else:
      masked_toks.append(toks1[ix])
  
  if num_masks != 1:
    return None
  
  res = nlp(nlp.tokenizer.decode(masked_toks), targets=[nlp.tokenizer.decode(dtok1), nlp.tokenizer.decode(dtok2)])
  return res[0]['token'] == dtok1

In [7]:
for _, row in data.iterrows():
  result = fill_one(row['Sentence1'], row['Sentence2'])
  if result is None:
    continue
  print(row['Sentence1'])
  print(row['Sentence2'])
  print(f'Sentence {1 if result else 2} is better')
  print()

Frank dug his way out of prison.
Frank dug his way out of cell.
Sentence 1 is better

Frank dug his way out of prison.
Frank dug his way out of courthouse.
Sentence 1 is better

Bob worked his way to the top of his profession.
Bob worked his way to the top of his degree.
Sentence 1 is better

Sam joked his way into the meeting.
Sam joked his way into the date.
Sentence 1 is better

