# Productivity / Compositionality Features
**Project goal**  
Compile a dataset that can targets at a specific effect in language, so (1) people know that CxG can be useful for describing this effect, and (2) people can assess whether neural LMs do well in terms of this effect.  

**Motivation**  
The standard argument for compositionality is productivity: we can assemble meanings from potentially infinite combinations of lexicons, from a finite collection that we already know the meanings of. That natural language being learnable also supports compositionality. So there are some interesting relations between compositionality and productivity.  

There are some "noncompositional phrases": the semantics of individual words do not add up to that of the whole. E.g., "rock and roll", "kick the bucket". The noncompositional phrases have low compositionality and high productivity.  

How about the dative alternation constructs? The transitive pair (e.g., the double object construction form "Bill gave Jack the box" vs. the oblique dative form "Bill gave the box to Jack") should be equally compositional. For productive -- I'm not sure. Will defer that to Google N-grams.  

**Computing these metrics from LM**  
- Compositionality
    - For contextualized embeddings: Either (a) embed word sense explanations from Oxford API to get $v_{part}=[v_1, v_2, ...]$, or (b) take the contextualized LM's last layer attentions as $v_{part}$. Embed sentence meaning as $v_{whole}$. Measure how far is the composition (e.g., average? unsure) of $v_{part}$ from $v_{whole}$. If they are too far, then it is kind of hard to "assemble the meanings".  
    - For NN based on static embeddings (e.g., GloVe + LSTM): Similar approaches apply. $v_{part}=[v_1,v_2,...]$ are taken from GloVe. $v_{whole}$ is taken from LSTM.  
    - [(Andreas 2019)](https://openreview.net/forum?id=HJz05o0qK7) proposed a tree reconstruction metric. This is a simplistic one. They assumed the tree-structured, `derivation` (i.e., semantics) to be known, and measured how far do the model constructions (`f(x)`) differ from the derivations.  
- Productivity: how often is this construction used. Related: [here](https://psychology.wikia.org/wiki/Productivity_(linguistics)).  
    - Google N-gram gives the frequency.  
    - Should productivity be related to compositionality? unsure. Let's see
- Aside: Systematicity: according to [(Goodwin et al., 2020)](https://research.fb.com/wp-content/uploads/2020/07/Probing-Linguistic-Systematicity.pdf): words mean the same thing in different contexts.  
    - Test novel words in familiar combinations. e.g., Goodwin et al., (2020)  
    - Test familiar words in novel combinations. e.g., Lake and Baroni (2018) SCAN. Dasgupta et al., 2018, 2019.  
    - "Self similarity" of [Ethayarajh 2019](https://aclanthology.org/D19-1006/): measure how different is a word's vector in different sentences. These are really different -- Most variance of $v_{word}$ is attributable to the context. So measuring the systematicity of LM is kind of moot.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import requests
from typing import List, Tuple, Any
import torch
from transformers import AutoModel, AutoTokenizer

verb_cxn = pd.read_csv("../notebooks/verb_cxn_examples.csv")
verb_cxn

Unnamed: 0,sentence,verb,construction
0,Bill kicked the box.,kicked,transitive
1,Frank threw the window.,threw,transitive
2,Jack got the ball.,got,transitive
3,Steve hit the fridge.,hit,transitive
4,Adam kicked George the ball.,kicked,ditransitive
5,Eric threw Paul the book.,threw,ditransitive
6,Harry got Thomas the door.,got,ditransitive
7,John hit Henry the door.,hit,ditransitive
8,Tom kicked the door off the bridge.,kicked,caused-motion
9,Sam threw the laptop onto the elevator.,threw,caused-motion


## Productivity, and the Google Ngram 
Let me try the [API](https://jameshfisher.com/2018/11/25/google-ngram-api/)  

In [2]:
def query_frequency(ngram: str, lang: str="en") -> pd.DataFrame:
    """
    If querying a bunch of ngrams, separate them with comma (",")
    For each ngram, if querying its different inflections, append the verb with _INF.
    For more usage: see https://books.google.com/ngrams/info 
    """
    ngramstr = "+".join(ngram.split())
    startyear = 2000
    endyear = 2019
    corpus = {"en": 26, "fr": 30, "ch": 34}[lang]  # fr and ch doesn't contain too many books before 2000
    resp = requests.get(f"https://books.google.com/ngrams/json?content={ngramstr}&year_start={startyear}&year_end={endyear}&corpus={corpus}&smoothing=3")
    L = resp.json()
    
    result = {"ngram": [], "frequency": [], "type": []}
    for item in L:
        result["ngram"].append(item['ngram'])
        result["frequency"].append(item['timeseries'][-1])
        result["type"].append(item['type'])
    return pd.DataFrame(result)

result = query_frequency("laisse_INF tomber", lang="fr")
result

Unnamed: 0,ngram,frequency,type
0,laisse_INF tomber,2.30024e-05,ALTERNATE_FORM
1,laissa tomber,6.971325e-06,EXPANSION
2,laisser tomber,4.347074e-06,EXPANSION
3,laisse tomber,4.024658e-06,EXPANSION
4,laissé tomber,2.772637e-06,EXPANSION
5,laissant tomber,1.439988e-06,EXPANSION
6,laissée tomber,6.137007e-07,EXPANSION
7,laissait tomber,5.313193e-07,EXPANSION
8,laissai tomber,6.533759e-07,EXPANSION
9,laissent tomber,2.090762e-07,EXPANSION


In [3]:
query_frequency("试试,逝世", lang="ch")

Unnamed: 0,ngram,frequency,type
0,试试,3e-06,NGRAM
1,逝世,9e-06,NGRAM


In [4]:
query_frequency("kick_INF the bucket")

Unnamed: 0,ngram,frequency,type
0,kick_INF the bucket,9.396156e-08,ALTERNATE_FORM
1,kick the bucket,4.404718e-08,EXPANSION
2,kicked the bucket,3.503867e-08,EXPANSION
3,kicking the bucket,7.868428e-09,EXPANSION
4,kicks the bucket,7.007286e-09,EXPANSION


In [5]:
query_frequency("kick_INF the box, kick_INF the bucket, kick_INF the wall")

Unnamed: 0,ngram,frequency,type
0,kick_INF the box,9.646479e-09,ALTERNATE_FORM
1,kicked the box,5.563756e-09,EXPANSION
2,kick the box,2.115356e-09,EXPANSION
3,kicking the box,1.451791e-09,EXPANSION
4,kicks the box,5.155758e-10,EXPANSION
5,kick_INF the bucket,9.396156e-08,ALTERNATE_FORM
6,kick the bucket,4.404718e-08,EXPANSION
7,kicked the bucket,3.503867e-08,EXPANSION
8,kicking the bucket,7.868428e-09,EXPANSION
9,kicks the bucket,7.007286e-09,EXPANSION


In [6]:
def compute_productivity(sentence: str, ngram: int=3) -> pd.DataFrame:
    wordlist = sentence.split()
    ngram_list = [" ".join(wordlist[i:i+ngram]) for i in range(len(wordlist)-ngram+1)]
    ngram_str = ", ".join(ngram_list)
    df = query_frequency(ngram_str)
    print(df)
    return df.frequency.mean()

compute_productivity("Paul kicked the bucket", ngram=3)

               ngram     frequency   type
0    Paul kicked the  3.876041e-10  NGRAM
1  kicked the bucket  3.503867e-08  NGRAM


1.7713136492292225e-08