# LingMess: Linguistically Informed Multi Expert Scorers for Coreference Resolution
This notebook is the official Demo of the paper ["LingMess: Linguistically Informed Multi Expert Scorers for Coreference Resolution"](https://arxiv.org/abs/2205.12644).


# Clone the code repository

In [None]:
!git clone https://github.com/shon-otmazgin/lingmess-coref.git

Cloning into 'lingmess-coref'...
remote: Enumerating objects: 396, done.[K
remote: Counting objects: 100% (158/158), done.[K
remote: Compressing objects: 100% (111/111), done.[K
remote: Total 396 (delta 98), reused 102 (delta 47), pack-reused 238[K
Receiving objects: 100% (396/396), 97.83 KiB | 7.52 MiB/s, done.
Resolving deltas: 100% (251/251), done.


# Install requirements

In [None]:
!python --version # make sure to use python 3.7, otherwise install it.

Python 3.7.13


In [None]:
cd lingmess-coref/

/content/lingmess-coref


In [None]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm@ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[K     |████████████████████████████████| 13.7 MB 15.2 MB/s 
Collecting scipy==1.7.3
  Downloading scipy-1.7.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (38.1 MB)
[K     |████████████████████████████████| 38.1 MB 367 kB/s 
Collecting spacy==3.0.6
  Downloading spacy-3.0.6-cp37-cp37m-manylinux2014_x86_64.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 31.4 MB/s 
[?25hCollecting torch==1.10.0
  Downloading torch-1.10.0-cp37-cp37m-manylinux1_x86_64.whl (881.9 MB)
[K     |██████████████████████████████▎ | 834.1 MB 1.2 MB/s eta 0:00:40tcmalloc: large alloc 1147494400 bytes 

There are some incompatible ^^^ but they for packages we not using so we can continue.

# Prepare your custom data for Inference

In [None]:
import json
import uuid

examples = [
        {'doc_key': uuid.uuid4().hex, 'text': 'The man tried to put the boot on his foot but it was too small.'},
        {'doc_key': uuid.uuid4().hex, 'text': 'Some apologizing was needed in the relationship after the argument because it is soothing.'}
    ]
with open('toy_data_raw.jsonlines', 'w') as f:
    for doc in examples:
        f.write(json.dumps(doc) + "\n")

# Inference

In [None]:
!python run.py \
        --output_file=predictions.jsonlines \
        --model_name_or_path=biu-nlp/lingmess-coref \
        --test_file=toy_data_raw.jsonlines \
        --eval_split=test \
        --max_tokens_in_batch=15000 \
        --device=cuda:0

06/15/2022 16:13:16 - INFO - __main__ -   missing_keys: []
06/15/2022 16:13:16 - INFO - __main__ -   unexpected_keys: []
06/15/2022 16:13:16 - INFO - __main__ -   mismatched_keys: []
06/15/2022 16:13:16 - INFO - __main__ -   error_msgs: []
06/15/2022 16:13:16 - INFO - __main__ -   Parameters: 590.0M, Transformer: 434.6M, Head: 155.4M
06/15/2022 16:13:16 - INFO - coref_dataset -   Creating dataset for {'train': None, 'dev': None, 'test': 'toy_data_raw.jsonlines'}
06/15/2022 16:13:18 - INFO - util -   Tokenize documents using Spacy...
100% 2/2 [00:00<00:00, 81.54it/s]
06/15/2022 16:13:18 - INFO - coref_dataset -   Tokenize documents...
100% 2/2 [00:00<00:00, 125.83ex/s]
06/15/2022 16:13:18 - INFO - eval -   ***** Running Inference on test split  *****
06/15/2022 16:13:18 - INFO - eval -     Examples number: 2
Inference: 100% 2/2 [00:00<00:00,  4.71it/s]
06/15/2022 16:13:19 - INFO - util -   Predicted clusters at: predictions.jsonlines


# Resolve the clusters

In [None]:
import json

predicted_docs = []
with open('predictions.jsonlines', 'r') as f:
    for line in f:
        predicted_docs.append(json.loads(line))

for i, doc in enumerate(predicted_docs):
    tokens = doc['tokens']
    predicted_clusters = doc['clusters']
    doc_key = doc['doc_key']
    print(f"##### doc_key: {doc_key}")

    for cluster in predicted_clusters:
        text_cluster = []
        for start, end in cluster:
            text_cluster.append(tokens[start:end+1])
        print(text_cluster)
    print("###########################")
    print()

##### doc_key: 300b22c2baa447aea33e34b2a0bcbfe3
[['The', 'man'], ['his']]
[['the', 'boot'], ['it']]
###########################

##### doc_key: 022810dea7c04365aff96fed73f3dd23
[['Some', 'apologizing'], ['it']]
###########################

