In [None]:
!pip install spacy==2.1.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en_core_web_sm==2.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1 MB)
[K     |████████████████████████████████| 11.1 MB 2.8 MB/s 
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [None]:
!pip install neuralcoref

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import spacy
import neuralcoref

In [None]:
nlp = spacy.load('en_core_web_sm')
neuralcoref.add_to_pipe(nlp)

<spacy.lang.en.English at 0x7fda3d32c6d0>

In [None]:
doc = nlp(u'''Alice was beginning to get very tired of sitting by her sister on the bank, 
and of having nothing to do: once or twice she had peeped into the book her 
sister was reading, but it had no pictures or conversations in it, ‘and what 
is the use of a book,’ thought Alice ‘without pictures or conversations?''')

In [None]:
#Printing the chains
for i in range(len(doc._.coref_clusters)):
  print("Chain",i+1,":")
  print(str(doc._.coref_clusters[i]).replace('\n',""))

Chain 1 :
Alice: [Alice, her, she, her, Alice]
Chain 2 :
her sister: [her sister, it, it]


In [None]:
# Checking that the entities from spacy are labeled correctly
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])

for entity in doc.ents:
    print(entity.text, entity.label_)

Noun phrases: ['Alice', 'her sister', 'the bank', 'nothing', 'she', 'the book', 'her \nsister', 'it', 'no pictures', 'conversations', 'it', 'what', 'the use', 'a book', 'pictures', 'conversations']


# Conclusion
spaCy coreference solver found 2 coreference chains, from the paragraph: Alice and her sister. 
The first chain related to Alice shows the right result of words pointing to Alice.
For the chain about her sister, it seems that the solver mixes her sister and the book since the coreferences are "her sister", "it", and "it". This should be split into one chain for her sister and one for the book. The result could be as follows:


*   Chain 1 : Alice: [Alice, her, she, her, Alice]
*   Chain 2 : her sister: [her sister, her sister]
*   Cain 3 : the book: [the book, it, it]

spaCy use the spaCy parser and a neural mention-ranking model described in Clark and Manning (2016), which makes a score between pairs of words for the likelihood of being coreferences based on a reinforcement learned model.

In order to find the error, the entities from the parser are checked to see if they are defined correctly.
According to this, the book is labeled as a noun phrase, while the sister before the book is an individual noun phrase. This could be concatenated to "the book her sister was reading", since the sister is not important in this context.

Below is an Alternative sentence, where the book instead of "the reading" is ending the phrase, in order to clarify that "her sister's book" is one entity. In this implementation, the coreference solver found a better solution.

This indicates that spaCy coreference solver are sensitive to the order of entities in a sentence. An easy way to know that the book, and not the sister, is the coreference, is because the pronoun "it" is used. This should have been one of the aspects the machine learning algorithm is looking for.

# Alternative sentence

In [None]:
doc = nlp(u'''Alice was beginning to get very tired of sitting by her sister on the bank, 
and of having nothing to do: once or twice she had peeped into her sister's book, but it had no pictures or conversations in it, ‘and what 
is the use of a book,’ thought Alice ‘without pictures or conversations?''')

In [None]:
# Checking that the entities from are correct spacy are correct
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])

for entity in doc.ents:
    print(entity.text, entity.label_)

Noun phrases: ['Alice', 'her sister', 'the bank', 'nothing', 'she', "her sister's book", 'it', 'no pictures', 'conversations', 'it', 'what', 'the use', 'a book', 'pictures', 'conversations']


In [None]:
for i in range(len(doc._.coref_clusters)):
  print("Chain",i+1,":")
  print(str(doc._.coref_clusters[i]).replace('\n',""))

Chain 1 :
Alice: [Alice, her, she, her, Alice]
Chain 2 :
her sister's book: [her sister's book, it, it]
