### Eploring and testing NeuralCoref

[20 juin 2023] 

I tested the library with a specifc conda environment although it seems not to be developed actively

"NeuralCoref is a pipeline extension for spaCy 2.1+ which annotates and resolves coreference clusters using a neural network. NeuralCoref is production-ready, integrated in spaCy's NLP pipeline and extensible to new training datasets."

* https://spacy.io/universe/project/neuralcoref 
* [NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks](https://github.com/huggingface/neuralcoref)
* https://medium.com/huggingface/state-of-the-art-neural-coreference-resolution-for-chatbots-3302365dcf30
* Online Test Service : https://huggingface.co/coref/

I also tested changing the parameters.
* greedyness=0.55 — default = 0.5 : the new one produces wrong results

Results:
* It is an older library, not maintained ?
* There are some issues, not solved or wrongly solved coreferences
* I decided to use _coreferee_





Installation instructions

* https://anaconda.org/conda-forge/neuralcoref
* The last version appears to be NeuralCoref 4.0
* Package neuralcoref-4.0-py37hc9558a2_0 requires python >=3.7,<3.8.0a0, but none of the providers can be installed
* Do not install spacy before neuralcoref but let conda choose the suitable dependencies

```
mamba create -n py3_7_spacy_neuralcoref python=3.7 pandas ipykernel neuralcoref psycopg2 matplotlib itables

conda activate py3_7_spacy_neuralcoref
jupyter kernelspec list
python -m ipykernel install --user --name py3_7_spacy_neuralcoref

python -m spacy download en_core_web_lg


conda remove --name py3_7_spacy_neuralcoref --all

```

In [39]:
import spacy
import pandas as pd

In [2]:
import neuralcoref

In [3]:
from importlib import reload
from matplotlib import pyplot as plt

import psycopg2
import settings as stt

from itables import init_notebook_mode, show
import re

from time import strftime, gmtime

In [4]:
import postgresql_functions as pgf

In [5]:
# reload(pgf)

## Get specimens from database

In [6]:
### connect to the local database
conn = psycopg2.connect(host="localhost", port = 5432, database="espace_intellectuel", 
                        user="postgres", password=stt.dbw)
#conn

In [7]:
q1 = """
select pk_mathshistory, "name", url, dates, length(biography) as eff, biography 
from astronomers.mathshistory m 
where pk_mathshistory in (103, 117, 133, 159, 186);
"""

In [8]:
result = pgf.sql_explore(q1, conn)
# print(f'Lines count: {len(result[0])}, errors count: {len(result[1])}, \nFirst lines: {result[0][:5]}')

In [9]:
textes = pd.DataFrame(result[0])
textes.columns = ['id', 'name', 'url', 'dates', 'length_bio', 'texte']

In [10]:
textes.head()

Unnamed: 0,id,name,url,dates,length_bio,texte
0,103,Christopher Clavius,https://mathshistory.st-andrews.ac.uk/Biograph...,1538-1612,10616,Christopher Clavius was born in a German regio...
1,117,Michael Mästlin,https://mathshistory.st-andrews.ac.uk/Biograph...,1550-1631,10122,Michael Mästlin was born in Göppingen which wa...
2,133,Giuseppe Biancani,https://mathshistory.st-andrews.ac.uk/Biograph...,1566-1624,10035,Giuseppe Biancani's name also appears in its L...
3,159,Wilhelm Schickard,https://mathshistory.st-andrews.ac.uk/Biograph...,1592-1635,10272,Wilhelm Schickard's name is sometimes written ...
4,186,Johannes Hevelius,https://mathshistory.st-andrews.ac.uk/Biograph...,1611-1687,10856,The first problem that we have to address is t...


In [11]:
### Choose one document
tx = textes.iloc[1].texte 
print(tx)

Michael Mästlin was born in Göppingen which was a village about 50 km east of Tübingen. His father, Jakob Mästlin, and his mother, Dorothea Simon, were both devout Lutherans and Michael was brought up in that faith and remained strongly committed to it throughout his life. He was the middle child of the family, having an older sister and a younger brother. He attended the monastic school in Königsbronn then, after his studies there, entered Tübingen University in 1568. [3]:-
As was the case with many young scholars including Kepler, his most famous student, [Mästlin] did his undergraduate studies at a preparatory school and came to the university to take his final exams and pick up his baccalaureate degree.At Tübingen University he studied mathematics and astronomy for a Master's degree under Philipp Apian who was Peter Apian's son. In 1570, while a student, he purchased a copy of Copernicus's De revolutionibus from the widow of Victorin Strigel, who had been professor of theology at L

## NLP pipeline with NeuralCoref

In [37]:
nlp = spacy.load('en_core_web_lg')

In [38]:
neuralcoref.add_to_pipe(nlp, max_dist= 300, max_dist_match = 4000)

<spacy.lang.en.English at 0x7f58f6d3a950>

In [40]:
doc = nlp(tx)

In [41]:
doc._.has_coref

True

In [42]:
new_text = doc._.coref_resolved
new_text[:600]

'Michael Mästlin was born in Göppingen which was a village about 50 km east of Tübingen. Michael Mästlin father, Jakob Mästlin, and Michael Mästlin mother, Dorothea Simon, were both devout Lutherans and Michael was brought up in that faith and remained strongly committed to it throughout Michael Mästlin life. Michael Mästlin was the middle child of the family, having an older sister and a younger brother. Michael Mästlin attended the monastic school in Königsbronn then, after Michael Mästlin studies there, entered Tübingen University in 1568. [3]:-\nAs was the case with many young scholars inclu'

In [43]:
doc._.coref_clusters[:13]

[Michael Mästlin: [Michael Mästlin, His, his, his, He, He, his, his, his, his, his, he, Mästlin, he, Mästlin, he, He, he, Mästlin, Mästlin, his, he, Mästlin, Mästlin, his, He, he, Mästlin, Mästlin, his, he, his, He, he, He, he, his, he, Mästlin, him, he, Mästlin, Mästlin, Mästlin, Mästlin, He, he, He, he, Mästlin, he, his, Mästlin, he, he, him, he, his, Mästlin, his],
 Göppingen: [Göppingen, Göppingen],
 Tübingen: [Tübingen, Tübingen, Tübingen, Tübingen, Tübingen, Tübingen, Tübingen, Tübingen],
 Tübingen University: [Tübingen University, the university, Tübingen University, Tübingen University, Tübingen University's],
 Copernicus: [Copernicus, Copernicus],
 Backnang: [Backnang, Backnang],
 three sons and three daughters: [three sons and three daughters, their],
 Margarete Burkhardt, daughter of a professor at Tübingen: [Margarete Burkhardt, daughter of a professor at Tübingen, they],
 He: [He, he, his, he, his, his, his, he, his, he, his, He, his, his, his, he],
 Heidelberg: [Heidelber

In [44]:
new_text[2250:]

'; Margarete Burkhardt, daughter of a professor at Tübingen had eight children.\n\nMichael Mästlin is famous for Michael Mästlin excellent, very accurate, observations of the comet of 1577, observed while Michael Mästlin was in Backnang, and published in Tübingen in 1578 as Observatio et demonstratio cometae aetherae qui anno 1577 et 1578 constitutus in sphaera Veneris apparuit cum admirandius eius passionibus varietate scilicet motus loco orbe distantia a terro centro etc. adhibitis demonstrationibus geometricis et calculo arithmetico cuius modi de alio quoquam cometa nunquam visa est (Observations and demonstrations of the ethereal comets of the years 1577 and 1578). We discuss this achievement in more detail below. He remained in Backnang for four years, then was appointed as professor of mathematics at the University of Heidelberg in 1580. There He published the first edition of He famous astronomy textbook Epitome astronomiae (1582) - He published six further editions of this popu

In [25]:
cor_sco = [[i,v] for i,v in doc._.coref_scores.items()]
cor_sco[:5]

[[Michael Mästlin, {Michael Mästlin: -0.00941610336303711}],
 [Göppingen,
  {Göppingen: 1.6938384771347046, Michael Mästlin: -1.7341481447219849}],
 [a village about 50 km east of Tübingen,
  {a village about 50 km east of Tübingen: 1.791876196861267,
   Michael Mästlin: -1.529648780822754,
   Göppingen: -1.5075124502182007}],
 [50 km,
  {50 km: 1.8876539468765259,
   Michael Mästlin: -1.6026262044906616,
   Göppingen: -1.5367844104766846,
   a village about 50 km east of Tübingen: -1.5255197286605835}],
 [Tübingen,
  {Tübingen: 1.6877461671829224,
   Michael Mästlin: -1.7620389461517334,
   Göppingen: -1.5420584678649902,
   a village about 50 km east of Tübingen: -1.5022647380828857,
   50 km: -1.520662546157837}]]

In [30]:
### Produce a list of the sentences in the document and add an index to each sentence 
sents = [[i, s, [nc for nc in s.noun_chunks]]  for i, s in enumerate(doc.sents)]
len(sents)

82

In [31]:
for sent in sents[:5]:
    print('----\n',sent)

----
 [0, Michael Mästlin was born in Göppingen which was a village about 50 km east of Tübingen., [Michael Mästlin, Göppingen, a village, Tübingen]]
----
 [1, His father, Jakob Mästlin, and his mother, Dorothea Simon, were both devout Lutherans and Michael was brought up in that faith and remained strongly committed to it throughout his life., [His father, Jakob Mästlin, his mother, Dorothea Simon, devout Lutherans, Michael, that faith, it, his life]]
----
 [2, He was the middle child of the family, having an older sister and a younger brother., [He, the middle child, the family, an older sister, a younger brother]]
----
 [3, He attended the monastic school in Königsbronn then, after his studies there, entered Tübingen University in 1568., [He, the monastic school, Königsbronn, his studies, Tübingen University]]
----
 [4, [3]:-
As was the case with many young scholars including Kepler, his most famous student, [Mästlin] did his undergraduate studies at a preparatory school and came to