## Création de l'environnement conda-pip

Cf. the [Instructions](./instructions.md)


In [1]:
import spacy
import psycopg2
import pandas as pd
from spacy.tokens import Span
from spacy import displacy
from time import strftime, gmtime

In [2]:
from itables import init_notebook_mode, show
import re
from importlib import reload

In [3]:
import coreferee

In [4]:
import warnings
warnings.filterwarnings('ignore')

In [5]:
import postgresql_functions as pgf
import settings as stt

In [6]:
# reload(pgf)

In [7]:
### connect to the local database
conn = psycopg2.connect(host="localhost", port = 5432, database="espace_intellectuel", 
                        user="postgres", password=stt.dbw)
#conn

In [8]:
q1 = """
select pk_mathshistory, biography, length(biography) AS size
from mathshistory.mathshistory m
order by pk_mathshistory;
"""

In [9]:
result = pgf.sql_explore(q1, conn)
# print(f'Lines count: {len(result[0])}, errors count: {len(result[1])}, \nFirst lines: {result[0][:5]}')

In [10]:
textes = pd.DataFrame(result[0])
textes.columns = ['id', 'texte', 'size']

In [11]:
textes.head()

Unnamed: 0,id,texte,size
0,1,Nothing was known about al-Nasawi in Europe un...,4282
1,2,Jia Xian is also known as Chia Hsien. Almost n...,5492
2,3,Hermann of Reichenau is also called Hermann th...,12391
3,4,Sripati's father was Nagadeva (sometimes writt...,3575
4,5,We note that there are several versions of al-...,12675


In [12]:
len(textes), len(textes[textes['size'] > 1500])

(3010, 2927)

In [13]:
### Choose one document
txt = textes.iloc[1].texte 
print(txt[:300])

Jia Xian is also known as Chia Hsien. Almost nothing is known about his life. It is recorded that he was a pupil of Chu Yan who was a famous calendarist, astronomer and mathematician. We know that Chu Yan was productive over the years 1022 to 1054 so he must have tutored Jia Xian at some time betwee


In [13]:
#txt = "We have quoted above from Biancani concerning his high regard for Galileo. However, he did not always agree with Galileo's views. The first disagreement came in 1611 and concerned the mountains on the moon. Galileo had observed the surface of the moon through a telescope in 1609 and had used certain mathematical techniques to prove that there were lunar mountains. His claim appeared in Sidereus Nuncius published in May 1610. In May 1611 a group of scientists, mostly Jesuits, was brought together by cardinal Ferdinando Gonzaga in Mantua to discuss Galileo's claims. One of the major points discussed was Galileo's proof that there were mountains on the moon, and the report from the group came down firmly in favour of the traditional belief that the moon was perfectly smooth. Galileo suspected that Biancani was the author of the report and letters were exchanged in which Biancani dissociated himself from any insult towards Galileo saying that he was sorry if he had been offended but, nevertheless, pointing out that he did believe that the moon was perfectly smooth. He also disagreed with Galileo in 1613 when a dispute broke out between Galileo and Christoph Scheiner over sunspots. Galileo unfairly accused Scheiner of plagiarism but, although Scheiner's discovery of sunspots was certainly independent of any work by Galileo, his explanation was quite wrong. Biancani, however, defended his fellow Jesuit Scheiner."

In [14]:
#print(txt)

## Coreferee

In [19]:
nlp = spacy.load('en_core_web_lg')

In [20]:
### https://spacy.io/universe/project/coreferee
nlp.add_pipe('coreferee')

<coreferee.manager.CorefereeBroker at 0x7f807ffb9480>

In [21]:
spacy.info()

{'spacy_version': '3.5.3',
 'location': '/home/francesco/miniconda3/envs/py310_pip_spacy/lib/python3.10/site-packages/spacy',
 'platform': 'Linux-5.14.0-1059-oem-x86_64-with-glibc2.31',
 'python_version': '3.10.11',
 'pipelines': {'en_core_web_lg': '3.1.0', 'en_core_web_trf': '3.5.0'}}

In [37]:
doc = nlp(txt)

In [38]:
doc._.coref_chains.print()

0: Xian(1), his(14), he(21), Xian(57), Xian(74), Xian(92)
1: Yan(27), Yan(42), he(52), Yan(71), his(80)
2: years(47), years(63)
3: Emperor(110), his(125), women(137), men(142), Emperor(162)
4: men(118), their(152), they(156)
5: Both(215), its(231), it(240)
6: Hui(256), Hui(297), Hui(320)
7: Suanfa(260), he(303), his(306)
8: work(288), work(311)
9: Xian(325), Xian(342), Xian(371), Xian(419), He(473), Xian(520)
10: Pascal(336), Pascal(366), Pascal(391)
11: coefficients(378), coefficients(397)
12: contribution(451), it(466)
13: algorithm(454), algorithm(510)
14: method(476), method(500)
15: method(528), method(534)
16: method(546), method(557)
17: x3=146363183x^{3(561), x^{3(570)
18: lt(573), lt(581), lt(585)
19: Qiujian(843), she(849)
20: Xian(853), Xian(908), Xian(961), Xian(1011)
21: examination(866), It(897)
22: Samawal(881), Samawal(914), Samawal(965), Samawal(994)
23: approximation(925), approximation(945)


In [None]:
### Produce resolved text
# https://stackoverflow.com/questions/75204212/spacy-coreferee-how-to-cleanly-extract-coreferenced-text

resolved_text = ""

for token in doc[:]:
  
    repres = doc._.coref_chains.resolve(token)
    # print(repres)
    if repres:
        c = " and ".join([t.text for t in repres])
        # resolved_text += " " + c
        resolved_text += c + " "
        # print(c)
    else:
        #resolved_text += " " + token.text
        resolved_text += token.text_with_ws

print(resolved_text)

In [68]:
def resolve_text(doc):
    
    resolved_text = ""
    
    for token in doc:
  
        repres = doc._.coref_chains.resolve(token)
        # print(repres)
        if repres:
            c = " and ".join([t.text for t in repres])
            # resolved_text += " " + c
            resolved_text += c + " "
            # print(c)
        else:
            #resolved_text += " " + token.text
            resolved_text += token.text_with_ws
        
    return resolved_text    

In [70]:
resolve_text(doc)[:1000]

"Jia Xian is also known as Chia Hsien. Almost nothing is known about Xian life. It is recorded that Xian was a pupil of Chu Yan who was a famous calendarist, astronomer and mathematician. We know that Chu Yan was productive over the years 1022 to 1054 so Yan must have tutored Jia Xian at some time between these years . Other evidence would suggest that Chu Yan taught Jia Xian fairly near the beginning of Yan career.\n\nAccording to Qian [3], Jia Xian was a Palace Eunuch of the Left Duty Group. This requires a little explanation. The Emperor of China would employ eunuchs, castrated men, as guards and servants in Emperor Palace. Although the original role was that of guarding the Emperor 's quarters, these Emperor achieved real power and influence. In addition to men role as guards men became confidential advisers to the Emperor , and sometimes government ministers.\n\nJia Xian is known to have written two mathematics books: Huangdi Jiuzhang Suanjing Xicao (The Yellow Emperor's detailed 

## Import into database

In [None]:
### Je n'ai retenu que ceux à plus que 1500 !
# il faudra reprendre les autres

ll = textes[textes['size'] > 1500].values.tolist()
len(ll), ll[:2]

UPDATE astronomers.mathshistory SET coreferenced_txt = null;

In [75]:

### Next line commented to avoid disruption

for t in ll:

    error = []
    
    doc = nlp(t[1])
        
    rt = resolve_text(doc).replace("'", "\\\'")
    # print(type(rt), len(rt))
    
    
    with conn.cursor() as curs:
        
        ### commented to avoid disruption
        #  try:
            qs = f"""
            UPDATE astronomers.mathshistory SET coreferenced_txt = E'{rt}'
            WHERE pk_mathshistory = {t[0]};
            """
            
            curs.execute(qs)
            conn.commit()
        except Exception as e:
            error.append([t[0], e])
            # print(error)
            with open('spacy/logs_errors_coreferenced.txt', 'a') as f:
                f.write(f'd{str(error)} — Error — {strftime("%a, %d %b %Y %H:%M:%S +0000", gmtime())}\n\n')
            conn.rollback()

### Errors

I had three not treated texts: 2527, 2827, 2876 — cf. logs: logs_errors_coreferenced.txt

[22 juin 2023] I decided not to treat them for the moment