# Discover Spacy plugins and their functions

Use the pip_spacy Conda environment

* CrossLingual
* Coreferee
* Claucy
* Concepcy

In [1]:
import spacy
import psycopg2
import pandas as pd
from spacy.tokens import Span
from spacy import displacy


In [2]:
from itables import init_notebook_mode, show
import re
from importlib import reload

In [3]:
import coreferee

In [4]:
import warnings
warnings.filterwarnings('ignore')

In [5]:
import postgresql_functions as pgf
import settings as stt

In [6]:
# reload(pgf)

In [7]:
### connect to the local database
conn = psycopg2.connect(host="localhost", port = 5432, database="espace_intellectuel", 
                        user="postgres", password=stt.dbw)
#conn

In [10]:
q1 = """
select pk_mathshistory, "name", url, dates, length(biography) as eff, biography 
from mathshistory.mathshistory m 
where pk_mathshistory in (103, 117, 133, 159, 186);
"""

In [11]:
result = pgf.sql_explore(q1, conn)
# print(f'Lines count: {len(result[0])}, errors count: {len(result[1])}, \nFirst lines: {result[0][:5]}')

In [12]:
textes = pd.DataFrame(result[0])
textes.columns = ['id', 'name', 'url', 'dates', 'length_bio', 'texte']

In [13]:
textes.head()

Unnamed: 0,id,name,url,dates,length_bio,texte
0,103,Christopher Clavius,https://mathshistory.st-andrews.ac.uk/Biograph...,1538-1612,10616,Christopher Clavius was born in a German regio...
1,117,Michael Mästlin,https://mathshistory.st-andrews.ac.uk/Biograph...,1550-1631,10122,Michael Mästlin was born in Göppingen which wa...
2,133,Giuseppe Biancani,https://mathshistory.st-andrews.ac.uk/Biograph...,1566-1624,10035,Giuseppe Biancani's name also appears in its L...
3,159,Wilhelm Schickard,https://mathshistory.st-andrews.ac.uk/Biograph...,1592-1635,10272,Wilhelm Schickard's name is sometimes written ...
4,186,Johannes Hevelius,https://mathshistory.st-andrews.ac.uk/Biograph...,1611-1687,10856,The first problem that we have to address is t...


In [14]:
### Choose one document
txt = textes.iloc[1].texte 
print(txt)

Michael Mästlin was born in Göppingen which was a village about 50 km east of Tübingen. His father, Jakob Mästlin, and his mother, Dorothea Simon, were both devout Lutherans and Michael was brought up in that faith and remained strongly committed to it throughout his life. He was the middle child of the family, having an older sister and a younger brother. He attended the monastic school in Königsbronn then, after his studies there, entered Tübingen University in 1568. [3]:-
As was the case with many young scholars including Kepler, his most famous student, [Mästlin] did his undergraduate studies at a preparatory school and came to the university to take his final exams and pick up his baccalaureate degree.At Tübingen University he studied mathematics and astronomy for a Master's degree under Philipp Apian who was Peter Apian's son. In 1570, while a student, he purchased a copy of Copernicus's De revolutionibus from the widow of Victorin Strigel, who had been professor of theology at L

In [15]:
#txt = "We have quoted above from Biancani concerning his high regard for Galileo. However, he did not always agree with Galileo's views. The first disagreement came in 1611 and concerned the mountains on the moon. Galileo had observed the surface of the moon through a telescope in 1609 and had used certain mathematical techniques to prove that there were lunar mountains. His claim appeared in Sidereus Nuncius published in May 1610. In May 1611 a group of scientists, mostly Jesuits, was brought together by cardinal Ferdinando Gonzaga in Mantua to discuss Galileo's claims. One of the major points discussed was Galileo's proof that there were mountains on the moon, and the report from the group came down firmly in favour of the traditional belief that the moon was perfectly smooth. Galileo suspected that Biancani was the author of the report and letters were exchanged in which Biancani dissociated himself from any insult towards Galileo saying that he was sorry if he had been offended but, nevertheless, pointing out that he did believe that the moon was perfectly smooth. He also disagreed with Galileo in 1613 when a dispute broke out between Galileo and Christoph Scheiner over sunspots. Galileo unfairly accused Scheiner of plagiarism but, although Scheiner's discovery of sunspots was certainly independent of any work by Galileo, his explanation was quite wrong. Biancani, however, defended his fellow Jesuit Scheiner."

In [16]:
#print(txt)

## CrossLingual

* https://spacy.io/universe/project/crosslingualcoreference
* https://github.com/davidberenstein1957/crosslingual-coreference

In [17]:
nlp = spacy.load('en_core_web_lg')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/francesco/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/francesco/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [18]:
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": -1, "model_name": "minilm"}
)

Some weights of the model checkpoint at nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large were not used when initializing XLMRobertaModel: ['lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaModel were not initialized from the model checkpoint at nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-st

<crosslingual_coreference.CrossLingualPredictorSpacy.CrossLingualPredictorSpacy at 0x7f990fa08a30>

In [19]:
doc = nlp(txt)

In [20]:
print(doc._.coref_clusters[:3])

[[[0, 15], [88, 91], [119, 122], [178, 185], [264, 267], [274, 276], [359, 361], [418, 421], [578, 581], [663, 666], [691, 694], [739, 741], [871, 873], [1044, 1051], [1139, 1141], [1370, 1377], [1601, 1603], [1707, 1709], [1780, 1782], [1934, 1941], [2039, 2046], [2061, 2064], [2141, 2143], [2320, 2324]], [[0, 15], [88, 91], [119, 122], [178, 185], [264, 267], [274, 276], [359, 361], [418, 421], [578, 581], [663, 666], [691, 694], [739, 741], [871, 873], [1044, 1051], [1139, 1141], [1370, 1377], [1601, 1603], [1707, 1709], [1780, 1782], [1934, 1941], [2039, 2046], [2061, 2064], [2141, 2143], [2320, 2324], [2579, 2584]], [[78, 86], [1416, 1424], [1696, 1705], [2003, 2011], [2178, 2186]]]


In [21]:
txt_res = doc._.resolved_text
print(txt_res[2300:3500])

ästlin comets married Margarete Burkhardt, daughter of a professor at Tübingen; they had eight children.

Michael Mästlin comets famous for Michael Mästlin's excellent, very accurate, observations Backnang the comet of 1577, observed while Michael Mästlin children in Backnang, and published in Tübingen in 1578 as Observatio et demonstratio cometae aetherae qui anno 1577 et 1578 constitutus in sphaera Veneris apparuit cum admirandius Michael Mästlin passionibus varietate scilicet motus loco orbe distantia a terro centro etc. adhibitis demonstrationibus geometricis et calculo arithmetico cuius modi children alio Backnang cometa nunquam visa est (Observations and demonstrations of the ethereal comets of the Michael Mästlin 1577 and 1578). We discuss this achievement in more detail below. He remained in Backnang for four comets, then was appointed as comets of mathematics at the University of Heidelberg in 1580comets There he published the first then was appointed as professor of mathemati

In [22]:
print(doc._.coref_chains)

None


In [23]:
len(doc), len(txt_res)

(1893, 11686)

In [24]:
doc[0], doc[len(doc)-2]

(Michael, himself)

In [25]:
info = []
for token in doc:
    info.append((token.idx, token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
                token.shape_, token.is_alpha, token.is_stop))
tx_df = pd.DataFrame(info, columns = ["idx","text", "lemma_", "pos_", "tag_", "dep_","shape_", "is_alpha", "is_stop"])

In [26]:
### https://github.com/mwouts/itables/blob/main/docs/advanced_parameters.md
show(tx_df[tx_df.idx >= 0], classes="display", scrollY="400px", scrollCollapse=True, paging=False, column_filters="footer", dom="lrtip")

idx,text,lemma_,pos_,tag_,dep_,shape_,is_alpha,is_stop
Loading... (need help?),,,,,,,,


In [28]:
cc = doc._.coref_clusters[0]
cc[:7]

[[0, 15], [88, 91], [119, 122], [178, 185], [264, 267], [274, 276], [359, 361]]

In [29]:
len(doc)

1893

In [33]:
doc[:300]

Michael Mästlin was born in Göppingen which was a village about 50 km east of Tübingen. His father, Jakob Mästlin, and his mother, Dorothea Simon, were both devout Lutherans and Michael was brought up in that faith and remained strongly committed to it throughout his life. He was the middle child of the family, having an older sister and a younger brother. He attended the monastic school in Königsbronn then, after his studies there, entered Tübingen University in 1568. [3]:-
As was the case with many young scholars including Kepler, his most famous student, [Mästlin] did his undergraduate studies at a preparatory school and came to the university to take his final exams and pick up his baccalaureate degree.At Tübingen University he studied mathematics and astronomy for a Master's degree under Philipp Apian who was Peter Apian's son. In 1570, while a student, he purchased a copy of Copernicus's De revolutionibus from the widow of Victorin Strigel, who had been professor of theology at L

In [None]:
### Does not work, issue with the number of spans
for s in cc:
    print(s[0], s[1])
    Span(doc, s[0], s[1])

In [None]:
### Does not work, issue with the number of spans
spans = []
for idx, cluster in enumerate(doc._.coref_clusters):
    for span in cluster:
        spans.append(
            Span(doc, span[0], span[1]+1, str(idx).upper())
        )

doc.spans["custom"] = spans

## Coreferee

In [69]:
nlp = spacy.load('en_core_web_lg')

In [70]:
### https://spacy.io/universe/project/coreferee
nlp.add_pipe('coreferee')

<coreferee.manager.CorefereeBroker at 0x7f97c0523fd0>

In [38]:
doc = nlp(txt)

In [39]:
doc._.coref_chains.print()

0: Mästlin(1), His(17), his(24), his(49), He(52), He(70), his(80), his(104), Mästlin(110), his(113), his(127), his(133), he(140), he(167), Mästlin(199), he(216), Mästlin(258), he(302), He(325), he(340), Mästlin(369), Mästlin(388), his(392), he(407), He(489), he(513), his(519), he(529), his(539), his(543), his(556), he(585), his(588), he(593), his(598), Mästlin(606), He(617), his(623), his(626), Mästlin(629), his(635), He(641), he(647), Mästlin(667), Mästlin(682), his(693), he(706), his(719), He(725), he(740), He(748), he(780), his(784), he(801), Mästlin(809), him(812), he(817), Mästlin(842), Mästlin(855)
1: father(18), it(47)
2: University(86), University(139)
3: Master(147), Master(203)
4: Tables(225), Tables(234)
5: 1571(230), them(245)
6: Backnang(314), Backnang(346)
7: 1551(330), they(348), their(365), they(382)
8: Tübingen(380), Tübingen(415)
9: Tübingen(591), Tübingen(604)
10: man(645), it(660)
11: Kepler(680), Kepler(700)
12: comet(736), comet(753)
13: Copernicus(743), Copernicu

In [40]:
doc[1487]

he

In [41]:
print(doc._.coref_chains.resolve(doc[1487]))

[Mästlin]


In [42]:
doc._.coref_chains.deserialize_obj()

[0: [1], [17], [24], [49], [52], [70], [80], [104], [110], [113], [127], [133], [140], [167], [199], [216], [258], [302], [325], [340], [369], [388], [392], [407], [489], [513], [519], [529], [539], [543], [556], [585], [588], [593], [598], [606], [617], [623], [626], [629], [635], [641], [647], [667], [682], [693], [706], [719], [725], [740], [748], [780], [784], [801], [809], [812], [817], [842], [855], 1: [18], [47], 2: [86], [139], 3: [147], [203], 4: [225], [234], 5: [230], [245], 6: [314], [346], 7: [330], [348], [365], [382], 8: [380], [415], 9: [591], [604], 10: [645], [660], 11: [680], [700], 12: [736], [753], 13: [743], [778], 14: [833], [850], [924], 15: [876], [901], [915], [955], [961], [991], [995], [1022], [1032], [1046], [1051], [1059], 16: [881], [958], 17: [908], [953], 18: [1043], [1056], 19: [1088], [1109], [1117], [1161], [1199], [1217], [1225], [1241], [1304], [1319], [1329], [1341], [1353], [1366], [1378], [1394], [1408], 20: [1149], [1158], 21: [1249], [1260], 2

In [43]:
tks = [[i, tk] for i, tk in enumerate(doc)]
len(tks), tks[0]

(1893, [0, Michael])

In [44]:
tks[1487]

[1487, he]

In [47]:
### Produce resolved text
# https://stackoverflow.com/questions/75204212/spacy-coreferee-how-to-cleanly-extract-coreferenced-text

resolved_text = ""

for token in doc[:]:
  
    repres = doc._.coref_chains.resolve(token)
    # print(repres)
    if repres:
        c = " and ".join([t.text for t in repres])
        # resolved_text += " " + c
        resolved_text += c + " "
        # print(c)
    else:
        # resolved_text += " " + token.text
        resolved_text += token.text_with_ws

print(resolved_text[:2000])

Michael Mästlin was born in Göppingen which was a village about 50 km east of Tübingen. Mästlin father, Jakob Mästlin, and Mästlin mother, Dorothea Simon, were both devout Lutherans and Michael was brought up in that faith and remained strongly committed to father throughout Mästlin life. Mästlin was the middle child of the family, having an older sister and a younger brother. Mästlin attended the monastic school in Königsbronn then, after Mästlin studies there, entered Tübingen University in 1568. [3]:-
As was the case with many young scholars including Kepler, Mästlin most famous student, [Mästlin ] did Mästlin undergraduate studies at a preparatory school and came to the university to take Mästlin final exams and pick up Mästlin baccalaureate degree.At Tübingen University Mästlin studied mathematics and astronomy for a Master's degree under Philipp Apian who was Peter Apian's son. In 1570, while a student, Mästlin purchased a copy of Copernicus's De revolutionibus from the widow of 

In [48]:
i = 0
for s in doc.sents:
    if i == 2:
        print(s.start_char)
        print([nc for nc in s.noun_chunks])
        print(dir(s))
    i += 1    

274
[He, the middle child, the family, an older sister, a younger brother]
['_', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_fix_dep_copy', '_vector', '_vector_norm', 'as_doc', 'char_span', 'conjuncts', 'doc', 'end', 'end_char', 'ent_id', 'ent_id_', 'ents', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'id', 'id_', 'kb_id', 'kb_id_', 'label', 'label_', 'lefts', 'lemma_', 'n_lefts', 'n_rights', 'noun_chunks', 'orth_', 'remove_extension', 'rights', 'root', 'sent', 'sentiment', 'sents', 'set_extension', 'similarity', 'start', 'start_char', 'subtree', 'tensor', 'text', 'text_with_ws', 'to_array', 'vector', 'vector_norm', 'vocab']


In [49]:
### Produce a list of the sentences in the document and add an index to each sentence 
sents = [[i, s, [nc for nc in s.noun_chunks]]  for i, s in enumerate(doc.sents)]
len(sents)

74

In [50]:
for sent in sents[:5]:
    print('----\n',sent)

----
 [0, Michael Mästlin was born in Göppingen which was a village about 50 km east of Tübingen., [Michael Mästlin, Göppingen, a village, Tübingen]]
----
 [1, His father, Jakob Mästlin, and his mother, Dorothea Simon, were both devout Lutherans and Michael was brought up in that faith and remained strongly committed to it throughout his life., [His father, Jakob Mästlin, his mother, Dorothea Simon, both devout Lutherans, Michael, that faith, it, his life]]
----
 [2, He was the middle child of the family, having an older sister and a younger brother., [He, the middle child, the family, an older sister, a younger brother]]
----
 [3, He attended the monastic school in Königsbronn then, after his studies there, entered Tübingen University in 1568., [He, the monastic school, Königsbronn, his studies, Tübingen University]]
----
 [4, [3]:-
As was the case with many young scholars including Kepler, his most famous student, [Mästlin] did his undergraduate studies at a preparatory school and ca

In [51]:
info = []
for token in doc:
    info.append((token.text_with_ws,  token.lemma_, token.pos_, token.tag_, token.dep_,
                token.shape_, token.is_alpha, token.is_stop))
pd.DataFrame(info, columns = ["text", "lemma_", "pos_", "tag_", "dep_","shape_", "is_alpha", "is_stop"])

Unnamed: 0,text,lemma_,pos_,tag_,dep_,shape_,is_alpha,is_stop
0,Michael,Michael,PROPN,NNP,compound,Xxxxx,True,False
1,Mästlin,Mästlin,PROPN,NNP,nsubjpass,Xxxxx,True,False
2,was,be,AUX,VBD,auxpass,xxx,True,True
3,born,bear,VERB,VBN,ROOT,xxxx,True,False
4,in,in,ADP,IN,prep,xx,True,True
...,...,...,...,...,...,...,...,...
1888,this,this,DET,DT,dobj,xxxx,True,True
1889,all,all,DET,DT,dobj,xxx,True,True
1890,by,by,ADP,IN,prep,xx,True,True
1891,himself,himself,PRON,PRP,pobj,xxxx,True,True


In [53]:
doc._.coref_chains.chains[:20]

[0: [1], [17], [24], [49], [52], [70], [80], [104], [110], [113], [127], [133], [140], [167], [199], [216], [258], [302], [325], [340], [369], [388], [392], [407], [489], [513], [519], [529], [539], [543], [556], [585], [588], [593], [598], [606], [617], [623], [626], [629], [635], [641], [647], [667], [682], [693], [706], [719], [725], [740], [748], [780], [784], [801], [809], [812], [817], [842], [855],
 1: [18], [47],
 2: [86], [139],
 3: [147], [203],
 4: [225], [234],
 5: [230], [245],
 6: [314], [346],
 7: [330], [348], [365], [382],
 8: [380], [415],
 9: [591], [604],
 10: [645], [660],
 11: [680], [700],
 12: [736], [753],
 13: [743], [778],
 14: [833], [850], [924],
 15: [876], [901], [915], [955], [961], [991], [995], [1022], [1032], [1046], [1051], [1059],
 16: [881], [958],
 17: [908], [953],
 18: [1043], [1056],
 19: [1088], [1109], [1117], [1161], [1199], [1217], [1225], [1241], [1304], [1319], [1329], [1341], [1353], [1366], [1378], [1394], [1408]]

In [54]:
for t in doc[:10]:
    print(t)

Michael
Mästlin
was
born
in
Göppingen
which
was
a
village


## ClauCy

https://github.com/mmxgn/spacy-clausie


In [68]:
import claucy

In [71]:
claucy.add_to_pipe(nlp)    

In [58]:
doc = nlp(txt)

txt_res = doc._.resolved_text


In [67]:
print(txt_res)

None


In [76]:
### The resolved_text variable is defined above
doc = nlp(resolved_text)

In [77]:
sents = [s for s in doc.sents]

In [78]:
info = []
for token in sents[3]:
    info.append((token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
                token.shape_, token.is_alpha, token.is_stop))
tx_df  = pd.DataFrame(info, columns = ["text", "lemma_", "pos_", "tag_", "dep_","shape_", "is_alpha", "is_stop"])

In [79]:
### https://github.com/mwouts/itables/blob/main/docs/advanced_parameters.md
show(tx_df, classes="display", scrollY="200px", scrollCollapse=True, paging=False, column_filters="footer", dom="lrtip")

text,lemma_,pos_,tag_,dep_,shape_,is_alpha,is_stop
Loading... (need help?),,,,,,,


In [83]:
props = []
for d in sents[3]._.clauses:
    props.append(d.to_propositions(inflect=None))


In [84]:
len(props)

2

In [82]:
props[:10]

[[(Mästlin, attended, the monastic school, then),
  (Mästlin, attended, the monastic school)],
 [(Mästlin, entered, Tübingen University),
  (Mästlin, entered, Tübingen University, after Mästlin studies there),
  (Mästlin, entered, Tübingen University, in 1568)]]

## Concepcy

A spaCy wrapper for ConceptNet, a freely-available semantic network designed to help computers understand the meaning of words.

* https://spacy.io/universe/project/concepcy

In [86]:
import concepcy

In [91]:
nlp = spacy.load('en_core_web_lg')
# Using default concepCy configuration
nlp.add_pipe('concepcy')

doc = nlp('WHO is a lovely company')

# Access all the 'RelatedTo' relations from the Doc
for word, relations in doc._.relatedto.items():
    print(f'Word: {word} {relations}')

# Access the 'RelatedTo' relations word by word
for token in doc:
    print(f'Word: {token}  {token._.relatedto}')

Word: company [{'start': {'id': '/c/en/company', 'label': 'company', 'language': 'en', 'term': '/c/en/company', '@type': 'Node'}, 'end': {'id': '/c/en/business', 'label': 'business', 'language': 'en', 'term': '/c/en/business', '@type': 'Node'}, 'relation': 'RelatedTo', 'text': '[[company]] is related to [[business]]', 'weight': 6.424017434596516}, {'start': {'id': '/c/en/company', 'label': 'company', 'language': 'en', 'term': '/c/en/company', '@type': 'Node'}, 'end': {'id': '/c/en/corporation', 'label': 'corporation', 'language': 'en', 'term': '/c/en/corporation', '@type': 'Node'}, 'relation': 'RelatedTo', 'text': '[[company]] is related to [[corporation]]', 'weight': 4.432155231938521}, {'start': {'id': '/c/en/company', 'label': 'company', 'language': 'en', 'term': '/c/en/company', '@type': 'Node'}, 'end': {'id': '/c/en/organization', 'label': 'organization', 'language': 'en', 'term': '/c/en/organization', '@type': 'Node'}, 'relation': 'RelatedTo', 'text': '[[company]] is related to [