# Exploring the Spacy library


In this notebook I explore and test different functions and methods in Spacy, applying them to my corpus



In [2]:
import psycopg2
import spacy
import pandas as pd
from spacy import displacy
### To activate GPU
# https://spacy.io/api/top-level
# activated = spacy.prefer_gpu()

In [3]:
from importlib import reload
import textacy
import seaborn as sns
from matplotlib import pyplot as plt
import settings as stt
import postgresql_functions as pgf
from itables import init_notebook_mode, show
import re

from time import strftime, gmtime

In [4]:
## reload the module if changes
reload(pgf)

<module 'postgresql_functions' from '/home/francesco/shared_files/python_notebooks/Early-Modern-Astronomy/mathshistory/postgresql_functions.py'>

In [5]:
help(pgf.sql_explore)

Help on function sql_explore in module postgresql_functions:

sql_explore(q, cnn)
    q : the SQL query as text
    cnn : the connection parameter



## Prepare some specimens

In [6]:
### connect to the local database
conn = psycopg2.connect(host="localhost", port = 5432, database="espace_intellectuel", 
                        user="postgres", password=stt.dbw)
#conn

In [7]:
### Get five specimen texts
q1 = """
select pk_mathshistory, "name", url, dates, length(biography) as eff, biography 
from astronomers.mathshistory m 
where pk_mathshistory in (103, 117, 133, 159, 186);
"""

In [8]:
### Execute query and get result
result = pgf.sql_explore(q1, conn)
# print(f'Lines count: {len(result[0])}, errors count: {len(result[1])}, \nFirst lines: {result[0][:5]}')

In [9]:
### Create a DataFrame from the result
textes = pd.DataFrame(result[0])
textes.columns = ['id', 'name', 'url', 'dates', 'length_bio', 'texte']

In [10]:
textes.head()

Unnamed: 0,id,name,url,dates,length_bio,texte
0,103,Christopher Clavius,https://mathshistory.st-andrews.ac.uk/Biograph...,1538-1612,10616,Christopher Clavius was born in a German regio...
1,117,Michael Mästlin,https://mathshistory.st-andrews.ac.uk/Biograph...,1550-1631,10122,Michael Mästlin was born in Göppingen which wa...
2,133,Giuseppe Biancani,https://mathshistory.st-andrews.ac.uk/Biograph...,1566-1624,10035,Giuseppe Biancani's name also appears in its L...
3,159,Wilhelm Schickard,https://mathshistory.st-andrews.ac.uk/Biograph...,1592-1635,10272,Wilhelm Schickard's name is sometimes written ...
4,186,Johannes Hevelius,https://mathshistory.st-andrews.ac.uk/Biograph...,1611-1687,10856,The first problem that we have to address is t...


In [11]:
### Choose one document
tx = textes.iloc[1].texte 
print(tx)

Michael Mästlin was born in Göppingen which was a village about 50 km east of Tübingen. His father, Jakob Mästlin, and his mother, Dorothea Simon, were both devout Lutherans and Michael was brought up in that faith and remained strongly committed to it throughout his life. He was the middle child of the family, having an older sister and a younger brother. He attended the monastic school in Königsbronn then, after his studies there, entered Tübingen University in 1568. [3]:-
As was the case with many young scholars including Kepler, his most famous student, [Mästlin] did his undergraduate studies at a preparatory school and came to the university to take his final exams and pick up his baccalaureate degree.At Tübingen University he studied mathematics and astronomy for a Master's degree under Philipp Apian who was Peter Apian's son. In 1570, while a student, he purchased a copy of Copernicus's De revolutionibus from the widow of Victorin Strigel, who had been professor of theology at L

## Apply standard Spacy pipeline

In [12]:
### Load Spacy language model:
# You have to first upload the models into the active
# conda environnement

# python -m spacy download en_core_web_sm
# python -m spacy download en_core_web_trf
# python -m spacy download en_core_web_lg

nlp = spacy.load("en_core_web_lg")

In [13]:
### Execute standard pipeline on one document
doc = nlp(tx)

In [14]:
i = 0
for s in doc.sents:
    if i == 0:
        print(s)
        print([nc for nc in s.noun_chunks])
        print(s.start_char)
        print(dir(s))
    i += 1    

Michael Mästlin was born in Göppingen which was a village about 50 km east of Tübingen.
[Michael Mästlin, Göppingen, which, a village, Tübingen]
0
['_', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_fix_dep_copy', '_vector', '_vector_norm', 'as_doc', 'char_span', 'conjuncts', 'doc', 'end', 'end_char', 'ent_id', 'ent_id_', 'ents', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'id', 'id_', 'kb_id', 'kb_id_', 'label', 'label_', 'lefts', 'lemma_', 'n_lefts', 'n_rights', 'noun_chunks', 'orth_', 'remove_extension', 'rights', 'root', 'sent', 'sentiment', 'sents', 'set_extension', 'similarity', 'start', 'start_char', 'subtree', 'tensor', 'text', 'text_with_ws', 'to

In [15]:
### Produce a list of the sentences in the document and add an index to each sentence 
sents = [[i, s, [nc for nc in s.noun_chunks]]  for i, s in enumerate(doc.sents)]
len(sents)

69

In [16]:
for sent in sents[3:6]:
    print('----\n',sent)

----
 [3, He attended the monastic school in Königsbronn then, after his studies there, entered Tübingen University in 1568., [He, the monastic school, Königsbronn, his studies, Tübingen University]]
----
 [4, [3]:-
As was the case with many young scholars including Kepler, his most famous student,, [the case, many young scholars, Kepler, his most famous student]]
----
 [5, [Mästlin] did his undergraduate studies at a preparatory school and came to the university to take his final exams and pick up his baccalaureate degree., [his undergraduate studies, a preparatory school, the university, his final exams, his baccalaureate degree]]


In [34]:
### https://spacy.io/api/vectors/
sents[0][1].vector.shape

(300,)

### Explore Spacy pipeline results regarding tokens

In [17]:
for token in doc[52:70]:
    print(token.text, token.lemma_, token.pos_,  token.dep_,
        token.head,  token.is_stop) # token.tag_, token.vector,

He he PRON nsubj was True
was be AUX ROOT was True
the the DET det child True
middle middle ADJ amod child False
child child NOUN attr was False
of of ADP prep child True
the the DET det family True
family family NOUN pobj of False
, , PUNCT punct was False
having have VERB advcl was False
an an DET det sister True
older old ADJ amod sister False
sister sister NOUN dobj having False
and and CCONJ cc sister True
a a DET det brother True
younger young ADJ amod brother False
brother brother NOUN conj sister False
. . PUNCT punct was False


In [18]:
info = []
for token in doc:
    info.append((token.idx, token.ent_id_,token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
                token.shape_, token.is_alpha, token.is_stop, token.tag_, token.vector[:1]))
tx_df  = pd.DataFrame(info, columns = ["idx", "ent_id_","text", "lemma_", "pos_", "tag_", "dep_","shape_", "is_alpha", "is_stop", "tag_", "vector"])

In [29]:
### https://github.com/mwouts/itables/blob/main/docs/advanced_parameters.md
show(tx_df[tx_df.idx >= 0], classes="display", scrollY="400px", scrollCollapse=True, paging=False, column_filters="footer", dom="lrtip")

idx,ent_id_,text,lemma_,pos_,tag_,dep_,shape_,is_alpha,is_stop,tag_.1,vector
Loading... (need help?),,,,,,,,,,,


In [20]:
s = sents[0][1]

In [21]:
for token in s:
    print(token.text, token.dep_, token.head.text, token.head.pos_, token.tag_,
            [child for child in token.children])

Michael compound Mästlin PROPN NNP []
Mästlin nsubjpass born VERB NNP [Michael, was]
was auxpass born VERB VBD []
born ROOT born VERB VBN [Mästlin, was, in, .]
in prep born VERB IN [Göppingen]
Göppingen pobj in ADP NNP []
which nsubj was AUX WDT []
was relcl Mästlin PROPN VBD [which, village]
a det village NOUN DT []
village attr was AUX NN [a, east]
about advmod 50 NUM RB []
50 nummod km NOUN CD [about]
km npadvmod east ADV NNS [50]
east advmod village NOUN RB [km, of]
of prep east ADV IN [Tübingen]
Tübingen pobj of ADP NNP []
. punct born VERB . []


In [22]:
for chunk in s.noun_chunks:
    print(chunk.text, chunk.start_char, chunk.end_char,
          chunk.root.text, chunk.root.dep_,
            chunk.root.head.text, [c for c in chunk.root.ancestors], chunk.vector[:3])

Michael Mästlin 0 15 Mästlin nsubjpass born [born] [-2.77865 -1.65305 -0.15149]
Göppingen 28 37 Göppingen pobj in [in, born] [-2.2553  -1.9444   0.10191]
which 38 43 which nsubj was [was, Mästlin, born] [-3.2023  -3.031    0.08376]
a village 48 57 village attr was [was, Mästlin, born] [-5.96865    4.7177052 -4.02923  ]
Tübingen 78 86 Tübingen pobj of [of, east, village, was, Mästlin, born] [-1.9794  -2.3969   0.50692]


## Test vector addition and cosinus similarity 

In [23]:
from numpy import dot
from numpy.linalg import norm

In [24]:
s.vector_norm, s.vector[:5]

(32.236862,
 array([-5.0095096 , -0.77976847, -2.3805912 , -1.222302  ,  4.1728654 ],
       dtype=float32))

In [25]:
s[0].text, s[1].text

('Michael', 'Mästlin')

In [26]:
vn = s[0].vector + s[1].vector
type(vn)

numpy.ndarray

In [28]:
len(vn), vn.shape

(300, (300,))

In [45]:
nc = [nc for nc in s.noun_chunks]
nc[0]

Michael Mästlin

In [49]:
ncv = nc[0].vector

In [50]:
nc[1], nc[4]

(Göppingen, Tübingen)

In [51]:
ncv1 = nc[1].vector
ncv4 = nc[4].vector

In [43]:
### calculate Cosine Similarity
# https://www.statology.org/cosine-similarity-python/
cos_sim = dot(ncv, vn)/(norm(ncv)*norm(vn))

cos_sim

1.0

In [43]:
### calculate Cosine Similarity
# https://www.statology.org/cosine-similarity-python/
cos_sim = dot(ncv, vn)/(norm(ncv)*norm(vn))

cos_sim

1.0

In [52]:
### Compare two places
cos_sim = dot(ncv1, ncv4)/(norm(ncv1)*norm(ncv4))

cos_sim

0.83159846

In [55]:
### Compare a person and a place
cos_sim = dot(ncv, ncv4)/(norm(ncv)*norm(ncv4))

cos_sim

0.10954975

## Displacy

In [23]:
s = sents[3][1]

In [24]:
for chunk in s.noun_chunks:
    print(chunk.text, chunk.start_char, chunk.end_char,
          chunk.root.text, chunk.root.dep_,
            chunk.root.head.text, [c for c in chunk.root.ancestors], chunk.vector[:3])

He 359 361 He nsubj attended [attended] [ 2.5453  2.2064 -0.913 ]
the monastic school 371 390 school dobj attended [attended] [-2.710333   0.7176333  1.0765667]
Königsbronn 394 405 Königsbronn pobj in [in, school, attended] [0. 0. 0.]
his studies 418 429 studies pobj after [after, entered, attended] [-1.617781  3.6808   -1.3355  ]
Tübingen University 445 464 University dobj entered [entered, attended] [-2.35755 -2.19745  2.18836]


In [25]:
options= {"collapse_phrases":True}
displacy.render(s, style="dep", options= options )

## Named entities

* Résultats pas parfaits mais intéressants
* Peut on améliorer par entraînement ?



In [26]:
for e in s.ents:
    print(e.label_,e.text, list(e.noun_chunks), e.ent_id_)

GPE Königsbronn [Königsbronn] 
ORG Tübingen University [Tübingen University] 
DATE 1568 [] 


In [27]:
options= {"collapse_phrases":True}
displacy.render(s, style="ent", options= options )

In [28]:
ents = doc.ents
df_ents = pd.DataFrame([[e.label_,e.text, list(e.noun_chunks), e.vector[:2]] for e in ents])
df_ents.columns= ['type', 'label','chunks', "vector"]

In [29]:
len(df_ents), df_ents.head()

(170,
        type            label                chunks                    vector
 0    PERSON  Michael Mästlin  [(Michael, Mästlin)]      [-2.77865, -1.65305]
 1       GPE        Göppingen         [(Göppingen)]        [-2.2553, -1.9444]
 2  QUANTITY      about 50 km                    []  [-4.6376333, -0.7857833]
 3       GPE         Tübingen          [(Tübingen)]        [-1.9794, -2.3969]
 4    PERSON    Jakob Mästlin    [(Jakob, Mästlin)]       [-2.0127, -1.22875])

In [30]:
ldf = pd.DataFrame(df_ents.groupby(by='label').size().sort_values(ascending=False))
ldf= ldf.reset_index()
ldf.columns=['label','effectif']
ldf.values.tolist()

[['Mästlin', 17],
 ['Copernicus', 10],
 ['Tübingen', 8],
 ['Kepler', 5],
 ['Backnang', 4],
 ['Methuen', 4],
 ['first', 4],
 ['3]:-', 3],
 ['Ptolemy', 3],
 ['1577', 3],
 ['De', 3],
 ['two', 3],
 ['Copernican', 3],
 ['Tübingen University', 2],
 ['Heidelberg', 2],
 ['Leipzig', 2],
 ['Aristotle', 2],
 ['Göppingen', 2],
 ['four years', 2],
 ['Lutheran', 2],
 ['three', 2],
 ['Philipp Apian', 2],
 ['1580', 2],
 ['1578', 2],
 ['Michael Mästlin', 2],
 ['1571', 2],
 ['1570', 2],
 ['Observatio', 1],
 ['Victorin Strigel', 1],
 ['Veneris apparuit', 1],
 ["Tübingen University's", 1],
 ["Peter Apian's", 1],
 ['Peter] Apian', 1],
 ['Pleiades', 1],
 ['Rheticus', 1],
 ['about 50 km', 1],
 ['about 30 km', 1],
 ['the years 1577 and', 1],
 ['the following years', 1],
 ['the age of 78', 1],
 ['the University of Tübingen', 1],
 ['the University of Heidelberg', 1],
 ['the New Star', 1],
 ['the Lutheran Church', 1],
 ['the Astronomical Revolution', 1],
 ['sixth', 1],
 ['six', 1],
 ['Osiander', 1],
 ['geometric

## Explore some aspects of the pipeline with a coreferenced text

In [31]:
tx1 = """Michael Mästlin was born in Göppingen which was a village about 50 km east of Tübingen. Mästlin father, Jakob Mästlin, and Mästlin mother, Dorothea Simon, were both devout Lutherans and Michael was brought up in that faith and remained strongly committed to father throughout Mästlin life. Mästlin was the middle child of the family, having an older sister and a younger brother. Mästlin attended the monastic school in Königsbronn then, after Mästlin studies there, entered Tübingen University in 1568. [3]:-
As was the case with many young scholars including Kepler, Mästlin most famous student, [Mästlin ] did Mästlin undergraduate studies at a preparatory school and came to the university to take Mästlin final exams and pick up Mästlin baccalaureate degree.At Tübingen University Mästlin studied mathematics and astronomy for a Master's degree under Philipp Apian who was Peter Apian's son. In 1570, while a student, Mästlin purchased a copy of Copernicus's De revolutionibus from the widow of Victorin Strigel, who had been professor of theology at Leipzig and the author of an astronomy text. Mästlin was awarded a Master 's degree with distinction in 1571. While studying for this degree Mästlin had edited a new edition of the Prutenis Tables which was published in 1571. The Prutenis Tables had been originally compiled by Erasmus Reinhold who had based 1571 on Copernicus's version of the solar system. After graduating, Mästlin spent the following years teaching in Tübingen, and at the same time took [3]:-
... the theological programme, because Tübingen University's primary mission was to prepare young men for the Lutheran ministry.Then, in 1576, Mästlin was sent to be a deacon at the Lutheran Church in Backnang, a town about 30 km northwest of Göppingen. Mästlin married Margarete Gruuninger (1551-1588) in April 1577 not long after Mästlin took up the position in Backnang ; 1551 had three sons and three daughters but Margarete died due to complications with the birth of 1551 sixth child. Mästlin then married Margarete Burkhardt, daughter of a professor at Tübingen; 1551 had eight children.

Mästlin is famous for Mästlin excellent, very accurate, observations of the comet of 1577, observed while Mästlin was in Backnang, and published in Tübingen in 1578 as Observatio et demonstratio cometae aetherae qui anno 1577 et 1578 constitutus in sphaera Veneris apparuit cum admirandius eius passionibus varietate scilicet motus loco orbe distantia a terro centro etc. adhibitis demonstrationibus geometricis et calculo arithmetico cuius modi de alio quoquam cometa nunquam visa est (Observations and demonstrations of the ethereal comets of the years 1577 and 1578). We discuss this achievement in more detail below. Mästlin remained in Backnang for four years, then was appointed as professor of mathematics at the University of Heidelberg in 1580. There Mästlin published the first edition of Mästlin famous astronomy textbook Epitome astronomiae (1582) - Mästlin published six further editions of this popular work during Mästlin lifetime. Despite Mästlin commitment to the views of Copernicus (which we state below in Mästlin own words) this teaching textbook was written purely as a description of astronomy based on the geocentric model of Ptolemy.

After four years in Heidelberg Mästlin returned to Mästlin position in Tübingen where Mästlin spent the rest of Mästlin career [2]:-
At Tübingen , Mästlin was elected dean of the arts faculty several times. Mästlin was well liked by both Mästlin colleagues and Mästlin students. Mästlin was very generous both to Mästlin family and to others. Mästlin was a religious man; Mästlin followed the Lutheran line in opposing the Gregorian calendar reform partly because man was initiated by the Pope. Mästlin had several students who became noted mathematicians, the most famous being Kepler. Mästlin also maintained interests in Biblical chronology and geography.Perhaps Mästlin greatest achievement (other than being Kepler 's teacher) is that Mästlin was the first to compute the orbit of a comet, although Mästlin method was not sound. Mästlin found, however, a sun centred orbit for the comet of 1577 which Mästlin claimed supported Copernicus's heliocentric system. Mästlin did show that the comet was further away than the moon, which contradicted the accepted teachings of Aristotle. Although clearly believing in the system as proposed by Copernicus , Mästlin taught astronomy using Mästlin own textbook which was based on Ptolemy's system. However for the more advanced lectures Mästlin adopted the heliocentric approach - Kepler credited Mästlin with introducing Mästlin to Copernican ideas while Mästlin was a student at Tübingen (1589-94). In [6] Methuen looks at these two different world systems in Mästlin 's teaching. A G Molland reviews Methuen 's paper:-
Michael Mästlin is regularly ascribed a firm if minor place in accounts of the Astronomical Revolution, as the teacher of Johannes Kepler and as an early believer in the physical reality of the Copernican system. What has been less clear is how this related to Kepler regular teaching at the University of Tübingen, with some having argued that Kepler only taught Copernicanism cautiously and in private. Methuen examines this question in the light of Maestlin's published writings, including five sets of theses over whose disputations (two at Heidelberg, one at Tübingen ) Kepler presided. believer shows that Kepler had no compunction about firmly asserting the supralunar position of the New Star of 1572 and of the comets of 1577 - 1578 and 1580. In general, Kepler writings, especially Kepler textbook 'Epitome astronomiae' of 1582, show no firm Copernican commitment, although there are signs of a leaning in that direction, and Kepler was concerned with emphasizing the interconvertibility of hypotheses. Kepler elementary teaching was certainly based on traditional astronomy, but Methuen concludes that Kepler taught newer material to Kepler more advanced students. Methuen claims that Kepler made a clear distinction between "spheres" and "orbs", with the former possessing physical reality and the latter being merely useful mathematical constructions ...Mästlin was both a great expert on spherical trigonometry and also a fine observer producing accurate data - the quality of Mästlin eyesight is seen from the fact that Mästlin saw, and sketched the positions of, 11 stars in the Pleiades cluster. Of course there were not any bright city lights around then, but we challenge any reader to equal this achievement however dark a site reader find. Mästlin seems to have been the first to claim that the dark part of the moon shone through sunlight reflected from the earth but Leonardo da Vinci has also been credited with this idea. Another first for Mästlin is an accurate calculation of the golden ratio as "approximately 0.6180340" stated in a letter Mästlin wrote to Kepler in 1597.

Mästlin was an innovative thinker who was quite prepared to challenge conventional views. For example Mästlin attempted to measure the parallax of a supernova and, having failed to find any, deduced that supernova was as far away as the "fixed stars". This of course contradicted the view, held since Aristotle, that all changes in the heavens occurred closer to earth than the realm of the stars which was unchanging.

Mästlin lived to see the invention of the telescope for astronomical observations by Galileo. Mästlin had two, rather poor, telescopes with which Mästlin was able to observe sunspots and the moons of Jupiter. Mästlin was still making accurate observations at the age of 78 when Mästlin observed the lunar eclipse of 1628.

As we mentioned above Mästlin acquired a copy of Copernicus's De revolutionibus in 1570 and Mästlin wrote extensive notes near the beginning of the book. These give much insight into Mästlin views on Copernicus and we quote the notes using Gingerich's translations. Mästlin wrote that [3]:-
... the arrangement presented in this book is the sort of structure in which all the sidereal motions and phenomena are explained very exactly. Therefore this hypothesis recommends hypothesis to the intellect.Mästlin continued:-
The heavenly motions were at the point of collapse, and so [Copernicus] concluded that appropriate hypotheses were needed to explain these motions . When Mästlin noticed that the common hypotheses were insufficient, Mästlin eventually accepted the idea of the Earth's mobility, since indeed, idea not only satisfied the phenomena very well but idea didn't lead to anything absurd.

In fact, if anyone would straighten out the common hypotheses so that hypotheses would agree with the phenomena and allow no inconsistencies, then I would gratefully trust Mästlin ; clearly Mästlin would bring very many to Mästlin views. But I see that some, even very outstanding mathematicians, have laboured on this yet, in the end, without results. Therefore, I think that unless the common hypotheses are reformed (a task that I am not up to because of my inadequate abilities), I will accept the hypotheses and opinion of Copernicus - after Ptolemy, the prince of all Astronomers.There are further annotations written by Mästlin [3] which are very interesting. On the back of the title page of De revolutionibus is the infamous notice which states that "... these hypotheses need not be true nor even probable; it is sufficient if the calculations agree with the observations." Mästlin adds a note to Mästlin copy stating:- 
This preface was added by someone, whoever preface author may be, (for indeed, preface weakness of style and choice of words reveal that preface is not by Copernicus).He later added a further note:-
I found the following words written somewhere among the books of Philipp Apian (which I bought from Apian widow); although no author was given I could recognise Apian 's hand:
On account of this letter Georg Joachim Rheticus, the Leipzig professor and disciple of Copernicus, became involved in a very bitter wrangle with the printer, who asserted that wrangle had been turned over to Rheticus with the rest of the work. Rheticus , however, suspected that Osiander [the proof-reader of Copernicus 's book] had prefaced Copernicus to the work . If professor knew this for certain, professor declared, professor would handle that fellow so that in future professor would mind professor own business and not slander astronomers any more. Nevertheless, [Peter] Apian told me that Osiander had openly admitted to professor that professor had added this all by professor ."""

In [32]:
### Execute standard pipeline on one document
doc1 = nlp(tx1)

In [33]:
### Produce a list of the sentences in the document and add an index to each sentence 
sents1 = [[i, s, [nc for nc in s.noun_chunks]]  for i, s in enumerate(doc1.sents)]
len(sents1)

70

In [35]:
for sent in sents1[3:6]:
    print('----\n',sent)

----
 [3, Mästlin attended the monastic school in Königsbronn then, after Mästlin studies there, entered Tübingen University in 1568., [Mästlin, the monastic school, Königsbronn, Mästlin studies, Tübingen University]]
----
 [4, [3]:-
As was the case with many young scholars including Kepler, Mästlin most famous student, [Mästlin ] did Mästlin undergraduate studies at a preparatory school and came to the university to take Mästlin final exams and pick up Mästlin baccalaureate degree., [the case, many young scholars, Kepler, Mästlin most famous student, Mästlin, Mästlin undergraduate studies, a preparatory school, the university, Mästlin final exams, Mästlin baccalaureate degree]]
----
 [5, At Tübingen University Mästlin studied mathematics and astronomy for a Master's degree under Philipp Apian who was Peter Apian's son., [Tübingen, University Mästlin, mathematics, astronomy, a Master's degree, Philipp Apian, who, Peter Apian's son]]


### Explore Spacy pipeline results regarding tokens

In [38]:
info = []
for token in doc1:
    info.append((token.idx, token.ent_id_,token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
                 token.head,
                token.shape_, token.is_alpha, token.is_stop, token.tag_, token.vector[:1]))
tx_df  = pd.DataFrame(info, columns = ["idx", "ent_id_","text", "lemma_", "pos_", "tag_", "dep_","head", "shape_", "is_alpha", "is_stop", "tag_", "vector"])

In [39]:
### https://github.com/mwouts/itables/blob/main/docs/advanced_parameters.md
show(tx_df[tx_df.idx >= 0], classes="display", scrollY="400px", scrollCollapse=True, paging=False, column_filters="footer", dom="lrtip")

idx,ent_id_,text,lemma_,pos_,tag_,dep_,head,shape_,is_alpha,is_stop,tag_.1,vector
Loading... (need help?),,,,,,,,,,,,


### Named entities

In [40]:
s = sents1[3][1]

In [41]:
for chunk in s.noun_chunks:
    print(chunk.text, chunk.start_char, chunk.end_char,
          chunk.root.text, chunk.root.dep_,
            chunk.root.head.text, [c for c in chunk.root.ancestors], chunk.vector[:3])

Mästlin 380 387 Mästlin nsubj attended [attended] [0. 0. 0.]
the monastic school 397 416 school dobj attended [attended] [-2.710333   0.7176333  1.0765667]
Königsbronn 420 431 Königsbronn pobj in [in, school, attended] [0. 0. 0.]
Mästlin studies 444 459 studies pobj after [after, entered, attended] [-0.010131  1.03305   0.57765 ]
Tübingen University 475 494 University dobj entered [entered, attended] [-2.35755 -2.19745  2.18836]


In [42]:
options= {"collapse_phrases":True}
displacy.render(s, style="dep", options= options )

In [43]:
for e in s.ents:
    print(e.label_,e.text, list(e.noun_chunks), e.ent_id_)

PERSON Mästlin [Mästlin] 
GPE Königsbronn [Königsbronn] 
ORG Mästlin [] 
ORG Tübingen University [Tübingen University] 
DATE 1568 [] 


In [44]:
options= {"collapse_phrases":True}
displacy.render(s, style="ent", options= options )

In [48]:
ents1 = doc1.ents
df_ents1 = pd.DataFrame([[e.label_,e.text, list(e.noun_chunks), e.vector[:2]] for e in ents1])
df_ents1.columns= ['type', 'label','chunks', "vector"]

In [49]:
len(df_ents1), df_ents1.head()

(255,
        type            label                chunks                    vector
 0    PERSON  Michael Mästlin  [(Michael, Mästlin)]      [-2.77865, -1.65305]
 1       GPE        Göppingen         [(Göppingen)]        [-2.2553, -1.9444]
 2  QUANTITY      about 50 km                    []  [-4.6376333, -0.7857833]
 3       GPE         Tübingen          [(Tübingen)]        [-1.9794, -2.3969]
 4    PERSON          Mästlin                    []                [0.0, 0.0])

In [51]:
ldf1 = pd.DataFrame(df_ents1.groupby(by='label').size().sort_values(ascending=False))
ldf1= ldf1.reset_index()
ldf1.columns=['label','effectif']
ldf1.values.tolist()

[['Mästlin', 81],
 ['Kepler', 16],
 ['Copernicus', 11],
 ['Tübingen', 8],
 ['Methuen', 5],
 ['first', 4],
 ['Backnang', 4],
 ['3]:-', 3],
 ['1571', 3],
 ['Ptolemy', 3],
 ['1577', 3],
 ['Copernican', 3],
 ['1551', 3],
 ['two', 3],
 ['De', 3],
 ['Tübingen University', 2],
 ['four years', 2],
 ['Aristotle', 2],
 ['Leipzig', 2],
 ['Apian', 2],
 ['Lutheran', 2],
 ['1570', 2],
 ['Heidelberg', 2],
 ['1580', 2],
 ['1578', 2],
 ['Philipp Apian', 2],
 ['Göppingen', 2],
 ['Michael Mästlin', 2],
 ['three', 2],
 ['Rheticus', 2],
 ['Veneris apparuit', 1],
 ['about 50 km', 1],
 ['Victorin Strigel', 1],
 ["Tübingen University's", 1],
 ['Pleiades', 1],
 ['Peter] Apian', 1],
 ["Peter Apian's", 1],
 ['about 30 km', 1],
 ['varietate scilicet motus', 1],
 ['the years 1577 and', 1],
 ['the following years', 1],
 ['the age of 78', 1],
 ['the University of Tübingen', 1],
 ['1568', 1],
 ['the New Star', 1],
 ['the Lutheran Church', 1],
 ['the Astronomical Revolution', 1],
 ['sixth', 1],
 ['eight', 1],
 ['orbe 

## Dependencies

Eccellente intro:
* https://towardsdatascience.com/nlp-with-python-knowledge-graph-12b93146a458 



In [88]:
stcs = [] 
for sent in doc.sents:
    stcs.append(sent)

In [89]:
stcs[10]

The Prutenis Tables had been originally compiled by Erasmus Reinhold who had based them on Copernicus's version of the solar system.

In [None]:
sent = stcs[10][2]
for token in sent:
    print(token.text, "-->", "pos: "+token.pos_, "|", "dep: "+token.dep_, "")

In [None]:
displacy.render(stcs[10], style="dep", options={"distance":100})

In [None]:
for s in stcs[:11]:
    displacy.render(s, style="ent")

In [None]:
## extract entities and relations
dic = {"id":[], "text":[], "entity":[], "relation":[], "object":[]}

for n,sentence in enumerate(stcs):
    lst_generators = list(textacy.extract.subject_verb_object_triples(sentence))  
    for sent in lst_generators:
        subj = "_".join(map(str, sent.subject))
        obj  = "_".join(map(str, sent.object))
        relation = "_".join(map(str, sent.verb))
        dic["id"].append(n)
        dic["text"].append(sentence.text)
        dic["entity"].append(subj)
        dic["object"].append(obj)
        dic["relation"].append(relation)


## create dataframe
dtf = pd.DataFrame(dic)

dtf.head()

In [None]:

dtf.groupby('id').size().sort_values(ascending=False)

In [None]:
dtf[dtf['id'] == 84]

In [None]:
print(list(dtf[dtf['id'] == 84]['text'])[0])