## Distant reading study week 3 (VT-23)

### Learning Material 3c: some remarks about SpaCy docs and using other language models

Matti La Mela


In [1]:
# Let's read one example article from our material.

with open("./dhq_corpus_complete_2007_2020/dhq-2007-000001-Drucker-Philosophy.txt", mode="r", encoding="utf-8") as file:
    text = file.read()
    

In [2]:
# We import SpaCy and load our English language model

import spacy

nlp = spacy.load("en_core_web_sm")


In [3]:
# By running a text through the SpaCy nlp pipeline, Spacy returns us a spacy Doc Object which is a sequence of Spacy tokens:

text_doc = nlp(text)

print("The type of text_doc is: " + str(type(text_doc)))

The type of text_doc is: <class 'spacy.tokens.doc.Doc'>


In [4]:
# But when we print it, spacy prints the doc as it would be just text, a string, however, this is just an output provided for us to explore the doc object.

print(text_doc[0:100])


Philosophy and Digital Humanities: A review of Willard McCarty, Humanities
      Computing (London and NY: Palgrave, 2005)
Johanna Drucker
3 April 2007

Humanities computing is still a fledgling discipline, in spite of its claim to a lineage now
decades old that descends from the estimable Father Busa and his assiduous labors
in the stony fields of concordance and corpus linguistics. The cultural authority of computing
has an older history yet, linked as it is to traditions of analytic thought and rational
calculus in the


In [5]:
# The doc object consists of elements that are spacy tokens. Again they look like strings, but they are spacy tokens.

print(text_doc[0])
print(text_doc[1])
print(text_doc[2])

print("")

print(type(text_doc[0]))

Philosophy
and
Digital

<class 'spacy.tokens.token.Token'>


In [6]:
# The tokens have many kinds of attributes that spacy generates when we process our text, e.g.

print(text_doc[0].lemma_)

print(text_doc[0].pos_)

print(text_doc[0].is_alpha)

print(text_doc[0].sent)  # Sentence where the token appears



philosophy
NOUN
True
Philosophy and Digital Humanities: A review of Willard McCarty, Humanities
      Computing (London and NY: Palgrave, 2005)
Johanna Drucker
3 April 2007




In [7]:
# However, only some of them are of string type!

print("The type of text_doc[0] is: " + str(type(text_doc[0])))
print("The type of text_doc[0].text is: " + str(type(text_doc[0].text)))
print("The type of text_doc[0].lemma_ is: " + str(type(text_doc[0].lemma_)))
print("The type of text_doc[0].pos_ is: " + str(type(text_doc[0].pos_)))

The type of text_doc[0] is: <class 'spacy.tokens.token.Token'>
The type of text_doc[0].text is: <class 'str'>
The type of text_doc[0].lemma_ is: <class 'str'>
The type of text_doc[0].pos_ is: <class 'str'>


In [8]:
# Thus, we need to think when it is the case that we want to store the Spacy token for later use. When we have the token, we
# store also the qualities it has, e.g. the attributes like .pos_ or lemma_

# However, if we need to have the token as text, as a string, we can use for example token.text (or token.lemma_).
# We need the strings to calculate frequencies for example. If we calculate frequencies for the Spacy tokens, they are all different,
# and won't be summed together.


### Other language models in spacy

If you want, you can also use other spacy language models than the English model: https://spacy.io/models. Normally, they are installed in the terminal: here in Jupyter Lab, in Powershell prompt for whole Anaconda, or for Mac users in your Mac terminal.

Let's try the French language model!

I installed it in terminal with: python -m spacy download fr_core_news_sm


In [9]:
# We import spacy and load the French language model

import spacy

nlp = spacy.load("fr_core_news_sm")

OSError: [E050] Can't find model 'fr_core_news_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

In [None]:
# The first sentence from Le Petit Prince by Antoine de Saint-Exupéry.

sentence = "Lorsque j'avais six ans j'ai vu, une fois, une magnifique image, dans un livre sur la Forêt Vierge qui s'appelait 'Histoires Vécues'."

sentence_doc = nlp(sentence)


In [None]:
# let's see how the tokens look like:

for token in sentence_doc:
    print(token.text)

In [None]:
# How about the lemmas?

for token in sentence_doc:
    print("The lemma for '" + token.text + "' is '" + token.lemma_ + "', and is a " + token.pos_ + " = " + spacy.explain(token.pos_))

# Looks very good, some things should be explored further: the proper nouns seem not to have been tagged perfectly,
# e.g. Histoire Vecués is the name of the book, but it is understood as two things (noun + proper noun).
#
# We could try with the other model Spacy offers (_trf), if this is more accurate (according to Spacy documentation this model is more accurate, but slower).