## First steps into Spacy

First an example from the website

In [2]:
import spacy
import json
from collections import Counter

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

Apple apple PROPN NNP nsubj Xxxxx True False
is be VERB VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. u.k. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


## NER - Named entity recognition

In [11]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


In [10]:
import spacy
from spacy import displacy

doc = nlp(u'Rats are various medium-sized, long-tailed rodents.')
displacy.render(doc, style='dep', jupyter=True, options={'distance':100})

## Let's open a json file from the scraped data


First we show the keywords, next the sentences

In [131]:
with open('0b0de1b0-2f40-413e-922d-4a54df601c35.json', 'r') as json_file:
    data = json.load(json_file)
    
print(data['keywords'])


['Dynamic', 'Indications']


We let Spacy identify sentences and print them

In [130]:
text = (data['documents'][0]['text'])
#print(text)

doc = nlp(text)
for sent in doc.sents:
    print(sent)

Dynamic indications
Dynamic indications are indications of the ”loudness”, or ”sound volume”, of the music to be performed.

The basic dynamic indications are usually printed in bold ”italics”, and are given in the example below:
These Italian terms are explained here:
· pp, pianissimo: very soft
· p, piano: soft
· mp, mezzo piano: medium soft
· mf, mezzo forte: medium loud
· f, forte: loud
· ff, fortissimo: very loud

In contemporary composed music, the fourfold indications pppp and ffff are quite common, and even up to a sixfold pppppp can be found with a composer like Morton Feldman.
Nevertheless all these indications are quite relative and one should not be mistaken about the apparent precision they seem to imply.
In practice, many more shades and differences than e.g. MIDI allows for (128 steps) are possible, but it makes no sense trying to notate such fine nuances.
When for example a whole passage is indicated mf it will be only a computer to play all these notes at the same volu

Then some Part of Speech (POS) tagging

In [129]:
for token in doc:
    if len(token.text) >1: 
        print(token.text, token.pos_)

Dynamic ADJ
indications NOUN
Dynamic ADJ
indications NOUN
are VERB
indications NOUN
of ADP
the DET
loudness NOUN
or CCONJ
sound ADJ
volume NOUN
of ADP
the DET
music NOUN
to PART
be VERB
performed VERB
The DET
basic ADJ
dynamic ADJ
indications NOUN
are VERB
usually ADV
printed VERB
in ADP
bold ADJ
italics NOUN
and CCONJ
are VERB
given VERB
in ADP
the DET
example NOUN
below ADV
These DET
Italian ADJ
terms NOUN
are VERB
explained VERB
here ADV
pp PROPN
pianissimo NOUN
very ADV
soft ADJ
piano NOUN
soft ADJ
mp NOUN
mezzo ADJ
piano NOUN
medium ADJ
soft ADJ
mf INTJ
mezzo ADJ
forte NOUN
medium ADJ
loud ADJ
forte NOUN
loud ADJ
ff NOUN
fortissimo NOUN
very ADV
loud ADJ
In ADP
contemporary ADJ
composed VERB
music NOUN
the DET
fourfold ADJ
indications NOUN
pppp NOUN
and CCONJ
ffff ADJ
are VERB
quite ADV
common ADJ
and CCONJ
even ADV
up ADP
to ADP
sixfold ADJ
pppppp NOUN
can VERB
be VERB
found VERB
with ADP
composer NOUN
like ADP
Morton PROPN
Feldman PROPN
Nevertheless ADV
all ADJ
these DET
indicat


Only show Nouns and ADJ

In [72]:
for token in doc:
    if token.pos_ == 'NOUN' or token.pos_ == 'ADJ':
        print(token.text, token.pos_)

Dynamic ADJ
indications NOUN
Dynamic ADJ
indications NOUN
indications NOUN
loudness NOUN
sound ADJ
volume NOUN
music NOUN
basic ADJ
dynamic ADJ
indications NOUN
bold ADJ
italics NOUN
example NOUN
Italian ADJ
terms NOUN
pianissimo NOUN
soft ADJ
p NOUN
piano NOUN
soft ADJ
mp NOUN
mezzo ADJ
piano NOUN
medium ADJ
soft ADJ
mezzo ADJ
forte NOUN
medium ADJ
loud ADJ
forte NOUN
loud ADJ
ff NOUN
fortissimo NOUN
loud ADJ
contemporary ADJ
music NOUN
fourfold ADJ
indications NOUN
pppp NOUN
ffff ADJ
common ADJ
sixfold ADJ
pppppp NOUN
composer NOUN
all ADJ
indications NOUN
relative ADJ
mistaken ADJ
apparent ADJ
precision NOUN
practice NOUN
many ADJ
more ADJ
shades NOUN
differences NOUN
e.g. ADJ
steps NOUN
possible ADJ
sense NOUN
such ADJ
fine ADJ
nuances NOUN
example NOUN
whole ADJ
passage NOUN
computer NOUN
all ADJ
notes NOUN
same ADJ
volume NOUN
performers NOUN
loudness NOUN
individual ADJ
notes NOUN
practice NOUN
normal ADJ
organic ADJ
element NOUN
human ADJ
music NOUN
styles NOUN
ages NOUN
Other 

### Only nouns and Adjective Noun Pairs

Counted and sorted, only using lowercase

In [135]:
def nps(doc):
    tokenlist = []
    tokenfirst = None
    for token in doc:
        if token.pos_ == 'ADJ':
            tokenfirst = token
            continue
        if token.pos_== 'NOUN' and tokenfirst is not None:
            tokenlist.append(tokenfirst.lower_ + ' ' + token.lower_)
            tokenfirst = None
            continue
        if token.pos_ == 'NOUN':
            tokenlist.append(token.lower_)
        tokenfirst = None
    return tokenlist
        
wf = Counter(nps(doc))
wf.most_common(8)

[('example', 5),
 ('dynamic indications', 3),
 ('indications', 3),
 ('loudness', 2),
 ('sound volume', 2),
 ('music', 2),
 ('ff', 2),
 ('practice', 2)]

Just count nouns.

Example also shows how to filter out stop words and punctuation - not helpfull here

In [54]:

words = [token.text for token in doc if token.is_stop != True and token.is_punct != True]

# noun tokens that arent stop words or punctuations
nouns = [token.text for token in doc if token.is_stop != True and token.is_punct != True and token.pos_ == "NOUN"]

# five most common tokens
word_freq = Counter(words)
common_words = word_freq.most_common(5)

# five most common noun tokens
noun_freq = Counter(nouns)
common_nouns = noun_freq.most_common(5)   
common_nouns

[('indications', 7),
 ('example', 5),
 ('loudness', 3),
 ('volume', 3),
 ('music', 3)]

Entities in the text - Not really meaningful in this case

In [36]:
for ent in doc.ents:
    if (len(ent.text)>1) and (ent.label_ != 'CARDINAL'):
        print(ent.text, ent.start_char, ent.end_char, ent.label_)


Dynamic 19 27 EVENT
Italian 235 242 NORP
Morton Feldman 576 590 PERSON
MIDI 782 786 ORG
all ages 1143 1151 DATE
Italian 1907 1914 NORP
dB11 2535 2539 GPE


## A larger text in Dutch

In [138]:
nlp = spacy.load('nl_core_news_sm')

with open('222ff520-cbb2-4854-9f06-7d8333511fd5.json', 'r') as json_file:
    data = json.load(json_file)
    
print(data['keywords'])

['Onderwijsontwerp', 'Excursie', 'Veldwerk', 'Technologie']


In [140]:
text = (data['documents'][0]['text'])
doc = nlp(text)
wf = Counter(nps(doc))
wf.most_common(10)

[('studenten', 180),
 ('excursies', 88),
 ('excursie', 57),
 ('docent', 36),
 ('gebruik', 35),
 ('autonomie', 33),
 ('docenten', 31),
 ('digitaal lesmateriaal', 31),
 ('veldwerk', 29),
 ('feedback', 27)]