# NLP Pipeline with spaCy

[spaCy](https://spacy.io/) is a widely used python library with a comprehensive feature set for fast text processing in multiple languages. 

The usage of the tokenization and annotation engines requires the installation of language models. The features we will use in this chapter only require the small models, the larger models also include word vectors that we will cover in chapter 15.

![spaCy](assets/spacy.jpg)

## Setup

### Imports

In [1]:
using Pkg

In [2]:
#Pkg.add("Glob")
#Pkg.add("TextAnalysis")
#Pkg.add("Languages")
#Pkg.add("WordNet")
#Pkg.add("WordTokenizers")
#Pkg.add("StringEncodings")

In [3]:
using PyCall
using Conda

In [4]:
using Glob
using TextAnalysis
using Languages
using DataFrames
using WordNet
using WordTokenizers
using StringEncodings

### SpaCy Language Model Installation

In addition to the `spaCy` library, we need [language models](https://spacy.io/usage/models).

#### English

Only need to run once.

In [5]:
#Conda.pip_interop(true)
#Conda.pip("install", "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl")
#Conda.pip("install", "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0.tar.gz")

#### Spanish

[Spanish language models](https://spacy.io/models/es#es_core_news_sm) trained on [AnCora Corpus](http://clic.ub.edu/corpus/) and [WikiNER](http://schwa.org/projects/resources/wiki/Wikiner)

Only need to run once.

In [6]:
#Conda.pip("install", "https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.3.0/es_core_news_sm-3.3.0-py3-none-any.whl")
#Conda.pip("install", "https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.3.0/es_core_news_sm-3.3.0.tar.gz")

## Get Data

- [BBC Articles](http://mlg.ucd.ie/datasets/bbc.html), use raw text files ([download](http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip))
    - Data already included in [data](../data) directory, just unzip before first-time use.
- [TED2013](http://opus.nlpl.eu/TED2013.php), a parallel corpus of TED talk subtitles in 15 langugages (sample provided) in `results/TED` subfolder of this directory.

## SpaCy Pipeline & Architecture

### The Processing Pipeline

When you call a spaCy model on a text, spaCy 

1) tokenizes the text to produce a `Doc` object. 

2) passes the `Doc` object through the processing pipeline that may be customized, and for the default models consists of
- a tagger, 
- a parser and 
- an entity recognizer. 

Each pipeline component returns the processed Doc, which is then passed on to the next component.

![Architecture](assets/pipeline.svg)

### Key Data Structures

The central data structures in spaCy are the **Doc** and the **Vocab**. Text annotations are also designed to allow a single source of truth:

- The **`Doc`** object owns the sequence of tokens and all their annotations. `Span` and `Token` are views that point into it. It is constructed by the `Tokenizer`, and then modified in place by the components of the pipeline. 
- The **`Vocab`** object owns a set of look-up tables that make common information available across documents. 
- The **`Language`** object coordinates these components. It takes raw text and sends it through the pipeline, returning an annotated document. It also orchestrates training and serialization.

![Architecture](assets/spaCy-architecture.svg)

## SpaCy in Action

### Create & Explore the Language Object

Once installed and linked, we can instantiate a spaCy language model and then call it on a document. As a result, spaCy produces a Doc object that tokenizes the text and processes it according to configurable pipeline components that by default consist of a tagger, a parser, and a named-entity recognizer.

In [7]:
#Conda.add("spacy")
@pyimport spacy

In [8]:
nlp = spacy.load("en_core_web_sm")

PyObject <spacy.lang.en.English object at 0x000000007C401BB0>

In [9]:
nlp.lang

"en"

In [10]:
spacy.info("en_core_web_sm")

Dict{Any, Any} with 18 entries:
  "requirements"      => Any[]
  "pipeline"          => ["tok2vec", "tagger", "parser", "attribute_ruler", "le…
  "name"              => "core_web_sm"
  "author"            => "Explosion"
  "version"           => "3.3.0"
  "description"       => "English pipeline optimized for CPU. Components: tok2v…
  "email"             => "contact@explosion.ai"
  "spacy_git_version" => "849bef2de"
  "components"        => ["tok2vec", "tagger", "parser", "senter", "attribute_r…
  "url"               => "https://explosion.ai"
  "vectors"           => Dict{Any, Any}("name"=>nothing, "keys"=>0, "width"=>0,…
  "spacy_version"     => ">=3.3.0.dev0,<3.4.0"
  "source"            => "C:\\Users\\Amirreza\\.julia\\conda\\3\\lib\\site-pack…
  "labels"            => Dict{Any, Any}("tagger"=>["\$", "''", ",", "-LRB-", "-…
  "disabled"          => ["senter"]
  "lang"              => "en"
  "license"           => "MIT"
  "sources"           => Dict{Any, Any}[Dict("name"=>"OntoNotes 5

### Explore the Pipeline

Let’s illustrate the pipeline using a simple sentence:

In [11]:
sample_text = "Apple is looking at buying U.K. startup for \$1 billion"
doc = nlp(sample_text)

PyObject Apple is looking at buying U.K. startup for $1 billion

In [12]:
doc.is_parsed

true

In [13]:
doc.is_sentenced

true

In [14]:
doc.is_tagged

true

In [15]:
doc.text

"Apple is looking at buying U.K. startup for \$1 billion"

In [16]:
doc.vocab.length

781

#### Explore `Token` annotations

The parsed document content is iterable and each element has numerous attributes produced by the processing pipeline. The below sample illustrates how to access the following attributes:

In [17]:
DataFrame(token = [token.text for token ∈ doc])

Unnamed: 0_level_0,token
Unnamed: 0_level_1,String
1,Apple
2,is
3,looking
4,at
5,buying
6,U.K.
7,startup
8,for
9,$
10,1


In [18]:
sample_info = [[t.text, t.lemma_, t.pos_, t.tag_, t.dep_, t.shape_, t.is_alpha, t.is_stop] for t ∈ doc]

11-element Vector{Vector{Any}}:
 ["Apple", "Apple", "PROPN", "NNP", "nsubj", "Xxxxx", true, false]
 ["is", "be", "AUX", "VBZ", "aux", "xx", true, true]
 ["looking", "look", "VERB", "VBG", "ROOT", "xxxx", true, false]
 ["at", "at", "ADP", "IN", "prep", "xx", true, true]
 ["buying", "buy", "VERB", "VBG", "pcomp", "xxxx", true, false]
 ["U.K.", "U.K.", "PROPN", "NNP", "compound", "X.X.", false, false]
 ["startup", "startup", "NOUN", "NN", "dobj", "xxxx", true, false]
 ["for", "for", "ADP", "IN", "prep", "xxx", true, true]
 ["\$", "\$", "SYM", "\$", "quantmod", "\$", false, false]
 ["1", "1", "NUM", "CD", "compound", "d", false, false]
 ["billion", "billion", "NUM", "CD", "pobj", "xxxx", true, false]

In [19]:
sample_df = DataFrame(text = getindex.(sample_info, 1),
    lemma = getindex.(sample_info, 2),
    pos = getindex.(sample_info, 3),
    tag = getindex.(sample_info, 4),
    dep = getindex.(sample_info, 5),
    shape = getindex.(sample_info, 6),
    is_alpha = getindex.(sample_info, 7),
    is_stop = getindex.(sample_info, 8))

Unnamed: 0_level_0,text,lemma,pos,tag,dep,shape,is_alpha,is_stop
Unnamed: 0_level_1,String,String,String,String,String,String,Bool,Bool
1,Apple,Apple,PROPN,NNP,nsubj,Xxxxx,1,0
2,is,be,AUX,VBZ,aux,xx,1,1
3,looking,look,VERB,VBG,ROOT,xxxx,1,0
4,at,at,ADP,IN,prep,xx,1,1
5,buying,buy,VERB,VBG,pcomp,xxxx,1,0
6,U.K.,U.K.,PROPN,NNP,compound,X.X.,0,0
7,startup,startup,NOUN,NN,dobj,xxxx,1,0
8,for,for,ADP,IN,prep,xxx,1,1
9,$,$,SYM,$,quantmod,$,0,0
10,1,1,NUM,CD,compound,d,0,0


#### Visualize POS Dependencies

We can visualize the syntactic dependency in a browser or notebook

In [20]:
options = Dict("compact" => true, "bg" => "white", "color" => "black", "font" => "Source Sans Pro", "notebook" => true)

Dict{String, Any} with 5 entries:
  "bg"       => "white"
  "notebook" => true
  "compact"  => true
  "color"    => "black"
  "font"     => "Source Sans Pro"

In [21]:
display(HTML(spacy.displacy.render(doc, style="dep", options=options)))

#### Visualize Named Entities

In [22]:
display(HTML(spacy.displacy.render(doc, style="ent")))

### Read BBC Data

We will now read a larger set of 2,225 BBC News articles (see GitHub for data source details) that belong to five categories and are stored in individual text files. We 
- call the .glob() method of the pathlib’s Path object, 
- iterate over the resulting list of paths, 
- read all lines of the news article excluding the heading in the first line, and 
- append the cleaned result to a list

In [23]:
files = Glob.glob("../data/bbc/bbc/**/*.txt")

2225-element Vector{String}:
 "..\\data\\bbc\\bbc\\business\\001.txt"
 "..\\data\\bbc\\bbc\\business\\002.txt"
 "..\\data\\bbc\\bbc\\business\\003.txt"
 "..\\data\\bbc\\bbc\\business\\004.txt"
 "..\\data\\bbc\\bbc\\business\\005.txt"
 "..\\data\\bbc\\bbc\\business\\006.txt"
 "..\\data\\bbc\\bbc\\business\\007.txt"
 "..\\data\\bbc\\bbc\\business\\008.txt"
 "..\\data\\bbc\\bbc\\business\\009.txt"
 "..\\data\\bbc\\bbc\\business\\010.txt"
 "..\\data\\bbc\\bbc\\business\\011.txt"
 "..\\data\\bbc\\bbc\\business\\012.txt"
 "..\\data\\bbc\\bbc\\business\\013.txt"
 ⋮
 "..\\data\\bbc\\bbc\\tech\\390.txt"
 "..\\data\\bbc\\bbc\\tech\\391.txt"
 "..\\data\\bbc\\bbc\\tech\\392.txt"
 "..\\data\\bbc\\bbc\\tech\\393.txt"
 "..\\data\\bbc\\bbc\\tech\\394.txt"
 "..\\data\\bbc\\bbc\\tech\\395.txt"
 "..\\data\\bbc\\bbc\\tech\\396.txt"
 "..\\data\\bbc\\bbc\\tech\\397.txt"
 "..\\data\\bbc\\bbc\\tech\\398.txt"
 "..\\data\\bbc\\bbc\\tech\\399.txt"
 "..\\data\\bbc\\bbc\\tech\\400.txt"
 "..\\data\\bbc\\bbc\\tech\\

In [24]:
bbc_articles = Any[]
for (i,file) in enumerate(files)
    f = open(file, "r")
    s = StringDecoder(f, "LATIN1", "UTF-8")
    lines = readlines(s)
    body = strip(join([strip(line) for line in lines[2:end]]))
    push!(bbc_articles, body)
    close(f)
end

In [25]:
length(bbc_articles)

2225

In [26]:
bbc_articles[1]

"Quarterly profits at US media giant TimeWarner jumped 76% to \$1.13bn (Â£600m) for the three months to December, from \$639m year-earlier.The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher adve" ⋯ 1976 bytes ⋯ "00m. It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann's purchase of a stake in AOL Europe, which it had reported as advertising revenue. It will now book the sale of its stake in AOL Europe as a loss on the value of that stake."

### Parse first article through Pipeline

In [27]:
nlp.pipe_names

6-element Vector{String}:
 "tok2vec"
 "tagger"
 "parser"
 "attribute_ruler"
 "lemmatizer"
 "ner"

In [28]:
doc = nlp(bbc_articles[1])
doc.is_parsed

true

### Detect sentence boundary
Sentence boundaries are calculated from the syntactic parse tree, so features such as punctuation and capitalisation play an important but non-decisive role in determining the sentence boundaries. 

Usually this means that the sentence boundaries will at least coincide with clause boundaries, even given poorly punctuated text.

spaCy computes sentence boundaries from the syntactic parse tree so that punctuation and capitalization play an important but not decisive role. As a result, boundaries will coincide with clause boundaries, even for poorly punctuated text.

We can access the parsed sentences using the .sents attribute:

In [29]:
sentences = [s for s ∈ doc.sents]
sentences[1:3]

3-element Vector{PyObject}:
 PyObject Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (Â£600m) for the three months to December, from $639m year-earlier.
 PyObject The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales.
 PyObject TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn.

In [30]:
first_sent_info = [[t.text, t.pos_, spacy.explain(t.pos_)] for t in sentences[1]]
first_sent_df = DataFrame(Token = getindex.(first_sent_info, 1),
    POS_Tag = getindex.(first_sent_info, 2),
    Meaning = getindex.(first_sent_info, 3))
first(first_sent_df, 15)

Unnamed: 0_level_0,Token,POS_Tag,Meaning
Unnamed: 0_level_1,String,String,String
1,Quarterly,ADJ,adjective
2,profits,NOUN,noun
3,at,ADP,adposition
4,US,PROPN,proper noun
5,media,NOUN,noun
6,giant,NOUN,noun
7,TimeWarner,PROPN,proper noun
8,jumped,VERB,verb
9,76,NUM,numeral
10,%,NOUN,noun


In [31]:
options = Dict("compact" => true, "bg" => "#09a3d5",
           "color" => "white", "font" => "Source Sans Pro")
display(HTML(spacy.displacy.render(sentences[1].as_doc(), style="dep", options=options)))

In [32]:
for t ∈ sentences[1]
    if !(isnothing(spacy.explain(t.ent_type_)))
        println("$(t.text) | $(t.ent_type_) | $(spacy.explain(t.ent_type_))")
    end
end

Quarterly | DATE | Absolute or relative dates or periods
US | GPE | Countries, cities, states
TimeWarner | ORG | Companies, agencies, institutions, etc.
76 | PERCENT | Percentage, including "%"
% | PERCENT | Percentage, including "%"
1.13bn | MONEY | Monetary values, including unit
the | DATE | Absolute or relative dates or periods
three | DATE | Absolute or relative dates or periods
months | DATE | Absolute or relative dates or periods
to | DATE | Absolute or relative dates or periods
December | DATE | Absolute or relative dates or periods
639 | MONEY | Monetary values, including unit
year | DATE | Absolute or relative dates or periods
- | DATE | Absolute or relative dates or periods
earlier | DATE | Absolute or relative dates or periods




In [33]:
display(HTML(spacy.displacy.render(sentences[1].as_doc(), style="ent")))

### Named Entity-Recognition with textacy

In [34]:
#Conda.add("textacy")
@pyimport textacy

spaCy enables named entity recognition using the .ent_type_ attribute:

Textacy makes access to the named entities that appear in the first article easy:

In [35]:
entities = [e.text for e in textacy.extract.entities(doc)]

64-element Vector{String}:
 "Quarterly"
 "US"
 "TimeWarner"
 "76%"
 "1.13bn"
 "three months to December"
 "639"
 "year-earlier"
 "Google"
 "TimeWarner"
 "fourth quarter"
 "2%"
 "11.1bn"
 ⋮
 "TimeWarner"
 "around 5%"
 "TimeWarner"
 "AOL"
 "US"
 "300"
 "SEC"
 "500"
 "German"
 "Bertelsmann"
 "AOL Europe"
 "AOL Europe"

In [36]:
entity_count_dict = Dict(entity => count(isequal(entity), entities) for entity ∈ unique(entities))
count_list = [entity_count_dict[entity] for entity in unique(entities)]

entity_count_df = sort(DataFrame(entity = unique(entities), count = count_list), :count, rev=true)
DataFrames.show(entity_count_df, allcols=true)

[1m44×2 DataFrame[0m
[1m Row [0m│[1m entity          [0m[1m count [0m
[1m     [0m│[90m String          [0m[90m Int64 [0m
─────┼────────────────────────
   1 │ TimeWarner           7
   2 │ AOL                  4
   3 │ fourth quarter       3
   4 │ US                   2
   5 │ year-earlier         2
   6 │ Google               2
   7 │ 8%                   2
   8 │ 2003                 2
   9 │ SEC                  2
  10 │ 27%                  2
  11 │ full-year            2
  ⋮  │        ⋮           ⋮
  35 │ 3.36bn               1
  36 │ 6.4%                 1
  37 │ 42.09bn              1
  38 │ Richard Parsons      1
  39 │ 2005                 1
  40 │ around 5%            1
  41 │ 300                  1
  42 │ 500                  1
  43 │ German               1
  44 │ Bertelsmann          1
[36m               23 rows omitted[0m

### N-Grams with textacy

N-grams combine N consecutive tokens. This can be useful for the bag-of-words model because, depending on the textual context, treating, e.g, ‘data scientist’ as a single token may be more meaningful than the two distinct tokens ‘data’ and ‘scientist’.

Textacy makes it easy to view the ngrams of a given length n occurring with at least min_freq times:

In [37]:
ngrams = [n.text for n ∈ textacy.extract.ngrams(doc, n=2, min_freq=2)]

9-element Vector{String}:
 "fourth quarter"
 "fourth quarter"
 "quarter profits"
 "company said"
 "fourth quarter"
 "quarter profits"
 "company said"
 "AOL Europe"
 "AOL Europe"

In [38]:
ngram_count_dict = Dict(ngram => count(isequal(ngram), ngrams) for ngram ∈ unique(ngrams))
count_list = [ngram_count_dict[ngram] for ngram in unique(ngrams)]

ngram_count_df = sort(DataFrame(ngram = unique(ngrams), count = count_list), :count, rev=true)
DataFrames.show(ngram_count_df, allcols=true)

[1m4×2 DataFrame[0m
[1m Row [0m│[1m ngram           [0m[1m count [0m
[1m     [0m│[90m String          [0m[90m Int64 [0m
─────┼────────────────────────
   1 │ fourth quarter       3
   2 │ quarter profits      2
   3 │ company said         2
   4 │ AOL Europe           2

### Multi-language Features

spaCy includes trained language models for English, German, Spanish, Portuguese, French, Italian and Dutch, as well as a multi-language model for named-entity recognition. Cross-language usage is straightforward since the API does not change.

We will illustrate the Spanish language model using a parallel corpus of TED talk subtitles. For this purpose, we instantiate both language models

#### Create a Spanish Language Object

In [39]:
model = Dict()
for language_model ∈ ["en_core_web_sm", "es_core_news_sm"]
    model[language_model[1:2]] = spacy.load(language_model)
end

model

Dict{Any, Any} with 2 entries:
  "en" => PyObject <spacy.lang.en.English object at 0x000000007F4DFAF0>
  "es" => PyObject <spacy.lang.es.Spanish object at 0x000000008AADF280>

#### Read bilingual TED2013 samples

In [40]:
text = Dict()
for language ∈ ["en", "es"]
    file_name = "data/TED/TED2013_sample." * language
    f = open(file_name, "r")
    text[language] = readlines(f)[1]
end

text

Dict{Any, Any} with 2 entries:
  "en" => "There's a tight and surprising link between the ocean's health and o…
  "es" => "Existe una estrecha y sorprendente relación entre nuestra salud y la…

#### Sentence Boundaries English vs Spanish

In [41]:
parsed = Dict()
sentences = Dict()
for language ∈ ["en", "es"]
    nlp = model[language]
    parsed[language] = nlp(text[language])
    sentences[language] = [sent for sent ∈ parsed[language].sents]
    println("Sentences: $(language), $(length(sentences[language]))")
end

Sentences: en, 19
Sentences: es, 22


In [42]:
for (i,(en, es)) ∈ enumerate(zip(sentences["en"], sentences["es"]))
    println("\n", i)
    println("English:\t $(string(en)[10:end])")
    println("Spanish:\t $(string(es)[10:end])")
    if i > 5 
        break
    end
end


1
English:	 There's a tight and surprising link between the ocean's health and ours, says marine biologist Stephen Palumbi.
Spanish:	 Existe una estrecha y sorprendente relación entre nuestra salud y la salud del océano, dice el biologo marino Stephen Palumbi.

2
English:	 He shows how toxins at the bottom of the ocean food chain find their way into our bodies, with a shocking story of toxic contamination from a Japanese fish market.
Spanish:	 Nos muestra, através de una impactante historia acerca de la contaminación tóxica en el mercado pesquero japonés, como las toxinas de la cadena alimenticia del fondo oceánico llegan a nuestro cuerpo.

3
English:	 His work points a way forward for saving the oceans' health -- and humanity's.
Spanish:	 fish,health,mission blue,oceans,science 899 Stephen Palumbi: Siguiendo el camino del mercurio.

4
English:	 fish,health,mission blue,oceans,science 899 Stephen Palumbi: Following the mercury trail It can be a very complicated thing, the ocean.
Spani

#### POS Tagging English vs Spanish

In [43]:
pos = Dict()
for language ∈ ["en", "es"]
    lang_sample_info = [[t.text, t.pos_, spacy.explain(t.pos_)] for t ∈ sentences[language][1]]
    lang_sample_df = DataFrame(Token = getindex.(lang_sample_info, 1),
                            POS_Tag = getindex.(lang_sample_info, 2),
                            Meaning = getindex.(lang_sample_info, 3))
    pos[language] = lang_sample_df
end

pos

Dict{Any, Any} with 2 entries:
  "en" => [1m21×3 DataFrame[0m…
  "es" => [1m22×3 DataFrame[0m…

In [44]:
bilingual_parsed = hcat(pos["en"][1:15, :], pos["es"][1:15, :], makeunique=true)
bilingual_parsed

Unnamed: 0_level_0,Token,POS_Tag,Meaning,Token_1,POS_Tag_1,Meaning_1
Unnamed: 0_level_1,String,String,String,String,String,String
1,There,PRON,pronoun,Existe,VERB,verb
2,'s,VERB,verb,una,DET,determiner
3,a,DET,determiner,estrecha,ADJ,adjective
4,tight,ADJ,adjective,y,CCONJ,coordinating conjunction
5,and,CCONJ,coordinating conjunction,sorprendente,ADJ,adjective
6,surprising,ADJ,adjective,relación,NOUN,noun
7,link,NOUN,noun,entre,ADP,adposition
8,between,ADP,adposition,nuestra,DET,determiner
9,the,DET,determiner,salud,NOUN,noun
10,ocean,NOUN,noun,y,CCONJ,coordinating conjunction


In [45]:
display(HTML(spacy.displacy.render(sentences["es"][1].as_doc(), style="dep", options=options)))