# Tokenization

[Tokenization](https://spacy.io/usage/linguistic-features#tokenization) is a process in which a given input text is split into meaningful words, punctuation and so on. These outputs are called *tokens*.

In [1]:
import pandas as pd
import spacy

In [2]:
# Load NLP models (both English and Japanese)
enlp = spacy.load('en_core_web_trf')
jnlp = spacy.load('ja_core_news_lg')

In [3]:
# Create a string that includes opening and closing quotation marks
estring1 = '"We\'re moving to L.A.!"'
print(estring1)

"We're moving to L.A.!"


In [4]:
jstring1 = '「２０２０年の東京オリンピックが成功になるように！」'
print(jstring1)

「２０２０年の東京オリンピックが成功になるように！」


In [5]:
edoc1 = enlp(estring1)
jdoc1 = jnlp(jstring1)

In [6]:
for token in edoc1:
    print(token.text)

"
We
're
moving
to
L.A.
!
"


In [7]:
for token in jdoc1:
    print(token.text)

「
２０２０
年
の
東京
オリンピック
が
成功
に
なる
よう
に
！
」


In [8]:
def tokensInfo(doc: spacy.tokens.doc.Doc):
    # Extract and save information of each token as a child array
    # of the parent tokens_info[] array
    tokens_info = []
    for token in doc:
        tokens_info.append([token.text, token.lemma_, token.pos_, token.tag_,
                            token.dep_, token.shape_, token.is_alpha, token.is_stop])

    # Table header
    headers = ["Text", "Lemma", "POS", "Tag", "Dep", "Shape", "Is Alpha", "Is Stop"]
    
    # Create and return a Pandas DataFrame containing information of all tokens
    table = pd.DataFrame(columns=headers, data=tokens_info)
    return table

In [9]:
edoc2 = enlp(u'New York, often called New York City to distinguish it from New York State, \
or NYC for short, is the most populous city in the United States. With a 2020 population of \
8,804,190 distributed over 300.46 square miles (778.2 km2), New York City is also the most \
densely populated major city in the United States.')
print(edoc2)

New York, often called New York City to distinguish it from New York State, or NYC for short, is the most populous city in the United States. With a 2020 population of 8,804,190 distributed over 300.46 square miles (778.2 km2), New York City is also the most densely populated major city in the United States.


In [10]:
tokensInfo(edoc2).head(20)

Unnamed: 0,Text,Lemma,POS,Tag,Dep,Shape,Is Alpha,Is Stop
0,New,New,PROPN,NNP,compound,Xxx,True,False
1,York,York,PROPN,NNP,nsubj,Xxxx,True,False
2,",",",",PUNCT,",",punct,",",False,False
3,often,often,ADV,RB,advmod,xxxx,True,True
4,called,call,VERB,VBN,acl,xxxx,True,False
5,New,New,PROPN,NNP,compound,Xxx,True,False
6,York,York,PROPN,NNP,compound,Xxxx,True,False
7,City,City,PROPN,NNP,oprd,Xxxx,True,False
8,to,to,PART,TO,aux,xx,True,True
9,distinguish,distinguish,VERB,VB,advcl,xxxx,True,False


In [11]:
jdoc2 = jnlp(u'松野官房長官は、30日午後の記者会見で、アフリカ南部のナミビアから入国した30代の男性が、\
新型コロナの新たな変異ウイルス「オミクロン株」に感染していたことが確認されたことを明らかにしました。\
日本国内で、オミクロン株の感染者が確認されたのは初めてです。\
この中で松野官房長官は「ナミビアからの入国者について、国立感染症研究所で陽性検体のゲノム解析を行ったところ、\
オミクロン株であると確認されたとの1報が、厚生労働省からあった」と述べ、アフリカ南部のナミビアから入国した\
30代の男性が、新型コロナの新たな変異ウイルス「オミクロン株」に感染していたことが確認されたことを明らかにしました。')
print(jdoc2)

松野官房長官は、30日午後の記者会見で、アフリカ南部のナミビアから入国した30代の男性が、新型コロナの新たな変異ウイルス「オミクロン株」に感染していたことが確認されたことを明らかにしました。日本国内で、オミクロン株の感染者が確認されたのは初めてです。この中で松野官房長官は「ナミビアからの入国者について、国立感染症研究所で陽性検体のゲノム解析を行ったところ、オミクロン株であると確認されたとの1報が、厚生労働省からあった」と述べ、アフリカ南部のナミビアから入国した30代の男性が、新型コロナの新たな変異ウイルス「オミクロン株」に感染していたことが確認されたことを明らかにしました。


In [12]:
tokensInfo(jdoc2).head(50)

Unnamed: 0,Text,Lemma,POS,Tag,Dep,Shape,Is Alpha,Is Stop
0,松野,松野,PROPN,名詞-固有名詞-人名-姓,compound,xx,True,False
1,官房,官房,NOUN,名詞-普通名詞-一般,compound,xx,True,False
2,長官,長官,NOUN,名詞-普通名詞-一般,nsubj,xx,True,False
3,は,は,ADP,助詞-係助詞,case,x,True,True
4,、,、,PUNCT,補助記号-読点,punct,、,False,False
5,30,30,NUM,名詞-数詞,nummod,dd,False,False
6,日,日,NOUN,名詞-普通名詞-助数詞可能,compound,x,True,False
7,午後,午後,NOUN,名詞-普通名詞-副詞可能,nmod,xx,True,False
8,の,の,ADP,助詞-格助詞,case,x,True,True
9,記者,記者,NOUN,名詞-普通名詞-一般,compound,xx,True,False


### Prefixes, Suffixes, Infixes and Exceptions

- **Prefix**:	Character(s) at the beginning &#9656; `$ ( “ ¿`
- **Suffix**:	Character(s) at the end &#9656; `km ) , . ! ”`
- **Infix**:	Character(s) in between &#9656; `- -- / ...`
- **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied &#9656; `St. U.S.`, or “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.

## Counting Tokens

In [13]:
len(edoc1)

8

In [14]:
len(jdoc1)

14

In [15]:
len(jdoc2)

180

## Counting Vocab Entries

In [16]:
len(edoc1.vocab)

800

In [17]:
len(jdoc1.vocab)

79

In [18]:
len(jdoc2.vocab)

79

In [19]:
enlp.vocab.length

800

In [20]:
jnlp.vocab.length

79

## Named Entities

[Named entities](https://spacy.io/usage/linguistic-features#named-entities) are available as the `ents` properties of the `Doc` and represent *real-world* objects - i.e., a person, a country, a product or a book title.

In [21]:
def namedEntitiesInfo(doc: spacy.tokens.doc.Doc):
    entities_info = []
    
    for ent in doc.ents:
        entities_info.append([ent, ent.text, ent.label_, spacy.explain(ent.label_)])
    
    headers = ['Entity', '.text', '.label_', 'Explanation']
    entities_df = pd.DataFrame(data=entities_info, columns=headers)
    return entities_df

In [22]:
namedEntitiesInfo(edoc2)

Unnamed: 0,Entity,.text,.label_,Explanation
0,"(New, York)",New York,GPE,"Countries, cities, states"
1,"(New, York, City)",New York City,GPE,"Countries, cities, states"
2,"(New, York, State)",New York State,GPE,"Countries, cities, states"
3,(NYC),NYC,GPE,"Countries, cities, states"
4,"(the, United, States)",the United States,GPE,"Countries, cities, states"
5,(2020),2020,DATE,Absolute or relative dates or periods
6,"(8,804,190)",8804190,CARDINAL,Numerals that do not fall under another type
7,"(300.46, square, miles)",300.46 square miles,QUANTITY,"Measurements, as of weight or distance"
8,"(778.2, km2)",778.2 km2,QUANTITY,"Measurements, as of weight or distance"
9,"(New, York, City)",New York City,GPE,"Countries, cities, states"


<font color=green>Note `spaCy` is able to recognize `New York City` is an entity of type `Countries, cities, states`, `300.46 square miles` is a `QUANTITY` entity.</font>

In [23]:
namedEntitiesInfo(jdoc2)

Unnamed: 0,Entity,.text,.label_,Explanation
0,(松野),松野,PERSON,"People, including fictional"
1,"(官房, 長官)",官房長官,TITLE_AFFIX,
2,"(30, 日, 午後)",30日午後,DATE,Absolute or relative dates or periods
3,(アフリカ),アフリカ,GPE,"Countries, cities, states"
4,(ナミビア),ナミビア,GPE,"Countries, cities, states"
5,"(30, 代)",30代,QUANTITY,"Measurements, as of weight or distance"
6,"(オミクロン, 株)",オミクロン株,PRODUCT,"Objects, vehicles, foods, etc. (not services)"
7,(日本),日本,GPE,"Countries, cities, states"
8,"(オミクロン, 株)",オミクロン株,PRODUCT,"Objects, vehicles, foods, etc. (not services)"
9,(松野),松野,PERSON,"People, including fictional"


<font color=green>Note how named entity `30日午後` is formed from 3 tokens, or two tokens `(30, 代)` is combined into a single entity `30代`.</font>

## Noun Chunks

[Noun chunks](https://spacy.io/usage/linguistic-features#noun-chunks) are "base noun phrases" that can be seen as *a noun plus the words describing the noun* - i.e., "red carpet", "delicious cake". Similar to getting entities with `Doc.ents`, we can get a list of `noun chunks` of a `Doc` by refering `Doc.noun_chunks`.

In [24]:
def nounChunksInfo(doc: spacy.tokens.doc.Doc):
    noun_chunks_info = []
    
    for chunk in doc.noun_chunks:
        noun_chunks_info.append([
            chunk, chunk.text,
            chunk.label_,
            chunk.root.text,
            chunk.root.dep_,
            chunk.root.head.text,
            spacy.explain(chunk.root.dep_)
        ])
    
    headers = [
        'Noun Chunk',
        '.text',
        '.label_',
        '.root.text',
        '.root.dep_',
        '.root.head.text',
        'Explanation (.root.dep_)'
    ]
    noun_chunks_info = pd.DataFrame(data=noun_chunks_info, columns=headers)
    return noun_chunks_info

<font color=magenta>

In the above `nounChunksInfo` function:

- **Text** (`chunk.text`): The original noun chunk text.
- **Root text** (`chunk.root.text`): The original text of the word connecting the noun chunk to the rest of the parse.
- **Root dep** (`chunk.root.dep_`): Dependency relation connecting the root to its head.
- **Root head text** (`chunk.root.head.text`): The text of the root token’s head.
</font>

In [25]:
nounChunksInfo(edoc2)

Unnamed: 0,Noun Chunk,.text,.label_,.root.text,.root.dep_,.root.head.text,Explanation (.root.dep_)
0,"(New, York)",New York,NP,York,nsubj,is,nominal subject
1,"(New, York, City)",New York City,NP,City,oprd,called,object predicate
2,(it),it,NP,it,dobj,distinguish,direct object
3,"(New, York, State)",New York State,NP,State,pobj,from,object of preposition
4,"(the, most, populous, city)",the most populous city,NP,city,attr,is,attribute
5,"(the, United, States)",the United States,NP,States,pobj,in,object of preposition
6,"(a, 2020, population)",a 2020 population,NP,population,pobj,With,object of preposition
7,"(300.46, square, miles)",300.46 square miles,NP,miles,pobj,over,object of preposition
8,"(New, York, City)",New York City,NP,City,nsubj,is,nominal subject
9,"(the, most, densely, populated, major, city)",the most densely populated major city,NP,city,attr,is,attribute


In [26]:
nounChunksInfo(jdoc2)

Unnamed: 0,Noun Chunk,.text,.label_,.root.text,.root.dep_,.root.head.text,Explanation (.root.dep_)
0,"(松野, 官房, 長官)",松野官房長官,NP,長官,nsubj,し,nominal subject
1,"(30, 日, 午後)",30日午後,NP,午後,nmod,会見,modifier of nominal
2,"(アフリカ, 南部)",アフリカ南部,NP,南部,nmod,ナミビア,modifier of nominal
3,(ナミビア),ナミビア,NP,ナミビア,obl,入国,oblique nominal
4,"(30, 代)",30代,NP,代,nmod,男性,modifier of nominal
5,(男性),男性,NP,男性,nsubj,感染,nominal subject
6,"(新型, コロナ)",新型コロナ,NP,コロナ,nmod,株,modifier of nominal
7,"(新た, な, 変異, ウイルス, 「, オミクロン, 株)",新たな変異ウイルス「オミクロン株,NP,株,obl,感染,oblique nominal
8,(こと),こと,NP,こと,nsubj,確認,nominal subject
9,"(日本, 国内)",日本国内,NP,国内,obl,初めて,oblique nominal


## Visualizers

`spaCy`'s built-in [visualizers](https://spacy.io/usage/visualizers) [displaCy](https://explosion.ai/demos/displacy) and [displaCy ENT](https://explosion.ai/demos/displacy-ent) provide functions for visualizing dependencies and entities in web browsers or in a notebook. There are two main visualization styles:

- dependency parse: `displacy.render(style='dep', ...)`
- named entities: `displacy.render(style='ent', ...)`

In [27]:
from spacy import displacy

In [28]:
displacy.render(edoc1, style='dep', jupyter=True, options={'distance': 110})

In [29]:
displacy.render(edoc1, style='ent', jupyter=True, options={'distance': 110})

In [30]:
displacy.render(jdoc1, style='dep', jupyter=True, options={'distance': 110})

In [31]:
displacy.render(jdoc1, style='ent', jupyter=True)

In [32]:
displacy.render(edoc2, style='ent', jupyter=True)

In [33]:
displacy.render(jdoc2, style='ent', jupyter=True)

### Visualizing Long Text

Long texts can become difficult to read when displayed in one row, so it’s often better to visualize them sentence-by-sentence instead.

In [34]:
# Split a long text into sentences, then visualize those sentences
e_sentences_2 = list(edoc2.sents)
displacy.render(e_sentences_2, style='ent')

### Visualization Options

In [35]:
options = {"compact": True, "bg": "#09a3d5",
           "color": "white", "font": "Source Sans Pro"}
displacy.render(edoc1, style='dep', options=options)

In [36]:
colors = {
    "DATE": "linear-gradient(90deg, #aa9cfc, #fc9ce7)",
    "EVENT": "linear-gradient(90deg, green, yellow)",
}
options = {"colors": colors}
jdoc1.user_data['title'] = '東京オリンピックテキストの視覚化'
displacy.render(jdoc1, style='ent', options=options)

In [37]:
colors = {
    "NOUN": "linear-gradient(90deg, #aa9cfc, #fc9ce7)",
    "PROPN": "linear-gradient(90deg, green, yellow)",
}
options = {
    "compact": True,
    "bg": "linear-gradient(90deg, #aa9cfc, #fc9ce7)", # background color
    "font": "Arial",
    "color": colors
}
displacy.render(jdoc1, style='dep', options=options)

### Visualization as a Webpage

In [38]:
displacy.serve(edoc2, style='ent')




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


<font color='magenta'>
    After running the above cell, open a web browser and load <a href='http://localhost:5000/'>http://localhost:5000/</a>

To stop serving the page, go back to this notebook, select the above `cell` and
- either press `Esc`, then `I` twice
- or click on 'Interrupt the kernel' square button in the top toolbar
</font>