 *Artificial Intelligence for Vision & NLP* &nbsp; | &nbsp;  *ATU Donegal - MSc in Big Data Analytics & Artificial Intelligence*

# Tokenisation
The first step in creating a `Doc` object is to break down the incoming text into component pieces or *tokens*.

Let's look at the first example shown in the lecture.

In [None]:
# Import spaCy and load the English language library
import spacy
# This will take a while to load initially
nlp = spacy.load("en_core_web_sm")

In [None]:
# The text is within "" which we want to display and therefore we need to use 
# the \ character to identify that some of the single quote marks are not the 
# end of the sentence

# SpaCy works with doc objects. This doc object is called "sentence"
sentence = '"Mr. O\'Neill thinks that the boys\' stories about Chile\'s capital aren\'t amusing."'
print(sentence)

Now we'll examine each of the tokens for our sentence. Refer to the slides on Blackboard for further information.

In [None]:
nlp_sentence = nlp(sentence)
for token in nlp_sentence:
    print(token.text)

Note that the sentence is split on punctuation, and the word **boys'** , **Chile's** and **aren't** are now split and are assigned their own tokens. The prefix and suffix have also been assigned individual tokens. The **.** for the word **Mr.** has remained part of this word. SpaCy was able to determine that the **.** is part of the **Mr.** title.

In [None]:
nlp_sentence = nlp(sentence)

# Show the tokens of the sentence and use
# the "|" between each token for additional clarification
for token in nlp_sentence:
    print(token.text, token.pos_, end = " | ")

Lets look at a more difficult sentence that includes **.** within the time elements, and **.** within the web address.

In [None]:
sentence = "It is best to access our website from 9 a.m. to 1 p.m. every weekend. The address is www.mywebsite.ie."

In [None]:
doc_object = nlp(sentence)

In [None]:
for token in doc_object:
    print(token)

SpaCy tokenises all words as expected, including the `.`. It also separates out the `.` at the end of the sentence as a *suffix* compared to the `.` in the middle of the web address.

Spacy can detect the difference between units such as distance and cost. Here's an example.

In [None]:
sentence = "I live about 20km from here. A taxi will cost around £50."
doc_object = nlp(sentence)

for token in doc_object:
    print(token)

## Counting Tokens
`Doc` objects have a set number of tokens:

In [None]:
# Number of tokens in our sentence
len(doc_object)

A language library contains individual `vocab` objects. The number of vocab objects in a library can vary for each language and can change when we add new `vocab` objects called `lexemes` to a language library.

In [None]:
# Count the number of vocab objects in the currently loaded language library
# This is from the en_core_web_sm library
# Use en_core_web_lg for larger library
len(doc_object.vocab)

## Retrieve token by index position and slice
`Doc` objects can be thought of as lists of `token` objects. As such, individual tokens can be retrieved by index position, and spans of tokens can be retrieved through slicing, just as shown in the previous notebook.

Let's enter the text into a `doc` object and then show the contents of the sentence.

In [None]:
doc = nlp(u"I really like working with words!")

# Print each token
for token in doc:
    print(token)

Now we can extract some tokens from the sentence. Note that the indexer starts at 0, and all tokens such as suffix count as a token position.

In [None]:
# Retrieve the first token
doc[0]

In [None]:
# Retrieve the 3rd to 6th token
doc[3:6]

In [None]:
# Retrieve the last 2 tokens
doc[-2:]

We cannot re-assign individual tokens with new values. Remember that Spacy has already done a lot of calculations on your text, so an item reassignment is going to cause issues with these.

In [None]:
doc[2] = "Do not"

# Named Entity Recognition (NER)
Going a step beyond tokens, **named entities** add another layer of context. A named entity is a **real-world object** that’s assigned a name – for example, a person, a country, a product, a date, money, a book title etc.

spaCy can recognise various types of named entities in a document, by asking the language model for a prediction.

Named entities are accessible through the `ents` property of a `Doc` object.

In [None]:
doc_object = nlp(u"Samsung in Ireland are pleased with their new folding screen that they released after a large $9 million investment.")

for token in doc_object:
    # show the token followed by a separator
    print (token, end = " | ")

We can view the named entities in the doc object with the following code:

In [None]:
for entity in doc_object.ents:
    print (entity)

SpaCy has recognised that these tokens are named entities and there are more context to these tokens. They are similar to nouns.

We can view the label for each named entity and see what entity spaCy has assigned to each named entity.

In [None]:
for entity in doc_object.ents:
    # Show the entity and its general label
    print (entity, entity.label_)

We can show more detail on each entity label. All this is generated automatically through spaCy.

In [None]:
for entity in doc_object.ents:
    # Show the entity and its general label
    # and show a full description on each named entity
    # using the spacy.explain command
    print (entity, entity.label_, spacy.explain(entity.label_))

If a named entity does not exist, the `show_ents` function will not work. For example, the word **car** is not automatically recognised as a named entity.

In [None]:
doc_object = nlp(u"I like my car")
for entity in doc_object.ents:
    # Show the entity and its general label
    print (entity, entity.label_)

Next we create a new function called `show_entity_info` that will accept a `doc_object` and display relevant entity information. I'll also check whether entity information exists or not.

In [None]:
# Create a function to display entity information from a doc_object
def show_entity_info(doc_object):
    if doc_object:
        for entity in doc_object.ents:
            print(f"{entity.text} {entity.label_:{20}} {spacy.explain(entity.label_)}")
    else:
        print(f"No entities found in text.")

In [None]:
doc_object = nlp(u"Samsung in Ireland are pleased with their new folding screen that they released after a large $9 million investment.")
show_entity_info(doc_object)

# Noun Chunks
Similar to NER `Doc.ents`, `Doc.noun_chunks` are another object property. *Noun chunks* are **base noun phrases** – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, a *"the lavish green grass"* would be one noun chunk.

`spaCy` uses the terms **head** and **child** to describe the words connected by a single arc in the **dependency tree**. The term **dep** is used for the arc label, which describes the type of syntactic relation that connects the child to the head. 

Lets have a look at an example. I'll then demonstrate how we can use a visualiser to demonstrate this information. 

In [None]:
doc_object = nlp("Autonomous cars shift insurance liability toward manufacturers")

# Create header text for table output
column1 = "Text"
column2 = "Root text"
column3 = "Root dependency"
column4 = "Root head text"
# Show the header for the table output
print (f"{column1:25} {column2:20} {column3:25} {column4:20}")
# Show relevant detail for each noun chunk in the text
for chunk in doc_object.noun_chunks:
    print(f"{chunk.text:{25}} {chunk.root.text:{20}} {spacy.explain(chunk.root.dep_):{25}} {chunk.root.head.text:{20}}")

The noun *cars* is described by the word *autonomous*. Both words are referred to as a *noun chunk*. Similarly the  noun *liability* is described by the word *insurance*. So again both words are described as a *noun chunk*. *Insurance* is a noun, and is also called a *noun chunk* even though it does not have any descriptive text associated with it.

In the table above, the *Text* column represents the original noun chunk text. The *Root text* column is the original text of the word connecting the noun chunk to the rest of the parse.

The *Root dependency* column is the dependency relation connecting the root to its head. The *Root head text* is the text of the root token’s head.

For more info on *noun_chunks*, see https://spacy.io/usage/linguistic-features#noun-chunks

## displaCy Built-in Visualiser

spaCy includes a built-in visualisation tool called `displaCy`. `displaCy` is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

Let's examine the dependencies of our sentence visually.

Note that we can show the sentence dependencies using the `style="dep"` option and the sentence entities using the `style="ent"` option.

In [None]:
from spacy import displacy

In [None]:
doc_object = nlp(u"Autonomous cars shift insurance liability toward manufacturers")

In [None]:
# Command to display the sentence. Be careful of the case with the word "True"
# Style set to "dep" means display dependencies
displacy.render(doc_object, style="dep", jupyter=True, options={"distance":100} )

The `options` command in the `dispacy.render` function allows us to modify various things in the diagram that is output by the visualisation tool. Here's a list of the settings availabe to us. See also this link for more information on the dispacy visualiser: https://spacy.io/api/top-level#displacy_options

<tr class="_8a68569b"><th class="_2e8d2972">Name</th><th class="_2e8d2972">Type</th><th class="_2e8d2972">Description</th><th class="_2e8d2972">Default</th></tr>

<tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">fine_grained</code></td><td class="_5c99da9a">bool</td><td class="_5c99da9a">Use fine-grained part-of-speech tags (<code class="_1d7c6046">Token.tag_</code>) instead of coarse-grained tags (<code class="_1d7c6046">Token.pos_</code>).</td><td class="_5c99da9a"><code class="_1d7c6046">False</code></td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">collapse_punct</code></td><td class="_5c99da9a">bool</td><td class="_5c99da9a">Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation.</td><td class="_5c99da9a"><code class="_1d7c6046">True</code></td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">collapse_phrases</code></td><td class="_5c99da9a">bool</td><td class="_5c99da9a">Merge noun phrases into one token.</td><td class="_5c99da9a"><code class="_1d7c6046">False</code></td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">compact</code></td><td class="_5c99da9a">bool</td><td class="_5c99da9a">“Compact mode” with square arrows that takes up less space.</td><td class="_5c99da9a"><code class="_1d7c6046">False</code></td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">color</code></td><td class="_5c99da9a">unicode</td><td class="_5c99da9a">Text color (HEX, RGB or color names).</td><td class="_5c99da9a"><code class="_1d7c6046">'#000000'</code></td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">bg</code></td><td class="_5c99da9a">unicode</td><td class="_5c99da9a">Background color (HEX, RGB or color names).</td><td class="_5c99da9a"><code class="_1d7c6046">'#ffffff'</code></td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">font</code></td><td class="_5c99da9a">unicode</td><td class="_5c99da9a">Font name or font family for all text.</td><td class="_5c99da9a"><code class="_1d7c6046">'Arial'</code></td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">offset_x</code></td><td class="_5c99da9a">int</td><td class="_5c99da9a">Spacing on left side of the SVG in px.</td><td class="_5c99da9a"><code class="_1d7c6046">50</code></td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">arrow_stroke</code></td><td class="_5c99da9a">int</td><td class="_5c99da9a">Width of arrow path in px.</td><td class="_5c99da9a"><code class="_1d7c6046">2</code></td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">arrow_width</code></td><td class="_5c99da9a">int</td><td class="_5c99da9a">Width of arrow head in px.</td><td class="_5c99da9a"><code class="_1d7c6046">10</code> / <code class="_1d7c6046">8</code> (compact)</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">arrow_spacing</code></td><td class="_5c99da9a">int</td><td class="_5c99da9a">Spacing between arrows in px to avoid overlaps.</td><td class="_5c99da9a"><code class="_1d7c6046">20</code> / <code class="_1d7c6046">12</code> (compact)</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">word_spacing</code></td><td class="_5c99da9a">int</td><td class="_5c99da9a">Vertical spacing between words and arcs in px.</td><td class="_5c99da9a"><code class="_1d7c6046">45</code></td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">distance</code></td><td class="_5c99da9a">int</td><td class="_5c99da9a">Distance between words in px.</td><td class="_5c99da9a"><code class="_1d7c6046">175</code> / <code class="_1d7c6046">150</code> (compact)</td></tr>

In the example above, we set the `distance` option to **100**. That set the pixel distance to **100** between each word on the diagram.

Here's another example showing various options from the table above. Note that the change from circular to square lines on the diagram can be achieved with the `compact` option.

In [None]:
displacy.render(doc_object, style="dep", jupyter=True, options={"distance":130, "color":"Blue", "arrow_stroke":4, "arrow_spacing":20, "word_spacing":50, "compact":True} )

Earlier we examined the dependencies in the sentence. `dispacy` can also display the entities of the sentence by using the `style="ent"` option. Before we do that, lets look at whether there are any entities in the sentence we've been using in the example until now.

In [None]:
# Display any named entities in the string
for entity in doc_object.ents:
    print (entity)

There are no named entities, so we'll use another text example for demo purposes. 

The text in this example comes from the official `displacy` webpage. For more info, see https://explosion.ai/demos/displacy-ent

First I'll look at the entities in this sentence.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
file_name = open("/content/gdrive/My Drive/NLP/noun-chunks.txt")
#sentence = file_name.read()
# sentence = (u"When Sebastian Thrun started working on self-driving cars at \
#              Google in 2007, few people outside of the company took him \
#              seriously. “I can tell you very senior CEOs of major American \
#              car companies would shake my hand and turn away because I \
#              wasn’t worth talking to,” said Thrun, now the co-founder \
#              and CEO of online higher education startup Udacity, in an \
#              interview with Recode earlier this week. A little less than \
#              a decade later, dozens of self-driving startups have cropped \
#              up while automakers around the world clamor, wallet in hand, \
#              to secure their place in the fast-moving world of fully \
#              automated transportation.")
doc_object = nlp(sentence)

# Display any named entities in the string
for entity in doc_object.ents:
    print (entity, entity.label_)

A nice feature in `dispacy.render` function is to display the sentence text and also highlight each entity with its associated entity label. This is all done automatically. 

In [None]:
displacy.render(doc_object, style="ent", jupyter=True)

## Tweet Tokenisation

NLTK has a tweet tokeniser module called `nltk.tokenize.TweetTokenizer`. Find out how to use this and use it to create a function that returns a tokenised tweet with any Twitter handles removed. Try also using a regex query to remove the handles.