### Text Processing

In this notebook we are going to go through some text processing teqiniques that we can use to clean our text when doing natural language processing task. We are going to use `spacy` library to perform the following text processing.

1. Tokenization
2. Part-of-Speech Taging (POS)
3. Case Folding
4. Stop Words Removal
5. Stemming
6. Lemmatization
7. Named Entity Recognition (NER)
8. Parsing


First thing first we need to install the latest version of `spacy` by running the following command:

In [2]:
!pip install -U spacy==3.* -q

We can check the information about this spacy library by running the following command.

In [3]:
!python -m spacy info

2023-08-14 07:11:59.608219: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1m

spaCy version    3.6.1                         
Location         /usr/local/lib/python3.10/dist-packages/spacy
Platform         Linux-5.15.109+-x86_64-with-glibc2.35
Python version   3.10.12                       
Pipelines        en_core_web_sm (3.6.0)        



Since we have downloaded `spacy` version `3.*` we must also upgrade the language model `en_core_web_sm`. **`en_core_web_sm`** is a statistical model that we are going to use to process some english sentences. These statistical models can be found at [spacy.io](https://spacy.io/models/en#en_core_web_sm) and they helped us with tokenization, part-of-speech tagging, named entity recognition, etc. First thing first we need to import spacy library as follows:

In [6]:
!python -m spacy download en_core_web_sm -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
import spacy
spacy.__version__

'3.6.1'

We are going to load the language model `en_core_web_sm` which is the smallest statistical model for english and it can help us to start up so quickly.

In [7]:
nlp = spacy.load('en_core_web_sm')

After loading the model, the `nlp` variable now references a **Language** class instance which contains language-specific rules for various tasks (e.g. tokenization) and a processing pipeline. You can find more about  [here](https://spacy.io/api/language)



In [8]:
type(nlp)

spacy.lang.en.English

### Tokenization

This is the processes of converting sentences in a sequence or list of words. Let's take a look at the following sentence.

```shell
This is a boy.
```
Tokenizing this sentence can be done using regular python by splitting each word using a space and this results in a list of tokens that looks as follows:

```shell
["This", "is", "a", "boy."]
```
But notice we have a problem here. `boy.` should be two tokens which include the word `boy` and the punctuation mark `.` This problem is solved by spacy. We are going to have a look on how tokenization can be done using `spacy` in an efficient way.

When we call the `nlp` object and pass in a string or a sentence. This returns us a `Doc` container object. You can read more about the `Doc` onject container [here](https://spacy.io/api/doc)  

In [9]:
sent = "He didn't want to pay $20 for this book."
doc = nlp(sent)
type(doc)

spacy.tokens.doc.Doc

We can tokenize the above sentence by iterating over the `doc` object as follows:


In [11]:
print([token.text for token in doc])

['He', 'did', "n't", 'want', 'to', 'pay', '$', '20', 'for', 'this', 'book', '.']


The `Doc` object can be indexed to get individual tokens.

In [14]:
doc[0]

He

Slicing the `Doc` object returns us a `Span` object. You can learn more about the `Span` object [here.](https://spacy.io/api/token)

In [16]:
print(doc[:3])
type(doc[:3])

He didn't


spacy.tokens.span.Span

We can access the index of each token in the doc object as follows:

In [17]:
print([(token.i, token.text) for token in doc])

[(0, 'He'), (1, 'did'), (2, "n't"), (3, 'want'), (4, 'to'), (5, 'pay'), (6, '$'), (7, '20'), (8, 'for'), (9, 'this'), (10, 'book'), (11, '.')]


Spacy's tokenization is **non-destructive**, which means the original input can be reconstructed from the tokens.

In [20]:
doc.text


"He didn't want to pay $20 for this book."

The good thing is `spacy` allows us to tokenize multiple sentences. Let's have a look at the following example:

In [24]:
paragraph = """Either the well was very deep, or she fell very slowly, for she
had plenty of time as she went down to look about her and to wonder what
was going to happen next. First, she tried to look down and make out what
she was coming to, but it was too dark to see anything; then she looked at
the sides of the well, and noticed that they were filled with cupboards and
book-shelves; here and there she saw maps and pictures hung upon pegs."""
doc = nlp(paragraph)

We can access the individual sentences using the `doc.sents` as follows:

> This will return a list of sentences from our sting.

In [25]:
[sent for sent in doc.sents]

[Either the well was very deep, or she fell very slowly, for she 
 had plenty of time as she went down to look about her and to wonder what 
 was going to happen next.,
 First, she tried to look down and make out what 
 she was coming to, but it was too dark to see anything; then she looked at 
 the sides of the well, and noticed that they were filled with cupboards and 
 book-shelves; here and there she saw maps and pictures hung upon pegs.]

Spacy comes with some attributes that can be found [here](https://spacy.io/api/token#attributes) like `is_currency` - which allows us to check if an individual token is a currency symbol or not, `is_bracket`, `is_space` etc. In the following sentence let's filter out the currecy from the sentence. In other terms we want to return the `"$20"`

In [29]:
text = "He didn't want to pay $20 for this book."
doc = nlp(text)

"".join([t.text for t in doc if t.is_currency or t.is_digit])

'$20'

### Case Folding

`spaCy` performs all these preprocessing steps (except stemming) behind the scenes for us. Inline with its non-destructive policy, the tokens aren't modified directly. Rather, each `Token` object has a number of attributes which can help us get views of your document with these pre-processing steps applied. The attributes a `Token` has can be found [here](https://spacy.io/api/token#attributes).

> Let's convert some token's of our sentence to lower case using the `lower_` attribute.

In [30]:
sent = "He told Dr. Lovato that he was done with the tests and would post the results shortly."
doc = nlp(sent)

print([token.lower_ for token in doc])

['he', 'told', 'dr.', 'lovato', 'that', 'he', 'was', 'done', 'with', 'the', 'tests', 'and', 'would', 'post', 'the', 'results', 'shortly', '.']


We can also conditionaly casefold tokens, for example let's say we want to convert the tokens to lower case for all the tokens that are between a sentence and the ones that are at the begining of sentence we will title them.

In [31]:
print([token if token.is_sent_start else token.lower_ for token in doc])

[He, 'told', 'dr.', 'lovato', 'that', 'he', 'was', 'done', 'with', 'the', 'tests', 'and', 'would', 'post', 'the', 'results', 'shortly', '.']


### Stopwords

`spaCy` comes with some default stopwords. We can check our stopwords list as follows.

In [34]:
stop_words = [i for i in nlp.Defaults.stop_words]
stop_words[:5]

['thereupon', 'how', 'perhaps', 'there', 'did']

We can check the total number of `stopwords` as follows:

In [37]:
"Default stop words: {}".format(len(nlp.Defaults.stop_words))

'Default stop words: 326'

### Lemmatization

> Lemmatization is a text pre-processing technique used in natural language processing `(NLP)` models to break a word down to its root meaning to identify similarities.

In `spaCy` we can access the lemma of a word using the `lemma_` attribute.




In [38]:
[(t.text, t.lemma_) for t in doc]

[('He', 'he'),
 ('told', 'tell'),
 ('Dr.', 'Dr.'),
 ('Lovato', 'Lovato'),
 ('that', 'that'),
 ('he', 'he'),
 ('was', 'be'),
 ('done', 'do'),
 ('with', 'with'),
 ('the', 'the'),
 ('tests', 'test'),
 ('and', 'and'),
 ('would', 'would'),
 ('post', 'post'),
 ('the', 'the'),
 ('results', 'result'),
 ('shortly', 'shortly'),
 ('.', '.')]

We can see that words like `told` are converted to `tell` and words like `results` are converted to `result`.


### Stemming

> Stemming is the process of reducing a word to its stem that affixes to suffixes and prefixes or to the roots of words known as "lemmas".

We can use the `nltk` libary to do basic stemming. Let's try to stem our sentence.

In [39]:
from nltk.stem.snowball import SnowballStemmer

Next we are going to initialize the `SnowballStemmer` instance with the `language` as `english`.


In [40]:
stemmer = SnowballStemmer(language='english')

[(t.text, stemmer.stem(t.text)) for t in doc]

[('He', 'he'),
 ('told', 'told'),
 ('Dr.', 'dr.'),
 ('Lovato', 'lovato'),
 ('that', 'that'),
 ('he', 'he'),
 ('was', 'was'),
 ('done', 'done'),
 ('with', 'with'),
 ('the', 'the'),
 ('tests', 'test'),
 ('and', 'and'),
 ('would', 'would'),
 ('post', 'post'),
 ('the', 'the'),
 ('results', 'result'),
 ('shortly', 'short'),
 ('.', '.')]

There are two major limmitations with stemming:

* over-stemming
* under-stemming


### Part-of-Speech Tagging (POS)


spaCy performs Part-of-Speech `(POS)` tagging, Named Entity Recognition `(NER)`, and parsing as part of its default pipeline in the `nlp` object.

In [41]:
sent = "John watched an old movie at the cinema."
doc = nlp(sent)

`POS` tags can be accessed through the `pos_` attribute

In [42]:
[(t.text, t.pos_) for t in doc]

[('John', 'PROPN'),
 ('watched', 'VERB'),
 ('an', 'DET'),
 ('old', 'ADJ'),
 ('movie', 'NOUN'),
 ('at', 'ADP'),
 ('the', 'DET'),
 ('cinema', 'NOUN'),
 ('.', 'PUNCT')]

To get a description for a `POS` tag, we can use `spacy.explain`.

In [43]:
spacy.explain('PROPN')

'proper noun'

The `POS` tags above are called **course-grained** tags. We can also access **fine-grained** tags through the `tag_` attribute which provides more detailed information about a token such as its tense and, if a word is a pronoun, what specific type of pronoun it is.

In [44]:
[(t.text, t.tag_) for t in doc]

[('John', 'NNP'),
 ('watched', 'VBD'),
 ('an', 'DT'),
 ('old', 'JJ'),
 ('movie', 'NN'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('cinema', 'NN'),
 ('.', '.')]

So **NNP** refers specifically to a `singular pronoun`, and **VBD** is a verb in **past tense**.

In [45]:
print(spacy.explain('NNP'))
print(spacy.explain('VBD'))

noun, proper singular
verb, past tense


### Named Entity Recognition (NER)

There are multiple ways to access named entities. One way is through the `ent_type_` attribute.


In [46]:
sent = "Volkswagen is developing an electric sedan which could potentially come to America next fall."
doc = nlp(sent)

In [47]:
[(t.text, t.ent_type_) for t in doc]

[('Volkswagen', 'ORG'),
 ('is', ''),
 ('developing', ''),
 ('an', ''),
 ('electric', ''),
 ('sedan', ''),
 ('which', ''),
 ('could', ''),
 ('potentially', ''),
 ('come', ''),
 ('to', ''),
 ('America', 'GPE'),
 ('next', 'DATE'),
 ('fall', 'DATE'),
 ('.', '')]

We can explain use the `explain` method to get more information about `spaCy` named entinties or we can access them [here](https://spacy.io/api/annotation#named-entities)

In [48]:
spacy.explain('GPE')

'Countries, cities, states'

You can also check if a token is an entity before printing it by checking whether the `ent_type` (note the lack of trailing underscore) attribute is non-zero.

In [49]:
print([(t.text, t.ent_type_) for t in doc if t.ent_type != 0])

[('Volkswagen', 'ORG'), ('America', 'GPE'), ('next', 'DATE'), ('fall', 'DATE')]



Another way is through the `ents` property of the **Doc** object. Here, we iterate through `ents` and print the entity itself and its label.

In [50]:
print([(ent.text, ent.label_) for ent in doc.ents])

[('Volkswagen', 'ORG'), ('America', 'GPE'), ('next fall', 'DATE')]



Note how `"next fall"` is outputted above as a single span when you use `ents`.

You can also access the positions of entities as follows:


In [51]:
print([(ent.text, ent.label_, ent.start_char, ent.end_char) for ent in doc.ents])

[('Volkswagen', 'ORG', 0, 10), ('America', 'GPE', 75, 82), ('next fall', 'DATE', 83, 92)]


`spaCy` is bundled with visualizers for both parsing and named entities and can be accessed [here](https://spacy.io/usage/visualizers)

Here, we visualize the entities in our sample sentence.

In [52]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

For domain-specific corpora, an `NER` tagger may need to be further `fine-tuned`. Here, we may want `The Martian` tagged as a `"FILM"` (assuming that's our goal).

In [53]:
sent = "Ridley Scott directed The Martian."
doc = nlp(sent)
displacy.render(doc, style='ent', jupyter=True)

### Parsing

Let's first visualize a parse to make it easier to follow.

In [54]:
sent = "She enrolled in the course at the university."
doc = nlp(sent)
displacy.render(doc, style='dep', jupyter=True)

The visualization above is for a dependency parse (spaCy doesn't come with a constituency parser). For each pair of depencencies, spaCy visualizes the child (pointed to), the head (pointed from), and their relationship (the label arc). You can view the dependency annotations [here](https://spacy.io/api/annotation#dependency-parsing)


We can also use `spacy.explain` to get information on a particular annotation.

In [55]:
spacy.explain('nsubj')

'nominal subject'

The dependency labels themselves can be accessed through the `dep_` attribute.

In [56]:
[(t.text, t.dep_) for t in doc]

[('She', 'nsubj'),
 ('enrolled', 'ROOT'),
 ('in', 'prep'),
 ('the', 'det'),
 ('course', 'pobj'),
 ('at', 'prep'),
 ('the', 'det'),
 ('university', 'pobj'),
 ('.', 'punct')]

Note how the word 'enrolled' is the `ROOT`.


But the labels above don't show how the words are related to each other (the arcs). To get a better idea, we can print the head of each dependency.

In [57]:
[(t.text, t.dep_, t.head.text) for t in doc]

[('She', 'nsubj', 'enrolled'),
 ('enrolled', 'ROOT', 'enrolled'),
 ('in', 'prep', 'enrolled'),
 ('the', 'det', 'course'),
 ('course', 'pobj', 'in'),
 ('at', 'prep', 'course'),
 ('the', 'det', 'university'),
 ('university', 'pobj', 'at'),
 ('.', 'punct', 'enrolled')]

#### Refs

1. https://spacy.io/usage/spacy-101