<a href="https://colab.research.google.com/github/Praveen76/Introduction-to-text-Processing-using-Spacy/blob/main/Introduction-to-text-Processing-using-Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learning Objectives:

At the end of the experiment, you will be able to:

* understand the spaCy library
* perform simple natural language processing tasks using the spaCy library

## Introduction

**spaCy** is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

It is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

spaCy's features and capabilities include:

- ***Tokenization***:	Segmenting text into words, punctuations marks etc.
- ***Part-of-speech (POS) Tagging***: Assigning word types to tokens, like verb or noun.
- ***Dependency Parsing***: Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
- ***Lemmatization***: Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.
- ***Sentence Boundary Detection (SBD)***: Finding and segmenting individual sentences.
- ***Named Entity Recognition (NER)***: Labelling named “real-world” objects, like persons, companies or locations.
- ***Entity Linking (EL)***: Disambiguating textual entities to unique identifiers in a knowledge base.
- ***Similarity***: Comparing words, text spans and documents and how similar they are to each other.
- ***Text Classification***: Assigning categories or labels to a whole document, or parts of a document.
- ***Rule-based Matching***: Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
- ***Training***: Updating and improving a statistical model's predictions.
- ***Serialization***: Saving objects to files or byte strings.


### Statistical models

While some of spaCy's features work independently, others require ***trained pipelines*** to be loaded, which enable spaCy to predict linguistic annotations - for example, whether a word is a verb or a noun.

A trained pipeline can consist of multiple components that use a statistical model trained on labeled data.

spaCy currently offers trained pipelines for a variety of languages, which can be installed as individual Python modules. Pipeline packages can differ in size, speed, memory usage, accuracy and the data they include.

For English language, available trained pipelines include:
- `en_core_web_sm`
- `en_core_web_md`
- `en_core_web_lg`
- `en_core_web_trf` - English transformer pipeline

To know more about trained pipelines for English, refer [here](https://spacy.io/models/en).

Let's perform basic NLP tasks with spaCy using an English trained pipeline.


### Install packages

In [1]:
!pip -q install spacy==3.7.4

In [2]:
!python -m spacy info

[1m

spaCy version    3.7.4                         
Location         /usr/local/lib/python3.10/dist-packages/spacy
Platform         Linux-6.1.58+-x86_64-with-glibc2.35
Python version   3.10.12                       
Pipelines        en_core_web_sm (3.7.1)        



From the above info, we can see that by default spaCy contains the small trained pipeline for English `en_core_web_sm`.

To use medium, large, and transformer trained pipelines, they need to be installed first using the `!python -m spacy download` command.

For example: `!python -m spacy download en_core_web_trf`

In [None]:
# Install English transformer pipeline
# NOTE that Runtime needs to restart after this step

!python -m spacy download en_core_web_trf

Collecting en-core-web-trf==3.7.3
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-curated-transformers<0.3.0,>=0.2.0 (from en-core-web-trf==3.7.3)
  Downloading spacy_curated_transformers-0.2.2-py2.py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.3/236.3 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting curated-transformers<0.2.0,>=0.1.0 (from spacy-curated-transformers<0.3.0,>=0.2.0->en-core-web-trf==3.7.3)
  Downloading curated_transformers-0.1.1-py2.py3-none-any.whl (25 kB)
Collecting curated-tokenizers<0.1.0,>=0.0.9 (from spacy-curated-transformers<0.3.0,>=0.2.0->en-core-web-trf==3.7.3)
  Downloading curated_tokenizers-0.0.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (731 kB)
[2K  

**Restart the Runtime/Session**

In [None]:
!python -m spacy info

### Import required packages

In [None]:
import spacy
from spacy import displacy

### Load the trained pipeline

Once you've downloaded and installed a trained pipeline, you can load it via `spacy.load()`. This will return a *Language object* containing all components and data needed to process text. We usually call it `nlp`.


In [None]:
# Load transformer pipeline for English
nlp = spacy.load("en_core_web_trf")

# This gives us a Language object
nlp

Esentially, spaCy's *Language* object is a pipeline that uses the language model to perform a number of natural language processing tasks such as *tokenization*, *part-of-speech tagging*, *syntactic parsing*, *named entity recognition*, etc.

<br>
<img src='https://cdn.iisc.talentsprint.com/AIandMLOps/Images/spacy_pipeline.png' width=800px>

<br>

## Performing basic NLP tasks using spaCy

Calling the Language object, `nlp`, on a string of text will return a processed *Doc*.

In [None]:
# An example sentence
text = "Apple is looking at buying U.K. startup for $1 billion."
text

In [None]:
# Feed the string object under 'text' to the Language object under 'nlp'
# Store the result under the variable 'doc'
doc = nlp(text)

In [None]:
type(doc)

Passing the variable `text` to the _Language_ object `nlp` returns a spaCy *Doc* object, short for document.

This object contains both the input text stored under `text` and the results of natural language processing using spaCy.

In [None]:
# Call the variable to examine the object
doc

Calling the variable `doc` returns the contents of the object.

Although the output resembles that of a Python string, the *Doc* object contains a wealth of information about its linguistic structure, which spaCy generated by passing the text through the NLP pipeline.

Let's examine the tasks that were performed under the hood after the input sentence was provided to the language model.

### Tokenization

*Tokenization* breaks the text down into words, punctuation and so on.

The diagram below outlines the tasks that spaCy can perform after a text has been tokenised, such as *part-of-speech tagging*, *syntactic parsing* and *named entity recognition*.

<img src='https://cdn.iisc.talentsprint.com/AIandMLOps/Images/spacy_pipeline.png' width=800px>

Each *Doc* consists of individual tokens, and we can iterate over them.

Let's print out each *Token* object stored in the _Doc_ object `doc`.

In [None]:
# Tokens present inside the document

print("Token\n"+'='*20)

for token in doc:
    print(token.text)

### Part-of-speech tagging

Part-of-speech (POS) tagging is the task of determining the word class of a token. This is crucial for *disambiguation*, because different parts of speech may have similar forms.

>Consider the example: *The sailor dogs the hatch*.<br>
>The present tense of the verb *dog* (to fasten something with something) is precisely the same as the plural form of the noun *dog*: *dogs*.

To identify the correct word class, we must examine the context in which the word appears.

spaCy provides two types of part-of-speech tags, coarse and fine-grained, which are stored under the attributes `pos_` and `tag_`, respectively.

To access the results of POS tagging, let's loop over the *Doc* object `doc` and print each *Token* and its part-of-speech tags.

In [None]:
# Print the token and the POS tags

print(f"{' ':<30}POS tag\n{' ':<20}{'-'*25}")
print(f"{'Token':<20}{'Coarse':<13}Fine-grained\n{'='*45}")

for token in doc:
    coarse = token.pos_         # coarse pos tag
    fine = token.tag_           # fine-grained pos tag

    print(f"{token.text:<20}{coarse:<13}{fine}")

### Lemmatization

A **lemma** is the base form of a word.

Unless explicitly instructed, computers cannot tell the difference between singular and plural forms of words, but treat them as distinct tokens, because their forms differ.

For instance, if we want to count the occurrences of words, a process known as _lemmatization_ is needed to group together the different forms of the same token.

Lemmas are available for each _Token_ under the attribute `lemma_`.

In [None]:
# Print the token and its base form

print(f"{'Token':<20} Lemma\n{'='*30}")

for token in doc:
    lemma = token.lemma_
    print(f"{token.text:<20} {lemma}")

### Named entity recognition (NER)

Named entity recognition (NER) is the task of recognising and classifying entities named in a text.

spaCy can recognise the named entities such as persons, geographic locations, and products as these were annotated in the dataset its trained on (OntoNotes 5 corpus).

We can use the *Doc* object's `.ents` attribute to get the named entities.

In [None]:
# Entities
doc.ents

This returns a tuple with the named entities.

Each item in the tuple is a spaCy *Span* object. *Span* objects can consist of multiple *Token* objects, as many named entities span multiple *Tokens*.

In [None]:
# Check the type of the object used to store named entities
type(doc.ents[0])

The named entities and their types are stored under the attributes `.text` and `.label_` of each *Span* object.

Let's loop over the *Span* objects in the tuple and print out both attributes.

In [None]:
# Loop over the named entities in the Doc object, and print the named entity and its label

print(f"{'Text':<20} {'Entity_label':<16} Explanation\n{'='*80}")

for ent in doc.ents:
    ent_text = ent.text           # named entity
    ent_label = ent.label_        # entity label
    ent_label_val = spacy.explain(ent_label)       # entity label explanation

    print(f"{ent_text:<20} {ent_label:<16} {ent_label_val}")

As you can see, named entities like '$1 billion' identified in the *Doc* consist of multiple *Tokens*, which is why they are represented as *Span* objects.

spaCy [*Span*](https://spacy.io/api/span) objects contain several useful arguments.

Most importantly, the attributes `start` and `end` return the indices of _Tokens_, which determine where the _Span_ starts and ends in the *Doc* object.

In [None]:
# Print the named entity and indices of its start and end Tokens
print(doc.ents[2], doc.ents[2].start, doc.ents[2].end)

The named entity starts at index 8 and ends at index 11 in the *Doc* object.

#### Visualize Named Entities

We can also render the named entities using *displacy*, the spaCy module we used for visualising dependency parses above.

Note that we must pass the string `ent` to the `style` argument to indicate that we wish to visualise named entities.

In [None]:
# Visualize named entity
spacy.displacy.render(doc, style='ent')

**Test another example 2:**

In [None]:
# Visualize another sample text
text2 = "On 3rd Feb, Ram was in Delhi.\nLater he traveled to Mumbai via Air India flight reading a Time magazine to meet Raj.\nAfter 10 days, he went again back to Delhi wearing a Timex watch."
doc2 = nlp(text2)
spacy.displacy.render(doc2, style='ent')

If a particular tag used for a named entity is unfamiliar, you can check it's explanation.

In [None]:
spacy.explain('DATE')

In [None]:
spacy.explain('PERSON')

**Test another example 3:**

In [None]:
# Visualize another sample text
text3 = "Holmes solves his another case while sitting at his home in Baker Street, without moving a single inch."
doc3 = nlp(text3)
spacy.displacy.render(doc3, style='ent')

In [None]:
spacy.explain('FAC')

References:
* https://spacy.io/usage/spacy-101