## Tokens in SpaCy.
In spaCy, a "token" refers to an individual piece of a text, like a word or a punctuation mark. When spaCy processes a text, it splits it into tokens, which is a process known as tokenization. This is a fundamental step in Natural Language Processing (NLP), as it breaks down text into manageable pieces for further analysis.

The process of tokenization breaks a text down into its basic units—or tokens—which are represented in spaCy as Token objects.

Each token in spaCy is an instance of the Token class, which contains various attributes and methods to access linguistic features or metadata about the token. For example, for each token, you can get its text, lemma (base form), part-of-speech tag, dependency relation to other tokens, and whether it is a stop word, among other properties.

This tokenization forms the basis for more complex NLP tasks performed by spaCy, such as parsing, named entity recognition, and more. By breaking text into tokens, spaCy allows for a more detailed and nuanced analysis of the text's structure and content.

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [5]:
about_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing."
)
about_doc = nlp(about_text)
print(about_doc)

Gus Proto is a Python developer currently working for a London-based Fintech company. He is interested in learning Natural Language Processing.


In [6]:
# Now, let print the tokens and their indexes in the string.
# this will done by iterating over the tokens in the Doc.
# The token.idx attribute returns the token's character offset in the Doc.

for token in about_doc:
    print(token, token.idx)

Gus 0
Proto 4
is 10
a 13
Python 15
developer 22
currently 32
working 42
for 50
a 54
London 56
- 62
based 63
Fintech 69
company 77
. 84
He 86
is 89
interested 92
in 103
learning 106
Natural 115
Language 123
Processing 132
. 142
