In [4]:
import spacy
nlp = spacy.load('en_core_web_sm')

You can use spaCy to create a processed Doc object, which is a container for accessing linguistic annotations, for a given input string:

In [3]:
introduction_text = ('This tutorial is about Natural'
...     ' Language Processing in Spacy.')
introduction_doc = nlp(introduction_text)
# Extract tokens for the given doc
print ([token.text for token in introduction_doc])

['This', 'tutorial', 'is', 'about', 'Natural', 'Language', 'Processing', 'in', 'Spacy', '.']


In the above example, notice how the text is converted to an object that is understood by spaCy. You can use this method to convert any text into a processed Doc object and deduce attributes, which will be covered in the coming sections.

**READING FROM FILE**

In [5]:
file_name = 'sreelal.txt'
introduction_file_text = open(file_name).read()
introduction_file_doc = nlp(introduction_file_text)
# Extract tokens for the given doc
print ([token.text for token in introduction_file_doc])

['Hello', ',', 'My', 'name', 'is', 'Sreelal', 'C.', 'I', "'m", 'a', 'software', 'Engineer', 'working', 'at', 'Tata', 'Consultancy', 'Services', '.', 'I', "'m", 'passionate', 'about', 'technology', 'and', 'loves', 'working', 'in', 'Artificial', 'Intelligence', 'or', 'Machine', 'Learning', '.']


**SENTENCE DETECTION**

Sentence Detection is the process of locating the start and end of sentences in a given text. This allows you to you divide a text into linguistically meaningful units. You’ll use these units when you’re processing your text to perform tasks such as part of speech tagging and entity extraction.

In spaCy, the sents property is used to extract sentences. Here’s how you would extract the total number of sentences and the sentences for a given input text:

In [9]:
sentences = list(introduction_file_doc.sents)
print(len(sentences))
for sentence in sentences:
    print(sentence)

3
Hello, My name is Sreelal C.
I'm a software Engineer working at Tata Consultancy Services.
I'm passionate about technology and loves working in Artificial Intelligence or Machine Learning.


In the above example, spaCy is correctly able to identify sentences in the English language, using a full stop(.) as the sentence delimiter. You can also customize the sentence detection to detect sentences on custom delimiters.

Here’s an example, where an ellipsis(...) is used as the delimiter:

In [13]:
@Language.component
def set_custom_boundaries(doc):
    # Adds support to use `...` as the delimiter for sentence detection
    for token in doc[:-1]:
        if token.text == '...':
             doc[token.i+1].is_sent_start = True
    return doc

ellipsis_text = ('Gus, can you, ... never mind, I forgot'
                  ' what I was saying. So, do you think'
                  ' we should ...')
# Load a new model instance
custom_nlp = spacy.load('en_core_web_sm')
custom_nlp.add_pipe(set_custom_boundaries, before='parser')
custom_ellipsis_doc = custom_nlp(ellipsis_text)
custom_ellipsis_sentences = list(custom_ellipsis_doc.sents)
for sentence in custom_ellipsis_sentences:
     print(sentence)

# Sentence Detection with no customization
ellipsis_doc = nlp(ellipsis_text)
ellipsis_sentences = list(ellipsis_doc.sents)
for sentence in ellipsis_sentences:
     print(sentence)

NameError: name 'Language' is not defined

**Tokenization in spaCy**

Tokenization is the next step after sentence detection. It allows you to identify the basic units in your text. These basic units are called tokens. Tokenization is useful because it breaks a text into meaningful units. These units are used for further analysis, like part of speech tagging.

In spaCy, you can print tokens by iterating on the Doc object:

In [14]:
for token in introduction_file_doc:
...     print (token, token.idx)

Hello 0
, 5
My 7
name 10
is 15
Sreelal 18
C. 26
I 29
'm 30
a 33
software 35
Engineer 44
working 53
at 61
Tata 64
Consultancy 69
Services 81
. 89
I 91
'm 92
passionate 95
about 106
technology 112
and 123
loves 127
working 133
in 141
Artificial 144
Intelligence 155
or 168
Machine 171
Learning 179
. 187


- **text_with_ws** prints token text with trailing space (if present).
- **is_alpha detects** if the token consists of alphabetic characters or not.
- **is_punct** detects if the token is a punctuation symbol or not.
- **is_space** detects if the token is a space or not.
- **shape_ prints** out the shape of the word.
- **is_stop** detects if the token is a stop word or not.

In [17]:
for token in introduction_file_doc:
    print (token, token.idx, token.text_with_ws, token.is_alpha, token.is_punct, token.is_space, \
           token.shape_, token.is_stop)

Hello 0 Hello True False False Xxxxx False
, 5 ,  False True False , False
My 7 My  True False False Xx True
name 10 name  True False False xxxx True
is 15 is  True False False xx True
Sreelal 18 Sreelal  True False False Xxxxx False
C. 26 C.  False False False X. False
I 29 I True False False X True
'm 30 'm  False False False 'x True
a 33 a  True False False x True
software 35 software  True False False xxxx False
Engineer 44 Engineer  True False False Xxxxx False
working 53 working  True False False xxxx False
at 61 at  True False False xx True
Tata 64 Tata  True False False Xxxx False
Consultancy 69 Consultancy  True False False Xxxxx False
Services 81 Services True False False Xxxxx False
. 89 .  False True False . False
I 91 I True False False X True
'm 92 'm  False False False 'x True
passionate 95 passionate  True False False xxxx False
about 106 about  True False False xxxx True
technology 112 technology  True False False xxxx False
and 123 and  True False False xxx True
loves

**REMOVING STOP WORDS**

In [19]:
for token in introduction_file_doc:
    if not token.is_stop:
        print (token)

Hello
,
Sreelal
C.
software
Engineer
working
Tata
Consultancy
Services
.
passionate
technology
loves
working
Artificial
Intelligence
Machine
Learning
.


In [20]:
 about_no_stopword_doc = [token for token in introduction_file_doc if not token.is_stop]
>>> print (about_no_stopword_doc)

[Hello, ,, Sreelal, C., software, Engineer, working, Tata, Consultancy, Services, ., passionate, technology, loves, working, Artificial, Intelligence, Machine, Learning, .]


**LEMMATIZATION**

In [22]:
for token in introduction_file_doc:
    print(token, token.lemma_)

Hello hello
, ,
My my
name name
is be
Sreelal Sreelal
C. C.
I I
'm be
a a
software software
Engineer Engineer
working work
at at
Tata Tata
Consultancy Consultancy
Services Services
. .
I I
'm be
passionate passionate
about about
technology technology
and and
loves love
working work
in in
Artificial Artificial
Intelligence Intelligence
or or
Machine Machine
Learning Learning
. .


In [7]:
from spacy import displacy
about_interest_text = ('He is interested in learning'
    ' Natural Language Processing.')
about_interest_doc = nlp(about_interest_text)
displacy.render(about_interest_doc, style='ent', jupyter=True)