Introduction to NLP and SpaCy	

	Overview of NLP and its applications
	Introduction to SpaCy and its features
	Installing and setting up SpaCy environment
	Basic text processing with SpaCy

	
 Tokenization and Lemmatization	

	Understanding tokenization and its importance in NLP
	Tokenization with SpaCy
	Lemmatization and word normalization using SpaCy
	
Part-of-Speech (POS) Tagging	

	Introduction to POS tagging and its role in NLP
	POS tagging with SpaCy
	Handling POS ambiguity and improving accuracy
	
	
 Named Entity Recognition (NER)	

	Understanding named entities and their significance
	NER basics and challenges
	Named Entity Recognition
	Visulizing Named Entity
	
Dependency Parsing	

	Introduction to dependency parsing and its applications
	Dependency parsing with SpaCy: NOUN Chunks
	Navigating the parse tree
	Extracting syntactic relationships from sentences
	
	
 Rule-based Matching and Phrase Matching	

	Introduction to rule-based matching in SpaCy
	Token Matcher
	
	
	
Visulization	

    Visualizing the dependency parse
	Visualizing long texts
	Visualizing the entity recognizer
	Visualizing spans
	

## Installation

In [1]:
# pip
# python -m venv .env
#source .env/bib/activate
# pip install -U pip setuptools wheel
# pip install -U spacy
# python -m spacy download en_core_web_sm



In [2]:
!python -m spacy download en_core_web_sm

!pip install -U spacy

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting spacy
  Downloading spacy-3.6.0-cp39-cp39-macosx_10_9_x86_64.whl (6.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m


Installing collected packages: spacy
  Attempting uninstall: spacy
    Found existing installation: spacy 3.5.3
    Uninstalling spacy-3.5.3:
      Successfully uninstalled spacy-3.5.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
en-core-web-sm 3.5.0 requires spacy<3.6.0,>=3.5.0, but you have spacy 3.6.0 which is incompatible.[0m[31m
[0mSuccessfully installed spacy-3.6.0


## Basic Processing

In [3]:
import spacy
# Load the English language model
nlp = spacy.load('en_core_web_sm')



In [4]:
# Sample text for processing
text = "SpaCy is an open-source library for advanced Natural Language Processing."

# Process the text
doc = nlp(text)

# Tokenization
tokens = [token.text for token in doc]
print("Tokens:", tokens)

Tokens: ['SpaCy', 'is', 'an', 'open', '-', 'source', 'library', 'for', 'advanced', 'Natural', 'Language', 'Processing', '.']


## Tokenization

In [5]:
doc = nlp("I love GFG, what's you choice?")
for token in doc:
    print(token.text)

I
love
GFG
,
what
's
you
choice
?


## Adding special case tokenization rules

Most domains have at least some idiosyncrasies that require custom tokenization rules. This could be very certain expressions, or abbreviations only used in this specific field. Here’s how to add a special case rule to an existing Tokenizer instance:

In [8]:
from spacy.symbols import ORTH

In [15]:
doc = nlp("gimme that")  # phrase to tokenize
print([w.text for w in doc])  # ['gimme', 'that']



['gim', 'me', 'that']


In [17]:
# Add special case rule
special_case = [{ORTH: "gim"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)

In [18]:
# Check new tokenization
print([w.text for w in nlp("gimme that")])  # ['gim', 'me', 'that']

['gim', 'me', 'that']


## Debugging the tokenizer


A working implementation of the pseudo-code above is available for debugging as nlp.tokenizer.explain(text). It returns a list of tuples showing which tokenizer rule or pattern was matched for each token. The tokens produced are identical to nlp.tokenizer() except for whitespace tokens

In [20]:
text = '''"Let's go!"'''
doc = nlp(text)



In [21]:
tok_exp = nlp.tokenizer.explain(text)
assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
for t in tok_exp:
    print(t[1], "\\t", t[0])

" \t PREFIX
Let \t SPECIAL-1
's \t SPECIAL-2
go \t TOKEN
! \t SUFFIX
" \t SUFFIX


## Lemmatizer

In [22]:
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode)  # 'rule'

rule


In [23]:
doc = nlp("The quick brown foxes jumped over the lazy dogs.")
print([token.lemma_ for token in doc])
# ['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '.']

['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '.']


In [24]:
# Sample text for processing
text = "The quick brown foxes jumped over the lazy dogs."

# Process the text
doc = nlp(text)

# Lemmatization
lemmas = [token.lemma_ for token in doc]
print("Lemmas:", lemmas)

Lemmas: ['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '.']


In [25]:
# Lemmatization with POS filtering
lemmas_filtered = [token.lemma_ for token in doc if token.pos_ in ['NOUN', 'VERB']]
print("Filtered Lemmas:", lemmas_filtered)




Filtered Lemmas: ['fox', 'jump', 'dog']


In [26]:
# Lemmatization with custom stop words
custom_stop_words = ['quick', 'jumped', 'lazy']


In [27]:
lemmas_custom_stop = [token.lemma_ for token in doc if token.text.lower() not in custom_stop_words]


In [28]:
print("Lemmas with Custom Stop Words:", lemmas_custom_stop)

Lemmas with Custom Stop Words: ['the', 'brown', 'fox', 'over', 'the', 'dog', '.']


# Parts Of Speech Tagging

After tokenization, spaCy can parse and tag a given Doc. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context. A trained component includes binary data that is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a noun.

Linguistic annotations are available as Token attributes. Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we n

In [14]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [15]:
doc = nlp("Rudra taken data science course. worth $60")
doc

Rudra taken data science course. worth $60

In [16]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Rudra Rudra PROPN NNP nsubj Xxxxx True False
taken take VERB VBD ROOT xxxx True False
data data NOUN NN compound xxxx True False
science science NOUN NN compound xxxx True False
course course NOUN NN dobj xxxx True False
. . PUNCT . punct . False False
worth worth ADJ JJ ROOT xxxx True False
$ $ SYM $ nmod $ False False
60 60 NUM CD npadvmod dd False False


In [17]:
for token in doc:
    print("{}:".format(token),token.lemma_)

Rudra: Rudra
taken: take
data: data
science: science
course: course
.: .
worth: worth
$: $
60: 60


In [18]:
for token in doc:
    print("{}:".format(token),token.pos_)

Rudra: PROPN
taken: VERB
data: NOUN
science: NOUN
course: NOUN
.: PUNCT
worth: ADJ
$: SYM
60: NUM


In [19]:
for token in doc:
    print("{}:".format(token),token.tag_)

Rudra: NNP
taken: VBD
data: NN
science: NN
course: NN
.: .
worth: JJ
$: $
60: CD


In [20]:
for token in doc:
    print("{}:".format(token),token.dep_)

Rudra: nsubj
taken: ROOT
data: compound
science: compound
course: dobj
.: punct
worth: ROOT
$: nmod
60: npadvmod


In [21]:
for token in doc:
    print("{}:".format(token),token.shape_)

Rudra: Xxxxx
taken: xxxx
data: xxxx
science: xxxx
course: xxxx
.: .
worth: xxxx
$: $
60: dd


In [22]:
for token in doc:
    print("{}:".format(token),token.is_alpha)

Rudra: True
taken: True
data: True
science: True
course: True
.: False
worth: True
$: False
60: False


In [23]:
for token in doc:
    print("{}:".format(token),token.is_stop)

Rudra: False
taken: False
data: False
science: False
course: False
.: False
worth: False
$: False
60: False


# Parts Of Speech

## Handling POS ambiguity and improving accuracy

To handle POS ambiguity and improve accuracy, you can incorporate contextual information and use statistical methods to disambiguate the POS tags. Here's an example using SpaCy:

In [24]:
import spacy

# Load the pre-trained English model
nlp = spacy.load('en_core_web_sm')

In [25]:
# Text to be POS tagged
text = "Rudra taken data science course. worth $60."

In [26]:
# Process the text with the POS tagger
doc = nlp(text)



In [27]:
# Function to disambiguate POS tags based on context
def disambiguate_pos(token):
    if token.text == "taken" and token.head.text == "Rudra":
        return "VERB"  # "taken" is a verb in this context
    elif token.text == "worth" and token.head.text == "course":
        return "ADJ"  # "worth" is an adjective in this context
    else:
        return token.pos_  # Return the default POS tag

In [28]:
# Iterate over the tokens in the document
for token in doc:
    pos = disambiguate_pos(token)
    print(token.text, pos)

Rudra PROPN
taken VERB
data NOUN
science NOUN
course NOUN
. PUNCT
worth ADJ
$ SYM
60 NUM
. PUNCT


In [29]:
# Output the tagged text
print([(token.text, disambiguate_pos(token)) for token in doc])

[('Rudra', 'PROPN'), ('taken', 'VERB'), ('data', 'NOUN'), ('science', 'NOUN'), ('course', 'NOUN'), ('.', 'PUNCT'), ('worth', 'ADJ'), ('$', 'SYM'), ('60', 'NUM'), ('.', 'PUNCT')]


# Name Entity Recognition

NER, or Named Entity Recognition, is a fundamental task in natural language processing (NLP) that involves identifying and categorizing named entities in text. Named entities refer to specific entities that have names, such as persons, organizations, locations, dates, and more. NER plays a crucial role in various NLP applications, including information extraction, question answering, and text summarization.

## Basics of NER:

<h3>Definition:</h3> 

NER is the process of identifying and classifying named entities in text into predefined categories. The commonly recognized entity types include person names, organizations, locations, date expressions, time expressions, monetary values, percentages, and more.

<h3>Entity Types:</h3> 
    
<h4> Named entities can be categorized into various types, depending on the domain and application. The common entity types include: </h4>

Person: Individual names or pronouns representing people.

Organization: Company names, institutions, or other organized groups.

Location: Names of cities, countries, regions, or specific places.

Date: Expressions representing dates or periods.

Time: Expressions representing specific times or durations.

Money: Monetary values, including currencies.

Percentage: Numeric expressions representing percentages.

Miscellaneous: Other named entities that do not fit into the above categories.

<h3>Annotation:</h3> 

NER typically involves annotating a dataset with labeled named entities. Human annotators manually label the named entities in the text and assign appropriate entity types. This annotated data is then used to train NER models.

<h3> Applications:</h3> 

NER is used in a wide range of applications, including:

<h4>*Information Extraction:</h4> 
Extracting structured information from unstructured text, such as extracting person names and organizations from news articles.

<h4>*Question Answering:</h4> Identifying and extracting relevant entities to answer specific questions, such as extracting the location mentioned in a question like "Where was the conference held?"
Text Summarization: Recognizing important named entities to generate informative summaries.
<h4>*Document Classification:</h4> Augmenting document classification models with named entity information for improved performance.

In [30]:
# Text to be processed
text = "Apple Inc. is headquartered in Cupertino, California."

# Process the text with NER
doc = nlp(text)

# Iterate over the entities in the document
for entity in doc.ents:
    print(entity.text, entity.label_)



Apple Inc. ORG
Cupertino GPE
California GPE


In [31]:
# Output the tagged text with entity annotations
print([(entity.text, entity.label_) for entity in doc.ents])

[('Apple Inc.', 'ORG'), ('Cupertino', 'GPE'), ('California', 'GPE')]


As you can see, the entities "Apple Inc." (an organization), "Cupertino" (a geopolitical entity), and "California" (a geopolitical entity) are correctly recognized and labeled by the NER model.

spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default trained pipelines can identify a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

## Named Entity Recognition
A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.


Named entities are available as the ents property of a Doc:

In [32]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

In [33]:
ner=[]

In [36]:
l=[]
for ent in doc.ents:
    l.append(ent.text)
    print("{}".format(ent),ent.text)
    
ner.append(l)

Apple Apple
U.K. U.K.
$1 billion $1 billion


In [37]:
m=[]
for ent in doc.ents:
    m.append(ent.start_char)
    print("{}".format(ent),ent.start_char)
ner.append(m)

Apple 0
U.K. 27
$1 billion 44


In [38]:
n=[]
for ent in doc.ents:
    n.append(ent.end_char)
    print("{}".format(ent),ent.end_char)
ner.append(n)

Apple 5
U.K. 31
$1 billion 54


In [39]:
o=[]
for ent in doc.ents:
    o.append(ent.label_)
    print("{}".format(ent),ent.label_)
ner.append(o)

Apple ORG
U.K. GPE
$1 billion MONEY


In [40]:
for i in range(len(ner)):
    print(ner[i][0])

Apple
0
5
ORG


we define the entities and their POS tags as a list of tuples in the entities variable. Each tuple contains the entity text, start index, end index, and the corresponding POS tag. For instance, ("Apple", 0, 5, "ORG") indicates that "Apple" is an organization entity.

In [42]:
print(ner)

[['Apple', 'U.K.', '$1 billion'], [0, 27, 44], [5, 31, 54], ['ORG', 'GPE', 'MONEY']]


In [43]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


In [45]:
from IPython.display import Image
  
# get the image
Image(url="img1.png", width=500, height=500)

# NER Visulisation

## Visualizing the entity recognizer
The entity visualizer, ent, highlights named entities and their labels in a text.

In [47]:
import spacy
from spacy import displacy

text = "Twitter is a popular social media platform that enables users to share and discover information through short messages called tweets."
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.serve(doc, style="ent")




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Adding titles to documents

Rendering several large documents on one page can easily become confusing. To add a headline to each visualization, you can add a title to its user_data. User data is never touched or modified by spaCy.

In [49]:
doc = nlp("When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously..")
doc.user_data["title"] = "self Driving Car"
displacy.serve(doc, style="ent")


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


# Dependency parsing

Dependency parsing is a natural language processing (NLP) technique used to analyze the grammatical structure of a sentence by identifying the syntactic relationships between words. It involves determining the dependency relationships between words in a sentence and representing them as a hierarchical structure called a dependency tree.


<h4>Here's a basic code example using SpaCy to perform dependency parsing:

In [51]:
# Text to be parsed
text = "I love GEEKSFORGEEKS."

# Process the text with dependency parsing
doc = nlp(text)

In [52]:
# Iterate over the tokens in the document
for token in doc:
    print(token.text, token.dep_, token.head.text)




I nsubj love
love ROOT love
GEEKSFORGEEKS dobj love
. punct love


In [53]:
# Output the dependency tree
for token in doc:
    print(token.text, token.dep_, token.head.text, [child.text for child in token.children])

I nsubj love []
love ROOT love ['I', 'GEEKSFORGEEKS', '.']
GEEKSFORGEEKS dobj love []
. punct love []


## Noun chunks
Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”. To get the noun chunks in a document, simply iterate over Doc.noun_chunks

In [54]:
doc = nlp("Twitter is a popular social media platform that enables users to share and discover information through short messages called tweets.")

In [55]:
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

Twitter Twitter nsubj is
a popular social media platform platform attr is
that that nsubj enables
users users dobj enables
information information dobj discover
short messages messages pobj through
tweets tweets oprd called


### Navigating the parse tree
spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of .dep is a hash value. You can get the string value with .dep_.

In [57]:
doc = nlp("Twitter is a popular social media platform that enables users to share and discover information through short messages called tweets.")
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

Twitter nsubj is AUX []
is ROOT is AUX [Twitter, platform, .]
a det platform NOUN []
popular amod platform NOUN []
social amod media NOUN []
media compound platform NOUN [social]
platform attr is AUX [a, popular, media, enables]
that nsubj enables VERB []
enables relcl platform NOUN [that, users, share]
users dobj enables VERB []
to aux share VERB []
share xcomp enables VERB [to, and, discover]
and cc share VERB []
discover conj share VERB [information, through]
information dobj discover VERB []
through prep discover VERB [messages]
short amod messages NOUN []
messages pobj through ADP [short, called]
called acl messages NOUN [tweets]
tweets oprd called VERB []
. punct is AUX []


# Extracting syntactic relationships

In [58]:

# Text to be parsed
text = "I love playing soccer."

# Process the text with dependency parsing
doc = nlp(text)

# Extract syntactic relationships
relationships = []
for token in doc:
    relationships.append((token.text, token.dep_, token.head.text))

# Output the extracted relationships
for relationship in relationships:
    print(relationship)



('I', 'nsubj', 'love')
('love', 'ROOT', 'love')
('playing', 'xcomp', 'love')
('soccer', 'dobj', 'playing')
('.', 'punct', 'love')


In this example, we load the pre-trained English model using spacy.load('en_core_web_sm'). We define the input text, text, as "I love playing soccer."

We process the text using nlp(text) to create a Doc object. Then, we iterate over the tokens in the document using doc. For each token, we extract its text, dependency label (dep_), and the head word's text (head.text).

We store the extracted relationships in a list called relationships. Finally, we iterate over the relationships list and print each extracted syntactic relationship.

#  Rule-based Matching and Phrase Matching

## Rule Based Matcher
patterns or entities based on predefined rules. It allows you to define patterns using linguistic attributes such as token text, part-of-speech tags, dependency labels, and more. The rule-based matcher in SpaCy can be used to find sequences of tokens that match these patterns.

Here's a basic example of rule-based matching in SpaCy:

## Token-based matching

Token-based matching in SpaCy allows you to find and match specific tokens based on their attributes or properties. It provides flexibility in defining patterns and conditions for matching tokens within a text. Let's look at an example:

In [61]:
from spacy.matcher import Matcher

In [62]:
# Text to be processed
text = "hello I have a cat and A dog."

# Initialize the matcher
matcher = Matcher(nlp.vocab)
matcher

<spacy.matcher.matcher.Matcher at 0x7fb224429670>

### Adding patterns

A token whose lowercase form matches “a”, e.g. “A cat” or “a dog”.<br>
A token whose Part of Speech NOUN

In [64]:
# Define a pattern
pattern = [{"LOWER": "a"}, {'POS': 'NOUN'}]
pattern

[{'LOWER': 'a'}, {'POS': 'NOUN'}]

In [65]:
# Add the pattern to the matcher
matcher.add('NounPattern', [pattern])



In [66]:
# Process the text with the matcher
doc = nlp(text)
matches = matcher(doc)
matcher

<spacy.matcher.matcher.Matcher at 0x7fb224429670>

In [67]:
# Iterate over the matched spans
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

a cat
A dog


We initialize the matcher using Matcher(nlp.vocab). Then, we define the pattern we want to match using a list of dictionaries. Each dictionary represents the attributes and values of the tokens we want to match. In this case, the pattern matches a lowercase "a" followed by a noun.

We add the pattern to the matcher using matcher.add("NounPattern", None, pattern). The first argument is a unique identifier for the pattern, the second argument is an optional callback function, and the third argument is the pattern itself.

Next, we process the text with the matcher using matches = matcher(doc). This returns a list of tuples, where each tuple contains the match ID, start index, and end index of the matched span.

Finally, we iterate over the matched spans and print their text using span.text.

# Visualizers 

In [68]:
from spacy import displacy

In [69]:
doc = nlp("I love Gfg")
displacy.serve(doc, style="dep")


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Visualizing long texts

Long texts can become difficult to read when displayed in one row, so it’s often better to visualize them sentence-by-sentence instead. As of v2.0.12, displacy supports rendering both Doc and Span objects, as well as lists of Docs or Spans. Instead of passing the full Doc to displacy.serve, you can also pass in a list doc.sents. This will create one visualization for each sentence.

In [71]:
text = """Machine learning is a subfield of artificial intelligence that focuses on developing algorithms and models that enable computers to learn from data and make predictions or decisions without being explicitly programmed. It involves statistical techniques and algorithms that allow machines to automatically learn patterns, extract insights, and improve performance over time. Machine learning has a wide range of applications across various domains, including image and speech recognition, natural language processing, recommendation systems, fraud detection, autonomous vehicles, and medical diagnostics."""
doc = nlp(text)
sentence_spans = list(doc.sents)
displacy.serve(sentence_spans, style="dep")


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Visualizing the entity recognizer

The entity visualizer, ent, highlights named entities and their labels in a text.

In [72]:
text = "When Rudransh started working on self-driving cars at Geeksforgeeks in 2023, few people outside of the company took him seriously."
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.serve(doc, style="ent")


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## Visualizing spans

In [73]:
from spacy.tokens import Span

text = "When Rudransh started working on self-driving cars at Geeksforgeeks in 2023, few people outside of the company took him seriously.."

nlp = spacy.blank("en")
doc = nlp(text)

doc.spans["sc"] = [
    Span(doc, 3, 6, "ORG"),
    Span(doc, 5, 6, "GPE"),
]

displacy.serve(doc, style="span")


Using the 'span' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.
