**Important:**
When running a cell, run the section in sequential order to ensure all prerequisites are met for that cell.

**Setup**

Install spaCy

In [2]:
# Download spaCy
!pip install spacy

In [4]:
# Otherwise, make sure spacy is up to date
!pip install --upgrade spacy

In [5]:
import spacy

__Install and load a spaCy model__

spaCy models are pretrained statistical models that allow you to perform necessary NLP functions.

In this case, we will be using the 'en_core_web_md' as it has more features than the default ('en_core_web_sm'). 
(see https://spacy.io/models/en/)
Each model contains a preset pipeline to process text through. (You will learn about pipelines further down)

Sample info:
'en' means english
'sm' means small
'md' means medium
'lg' means large
size of model reflects its accuracy in its performance of NLP tasks.

In [7]:
# Download model
!python3 -m spacy download en_core_web_md

In [8]:
# Load model as a natural language processer
nlp = spacy.load('en_core_web_md')

**Visualization**

Some sections will provide an example visualization.

For more information on visualization including more visualizer functions, parameters, available arguments and their output, consult this documentation:
https://spacy.io/api/top-level#displacy_options
Under 'displaCy'

**SpaCy Data Structures**
1. Doc
2. Token
3. Vocabulary
4. Lexeme
5. Span
6. Matcher

**1. The 'Doc'**

A 'Doc' represents a piece of text. It is a sequence of Token objects that contain information about the text it represents. Docs are the primary data structure for text processing in spaCy.
    
All doc methods and attributes:
https://spacy.io/api/doc

In [9]:
# Prerequisites
import spacy
nlp = spacy.load('en_core_web_md')

In [10]:
# Create a 'Doc'

text = "I will become a doc!"

doc = nlp(text)
# a doc is a list of token objects

print(doc)

I will become a doc!


In [11]:
# Accessing tokens inside of Doc

# Iterate through doc tokens
index = 0
while index < len(doc):
    token_at_index = doc[index]
    print(f"index {index}: {token_at_index}")
    index += 1

index 0: I
index 1: will
index 2: become
index 3: a
index 4: doc
index 5: !


In [12]:
# Example for Visualizing a Doc

from spacy import displacy
displacy.render(doc, style="dep", jupyter=True)
# this visualization represents each word's dependencies (covered later)
# for more information, go up to the 'Visualization' cell

**2. The 'Token'**

A Token can be a single word, punctuation mark, or other linguistic unit that is stored within a 'Doc'. Each token has properties of various linguistic annotations including part-of-speech tags, named entity labels, and syntactic dependencies.
    
<u>pos tags:<u>
* Abbreviated representations of a part of speech based on the word's definition and context.

<u>named entity:<u> 
* A real-world object that is assigned a name. (NER is covered later)

<u>syntactic dependencies:<u> 
    
* The relations between individual words of a sentence.

All token methods and attributes:
https://spacy.io/api/token#vector

In [13]:
# Prerequisites
# Create Doc
import spacy 
nlp = spacy.load("en_core_web_md")
text = "These will become tokens"
doc = nlp(text)

In [14]:
# Accessing and interacting with tokens

# Indexing
first_token = doc[0]
print(f"first token: {first_token}")
# Iterating
for token in doc:
    print(token.text)

first token: These
These
will
become
tokens


In [15]:
# Accessing and interacting with tokens

# Token Attributes
"""
Many of a token's attributes are hashes
but can be retrieved as their string from
by placing an underscore at the end
"""

print(
f"""
Original text: {first_token.text}\n
Base form: {first_token.lemma_}\n
Part of speech tag: {first_token.pos_}\n
Detailed pos tag: {first_token.tag_}\n
Dependency label: {first_token.dep_}\n
Word shape: {first_token.shape_}\n
"""
)
    
# Token Methods 

# alpha characters are chars from the alphabet (non-numerical)
# stop words are words with little meaning (covered later)

print(
f"""
Is alpha character: {first_token.is_alpha}\n
Is stop word: {first_token.is_stop}
"""
)


Original text: These

Base form: these

Part of speech tag: PRON

Detailed pos tag: DT

Dependency label: nsubj

Word shape: Xxxxx



Is alpha character: True

Is stop word: True



**3. Vocabulary**

A storage class; a collection of all unique words or tokens used by spaCy in a particular language model. Vocabulary includes word vectors (covered later) and integer IDs associated with each word. Great for text processing and memory.
    
Vocabularies customization will not be covered

For all Vocab methods and attributes:
https://spacy.io/api/vocab

In [16]:
# Prerequisites 
import spacy 
nlp = spacy.load("en_core_web_md")

In [17]:
# Accessing the 'Vocabulary'

vocab = nlp.vocab


# Accessing Vocabulary Attributes

word_count = len(vocab)
# Returns the total count of unique words in the vocabulary
print(f"word count: {word_count}\n")


word = "apple"

# Check if the word is in the vocabulary
is_in_vocab = word in vocab
print(f"{word} is in vocab: {is_in_vocab}\n")


# Find an ID
word_id = vocab[word].orth
# words have IDs inside the vocabulary
# the ID can be used to find the word in the vocab
print(f"word ID: {word_id}\n")


# Check if word has a vector
has_vec = vocab.has_vector(word_id)
# Returns True or False reflecting if the provided word has a vector in the vocab
# NOTE: word and word_id are interchangeable when finding its attributes
print(f"{word} has vector: {has_vec}\n")


# Retrieve word vector
word_vector = vocab[word_id].vector
# Returns a pre-trained word vector (list of numbers) that represent the provided word's meaning
print(f"word vector for '{word}': {word_vector[0:8]}\n")

word count: 764

apple is in vocab: False

word ID: 8566208034543834098

apple has vector: True

word vector for 'apple': [-1.0084  -2.0308  -0.64185  2.6928   0.31771 -2.6662  -3.7372   5.4714 ]



In [18]:
# Iterating through a vocab

print("Vocab section:")
Vocab_section = list(nlp.vocab)[:10]
# Returns a list of lexemes in the vocab up to index 10

for lexeme in Vocab_section:
    word = lexeme.text
    # Returns the string version of the lexeme (covered later)
    print(word)

Vocab section:
nuthin
ü.
p.m
Kan
Mar
When's
 
Sept.
c.
Mont.


**4. The 'Lexeme'**

A basic unit of a word in spaCy's vocabulary. It includes information about the word's string representation, orthographic features, and linguistic attributes. Can be used to efficiently access word information from the vocabulary.
    
For all Lexeme methods and attributes:
https://spacy.io/api/lexeme

In [19]:
# Prerequisites
import spacy
nlp = spacy.load("en_core_web_md")

# Setup
text = "My word is Paper"
doc = nlp(text)
vocab = nlp.vocab

In [20]:
# Accessing a lexeme

# Way #1: use a doc
# Will be more accurate

print("Way #1:\n")

my_word = doc[len(doc)-1]
print(f"my word: {my_word}")

my_lexeme = my_word.lex
print(f"my lexeme: {my_lexeme}\n")


# Way #2: use a vocabulary

print("Way #2:")

word = "Paper"
lexeme = vocab[word]

print(f"""
my word: {word}
my lexeme: {lexeme}
"""
)

Way #1:

my word: Paper
my lexeme: <spacy.lexeme.Lexeme object at 0x7f0c7955f100>

Way #2:

my word: Paper
my lexeme: <spacy.lexeme.Lexeme object at 0x7f0b47f77e40>



In [21]:
# Accessing lexeme attributes

# Text

lexeme_text = my_lexeme.text
# Returns the text representation of the lexeme
print(f"lexeme word: {lexeme_text}")


# Orthographic features

is_lower = my_lexeme.is_lower
# Lowercase
is_upper = my_lexeme.is_upper
# Uppercase
is_title = my_lexeme.is_title
# Titlecase
is_stop = my_lexeme.is_stop
# Stop word 
# (little contextual meaning)
is_currency = my_lexeme.is_currency
# Currency of some kind
lang = my_lexeme.lang_
# Language of parent vocab
shape = my_lexeme.shape_
# Shape of word (case, length, etc.)
sentiment = my_lexeme.sentiment
# point value that represents emotional sentiment of word
# (covered later)


print(f"""
The lexeme is:
lower: {is_lower}
upper: {is_upper}
title: {is_title}
stop word: {is_stop}
currency: {is_currency}

Other Properties:
from language: {lang}
shape: {shape}
sentiment: {sentiment}
""")

lexeme word: Paper

The lexeme is:
lower: False
upper: False
title: True
stop word: False
currency: False

Other Properties:
from language: en
shape: Xxxxx
sentiment: 0.0



**5. the 'Span'**

A fragment of text which belongs to some category. It is stored as a contiguous sequence of tokens within a Doc; allowing you to work with a subset of tokens in a document.
    
For all span methods and attributes:
https://spacy.io/api/span

In [22]:
# Prerequisites
import spacy
nlp = spacy.load("en_core_web_md")

# Setup
text = "I am a doc, but here is a span."
doc = nlp(text)

for i, token in enumerate(doc):
    print(f"token #{i}: {token}")

token #0: I
token #1: am
token #2: a
token #3: doc
token #4: ,
token #5: but
token #6: here
token #7: is
token #8: a
token #9: span
token #10: .


In [23]:
# Create a span

# Select tokens in a doc
span = doc[6:10]
# Creates a span that includes tokens 5, 6, 7, 8, and 9
# You can access the attributes of each token

for i, token in enumerate(span, start=6):
    print(f"token #{i}: {token}")

token #6: here
token #7: is
token #8: a
token #9: span


In [24]:
# Create a span of a sentence

# Setup
text = "I don't want this sentence. This sentence is a span. Not this one."
doc = nlp(text)

sents = []
# iterate through the sentences in a doc
for sentence in doc.sents:
    # .sents returns an object
    sents.append(sentence)
    # add each sentence to a list

span = sents[1]
# select which sentence to become a span
print(span)

This sentence is a span.


<u>Span Use Cases:<u>

- Custom NER (covered later):
    - Can use a span to define your own named entities and use matching (covered later) to analyze text
- Text Processing (covered later)
    - Allows you to process and analyze a specific span of text
- Text Classification (covered later)
    - Can use spans to represent points of interest throughout a document 

In [28]:
# Span Method example

text = "I am a doc, but here is a span."
doc = nlp(text)
span = doc[6:10]


# Span index visualization

Span_string = ""
for token in span:
    Span_string += token.text + " "
print(f"Span:\n{Span_string}\n")

for i, char in enumerate(Span_string):
      print(f"Index {i}: {char}")
        

# Character span method

start_index = 8
end_index = 14
character_span = span.char_span(start_index, end_index)
# Returns a span of tokens from the specified slice from the original span

print(f"\nCharacter Span from {start_index}-{end_index}:\n{character_span}")

Span:
here is a span 

Index 0: h
Index 1: e
Index 2: r
Index 3: e
Index 4:  
Index 5: i
Index 6: s
Index 7:  
Index 8: a
Index 9:  
Index 10: s
Index 11: p
Index 12: a
Index 13: n
Index 14:  

Character Span from 8-14:
a span


**6. the 'Matcher'**

A data structure for rule-based pattern matching in text. Matchers allow you to define patterns of tokens or text entities to extract specific information from text data.
    
For all matcher methods and attributes:
https://spacy.io/api/matcher

In [29]:
# Prerequisites
import spacy
nlp = spacy.load("en_core_web_md")

# Matcher import
from spacy.matcher import Matcher

In [30]:
# Creating and Configuring a Matcher

# Create matcher object
matcher = Matcher(nlp.vocab)

# Define pattern
pattern = [{"LOWER":"example"}, {"POS":"NOUN"}]
# matches the word 'example' followed by a noun

# Add pattern to Matcher with a name
pattern_name = "pattern_V1"
matcher.add(pattern_name, [pattern])

In [33]:
# Using a matcher

text = "This is an example match"
doc = nlp(text)

# Match doc
matches = matcher(doc)
print(
f"""
Match Format:
(Match_ID, Start_index, End_index)

Matches:
{matches}
"""
)

# Extract info from matches
for match_id, start, end in matches:
    match = doc[start:end].text
    print(f"Matched text: '{match}'")


Match Format:
(Match_ID, Start_index, End_index)

Matches:
[(16905033022423864804, 3, 5)]

Matched text: 'example match'


In [34]:
# More in depth patterns

matcher = Matcher(nlp.vocab)

# Pattern to match positive descriptions in second person
pattern = [
    {"LOWER": {"IN": ["you", "your", "you're"]}, "POS": "PRON"},
    # Pattern must start with "You" and the part-of-speech must be a pronoun
    {"POS": {"IN": ["NOUN", "VERB", "ADJ", "AUX"]}, "OP": "*"},
    # The middle of the pattern
    # POS matches tokens with a part-of-speech of either VERB or ADJ
    # IN means the attribute (POS) value must be a member of the list provided
    # OP being * is an operator that allows 0 or more tokens to match this part
    {"TEXT": {"IN": ["good", "awesome", "great", "fantastic", "amazing"]}, "OP": "+"}
    # + OP requires 1 or more matches
]

matcher.add("positive_description", [pattern])

text = "You are awesome and your programs are fantastic!"
doc = nlp(text)
matches = matcher(doc)

for match_id, start_index, end_index in matches:
    match = doc[start_index:end_index]
    print(f"Match: '{match}'")


Match: 'You are awesome'
Match: 'your programs are fantastic'


<u>More on Patterns:<u>

Matchers can also use patterns from Regex such as:
[{"TEXT": {"REGEX": r"^\d{3}"}}]


Refer to the design document link provided above for more in-depth description including operators.


**SpaCy Key Concepts and Functions**
1. Tokenization
2. Part-of-Speech Tagging (POS)
3. Named Entity Recognition (NER)
4. Dependency Parsing
5. Lemmatization
6. Stop Words
7. Text Vectors
8. Rule-Based Matching
9. Custom Pipelines
10. Text Classification

**Setup:**

Please refer to the top of this notebook.

**1. Tokenization**

Tokenization is the process of splitting text into individual words, phrases, or symbols (tokens). Tokenization can be handled differently based on its approach to contractions and punctuation.

<u>Usage:<u> 
* The first step in NLP as it breaks down text into usable pieces for analysis.
    
<u>Refer to:<u>
* Token data structure

In [35]:
# Prerequisites
import spacy
nlp = spacy.load("en_core_web_md")
# natural language processor


# Process text
text = "These are all tokens. These are too."
doc = nlp(text)
# send text through a NLP pipeline to tokenize the text
print(f"Original Text: {text}\n")


# The tokens are now stored as a list of Token objects inside of the doc
for index, token in enumerate(doc):
    print(f"Index {index}: {token.text}")

Original Text: These are all tokens. These are too.

Index 0: These
Index 1: are
Index 2: all
Index 3: tokens
Index 4: .
Index 5: These
Index 6: are
Index 7: too
Index 8: .


**2. Part-of-Speech Tagging (POS)**

POS tagging assigns grammatical parts of speech to tokens. These POS tags represent the syntactic role of a token within its sentence.

<u>Usage:<u> 
* Used for understanding sentence structure and information retrieval.

<u>Tags:<u> 
    
For more tags including detailed tags, see: https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk

Tags are an abbreviated representation of a part of speech. When you tag a token, you are determining what part of speech that word is.

Here are some examples to help grasp the idea:

* NN: noun, common, singular or mass

    * Ex words:
        * common-carrier cabbage knuckle-duster Casino afghan shed thermostat
        investment slide humour falloff slick wind hyena override subhumanity
        machinist


* NNP: noun, proper, singular

    * Ex words:
        * Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
        Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
        Shannon A.K.C. Meltex Liverpool


* NNS: noun, common, plural

    * Ex words:
        * undergraduates scotches bric-a-brac products bodyguards facets coasts
        divestitures storehouses designs clubs fragrances averages
        subjectivists apprehensions muses factory-jobs

* VB: verb, base form

    * Ex words:
        * ask assemble assess assign assume atone attention avoid bake balkanize
        bank begin behold believe bend benefit bevel beware bless boil bomb
        boost brace break bring broil brush build


* VBD: verb, past tense

    * Ex words:
        * dipped pleaded swiped regummed soaked tidied convened halted registered
        cushioned exacted snubbed strode aimed adopted belied figgered
        speculated wore appreciated contemplated

In [36]:
# Prerequisites
import spacy
nlp = spacy.load("en_core_web_md")

# Process text
text = "These are tokens with tags."
doc = nlp(text)
print(f"Original Text: {text}\n")


# Accessing Part-of-Speech Tags
for token in doc:
    print(f"""
    Token: {token.text}
    POS tag: {token.pos_}
    Detailed POS tag: {token.tag_}""")
    # .pos_ returns the word form of .pos, which is a number
    # tag_ returns a more specific tag for the token in its context

Original Text: These are tokens with tags.


    Token: These
    POS tag: PRON
    Detailed POS tag: DT

    Token: are
    POS tag: AUX
    Detailed POS tag: VBP

    Token: tokens
    POS tag: NOUN
    Detailed POS tag: NNS

    Token: with
    POS tag: ADP
    Detailed POS tag: IN

    Token: tags
    POS tag: NOUN
    Detailed POS tag: NNS

    Token: .
    POS tag: PUNCT
    Detailed POS tag: .


**3. Named Entity Recognition (NER)**

NER identifies and categorizes named entities in text. These include names of people, organizations, dates, locations, and even money.

<u>Usage:<u>
* NER is used for information extraction, entity linking, and data mining.
    
<u>Entity Labels:<u>
* Each named entity is assigned an entity label that describes the type of entity it represents (e.g. PERSON, ORGANIZATION, DATE, GEO-POLITICAL ENTITY, etc.).

In [37]:
# Prerequisites
import spacy
nlp = spacy.load("en_core_web_md")

# Process text
text = "SpaCy version 1.0 was released on October 19, 2016 by its main developers; Mathew Honnibal and Ines Montani under their MIT License."
doc = nlp(text)
print(f"Original Text: {text}\n")


# Accessing Named Entities
for entity in doc.ents:
    # .ents returns a tuple of every token that is an entitiy
    print(f"Entity: {entity.text} \nLabel: {entity.label_}\n")
    # .label_ returns the word form of its entity label

    
# Visualization
from spacy.displacy import render

print("\nVisualization:")
render(doc, style="ent")
# Highlights each entity within a doc

Original Text: SpaCy version 1.0 was released on October 19, 2016 by its main developers; Mathew Honnibal and Ines Montani under their MIT License.

Entity: 1.0 
Label: CARDINAL

Entity: October 19, 2016 
Label: DATE

Entity: Mathew Honnibal 
Label: PERSON

Entity: Ines Montani 
Label: PERSON

Entity: MIT License 
Label: ORG


Visualization:


**4. Dependency Parsing**

Dependency parsing analyzes the grammatical structure of a sentence by identifying the relationships between words. It uses an arc-tree structure to represent dependencies. (which words refer to which words in a sentence)

<u>Usage:<u>
* Essential for understandning the relationship between words of a sentence. Helps identify subkects, predicates, and objects in a sentence and is used for translation and question-answering systems.

In [39]:
# Prerequisites
import spacy
nlp = spacy.load("en_core_web_md")

# Process text
text = "Brandon has a lot of time-consuming homework."
doc = nlp(text)
print(f"Original Text: {text}\n")


# Accessing dependency information
for token in doc:
    print(f"""
    Token: {token.text}
    Dependency Label: {token.dep_}
    Head Token: {token.head.text}""")
    # A head token is a word that modifies the meaning of another word

Original Text: Brandon has a lot of time-consuming homework.


    Token: Brandon
    Dependency Label: nsubj
    Head Token: has

    Token: has
    Dependency Label: ROOT
    Head Token: has

    Token: a
    Dependency Label: det
    Head Token: lot

    Token: lot
    Dependency Label: dobj
    Head Token: has

    Token: of
    Dependency Label: prep
    Head Token: lot

    Token: time
    Dependency Label: npadvmod
    Head Token: consuming

    Token: -
    Dependency Label: punct
    Head Token: consuming

    Token: consuming
    Dependency Label: amod
    Head Token: homework

    Token: homework
    Dependency Label: pobj
    Head Token: of

    Token: .
    Dependency Label: punct
    Head Token: has


In [40]:
# Extract noun phrases
# noun phrases are chunks of text that include:
    # a noun and the words describing that noun
print("\nNoun Phrases:")
for noun_phrase in doc.noun_chunks:
    # .noun_chunks returns an object
    print(noun_phrase.text)


Noun Phrases:
Brandon
a lot
time-consuming homework


In [41]:
# Visualization
from spacy.displacy import render

print("\nVisualization:")
render(doc, style="dep")
# prints an arc-tree representing each word's dependency


Visualization:


**5. Lemmatization**

Lemmatization reduces words to their base or dictionary form, helping standardize words.

<u>Usage:<u>
* Normalizing text makes it easier to compare and analyze. This helps in information retrieval and text classification.

In [42]:
# Prerequisites
import spacy
nlp = spacy.load("en_core_web_md")

# Process text
text = "Programming is a diverse field"
doc = nlp(text)
print(f"Original Text: {text}\n")


# find lemmas (Lemmatize)
print("Lemmatization:")
for token in doc:
    print(f"""
    Token: {token.text}
    Lemma: {token.lemma_}""")

Original Text: Programming is a diverse field

Lemmatization:

    Token: Programming
    Lemma: programming

    Token: is
    Lemma: be

    Token: a
    Lemma: a

    Token: diverse
    Lemma: diverse

    Token: field
    Lemma: field


**6. Stop Words**

Stop words are common words with little contextual meaning. They are considered noise words.

<u>Usage:<u>
* Removing stop words improves efficiency and accuracy when classifying text or analyzing sentiment.

In [43]:
# Prerequisites
import spacy
nlp = spacy.load("en_core_web_md")

# Process text
text = "The bunny ran over the turtle in spite."
doc = nlp(text)
print(f"Original Text: {text}\n")


# Find stop words
for token in doc:
    if token.is_stop:
        # .is_stop method returns True if the token is a stop word
        print(f"Stop word: {token.text}")
        
        
# Recreate text without stop words
filtered_text = " ".join(token.text for token in doc if not token.is_stop)
# Concatenates every token's text within a doc that is not a stop word with a space in between
print(f"\nFlitered text: {filtered_text}")

Original Text: The bunny ran over the turtle in spite.

Stop word: The
Stop word: over
Stop word: the
Stop word: in

Flitered text: bunny ran turtle spite .


**7. Text Vectors**

Text vectors represent words, phrases, or documents as numerical vectors in multi-dimensional space. These vectors capture semantic information, allowing for comparisons between text elements.

<u>Usage:<u>
* Finsing document similarity, clustering, and recommendation systems. Vectors enable computation of similarities between texts based on their content.

In [44]:
# Prerequisites
import spacy
nlp = spacy.load("en_core_web_md")

# Process text
text = "I ate an apple."
doc = nlp(text)
print(f"Original Text: {text}\n")


# Find word vectors
print("Word Vectors:")
for token in doc:
    print(f"""
    Token: {token.text}
    Vector shape: {token.vector.shape[0]} value long
    First 4 vector values: 
    {token.vector[:4]}""")
# .vector returns a long list of vector values
# .vector.shape returns a tuple of the dimensions of the vector


# Find vector of entire doc
print("\nDocument info:")
print(f"""
    Document Vector Shape: {doc.vector.shape[0]} values long
    First 4 vector values:
    {token.vector[:4]}
    """)


Original Text: I ate an apple.

Word Vectors:

    Token: I
    Vector shape: 300 value long
    First 4 vector values: 
    [-1.8607   0.15804 -4.1425  -8.6359 ]

    Token: ate
    Vector shape: 300 value long
    First 4 vector values: 
    [ 3.3417  -8.3958   0.35492  1.8888 ]

    Token: an
    Vector shape: 300 value long
    First 4 vector values: 
    [11.492   2.9806 15.917  -1.1007]

    Token: apple
    Vector shape: 300 value long
    First 4 vector values: 
    [-1.0084  -2.0308  -0.64185  2.6928 ]

    Token: .
    Vector shape: 300 value long
    First 4 vector values: 
    [-0.076454 -4.6896   -4.0431   -3.4333  ]

Document info:

    Document Vector Shape: 300 values long
    First 4 vector values:
    [-0.076454 -4.6896   -4.0431   -3.4333  ]
    


In [48]:
# Find similarity between tokens or documents

# Create doc to compare with
comparison_text = "I ate a banana today"
comparison_doc = nlp(comparison_text)

# Find similarity
similarity = doc.similarity(comparison_doc)
# Returns a 
print(f"Similarity between the two documents: \n{similarity}")

if similarity == 1:
    phrase = "(Exact Same)"
elif similarity >= .75:
    phrase = "(Very Similar)"
elif similarity >= .5:
    phrase = "(Similar)"
elif similarity >= .25:
    phrase = "(Somewhat Similar)"
else:
    phrase = "(Note Similar)"

print(phrase)

Similarity between the two documents: 
0.7007695012580235
(Similar)


**8. Rule-Based Matching**

Rule-based matching allows for defined patterns of tokens using specified rules and extract information based on these patterns. SpaCy has an implemented rule-based matcher.

<u>Usage:<u>
* Used for extracting specific information or entities from text, even when they don't fit standard NER categories. Allows for the creation of custom entities.

In [49]:
# Prerequisites
import spacy
nlp = spacy.load("en_core_web_md")

In [50]:
# Define a matcher
from spacy.matcher import Matcher

# Matcher object
matcher = Matcher(nlp.vocab)

# Deine a pattern (explained in more detail later)
pattern = [{"LOWER": "example"}]

# Add pattern to matcher
matcher.add("ExamplePattern", [pattern])

In [51]:
# Process text and apply matcher

text = "Here is example 1 and example 2."
doc = nlp(text)

matches = matcher(doc)

In [54]:
# Access matches

print(f"""
Matches are tuples in this format:
(Match_ID, Start_index, End index)

For Example:
{matches[0]}
""")

print("Matches:")

for match_id, start, end in matches:
    match_text = doc[start:end].text
    print(f"""
    Matched ID: {match_id}
    Start and End Index: ({start}, {end})
    Match Text: {match_text}
    """)


Matches are tuples in this format:
(Match_ID, Start_index, End index)

For Example:
(583384548970338471, 2, 3)

Matches:

    Matched ID: 583384548970338471
    Start and End Index: (2, 3)
    Match Text: example
    

    Matched ID: 583384548970338471
    Start and End Index: (5, 6)
    Match Text: example
    


In [55]:
# Removing patterns

matcher.remove("ExamplePattern")

<u>Defining Patterns<u>
* Patterns are defined using dictionaries that specify the criteria for matching.

<u>Common Keys<u>
* "LOWER": Matches the lowercase form of the token's text.
* "TEXT": Matches the exact text of the token.
* "LEMMA": Matches the lemma of the token.
* "POS": Matches the part-of-speech tag of the token.
* "TAG": Matches the fine-grained part-of-speech tag of the token.
* "ORTH": Matches the exact text, including case.
* "ENT_TYPE": Matches the entity type of the token (for named entity recognition).
* "IS_ALPHA": Matches if the token is an alphabetic character.
* "IS_DIGIT": Matches if the token is a digit.
    
<u>Modifer Keys<u>
* "IN": 
    * Matches a token's text against a list of possible values
    * Means the attribute (key) before it must be a member of the list provided
* "OP"
    * Specifies the repitition of a token in the pattern
    * Allows you to define the recurrance of a matching token using these operators:
        * "*" (Zero or more)
        * "+" (One or more)
        * "?" (Zero or one)
        * "{n}" (Exactly n times)
        * "{n,}" (n or more times)
        * "{n,m}" (Between n and m times)
    
<u>Using Regex<u>
* Regex can be used with the "REGEX" key to match the "TEXT" key. For example:
    * [{"TEXT": {"REGEX": r"^\d{3}"}}]
    * Will match any text that matches that regex (3 digits)

In [56]:
# Here is an example from earlier
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

# Pattern to match positive descriptions in second person
pattern = [
    {"LOWER": {"IN": ["you", "your", "you're"]}, "POS": "PRON"},
    # Pattern must start with "You" and the part-of-speech must be a pronoun
    {"POS": {"IN": ["NOUN", "VERB", "ADJ", "AUX"]}, "OP": "*"},
    # The middle of the pattern
    # POS matches tokens with a part-of-speech of either VERB or ADJ
    # IN means the attribute (POS) value must be a member of the list provided
    # OP being * is an operator that allows 0 or more tokens to match this part
    {"TEXT": {"IN": ["good", "awesome", "great", "fantastic", "amazing"]}, "OP": "+"}
    # + OP requires 1 or more matches
]

matcher.add("positive_description", [pattern])

text = "You are awesome and your programs are fantastic!"
doc = nlp(text)
matches = matcher(doc)

for match_id, start_index, end_index in matches:
    match = doc[start_index:end_index]
    print(f"Match: '{match}'")

Match: 'You are awesome'
Match: 'your programs are fantastic'


**9. Custom Pipelines**

Custom pipelines allow you to create processing steps tailored to your specific NLP tasks. You can add custom components or functions to process text.

<u>Usage:<u>
* Handy when you need a unique preprocessing or analysis requirement that aren't covered in the default spaCy pipeline.

In [57]:
# Creates a blank english model
import spacy

nlp = spacy.blank("en")
# a blank model has no pre-defined components

In [58]:
# Define custom pipeline components

# Add spacy decorator to define a custom component with a specified name

@spacy.Language.component("reverse_doc")
def custom_component1(doc):
    # Define custom process (reverse token order)
    reversed_tokens = list(reversed(doc))
    reversed_text = ' '.join(token.text for token in reversed_tokens)
    
    # Return a doc
    reversed_doc = nlp.make_doc(reversed_text)
    
    return reversed_doc


# Defining custom attributes inside a component 

@spacy.Language.component("token_counter")
def custom_component2(doc):
    # Define custom process (count reversed tokens)
    # Define custom attribute
    
    if (len(doc) % 2) == 0:
        doc._.reversed_tokens = len(doc)
    else:
        doc._.reversed_tokens = len(doc) - 1
    # custom doc attributes start with 'doc._.'
    
    return doc

In [59]:
# Add custom components to the pipeline

nlp.add_pipe("reverse_doc", last=True)
# Adds first component to the end of the default pipeline
nlp.add_pipe("token_counter", last=True)
# component1 was added first, so it will perform before component2

# Register custom attribute
from spacy.tokens import Doc

Doc.set_extension("reversed_tokens", default=None, force=True)
# Doc object is given a new attribute named 'reversed_tokens'
# default value = None
# force = True allows the extension to be overwritten


# Adding duplicate components will result in an error

In [60]:
# Process text with the custom pipeline

text = "This is an example text."
doc = nlp(text)
# The doc object will represent the text after being processed
# by the custom components

print(f"Original text: {text}\n")

print("Post-processing: ")
for i, token in enumerate(doc):
    print(f"Index {i}: {token}")
    
print(f"\nTokens reverse: {doc._.reversed_tokens}")

Original text: This is an example text.

Post-processing: 
Index 0: .
Index 1: text
Index 2: example
Index 3: an
Index 4: is
Index 5: This

Tokens reverse: 6


In [61]:
# Removing custom components
nlp.remove_pipe("reverse_doc")
nlp.remove_pipe("token_counter")

('token_counter', <function __main__.custom_component2(doc)>)

**10. Text Classification**

Text classification assigns predefined categories or labels to text documents.

<u>Usage:<u>
* Valuable for automating tasks such as sorting documents or emails into folders, analyzing review, or categorizing articles.
    
<u>Training Info:<u>
* The Example object is used: https://spacy.io/api/example

In [66]:
# Prerequisites
import spacy
nlp = spacy.blank("en")
# You can also start with a pretrained model as well (e.g. "en_core_web_md")

In [67]:
# Defining 

# Define text classification model

textcat = nlp.add_pipe("textcat")
# Adding a copy results in error (remove copy and try again)
# text categorizer requires the name 'textcat'
# 'textcat' is a predefined spacy factory (class) component 


# Define labels
# These are categorizers that the model can predict
textcat.add_label("English")
textcat.add_label("Code")


# Define training data

training_data = [
    ("I've had a great day today.", {"cats": {"English": 1, "Code": 0}}),
    ("for data in data_dict:", {"cats": {"English": 0, "Code": 1}}),
    ("var = (x+1)", {"cats": {"English": 0, "Code": 1}}),
    ("nlp = spacy.load('en_core_web_md')", {"cats": {"English": 0, "Code": 1}}),
    ("Watch out! There's a tumbling duck!", {"cats": {"English": 1, "Code": 0}}),
    ("Please read the document.", {"cats": {"English": 0, "Code": 1}})
]
# All data samples should be a tuple with a string and a dictionary 
# Dictionaries represent the labels/categories by having the "cats" key
# Categories are predefined based off of the training string

In [68]:
# Train text classification mode
from spacy.training.example import Example
import random


n_training = 10
# number of training iterations

# Define optimizer used for training
optimizer = nlp.initialize()
# An optimizer is a component that can adjust the parameters of a model
# (e.g. textcat) during training to minimize loss in accuracy



for i in range(n_training):
    # Shuffle training data
    random.shuffle(training_data)
    examples = []
    
    # Make a collection of examples
    for text, annotations in training_data:
        doc = nlp.make_doc(text)
        # Another way to make a doc
        example = Example.from_dict(doc, annotations)
        # Example object encases the doc and its annotation in a format suitable 
        # for training models
        examples.append(example)
        # Creates example from dictionary using Example object's method 'from_dict'
        
    # Update model with training data 
    losses = textcat.update(examples, sgd=optimizer, drop=0.5)
    # 'update' method takes a list of Example objects (or 1) and updates the model
    # optimizer reduces loss
    # drop is a proportion of neurons in a neural network being unused to increase generalization

In [72]:
# Classify text

test_text = "doc = nlp(text_rep)"
doc = nlp(test_text)

prediction = doc.cats
# gets the predicted labels (categories) and their probabilities

print(f"Label Predictions: \n{prediction}")

Code = prediction['Code']
English = prediction['English']

if Code > English:
    phrase = "(More code than english)"
elif Code == Enlish:
    phrase = "(Equal parts code to english)"
else:
    phrase = "(More english than code)"
    
print(phrase)

Label Predictions: 
{'English': 0.4103195369243622, 'Code': 0.5896804928779602}
(More code than english)


In [73]:
# Removing a text classifier
nlp.remove_pipe("textcat")

('textcat', <spacy.pipeline.textcat.TextCategorizer at 0x7f0b5af20340>)