<a href="https://colab.research.google.com/github/AbdulRauf96/NLP/blob/main/Custom_Spacy_Tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color = 'pickle'>**Advanced Spacy**</font>
    
In this notebook, we will learn some advanced features of Spacy:

1. Custom Tokenizer
2. Rule-Based Matching
3. Custom Extensions



# <font color = 'pickle'>**Set Path for Data**

In [1]:
# Use for normal projects
from pathlib import Path
if 'google.colab' in str(get_ipython()):
  from google.colab import drive
  drive.mount('/content/drive') 
  %pip install swifter -qq
  %pip install -U spacy -qq
  base_folder = Path('/content/drive/MyDrive/colab_notebooks/')
  subject = 'nlp'
  data = base_folder/subject/'data/'
  archive = base_folder/subject/'archive/'
  output = base_folder/subject/'output'
else:
  base_folder = Path('C:/Users/Abdul Rauf Maroof/OneDrive/Documents/MSBA')
  data = base_folder/subject/'data/'
  archive = base_folder/subject/'archive/'
  output = base_folder/subject/'output'

Mounted at /content/drive
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m830.9/830.9 KB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.2/280.2 KB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for swifter (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
en-core-web-sm 3.4.1 requires spacy<3.5.0,>=3.4.0, but you have spacy 3.5.0 which is incompatible.[0m[31m
[0m

# <font color = 'pickle'>**Install/Import Libraries**

In [2]:
# Importing required libraries
# Path from the 'pathlib' library is used for working with files and directories in a portable manner across different operating systems
import re
import spacy



In [3]:
spacy.__version__

'3.5.0'

# <font color = 'pickle'>**Load Spacy Model**

In [9]:
!python -m spacy download en_core_web_sm -qq

2023-03-01 19:53:01.743442: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-01 19:53:01.743615: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-01 19:53:03.450923: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m44.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the packa

In [5]:
# Loading the 'en_core_web_sm' language model from the spaCy library
nlp = spacy.load('en_core_web_sm')

# Selecting pipes to disable for the loaded language model
# Here, the pipes for token-to-vector, part-of-speech tagging, dependency parsing, attribute ruler, 
# lemmatization and named entity recognition are being disabled.
disabled = nlp.select_pipes(disable= ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'])

In [6]:
# check the disabiled pipelines
disabled

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [7]:
sample_text = " #Reg #Ex @abc@xyz.com! prefixes  stop-words wow!"

In [8]:
# Creating a spaCy document from the sample text
doc = nlp(sample_text)

# Printing the text of each token in the document
print([token.text for token in doc])

[' ', '#', 'Reg', '#', 'Ex', '@abc@xyz.com', '!', 'prefixes', ' ', 'stop', '-', 'words', 'wow', '!']


# <font color = 'pickle'>**Custom Tokenizer in spaCy**

## <font color = 'pickle'>**Modfiy Prefixes**

In [20]:
# Modify the prefix character used by spacy tokenizer
# Let us say if we want to keep hashes (#) together in a token
# spacy treats # as prefixes and hence separates them when creating tokens

# Accessing the default prefixes for the loaded language model
prefixes = nlp.Defaults.prefixes
prefixes[20:30]

['\\{', '\\}', '<', '>', '_', '\\*', '&', '。', '？', '！']

### <font color = 'pickle'>**Remove Prefixes**

In [None]:
# Removing the '#' symbol from the prefixes list
prefixes.remove(r'#')

# Compiling a regular expression pattern for the remaining prefixes
prefix_regex = spacy.util.compile_prefix_regex(prefixes)

# Assigning the compiled prefix regular expression to the spaCy tokenizer's prefix search method
nlp.tokenizer.prefix_search = prefix_regex.search

Code Explanation:

This code modifies the prefixes used by the spaCy tokenizer. 
- The first line removes the `#` symbol from the prefixes list. 
- Then, a regular expression pattern is compiled from the updated prefixes list using the `spacy.util.compile_prefix_regex` function. 
- Finally, the compiled regular expression is assigned to the `prefix_search` method of the spaCy tokenizer, which is used to identify prefixes in text during tokenization. 

By updating the prefix_search method, this code changes the behavior of the spaCy tokenizer to no longer treat the `#` symbol as a prefix.

### <font color = 'pickle'>**Add Prefixes**

In [22]:
# create doc
doc = nlp(sample_text)
# print tokens
print([token.text for token in doc])

[' ', '#Reg', '#Ex', '@abc@xyz.com', '!', 'prefixes', ' ', 'stop', '-', 'words', 'wow', '!']


In [23]:
# Add prefix character to split from text
prefixes.append(r'@')

# Compiling a regular expression pattern for the modified prefixes
prefix_regex = spacy.util.compile_prefix_regex(prefixes)

# Assigning the compiled prefix regular expression to the spaCy tokenizer's prefix search method
nlp.tokenizer.prefix_search = prefix_regex.search

doc = nlp(sample_text)
print([token.text for token in doc])

[' ', '#Reg', '#Ex', '@', 'abc@xyz.com', '!', 'prefixes', ' ', 'stop', '-', 'words', 'wow', '!']


## <font color = 'pickle'>**Modify Suffixes**

In [26]:
# check default suffixes in spacy
suffixes = nlp.Defaults.suffixes
suffixes[20:30]

['&', '。', '？', '！', '，', '、', '；', '：', '～', '·']

In [None]:
# Remove suffix characters to not split from text
suffixes.remove(r'\!')

# Compiling a regular expression pattern for the modified suffixes
suffix_regex = spacy.util.compile_suffix_regex(suffixes)

# Assigning the compiled prefix regular expression to the spaCy tokenizer's suffix search method
nlp.tokenizer.suffix_search = suffix_regex.search

doc = nlp(sample_text)
print([token.text for token in doc])

## <font color = 'pickle'>**Modify infixes**

In [28]:
# Create a list of default infixes from spaCy's "nlp.Defaults" module
infixes = list(nlp.Defaults.infixes)
infixes[0:3]

['\\.\\.+',
 '…',
 '[\\u00A6\\u00A9\\u00AE\\u00B0\\u0482\\u058D\\u058E\\u060E\\u060F\\u06DE\\u06E9\\u06FD\\u06FE\\u07F6\\u09FA\\u0B70\\u0BF3-\\u0BF8\\u0BFA\\u0C7F\\u0D4F\\u0D79\\u0F01-\\u0F03\\u0F13\\u0F15-\\u0F17\\u0F1A-\\u0F1F\\u0F34\\u0F36\\u0F38\\u0FBE-\\u0FC5\\u0FC7-\\u0FCC\\u0FCE\\u0FCF\\u0FD5-\\u0FD8\\u109E\\u109F\\u1390-\\u1399\\u1940\\u19DE-\\u19FF\\u1B61-\\u1B6A\\u1B74-\\u1B7C\\u2100\\u2101\\u2103-\\u2106\\u2108\\u2109\\u2114\\u2116\\u2117\\u211E-\\u2123\\u2125\\u2127\\u2129\\u212E\\u213A\\u213B\\u214A\\u214C\\u214D\\u214F\\u218A\\u218B\\u2195-\\u2199\\u219C-\\u219F\\u21A1\\u21A2\\u21A4\\u21A5\\u21A7-\\u21AD\\u21AF-\\u21CD\\u21D0\\u21D1\\u21D3\\u21D5-\\u21F3\\u2300-\\u2307\\u230C-\\u231F\\u2322-\\u2328\\u232B-\\u237B\\u237D-\\u239A\\u23B4-\\u23DB\\u23E2-\\u2426\\u2440-\\u244A\\u249C-\\u24E9\\u2500-\\u25B6\\u25B8-\\u25C0\\u25C2-\\u25F7\\u2600-\\u266E\\u2670-\\u2767\\u2794-\\u27BF\\u2800-\\u28FF\\u2B00-\\u2B2F\\u2B45\\u2B46\\u2B4D-\\u2B73\\u2B76-\\u2B95\\u2B98-\\u2BC8\\u2BCA-\\

In [29]:
# Create a new list of infixes without elements containing the string '-'
infixes = [x for x in infixes if r'-' not in x]

# Compiling a regular expression pattern for the modified infixes
infix_regex = spacy.util.compile_infix_regex(infixes)

# Assigning the compiled prefix regular expression to the spaCy tokenizer's infix search method
nlp.tokenizer.infix_finditer =infix_regex.finditer 

doc = nlp(sample_text)
print([token.text for token in doc])

[' ', '#Reg', '#Ex', '@', 'abc@xyz.com!', 'prefixes', ' ', 'stop-words', 'wow!']


You can't use the `.remove()` method to remove the string `'-'` from the infixes list because there are multiple instances of the string `'-'` in the infixes list.

The list comprehension `[x for x in infixes if r'-' not in x]` creates a new list that only contains elements from the original infixes list if the string `'-'` is not present in the element. This ensures that all elements in the list that contain the string `'-'` are removed.

## <font color = 'pickle'>**Adding special case tokenization rules**

In [31]:
doc = nlp("gimme that")  
print([w.text for w in doc])  

['gimme', 'that']


In [33]:
# Importing the ORTH symbol from the spaCy symbols module
from spacy.symbols import ORTH

The ORTH symbol is a constant from the spacy.symbols module in spaCy. It represents the string form of a token, including any whitespace or special characters. 

In [34]:
# Define a special case for the text "gimme"
special_case = [{ORTH: "gim"}, {ORTH: "me"}]

# Add the special case to the tokenizer of the NLP object
nlp.tokenizer.add_special_case("gimme", special_case)

# Tokenize the text "gimme that" and print the resulting tokens
print([w.text for w in nlp("gimme that")]) 

['gim', 'me', 'that']


`nlp.tokenizer.add_special_case` is a method in spaCy's NLP object that allows you to add a special case tokenization for specific texts. This method takes two arguments:

- The first argument is the text you want to add a special case for.
- The second argument is the special case tokenization you want to apply to the text. This is defined as a list of dictionaries, where each dictionary represents a token and maps the ORTH key to the text of the token.

# <font color = 'pickle'>**Rule-based matching using spaCy**


In [35]:
# Import the Matcher class from spaCy's 'spacy.matcher' module
from spacy.matcher import Matcher


Matcher objects let us match sequences of tokens based on pattern rules. This is used as an alternative to regex pattern matching.

- Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.

- When we use Matcher object on Tokens, we can use word level features of spaCy such as LOWER, LENGTH, LEMMA, SHAPE and flags such as IS_PUNCT, IS_DIGIT, LIKE_URL, etc. 

- We can also use part of speech tags and named entities in patterns e.g., find the word `"cloud"` only if it's a `verb`, not a `noun`.
 


In [36]:
text = """New version of operation system is iOS 11. It is better than iOS 9 and iOS 9. 
The new version of iPhone X seems cool. The video of iphone x released. I liked iOS 9 but I like iOS 11 more.
You may not like my like. Contact us : xyz@gmail.com., abc@utdallas.edu"""

## <font color = 'pickle'>**1. Matching Exact Tokens**

In [37]:
# Example 1: Matching Exact Text

# When initiating Matcher we need to specify vocab
# Instantiate Matcher object using nlp.vocab

matcher = Matcher(nlp.vocab)
doc = nlp(text)

# Match Exact Tokens : match TEXT iOS
pattern1 = [{"TEXT":"iOS"}]

# Match sequence of texts : iPhone followed by X
pattern2 = [{"TEXT": "iPhone"},{"TEXT": "X"}]

# matcher.add() method to add patterns to matcher
matcher.add("TextOnly",[pattern1, pattern2])

# When we call the matcher on a doc, it returns a list of tuples.
# Each tuple consists of three values: the match ID, the star index and the end index of the matched span.
matches = matcher(doc)
matches

[(9385982399280393077, 6, 7),
 (9385982399280393077, 13, 14),
 (9385982399280393077, 16, 17),
 (9385982399280393077, 24, 26),
 (9385982399280393077, 38, 39),
 (9385982399280393077, 43, 44)]

In [38]:
# We can acees a span from doc using slicing (similar to arrays in numpy)
print([doc[start:end].text for match_id, start, end in matches])

['iOS', 'iOS', 'iOS', 'iPhone X', 'iOS', 'iOS']


## <font color = 'pickle'>**2. Matching Attribute (LOWER, IS_DIGIT)**

In [39]:
# Match Exact tokens and attributes
# List of availible attributes that can be used with matcher : https://spacy.io/usage/rule-based-matching

matcher = Matcher(nlp.vocab)
doc = nlp(text)

# pattern 1 : text iOS followed by digit
pattern1 = [{"TEXT":"iOS"}, {"IS_DIGIT":True}]

# pattern 2 : iphone followed by x (irrespective if case (lower/upper) for both iphone and X)
pattern2 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# matcher.add() method to add patterns to matcher
matcher.add("TextAndLower",[pattern1, pattern2])

# Get the matches from the Matcher for the Doc object
matches = matcher(doc)
print([doc[start:end].text for match_id, start, end in matches])

['iOS 11', 'iOS 9', 'iOS 9', 'iPhone X', 'iphone x', 'iOS 9', 'iOS 11']


<font color = 'dodgerblue'> Note: For pattern 2 in above example, The `‘LOWER’: ‘iphone'`, `‘LOWER’: 'x'` means that we want to match a word where its lower form is `‘iphone x'`. So with this, we can match the word `‘Iphone X’` or even `‘IPHONE X'`

## <font color = 'pickle'>**3. Matching Attribute (IS_LOWER)**

In [40]:
# Match Exact tokens and attributes

matcher = Matcher(nlp.vocab)
doc = nlp(text)

# pattern 1 : text iOS followed by digit
pattern1 = [{"TEXT":"iOS"}, {"IS_DIGIT":True}]

# pattern2 :lowercase iPhone
pattern2 = [{"TEXT": "iphone" ,"IS_LOWER":True},  {"LOWER": "x"}]
             
# matcher.add() method to add patterns to matcher
matcher.add("TextAndIsLower",[pattern1, pattern2])

# Get the matches from the Matcher for the Doc object
matches = matcher(doc)
print([doc[start:end].text for match_id, start, end in matches])

['iOS 11', 'iOS 9', 'iOS 9', 'iphone x', 'iOS 9', 'iOS 11']


<font color = 'dodgerblue'>For pattern 2 in above example, we only want to extract iphone when it is in lowercase.

## <font color = 'pickle'>**4. Matching Attribute (LEMMA)**

In [45]:
# Matching other attributes
disabled.restore()

print(f'pipe names: {nlp.pipe_names}')
print()

matcher = Matcher(nlp.vocab)
doc = nlp(text)

# write a pattern to match word whose lemma is 'like'
pattern = [{"LEMMA": "like"}]

# matcher.add() method to add patterns to matcher
matcher.add("Lemma",[pattern],greedy='LONGEST')

# Get the matches from the Matcher for the Doc object
matches = matcher(doc)
print(matches)
print([doc[start:end].text for match_id, start, end in matches])

pipe names: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

[(12849222793144466734, 37, 38), (12849222793144466734, 42, 43), (12849222793144466734, 51, 52), (12849222793144466734, 53, 54)]
['liked', 'like', 'like', 'like']


In [46]:
[token.pos_ for token in doc if token.lemma_ =='like']

['VERB', 'VERB', 'VERB', 'INTJ']

## <font color = 'pickle'>**5. Matching Attribute (LENGTH)**

In [47]:
# Create a spaCy Doc object
doc = nlp("I see you are doing a good job.")

# Create a Matcher object using spaCy's vocab
matcher = Matcher(nlp.vocab)

# Define a pattern to match tokens with a length of 3
pattern = [{"LENGTH": 3}]

# Add the pattern to the Matcher
matcher.add("Length", [pattern])

# Get the matches from the Matcher for the Doc object
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Print the matching text from the Doc
    print(doc[start:end].text) 

see
you
are
job


##<font color = 'pickle'>**6. Using POS tags in matcher**

### <font color = 'pickle'>**Example 1**

In [48]:
disabled.restore()
matcher = Matcher(nlp.vocab)
doc = nlp(text)

# write a pattern to match word whose lemma is like and pos tag is VERB
pattern = [{"LEMMA": "like", "POS": "VERB"}]

# matcher.add() method to add patterns to matcher
matcher.add("Pos",[pattern])

# Get the matches from the Matcher for the Doc object
matches = matcher(doc)

print(matches)
print([doc[start:end].text for match_id, start, end in matches])

[(12506337956553590349, 37, 38), (12506337956553590349, 42, 43), (12506337956553590349, 51, 52)]
['liked', 'like', 'like']


### <font color = 'pickle'>**Example 2**

In [49]:
matcher = Matcher(nlp.vocab)
doc = nlp(text)

# pattern to match word whose lemma is like and pos tag is VERB. 
# This word should be followed by a word whose pos tag is Noun
pattern = [{"LEMMA": "like", "POS": "VERB"}, {"POS": "NOUN"}]

# matcher.add() method to add patterns to matcher
matcher.add("LemmaPos",[pattern])

# Get the matches from the Matcher for the Doc object
matches = matcher(doc)

print([doc[start:end].text for match_id, start, end in matches])

['like iOS']


## <font color = 'pickle'>**7. Use Quantifiers**

In [50]:
# Create a spaCy Doc object
doc = nlp("I am reading a new book on NLP. I read an excellent Deep Learning book last week")

# Create a Matcher object using spaCy's vocab
matcher = Matcher(nlp.vocab)

# Define a pattern to match the lemma "read" followed by an optional determiner and an adjective
pattern = [
    {"LEMMA": "read"},
    {"POS": "DET", "OP": "?"},  # optional: match 0 or 1 times
    {"POS": "ADJ"}
]
# Add the pattern to the Matcher
matcher.add("Quantifier", [pattern])

# Iterate over the matches for the Doc object
for match_id, start, end in matcher(doc):
    # Print the matching text from the Doc
    print(doc[start:end].text)

reading a new
read an excellent


## <font color = 'pickle'>**8. Using Regular Expressions in Matcher**

### <font color = 'pickle'>**Example 1**

In [51]:
# Extracting Email Addresses using Regular Expressions and spaCy's Matcher

# The text we want to extract email addresses from
text = 'You can contact me at @twitter, xyz@utdallas.edu, abx@gmail.com'

# Create a spaCy Doc object from the text
doc = nlp(text)

# Create a Matcher object using spaCy's vocab
matcher = Matcher(nlp.vocab)

# Define a pattern to match email addresses using a regular expression
pattern = [{"TEXT": {"REGEX": "\w+@\w+"}}]

# Add the pattern to the Matcher
matcher.add("Email", [pattern])

# Get the matches from the Matcher for the Doc object
matches = matcher(doc)

# Print the matched text from the Doc
print([doc[start:end].text for match_id, start, end in matches])


['xyz@utdallas.edu', 'abx@gmail.com']


<font color = 'dodgerblue'> **Note**: Matcher object gives the complete token where the pattern occurs. Let us compare the result with re.findall.

In [None]:
re.findall("[\w]+@[\w]+", text)

['xyz@utdallas', 'abx@gmail']

### <font color = 'pickle'>**Example 2**

In [52]:
# Create a Matcher object using spaCy's vocab
matcher = Matcher(nlp.vocab)

# Create two spaCy Doc objects
doc1 = nlp("I travelled by bus.")
doc2 = nlp("She traveled by bike.")

# Define a pattern to match using a combination of POS and a regex pattern
pattern = [{"POS": "PRON"}, {"TEXT": {"REGEX": "[Tt]ravell?ed"}}]

# Add the pattern to the Matcher
matcher.add("PosRegex", [pattern])

# Iterate over the matches in the first Doc object
for matchid, start, end in matcher(doc1):
    # Print the matching span from the first Doc
    print(doc1[start:end])

# Iterate over the matches in the second Doc object
for mid, start, end in matcher(doc2):
    # Print the matching span from the second Doc
    print(doc2[start:end])

I travelled
She traveled


### <font color = 'pickle'>**Example 3**

In [53]:
text = "Let us try different frequency of radio stations - FM 12.9, AM 104.9, FM 104.1,  AM 123.8. 1234"
radio_stations = re.findall(r'[FA]M\s\d{2,3}\.\d', text)
radio_stations

['FM 12.9', 'AM 104.9', 'FM 104.1', 'AM 123.8']

In [None]:
# Extracting Radio Stations using Regular Expressions and spaCy's Matcher

# Create a Matcher object using spaCy's vocab
matcher = Matcher(nlp.vocab)

# The text we want to extract radio stations from
text = "Let us try different frequency of radio stations - FM 12.9, AM 104.9, FM 104.1, AM 123.8. 1234"

# Create a spaCy Doc object from the text
doc = nlp(text)

# Define a pattern to match radio stations using a regular expression
pattern = [{"TEXT": {"REGEX": "[FA]M\\s\\d{2,3}\\.\\d"}}]

# Add the pattern to the Matcher
matcher.add("RadioStation", [pattern])

# Get the matches from the Matcher for the Doc object
matches = matcher(doc)

# Print the matched text from the Doc
for match_id, start, end in matches:
    print(doc[start:end].text)

<font color = 'dodgerblue'>**Using the same pattern as used in `re.findall()` does not match anything. WHY?**

 - The match was applied to a single token 
 - No single token matched the pattern

 Let us modify the pattern now.





In [None]:
# Create a Matcher object using spaCy's vocab
matcher = Matcher(nlp.vocab)

# Define the input text
text = "Let us try different frequency of radio stations - FM 12.9, AM 104.9, FM 104.1,  AM 123.8. 1234"

# Apply the spaCy's NLP model on the input text
doc = nlp(text)

# Define a pattern to match "FM" or "AM" followed by a number with 2 to 3 digits and a dot followed by another number
pattern = [
    {"TEXT": {'REGEX': '[FA]M'}}, 
    {"TEXT": {'REGEX': '\d{2,3}\.\d'}} 
]

# Add the pattern to the Matcher
matcher.add("RegexMulti2", [pattern]) 

# Find matches in the text using the Matcher
matches = matcher(doc) 

# Print the matched text
for match_id, start, end in matches: 
    print(doc[start:end].text) 


FM 12.9
AM 104.9
FM 104.1
AM 123.8


## <font color = 'pickle'>**9. Matching Attribute (SHAPE)**

### <font color = 'pickle'>**Understanding Shape Attribute**

In [54]:
text = "Let us try different radio stations - FM 12.9, AM 104.9, FM 104.1,  AM 123.8 and A234Hj.,-9"
doc = nlp(text)

# Get the text and shape of each token in the doc
[(token.text,token.shape_) for token in doc]

[('Let', 'Xxx'),
 ('us', 'xx'),
 ('try', 'xxx'),
 ('different', 'xxxx'),
 ('radio', 'xxxx'),
 ('stations', 'xxxx'),
 ('-', '-'),
 ('FM', 'XX'),
 ('12.9', 'dd.d'),
 (',', ','),
 ('AM', 'XX'),
 ('104.9', 'ddd.d'),
 (',', ','),
 ('FM', 'XX'),
 ('104.1', 'ddd.d'),
 (',', ','),
 (' ', ' '),
 ('AM', 'XX'),
 ('123.8', 'ddd.d'),
 ('and', 'xxx'),
 ('A234Hj.,-9', 'XdddXx.,-d')]

Note: `shape_` represents the token's shape in the form of a string, where each character represents a different type of character in the token. For example, `'x'` represents a lowercase letter, `'X'` represents an uppercase letter, `'d'` represents a digit, etc.



### <font color = 'pickle'>**Use SHAPE in Matcher**

In [55]:
# Create a doc object using spaCy
doc = nlp("Let us try different frequency of radio stations - FM 12.9, AM 104.9, FM 104.1,  AM 123.8.")

# Create a Matcher object using spaCy's vocab
matcher = Matcher(nlp.vocab)

# Define a pattern to match the text with shape ddd.d
pattern = [{"SHAPE": 'ddd.d'}]

# Add the pattern to the Matcher
matcher.add("Shape", [pattern])

# Apply the matcher to the doc
matches = matcher(doc)

# Loop through the matches and print matched text
for match_id, start, end in matches:
    print(doc[start:end].text)

104.9
104.1
123.8


### <font color = 'pickle'>**Use SHAPE with Regex in Matcher**

In [58]:
doc = nlp("Let us try different frequency of radio stations - FM 12.9, AM 104.9, FM 104.1,  AM 123.8.") 

# Creating a Matcher object
matcher = Matcher(nlp.vocab) 

# Define the pattern to match
# Here we are using SHAPE property with a regular expression
pattern = [{"SHAPE": {'REGEX': 'd?dd.d'}} ]

# Add the pattern to the matcher
matcher.add("ShapeRegex", [pattern]) 

# Use the matcher to find matches in the text
matches = matcher(doc) 

# Iterate over the matches and print the matched text
for match_id, start, end in matches: 
    print(doc[start:end].text)   

12.9
104.9
104.1
123.8


## <font color = 'pickle'>**10 Extract X relationship Y using Dependency Labels**

Here we will extract pair of entities: (X, Y) if there is a relationship like X acquired (bought) Y, Y was acquired (bought) by Y.


In [59]:
text1 =  "In their largest acquisition to date, Google has acquired YouTube for $1.65 billion"
text2 = " YouTube was acquired by Google for $1.65 billion"
text3 = " Google bought YouTube for $1.65 billion"
text4 = " Work was done"
doc1 = nlp(text1)
doc2 = nlp(text2)
doc3 = nlp(text3)
doc4 = nlp(text4)

### <font color = 'pickle'>**Understanding Dependency Labels**

In [60]:
print(f'{"Text":<12}: {"Lemma":<10}: {"POS":<10}: DEP\n')
for token in doc1:
  print(f'{token.text:<12}: {token.lemma_:<10}: {token.pos_:<10}: {token.dep_}')  

Text        : Lemma     : POS       : DEP

In          : in        : ADP       : prep
their       : their     : PRON      : poss
largest     : large     : ADJ       : amod
acquisition : acquisition: NOUN      : pobj
to          : to        : ADP       : prep
date        : date      : NOUN      : pobj
,           : ,         : PUNCT     : punct
Google      : Google    : PROPN     : nsubj
has         : have      : AUX       : aux
acquired    : acquire   : VERB      : ROOT
YouTube     : YouTube   : PROPN     : dobj
for         : for       : ADP       : prep
$           : $         : SYM       : quantmod
1.65        : 1.65      : NUM       : compound
billion     : billion   : NUM       : pobj


In [61]:
print(f'{"Text":<12}: {"Lemma":<10}: {"POS":<10}: DEP\n')
for token in doc2:
  print(f'{token.text:<12}: {token.lemma_:<10}: {token.pos_:<10}: {token.dep_}')    

Text        : Lemma     : POS       : DEP

            :           : SPACE     : dep
YouTube     : YouTube   : PROPN     : nsubjpass
was         : be        : AUX       : auxpass
acquired    : acquire   : VERB      : ROOT
by          : by        : ADP       : agent
Google      : Google    : PROPN     : pobj
for         : for       : ADP       : prep
$           : $         : SYM       : quantmod
1.65        : 1.65      : NUM       : compound
billion     : billion   : NUM       : pobj


### <font color = 'pickle'>**Label description with spacy.explain()**

In [62]:
print(spacy.explain('nsubjpass'))
print(spacy.explain('nsubj'))
print(spacy.explain('pobj'))
print(spacy.explain('dobj'))

nominal subject (passive)
nominal subject
object of preposition
direct object


In [63]:
print(f'{"Text":<12}: {"Lemma":<10}: {"POS":<10}: DEP\n')
for token in doc3:
  print(f'{token.text:<12}: {token.lemma_:<10}: {token.pos_:<10}: {token.dep_}')  

Text        : Lemma     : POS       : DEP

            :           : SPACE     : dep
Google      : Google    : PROPN     : nsubj
bought      : buy       : VERB      : ROOT
YouTube     : YouTube   : PROPN     : dobj
for         : for       : ADP       : prep
$           : $         : SYM       : quantmod
1.65        : 1.65      : NUM       : compound
billion     : billion   : NUM       : pobj


In [64]:
print(f'{"Text":<12}: {"Lemma":<10}: {"POS":<10}: DEP\n')
for token in doc4:
  print(f'{token.text:<12}: {token.lemma_:<10}: {token.pos_:<10}: {token.dep_}')  

Text        : Lemma     : POS       : DEP

            :           : SPACE     : dep
Work        : Work      : PROPN     : nsubjpass
was         : be        : AUX       : auxpass
done        : do        : VERB      : ROOT


### <font color = 'pickle'>**Step1: Check lemma of ROOT word**

In [65]:
def root_acquire(doc)-> bool:
    """
    A function that checks if a root of a given spaCy doc is 'acquire' or 'buy'

    Parameters:
    doc (spacy.tokens.doc.Doc): spaCy doc to be analyzed

    Returns:
    bool: True if the root of the given doc is either 'acquire' or 'buy' else False

    """
    return len([token for token in doc if token.dep_ == 'ROOT' if token.lemma_ in  ['acquire', 'buy']]) >0


In [66]:
print(root_acquire(doc1))
print(root_acquire(doc2))
print(root_acquire(doc3))
print(root_acquire(doc4))

True
True
True
False


### <font color = 'pickle'> **Step2: Check active/passive voice**

In [67]:
def is_passive(doc) -> bool:
  """
  Check if a document contains passive voice sentences.
  This function takes in a spaCy doc object and returns True if 
  it contains one or more tokens with a dependency label of "nsubjpass".

  Args:
  doc (spaCy doc object): The document to be analyzed.

  Returns:
  bool: True if the document contains passive voice sentences, False otherwise.
  """
  return len([token for token in doc if token.dep_ == 'nsubjpass']) >0

In [68]:
print(is_passive(doc1))
print(is_passive(doc2))
print(is_passive(doc3))
print(is_passive(doc4))

False
True
False
True


### <font color = 'pickle'> **Step3: Extract Relationship**

In [69]:
from typing import Tuple

def get_x_acquire_y_pairs(doc) -> Tuple[str, str]:
  """
  Extract the X acquire Y pairs from a spaCy doc object.
  
  Args:
    doc (spaCy doc object): The document to be analyzed.
    
  Returns:
    Tuple[str, str]: A tuple of X and Y if X acquires Y.
    None: If X acquires Y pair is not present in the document.
  """
  if root_acquire(doc):
    if is_passive(doc):
      # Get X if the document is in passive voice
      x = [token.text for token in doc if token.dep_.endswith('obj')]
      # Get Y if the document is in passive voice
      y = [token.text for token in doc if token.dep_ in ('nsubjpass')]
    else:
      # Get X if the document is not in passive voice
      x = [token.text for token in doc if token.dep_.endswith('subj')]
      # Get Y if the document is not in passive voice
      y = [token.text for token in doc if token.dep_.endswith('dobj')]
    return (x[0], y[0])
  else: 
    print('X acquire Y pair is not present in document') 



In [70]:
print(get_x_acquire_y_pairs(doc1))
print(get_x_acquire_y_pairs(doc2))
print(get_x_acquire_y_pairs(doc3))
get_x_acquire_y_pairs(doc4)

('Google', 'YouTube')
('Google', 'YouTube')
('Google', 'YouTube')
X acquire Y pair is not present in document


## <font color = 'pickle'>**11. Phrase Matcher**

Using the PhraseMatcher to construct Doc objects rather than token patterns is a far more effective option if you need to match extensive terminology lists. For Example - It is difficult to define patterns that will match all the country names. However, we can easily enumerate all the country names and creaet a list. We can create a doc object from this list and use that as the basis of our information extraction script.

In [71]:
from spacy.matcher import PhraseMatcher
import json

### <font color = 'pickle'>**Example1 - Countries**

#### <font color = 'pickle'>**Download list of countries**

In [82]:
base_folder = Path(base_folder/'nlp')
data_folder = base_folder/'data'

In [83]:
file = data_folder/'countries.json'
URL = 'https://raw.githubusercontent.com/explosion/spacy-course/master/exercises/en/countries.json'
if not file.exists():
  !wget {URL} -P {data_folder} -O {file}

In [84]:
with open(file, 'r') as f:
  COUNTRIES = json.loads(f.read())

In [85]:
COUNTRIES[0:10]

['Afghanistan',
 'Åland Islands',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antarctica',
 'Antigua and Barbuda']

#### <font color = 'pickle'>**Create patterns**
We will now create a list of doc object as patterns

In [86]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [87]:
disable = nlp.select_pipes(disable=['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'])

In [101]:
# patterns = [nlp.make_doc(country) for country in COUNTRIES] # slower version
patterns = list(nlp.pipe(COUNTRIES))

In [102]:
patterns[0:10]

[Afghanistan,
 Åland Islands,
 Albania,
 Algeria,
 American Samoa,
 Andorra,
 Angola,
 Anguilla,
 Antarctica,
 Antigua and Barbuda]

In [103]:
# Patterns are doc objects not text
type(patterns[0])

spacy.tokens.doc.Doc

#### <font color = 'pickle'>**Add patterns to Phrase Matcher**

In [104]:
# Input text
text = 'New Zealand defated Germany in rugby'

# Apply spaCy NLP pipeline to the text
doc = nlp(text)

# Create a PhraseMatcher object to match phrases in the document
matcher = PhraseMatcher(nlp.vocab)

# Add the patterns to be matched to the matcher
matcher.add('phrase-country', patterns)

# Get the matches for the patterns in the document
matches = matcher(doc) 

# Loop over the matches and print the matching text
for match_id, start, end in matches: 
    # Matching text
    print(doc[start:end].text)

New Zealand
Germany


Let us try a variation with lowercase and uppercase.

In [105]:
text = 'new zealand defated GERMANY in rugby.'
doc = nlp(text)
matcher = PhraseMatcher(nlp.vocab)
matcher.add('phrase-country', patterns)
matches = matcher(doc) 
for match_id, start, end in matches: 
    print(doc[start:end].text) 

We do not get any result as the patterns are case senstive (patterns are in Camel case (First word is capital letter)

#### <font color = 'pickle'>**Use attributes in Phrase Matcher**

We can easily overcome the above issue by adding attribute - LOWER in our matcher.

In [106]:
text = 'new zealand defated GERMANY in rugby. Some other Variations iNDIA, united STATES OF America'
doc = nlp(text)
matcher = PhraseMatcher(nlp.vocab, attr = 'LOWER')
matcher.add('phrase-country', patterns)
matches = matcher(doc) 
for match_id, start, end in matches: 
    print(doc[start:end].text) 

new zealand
GERMANY
iNDIA
united STATES OF America


### <font color = 'pickle'>**Example2 - IP Addresses**

In [107]:
matcher = PhraseMatcher(nlp.vocab, attr="SHAPE")
ip_adresses = ["197.1.1.1", "197.197.1.1"]
patterns = list(nlp.pipe(ip_adresses))
matcher.add("IpAddressess", patterns)

doc = nlp("The static IP adress for this facility are 127.3.4.1, 127.123.2.2")
for match_id, start, end in matcher(doc):
    print( doc[start:end].text)

127.3.4.1
127.123.2.2


## <font color = 'pickle'>**12. Use entities in Matcher**

In [108]:
nlp.pipe_names

[]

In [109]:
disabled.restore()

In [110]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [112]:
text = "I work at Apple. My favorite fruit is apple."
doc = nlp(text)
[(entity.label_, entity.text) for entity in doc.ents ]

[('ORG', 'Apple')]

In [113]:
matcher = Matcher(nlp.vocab) 

pattern = [{"ENT_TYPE": "ORG", "LOWER": "apple"} ]
matcher.add("entity", [pattern]) 
matches = matcher(doc) 
for match_id, start, end in matches: 
    print(doc[start:end].text)

Apple


# <font color = 'pickle'>**Custom extensions for tokens**

In [114]:
# we need to import Token class to set custom extension
from spacy.tokens import Token
doc = nlp("My email is harpreet@utdallas.edu and my url is https://j.u.edu/faculty/hs.")

In [115]:
# Define the extension attribute on the token level with name as "clean" and default value as False
Token.set_extension('clean', default=False, force=True)

In [116]:
# Printing each token on the doc object and the stored value by the extension attribute.
# All the values default to 'False'
print(f'{"token.text":<27} : {"token._.clean"}')
for token in doc:
  print(f'{token.text:<27} : {token._.clean}')

token.text                  : token._.clean
My                          : False
email                       : False
is                          : False
harpreet@utdallas.edu       : False
and                         : False
my                          : False
url                         : False
is                          : False
https://j.u.edu/faculty/hs  : False
.                           : False


In [117]:
# Change the value of custom extension (clean) to True if it is not a punctuation, url or email
for token in doc:
  if not (token.is_punct or token.like_url or token.like_email):
    token._.set('clean', True)

In [118]:
# Printing the tokens again to see the modified values.
print(f'{"token.text":<27} : {"token._.clean"}')
for token in doc:
  print(f'{token.text:<27} : {token._.clean}')

token.text                  : token._.clean
My                          : True
email                       : True
is                          : True
harpreet@utdallas.edu       : False
and                         : True
my                          : True
url                         : True
is                          : True
https://j.u.edu/faculty/hs  : False
.                           : False


In [119]:
[token.text for token in doc if token._.clean]

['My', 'email', 'is', 'and', 'my', 'url', 'is']