# <font color = 'indianred'>**Advanced Spacy**</font>
    
In this notebook, we will learn some advanced features of Spacy:

1. Custom Tokenizer
3. Custom Extensions

This will help us to modify the default behavior of Spacy Tokenizer. We will use these futures to create a class which we can then use in future lectures.



# <font color = 'indianred'>**Install/Import Libraries**

In [None]:
# install spacy
if 'google.colab' in str(get_ipython()):
    !pip install -U spacy -qq


In [None]:
# Importing required libraries
# Path from the 'pathlib' library is used for working with files and directories in a portable manner across different operating systems
from pathlib import Path
import re
import spacy


In [None]:
spacy.__version__


'3.6.1'

# <font color = 'indianred'>**Set Path for Data**

In [None]:
# Check if the code is running in a Colab environment
if 'google.colab' in str(get_ipython()):  # If the code is running in Colab
    # mount google drive
    from google.colab import drive
    drive.mount('/content/drive')

    # specify the base path
    basepath = '/content/drive/MyDrive/data'
else:
    basepath = '/home/harpreet/Insync/google_drive_shaannoor'


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# <font color = 'indianred'>**Load Spacy Model**

In [None]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m81.1 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
# Loading the 'en_core_web_sm' language model from the spaCy library
nlp = spacy.load('en_core_web_sm')

# Selecting pipes to disable for the loaded language model
# Here, the pipes for token-to-vector, part-of-speech tagging, dependency parsing, attribute ruler,
# lemmatization and named entity recognition are being disabled.
disabled = nlp.select_pipes(
    disable=['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'])


In [None]:
# check the disabiled pipelines
disabled


['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [None]:
sample_text = " #Reg #Ex @abc@xyz.com! prefixes  stop-words wow!"


In [None]:
# Creating a spaCy document from the sample text
doc = nlp(sample_text)

# Printing the text of each token in the document
print([token.text for token in doc])


[' ', '#', 'Reg', '#', 'Ex', '@abc@xyz.com', '!', 'prefixes', ' ', 'stop', '-', 'words', 'wow', '!']


# <font color = 'indianred'>**Custom Tokenizer in spaCy**

## <font color = 'indianred'>**Modfiy Prefixes**

In [None]:
# Modify the prefix character used by spacy tokenizer
# Let us say if we want to keep hashes (#) together in a token
# spacy treats # as prefixes and hence separates them when creating tokens

# Accessing the default prefixes for the loaded language model
prefixes = # code here
prefixes[20:30]


### <font color = 'indianred'>**Remove Prefixes**

In [None]:
# Removing the '#' symbol from the prefixes list
# code here

# Compiling a regular expression pattern for the remaining prefixes
prefix_regex = # code here

# Assigning the compiled prefix regular expression to the spaCy tokenizer's prefix search method
nlp.tokenizer.prefix_search = # code here


Code Explanation:

This code modifies the prefixes used by the spaCy tokenizer.
- The first line removes the `#` symbol from the prefixes list.
- Then, a regular expression pattern is compiled from the updated prefixes list using the `spacy.util.compile_prefix_regex` function.
- Finally, the compiled regular expression is assigned to the `prefix_search` method of the spaCy tokenizer, which is used to identify prefixes in text during tokenization.

By updating the prefix_search method, this code changes the behavior of the spaCy tokenizer to no longer treat the `#` symbol as a prefix.

### <font color = 'indianred'>**Add Prefixes**

In [None]:
# create doc
doc = nlp(sample_text)
# print tokens
print([token.text for token in doc])


[' ', '#Reg', '#Ex', '@abc@xyz.com', '!', 'prefixes', ' ', 'stop', '-', 'words', 'wow', '!']


In [None]:
# Add prefix character to split from text
prefixes.append(r'@')

# Compiling a regular expression pattern for the modified prefixes
prefix_regex = spacy.util.compile_prefix_regex(prefixes)

# Assigning the compiled prefix regular expression to the spaCy tokenizer's prefix search method
nlp.tokenizer.prefix_search = prefix_regex.search

doc = nlp(sample_text)
print([token.text for token in doc])


[' ', '#Reg', '#Ex', '@', 'abc@xyz.com', '!', 'prefixes', ' ', 'stop', '-', 'words', 'wow', '!']


## <font color = 'indianred'>**Modify Suffixes**

In [None]:
# check default suffixes in spacy
suffixes = nlp.Defaults.suffixes
suffixes[20:30]


['\\*', '&', '。', '？', '！', '，', '、', '；', '：', '～']

In [None]:
# Remove suffix characters to not split from text
suffixes.remove(r'\!')

# Compiling a regular expression pattern for the modified suffixes
suffix_regex = spacy.util.compile_suffix_regex(suffixes)

# Assigning the compiled prefix regular expression to the spaCy tokenizer's suffix search method
nlp.tokenizer.suffix_search = suffix_regex.search

doc = nlp(sample_text)
print([token.text for token in doc])


[' ', '#Reg', '#Ex', '@', 'abc@xyz.com!', 'prefixes', ' ', 'stop', '-', 'words', 'wow!']


## <font color = 'indianred'>**Modify infixes**

In [None]:
# Create a list of default infixes from spaCy's "nlp.Defaults" module
infixes = list(nlp.Defaults.infixes)
infixes[0:3]


['\\.\\.+',
 '…',
 '[\\u00A6\\u00A9\\u00AE\\u00B0\\u0482\\u058D\\u058E\\u060E\\u060F\\u06DE\\u06E9\\u06FD\\u06FE\\u07F6\\u09FA\\u0B70\\u0BF3-\\u0BF8\\u0BFA\\u0C7F\\u0D4F\\u0D79\\u0F01-\\u0F03\\u0F13\\u0F15-\\u0F17\\u0F1A-\\u0F1F\\u0F34\\u0F36\\u0F38\\u0FBE-\\u0FC5\\u0FC7-\\u0FCC\\u0FCE\\u0FCF\\u0FD5-\\u0FD8\\u109E\\u109F\\u1390-\\u1399\\u1940\\u19DE-\\u19FF\\u1B61-\\u1B6A\\u1B74-\\u1B7C\\u2100\\u2101\\u2103-\\u2106\\u2108\\u2109\\u2114\\u2116\\u2117\\u211E-\\u2123\\u2125\\u2127\\u2129\\u212E\\u213A\\u213B\\u214A\\u214C\\u214D\\u214F\\u218A\\u218B\\u2195-\\u2199\\u219C-\\u219F\\u21A1\\u21A2\\u21A4\\u21A5\\u21A7-\\u21AD\\u21AF-\\u21CD\\u21D0\\u21D1\\u21D3\\u21D5-\\u21F3\\u2300-\\u2307\\u230C-\\u231F\\u2322-\\u2328\\u232B-\\u237B\\u237D-\\u239A\\u23B4-\\u23DB\\u23E2-\\u2426\\u2440-\\u244A\\u249C-\\u24E9\\u2500-\\u25B6\\u25B8-\\u25C0\\u25C2-\\u25F7\\u2600-\\u266E\\u2670-\\u2767\\u2794-\\u27BF\\u2800-\\u28FF\\u2B00-\\u2B2F\\u2B45\\u2B46\\u2B4D-\\u2B73\\u2B76-\\u2B95\\u2B98-\\u2BC8\\u2BCA-\\

In [None]:
# Create a new list of infixes without elements containing the string '-'
infixes = # code here

# Compiling a regular expression pattern for the modified infixes
infix_regex = spacy.util.compile_infix_regex(infixes)

# Assigning the compiled prefix regular expression to the spaCy tokenizer's infix search method
nlp.tokenizer.infix_finditer = infix_regex.finditer

doc = nlp(sample_text)
print([token.text for token in doc])


[' ', '#Reg', '#Ex', '@', 'abc@xyz.com!', 'prefixes', ' ', 'stop-words', 'wow!']


You can't use the `.remove()` method to remove the string `'-'` from the infixes list because there are multiple instances of the string `'-'` in the infixes list.

The list comprehension `[x for x in infixes if r'-' not in x]` creates a new list that only contains elements from the original infixes list if the string `'-'` is not present in the element. This ensures that all elements in the list that contain the string `'-'` are removed.

## <font color = 'indianred'>**Adding special case tokenization rules**

In [None]:
doc = nlp("gimme that")
print([w.text for w in doc])


['gimme', 'that']


In [None]:
# Importing the ORTH symbol from the spaCy symbols module
from spacy.symbols import ORTH


The ORTH symbol is a constant from the spacy.symbols module in spaCy. It represents the string form of a token, including any whitespace or special characters.

In [None]:
# Define a special case for the text "gimme"
special_case = [{ORTH: "gim"}, {ORTH: "me"}]

# Add the special case to the tokenizer of the NLP object
nlp.tokenizer.add_special_case("gimme", special_case)

# Tokenize the text "gimme that" and print the resulting tokens
print([w.text for w in nlp("gimme that")])


['gim', 'me', 'that']


`nlp.tokenizer.add_special_case` is a method in spaCy's NLP object that allows you to add a special case tokenization for specific texts. This method takes two arguments:

- The first argument is the text you want to add a special case for.
- The second argument is the special case tokenization you want to apply to the text. This is defined as a list of dictionaries, where each dictionary represents a token and maps the ORTH key to the text of the token.

# <font color = 'indianred'>**Custom extensions for tokens**

In [None]:
# we need to import Token class to set custom extension
from spacy.tokens import Token
doc = nlp(
    "My email is harpreet@utdallas.edu and my url is https://j.u.edu/faculty/hs.")


In [None]:
# Define the extension attribute on the token level with name as "clean" and default value as False
Token.set_extension('clean', default=False, force=True)


In [None]:
# Printing each token on the doc object and the stored value by the extension attribute.
# All the values default to 'False'
print(f'{"token.text":<27} : {"token._.clean"}')
for token in doc:
    print(f'{token.text:<27} : {token._.clean}')


token.text                  : token._.clean
My                          : False
email                       : False
is                          : False
harpreet@utdallas.edu       : False
and                         : False
my                          : False
url                         : False
is                          : False
https://j.u.edu/faculty/hs  : False
.                           : False


In [None]:
# Change the value of custom extension (clean) to True if it is not a punctuation, url or email
for token in doc:
    if not (token.is_punct or token.like_url or token.like_email):
        token._.set('clean', True)


In [None]:
# Printing the tokens again to see the modified values.
print(f'{"token.text":<27} : {"token._.clean"}')
for token in doc:
    print(f'{token.text:<27} : {token._.clean}')


token.text                  : token._.clean
My                          : True
email                       : True
is                          : True
harpreet@utdallas.edu       : False
and                         : True
my                          : True
url                         : True
is                          : True
https://j.u.edu/faculty/hs  : False
.                           : False


In [None]:
[token.text for token in doc if token._.clean]


['My', 'email', 'is', 'and', 'my', 'url', 'is']