Library installs


In [None]:
!pip install textacy

Note: The following “factoids” may be biased. That is why we refer to them as “factoids.”

Also quite a bit of this code relies on using packaged functions from 3rd party libraries. Author being a certified crook does not really understands whats the big deal with writing your own code

NLP factoids
1. NLTK 
  * NLTK is a string processing library. All the tools take strings as input and return strings or lists of strings as output 
  * NLTK is a good choice if you want to explore different NLP with a corpus whose length is less than a million words.
  * NLTK is a bad choice if you want to go into production with your NLP application

2. Regex
  * The use of regex is pervasive throughout our text-preprocessing code. Regex is a fast string processor. 

3. spaCy
  * spaCy is a moderate choice if you want to research different NLP models with a corpus whose length is greater than a million words.
  * If you use a selection from spaCy, Hugging Face, fast.ai, and GPT-3, then you are performing SOTA (state-of-the-art) research of different NLP models (my opinion at the time of writing this blog).
spaCy is a good choice if you want to go into production with your NLP application.
  * spaCy is an NLP library implemented both in Python and Cython. Because of the Cython, parts of spaCy are faster than if implemented in Python;
  * spacy is the fastest package, we know of, for NLP operations;
  * spacy is available for operating systems MS Windows, macOS, and Ubuntu;
  * spaCy runs natively on Nvidia GPUs;
explosion/spaCy has 16,900 stars on Github (7/22/2020);
  * spaCy has 138 public repository implementations on GitHub;
  * spaCy comes with pre-trained statistical models and word vectors;
  * spaCy transforms text into document objects, vocabulary objects, word- token objects, and other useful objects resulting from parsing the text ;
  * Doc class has several useful attributes and methods. Significantly, you can create new operations on these objects as well as extend a class with new attributes (adding to the spaCy pipeline);
  * spaCy features tokenization for 50+ languages;


In [9]:
from typing import Pattern
import re

For practice purposes and timing we are going to create an non-sensical text string of absolutely no use

In [2]:
MULTIPLIER = int(3.8e3)
text_l = 300

In [3]:
%%time
long_s = ':( 😻 😈   #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 '
long_s += '  888 eihtg DoD Fee https://medium.com/ #hash ##   Document Title</title> '
long_s += ':( cat- \n nip'
long_s += ' immed- \n natedly <html><h2>2nd levelheading</h2></html>  . , '
long_s +=  '# bhc@gmail.com  f@z.yx  can\'t Be  a ckunk. $4 $123,456 won\'t seven '
long_s +=' $Shine $$beighty?$ '

long_s *= MULTIPLIER
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

size: 1.159e+06 :( 😻 😈   #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508   888 eihtg DoD Fee https://medium.com/ #hash ##   Document Title</title> :( cat- 
 nip immed- 
 natedly <html><h2>2nd levelheading</h2></html>  . , # bhc@gmail.com  f@z.yx  can't Be  a ckunk. $4 $123,456 won't seven  $Shine $$beigh
CPU times: user 922 µs, sys: 2.03 ms, total: 2.95 ms
Wall time: 2.84 ms


1. NLP text preprocessing: Replace Twitter Hash Tags

In [6]:
from textacy.preprocessing.replace import replace_hashtags
%time text = replace_hashtags(long_s,replace_with='_HASH_')
print('size: {:g} {}'.format(len(text),text[:text_l]))

CPU times: user 91 ms, sys: 5.79 ms, total: 96.8 ms
Wall time: 98.4 ms
size: 1.159e+06 :( 😻 😈   _HASH_ +1 608-444-0000 08-444-0004 608-444-00003 ext. 508   888 eihtg DoD Fee https://medium.com/ _HASH_ ##   Document Title</title> :( cat- 
 nip immed- 
 natedly <html><h2>2nd levelheading</h2></html>  . , # bhc@gmail.com  f@z.yx  can't Be  a ckunk. $4 $123,456 won't seven  $Shine $$beigh


3. NLP text preprocessing: Replace Phone Numbers

Personal information is protected by privacy laws and hence cannot be shown directly

In [7]:
from textacy.preprocessing.replace import replace_phone_numbers
%time text = replace_phone_numbers(long_s,replace_with='_PHONE_')
print('size: {:g} {}'.format(len(text),text[:text_l]))

CPU times: user 175 ms, sys: 641 µs, total: 176 ms
Wall time: 175 ms
size: 1.1248e+06 :( 😻 😈   #google _PHONE_ 08-_PHONE_ 608-444-00003 ext. 508   888 eihtg DoD Fee https://medium.com/ #hash ##   Document Title</title> :( cat- 
 nip immed- 
 natedly <html><h2>2nd levelheading</h2></html>  . , # bhc@gmail.com  f@z.yx  can't Be  a ckunk. $4 $123,456 won't seven  $Shine $$beighty?$ :( 😻


Notice phone number 08-444-0004 and 608-444-00003 ext. 508 were not transformed.

4. NLP text preprocessing: Replace Phone Numbers - better


In [21]:
RE_PHONE_NUMBER = re.compile(
    # core components of a phone number
    r"(?:^|(?<=[^\w)]))(\+?1[ .-]?)?(\(?\d{2,3}\)?[ .-]?)?(\d{2,3}[ .-]?\d{2,5})"
    # extensions, etc.
    r"(\s?(?:ext\.?|[#x-])\s?\d{2,6})?(?:$|(?=\W))",
    flags=re.UNICODE | re.IGNORECASE)
%time text = RE_PHONE_NUMBER.sub("_PHONE_",long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

CPU times: user 155 ms, sys: 3.79 ms, total: 159 ms
Wall time: 159 ms
size: 1.0564e+06 :( 😻 😈   #google _PHONE_ _PHONE_ _PHONE_   888 eihtg DoD Fee https://medium.com/ #hash ##   Document Title</title> :( cat- 
 nip immed- 
 natedly <html><h2>2nd levelheading</h2></html>  . , # bhc@gmail.com  f@z.yx  can't Be  a ckunk. $4 $123,456 won't seven  $Shine $$beighty?$ :( 😻 😈   #google _PHON


5. NLP text preprocessing: Remove Phone Numbers

In [23]:
%time text = RE_PHONE_NUMBER.sub("",long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

CPU times: user 159 ms, sys: 0 ns, total: 159 ms
Wall time: 160 ms
size: 976600 :( 😻 😈   #google      888 eihtg DoD Fee https://medium.com/ #hash ##   Document Title</title> :( cat- 
 nip immed- 
 natedly <html><h2>2nd levelheading</h2></html>  . , # bhc@gmail.com  f@z.yx  can't Be  a ckunk. $4 $123,456 won't seven  $Shine $$beighty?$ :( 😻 😈   #google      888 eihtg DoD Fee htt
