In [1]:
# https://docs.python.org/3/howto/regex.html

# "Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming 
# language embedded inside Python and made available through the re module. Using this little language, you specify the 
# rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, 
# or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or 
# “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in 
# various ways."

In [2]:
# Note: 
#    - alphanumeric here implies 0-9, a-z, A-Z, or _
#    - a word is defined as a sequence of alphanumeric characters

# metacharacters
#    [ ]      matches character class specified within the square brackers
#             - and ^ have special meaning within character class
#             $ does not have special meaning within character class
#     -      when used inside a characted class set, implies range of characters
#    ^      when used as first character inside a character class set, implies match of complementing character class set
#    \       is used to either escape a metacharacter of its special meaning, or to signify a special squence
#    .        matches anything except a newline character
#    *       previous character is matched 0 or more times
#    +      previous character is matched 1 or more times
#    ?       previous characer is mathced 0 or 1 times
#    { }     {m,n} means there must be at least m repetitions, and at most n
#           {0,} is the same as *, {1,} is equivalent to +, and {0,1} is the same as ?
#    ^     when NOT used as first character inside a character class set, matches at the begining of a line
#    \A    matches only at the start of a string (equivalent to ^ in non-MULTILINE mode)
#    $     matches at the end of a line
#    \Z    matches only at the end of a string (equivalent to $ in non-MULTILINE mode)
#    \b    matches only at the begining or end of a word (that is, at a word boundary)
#    \B    matches only when not at the begining or end of a word (that is, not at a word boundary)
#    |      matches either/or expression on either side of | opeartor
#    ( )    used to group together the expressions contained inside;
#           you can then repeat the contents of a group with a repeating qualifier, such as *, +, ?, or {m,n}

# special squences (all sequencces can be included in a character set)
#    \d    matches any digit character; equivalent to [0-9]
#    \D    matches any non-digit character; equivalent to [^0-9]
#    \s     matches any whitespace character; equivalent to [ \t\n\r\f\v] => space, tab, newline, carriage return, form feed, vertical tab
#    \S     matches any non-whitespace character; equivalent to [^\t\n\r\f\v] 
#    \w    matches any alphanumeric character; equivalent to [0-9a-zA-Z_]
#    \W    matches any non-alphanumeric character; equivalent to [^0-9a-zA-Z_]

# Raw Strings
# Regular expressions use the backslash character ('\') to indicate special forms or to allow 
#    special characters to be used without invoking their special meaning. 
# This conflicts with Python’s usage of the same character for the same purpose in string literals.
# The solution is to use Python’s raw string notation for regular expressions.
# This is done by preceeding the regular expression pattern by r".."

In [3]:
# Regular Expressions are compiled into pattern objects:
#    import re
#    regex = re.compile(pattern, options)
#        - pattern: created using metacharacters and special squences
#        - options: can be re.IGNORECASE, re.VERBOSE, etc

# Once a pattern object is created, you can use one of several methods on it to create a match object
# match(): determines if the pattern matches at the begining of the string
# search(): determines if the pattern matches at any location of the string
# findall(): find all substrings where pattern matches, and return them as a list
# finditer(): find all substrings where pattern matches, and return them as an iterator

# Once a match object is created,  you can query the match object for information about the matching string
# group(): returns string matched by the pattern
# start(): return starting position of the match
# end(): return ending position of the match
# span(): return a tuple containing (start, end) position of the match

# Once a pattern object is created, you can also use the following methods to modify strings
# split(string[, maxsplit=0]): 
#               split the string into a list, splitting wherever the pattern matches 
#               if maxsplit is non-zero, at most maxsplit splits are performed (otherwise all splits are done)
# sub(replacement, string[, count=0]): 
#               find all substrings where the pattern matches, and replace them with a different string
#               if count is non-zero, at most count replacements are performed (otherwise all replacements are done)
# subn(): same as sub, but returns new string and number of replacements

In [4]:
import re
import string

In [5]:
oldstr = "Hey! Are we still on for lunch today at 11am?"

# first create an RE pattern object of all characters you'd like to match;
# then replace all matched characters with ''#''
regex = re.compile(r"[a!1]")
newstr = regex.sub('#', oldstr)
print (oldstr)
print (newstr)

Hey! Are we still on for lunch today at 11am?
Hey# Are we still on for lunch tod#y #t ###m?


In [6]:
oldstr = "Hey! Are we still on for lunch today at 11am?"

# create an RE pattern object of all characters from a-z;
# then replace all matched characters with ''#''
regex = re.compile(r"[a-z]")
newstr = regex.sub('#', oldstr)
print (oldstr)
print (newstr)

Hey! Are we still on for lunch today at 11am?
H##! A## ## ##### ## ### ##### ##### ## 11##?


In [7]:
oldstr = "Hey! Are we still on for lunch today at 11am?"

# create an RE pattern object of all characters from a-zA-Z;
# then replace all matched characters with ''#''
regex = re.compile(r"[a-zA-Z]")
newstr = regex.sub('#', oldstr)
print (oldstr)
print (newstr)

Hey! Are we still on for lunch today at 11am?
###! ### ## ##### ## ### ##### ##### ## 11##?


In [8]:
oldstr = "Hey! Are we still on for lunch today at 11am?"

# create an RE pattern object of all characters from a-z and make match case-insensitive (re.IGNORECASE)
# then replace all matched characters with ''#''
regex = re.compile(r"[a-z]", re.IGNORECASE)
newstr = regex.sub('#', oldstr)
print (oldstr)
print (newstr)

Hey! Are we still on for lunch today at 11am?
###! ### ## ##### ## ### ##### ##### ## 11##?


In [9]:
oldstr = "Hey! Are we still on for lunch today at 11am?"

# create an RE pattern object of all digits 0-9;
# then replace all matched characters with ''#''
regex = re.compile(r"[\d]")
#regex = re.compile(r"[0-9]")
newstr = regex.sub('#', oldstr)
print (oldstr)
print (newstr)

Hey! Are we still on for lunch today at 11am?
Hey! Are we still on for lunch today at ##am?


In [10]:
oldstr = "Hey! Are we still on for lunch today at 11am?"

# create an RE pattern object of the complement of all characters from a-zA-Z (that is, any characted that is not a-zA-Z)
# then replace all matched characters with ''#''
regex = re.compile(r"[^a-zA-Z]")
newstr = regex.sub('#', oldstr)
print (oldstr)
print (newstr)

Hey! Are we still on for lunch today at 11am?
Hey##Are#we#still#on#for#lunch#today#at###am#


In [11]:
oldstr = "Hey! Are we still on for lunch today at 11am?"

# create an RE pattern object of the complement of all characters from 0-9 (that is, any characted that is not 0-9)
# then replace all matched characters with ''#''
regex = re.compile(r"[^\d]")
#regex = re.compile(r"^[0-9]")
newstr = regex.sub('#', oldstr)
print (oldstr)
print (newstr)

Hey! Are we still on for lunch today at 11am?
########################################11###


In [12]:
oldstr = "Hey! Are we still on for lunch today at 11am?"

# create an RE pattern object of whitespaces
# then replace all matched characters with ''#''
regex = re.compile(r"[\s]")
newstr = regex.sub('#', oldstr)
print (oldstr)
print (newstr)

Hey! Are we still on for lunch today at 11am?
Hey!#Are#we#still#on#for#lunch#today#at#11am?


In [13]:
oldstr = "Why Lisa, why, WHY"
print (oldstr)

# create an RE pattern object of  (case-insensitive) "why" anywhere in the string
# then replace all matches with ''#''
regex1 = re.compile(r"why", re.IGNORECASE)
newstr1 = regex1.sub('#', oldstr)
print (newstr1)

# create an RE pattern object of (case-insensitive) "why" at the begining of the string
# then replace all matches with ''#''
regex2 = re.compile(r"^why", re.IGNORECASE)
newstr2 = regex2.sub('#', oldstr)
print (newstr2)

# create an RE pattern object of (case-insensitive) "why" at the end of the string
# then replace all matches with ''#''
regex3 = re.compile(r"why$", re.IGNORECASE)
newstr3 = regex3.sub('#', oldstr)
print (newstr3)

Why Lisa, why, WHY
# Lisa, #, #
# Lisa, why, WHY
Why Lisa, why, #


In [14]:
oldstr = "the cat will catch-up with you in muscat"
print (oldstr)

# first create an RE pattern object of "cat"
# then replace all matches with ''#''
regex1 = re.compile(r"cat")
newstr1 = regex1.sub('#', oldstr)
print (newstr1)

# first create an RE pattern object of "cat" at a word boundary at begining as well as end
# then replace all matches with ''#''
regex2 = re.compile(r"\bcat\b")
newstr2 = regex2.sub('#', oldstr)
print (newstr2)

# first create an RE pattern object of "cat" at a word boundary at the begining 
# then replace all matches with ''#''
regex3 = re.compile(r"\bcat")
newstr3 = regex3.sub('#', oldstr)
print (newstr3)

# first create an RE pattern object of "cat" at a word boundary at the end 
# then replace all matches with ''#''
regex4 = re.compile(r"cat\b")
newstr4 = regex4.sub('#', oldstr)
print (newstr4)

the cat will catch-up with you in muscat
the # will #ch-up with you in mus#
the # will catch-up with you in muscat
the # will #ch-up with you in muscat
the # will catch-up with you in mus#


In [15]:
# Exercise 1: 
#    - find file names of the form base.extension  
#    - and print the file names

fnamestr = "The two files are foo1.bar and foo2.bar. There are no other files."
print (fnamestr)

regex = re.compile(r"\b\w+[.]\w+\b")
fnames = regex.findall(fnamestr)
print (fnames)

The two files are foo1.bar and foo2.bar. There are no other files.
['foo1.bar', 'foo2.bar']


In [16]:
# Exercise 2:
#     - find punctuations and digits
#     - replace with empty

oldstr = "Hey! Are we still on for lunch today at 11am?"
print (oldstr)

regex1 = re.compile(r"[%s]" % string.punctuation)
newstr1 = regex1.sub('',oldstr)
print (newstr1)

regex1 = re.compile(r"[%s%s]" % (string.punctuation,string.digits))
newstr1 = regex1.sub('',oldstr)
print (newstr1)

Hey! Are we still on for lunch today at 11am?
Hey Are we still on for lunch today at 11am
Hey Are we still on for lunch today at am


# spaCy

- https://spacy.io/
- https://spacy.io/usage
- https://spacy.io/models/en
- https://spacy.io/api/doc
- https://spacy.io/api/token
- https://spacy.io/usage/processing-pipelines
- https://spacy.io/usage/spacy-101



 - **spaCy** is a free, open-source library for advanced industrial-strength Natural Language Processing (NLP) in Python.

- When you call spaCy on a text, spaCy first tokenizes the text (i.e. segments it into words, punctuation and so on) to produce a Doc object. 
   spaCy uses rules specific to each language for tokenization.

 - The Doc object is then processed in several different steps (also referred to as the processing pipeline). 
   The pipeline used by the default models consists of a (pos) tagger, a (dependency) parser and a (named) entity recognizer (ner). 
   spaCy uses statistical models to predict pos, syntatctic dependencies, and named entities.
   Each pipeline component returns the processed Doc, which is then passed on to the next component.
   You can pick and choose the stages you want spaCy to load.

- Here is a list of features and capabilities of spaCy: 
  https://spacy.io/usage/spacy-101#features

### installation

#### https://spacy.io/usage
- `pip install spacy`

#### https://spacy.io/models/en
you can download these general-purpose pretrained models to predict 
pos tags (tagger), named entities (ner), and syntactic dependencies (parser).
    note: n_core_web_sm does not include **word-vectors**, but en_core_web_md and en_core_web_lg do.
> `python -m spacy download en_core_web_sm` <br>
> `python -m spacy download en_core_web_md` <br>
> `python -m spacy download en_core_web_lg`



In [4]:
import spacy

In [6]:
# once you’ve downloaded and installed a model, you can load it via spacy.load(). 
# spacy.load() returns a Language object containing all components and data needed to process text. \

# the Language object is typically called nlp. 
import en_core_web_sm
nlp = en_core_web_sm.load()

In [8]:
# calling the nlp object on a string of text will return a processed Doc object. the Doc object is typically called doc.
# even though a Doc object is processed (for isntance, split into individual words and annotated),
# it still holds all information of the original text.
# once the doc object has been created, we can  use it to access the various spaCy features.
doc = nlp("Hi Emma Watson! How are you?")

In [9]:
# for instance, you can iterate over individual sentences in the document.
for s in doc.sents:
    print (s.text)

Hi Emma Watson!
How are you?


In [12]:
# you can iterate over the named entities in the document (from ner)
# a named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title.
for e in doc.ents:
    print (e.text)
    print (e.label_)
    print (spacy.explain(e.label_))

Watson
PERSON
People, including fictional


In [13]:
# you can visualize the named entities 
spacy.displacy.render(doc, style='ent',jupyter=True)

In [14]:
# you can also visualize the dependencies (from parser)
spacy.displacy.render(doc, style="dep", jupyter= True)

In [15]:
# you can iterate over the base noun chunks in the document.
# noun chunks are “base noun phrases”  - a noun plus the words describing the noun.
# for instance, “the lavish green grass” or “the world’s largest tech fund”.
for c in doc.noun_chunks:
    print (c.text)

Hi Emma Watson
you


In [16]:
# you can iterate over the linguisitic annotations associated with tokens in the document (from tagger)
# https://spacy.io/api/annotation
# https://spacy.io/api/token#attributes
doc = nlp("Hi Emma Watson! How are you?")
for token in doc:
    print (token.i,                 # index of the token within the parent document
           token,
           token.text,               # verbatim text
           token.ent_type_,     # named entity type
           spacy.explain(token.ent_type_),
           token.lemma_,        # base form of the token, with no inflectional suffixes
           token.pos_,             # coarse-grained part-of-speech
            spacy.explain(token.pos_),
           token.tag_,             # fine-grained part-of-speech
            spacy.explain(token.tag_),
           token.dep_,            # syntactic dependency relation
           token.like_url,        # does the token resemble a URL
           token.like_num,     # does the token represent a number? e.g. “10.9”, “10”, “ten”, etc
           token.like_email,    # does the token resemble an email address
           token.is_stop,         # is the token part of a “stop list”
          token.is_alpha,
          token.is_ascii,
          token.is_digit,
          token.is_lower,
          token.is_upper,
          token.is_title,
          token.is_punct,
          token.is_space,
          token.is_currency
          )

0 Hi Hi  None hi INTJ interjection UH interjection compound False False False False True True False False False True False False False
1 Emma Emma  None Emma PROPN proper noun NNP noun, proper singular compound False False False False True True False False False True False False False
2 Watson Watson PERSON People, including fictional Watson PROPN proper noun NNP noun, proper singular ROOT False False False False True True False False False True False False False
3 ! !  None ! PUNCT punctuation . punctuation mark, sentence closer punct False False False False False True False False False False True False False
4 How How  None how ADV adverb WRB wh-adverb advmod False False False True True True False False False True False False False
5 are are  None be AUX auxiliary VBP verb, non-3rd person singular present ROOT False False False True True True False True False False False False False
6 you you  None -PRON- PRON pronoun PRP pronoun, personal nsubj False False False True True True False

In [17]:
# you can make semantic similarity estimates based on word vectors.
# the default estimate is cosine similarity, using an average of word vectors for the document.
# it returns a scalar similarity score (higher is more similar).
doc1 = nlp("I like oranges that are sweet.")
# print (doc1.vector) # doc vector is average of token vectors
doc2 = nlp("I like apples that are sour.")
# print (doc2.vector) # doc vector is average of token vectors
doc1.similarity(doc2)

  doc1.similarity(doc2)


0.9100590601885606

In [19]:
# processing large corpuses with nlp.pipe

# let's say you had a very large corpus of text
# illustrated with a very small corpus below :)
data = ["Amy is going to class now.",
          "Matt is having lunch."]

# first, you'll only want to apply the pipeline components you need:
# getting predictions from the model that you don’t actually need adds up and becomes very inefficient at scale. 
# to prevent this, use the disable keyword argument to disable components you don’t need.
nlp = en_core_web_sm.load(disable=['parser', 'ner'])
# nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# and second, you'll want to work on batches of texts.
# this can be done with spaCy’s nlp.pipe method which takes an iterable of texts and yields processed Doc objects. 
# the batching is done internally.
corpus = nlp.pipe(data)

# now we can clean the corpus efficiently
def custom_tokenizer(doc):
    tokens = [token.lemma_.lower() 
                      for token in doc 
                          if not token.is_stop and not token.is_punct]
    return ' '.join(tokens)

clean_corpus = [custom_tokenizer(doc) for doc in corpus]
clean_corpus

['amy go class', 'matt have lunch']

---

# Exercise

#### Question 1
Lowercase the text.
- `text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."`

In [20]:
text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."
text.lower()

'yes, that is a duplicate catalog category. the catalog number is c1357-a.'

#### Question 2
Substitute the pattern `"cat"` with replacement `"#"` in the text.
- make the substitution case sensitive,
- and match the pattern wherever it occurs in the text
  - `text = "Yes, that is a duplicate Catalog category. The Catalog number is C1357-A."`

In [21]:
import re

regex = re.compile(r"cat")
newtext = regex.sub('#', text)
print(newtext)

Yes, that is a dupli#e #alog #egory. The #alog number is C1357-A.


#### Question 3
Substitute the pattern "cat" with replacement "#" in the text
- make the substitution case insensitive (re.IGNORECASE)
- and match the pattern wherever it occurs in the text
  - `text = "Yes, that is a duplicate Catalog category. The Catalog number is C1357-A."`

In [22]:
regex = re.compile(r"cat", re.IGNORECASE)
newtext = regex.sub('#', text)
print(newtext)

Yes, that is a dupli#e #alog #egory. The #alog number is C1357-A.


#### Question 4
Substitute the pattern "cat" with replacement "#" in the text
- make the substitution case insensitive
- only match if the pattern is at the beginning of a word boundary (\b)
  - `text = "Yes, that is a duplicate Catalog category. The Catalog number is C1357-A."`

In [23]:
regex = re.compile(r"\bcat", re.IGNORECASE)
newtext = regex.sub('#', text)
print(newtext)

Yes, that is a duplicate #alog #egory. The #alog number is C1357-A.


#### Question 5
Substitute the characters 'c', 'a', 't' with replacement "#" in the text
- make the substitution case insensitive
- hint: use character class r"[cat]“
  - `text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."`

In [24]:
regex = re.compile(r"[cat]", re.IGNORECASE)
newtext = regex.sub('#', text)
print(newtext)

Yes, #h## is # dupli###e ####log ###egory. #he ####log number is #1357-#.


#### Question 6
Substitute all alphabets with replacement "#" in the text
- make the substitution case insensitive
- hint: use character class with range [a-z]
  - `text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."`

In [25]:
regex = re.compile(r"[a-z]", re.IGNORECASE)
newtext = regex.sub('#', text)
print(newtext)

###, #### ## # ######### ####### ########. ### ####### ###### ## #1357-#.


#### Question 7
Substitute all digits with replacement "#" in the text
- hint: use character class with range [0-9], or special sequence \d = [0-9]
  - `text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."`

In [26]:
regex = re.compile(r"[0-9]")
newtext = regex.sub('#', text)
print(newtext)

Yes, that is a duplicate catalog category. The catalog number is C####-A.


#### Question 8
Substitute one or more white-space characters with replacement " " in the spacetext
- hint: use special sequence \s for whitespace characters, and metacharacter + for one-or-moretimes
  - `spacetext = "Yes, that is a duplicate catalog \t category. The catalog number is C1357-A.\n"`

In [31]:
spacetext = "Yes, that is a duplicate catalog \t category. The catalog number is C1357-A.\n"

regex = re.compile(r"\s+")
newtext = regex.sub(' ', spacetext)
# print(spacetext)
print(newtext)

Yes, that is a duplicate catalog category. The catalog number is C1357-A. 


#### Question 9
Substitute words that are two or more alphanumeric characters long
with replacement "#"
- use special sequence \w for alphanumeric characters [0-9a-zA-Z_],
- special sequence \b for words boundaries,
- and metacharacter + for one-or-more-times, or {2,} for two or more times
  - `text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."`

In [32]:
text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."

regex = re.compile(r"[\w]{2,}")
newtext = regex.sub('#', text)
print(newtext)

#, # # a # # #. # # # # #-A.


#### Question 10
Find all words that are two or more characters long
- hint: use regex.findall(text)
  - `text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."`

In [34]:
regex = re.compile(r"\b\w{2,}\b")
newtext = regex.findall(text)
print(newtext)

['Yes', 'that', 'is', 'duplicate', 'catalog', 'category', 'The', 'catalog', 'number', 'is', 'C1357']


#### Question 11 (@20:00)
Find all the urls in urltext
- take care of http vs https
- hint: use metacharacter ? for zero-or-one-times, metacharacter + for one-or-more-times, and \S for non-whitespace characters
  - `urltext = """The url for sklearn documentation …"""`