# Regular Expression

- https://docs.python.org/3/howto/regex.html

"Regular expressions (called *REs*, or *regexes*, or *regex* patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or 


“Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways."

## Note: 
   - **alphanumeric** here implies 0-9, a-z, A-Z, or _
   - a word is defined as **a sequence of alphanumeric characters**

### Metacharacters
-   `[ ]`      matches character class specified within the square brackers 
  - `-` and `^` have special meaning within character class
  -   `$` does not have special meaning within character class
-   `-`      when used inside a characted class set, implies range of characters (*e.g. `[a-z]` , `[A-Z]`, `[0-9]`*)
-   `^`      when used as first character inside a character class set, implies match of complementing character class set
-   `\`       is used to either escape a metacharacter of its special meaning, or to signify a special squence
-   `.`        matches anything except a newline character
-   `*`       previous character is matched **0 or more times**
-   `+`      previous character is matched **1 or more times**
-   `?`       previous characer is mathced **0 or 1 times**
-   `{ }`     `{m,n}` means there must be at least m repetitions, and at most n 
  - `{0,}` is the same as `*`,
  - `{1,}` is equivalent to `+`, and 
  - `{0,1}` is the same as `?`
-   `^`     when **NOT** used as first character inside a character class set, matches at the begining of a line (*e.g. `[^a-z]` -> not start with a-z, `^why` -> sentense not begin with why)*
-   `\A`    matches only at the start of a string (equivalent to `^` in non-MULTILINE mode)
-   `$`     matches at the end of a line
-   `\Z`    matches only at the end of a string (equivalent to `$` in non-MULTILINE mode)
-   `\b`    matches only at the begining or end of a word (that is, **at a word boundary**) *(e.g. catch `\bcat\b` -> no matches)*
-   `\B`    matches only when not at the begining or end of a word (that is, **not at a word boundary**)
-   `|`      matches **either/or** expression on either side of | opeartor
-   `( )`    used to **group** together the expressions contained inside; <br> you can then repeat the contents of a group with a repeating qualifier, such as `*`, `+`, `?`, or `{m,n}`

### Special Squences (all sequencces can be included in a character set)
-   `\d`    matches any **digit** character; equivalent to `[0-9]`
-   `\D`    matches any **non-digit** character; equivalent to `[^0-9]`
-   `\s`     matches any **whitespace character**; equivalent to `[ \t\n\r\f\v]` => space, tab, newline, carriage return, form feed, vertical tab
-   `\S`     matches any **non-whitespace** character; equivalent to `[^\t\n\r\f\v]`
-   `\w`    matches any **alphanumeric** character; equivalent to `[0-9a-zA-Z_]`
-   `\W`    matches any **non-alphanumeric** character; equivalent to `[^0-9a-zA-Z_]`

### Raw Strings
- Regular expressions use the backslash character (`'\'`) to indicate special forms or to allow special characters to be used without invoking their special meaning. <br>
- This conflicts with Python’s usage of the same character for the same purpose in string literals.<br>
- The solution is to use Python’s raw string notation for regular expressions.<br>
- This is done by preceeding the regular expression pattern by `r".."` (raw string mode)

In [1]:
# Regular Expressions are compiled into pattern objects:
#    import re
#    regex = re.compile(pattern, options)
#        - pattern: created using metacharacters and special squences
#        - options: can be re.IGNORECASE, re.VERBOSE, etc

In [2]:
# Once a pattern object is created, you can use one of several methods on it to create a match object

# match(): determines if the pattern matches at the begining of the string
# search(): determines if the pattern matches at any location of the string
# findall(): find all substrings where pattern matches, and return them as a list
# finditer(): find all substrings where pattern matches, and return them as an iterator

# Once a match object is created,  you can query the match object for information about the matching string
# group(): returns string matched by the pattern
# start(): return starting position of the match
# end(): return ending position of the match
# span(): return a tuple containing (start, end) position of the match

# Once a pattern object is created, you can also use the following methods to modify strings

# split(string[, maxsplit=0]): 
#               split the string into a list, splitting wherever the pattern matches 
#               if maxsplit is non-zero, at most maxsplit splits are performed (otherwise all splits are done)

# sub(replacement, string[, count=0]):  ### <- most often, take unstructured data and clean up
#               find all substrings where the pattern matches, and replace them with a different string
#               if count is non-zero, at most count replacements are performed (otherwise all replacements are done)

# subn(): same as sub, but returns new string and number of replacements

In [3]:
# import
import re

#### Example 1 
replace Character ['a', '!', '1'] with hash '#'

In [4]:
# string
oldstr = "Hey! Are we still on for lunch today at 11am?"

# replace Character ['a', '!', '1'] with hash '#'
regex = re.compile(r"[a!1]")

# replace with '#'
newstr = regex.sub('#',oldstr)
print(newstr)

Hey# Are we still on for lunch tod#y #t ###m?


#### Example 2
covert [a-z] to hash'#'

In [5]:
oldstr = "Hey! Are we still on for lunch today at 11am?"

# covert [a-z] to hash'#'

regex = re.compile(r"[a-z]")

newstr = regex.sub('#',oldstr)

print(newstr)

H##! A## ## ##### ## ### ##### ##### ## 11##?


#### Example 3
covert all a-zA-Z charaters to hash'#'

In [6]:
oldstr = "Hey! Are we still on for lunch today at 11am?"

# covert [a-zA-Z] to hash'#'

regex = re.compile(r"[a-zA-Z]")

newstr = regex.sub('#',oldstr)

print(newstr)

###! ### ## ##### ## ### ##### ##### ## 11##?


or,

In [7]:
oldstr = "Hey! Are we still on for lunch today at 11am?"

# covert [a-zA-Z] to hash'#'

regex = re.compile(r"[a-z]", re.IGNORECASE)

newstr = regex.sub('#',oldstr)

print(newstr)

###! ### ## ##### ## ### ##### ##### ## 11##?


#### Example 4
change all digit to hash

In [8]:
# method 1
oldstr = "Hey! Are we still on for lunch today at 11am?"

regex = re.compile(r"[0-9]")

newstr = regex.sub('#',oldstr)

print(newstr)

Hey! Are we still on for lunch today at ##am?


In [9]:
# method 2
oldstr = "Hey! Are we still on for lunch today at 11am?"

# special sequences
regex = re.compile(r"[\d]")

newstr = regex.sub('#',oldstr)

print(newstr)

Hey! Are we still on for lunch today at ##am?


In [10]:
# method 3 (include the spaces and marks)

oldstr = "Hey! Are we still on for lunch today at 11am?"

# special sequences
regex = re.compile(r"[^a-zA-Z]")

newstr = regex.sub('#',oldstr)

print(newstr)

Hey##Are#we#still#on#for#lunch#today#at###am#


In [11]:
oldstr = "Hey! Are we still on for lunch today at 11am?"

# special sequences
regex = re.compile(r"[^\d]")

newstr = regex.sub('#',oldstr)

print(newstr)

########################################11###


In [12]:
oldstr = "Hey! Are we still on for lunch today at 11am?"

# special sequences # spaces
regex = re.compile(r"[\s]")

newstr = regex.sub('#',oldstr)

print(newstr)

Hey!#Are#we#still#on#for#lunch#today#at#11am?


In [13]:
oldstr = "Why Lisa, why, WHY"

# replace all y to '#'
regex = re.compile(r"why", re.IGNORECASE)

newstr = regex.sub('#', oldstr)

print(newstr)

# Lisa, #, #


In [14]:
# beginning of the line
oldstr = "Why Lisa, why, WHY"

regex = re.compile(r"^why", re.IGNORECASE)

newstr = regex.sub('#', oldstr)

print(newstr)

# Lisa, why, WHY


In [15]:
# endding of the line
oldstr = "Why Lisa, why, WHY"

regex = re.compile(r"why$", re.IGNORECASE)

newstr = regex.sub('#', oldstr)

print(newstr)

Why Lisa, why, #


In [16]:
# word boundary

oldstr = "the cat will catch-up with you in muscat"

regex = re.compile(r"cat", re.IGNORECASE)
newstr = regex.sub('#', oldstr)
print(newstr)

the # will #ch-up with you in mus#


In [17]:
# word boundary - full word
oldstr = "the cat will catch-up with you in muscat"

regex = re.compile(r"\bcat\b", re.IGNORECASE)
newstr = regex.sub('#', oldstr)
print(newstr)

the # will catch-up with you in muscat


In [18]:
# word boundary - begins with cat
oldstr = "the cat will catch-up with you in muscat"

regex = re.compile(r"\bcat", re.IGNORECASE)
newstr = regex.sub('#', oldstr)
print(newstr)

the # will #ch-up with you in muscat


In [19]:
# word boundary - ends with cat
oldstr = "the cat will catch-up with you in muscat"

regex = re.compile(r"cat\b", re.IGNORECASE)
newstr = regex.sub('#', oldstr)
print(newstr)

the # will catch-up with you in mus#


## Exercise

#### Exercise 1: 
 - find file names of the form base.extension  
 - and print the file names

In [20]:
fnamestr = "The two files are foo1.bar and foo2.bar. There are no other files."
print (fnamestr)

The two files are foo1.bar and foo2.bar. There are no other files.


In [21]:
regex = re.compile(r"\b\w+[.]\w+\b", re.IGNORECASE)

fnames = regex.findall(fnamestr)
print(fnames)

['foo1.bar', 'foo2.bar']


#### Exercise 2:
 - find punctuations and digits
 - replace with empty

In [22]:
oldstr = "Hey! Are we still on for lunch today at 11am?"
print (oldstr)

Hey! Are we still on for lunch today at 11am?


In [23]:
import string

regex = re.compile(r"[%s%s]"%(string.digits, string.punctuation)) ### <-

newstr = regex.sub('',oldstr)

print(newstr)

Hey Are we still on for lunch today at am


# spaCy

- https://spacy.io/
- https://spacy.io/usage
- https://spacy.io/models/en
- https://spacy.io/api/doc
- https://spacy.io/api/token
- https://spacy.io/usage/processing-pipelines
- https://spacy.io/usage/spacy-101



 - **spaCy** is a free, open-source library for advanced industrial-strength Natural Language Processing (NLP) in Python.

- When you call spaCy on a text, spaCy first tokenizes the text (i.e. segments it into words, punctuation and so on) to produce a Doc object. 
   spaCy uses rules specific to each language for tokenization.

 - The Doc object is then processed in several different steps (also referred to as the processing pipeline). 
   The pipeline used by the default models consists of a (pos) tagger, a (dependency) parser and a (named) entity recognizer (ner). 
   spaCy uses statistical models to predict pos, syntatctic dependencies, and named entities.
   Each pipeline component returns the processed Doc, which is then passed on to the next component.
   You can pick and choose the stages you want spaCy to load.

- Here is a list of features and capabilities of spaCy: 
  https://spacy.io/usage/spacy-101#features

### installation

#### https://spacy.io/usage
- `pip install spacy`

#### https://spacy.io/models/en
you can download these general-purpose pretrained models to predict 
pos tags (tagger), named entities (ner), and syntactic dependencies (parser).
    note: n_core_web_sm does not include **word-vectors**, but en_core_web_md and en_core_web_lg do.
> `python -m spacy download en_core_web_sm` <br>
> `python -m spacy download en_core_web_md` <br>
> `python -m spacy download en_core_web_lg`



In [1]:
import spacy

In [2]:
# once you’ve downloaded and installed a model, you can load it via spacy.load(). 
# spacy.load() returns a Language object containing all components and data needed to process text. \

# the Language object is typically called nlp. 
import en_core_web_sm
nlp = en_core_web_sm.load()

In [3]:
# calling the nlp object on a string of text will return a processed Doc object. the Doc object is typically called doc.
# even though a Doc object is processed (for isntance, split into individual words and annotated),
# it still holds all information of the original text.
# once the doc object has been created, we can  use it to access the various spaCy features.
doc = nlp("Hi Emma Watson! How are you?")

In [4]:
# for instance, you can iterate over individual sentences in the document.
for s in doc.sents:
    print (s.text)

Hi Emma Watson!
How are you?


In [5]:
# you can iterate over the named entities in the document (from ner)
# a named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title.
for e in doc.ents:
    print (e.text)
    print (e.label_)
    print (spacy.explain(e.label_))

Watson
PERSON
People, including fictional


In [6]:
# you can visualize the named entities 
spacy.displacy.render(doc, style='ent',jupyter=True)

In [7]:
# you can also visualize the dependencies (from parser)
spacy.displacy.render(doc, style="dep", jupyter= True)

In [8]:
# you can iterate over the base noun chunks in the document.
# noun chunks are “base noun phrases”  - a noun plus the words describing the noun.
# for instance, “the lavish green grass” or “the world’s largest tech fund”.
for c in doc.noun_chunks:
    print (c.text)

Hi Emma Watson
you


In [9]:
# you can iterate over the linguisitic annotations associated with tokens in the document (from tagger)
# https://spacy.io/api/annotation
# https://spacy.io/api/token#attributes
doc = nlp("Hi Emma Watson! How are you?")
for token in doc:
    print (token.i,                  # index of the token within the parent document
           token,
           token.text,               # verbatim text
           token.ent_type_,          # named entity type
           spacy.explain(token.ent_type_),
           token.lemma_,             # base form of the token, with no inflectional suffixes
           token.pos_,               # coarse-grained part-of-speech
            spacy.explain(token.pos_),
           token.tag_,               # fine-grained part-of-speech
            spacy.explain(token.tag_),
           token.dep_,               # syntactic dependency relation
           token.like_url,           # does the token resemble a URL
           token.like_num,           # does the token represent a number? e.g. “10.9”, “10”, “ten”, etc
           token.like_email,         # does the token resemble an email address
           token.is_stop,            # is the token part of a “stop list”
          token.is_alpha,
          token.is_ascii,
          token.is_digit,
          token.is_lower,
          token.is_upper,
          token.is_title,
          token.is_punct,
          token.is_space,
          token.is_currency
          )

0 Hi Hi  None hi INTJ interjection UH interjection compound False False False False True True False False False True False False False
1 Emma Emma  None Emma PROPN proper noun NNP noun, proper singular compound False False False False True True False False False True False False False
2 Watson Watson PERSON People, including fictional Watson PROPN proper noun NNP noun, proper singular ROOT False False False False True True False False False True False False False
3 ! !  None ! PUNCT punctuation . punctuation mark, sentence closer punct False False False False False True False False False False True False False
4 How How  None how ADV adverb WRB wh-adverb advmod False False False True True True False False False True False False False
5 are are  None be AUX auxiliary VBP verb, non-3rd person singular present ROOT False False False True True True False True False False False False False
6 you you  None -PRON- PRON pronoun PRP pronoun, personal nsubj False False False True True True False

In [10]:
# you can make semantic similarity estimates based on word vectors.
# the default estimate is cosine similarity, using an average of word vectors for the document.
# it returns a scalar similarity score (higher is more similar).
doc1 = nlp("I like oranges that are sweet.")
# print (doc1.vector) # doc vector is average of token vectors
doc2 = nlp("I like apples that are sour.")
# print (doc2.vector) # doc vector is average of token vectors
doc1.similarity(doc2)

  doc1.similarity(doc2)


0.9100590601885606

In [11]:
# processing large corpuses with nlp.pipe

# let's say you had a very large corpus of text
# illustrated with a very small corpus below :)
data = ["Amy is going to class now.",
          "Matt is having lunch."]

# first, you'll only want to apply the pipeline components you need:
# getting predictions from the model that you don’t actually need adds up and becomes very inefficient at scale. 
# to prevent this, use the disable keyword argument to disable components you don’t need.
nlp = en_core_web_sm.load(disable=['parser', 'ner'])
# nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# and second, you'll want to work on batches of texts.
# this can be done with spaCy’s nlp.pipe method which takes an iterable of texts and yields processed Doc objects. 
# the batching is done internally.
corpus = nlp.pipe(data)

# now we can clean the corpus efficiently
def custom_tokenizer(doc):
    tokens = [token.lemma_.lower() 
                      for token in doc 
                          if not token.is_stop and not token.is_punct]
    return ' '.join(tokens)

clean_corpus = [custom_tokenizer(doc) for doc in corpus]
clean_corpus

['amy go class', 'matt have lunch']

---

# Exercise

#### Question 1
Lowercase the text.
- `text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."`

In [12]:
text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."
text.lower()

'yes, that is a duplicate catalog category. the catalog number is c1357-a.'

#### Question 2
Substitute the pattern `"cat"` with replacement `"#"` in the text.
- make the substitution case sensitive,
- and match the pattern wherever it occurs in the text
  - `text = "Yes, that is a duplicate Catalog category. The Catalog number is C1357-A."`

In [13]:
import re

regex = re.compile(r"cat")
newtext = regex.sub('#', text)
print(newtext)

Yes, that is a dupli#e #alog #egory. The #alog number is C1357-A.


#### Question 3
Substitute the pattern "cat" with replacement "#" in the text
- make the substitution case insensitive (re.IGNORECASE)
- and match the pattern wherever it occurs in the text
  - `text = "Yes, that is a duplicate Catalog category. The Catalog number is C1357-A."`

In [14]:
regex = re.compile(r"cat", re.IGNORECASE)
newtext = regex.sub('#', text)
print(newtext)

Yes, that is a dupli#e #alog #egory. The #alog number is C1357-A.


#### Question 4
Substitute the pattern "cat" with replacement "#" in the text
- make the substitution case insensitive
- only match if the pattern is at the beginning of a word boundary (\b)
  - `text = "Yes, that is a duplicate Catalog category. The Catalog number is C1357-A."`

In [16]:
regex = re.compile(r"\bcat", re.IGNORECASE)
newtext = regex.sub('#', text)
print(newtext)

Yes, that is a duplicate #alog #egory. The #alog number is C1357-A.


#### Question 5
Substitute the characters 'c', 'a', 't' with replacement "#" in the text
- make the substitution case insensitive
- hint: use character class r"[cat]“
  - `text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."`

In [17]:
regex = re.compile(r"[cat]", re.IGNORECASE)
newtext = regex.sub('#', text)
print(newtext)

Yes, #h## is # dupli###e ####log ###egory. #he ####log number is #1357-#.


#### Question 6
Substitute all alphabets with replacement "#" in the text
- make the substitution case insensitive
- hint: use character class with range [a-z]
  - `text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."`

In [18]:
regex = re.compile(r"[a-z]", re.IGNORECASE)
newtext = regex.sub('#', text)
print(newtext)

###, #### ## # ######### ####### ########. ### ####### ###### ## #1357-#.


#### Question 7
Substitute all digits with replacement "#" in the text
- hint: use character class with range [0-9], or special sequence \d = [0-9]
  - `text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."`

In [19]:
regex = re.compile(r"[0-9]")
newtext = regex.sub('#', text)
print(newtext)

Yes, that is a duplicate catalog category. The catalog number is C####-A.


#### Question 8
Substitute one or more white-space characters with replacement " " in the spacetext
- hint: use special sequence \s for whitespace characters, and metacharacter + for one-or-moretimes
  - `spacetext = "Yes, that is a duplicate catalog \t category. The catalog number is C1357-A.\n"`

In [20]:
spacetext = "Yes, that is a duplicate catalog \t category. The catalog number is C1357-A.\n"

regex = re.compile(r"\s+")
newtext = regex.sub(' ', spacetext)
# print(spacetext)
print(newtext)

Yes, that is a duplicate catalog category. The catalog number is C1357-A. 


#### Question 9
Substitute words that are two or more alphanumeric characters long
with replacement "#"
- use special sequence \w for alphanumeric characters [0-9a-zA-Z_],
- special sequence \b for words boundaries,
- and metacharacter + for one-or-more-times, or {2,} for two or more times
  - `text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."`

In [21]:
text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."

regex = re.compile(r"[\w]{2,}")
newtext = regex.sub('#', text)
print(newtext)

#, # # a # # #. # # # # #-A.


#### Question 10
Find all words that are two or more characters long
- hint: use regex.findall(text)
  - `text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."`

In [22]:
regex = re.compile(r"\b\w{2,}\b")
newtext = regex.findall(text)
print(newtext)

['Yes', 'that', 'is', 'duplicate', 'catalog', 'category', 'The', 'catalog', 'number', 'is', 'C1357']


#### Question 11 (@20:00)
Find all the urls in urltext
- take care of http vs https
- hint: use metacharacter ? for zero-or-one-times, metacharacter + for one-or-more-times, and \S for non-whitespace characters
  - `urltext = 
  """The url for sklearn documentation is https://scikit-learn.org/stable/. 
You can learn more about pipelines by following these links: https://scikit-learn.org/stable/modules/compose.html
and https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline"""`

In [33]:
urltext = """The url for sklearn documentation is https://scikit-learn.org/stable/. 
You can learn more about pipelines by following these links: https://scikit-learn.org/stable/modules/compose.html
and https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline"""

regex = re.compile(r"\bhttps?://[\S]+\b")
urls = regex.findall(urltext)
urls

['https://scikit-learn.org/stable',
 'https://scikit-learn.org/stable/modules/compose.html',
 'https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline']

### Exercise Set 2
String formatting is useful when we're trying to create patterns on the fly:
> `print (string.digits)
strd = r"[%s]" % string.digits print (strd)
print ()`


> `print (string.punctuation)
strp = r"[%s]" % string.punctuation print (strp)
print ()`


> `strpd = r"%s%s" % (string.punctuation, string.digits)
print (strpd)`

In [39]:
import string
print (string.digits)
strd = r"[%s]" % string.digits
print (strd)
print ()

print (string.punctuation)
strp = r"[%s]" % string.punctuation
print (strp)
print ()

strpd = r"%s%s" % (string.punctuation, string.digits)
print (strpd)

0123456789
[0123456789]

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
[!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~0123456789


#### Question 12

Substitute all punctuations with replacement "#" in the text
- hint: use string formatting and string.punctuation to create the required character class
  - `text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."`


In [42]:
text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."

regex = re.compile(r"[%s]" % string.punctuation)
newtext = regex.sub("#", text)
print(newtext)

Yes# that is a duplicate catalog category# The catalog number is C1357#A#


In [47]:
# Notice that split and join work in opposite ways
st = "ASU Sun Devils"
print (st)
# “split” splits the string (at ' ' by default), and put the words in a list 
stsplit = st.split() # splitting at ' ' by default
print (stsplit)
# “join” joins the element of the list using the separator specified 
stjoin = ' '.join(stsplit) # joining using ' '
print (stjoin)

ASU Sun Devils
['ASU', 'Sun', 'Devils']
ASU Sun Devils


### StopWords

- Stop words are a set of commonly used words in any language

In [53]:
# import ssl
# ssl._create_default_https_context = ssl._create_unverified_context

from nltk.corpus import stopwords

# nltk.download('stopwords')

print (stopwords.words('english')) 

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

#### Question 13
Remove stopwords from the text
- hint: use lower, split, list comprehension, and join
  - `text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A.“`
  - `sw = stopwords.words('english')`

In [65]:
text = "Yes, that is a duplicate catalog category. The catalog number is C1357-A."
sw = stopwords.words('english')

' '.join([i.lower() for i in text.split() if i.lower() not in sw])

'yes, duplicate catalog category. catalog number c1357-a.'

#### Question 14
Lemmatization attempts to get the word root through vocabulary and morphological analysis 

In [75]:
from nltk.stem import WordNetLemmatizer

import nltk
# nltk.download('wordnet')

text = 'wolves'
print(WordNetLemmatizer().lemmatize('wolves'))

# or insatantiate wnl = WordNetLemmatizer() 

wolf


In [72]:
from nltk.stem import PorterStemmer
text = 'wolves'
PorterStemmer().stem(text)

'wolv'

### Preprocessing a text corpus using regex and nltk


Write a function called preprocess(txt) that pre-processes the txt that's been passed in:
- lower case,
- remove digits,
- remove punctuation,
- remove extra white-spaces,
- remove stop words,
- remove words < 2 characters long (i.e., retain words >= 2 characters long),
- lemmatize

use the function to preprocess the doc
- `doc = preprocess(doc)
doc`


use the function to preprocess docs in corpus2
- `clean_corpus2 = [preprocess(text) for text in corpus2]
clean_corpus2`

In [88]:
doc = """Python is an interpreted, high-level, general-purpose programming 
language. Created by Guido van Rossum and first released in 1991, Python has a 
design philosophy that emphasizes code readability, notably using significant 
whitespace. It provides constructs that enable clear programming on both small 
and large scales.[26] Van Rossum led the language community until stepping 
down as leader in July 2018.[27][28] Python features a dynamic type system 
and automatic memory management. It supports multiple programming paradigms, 
including object-oriented, imperative, functional and procedural, and has a 
large and comprehensive standard library.[29] Python interpreters are 
available for many operating systems. CPython, the reference implementation of 
Python, is open source software[30] and has a community-based development 
model, as do nearly all of Python's other implementations. Python and CPython 
are managed by the non-profit Python Software Foundation."""

preprocess(doc)

['python',
 'interpreted',
 'highlevel',
 'generalpurpose',
 'programming',
 'language',
 'created',
 'guido',
 'van',
 'rossum',
 'first',
 'released',
 'python',
 'design',
 'philosophy',
 'emphasizes',
 'code',
 'readability',
 'notably',
 'using',
 'significant',
 'whitespace',
 'provides',
 'construct',
 'enable',
 'clear',
 'programming',
 'small',
 'large',
 'scale',
 'van',
 'rossum',
 'led',
 'language',
 'community',
 'stepping',
 'leader',
 'july',
 'python',
 'feature',
 'dynamic',
 'type',
 'system',
 'automatic',
 'memory',
 'management',
 'support',
 'multiple',
 'programming',
 'paradigm',
 'including',
 'objectoriented',
 'imperative',
 'functional',
 'procedural',
 'large',
 'comprehensive',
 'standard',
 'library',
 'python',
 'interpreter',
 'available',
 'many',
 'operating',
 'system',
 'cpython',
 'reference',
 'implementation',
 'python',
 'open',
 'source',
 'software',
 'communitybased',
 'development',
 'model',
 'nearly',
 'python',
 'implementation',
 'pyth

In [94]:
def preprocess (txt):
    
    txt = txt.lower()
    
    txt = re.compile(r"\d").sub(' ',txt)
    
    txt = re.compile(r"[%s]" %string.punctuation).sub(' ',txt)
    txt = re.compile(r"\s+").sub(' ',txt)
    
    sw = stopwords.words('english')
    txt = txt.split()
    txt = ' '.join([w for w in txt if w not in sw])
    
    regex = re.compile(r"\b\w{2,}\b")
    txt = ' '.join(regex.findall(txt))
    
    wnl = WordNetLemmatizer()
    txt = ' '.join([wnl.lemmatize(w) for w in txt.split()])
    
    return txt


In [95]:
preprocess(doc)

'python interpreted high level general purpose programming language created guido van rossum first released python design philosophy emphasizes code readability notably using significant whitespace provides construct enable clear programming small large scale van rossum led language community stepping leader july python feature dynamic type system automatic memory management support multiple programming paradigm including object oriented imperative functional procedural large comprehensive standard library python interpreter available many operating system cpython reference implementation python open source software community based development model nearly python implementation python cpython managed non profit python software foundation'

In [98]:
corpus1 = ["This is a brown house. This house is big.",
          "This is a small house. This house has 1 bedroom.",
          "This dog is brown. This dog likes to play",
          "The dog is in the bedroom."]

document1 = """In Greek mythology, Python (Greek: Πύθων, gen.: Πύθωνος) was the earth-dragon of 
Delphi, always represented in Greek sculpture and vase-paintings as a serpent. He presided at the 
Delphic oracle, which existed in the cult center for his mother, Gaia, "Earth," Pytho being the 
place name that was substituted for the earlier Krisa.[1] Hellenes considered the site to be the 
center of the earth, represented by a stone, the omphalos or navel, which Python guarded."""

document2 = """Monty Python (sometimes known as The Pythons)[2][3] were a British surreal comedy 
group who created the sketch comedy show Monty Python's Flying Circus, that first aired on the BBC on 
October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from 
the television series into something larger in scope and impact, spawning touring stage shows, films, 
numerous albums, several books, and a stage musical. The group's influence on comedy has been compared 
to The Beatles' influence on music."""

document3 = """Python is a widely used general-purpose, high-level programming language.[19][20] 
Its design philosophy emphasizes code readability, and its syntax allows programmers to express 
concepts in fewer lines of code than would be possible in languages such as C++ or Java.[21][22] 
The language provides constructs intended to enable clear programs on both a small and large scale."""

corpus2 = [document1, document2, document3]

In [99]:
clean_corpus2 = [preprocess(txt) for txt in corpus2]
clean_corpus2

['greek mythology python greek πύθων gen πύθωνος earth dragon delphi always represented greek sculpture vase painting serpent presided delphic oracle existed cult center mother gaia earth pytho place name substituted earlier krisa hellene considered site center earth represented stone omphalos navel python guarded',
 'monty python sometimes known python british surreal comedy group created sketch comedy show monty python flying circus first aired bbc october forty five episode made four series python phenomenon developed television series something larger scope impact spawning touring stage show film numerous album several book stage musical group influence comedy compared beatles influence music',
 'python widely used general purpose high level programming language design philosophy emphasizes code readability syntax allows programmer express concept fewer line code would possible language java language provides construct intended enable clear program small large scale']

### Using Spacy

In [105]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load(disable=['parser', 'ner'])

In [113]:
def custom_tokenizer(doc):
    tokens = [token.lemma_.lower()
             for token in doc
                if (
                    len(token >= 2) and
                    not token.is_punct and
                    not token.is_space and
                    not token.is_stop and
                    not token.is_digit)]
    return ' '.join(tokens)

In [125]:
corpus = ["This is a brown house. This house is big.",
          "This is a small house. This house has 1 bedroom.",
          "This dog is brown. This dog likes to play",
          "The dog is in the bedroom."]

In [126]:
nlp_corpus = nlp.pipe(corpus)
clean_corpus = [custom_tokenizer(doc) for doc in nlp_corpus]
clean_corpus


TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got int)

In [127]:
corpus = list(map(preprocess, corpus))
corpus

['brown house house big',
 'small house house bedroom',
 'dog brown dog like play',
 'dog bedroom']

In [128]:
corpus = [w.split() for w in corpus]
corpus

[['brown', 'house', 'house', 'big'],
 ['small', 'house', 'house', 'bedroom'],
 ['dog', 'brown', 'dog', 'like', 'play'],
 ['dog', 'bedroom']]

In [130]:
import pandas as pd

corpus = [pd.Series(i) for i in corpus]
corpus

[0    brown
 1    house
 2    house
 3      big
 dtype: object,
 0      small
 1      house
 2      house
 3    bedroom
 dtype: object,
 0      dog
 1    brown
 2      dog
 3     like
 4     play
 dtype: object,
 0        dog
 1    bedroom
 dtype: object]

In [133]:
corpus = [i.value_counts() for i in corpus]
corpus

[house    2
 big      1
 brown    1
 dtype: int64,
 house      2
 small      1
 bedroom    1
 dtype: int64,
 dog      2
 like     1
 play     1
 brown    1
 dtype: int64,
 dog        1
 bedroom    1
 dtype: int64]

In [139]:
df = pd.DataFrame(data=corpus)
df

Unnamed: 0,house,big,brown,small,bedroom,dog,like,play
0,2.0,1.0,1.0,,,,,
1,2.0,,,1.0,1.0,,,
2,,,1.0,,,2.0,1.0,1.0
3,,,,,1.0,1.0,,


In [141]:
df.fillna(0, inplace = True)

In [142]:
df

Unnamed: 0,house,big,brown,small,bedroom,dog,like,play
0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1,2.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,2.0,1.0,1.0
3,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
