## Using `re` package for regular expressions

In [2]:
import re
from nltk.tokenize import sent_tokenize
scene_one = """
SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!  [clop clop clop] \nSOLDIER #1: Halt!  Who goes there?\nARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!\nSOLDIER #1: Pull the other one!\nARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.\nSOLDIER #1: What?  Ridden on a horse?\nARTHUR: Yes!\nSOLDIER #1: You're using coconuts!\nARTHUR: What?\nSOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.\nARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?\nARTHUR: We found them.\nSOLDIER #1: Found them?  In Mercea?  The coconut's tropical!\nARTHUR: What do you mean?\nSOLDIER #1: Well, this is a temperate zone.\nARTHUR: The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land?\nSOLDIER #1: Are you suggesting coconuts migrate?\nARTHUR: Not at all.  They could be carried.\nSOLDIER #1: What?  A swallow carrying a coconut?\nARTHUR: It could grip it by the husk!\nSOLDIER #1: It's not a question of where he grips it!  It's a simple question of weight ratios!  A five ounce bird could not carry a one pound coconut.\nARTHUR: Well, it doesn't matter.  Will you go and tell your master that Arthur from the Court of Camelot is here.\nSOLDIER #1: Listen.  In order to maintain air-speed velocity, a swallow needs to beat its wings forty-three times every second, right?\nARTHUR: Please!\nSOLDIER #1: Am I right?\nARTHUR: I'm not interested!\nSOLDIER #2: It could be carried by an African swallow!\nSOLDIER #1: Oh, yeah, an African swallow maybe, but not a European swallow.  That's my point.\nSOLDIER #2: Oh, yeah, I agree with that.\nARTHUR: Will you ask your master if he wants to join my court at Camelot?!\nSOLDIER #1: But then of course a-- African swallows are non-migratory.\nSOLDIER #2: Oh, yeah...\nSOLDIER #1: So they couldn't bring a coconut back anyway...  [clop clop clop] \nSOLDIER #2: Wait a minute!  Supposing two swallows carried it together?\nSOLDIER #1: No, they'd have to have it on a line.\nSOLDIER #2: Well, simple!  They'd just use a strand of creeper!\nSOLDIER #1: What, held under the dorsal guiding feathers?\nSOLDIER #2: Well, why not?\n
"""
sentences = sent_tokenize(scene_one)

The basic regex matching is `re.match(pattern, string)` to identify target patterns.

### List of commonly-used regex patterns:

pattern | matches | example
--------|---------|----------
\w+ | word | 'Magic'
\d | digit | 9
\s | space | ' '
"." | wildcard, any combination of repetition for 0 or more times | 'username74'
&#124; | boolean logic "OR" |
"+" or "*" | greedy match for pattern; + with at least 1 occurance and * with optional occurance | 'aaaaaaaaaa'
(capitalized as negation) \S | *not* space | 'no_spaces'
(square bracket as group)[a-z] | lowercase group | 'abcdefg'
[A-Za-z]+ | upper and lower case English alphabet string | 'ABCDEFghijk'
[0-9] | numbers from 0 to 9 | 9
[A-Za-z\-\.]+ | upper and lower case English alphabet, - and . | 'My-Website.com'
(a-Z) | a, - and Z | 'a-Z'
(\s+&#124;,) |spaces or a comma | ', '

`[]` usually works with character ranges and needs `\` to escape, while `()` works with characters specified explicitly in the brackets.

**For instance, adding "+" after the special character will match the entire pattern**:

In [3]:
my_string = "Let's write RegEx!"
PATTERN = '\w'
print(f"Using {PATTERN} returns {re.findall(PATTERN, my_string)}")
PATTERN = '\w+'
print(f"Using {PATTERN} returns {re.findall(PATTERN, my_string)}")

Using \w returns ['L', 'e', 't', 's', 'w', 'r', 'i', 't', 'e', 'R', 'e', 'g', 'E', 'x']
Using \w+ returns ['Let', 's', 'write', 'RegEx']



----------------------

On a side note, it's important to use a `r` prefix before an string if one doesn't want Python to interpret backslash as part of special characters. For instance, `"This is a code. \n"` will be interpreted as *"This is a code"* with a line break; whereas `r"This is a code. \n"` will be interpret as is *"This is a code. \n"*.

**Hence when creating regex pattern, it is recommended that strings come after the `r` prefix.**

-----------------------


### Using `re.split()` and `re.findall()`, with some common regex:

In [4]:
my_string = """Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"""

# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.!?]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']


In [5]:
# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

['Let', 'RegEx', 'Won', 'Can', 'Or']


In [6]:
# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']


In [7]:
# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

['4', '19']


### Differences between `re.search()` and `re.match()`:
`re.search()` will find the pattern at any chunk of the string, while `re.match()` only only start from the beginning of the string. Therefore matching is more strict than searching.

In [8]:
# Both will have same result if pattern matches beginninng of the string
print(re.match('abc','abcde'))
print(re.search('abc','abcde'))

# Only search will return result when only chunk of the string matches
print(re.match('bc','abcde'))
print(re.search('bc','abcde'))

<_sre.SRE_Match object; span=(0, 3), match='abc'>
<_sre.SRE_Match object; span=(0, 3), match='abc'>
None
<_sre.SRE_Match object; span=(1, 3), match='bc'>


Each `re.match()` or `re.search()` object has a `.start()` and a `.end()` method that will return the starting and ending index of the matched pattern within the target string.

**Below are some further examples of regex matching:**

In [9]:
# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search('coconuts', scene_one)

# Print the start and end indexes of match
print(f"The first 'coconuts' starts at [{match.start()}] and ends at [{match.end()}]")

# Regex for anything in square brackets: pattern1
pattern1 = r"\[.*\]"

# Use re.search to find the first text in square brackets
re.search(pattern1, scene_one)

# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w\s]+:"
print(re.match(pattern2, sentences[3]))


The first 'coconuts' starts at [581] and ends at [589]
<_sre.SRE_Match object; span=(0, 7), match='ARTHUR:'>


## Use `nltk` library for tokenize strings

Tokenization can help to:

* Map part of the speech
* Match common words
* Remove unwanted tokens

The nltk library have three common tokenizers:

tokenizer | function
-----------|----------
`sent_tokenize` | tokenize a document into list of sentences
`word_tokenize` | tokenize a string into list of words
`regexp_tokenize` | tokenize string / document based on regex
`TweetTokenizer` | special class just for tweet tokenization that can separate hashtags, mentions, and other special cases for tweets

In [10]:
# Import tokenizers from nltk.tokenize
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize


-------------------

It is a common practice to create a "dictionary" with unique values (word elements) from a corpora. Using `set()` objects on a tokenized list of words will create an unordered list of **unique** words from the corpora, making it handy for further processing.

**Note:** `set()` objects are basically dictionaires without keys.

-------------------

### Using `sent_tokenize` and `word_tokenize` to tokenize corpora

`sent_tokenize` and `word_tokenize` are used to generate list of sentences and words, individually.

Below are some sample codes:

In [11]:
# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)
print(f"The resulted list has: {len(sentences)} sentences.")

The resulted list has: 54 sentences.


In [12]:
# Use word_tokenize to tokenize the last sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[-1])
print(f"The target sentence has {len(tokenized_sent)} words.")

The target sentence has 9 words.


In [13]:
# Create a unique set of words from the corpora
unique_tokens = set(word_tokenize(scene_one))
print(f"The corpora has a total of {len(unique_tokens)} unique words.")

The corpora has a total of 226 unique words.


### Using `regex_tokenize` to tokenize words that matche the RE

Be noted that `regex_tokenize()` takes the text argument first, and then the RE. This is a reverse order of other `re` and `nltk` packages.

In [14]:
tweets = [
 'This is the best #nlp exercise ive found online! #python',
 '#NLP is super fun! <3 #learning',
 'Thanks @datacamp :) #nlp #python'   
]

# Import the necessary modules
from nltk.tokenize import regexp_tokenize

# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"

# Use the pattern on the first tweet in the tweets list
[regexp_tokenize(tweet, pattern1) for tweet in tweets]

[['#nlp', '#python'], ['#NLP', '#learning'], ['#nlp', '#python']]

In [15]:
# Write a pattern that matches both mentions and hashtags
pattern2 = r"([#@]\w+)"

# Use the pattern on the last tweet in the tweets list
[regexp_tokenize(tweet, pattern2) for tweet in tweets]

[['#nlp', '#python'], ['#NLP', '#learning'], ['@datacamp', '#nlp', '#python']]

### Using `TweetTokenizer` objects to easily parse tweets
Similar to any other class, one needs to initiate the tokenizer before applying it on target objects, as shown below:

In [16]:
from nltk.tokenize import TweetTokenizer
# initiate a TweetTokenizer object
tknz = TweetTokenizer()
# apply the .tokenize() method on target strings
print([tknz.tokenize(tweet) for tweet in tweets])

[['This', 'is', 'the', 'best', '#nlp', 'exercise', 'ive', 'found', 'online', '!', '#python'], ['#NLP', 'is', 'super', 'fun', '!', '<3', '#learning'], ['Thanks', '@datacamp', ':)', '#nlp', '#python']]
