### Regex

In [1]:
import re
# Regular expression / Regex lib

In [2]:
re.match('abc', 'abcdef')

<_sre.SRE_Match object; span=(0, 3), match='abc'>

In [4]:
word_regex = '\w+'
# Matching the first word
re.match(word_regex, 'hi there!')

<_sre.SRE_Match object; span=(0, 2), match='hi'>

#### Common Regex Patterns

|pattern | Matches          | Example|
|-|-|-|
|\w+     |  word            | 'Magic'|
|\d      |  digit           |  9|
|\s      |  space           | ' '|
|.*      |  wildcard        | 'username74'|
|+ or *  |  greedy match    | 'aaaaa'|
|\S      |  not space       |  'no_space'|
|[a-z]   |  Lowercase group | 'abc'|

In [6]:
re.split('\s+', 'Split on spaces.')

['Split', 'on', 'spaces.']

In [14]:
my_string = "Let's write RegEx!"

re.findall('\w+' , my_string)

['Let', 's', 'write', 'RegEx']

In [None]:
# Import the regex module
import re

# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

### Tokenization

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\John.Yin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [3]:
from nltk.tokenize import word_tokenize

word_tokenize("Hi there!")

['Hi', 'there', '!']

**sent_ tokenize:** tokenize a document into sentences

**regexp_tokenize:** tokenize a string or document based on a regular expression pattern

**TweetTokenizer:** special class just for tweet tokenization, allowing you to separate hashtags, mentions and lots of exclamation points

**Difference between** re.search() **and** re.match():

match trying to match from the begining

search will go though the entire string to search

In [5]:
import re

re.match('abc', 'abcde')

<_sre.SRE_Match object; span=(0, 3), match='abc'>

In [6]:
re.search('abc', 'abcde')

<_sre.SRE_Match object; span=(0, 3), match='abc'>

In [7]:
re.match('cd', 'abcde')

In [8]:
re.search('cd', 'abcde')

<_sre.SRE_Match object; span=(2, 4), match='cd'>

In [None]:
# Import necessary modules
import re
import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

# print(scene_one)

# Split scene_one into sentences: sentences
# sentence_endings = r"[.?!]"
# sentences = re.split(sentence_endings,scene_one)
sentences=sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)


In [None]:
# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*\]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w\s]+:"
print(re.match(pattern2, sentences[3]))

### Advanced tokenization with NLTK and regex

Regex group using or "|"

* OR is represented using |
* Define a group using ()
* Define explicit character ranges using []


In [9]:
match_digitals_and_words = ('(\d+|\w+)')

re.findall(match_digitals_and_words, 'He has 11 cats.')

['He', 'has', '11', 'cats']

|pattern|matches|example|
|-|-|-|
|[A-Za-z]|upper and lowercase English alphabet|'ABCDEFghijk'|
|[0-9]|numbers from 0 to 9| 9 |
|[A-Za-Z\-\.]+|upper and lowercase English alphabet, - and .|'My-Websige.com'|
|(a-z)|a, - and z| 'a-z'|
|(\s+|,)| space or a comma | ','|

In [12]:
my_str = 'match lowercase spaces nums like 12, but no commas'

re.match('[a-z0-9 ]+', my_str)

<_sre.SRE_Match object; span=(0, 35), match='match lowercase spaces nums like 12'>

In [19]:
my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"

pattern2 = r"(\w+|#\d|\?|!)"

re.match(pattern2, my_string)

<_sre.SRE_Match object; span=(0, 7), match='SOLDIER'>