### Intro to Regex

In [1]:
import re
re.match('abc', 'abcdef')

<_sre.SRE_Match object; span=(0, 3), match='abc'>

In [2]:
word_regex ='\w+'
re.match(word_regex, 'hi there!')

<_sre.SRE_Match object; span=(0, 2), match='hi'>

<table>
  <thead>
    <tr>
      <th>pattern</th>
      <th>matches</th> 
      <th>example</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>\w+</td>
      <td>word</td> 
      <td>'Magic'</td>
    </tr>
    <tr>
      <td>\d</td>
      <td>digit</td> 
      <td>9</td>
    </tr>
    <tr>
      <td>\s</td>
      <td>space</td> 
      <td>' '</td>
    </tr>
    <tr>
      <td>.*</td>
      <td>wildcard</td> 
      <td>username74</td>
    </tr>
    <tr>
      <td>+ or *</td>
      <td>greedy match</td> 
      <td>'aaaaa'</td>
    </tr>
    <tr>
      <td>\S</td>
      <td><b>not</b> space</td> 
      <td>'no_spaces'</td>
    </tr>
    <tr>
      <td>[a-z]</td>
      <td>lowercase group</td> 
      <td>'abcdefg'</td>
    </tr>        
  </tbody>
</table>

### Python's re module
- `re` module
- `split:` split a string on regex
- `findall:` find all patterns in a string
- `search:` search for a pattern
- `match:` match an entire string or substring based on a pattern 
<br>
- Pass the pattern first, and then the string second
- May return an iterator, string, or match object

In [3]:
re.split('\s+', 'Split on spaces.')

['Split', 'on', 'spaces.']

In [4]:
my_string = "Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"

In [5]:
# Import the regex module
import re

# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']
['Let', 'RegEx', 'Won', 'Can', 'Or']
["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']
['4', '19']


### Introduction to tokenization

### What is tokenization?
- Turning a string of document into **tokens** (smaller chunks)
- One step in preparing a text for NLP
- Many different theories and rules
- Rules can be created using regular expressions 
- some examples:
 - Breaking out words or sentences
 - Seperating punctuation
 - Seperating all hashtags in a tweet

### nltk library
- `nltk:` natural language toolkit

In [6]:
from nltk.tokenize import word_tokenize

In [7]:
word_tokenize('Hi there!')

['Hi', 'there', '!']

### Why tokenize?
- Easier to map part of speech 
- Matching common words
- Removing unwanted tokens

### Other nltk tokenizers
- `sent_tokenize:` tokenize a document into sentences
- `regexp_tokenize:` tokenize a string or document based on a regular expression pattern 
- `TweetTokenizer:` special calls just for tweet tokenization, allowing you to seperate hashtags, mentions and lots of exclamation points!!!

### More regex practice
- Difference between `re.search()` and `re.match()`

In [10]:
f = open('grail.txt', 'r')
content = f.read()
SCENES = content.split('SCENE')
scene_one = 'SCENE' + SCENES[1]
#scene_one

In [13]:
# Import necessary modules
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)

{'one', 'why', 'two', 'every', 'Listen', 'by', 'tell', 'all', 'grips', ':', 'What', 'times', 'suggesting', 'Saxons', 'and', 'African', 'dorsal', 'son', 'martin', 'climes', 'speak', 'yeah', 'them', 'Yes', 'bring', 'Please', 'course', 'the', 'castle', 'So', 'since', 'Arthur', 'its', 'you', 'bangin', 'carrying', 'other', 'if', 'winter', 'wings', 'is', "n't", 'Ridden', 'are', 'horse', 'kingdom', 'he', 'second', 'KING', 'back', 'this', 'plover', 'Oh', 'I', 'temperate', 'use', "'re", '#', 'not', 'but', 'air-speed', 'to', 'coconut', 'they', '...', 'Supposing', 'feathers', 'Wait', 'go', 'European', 'yet', 'anyway', 'house', 'carry', 'forty-three', 'wants', 'mean', 'got', 'Found', 'maybe', 'creeper', 'trusty', '--', 'master', 'strangers', '1', 'SOLDIER', 'here', 'or', 'snows', 'You', 'carried', 'The', 'weight', 'Britons', 'get', 'where', 'They', "'s", 'SCENE', 'from', '!', 'defeator', 'ridden', 'Mercea', '.', 'that', 'ask', 'have', 'who', 'Not', 'A', 'warmer', 'do', 'be', 'empty', 'Are', 'order

In [14]:
# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[ARTHUR:]+"
print(re.match(pattern2, sentences[3]))

580 588
<_sre.SRE_Match object; span=(9, 32), match='[wind] [clop clop clop]'>
<_sre.SRE_Match object; span=(0, 7), match='ARTHUR:'>
