## I. Regular Expressions (Regex)

- Strings with a special syntax 
- Allow us to match patterns in other strings
- Applications:
    - find weblinks in a document
    - parse email addresses
    - remove/replace unwanted characters

In [1]:
import re

### 1. Reference

|pattern | matches | example |
| :- | :- | :- |
| \w+| word | 'Magic' |
| \d | digit | 9 |
| \s | space | '' |
| .* | wildcard | 'username74' |
| + or * | greedy match | 'aaaaaa' |
| \S| **not** space | 'no_spaces' |
| [a-z]| lowercase group | 'abcdefg' |
| [A-Z]| uppercase group | 'ABCDEFG' |
| [.?!]| symbol group | '.' or '?'|
|[A-Za-a] | upper and lowercase English alphabet | "ABCDEFghijk" |
| [0-9] | numbers from 0 to 9 | 9 |
| [A-Za-z\-\.] | upper and lowercase English alphabet, - and . | 'My-Website.com' |
| (a-z) | a, - and z | 'a-z' |
| (\s+|,) | spaces or a comma | ',' |

> **note**: since `-` and `.` are special characters in regex, to look for them explicitly an escape character `\` is needed directly before the character. 

**example** find anything in square brackets:
```python
pattern1 = r"\[.*\]"
```
#### Regex using or "I"
- OR is represented using `|`
- You can define a group using `()`
> only what is defined explicitly is matched
- You can define explicit character ranges using `[]`

**example** find any words or digits:
```python
match_digits_and_words = r"(\d+|\w+)"
```
### 2. re methods
- <span style= "color:indianred">Pattern first, string second </span>
- May return an iterator, string, or match object

#### match()
- matches a pattern with a string, taking pattern as first arg and string as second and returns match oobject.

- note: using symbols as capital negates them

#### search()
- search for a pattern

> <span style= "color:royalblue"> **NOTE** on search vs. match: `search` will go through the ENTIRE string to look for match options, while `match` tries to match from the beginning of a string until it can no longer match.</span> <br>
<span style= "color:indianred"> If you need to find a pattern that might not be at the beginning of the string, you should use search. If you want to be specific about the composition of the entire string, or at least the initial pattern, then you should use match.</span>

#### split()
- split a string on a regex

**e.g.:**
```python
re.split('\s+', 'Split on spaces.')
# would return:
['Split', 'on', 'spaces.']
```
> *This can be used for tokenization, so you can preprocess text using regex*

#### findall()
- find all patterns in a string

In [2]:
### CODE:
my_string = "Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"

In [4]:
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']
['Let', 'RegEx', 'Won', 'Can', 'Or']
["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']
['4', '19']


## II. Tokenization
- <span style= "color:indianred">Pattern first, string second </span>

#### overview
- Transforming a string or document into tokens(smaller chunks)
- One step in preparing a text for NLP
- Many different theories and rules
- You can create your own rules using regular expressions
- Examples:
    - breaking out words or sentences
    - separating punctuation
    - spearating parts, such as all hashtags in a tweet
  
#### value of tokenization
- Easier to match part of speech
- Matching common words
- Removing unwanted tokens
> e.g.: "I don't like Sam's shoes." <br>
reveals **negation** in `"n't"` and **possession** in `"'s"` in the following output: <br>
```python
["I", "do","n't","like","Sam","'s","shoes","."]
```
  
### nltk library
`nltk`: natural language toolkit

```python
from nltk.tokenize import word_tokenize
work_tokenize("Hi there!")
```
would output: 
```python
['Hi', 'there', '!']
```

#### other nltk tokenizers

`sent_tokenize`: tokenize a document into sentences <br>
`regexp_tokenize`: tokenize a string or document based on a regular expression pattern <br>
`TweetTokenizer`: special class for tweet tokenization, allows separation of hashtags, mentions and lots of exclamation points!!!

In [5]:
my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"

In [7]:
from nltk.tokenize import regexp_tokenize

In [8]:
regexp_tokenize(my_string, r"(\w+|#\d|\?|!)")

['SOLDIER',
 '#1',
 'Found',
 'them',
 '?',
 'In',
 'Mercea',
 '?',
 'The',
 'coconut',
 's',
 'tropical',
 '!']

#### how to find mentions and hashtags:
```python
r"([@#]\w+)"
```

#### How to tokenize tweets
```python
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)
```

### III. Bag-of-Words

