## I. Regular Expressions (Regex)

- Strings with a special syntax 
- Allow us to match patterns in other strings
- Applications:
    - find weblinks in a document
    - parse email addresses
    - remove/replace unwanted characters

In [1]:
import re

### 1. Reference

|pattern | matches | example |
| :- | :- | :- |
| \w+| word | 'Magic' |
| \d | digit | 9 |
| \s | space | '' |
| .* | wildcard | 'username74' |
| + or * | greedy match | 'aaaaaa' |
| \S| **not** space | 'no_spaces' |
| [a-z]| lowercase group | 'abcdefg' |
| [A-Z]| uppercase group | 'ABCDEFG' |
| [.?!]| symbol group | '.' or '?'|
|[A-Za-a] | upper and lowercase English alphabet | "ABCDEFghijk" |
| [0-9] | numbers from 0 to 9 | 9 |
| [A-Za-z\-\.] | upper and lowercase English alphabet, - and . | 'My-Website.com' |
| (a-z) | a, - and z | 'a-z' |
| (\s+|,) | spaces or a comma | ',' |

> **note**: since `-` and `.` are special characters in regex, to look for them explicitly an escape character `\` is needed directly before the character. 

**example** find anything in square brackets:
```python
pattern1 = r"\[.*\]"
```
#### Regex using or "I"
- OR is represented using `|`
- You can define a group using `()`
> only what is defined explicitly is matched
- You can define explicit character ranges using `[]`

**example** find any words or digits:
```python
match_digits_and_words = r"(\d+|\w+)"
```
### 2. re methods
- <span style= "color:indianred">Pattern first, string second </span>
- May return an iterator, string, or match object

#### match()
- matches a pattern with a string, taking pattern as first arg and string as second and returns match oobject.

- note: using symbols as capital negates them

#### search()
- search for a pattern

> <span style= "color:royalblue"> **NOTE** on search vs. match: `search` will go through the ENTIRE string to look for match options, while `match` tries to match from the beginning of a string until it can no longer match.</span> <br>
<span style= "color:indianred"> If you need to find a pattern that might not be at the beginning of the string, you should use search. If you want to be specific about the composition of the entire string, or at least the initial pattern, then you should use match.</span>

#### split()
- split a string on a regex

**e.g.:**
```python
re.split('\s+', 'Split on spaces.')
# would return:
['Split', 'on', 'spaces.']
```
> *This can be used for tokenization, so you can preprocess text using regex*

#### findall()
- find all patterns in a string

In [2]:
### CODE:
my_string = "Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"

In [4]:
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']
['Let', 'RegEx', 'Won', 'Can', 'Or']
["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']
['4', '19']


## II. Tokenization
- <span style= "color:indianred">Pattern first, string second </span>

#### overview
- Transforming a string or document into tokens(smaller chunks)
- One step in preparing a text for NLP
- Many different theories and rules
- You can create your own rules using regular expressions
- Examples:
    - breaking out words or sentences
    - separating punctuation
    - spearating parts, such as all hashtags in a tweet
  
#### value of tokenization
- Easier to match part of speech
- Matching common words
- Removing unwanted tokens
> e.g.: "I don't like Sam's shoes." <br>
reveals **negation** in `"n't"` and **possession** in `"'s"` in the following output: <br>
```python
["I", "do","n't","like","Sam","'s","shoes","."]
```
  
### nltk library
`nltk`: natural language toolkit

```python
from nltk.tokenize import word_tokenize
work_tokenize("Hi there!")
```
would output: 
```python
['Hi', 'there', '!']
```

#### other nltk tokenizers

`sent_tokenize`: tokenize a document into sentences <br>
`regexp_tokenize`: tokenize a string or document based on a regular expression pattern <br>
`TweetTokenizer`: special class for tweet tokenization, allows separation of hashtags, mentions and lots of exclamation points!!!

In [5]:
my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"

In [7]:
from nltk.tokenize import regexp_tokenize

In [8]:
regexp_tokenize(my_string, r"(\w+|#\d|\?|!)")

['SOLDIER',
 '#1',
 'Found',
 'them',
 '?',
 'In',
 'Mercea',
 '?',
 'The',
 'coconut',
 's',
 'tropical',
 '!']

#### how to find mentions and hashtags:
```python
r"([@#]\w+)"
```

#### How to tokenize tweets
```python
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)
```

### III. Bag-of-Words



- Basic method for finding topics in a text
- First create tokens using tokenization
- ... then count tokens
- Can be a good way to identify word significance

In [10]:
from nltk.tokenize import word_tokenize
from collections import Counter
counted_words = Counter(word_tokenize("""The cat is in the box. The cat box."""))
counted_words

Counter({'The': 2, 'cat': 2, 'is': 1, 'in': 1, 'the': 1, 'box': 2, '.': 2})

In [15]:
counted_words.most_common(2)

[('The', 2), ('cat', 2)]

### IV. Simple Text Preprocessing

In [21]:
# Import WordNetLemmatizer and stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
text = """The cat is in the box. The cat likes the box.
                The box is over the cat"""

In [22]:
# list comprehension to tokenize
# isalpha tests if it only alphabetic characters
lower_tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()] 

In [24]:
# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in stopwords.words('english')]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))

[('cat', 3), ('box', 3), ('like', 1)]


### V. Gensim intro

- Popular open-source NLP library
- Uses top academic models to perform complex tasks
    - building document or word vectors
    - performing topic identification and documnet comparison
- `gensim` models can be easily saved, updated, and reused
- dicts can also be updated
- The immediate below is a more advanced, feature rich bag-of-words.


#### LDA visualization
- LDA = latent dirichlet allocation
- used as part of preprocess

#### BOW in Gensim

In [25]:
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize

In [26]:
my_docs = [
    'The movie was about a spaceship and aliens',
    'I really liked the movie!',
    'Awesome action scenes, but boring characters.',
    'The movie was awful! I hate alien films.',
    'Space is cool! I liked the movie.',
    'More space films, please!'
]

In [27]:
tokenized_docs = [word_tokenize(doc.lower()) for doc in my_docs]
dictionary = Dictionary(tokenized_docs) # adding the tokens to a dict
dictionary.token2id # how to access the gensim dict of tokens and their ids

{'a': 0,
 'about': 1,
 'aliens': 2,
 'and': 3,
 'movie': 4,
 'spaceship': 5,
 'the': 6,
 'was': 7,
 '!': 8,
 'i': 9,
 'liked': 10,
 'really': 11,
 ',': 12,
 '.': 13,
 'action': 14,
 'awesome': 15,
 'boring': 16,
 'but': 17,
 'characters': 18,
 'scenes': 19,
 'alien': 20,
 'awful': 21,
 'films': 22,
 'hate': 23,
 'cool': 24,
 'is': 25,
 'space': 26,
 'more': 27,
 'please': 28}

In [29]:
### creating a gensim corpus

corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
corpus 
    # first tuple item = token id from dict
    # second tuple item = token frequency in doc

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(4, 1), (6, 1), (8, 1), (9, 1), (10, 1), (11, 1)],
 [(12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)],
 [(4, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (13, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1)],
 [(4, 1), (6, 1), (8, 1), (9, 1), (10, 1), (13, 1), (24, 1), (25, 1), (26, 1)],
 [(8, 1), (12, 1), (22, 1), (26, 1), (27, 1), (28, 1)]]

### Tf-idf with gensim

- Term frequency - inverse document frequency
- Allows you to determine the most important words in each document
- Each corpus may have shared words beyond stopwards
- These words can be down-weighted in importance
- Example from astronomy: "Sky"
- Ensures most common words don't show up as key words
- Keeps document specific frequent words weighted high, but corpus wide as weighted low

#### Tf-idf formula

$$ w_{i,j} = tf_{i,j} * log(\frac{N}{df_{i}}) $$

$ w_{i,j} = $ tf-idf weight for token $i$ in documnet $j$
<br>
<br>
$ tf_{i,j} = $ number of occurences of token $i$ in document $j$
<br>
<br>
$ df_{i} = $ number of documents that contain token $i$
<br>
<br>
$N = $ total number of documents

In [31]:
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus)

In [32]:
tfidf[corpus[1]]

[(4, 0.1746298276735174),
 (6, 0.1746298276735174),
 (8, 0.1746298276735174),
 (9, 0.29853166221463673),
 (10, 0.47316148988815415),
 (11, 0.7716931521027908)]