# Natural Language Parsing with Regex

```complile``` from ```re```  
takes a **regex** as an argument & compiles the pattern into a **regex object**

In [3]:
import re

# define characters
character_1 = "Dorothy"
character_2 = "Henry"

# compile regular expression, any upper- or lower-case letter x 7
regular_expression = re.compile("[A-Za-z]{7}")

```match()``` from ```re```  
Takes a **string** as an argument and looks for a **single match** to the regex that starts at the **beginning of the string**.  
If matched, returns a **match object**, if not ```None```.


In [4]:
# check for a match to character_1
result_1 = regular_expression.match(character_1)
print(result_1)

# check for a match to character_2
result_2 = re.match("[A-Za-z]{7}", character_2)
print(result_2)

<re.Match object; span=(0, 7), match='Dorothy'>
None


Access the matched texted by ```group()```

In [5]:
# store and print the matched text
match_1 = result_1.group()
print(match_1)

Dorothy


```search()```  
Looks left to right and returns a **match object** for the **first match**.

In [16]:
name = "Michael"
result = re.search("\w{3}", name)
print(result)

<re.Match object; span=(0, 3), match='Mic'>


```findall()```  
Returns a list of **all non-overlapping matches**.

In [18]:
result_a = re.findall('\w{3}', name)
print(result_a)

['Mic', 'hae']


<a name="POS"> </a>
## Part-of-Speech (POS) Tagging

```pos_tag()```  
Takes a **list of words** as an argument, returns a **list of tuples** (word, tag).

In [30]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

text = '   Natural Language Processing is awesome!'

# lower case letters
text_lower = text.lower()

# remove whitespace
text_nowhite = re.sub(r'\s{2,9}', '', text_lower)
                      
# remove punctuation
text_nopunc = re.sub(r'\!', '', text_nowhite)

# split sentence into individual words
text_tokenized = word_tokenize(text_nopunc)

# remove stopwords
stop_words = set(stopwords.words('english'))
text_clean = [token for token in text_tokenized if token not in stop_words]

print(text_clean)

['natural', 'language', 'processing', 'awesome']


In [33]:
from nltk import pos_tag

pos_tagged = pos_tag(text_clean)
print(pos_tagged)

[('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('awesome', 'NN')]


<a name="Chunking"> </a>
### Chunking
Grouping words by their POS tag.

**Chunk grammar**: the regex for finding chunks (non-overlapping matches).

In [34]:
from nltk import RegexpParser

# AN = ADJECTIVE followed by NOUN
chunk_grammar = "AN: {<JJ><NN>}"

# instantiate RegexpParser with the defined chunk grammar
chunk_parser = RegexpParser(chunk_grammar)

# parse the tagged text
chunked_text = chunk_parser.parse(pos_tagged)
print(chunked_text)

(S (AN natural/JJ language/NN) processing/NN awesome/NN)


Visualize chunking results with ```Tree```.

In [36]:
from nltk import Tree

Tree.fromstring(str(chunked_text)).pretty_print()

                  S                                
       ___________|__________________               
      |           |                  AN            
      |           |           _______|_______       
processing/NN awesome/NN natural/JJ     language/NN



Certain types of chunking are **linguistically helpful** for determining **meaning** and **bias**.

### Chunking Noun Phrases (NP-chunking)
1. Begins with a **determiner** ```DT``` which specifies the noun being referenced,
2. followed by any number of **adjectives** ```JJ``` which describe the noun,
3. ends with a **noun** ```NN```.

In [37]:
# ? = 0 or 1, * = 0 or more
np_grammar = "NP: {<DT>?<JJ>*<NN>}"

### Chunking Verb Phrases (VB-Chunking)
A phrase that contains a **verb** and its **complements**, **objects**, or **modifiers**.

Finding all verb phrases can give insight into what kind of **action** different characters take or how the actions are **described** by the author.

In [38]:
# VB.* = ensures matching verbs of any tense, RB.? = ensures matching any adverb form
vp_grammar_a = "VB: {<VB.*><DT>?<JJ>*<NN><RB.?>?}"
vp_grammar_b = "VB: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"

### Chunk Filtering
Lets you define what POS you **do not want** in a chunk and remove them.

1. Chunk an entire sentence together
2. Indicate POS filtering

An alternative way to search through a text.

In [None]:
# matches every POS in the sentence
chunk_filtering_grammar = """NP: {<.*>+}
                        }<VB.?|IN>{""" # filter any verbs or prepositions