# Natural Language Processing Fundamentals in Python

 
NLP is focused in making sense of human language using computers and statistics. Comon uses are topic identification and text classification.  

NLP Applications: 

* Sentiment analysis
* Chatbot 
* Automatic translations

## Regular expressions & word tokenization

A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You
can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations
such as *.txt to find all text files in a file manager. The regex equivalent is «.*\.txt».[1](https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf)

### Common patterns

|    Pattern      |               Matches                           |    Example        |
|:---------------:|:-----------------------------------------------:|:-----------------:|
|   \w+           |                word                             |    'Magic'        |  
|   \d            |                digit                            |       9           | 
|   \s            |               spaces                            |      ' '          | 
|   .*            |              wildcard                           |  'username74'     | 
| + or *	      |            greedy match                         |	   'aaaaaa'     |
|   \S	          |             not space                           |	  'no_spaces'   |
|  [a-z]	      |             lowercase                           |    'abcdefg'      |
|[A-Za-z]+	      | upper and lowercase English alphabet	        |   'ABCDEFghijk'   |
|[0-9]	          |   numbers from 0 to 9	                        |          9        |
|[A-Za-z\-\.]+	  |  upper and lowercase English alphabet, - and .  | 'My-Website.com'  |
|   (a-z)	      |              a, - and z	                        |        'a-z'      |
|  (\s+l,)        |	       spaces or a comma	                    |         ', '      |

### How to do it with Python? 

* re module
* split: split a string on regex
* findall: find all patterns in a string
* search: search for a pattern
* match: match an entire string or substring based on a pattern
* Define pattern first, and the string second
> Depending on the method used, it may return an iterator, string, or match object

### Which pattern?

In [1]:
import re
my_string = "Let's write RegEx!"
re.findall(r"\w+", my_string)

['Let', 's', 'write', 'RegEx']

### Practicing regular expressions: re.split() and re.findall()

In [2]:
my_string = "Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"

In [3]:
# Import the regex module
import re

# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']
['Let', 'RegEx', 'Won', 'Can', 'Or']
["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']
['4', '19']


## Preprocess NLP: Tokenization

Turning a string or document into tokens (smaller chunks). There are many different theories and rules and you can create your own rules using regular expressions for: 

* Breaking out words or sentences
* Separating punctuation
* Separating all hashtags in a tweet

### Why tokenize? 

* Easier to map part of speech: words like 'awesome' or 'aweful' for Sentiment analysis
* Matching common words
* Removing unwanted tokens like articles: 'the', 'and', etc.

### How? 

* nltk library
* sent_tokenize: tokenize a document into sentences
* regexp_tokenize: tokenize a string or document based on a regular expression pattern
* TweetTokenizer: special class just for tweet tokenization, allowing you to separate hashtags, mentions and lots of exclamation points!!!
* ** Careful to difference between re.search() and re.match()** : match() searches a pattern from the beginning until it cannot match any longer. Search() will go through the entire string searching for the pattern. 

### Word tokenization with NLTK

In [4]:
!pip install nltk



In [5]:
import nltk

In [6]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [7]:
# Import necessary modules
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize


In [8]:
scene_one = "SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!  [clop clop clop] \nSOLDIER #1: Halt!  Who goes there?\nARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!\nSOLDIER #1: Pull the other one!\nARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.\nSOLDIER #1: What?  Ridden on a horse?\nARTHUR: Yes!\nSOLDIER #1: You're using coconuts!\nARTHUR: What?\nSOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.\nARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?\nARTHUR: We found them.\nSOLDIER #1: Found them?  In Mercea?  The coconut's tropical!\nARTHUR: What do you mean?\nSOLDIER #1: Well, this is a temperate zone.\nARTHUR: The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land?\nSOLDIER #1: Are you suggesting coconuts migrate?\nARTHUR: Not at all.  They could be carried.\nSOLDIER #1: What?  A swallow carrying a coconut?\nARTHUR: It could grip it by the husk!\nSOLDIER #1: It's not a question of where he grips it!  It's a simple question of weight ratios!  A five ounce bird could not carry a one pound coconut.\nARTHUR: Well, it doesn't matter.  Will you go and tell your master that Arthur from the Court of Camelot is here.\nSOLDIER #1: Listen.  In order to maintain air-speed velocity, a swallow needs to beat its wings forty-three times every second, right?\nARTHUR: Please!\nSOLDIER #1: Am I right?\nARTHUR: I'm not interested!\nSOLDIER #2: It could be carried by an African swallow!\nSOLDIER #1: Oh, yeah, an African swallow maybe, but not a European swallow.  That's my point.\nSOLDIER #2: Oh, yeah, I agree with that.\nARTHUR: Will you ask your master if he wants to join my court at Camelot?!\nSOLDIER #1: But then of course a-- African swallows are non-migratory.\nSOLDIER #2: Oh, yeah...\nSOLDIER #1: So they couldn't bring a coconut back anyway...  [clop clop clop] \nSOLDIER #2: Wait a minute!  Supposing two swallows carried it together?\nSOLDIER #1: No, they'd have to have it on a line.\nSOLDIER #2: Well, simple!  They'd just use a strand of creeper!\nSOLDIER #1: What, held under the dorsal guiding feathers?\nSOLDIER #2: Well, why not?\n"
scene_one

"SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!  [clop clop clop] \nSOLDIER #1: Halt!  Who goes there?\nARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!\nSOLDIER #1: Pull the other one!\nARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.\nSOLDIER #1: What?  Ridden on a horse?\nARTHUR: Yes!\nSOLDIER #1: You're using coconuts!\nARTHUR: What?\nSOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.\nARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?\nARTHUR: We found them.\nSOLDIER #1: Found them?  In Mercea?  The coconut's tropical!\nARTHUR: What do you mean?\nSOLDIER #1: Well, this is a t

In [15]:
# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

sentences

['SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!',
 '[clop clop clop] \nSOLDIER #1: Halt!',
 'Who goes there?',
 'ARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.',
 'King of the Britons, defeator of the Saxons, sovereign of all England!',
 'SOLDIER #1: Pull the other one!',
 'ARTHUR: I am, ...  and this is my trusty servant Patsy.',
 'We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.',
 'I must speak with your lord and master.',
 'SOLDIER #1: What?',
 'Ridden on a horse?',
 'ARTHUR: Yes!',
 "SOLDIER #1: You're using coconuts!",
 'ARTHUR: What?',
 "SOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.",
 'ARTHUR: So?',
 "We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?",
 'ARTHUR: We found them.',
 'SOLDIER #1: Found them?',
 'In Mercea?',
 "The coconut's tropic

In [14]:
# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

tokenized_sent

['ARTHUR',
 ':',
 'It',
 'is',
 'I',
 ',',
 'Arthur',
 ',',
 'son',
 'of',
 'Uther',
 'Pendragon',
 ',',
 'from',
 'the',
 'castle',
 'of',
 'Camelot',
 '.']

In [17]:
# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

unique_tokens

{'!',
 '#',
 "'",
 "'d",
 "'em",
 "'m",
 "'re",
 "'s",
 "'ve",
 ',',
 '--',
 '.',
 '...',
 '1',
 '2',
 ':',
 '?',
 'A',
 'ARTHUR',
 'African',
 'Am',
 'Are',
 'Arthur',
 'Britons',
 'But',
 'Camelot',
 'Court',
 'England',
 'European',
 'Found',
 'Halt',
 'I',
 'In',
 'It',
 'KING',
 'King',
 'Listen',
 'Mercea',
 'No',
 'Not',
 'Oh',
 'Patsy',
 'Pendragon',
 'Please',
 'Pull',
 'Ridden',
 'SCENE',
 'SOLDIER',
 'Saxons',
 'So',
 'Supposing',
 'That',
 'The',
 'They',
 'Uther',
 'Wait',
 'We',
 'Well',
 'What',
 'Where',
 'Who',
 'Whoa',
 'Will',
 'Yes',
 'You',
 '[',
 ']',
 'a',
 'agree',
 'air-speed',
 'all',
 'am',
 'an',
 'and',
 'anyway',
 'are',
 'ask',
 'at',
 'back',
 'bangin',
 'be',
 'beat',
 'bird',
 'breadth',
 'bring',
 'but',
 'by',
 'carried',
 'carry',
 'carrying',
 'castle',
 'climes',
 'clop',
 'coconut',
 'coconuts',
 'could',
 'course',
 'court',
 'covered',
 'creeper',
 'defeator',
 'do',
 'does',
 'dorsal',
 'empty',
 'every',
 'feathers',
 'five',
 'fly',
 'forty-

### More regex with re.search()


In [18]:
# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

580 588


In [19]:
# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*\]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[A-Z]\w+"
print(re.match(pattern2, sentences[3]))


<_sre.SRE_Match object; span=(9, 32), match='[wind] [clop clop clop]'>
<_sre.SRE_Match object; span=(0, 6), match='ARTHUR'>


### Choosing a tokenizer

In [20]:
my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"

In [21]:
from nltk.tokenize import regexp_tokenize

In [41]:
patterns = [r"\w+(\?!)", r"(\w+|#\d|\?|!)", r"(#\d\w+\?!)", r"\s+"]

In [43]:
for i in range(len(patterns)): 
    print(regexp_tokenize(my_string, patterns[i]))   

[]
['SOLDIER', '#1', 'Found', 'them', '?', 'In', 'Mercea', '?', 'The', 'coconut', 's', 'tropical', '!']
[]
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


### Regex with NLTK tokenization

Twitter is a frequently used source for NLP text and tasks. In this exercise, you'll build a more complex tokenizer for tweets with hashtags and mentions using nltk and regex. The nltk.tokenize.TweetTokenizer class gives you some extra methods and attributes for parsing tweets.

In [44]:
tweets = ['This is the best #nlp exercise ive found online! #python', '#NLP is super fun! <3 #learning', 'Thanks @datacamp :) #nlp #python']

In [45]:
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer

# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"

# Use the pattern on the first tweet in the tweets list
regexp_tokenize(tweets[0], pattern1)

# Write a pattern that matches both mentions and hashtags
pattern2 = r"([#|@]\w+)"

# Use the pattern on the last tweet in the tweets list
regexp_tokenize(tweets[-1], pattern2)

# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)

[['This', 'is', 'the', 'best', '#nlp', 'exercise', 'ive', 'found', 'online', '!', '#python'], ['#NLP', 'is', 'super', 'fun', '!', '<3', '#learning'], ['Thanks', '@datacamp', ':)', '#nlp', '#python']]


### Non-ascii tokenization

In this exercise, you'll practice advanced tokenization by tokenizing some non-ascii based text. You'll be using German with emoji!

In [None]:
german_text = "Wann gehen wir zum Pizza? 🍕 Und fährst du mit Über? 🚕"