# Introduction to Data Science – Text Munging Exercises
*COMP 5360 / MATH 4100, University of Utah, http://datasciencecourse.net/* 

## NLP

### Exercise 1.1: Frequent Words
Find the most frequently used words in Moby Dick which are not stopwords and not punctuation. Hint: [`str.isalpha()`](https://docs.python.org/3/library/stdtypes.html#str.isalpha) could be useful here.

In [1]:
import nltk
from nltk.corpus import stopwords
stopwords = nltk.corpus.stopwords.words('english')
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [2]:
frequency_dist = FreqDist(text1)
print(frequency_dist)

most_common = frequency_dist.most_common(500)

filtered_words = [word_tuple for word_tuple in most_common if word_tuple[0].lower() not in stopwords]
filtered_words = [word_tuple for word_tuple in filtered_words if word_tuple[0].isalpha()]
filtered_words[0:50]

<FreqDist with 19317 samples and 260819 outcomes>


[('whale', 906),
 ('one', 889),
 ('like', 624),
 ('upon', 538),
 ('man', 508),
 ('ship', 507),
 ('Ahab', 501),
 ('ye', 460),
 ('old', 436),
 ('sea', 433),
 ('would', 421),
 ('head', 335),
 ('though', 335),
 ('boat', 330),
 ('time', 324),
 ('long', 318),
 ('said', 302),
 ('yet', 300),
 ('still', 299),
 ('great', 293),
 ('two', 285),
 ('seemed', 283),
 ('must', 282),
 ('Whale', 282),
 ('last', 277),
 ('way', 269),
 ('Stubb', 255),
 ('see', 253),
 ('Queequeg', 252),
 ('little', 247),
 ('round', 242),
 ('whales', 237),
 ('say', 237),
 ('three', 237),
 ('men', 236),
 ('thou', 232),
 ('may', 230),
 ('us', 228),
 ('every', 222),
 ('much', 218),
 ('could', 215),
 ('Captain', 215),
 ('first', 210),
 ('side', 208),
 ('hand', 205),
 ('ever', 203),
 ('Starbuck', 196),
 ('never', 195),
 ('good', 192),
 ('white', 191)]

## Exercise 2.1

You're an evil Spammer who's observed that many people try to obfuscate their e-mail using this notation: "`alex at utah dot edu`". Below are three examples of such e-mails text. Try to extract "alex at utah dot edu", etc. Start with the first string. Then extend your regular expression to work on all of them at the same time. Note that the second and third are slightly harder to do! 

In [3]:
import re
html_smart = "You can reach me: alex at utah dot edu"
html_smart2 = "You can reach me: alex dot lex at utah dot edu"
html_smart3 = "You can reach me: alex dot lex at sci dot utah dot edu"

In [4]:
def testRegex(regex):
    for html in (html_smart, html_smart2, html_smart3):
        print(re.search(regex, html).group())

In [5]:
# TODO write your regex here
mail_regex = "\w+\sat\s\w+\sdot\s\w+"

In [6]:
testRegex(mail_regex)

alex at utah dot edu
lex at utah dot edu
lex at sci dot utah


In [7]:
better_regex = "((\w+\s)+(\sdot)*)+at\s\w+\sdot\s\w+"
testRegex(better_regex)

alex at utah dot edu
alex dot lex at utah dot edu
alex dot lex at sci dot utah


In [8]:
best_regex = "((\w+\s)+(\sdot)*)+at(\s\w+\sdot)+\s\w+"
testRegex(best_regex)

alex at utah dot edu
alex dot lex at utah dot edu
alex dot lex at sci dot utah dot edu


## Exercise 2.2: Find Adverbs

Write a regular expression that finds all adverbs in a sentence. Adverbs are characterized by ending in "ly".

In [10]:
text = "He was carefully disguised but captured quickly by police."

In [11]:
re.findall(r"\w+ly", text)

['carefully', 'quickly']

### Exercise 2.3: Phone Numbers

Extract the phone numbers that follow a (xxx) xxx-xxxx pattern from the text:

In [13]:
phone_numbers = "(857) 131-2235, (801) 134-2215, but this one (12) 13044441 shouldnt match. Also, this is common in twelve (12) countries and one (1) state"

In [14]:
re.findall(r"\([0-9]{3}\)\s[0-9]{3}-[0-9]{4}", phone_numbers)

['(857) 131-2235', '(801) 134-2215']

### Exercise 2.4: HTML Content

Extract the content between the `<b>` and `<i>` tags but not the other tags:

In [16]:
html_tags = "This is <b>important</b> and <u>very</u><i>timely</i>"

In [17]:
re.findall(r"<[bi]>(.*?)<\/[bi]>", html_tags)

['important', 'timely']