### Text Files

 Find a .txt file on Project Gutenberg to download using request.urlopen().

In [7]:
import nltk

In [42]:
from urllib import request
url = "http://www.gutenberg.org/files/43/43-0.txt"
response = request.urlopen(url)
print(response)
raw = response.read().decode('utf-8-sig')
print(type(raw))
raw[:75]

<http.client.HTTPResponse object at 0x1a13a62860>
<class 'str'>


'\r\nThe Project Gutenberg EBook of The Strange Case Of Dr. Jekyll And Mr.\r\nHy'

Read in the text file and use word_tokenize to create a list of the word tokens.

In [43]:
from nltk import word_tokenize 
tokens = word_tokenize(raw)
print(type(tokens))
tokens[:15]

<class 'list'>


['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'The',
 'Strange',
 'Case',
 'Of',
 'Dr.',
 'Jekyll',
 'And',
 'Mr.',
 'Hyde',
 ',']

 Find the points in the book that contain the Project Gutenberg information and remove that text.

In [64]:
# remove 'Project Gutenberg' from text
print('Text before deleting:',raw[:50])
word = 'project gutenberg'
word_len = len(word)
print('Pattern to be removed:',word)
start_idx = 0
count = 0
raw_clean = raw

while start_idx != -1:
    start_idx = raw_clean.lower().find(word)
    raw_clean = raw_clean[:start_idx] + raw_clean[start_idx+word_len:]
    count += 1
print('Patterns deleted:',count)
print('Text after deleting:',raw_clean[:50])

Text before deleting: 
The Project Gutenberg EBook of The Strange Case 
Pattern to be removed: project gutenberg
Patterns deleted: 71
Text after deleting: 
The  EBook of The Strange Case Of Dr. Jekyll And


### HTML Files

Go to your favorite news organization and find an article you want to use as a corpus.

In [27]:
url = "https://www.cnn.com/2019/02/10/us/denver-teacher-strike-multiple-jobs/index.html"
html = request.urlopen(url).read().decode('utf8')

Import that file using bs4 package and HTML format clean up.

In [105]:
from bs4 import BeautifulSoup
raw = BeautifulSoup(html).get_text()



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [29]:
tokens = word_tokenize(raw)
tokens[:10]

['Living',
 'with',
 'roommates',
 'and',
 'working',
 'multiple',
 'jobs',
 ',',
 'here',
 "'s"]

### Regular Expressions

• Using your book or your html file, print a list of all wh word types that occur (wh-words in English are used in questions, relative clauses and exclamations: who, which, what, and so on).
<br>
– Use the set function to get only a unique list of these words.
<br>
– You do not have to use .islower to clean it up, let’s look at the raw list to see how much variability
there is in text.
<br>
– Are any words duplicated in this list, because of the presence of case distinctions or punctuation?

In [71]:
text = raw
words = nltk.word_tokenize(text)
wh_list = sorted(set([w for w in words if re.search(r'^wh', w)]))
for word in wh_list:
    print(word)
print('Unique words starting with wh:',len(wh_list))

what
whatever
whatsoever
whatsoever.You
wheel—if
when
whenever
where
whereabouts
whereas
wherein
wherewithal
whether
whetted
which
while
whilst
whimpering
whipped
whipping
whisper
whispered
whispering
whisper—
white
whither
who
whole
wholesale
wholly
whom
whose
why
Unique words starting with wh: 33


Yes, "whisper" and "whisper-" would be same if we removed punctuations
All words are lower, so adding .islower would not make any difference

Question: Pig Latin is a simple transformation of English text. Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ay, e.g. string → ingstray, idle → idleay. http://en.wikipedia.org/wiki/Pig_Latin
<br>
– Write a function to convert a word to Pig Latin.
<br>
– Use regular expressions to find the first vowel in a word, index that, and then rearrange the word
to be vowel to end + beginning + ay.
<br>
– Test your function on the following words: cheese, elephant, moose, thing

In [87]:
def get_pig_latin(word):
    """Returns the pig latin version of the word"""
    idx_vowel = (re.search('[aeiou]', word)).start()
    pig_latin = word[idx_vowel:] + word[:idx_vowel] + 'ay'
    return pig_latin

print(get_pig_latin('cheese'))
print(get_pig_latin('elephant'))
print(get_pig_latin('moose'))
print(get_pig_latin('thing'))

eesechay
elephantay
oosemay
ingthay


### Stemmers

>Use the Porter Stemmer to normalize your tokenized book or html document, calling the stemmer on each word.
>Do the same thing with the Lancaster Stemmer and see if you observe any differences.

In [88]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()

In [90]:
[porter.stem(t) for t in tokens][:20]

['the',
 'project',
 'gutenberg',
 'ebook',
 'of',
 'the',
 'strang',
 'case',
 'Of',
 'dr.',
 'jekyl',
 'and',
 'mr.',
 'hyde',
 ',',
 'by',
 'robert',
 'loui',
 'stevenson',
 'thi']

In [91]:
[lancaster.stem(t) for t in tokens][:20]

['the',
 'project',
 'gutenberg',
 'ebook',
 'of',
 'the',
 'strange',
 'cas',
 'of',
 'dr.',
 'jekyl',
 'and',
 'mr.',
 'hyd',
 ',',
 'by',
 'robert',
 'lou',
 'stevenson',
 'thi']

Yes, there are differences between the two stemmers we can see that the word strange has "strang" with and e in lancaster stemmer whereas in potter there is no "e". Case is missing an in lancaster stemmer. Hyde is missing an "e" in lancaster stemmer. Loui is missing an "i" in lancaster stemmer.

### Tokenization 
<br>Using your book or html file, tokenize the document into both words (word_tokenize) and sentences (sent_tokenize).
<br>Calculate the average length of the words avg_w and the average length of the sentences avg_s
<br>Calculate a readability index by using the formula: 4.71*avg_w + .5*avg_s - 21.43.

In [93]:
# word tokens
tokens = word_tokenize(raw)

# Sentence tokens
sents = nltk.sent_tokenize(text)

In [104]:
word_len_list = [len(w) for w in tokens]
avg_w = sum(word_len_list)/len(word_len_list)

sents_len_list = [len(s) for s in sents]
avg_s = sum(sents_len_list)/len(sents_len_list)

print('Average word length:',avg_w)
print('Average sentences length:',avg_s)

readability_idx = 4.71*avg_w + 0.5*avg_s - 21.43
print('Readability Index:',readability_idx)


Average word length: 3.781900372794787
Average sentences length: 148.55317164179104
Readability Index: 70.65933657675896
