# NLTK Chapter 3

## Processing Raw Text

*The html version of this chapter in the NLTK book is available [here](https://www.nltk.org/book/ch03.html#exercises "Ch03 Exercises").*

### 8   Exercises

###### 1. 

☼ Define a string `s = 'colorless'`. Write a Python statement that changes this to "colourless" using only the slice and concatenation operations.

In [4]:
s = 'colorless'
s = s[:4] + 'u' + s[4:]
s

'colourless'

##### 2.

☼ We can use the slice notation to remove morphological endings on words. For example, `'dogs'[:-1]` removes the last character of `dogs`, leaving `dog`. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): `dish-es`, `run-ning`, `nation-ality`, `un-do`, `pre-heat`.

In [15]:
affixed = [('dishes', 2), 
           ('running', 4),
           ('nationality', 5),
           ('undo', 2),
           ('preheat', 4)]

print([s[:-a] for s, a in affixed])

['dish', 'run', 'nation', 'un', 'pre']


##### 3.

☼ We saw how we can generate an `IndexError` by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string?

*Yes.  I'm not going to run the code in my notebook, because then the cells below this one wouldn't run.*

```
>>>trial = "trial"
>>>for i in range(1, len(trial) + 2):
>>>    print(trial[-i])
    
l
a
i
r
t
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-21-98077138b076> in <module>
      1 trial = "trial"
      2 for i in range(1, len(trial) + 2):
----> 3     print(trial[-i])

IndexError: string index out of range
```

##### 4. 

☼ We can specify a "step" size for the slice. The following returns every second character within the slice: `monty[6:11:2]`. It also works in the reverse direction: `monty[10:5:-2]` Try these for yourself, then experiment with different step values.

In [22]:
# A Czech tongue twister:
tt = "Třistatřiatřicet stříbrných křepelek přeletělo přes třistatřiatřicet stříbrných střech."

# Every other letter
tt[::2]

'Tittitie tírýhkeee řltl řstittitie tírýhsřc.'

In [23]:
# Every other letter from the end
tt[::-2]

'.cřshýrít eitittitsř ltlř eeekhýrít eitittiT'

In [24]:
# Every third letter
tt[::3]

'Tstaittbý el etoř iaiřesínhtc'

*You get the point...*

##### 5. 

☼ What happens if you ask the interpreter to evaluate `monty[::-1]`? Explain why this is a reasonable result.

*It prints the word backwards.  It's simply printing from the end by steps of -1:*

In [26]:
"redrum"[::-1]

'murder'

##### 6.

☼ Describe the class of strings matched by the following regular expressions.

a. `[a-zA-Z]+`

b. `[A-Z][a-z]*`

c. `p[aeiou]{,2}t`

d. `\d+(\.\d+)?`

e. `([^aeiou][aeiou][^aeiou])*`

f. `\w+|[^\w\s]+`

Test your answers using `nltk.re_show()`.

*__a.__ `[a-zA-Z]+` will match anything alphabetical:*

In [3]:
import nltk, re

nltk.re_show(r'[a-zA-Z]+', "cAMELCASE 6186258313 hybr1d")

{cAMELCASE} 6186258313 {hybr}1{d}


<i>__b.__ `[A-Z][a-z]*` will match words beginning with uppercase letters, or any uppercase letters in other positions:</i>

In [42]:
test = 'I think words beginning with Uppercase Letters will be matched, ' \
       'or any uppercase letters found in oTHER positions.'

nltk.re_show(r'[A-Z][a-z]*', test)

{I} think words beginning with {Uppercase} {Letters} will be matched, or any uppercase letters found in o{T}{H}{E}{R} positions.


*__c.__ `p[aeiou]{,2}t` will match all words with __p__, up to two vowels, and a letter __t__.  This is a lot of words: In the wordlist we've been using in this chapter, this RegExp will return nearly 7,000 words, since any word with __pt__ will be a match.*

In [51]:
wordlist = [w.lower() for w in nltk.corpus.words.words('en')]
len([w for w in wordlist if re.search(r'p[aeiou]{,2}t', w)])

6978

In [52]:
print([w for w in wordlist if re.search(r'p[aeiou]{,2}t', w)][:20])

['abaptiston', 'abepithymia', 'ableptical', 'ableptically', 'abrupt', 'abruptedly', 'abruption', 'abruptly', 'abruptness', 'absorpt', 'absorptance', 'absorptiometer', 'absorptiometric', 'absorption', 'absorptive', 'absorptively', 'absorptiveness', 'absorptivity', 'absumption', 'acalypterae']


*If we add the `^` and <code>\$</code> operators, we'll instead end up with all 3-letter words beginning and ending with __p__ and __t__ with one vowel in the middle, or all 4-letter words beginning and ending with __p__ and __t__ with two vowels in the middle:*

In [49]:
print([w for w in wordlist if re.search(r'^p[aeiou]{,2}t$', w)])

['pat', 'pat', 'paut', 'peat', 'pet', 'piet', 'piet', 'pit', 'poet', 'poot', 'pot', 'pout', 'put']


*__d.__ `\d+(\.\d+)?` will match any numbers and decimal points, no matter how many numbers are to the left/right of the decimal.  It will not match dashes, dollar signs, or any other symbol associated with number.*

In [62]:
test = ['1234', '12.34', 'example 123.4 in a string', '1-234', '12,4', '$12.34']
for t in test:
    nltk.re_show(r'\d+(\.\d+)?', t) 

{1234}
{12.34}
example {123.4} in a string
{1}-{234}
{12},{4}
${12.34}


*If we use two decimals, the second is ignored:*

In [67]:
nltk.re_show(r'\d+(\.\d+)?', '1.23.4')

{1.23}.{4}


*We can alter that by changing the `?` to a `+`:*

In [68]:
nltk.re_show(r'\d+(\.\d+)+', '1.23.4')

{1.23.4}


<i>__e.__ `([^aeiou][aeiou][^aeiou])*` will match any non-vowel\vowel\non-vowel combination, no matter how many times it's repeated.  White spaces are considered non-vowels, so a string such as `to ` would match.  `nltk.re_show()` behaves quite strangely with this RegExp - a string like `"baab"` would return `{}b{}a{}a{}b{}`.  However, I have evaluated this RegExp with online evaluators (such as [this one](https://regexr.com/ "regexr.com"), and there the responsive is as expected:</i>

In [5]:
string = "babbabbab" \
         "babapapa"
nltk.re_show(r'([^aeiou][aeiou][^aeiou])*', string)

{babbabbabbab}{}a{pap}{}a{}


In [6]:
string = "baab"
nltk.re_show(r'([^aeiou][aeiou][^aeiou])*', string)

{}b{}a{}a{}b{}


*__f.__ `\w+|[^\w\s]+` will match either any alphanumeric string of any length, or a string of any length that does not contain alphanumeric characters or whitespace - i.e., all punctuation and any other non-whitespace/non-alphanumeric characters:*

In [7]:
string = "This RegExp needs a fairly long string to show what it can %#$^%&* do."
nltk.re_show(r'\w+|[^\w\s]+', string)

{This} {RegExp} {needs} {a} {fairly} {long} {string} {to} {show} {what} {it} {can} {%#$^%&*} {do}{.}


##### 7.

*☼ Write regular expressions to match the following classes of strings:*

 + *__a.__ A single determiner (assume that __a__, __an__, and __the__ are the only determiners).*
 + <i>__b.__ An arithmetic expression using integers, addition, and multiplication, such as `2*3+8`.</i>
 
*__a.__*

In [21]:
string = "I think a relevant string like the one here is an example of what we need."
nltk.re_show(r'\b[Aa]n?\b|\b[Tt]he\b', string)

I think {a} relevant string like {the} one here is {an} example of what we need.


*__b.__*

In [25]:
string = "2 * 3 + 8"
nltk.re_show(r'(\d|[+*= ])+', string)

{2 * 3 + 8}


In [26]:
string = "11 + 4 * 2"
nltk.re_show(r'(\d|[+*= ])+', string)

{11 + 4 * 2}


##### 8.


☼ Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use `from urllib import request`  and then `request.urlopen('http://nltk.org/').read().decode('utf8')` to access the contents of the URL.

*This chapter of the NLTK book dealt with removing HTML tags, but didn't really touch on removing the style & scripts tags that are present in most pages today. [This Stack Overflow discussion](https://stackoverflow.com/questions/30565404/remove-all-style-scripts-and-html-tags-from-an-html-page#answers "Removing Style, Scripts, and HTML tags") has a good discussion on how to use BeautifulSoup to do that, specifically with the `extract` and `stripped_strings` methods.  The code below borrows heavily from [this answer in the above Stack Overflow discussion](https://stackoverflow.com/a/30565597 "Removing Style, Scripts, and HTML tags - answer"):*

In [2]:
from urllib import request
from bs4 import BeautifulSoup
from unicodedata import normalize

def return_URL_contents(url):
    html = request.urlopen(url).read().decode('utf8')
    raw = BeautifulSoup(html, 'html.parser')
    for r in raw(['script', 'style']):
        r.extract() # remove tags
    
    text = ' '.join(raw.stripped_strings) # retrieve tag content
    
    return normalize('NFKD', text) # normalize escape sequences




In [101]:
url = "https://www.nytimes.com/2017/10/29/business/virtual-reality-driverless-cars.html?module=inline"

return_URL_contents(url)[:2000]

'What Virtual Reality Can Teach a Driverless Car - The New York Times Sections SEARCH Skip to content Skip to site index Business Log In Log In Today’s Paper Business | What Virtual Reality Can Teach a Driverless Car Subscribe Log In Advertisement Supported by What Virtual Reality Can Teach a Driverless Car By Cade Metz Oct. 29, 2017 SAN FRANCISCO — As the computers that operate driverless cars digest the rules of the road, some engineers think it might be nice if they can learn from mistakes made in virtual reality rather than on real streets. Companies like Toyota, Uber and Waymo have discussed at length how they are testing autonomous vehicles on the streets of Mountain View, Calif., Phoenix and other cities. What is not as well known is that they are also testing vehicles inside computer simulations of these same cities. Virtual cars, equipped with the same software as the real thing, spend thousands of hours driving their digital worlds. Think of it as a way of identifying flaws i

In [102]:
url = "https://en.wikipedia.org/wiki/Guido_van_Rossum"

return_URL_contents(url)[:2000]

'Guido van Rossum - Wikipedia Guido van Rossum From Wikipedia, the free encyclopedia Jump to navigation Jump to search Dutch programmer and creator of Python In this Dutch name , the family name is van Rossum , not Rossum . Guido van Rossum Guido van Rossum at the Dropbox headquarters in 2014 Born ( 1956-01-31 ) 31 January 1956 (age 63) [1] Haarlem , Netherlands [2] [3] Residence Belmont, California , U.S. Nationality Dutch Alma mater University of Amsterdam Occupation Computer programmer, author Employer Dropbox [4] Known for Creating the Python programming language Spouse(s) Kim Knapp ( m. 2000) Children 1 [5] Awards Award for the Advancement of Free Software (2001) Website gvanrossum .github .io Guido van Rossum ( Dutch: [ˈɣido vɑn ˈrɔsʏm, -səm] ; born 31 January 1956) is a Dutch programmer best known as the author of the Python programming language , for which he was the " Benevolent dictator for life " (BDFL) until he stepped down from the position in July 2018. [6] [7] He is curr

##### 9. 

☼ Save some text into a file `corpus.txt`. Define a function `load(f)` that reads from the file named in its sole argument, and returns a string containing the text of the file.

 + a. Use `nltk.regexp_tokenize()` to create a tokenizer that tokenizes the various kinds of punctuation in this text. Use one multi-line regular expression, with inline comments, using the verbose flag `(?x)`.
 
 + b. Use `nltk.regexp_tokenize()` to create a tokenizer that tokenizes the following kinds of expression: monetary amounts; dates; names of people and organizations.

In [109]:
import os

path = "C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\nltk\\chapter03"

os.chdir(path)

In [173]:
url = 'https://www.nytimes.com/2017/10/22/technology/artificial-intelligence-experts-salaries.html?action=click&module=RelatedCoverage&pgtype=Article&region=Footer'

text = return_URL_contents(url)

with open('corpus.txt', 'w', encoding = "utf-8") as f:
    f.write(text)


In [176]:
def load(f):
    text = open(f, encoding = "utf-8")
    raw = text.read()
    
    return raw

In [177]:
nyt = load('corpus.txt')

*__a.__*

In [178]:
pattern = r'''(?x)
    [][.,;"'?!():_-`] # finds punctuation
'''

print(nltk.regexp_tokenize(nyt, pattern))

['.', '.', '.', '.', '.', '.', ',', '.', '.', ',', ':', '.', '.', '.', ',', '.', ',', '.', '.', '.', ',', '.', '.', ',', ',', ',', ',', '.', '.', '.', '.', '.', ',', '.', '.', '.', '.', ',', ',', ',', ',', '.', ',', ',', '.', ',', '.', '.', '.', '.', ',', ',', '.', '.', '.', '.', '.', '.', ',', '.', ',', ',', '.', '.', '.', '.', ',', ',', ',', ',', '.', ',', ',', ',', ',', '.', '.', '.', '.', ',', ',', ',', '.', ',', ',', '.', ',', '.', ',', ',', ',', '.', '.', '.', ',', ',', '.', ',', '.', ',', ',', '.', ',', '.', ',', ',', ',', '.', '.', '.', ',', '.', ',', '.', '.', '.', '.', ',', '.', '.', '.', ',', '.', ',', ',', '.', '.', '(', ',', ',', ')', '.', ',', '.', ',', ',', '.', '.', ',', '.', '.', '.', ',', '.', '.', '.', ',', ',', '.', ',', '.', '.', ',', '.', ',', '.', '.', ',', '.', ',', ',', ',', ',', '.', '.', '.', ',', ',', '.', '.', ',', '.', ',', '.', ',', '.', '.', ',', ',', '.', '.', ',', '.', '.', '.', ',', ',', ':', '.', '.', '.', '.', '.', '.', '.', '.', "'", "'", ':', "'",

*__b.__ Using regular expressions to extract information such as proper names - which can take numerous forms - is wrought with problems, and the regular expressions below are far from perfect.  It could very well be that the point of this exercise was to demonstrate just how difficult this approach is.*

*Proper names are almost always capitalized; but so are many other words, and this approach alone is practically guaranteed to generate false positives/negatives.  I tried to limit the number of false positives by only examining sequences of two or more strings that began with uppercase letters; but that would eliminate any companies with a one-word name, as well as any references to a person using only his/her first/last name.  Another issue is that this returns strings where multiple words are in uppercase, such as titles.*

*Monetary values are more straightforward, but I noticed that large amounts are often written out, so I created a special use case for this.  I didn't bother with all the different currency symbols for commonly-used currencies and instead made do with using the dollar sign.*

*Finally, dates have a variety of formats around the world, but for the purpose of this exercise I only focussed on those formats in common use in the U.S., i.e. numerical dates (__10/16/19__ or __10/16/2019__) or literal dates (__Oct. 16, 2019__ or __October 16, 2019__.)*

*While researching this question, I came across several methods that looked interesting, but which I did not pursue because of time limitations.  One was finding first names in a text by using the `names` corpus in NLTK.  It's easy to see how this would work; but I didn't explore this, because it seems the focus of this unit is regular expressions.  An additional issue would be the lack of corpora for last and company names.  Another method that looked interesting was using Python's `datetime` module to deal with dates.  But, as was the case above, time limitations prevented me from trying this.*



In [280]:
pattern = r'''(?x)
          
          (?:[A-Z])(?:[a-z]+|\.)(?:\s+[A-Z](?:[a-z]+|\.))*(?:\s+[A-Z])(?:[a-z]+|\.)
                                         # proper names
          | \$\d+\s\b[tr|b|m]illion\b    # literal monetary amounts
          | \$?\d+(?:[,\.]\d+)?          # numerical monetary amounts
          | \d{2}\[\\]\d{2}\-\\]\d{4}    # numerical dates (U.S. format)
          | [A-Z][a-z.]*\s\d{2}\,\s\d{2, 4} # literal dates (U.S. format)

          
        '''

print(nltk.regexp_tokenize(nyt, pattern))

['Tech Giants Are Paying Huge Salaries', 'Scarce A.', 'I. Talent', 'The New York Times Technology', 'Tech Giants Are Paying Huge Salaries', 'Scarce A.', 'I. Talent Subscribe Log In Image Credit Credit Christina Chung Sections Skip', 'Tech Giants Are Paying Huge Salaries', 'Scarce A.', 'I. Talent Nearly', 'Credit Credit Christina Chung Supported', 'By Cade Metz Oct', '22', '2017', 'Silicon Valley', 'I. Tech', 'Typical A.', '$300,000', '$500,000', 'Anthony Levandowski', '2007', '$120 million', 'Image Luke Zettlemoyer', 'Allen Institute', 'Artificial Intelligence', 'Credit Kyle Johnson', 'The New York Times Salaries', 'National Football League', 'Christopher Fernandez', 'Silicon Valley', '10,000', 'Andrew Moore', 'Carnegie Mellon University', '$650 million', '2014', '50', '400', '$138 million', '$345,000', 'Jessica Cataneo', '1950', '2013', 'Amazon Echo', '40', 'Carnegie Mellon', '2015', 'Stanford University', '20', 'Oren Etzioni', 'Allen Institute', 'Artificial Intelligence', 'Luke Zettl

#####  10.

☼ Rewrite the following loop as a list comprehension:

In [282]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
result = []
for word in sent:
    word_len = (word, len(word))
    result.append(word_len)
print(result)

[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]


In [285]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
result = [(word, len(word)) for word in sent]
print(result)

[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]


##### 11.

☼ Define a string `raw` containing a sentence of your own choosing. Now, split `raw` on some character other than space, such as '`s`'.

In [287]:
raw = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
raw.split('wood')

['How much ', ' would a ', 'chuck chuck if a ', 'chuck could chuck ', '?']

##### 12.

☼ Write a `for` loop to print out the characters of a string, one per line.

In [121]:
string = "Compared to some of the previous exercises, this seems comically easy."

for s in string[:20]:
    print(s)

C
o
m
p
a
r
e
d
 
t
o
 
s
o
m
e
 
o
f
 


##### 13.

☼ What is the difference between calling `split` on a string with no argument or with `' '` as the argument, e.g. `sent.split()` versus `sent.split(' ')`? What happens when the string being split contains tab characters, consecutive space characters, or a sequence of tabs and spaces? 

*`sent.split()` splits all whitespace identically.*

*`sent.split(' ')` splits all whitespace literally.  I.e., tabs will be represented as `\t`, and each individaul whitespace will be spilt into its own string.*

In [294]:
s1 = "This string is a pretty simple string."
s2 = "This\tstrings\thas\ttabs."
s3 = "This        string          has      lots    of     space."
s4 = "This\tstring         has\ttabs       and\tspaces."

Ss = [s1, s2, s3, s4]

for s in Ss:
    print("\nWith `sent.split()`:")
    print(s.split())
    print("\nWith `sent.split(' ')`:")
    print(s.split(' '))


With `sent.split()`:
['This', 'string', 'is', 'a', 'pretty', 'simple', 'string.']

With `sent.split(' ')`:
['This', 'string', 'is', 'a', 'pretty', 'simple', 'string.']

With `sent.split()`:
['This', 'strings', 'has', 'tabs.']

With `sent.split(' ')`:
['This\tstrings\thas\ttabs.']

With `sent.split()`:
['This', 'string', 'has', 'lots', 'of', 'space.']

With `sent.split(' ')`:
['This', '', '', '', '', '', '', '', 'string', '', '', '', '', '', '', '', '', '', 'has', '', '', '', '', '', 'lots', '', '', '', 'of', '', '', '', '', 'space.']

With `sent.split()`:
['This', 'string', 'has', 'tabs', 'and', 'spaces.']

With `sent.split(' ')`:
['This\tstring', '', '', '', '', '', '', '', '', 'has\ttabs', '', '', '', '', '', '', 'and\tspaces.']


##### 14. 

☼ Create a variable `words` containing a list of words. Experiment with `words.sort()` and `sorted(words)`. What is the difference?

*`words.sort()` doesn't return a value, but it alters the ordering of the list, so that whenever I call the list again, the returned list will be the ordered one, and not the one I originally stored.*

In [298]:
words = ["slova", "ord", "Wörter", "λόγια", "words", "palabras", "sanat", 
         "mots", "focail", "szavak", "parole", "words", "woorden", "ord", 
         "słowa", "palavras", "từ ngữ", "ווערטער"]

words.sort()
print(words)

['Wörter', 'focail', 'mots', 'ord', 'ord', 'palabras', 'palavras', 'parole', 'sanat', 'slova', 'szavak', 'słowa', 'từ ngữ', 'woorden', 'words', 'words', 'λόγια', 'ווערטער']


In [299]:
print(words)

['Wörter', 'focail', 'mots', 'ord', 'ord', 'palabras', 'palavras', 'parole', 'sanat', 'slova', 'szavak', 'słowa', 'từ ngữ', 'woorden', 'words', 'words', 'λόγια', 'ווערטער']


*`sorted(words)` will return a result, but the ordering of the original list will not be changed:*

In [301]:
words = ["slova", "ord", "Wörter", "λόγια", "words", "palabras", "sanat", 
         "mots", "focail", "szavak", "parole", "words", "woorden", "ord", 
         "słowa", "palavras", "từ ngữ", "ווערטער"]

print(sorted(words))

['Wörter', 'focail', 'mots', 'ord', 'ord', 'palabras', 'palavras', 'parole', 'sanat', 'slova', 'szavak', 'słowa', 'từ ngữ', 'woorden', 'words', 'words', 'λόγια', 'ווערטער']


In [303]:
print(words)

['slova', 'ord', 'Wörter', 'λόγια', 'words', 'palabras', 'sanat', 'mots', 'focail', 'szavak', 'parole', 'words', 'woorden', 'ord', 'słowa', 'palavras', 'từ ngữ', 'ווערטער']


##### 15. 

☼ Explore the difference between strings and integers by typing the following at a Python prompt: `"3" * 7` and `3 * 7`. Try converting between strings and integers using `int("3")` and `str(3)`.

*Multiplying a string $x$ by an integer $y$ will just cause $x$ to be printed to the console $y$ times:*

In [304]:
"3" * 7

'3333333'

<i>`3 * 7` will just give us the product:</i>

In [305]:
3 * 7

21

*We can convert strings to integers with `int()` and vice-versa with `str()`:*

In [306]:
int("3") * 7

21

In [307]:
str(3) * 7

'3333333'

##### 16. 

☼ Use a text editor to create a file called `prog.py` containing the single line `monty = 'Monty Python'`. Next, start up a new session with the Python interpreter, and enter the expression `monty` at the prompt. You will get an error from the interpreter. Now, try the following (note that you have to leave off the `.py` part of the filename):

*Errors in my notebook prevent the remaining cells from running, so this is being saved as markdown:*

```
monty

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-308-d4cc90107335> in <module>
----> 1 monty

NameError: name 'monty' is not defined
```

In [310]:
from prog import monty
monty

'Monty Python'

*This time, Python should return with a value. You can also try `import prog`, in which case Python should be able to evaluate the expression `prog.monty` at the prompt.*

In [311]:
import prog

prog.monty

'Monty Python'

##### 17. 

☼ What happens when the formatting strings `%6s` and `%-6s` are used to display strings that are longer than six characters?

*This looks to be a legacy question from an older version of the book, since this is the older method of formatting in Python.  As the question is written, `%6s` won't do anything to a longer string:*

In [323]:
test = "another test"

"%6s" % (test)

'another test'

*`%-6s` will add padding to a string shorter than six characters:*

In [327]:
"%-6s" % ("hey")

'hey   '

*I suspect the authors may have left out a decimal.  `%.6s` has a quite different effect:*

In [328]:
"%.6s" % (test)

'anothe'

##### 18. 

◑ Read in some text from a corpus, tokenize it, and print the list of all *wh*-word types that occur. (*wh*-words in English are used in questions, relative clauses and exclamations: *who*, *which*, *what*, and so on.) Print them in order. Are any words duplicated in this list, because of the presence of case distinctions or punctuation?

*This question is a little difficult to follow.  Most of the corpora we're using have texts that have already been tokenized, so the first part of this question seems a bit redundant.  However, just to play along,  I'll use the raw text version of one of the Project Gutenberg texts.*

In [241]:
from nltk import word_tokenize
from nltk.corpus import gutenberg

raw = gutenberg.raw('bryant-stories.txt')

tokens = word_tokenize(raw)

tokens = sorted(set(tokens))

*The next part of the question also seems a little hard to follow: The __wh-__words are a closed set (__what__, __when__, __where__, __which__, __who__, __whose__, and __why__), and if I explicitly define them, I won't find any exceptions.  The only possible way around this is to search for all words that begin with __wh__.  But if I do so, most of the hits will be false positives, because most words that start with __wh-__ are outside of this set.*

In [356]:
print([w for w in tokens if re.search('^[Ww]h', w)])

['Whale', 'What', 'When', 'Whenever', 'Where', 'Whether', 'Whiff', 'While', 'Whirling', 'White', 'Who', 'Whose', 'Why', 'what', 'whatever', 'wheat', 'wheelbarrow', 'wheeled', 'when', 'whence', 'whenever', 'where', 'wherein', 'wherever', 'whether', 'which', 'while', 'whimpering', 'whin', 'whinny', 'whipped', 'whirlpool', 'whiruled', 'whisk', 'whisked', 'whisper', 'whisper_', 'whispered', 'whispering', 'whispers', 'whistle', 'whistled', 'white', 'white-haired', 'white-robed', 'whither', 'who', 'whole', 'wholly', 'whom', 'whose', 'why']


*As expected, most of the results are false positives.  There are also  versions of the __wh-__ words starting with both upper- and lowercase letters; as well as 'whom', the accusative form of 'who'.  But there are also a number of words that could be considered __wh-__ words that I wouldn't have thought of searching for, such as words ending with '-ever' (e.g., 'whatever', 'whenever', etc...); variant forms of 'where' (i.e., 'whence' and 'whither'), as well as 'whether' and 'wherein'.*

##### 19. 

◑ Create a file consisting of words and (made up) frequencies, where each line consists of a word, the space character, and a positive integer, e.g. `fuzzy 53`. Read the file into a Python list using `open(filename).readlines()`. Next, break each line into its two fields using `split()`, and convert the number into an integer using `int()`. The result should be a list of the form: `[['fuzzy', 53], ...]`.

In [358]:
fuzzy = open('fuzzy.txt', encoding = "utf-8").readlines()

In [386]:
[[word, int(value)] for word, value in (f.split() for f in fuzzy)]

[['fuzzy', 53],
 ['wuzzy', 92],
 ['was', 128],
 ['a', 4897],
 ['bear', 23],
 ['had', 47],
 ['no', 93],
 ['hair', 23],
 ["wasn't", 78],
 ['he', 77]]

##### 20. 

◑ Write code to access a favorite webpage and extract some text from it. For example, access a weather site and extract the forecast top temperature for your town or city today.

*This is not so straightforward.  Most websites today do not have a static link to weather values, and it's almost certainly easier to use an API.  [OpenWeather](https://openweathermap.org/api "OpenWeather") offers such an API, though it requires registration, which takes a few hours to process.  The results will be in JSON, which needs to be parsed, and the temperature will be in Kelvin, which will need to be converted:*


In [117]:
import requests
import json

# needs to be deleted before posting to GH
key = '10f1049e5862ee388ad748981accb4c1'

def k_to_c(temp):
    """Converts Kelvin to Celsius."""
    return temp - 273.15

def k_to_f(temp):
    """Converts Kelvin to Fahrenheit."""
    return (temp - 273.15) * 1.8 + 32

def get_temp(city):
    """
    Gets current, high, and low temperatures for a given city.
    """
    r = requests.get('http://api.openweathermap.org/data/2.5/weather?q=' + city + '&APPID=' + key)
    
    if r.json()['cod'] == '404':
        print("Sorry, we don't know where that city is.")
    else:
        current_k = r.json()['main']['temp']
        min_k = r.json()['main']['temp_min']
        max_k = r.json()['main']['temp_max']
        print("The current temperature is {:.1f} C°/{:.1f} F°.".format(k_to_c(current_k), k_to_f(current_k)))
        print("Today's high temperature is {:.1f} C°/{:.1f} F°.".format(k_to_c(max_k), k_to_f(max_k)))
        print("Today's low temperature is {:.1f} C°/{:.1f} F°.".format(k_to_c(min_k), k_to_f(min_k)))

In [111]:
get_temp('Hiroshima')

The current temperature is 17.8 C°/64.0 F°.
Today's high temperature is 20.6 C°/69.0 F°.
Today's low temperature is 14.0 C°/57.2 F°.


In [118]:
get_temp('Kure')

The current temperature is 18.7 C°/65.6 F°.
Today's high temperature is 20.6 C°/69.0 F°.
Today's low temperature is 16.0 C°/60.8 F°.


In [115]:
get_temp('aaaa')

Sorry, we don't know where that city is.


##### 21.

◑ Write a function `unknown()` that takes a URL as its argument, and returns a list of unknown words that occur on that webpage. In order to do this, extract all substrings consisting of lowercase letters (using `re.findall()`) and remove any items from this set that occur in the Words Corpus (`nltk.corpus.words`). Try to categorize these words manually and discuss your findings.

*If we literally do what the instructions tell us, we'll get quite a large list, as inflected word forms (-ing, -ed, -s, etc...) are mostly absent from the Words Corpus.  So I've added parameters to `unknown` that will let us exclude these common words.  This may lead to a small number of false negatives; but I feel this is an acceptable cost, given the very high number of false positives.  It might be advisable to run the function twice, once without the endings excluded and once with, so that the user can inspect for him-/herself which words are being excluded.*

*Additionally, irregular verbs are absent in the wordlist, so I added these manually.*

*After adjusting which words are excluded, we'll see that most of the words are fairly new words ('podcast', 'app', ...) which haven't had a chance to be added to the corpus.  We also see compound words ('counterproductive', 'rollout', ...) whose constituent parts are part of the corpus.  We will also see the occasional lower-case last name: it's common for authors to post their Twitter handles next to their byline, and these handles are usually lowercase.  Upper-case names were pruned in the function.*



In [65]:
wordlist = [w.lower() for w in nltk.corpus.words.words('en')]

# irregular verbs
verbs = ['ate', 'beat', 'beaten', 'became', 'become', 'began', 'begun', 'bent', 
         'bet', 'bid', 'bit', 'bitten', 'blew', 'blown', 'bought', 'broke', 'broken', 
         'brought', 'built', 'burnt', 'came', 'caught', 'chose', 'chosen', 'come', 
         'cost', 'cut', 'did', 'dived', 'done', 'dove', 'drank', 'drawn', 'dreamt', 
         'drew', 'driven', 'drove', 'drunk', 'dug', 'eaten', 'fallen', 'fell', 'felt', 
         'flew', 'flown', 'forgave', 'forgiven', 'forgot', 'forgotten', 'fought', 'found', 
         'froze', 'frozen', 'gave', 'given', 'gone', 'got', 'gotten', 'grew', 'grown', 
         'had', 'heard', 'held', 'hid', 'hidden', 'hit', 'hung', 'hurt', 'kept', 'knew', 
         'known', 'laid', 'lain', 'lay', 'led', 'left', 'lent', 'let', 'lost', 'made', 
         'meant', 'met', 'paid', 'put', 'ran', 'rang', 'read', 'ridden', 'risen', 'rode', 
         'rose', 'run', 'rung', 'said', 'sang', 'sat', 'saw', 'seen', 'sent', 'showed', 
         'shown', 'shut', 'slept', 'sold', 'spent', 'spoke', 'spoken', 'stood', 'sung', 
         'swam', 'swum', 'taken', 'taught', 'thought', 'threw', 'thrown', 'told', 'took', 
         'tore', 'torn', 'understood', 'went', 'woke', 'woken', 'won', 'wore', 'worn', 
         'written', 'wrote']

wordlist += verbs



def unknown(url, es = False, s = False, ed = False, ing = False, n = False, er = False):
    
    # get text
    raw = return_URL_contents(url)
    
    # get lower-case words
    raw_lower = re.findall(r'\b[a-z]+\b', raw)
    
    # find unknown words and eliminate duplicates
    unknown = sorted(set([w for w in raw_lower if w not in wordlist]))
    
    # find common words that are not in wordlist because of 
    # morphological changes
    exclude = []
    
    # words with -es plurals
    if es:
        es = [i for i in unknown if i[-2:] == 'es' and i[:-2] in wordlist] 
        exclude += es
        # -y becomes -ies
        ies = [i for i in unknown if i[-3:] == 'ies' and i[:-3] + 'y' in wordlist]
        exclude += ies
        
    # regular plurals
    if s:
        s = [i for i in unknown if i[-1] == 's' and i[:-1] in wordlist]
        exclude += s
        
    # regular past tense forms
    if ed:
        # verbs with final -e
        d = [i for i in unknown if i[-1:] == 'd' and i[:-1] in wordlist]
        exclude += d
        # regular verbs
        ed = [i for i in unknown if i[-2:] == 'ed' and i[:-2] in wordlist]
        exclude += ed
        # verbs that double final consonant
        dd = [i for i in unknown if i[-2:] == 'ed' and i[:-3] in wordlist]
        exclude += dd
        
    # regular gerunds
    if ing:
        # verbs with final -e
        ng = [i for i in unknown if i[-3:] == 'ing' and i[:-3] + 'e' in wordlist]
        exclude += ng
        # regular verbs
        ing = [i for i in unknown if i[-3:] == 'ing' and i[:-3] in wordlist]
        exclude += ing
        # verbs that double final consonat
        nng = [i for i in unknown if i[-3:] == 'ing' and i[:-4] in wordlist]
        exclude += nng
        
    if n:
        # negative contractions without final -'t
        n = [i for i in unknown if i[-1:] == 'n' and i[:-1] in wordlist]
        exclude += n
        
    if er:
        # comparative forms
        er = [i for i in unknown if i[-2:] == 'er' and i[:-2] in wordlist]
        exclude += er
        # comparative forms with final -y
        ier = [i for i in unknown if i[-3:] == 'ier' and i[:-3] + 'y' in wordlist]
        exclude += ier
        # comparative forms with final -e
        r = [i for i in unknown if i[-2:] == 'er' and i[:-1] in wordlist]
        exclude += r
        # superlative forms
        est = [i for i in unknown if i[-3:] == 'est' and i[:-3] in wordlist]
        exclude += est
        # superlative forms with final -y
        st = [i for i in unknown if i[-3:] == 'est' and i[:-2] in wordlist]
        exclude += st
        # superlative forms with final -e
        iest = [i for i in unknown if i[-4:] == 'iest' and i[:-4] + 'y' in wordlist]
        exclude += iest
        
    # return only those unknown words that have not been excluded
    # by the above list comprehensions
    return [i for i in unknown if i not in exclude]


In [25]:
url = 'https://www.vox.com/recode/2019/10/16/20916712/cnn-democratic-presidential-debate-big-tech-silicon-valley-warren-harris'

print(unknown(url))

['actions', 'agencies', 'answers', 'app', 'attacking', 'attempts', 'banned', 'biohacker', 'breaks', 'bringing', 'businesses', 'called', 'candidates', 'changing', 'companies', 'compares', 'comparing', 'competitors', 'couldn', 'criticized', 'debated', 'details', 'diagnosing', 'didn', 'discussed', 'doesn', 'doors', 'employees', 'enemies', 'enjoyed', 'executives', 'explained', 'explodes', 'explores', 'frontrunner', 'fundraising', 'giants', 'harshest', 'has', 'helped', 'hosted', 'ignoring', 'including', 'indicated', 'infused', 'joined', 'laws', 'lists', 'lives', 'mailing', 'mainstream', 'members', 'millennials', 'minutes', 'missed', 'monopolies', 'numbers', 'offered', 'okay', 'options', 'playing', 'podcast', 'politicians', 'practices', 'pressed', 'proposals', 'pushed', 'represents', 'required', 'rights', 'rules', 'sectors', 'sees', 'sharing', 'shifted', 'showcased', 'signing', 'simmering', 'specifics', 'stories', 'streamers', 'techlash', 'teddyschleifer', 'themes', 'things', 'topics', 'toug

In [50]:
url = 'https://www.vox.com/recode/2019/10/16/20916712/cnn-democratic-presidential-debate-big-tech-silicon-valley-warren-harris'

print(unknown(url, es = True, s = True, ed = True, ing = True, n = True, er = True))

['app', 'biohacker', 'frontrunner', 'fundraising', 'mainstream', 'okay', 'podcast', 'techlash', 'teddyschleifer']


In [66]:
url = "https://www.nytimes.com/2019/10/16/world/middleeast/trump-erdogan-turkey-syria-kurds.html?action=click&module=Top%20Stories&pgtype=Homepage"

print(unknown(url))

['abandoning', 'acquiescing', 'adding', 'advocates', 'aligned', 'angels', 'announced', 'appearances', 'applications', 'areas', 'arts', 'asserted', 'attacking', 'attempted', 'automobiles', 'berated', 'books', 'briefed', 'captured', 'citing', 'closest', 'closing', 'columnists', 'comments', 'communists', 'compared', 'complaining', 'concerns', 'condemns', 'continues', 'contributed', 'controlled', 'corrections', 'counterproductive', 'countries', 'created', 'criticized', 'dated', 'decades', 'defended', 'defends', 'denied', 'denounced', 'deployed', 'described', 'detained', 'deterring', 'dismissed', 'dismissing', 'earlier', 'edited', 'editorials', 'elected', 'email', 'emerged', 'emphasized', 'events', 'expressing', 'fears', 'feels', 'fighters', 'focusing', 'followed', 'forces', 'friends', 'gained', 'hands', 'has', 'having', 'heated', 'horses', 'hours', 'implying', 'including', 'insisted', 'interests', 'interjected', 'interpreted', 'invading', 'isn', 'issues', 'jeopardizing', 'jobs', 'journeys'

In [67]:
print(unknown(url, es = True, s = True, ed = True, ing = True, n = True, er = True))

['counterproductive', 'email', 'lebanon', 'meltdown', 'multimedia', 'near', 'ol', 'op', 'peterbakernyt', 'reemerge', 'rollout']


In [68]:
url = "https://www.cyclingnews.com/news/taylor-phinney-set-to-retire/"
print(unknown(url, es = True, s = True, ed = True, ing = True, n = True, er = True))

['allenskratch', 'cyclo', 'll', 'longtime', 'maglia', 'plc', 'taylorphinney', 'triallist', 'unsubscribe', 'vibe']


##### 22. 

◑ Examine the results of processing the URL `http://news.bbc.co.uk/` using the regular expressions suggested above. You will see that there is still a fair amount of non-textual data there, particularly Javascript commands. You may also find that sentence breaks have not been properly preserved. Define further regular expressions that improve the extraction of text from this web page.

*I find this question really poorly written.  We're supposed to use the 'regular expressions suggested above', but which ones?  The chapter is full of them!*

*I've already made several functions that do fairly good jobs of extracting text and removing Javascript commands.  I don't really feel like breaking one of these functions just for the sake of practice...*

In [124]:
url = "https://www.bbc.com/news"
return_URL_contents(url)[:2000]

"Home - BBC News Homepage Accessibility links Skip to content Accessibility Help BBC Account Notifications Home News Sport Weather iPlayer Sounds CBBC CBeebies Food Bitesize Arts Taster Local TV Radio Three Menu Search Search the BBC Search the BBC BBC News News Navigation Sections Home Home selected Video World Asia UK Business Tech Science Stories Entertainment & Arts Health World News TV In Pictures Reality Check Newsbeat Special Reports Explainers Long Reads Have Your Say More More sections Home Home selected Video World World Home Africa Australia Europe Latin America Middle East US & Canada Asia Asia Home China India UK UK Home England N. Ireland Scotland Wales Politics Local News Business Business Home Market Data Global Trade Companies Entrepreneurship Technology of Business Business of Sport Global Education Economy Global Car Industry Tech Science Stories Entertainment & Arts Health World News TV In Pictures Reality Check Newsbeat Special Reports Explainers Long Reads Have Yo

##### 23. 
◑ Are you able to write a regular expression to tokenize text in such a way that the word *don't* is tokenized into *do* and *n't*? Explain why this regular expression won't work: `«n't|\w+»`.

*The short answer is that I'm not really sure.  I understand that this book is (was) designed so that it could be used in the classroom, and therefore the answers were not included; but I believe a large percentage - perhaps even the majority - of users are students engaging in self-study, and for these people (myself included), explanations would be greatly appreciated.*

<i>My guess is that matches are greedy, and the regular expression evaluator will try to return the largest match that it can.  To turn this off, we need to add <code>(.*?)</code>.  Also, we need to add parentheses around `n't` to specify it as a capturing group.</i>

In [161]:
re.findall(r"(.*?)(n't)|\w+", "don't")

[('do', "n't")]

##### 24. 

◑ Try to write code to convert text into *hAck3r*, using regular expressions and substitution, where `e` → `3`, `i` → `1`, `o` → `0`, `l` → `|`, `s` → `5`, `.` → `5w33t!`, `ate` → `8`. Normalize the text to lowercase before converting it. Add more substitutions of your own. Now try to map `s` to two different values: `$` for word-initial `s`, and `5` for word-internal `s`.

*This seems like a perfect place to `re.sub()`, which, typically, is not introduced until later in the exercise set...*

In [171]:
test = "Hello suckers.  I ate your lunch.  It was delish."

test = test.lower()

org = ['ate', 'e', 'i', 'o', 'l', 's', '\.']
sub = ['8', '3', '1', '0', '|', '5', '5w33t!']

for i in range(len(org)):
    test = re.sub(org[i], sub[i], test)

test

'H3||0 5uck3r55w33t!  I 8 y0ur |unch5w33t!  It wa5 d3|15h5w33t!'

*Adding my own substitutions:*

In [187]:
test = "Peter Piper picked a peck of pickled peppers."

test = test.lower()

org = ['e', 'i', 'o', 'l', 's', 't', 'p', '\.']
sub = ['3', '1', '0', '|', '5', '+', '%', '5w33t!']

for i in range(len(org)):
    test = re.sub(org[i], sub[i], test)

test

'%3+3r %1%3r %1ck3d a %3ck 0f %1ck|3d %3%%3r55w33t!'

*Differentiating between initial $s$ and medial $s$:*

In [185]:
test = "Susie lives in Mississippi."

test = test.lower()

org = ['ate', 'e', 'i', 'o', 'l', 't', r'\bs', 's', '\.']
sub = ['8', '3', '1', '0', '|', '+', '$', '5', '5w33t!']

for i in range(len(org)):
    test = re.sub(org[i], sub[i], test)

test

'$u513 |1v35 1n m1551551pp15w33t!'

##### 25. 

◑ Pig Latin is a simple transformation of English text. Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append *ay*, e.g. *string* → *ingstray*, *idle* → *idleay*. http://en.wikipedia.org/wiki/Pig_Latin

 * a. Write a function to convert a word to Pig Latin.
 * b. Write code that converts text, instead of individual words.
 * c. Extend it further to preserve capitalization, to keep qu together (i.e. so that `quiet` becomes `ietquay`), and to detect when `y` is used as a consonant (e.g. `yellow`) vs a vowel (e.g. `style`).
 
*Instead of going through the instructions in order, I think it might be easier to handle all of the exceptions from the beginning.  I'm also not going to include all the intermediate code, so what follows is my final answer to all parts of this question:*

In [301]:
def pig_latin(word):
    """
    Returns pig latin version of word.
    """
    
    # replace 'dumb' quotes
    word = re.sub("’", "'", word)
        
    # won't work on non-alphabetic strings
    if not word.isalpha():
        if "'" not in word:
            return word
    
    # Return uppercase word if original is in uppercase
    caps = False
    if word[0].isupper():
        caps = True
    word = word.lower()
    
    # word starts with vowel
    if word[0] in 'AEIOUaeiou':
        pl = word + 'ay'
        
    # some tokenizers will produce non-words    
    elif len(word) == 1:
        return word
    
    # word begins with 'y' - treated as a consonant
    # otherwise 'y' is a vowel, or first vowel is not 'y'
    elif word[0] == 'y':
        pl = word[1:] + 'yay'
    
    # word begins with 'qu'
    elif word[:2] == "qu":
        pl = word[2:] + "quay"
    
    # all other cases
    else:
        start, end = re.findall(r'\b^[^aeiouy]*|[aeiouy]{1}\S*', word)
        pl = end + start + 'ay'
    
    # restore word to uppercase if necessary
    if caps == True:
        pl = pl[0].upper() + pl[1:]
    return pl

*From Chapter 2, exercise 3.  It will make the final output look nicer by joining punctuation to the preceding string.*

In [None]:
def join_punctuation(text, characters = ["'", '’', ')', ',', '.', ':', ';', '?', '!', ']', "''"]): 
    """
    Takes a list of strings and attaches punctuation to
    the preceding string in the list.
    """
    
    text = iter(text)
    current = next(text)

    for nxt in text:
        if nxt in characters:
            current += nxt
        else:
            yield current
            current = nxt
            

    yield current

In [299]:
# from https://funnystories.tumblr.com/post/140309670613/funny-story

# For the sake of convenience, I've removed some punctuation.


story = """One time in sixth grade we were at recess and while I was running to 
my friends, I just so happened to kick a HUGE rock and without thinking I 
shouted at the top of my lungs MOTHERFUCKER And with my god-awful luck, my math
teacher was sitting at the bench right BESIDE ME. He then took me inside to 
what I thought was yell at me but he just couldn’t stop laughing and sent 
me back outside with a literal candy bar. He is still my favorite teacher 
I've ever had."""



In [304]:
def pig_latin_text(text):
    """
    Translates text into pig latin.
    """
    pl = []
    for t in re.findall(r'\b[\S]+\b|[.,!?]', text):
        pl.append(pig_latin(t))
        
    return " ".join(join_punctuation(pl))

In [303]:
pig_latin_text(story)

"Oneay imetay inay ixthsay adegray eway ereway atay ecessray anday ilewhay Iay asway unningray otay ymay iendsfray, Iay ustjay osay appenedhay otay ickkay aay Ugehay ockray anday ithoutway inkingthay Iay outedshay atay ethay optay ofay ymay ungslay Otherfuckermay Anday ithway ymay god-awful ucklay, ymay athmay eachertay asway ittingsay atay ethay enchbay ightray Esidebay Emay. Ehay enthay ooktay emay insideay otay atwhay Iay oughtthay asway ellyay atay emay utbay ehay ustjay ouldn'tcay opstay aughinglay anday entsay emay ackbay outsideay ithway aay iterallay andycay arbay. Ehay isay illstay ymay avoritefay eachertay I'veay everay adhay."

In [328]:
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')

In [329]:
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]

In [331]:
cvs[:10]

['ka', 'ka', 'ka', 'ka', 'ka', 'ro', 'ka', 'ka', 'vi', 'ko']

#####  26.

◑ Download some text from a language that has vowel harmony (e.g. Hungarian), extract the vowel sequences of words, and create a vowel bigram table.

In [317]:
url = "https://hu.wikipedia.org/wiki/Google_keres%C5%91"
H = return_URL_contents(url)

In [322]:
rawH = re.findall(r'\b[\S]+\b', H)
rawH = sorted(set(rawH))

In [334]:
rawH

['0,85',
 '0-19-279735-2',
 '1',
 '1,3',
 '1-d',
 '1.1',
 '1.2',
 '1.3',
 '1.4',
 '10',
 '10,4',
 '100',
 '1000',
 '11',
 '11:48',
 '12',
 '12-i',
 '13',
 '14',
 '15',
 '16',
 '18',
 '1913',
 '1981',
 '1998-ban',
 '2',
 '20',
 '200',
 '2001',
 '2003',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2016',
 '2018',
 '2019',
 '22',
 '24',
 '25',
 '26',
 '27',
 '27-én',
 '28',
 '3',
 '3.0',
 '32',
 '3D',
 '4',
 '4726597-8',
 '5',
 '6',
 '7',
 '8',
 '8,05',
 '9',
 '9.1',
 '9.2',
 'A',
 'Aardvark',
 'Accelerator',
 'Account',
 'Ad',
 'AdMob',
 'AdSense',
 'AdWords',
 'AdWords-hirdetések',
 'AdWords-höz',
 'AdWords-öt',
 'Adatvédelmi',
 'Adományok',
 'Ads',
 'Adscape',
 'Adsense',
 'Advanced',
 'Advertising',
 'Afrikaans',
 'Ahogy',
 'Albums',
 'Alemannisch',
 'Amennyiben',
 'Analysis',
 'Analytics',
 'Android',
 'Angolul',
 'Annak',
 'Answers',
 'App',
 'Apps',
 'Art',
 'Asturianu',
 'Az',
 'Azon',
 'BNF',
 'Bahasa',
 'Base',
 'Bejelentkezés',
 'BigTable',
 'Birth',
 'Blog',
 'Blogger

In [339]:
test = ['Közreműködések', 'Keresőkulcsszavak', 'Kezdőlap']

In [342]:
for rh in test:
    print(re.findall(r'\S+([ö])\S+([ö])\S+', rh, re.UNICODE))

[]
[]
[]


In [323]:
vh = [v for w in H for v in re.findall(r'[eéiíöőüűaáoóuú][eéiíöőüűaáoóuú]', rawH)]

TypeError: expected string or bytes-like object

In [315]:
vh

[]

In [314]:
cfd = nltk.ConditionalFreqDist(vh)
cfd.tabulate()

ValueError: max() arg is an empty sequence