# [3. Processing Raw Text](https://www.nltk.org/book/ch03.html) - Notes

* [NLTK-Book-Resource Repository](https://github.com/BetoBob/NLTK-Book-Resource)
* [NLTK-Book-Resource Table of Contents](https://github.com/BetoBob/NLTK-Book-Resource#table-of-contents)

Run the cell below before running the examples.

In [35]:
import nltk, re, pprint
from nltk import word_tokenize

## 3.1   - Accessing Text from the Web and from Disk

### Electronic Books

* **Note:** using `utf-8-sig` is a more correct codec for this web document, and will remove some extra characters
* [Forum Post on codecs](https://stackoverflow.com/questions/17912307/u-ufeff-in-python-string)

In [None]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf-8-sig')
type(raw)

In [None]:
len(raw)

In [None]:
raw[:75]

In [None]:
tokens = word_tokenize(raw)
type(tokens)

In [None]:
len(tokens)

In [None]:
tokens[:10]

In [None]:
text = nltk.Text(tokens)

In [None]:
type(text)

In [None]:
# updated to match 'utf-8-sig' encoding
text[1019:1059]

In [None]:
text.collocation_list()

In [None]:
raw.find("PART I")

In [None]:
raw.rfind("End of Project Gutenberg’s Crime")

In [None]:
raw = raw[5335:1157811]

In [None]:
raw.find("PART I")

### Dealing with HTML

* **Note:** BeautifulSoup (`bs4`) is a python library bundled in Anaconda and in Google Colab

In [None]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:60]

In [None]:
print(html)

In [None]:
from bs4 import BeautifulSoup
raw = BeautifulSoup(html, 'html.parser').get_text()
tokens = word_tokenize(raw)

In [None]:
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

### Processing Search Engine Results

**Note:** Search engine scraping is a constantly changing process. Google in more recent years have made it deliberately challenging to scrape data from search results to avoid bot searches from effecting their search algorithm and to maintain the value of their [analytics services](https://developers.google.com/custom-search/v1/overview). Web scraping data from Google can breach their [terms of services](https://policies.google.com/terms). You should instead use their official [Custom Search JSON API](https://developers.google.com/custom-search/v1/overview), which provides 100 free search queries per day. 

* [Search engine scraping](https://en.wikipedia.org/wiki/Search_engine_scraping)
* [Google terms of services](https://policies.google.com/terms)
* [Google Search API](https://developers.google.com/custom-search/v1/overview)
    * 100 free searches a day
* [Bing Search API](https://azure.microsoft.com/en-us/services/cognitive-services/bing-web-search-api/)
    * 3 searches per second, 1,000 searches per month
* [DuckDuckGo Instant Answer API](https://api.duckduckgo.com/api?t=h_)

### Processing RSS Feeds

Run the `pip install` command below to download the feedparser library.

In [None]:
!pip install feedparser

* **Note:** This RSS feed changes over time, so the "He's my BF" entry will not be available. This code however will still work properly for any entry in the RSS feed.

In [None]:
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']

In [None]:
len(llog.entries)

In [None]:
post = llog.entries[4]
post.title

In [None]:
content = post.content[0].value

In [None]:
content[:70]

In [None]:
raw = BeautifulSoup(content, 'html.parser').get_text()

In [None]:
word_tokenize(raw)

In [None]:
[l.title for l in llog.entries]

### Reading Local Files

**Note:** 
* It's important to close your document with the `.close()` command after opening a file.
* If you are in Google Colab, click on the folder icon, right click the Files menu, and select `New File` to create a txt file
    * you can also create a folder by right clicking the Files menu, and can move an `example.txt` file you created in a folder named `data` to run the code below

In [None]:
f = open('data/example.txt')
raw = f.read()
f.close()

raw

In [None]:
f.close()

**Your Turn:** Your Turn: Create a file called document.txt using a text editor, and type in a few lines of text, and save it as plain text. Next, in the Python interpreter, open the file using `f = open('document.txt')`, then inspect its contents using `print(f.read())`.

In [None]:
f = open('document.txt')
print(f.read())
f.close()

In [None]:
import os
os.listdir('.')

In [None]:
f = open('data/example.txt', 'r')
for line in f:
    print(line.strip())
f.close()

In [None]:
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')

file = open(path, 'r')
raw = file.read()
file.close()

raw

### Capturing User Input

In [None]:
s = input("Enter some text: ")

print("You typed", len(word_tokenize(s)), "words.")

### The NLP Pipeline

In [None]:
file = open('data/example.txt')

raw = file.read()

file.close()

type(raw)

In [None]:
tokens = word_tokenize(raw)
type(tokens)

In [None]:
words = [w.lower() for w in tokens]
type(words)

In [None]:
vocab = sorted(set(words))
type(vocab)

In [None]:
vocab.append('blog')
raw.append('blog')

In [None]:
query = 'Who knows?'
beatles = ['john', 'paul', 'george', 'ringo']
query + beatles

## 3.2 - Strings: Text Processing at the Lowest Level

### Basic Operations with Strings

In [1]:
monty = 'Monty Python'
monty

'Monty Python'

In [2]:
circus = "Monty Python's Flying Circus"
circus

"Monty Python's Flying Circus"

In [3]:
circus = 'Monty Python's Flying Circus'

SyntaxError: invalid syntax (<ipython-input-3-35ad50ae6ca1>, line 1)

In [4]:
couplet = "Shall I compare thee to a Summer's day?"\
          "Thou are more lovely and more temperate:"

In [5]:
print(couplet)

Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:


In [7]:
couplet = ("Rough winds do shake the darling buds of May,"
           "And Summer's lease hath all too short a date:")

In [8]:
print(couplet)

Rough winds do shake the darling buds of May,And Summer's lease hath all too short a date:


In [11]:
couplet = """Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:"""

In [12]:
print(couplet)

Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:


In [13]:
couplet = '''Rough winds do shake the darling buds of May,
And Summer's lease hath all too short a date:'''

In [14]:
print(couplet)

Rough winds do shake the darling buds of May,
And Summer's lease hath all too short a date:


In [15]:
'very' + 'very' + 'very'

'veryveryvery'

In [16]:
'very' * 3

'veryveryvery'

**Your Turn:** Try running the following code, then try to use your understanding of the string `+` and `*` operations to figure out how it works. Be careful to distinguish between the string `' '`, which is a single whitespace character, and `''`, which is the empty string.

In [17]:
a = [1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1]
b = [' ' * 2 * (7 - i) + 'very' * i for i in a]

for line in b:
    print(line)

            very
          veryvery
        veryveryvery
      veryveryveryvery
    veryveryveryveryvery
  veryveryveryveryveryvery
veryveryveryveryveryveryvery
  veryveryveryveryveryvery
    veryveryveryveryvery
      veryveryveryvery
        veryveryvery
          veryvery
            very


In [18]:
'very' - 'y'

TypeError: unsupported operand type(s) for -: 'str' and 'str'

In [19]:
'very' / 2

TypeError: unsupported operand type(s) for /: 'str' and 'int'

### Printing Strings

In [20]:
monty = 'Monty Python'
print(monty)

Monty Python


In [21]:
grail = 'Holy Grail'
print(monty + grail)

Monty PythonHoly Grail


In [22]:
print(monty, grail)

Monty Python Holy Grail


In [23]:
print(monty, "and the", grail)

Monty Python and the Holy Grail


### Accessing Individual Characters

In [27]:
monty = 'Monty Python'

In [24]:
monty[0]

'M'

In [25]:
monty[3]

't'

In [28]:
monty[20]

IndexError: string index out of range

In [29]:
monty[-1]

'n'

In [30]:
monty[5]

' '

In [31]:
monty[-7]

' '

In [33]:
sent = 'colorless green ideas sleep furiously'

for char in sent:
    print(char, end=' ')

c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y 

In [36]:
from nltk.corpus import gutenberg
raw = gutenberg.raw('melville-moby_dick.txt')
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())

fdist.most_common(5)

[('e', 117092), ('t', 87996), ('a', 77916), ('o', 69326), ('n', 65617)]

In [37]:
[char for (char, count) in fdist.most_common()]

['e',
 't',
 'a',
 'o',
 'n',
 'i',
 's',
 'h',
 'r',
 'l',
 'd',
 'u',
 'm',
 'c',
 'w',
 'f',
 'g',
 'p',
 'b',
 'y',
 'v',
 'k',
 'q',
 'j',
 'x',
 'z']

### Accessing Substrings

In [38]:
monty = 'Monty Python'

In [39]:
monty[6:10]

'Pyth'

In [40]:
monty[-12:-7]

'Monty'

In [41]:
monty[:5]

'Monty'

In [42]:
monty[6:]

'Python'

In [43]:
phrase = 'And now for something completely different'

In [44]:
if 'thing' in phrase:
    print('found "thing"')

found "thing"


In [45]:
monty.find('Python')

6

**Your Turn:** Make up a sentence and assign it to a variable, e.g. `sent = 'my sentence...'`. Now write slice expressions to pull out individual words. (This is obviously not a convenient way to process the words of a text!)

### The Difference between Lists and Strings

In [46]:
query = 'Who knows?'
beatles = ['John', 'Paul', 'George', 'Ringo']

In [47]:
query[2]

'o'

In [48]:
beatles[2]

'George'

In [49]:
query[:2]

'Wh'

In [50]:
beatles[:2]

['John', 'Paul']

In [51]:
query + " I don't"

"Who knows? I don't"

In [52]:
beatles + 'Brian'

TypeError: can only concatenate list (not "str") to list

In [53]:
beatles + ['Brian']

['John', 'Paul', 'George', 'Ringo', 'Brian']

In [54]:
beatles[0] = "John Lennon"

In [55]:
del beatles[-1]

In [56]:
beatles

['John Lennon', 'Paul', 'George']

In [57]:
query[0] = 'F'

TypeError: 'str' object does not support item assignment

## Your Turn Solutions