# [3. Processing Raw Text](https://www.nltk.org/book/ch03.html)

Run the cell below before running any other code.

In [None]:
import nltk, re, pprint
from nltk import word_tokenize

## 3.1   - Accessing Text from the Web and from Disk

### Electronic Books

* **Note:** using `utf-8-sig` is a more correct codec for this web document, and will remove some extra characters
* [Forum Post on codecs](https://stackoverflow.com/questions/17912307/u-ufeff-in-python-string)

In [None]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf-8-sig')
type(raw)

In [None]:
len(raw)

In [None]:
raw[:75]

In [None]:
tokens = word_tokenize(raw)
type(tokens)

In [None]:
len(tokens)

In [None]:
tokens[:10]

In [None]:
text = nltk.Text(tokens)

In [None]:
type(text)

In [None]:
# updated to match 'utf-8-sig' encoding
text[1019:1059]

In [None]:
text.collocation_list()

In [None]:
raw.find("PART I")

In [None]:
raw.rfind("End of Project Gutenberg’s Crime")

In [None]:
raw = raw[5335:1157811]

In [None]:
raw.find("PART I")

### Dealing with HTML

* **Note:** BeautifulSoup (`bs4`) is a python library bundled in Anaconda and in Google Colab

In [None]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:60]

In [None]:
print(html)

In [None]:
from bs4 import BeautifulSoup
raw = BeautifulSoup(html, 'html.parser').get_text()
tokens = word_tokenize(raw)

In [None]:
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

### Processing Search Engine Results

**Note:** Search engine scraping is a constantly changing process. Google in more recent years have made it deliberately challenging to scrape data from search results to avoid bot searches from effecting their search algorithm and to maintain the value of their [analytics services](https://developers.google.com/custom-search/v1/overview). Web scraping data from Google can breach their [terms of services](https://policies.google.com/terms). You should instead use their official [Custom Search JSON API](https://developers.google.com/custom-search/v1/overview), which provides 100 free search queries per day. 

* [Search engine scraping](https://en.wikipedia.org/wiki/Search_engine_scraping)
* [Google terms of services](https://policies.google.com/terms)
* [Google Search API](https://developers.google.com/custom-search/v1/overview)
    * 100 free searches a day
* [Bing Search API](https://azure.microsoft.com/en-us/services/cognitive-services/bing-web-search-api/)
    * 3 searches per second, 1,000 searches per month
* [DuckDuckGo Instant Answer API](https://api.duckduckgo.com/api?t=h_)

### Processing RSS Feeds

Run the `pip install` command below to download the feedparser library.

In [None]:
!pip install feedparser

* **Note:** This RSS feed changes over time, so the "He's my BF" entry will not be available. This code however will still work properly for any entry in the RSS feed.

In [None]:
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']

In [None]:
len(llog.entries)

In [None]:
post = llog.entries[4]
post.title

In [None]:
content = post.content[0].value

In [None]:
content[:70]

In [None]:
raw = BeautifulSoup(content, 'html.parser').get_text()

In [None]:
word_tokenize(raw)

In [None]:
[l.title for l in llog.entries]

### Reading Local Files

**Note:** 
* It's important to close your document with the `.close()` command after opening a file.
* If you are in Google Colab, click on the folder icon, right click the Files menu, and select `New File` to create a txt file
    * you can also create a folder by right clicking the Files menu, and can move an `example.txt` file you created in a folder named `data` to run the code below

In [None]:
f = open('data/example.txt')
raw = f.read()
f.close()

raw

In [None]:
f.close()

**Your Turn:** Your Turn: Create a file called document.txt using a text editor, and type in a few lines of text, and save it as plain text. Next, in the Python interpreter, open the file using `f = open('document.txt')`, then inspect its contents using `print(f.read())`.

In [None]:
f = open('document.txt')
print(f.read())
f.close()

In [None]:
import os
os.listdir('.')

In [None]:
f = open('data/example.txt', 'r')
for line in f:
    print(line.strip())
f.close()

In [None]:
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')

file = open(path, 'r')
raw = file.read()
file.close()

raw

### Capturing User Input

In [None]:
s = input("Enter some text: ")

print("You typed", len(word_tokenize(s)), "words.")

### The NLP Pipeline

In [None]:
file = open('data/example.txt')

raw = file.read()

file.close()

type(raw)

In [None]:
tokens = word_tokenize(raw)
type(tokens)

In [None]:
words = [w.lower() for w in tokens]
type(words)

In [None]:
vocab = sorted(set(words))
type(vocab)

In [None]:
vocab.append('blog')
raw.append('blog')

In [None]:
query = 'Who knows?'
beatles = ['john', 'paul', 'george', 'ringo']
query + beatles

## 3.2 - Strings: Text Processing at the Lowest Level

## Your Turn Solutions

### 3 - Processing Search Engine Results

* use [Google Search Parameters](https://moz.com/blog/the-ultimate-guide-to-the-google-search-parameters) to create a url for the search query
* a [Custom Search API](https://developers.google.com/custom-search/v1/overview) can be used for google searches as well; with a 100 search query max

In [None]:
url = "https://duckduckgo.com/?q=%22the+of%22&t=h_&ia=web"
html = request.urlopen(url).read().decode('utf8')

In [None]:
soup = BeautifulSoup(html, 'html.parser')

In [None]:
soup