
#Notebook 3. Low-level Text Processing
---
##Learning Outcomes
- Word tokenization
- Constructing NLTK text object
- Clean text from HTML
- Searching a string
- Noisy text from the web
- Using NLTK to process extracted text
- Note on search results
- Process rss feeds
- File local operations
- Simple string operations
- Processing PDFs
- Character distribution
- Finding position & context of a word
- Text encodings: unicode, code points, ordinal, hexadecimal, excape sequence



## Libraries Required

In [None]:
import nltk
from nltk import word_tokenize
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> book
    Downloading collection 'book'
       | 
       | Downloading package abc to /root/nltk_data...
       |   Unzipping corpora/abc.zip.
       | Downloading package brown to /root/nltk_data...
       |   Unzipping corpora/brown.zip.
       | Downloading package chat80 to /root/nltk_data...
       |   Unzipping corpora/chat80.zip.
       | Downloading package cmudict to /root/nltk_data...
       |   Unzipping corpora/cmudict.zip.
       | Downloading package conll2000 to /root/nltk_data...
       |   Unzipping corpora/conll2000.zip.
       | Downloading package conll2002 to /root/nltk_data...
       |   Unzipping corpora/conll2002.zip.
       | Downloading package dependency_

True

In [None]:
# If you like to install Conda instead of PIP you can install either the full version of Anaconda or the light version called MiniConda
# #installing full Anaconda
# !wget https://repo.anaconda.com/archive/Anaconda3-5.2.0-Linux-x86_64.sh && bash Anaconda3-5.2.0-Linux-x86_64.sh -bfp /usr/local
# # To make Python find the modules run:
# import sys
# sys.path.append('/usr/local/lib/python3.6/site-packages')

# # Installing Miniconda
# !wget https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh && bash Miniconda3-4.5.4-Linux-x86_64.sh -bfp /usr/local
# import sys
# sys.path.append('/usr/local/lib/python3.6/site-packages')

## Text from the Web

In [None]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode("utf-8")
print(type(raw))

<class 'str'>


In [None]:
num_of_chars = len(raw)
print("number of chars in the retrieved text:", num_of_chars)
print(raw[:75])

number of chars in the retrieved text: 1176967
﻿The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky


### Tokenization

In [None]:
tokens = word_tokenize(raw)
print("Type of tokens is: "+str(type(tokens)))
num_of_tokens = len(tokens)
print("number of words in the retrieved text is", num_of_tokens)
average_token_length = num_of_chars/num_of_tokens
print("average token length in the text is", average_token_length)
print(tokens[1:10])

Type of tokens is: <class 'list'>
number of words in the retrieved text is 257727
average token length in the text is 4.566719823689408
['Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']


###Constructing an NLTK Text Object

In [None]:
text = nltk.Text(tokens)
print(type(text))
print(text[1024:1062])
print(text.collocations()) # collocations are words that occur frequenctly together i.e., 'Barak' 'Obama' is an example

<class 'nltk.text.Text'>
['an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S.', 'Place', 'and', 'walked', 'slowly', ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K.', 'bridge', '.', 'He', 'had', 'successfully']
Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; young man; Nikodim Fomitch; Ilya Petrovitch; Project
Gutenberg; Andrey Semyonovitch; Hay Market; Dmitri Prokofitch; Good
heavens
None


###Search Strings

In [None]:
raw.find("Dmitri Prokofitch") # searches from the begining

461252

In [None]:
raw.rfind("Dmitri Prokofitch") # searches from the end

1087973

###Noisy Text from the Web
The text you retrieve from the web isn't always as clean as above. 

In [None]:
url = "https://www.cbc.ca/news/canada/british-columbia/6-new-covid-19-infections-in-b-c-as-virus-spreads-inside-care-home-1.5489921"
html = request.urlopen(url).read().decode('utf8')
html[:200]

'<!DOCTYPE html>\n    <html lang="en">\n        <head>\n            <title data-react-helmet="true">6 new COVID-19 infections in B.C. as virus spreads inside care home | CBC News</title>\n            <meta'

### Get Clean Text from HTML

In [None]:
from bs4 import BeautifulSoup
raw = BeautifulSoup(html, 'html.parser').get_text()
tokens = word_tokenize(raw)
tokens[:20]

['6',
 'new',
 'COVID-19',
 'infections',
 'in',
 'B.C',
 '.',
 'as',
 'virus',
 'spreads',
 'inside',
 'care',
 'home',
 '|',
 'CBC',
 'News',
 '!',
 '(',
 'function',
 '(']

In [None]:
text = nltk.Text(tokens)
text.concordance("virus")

Displaying 25 of 28 matches:
                                     virus spreads inside care home | CBC News 
 '' ImageObject '' , '' name '' : '' Virus Outbreak California '' , '' descript
 new COVID-19 infections in B.C . as virus spreads inside care home '' , '' pub
 new COVID-19 infections in B.C . as virus spreads inside care home '' , '' art
 new COVID-19 infections in B.C . as virus spreads inside care home '' , '' sha
 new COVID-19 infections in B.C . as virus spreads inside care home '' , '' ori
enry said & nbsp ; on Saturday . The virus spread after a worker contracted the
spread after a worker contracted the virus earlier this week.\u003C\u002Fp\u003
 workers at the centre to ensure the virus has n't spread to other parts of the
nd steps it 's taking to prevent the virus from spreading.\u003C\u002Fp\u003E \
 woman in their 60s — contracted the virus while \u003Ca href=\ '' https : \u00
n B.C . have tested positive for the virus . At least four have recovered. & nb
ame infecte

###A note on Search Engines

NOTE: Unlike text objects, search results are not static and change every now and then. They are also differ regionally. No single pattern is expected to work across the board.

###RSS Feeds

In [None]:
!pip install feedparser # package to parse RSS Feeds



In [None]:
import feedparser
NewsFeed = feedparser.parse("https://www.cbc.ca/cmlink/rss-topstories")
print('Number of RSS posts :', len(NewsFeed.entries))
entry = NewsFeed.entries[1]

entry.keys()

Number of RSS posts : 20


dict_keys(['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'published', 'published_parsed', 'authors', 'author', 'tags', 'summary', 'summary_detail'])

In [None]:
print('Post Title :', entry.title)

Post Title : TRC commissioners call on Ottawa to end delays in implementing Calls to Action


In [None]:
print(entry.published) # publication time
print("******")
print("------News Link--------")
print(entry.link)

Fri, 28 May 2021 20:07:30 EDT
******
------News Link--------
https://www.cbc.ca/news/politics/trc-commissioners-calls-to-action-redouble-efforts-1.6051580?cmp=rss


In [None]:
#summary = request.urlopen(entry.link).read().decode('utf8')
summary = entry.summary
raw = BeautifulSoup(summary, 'html.parser').get_text()
raw[:50]

' '

In [None]:
tokens = word_tokenize(raw)
tokens[:20]

[]

###Local File Operations

In [None]:
f = open("document.txt") # it will give an error, as the file doesn't exist at this point
raw = f.read()

In [None]:
import os
os.listdir(".")

In [None]:
f = open("document.txt", "w")
f.write('Time flies like an arrow.\nFruit flies like a banana.\n')
f.close()

In [None]:
with open("document.txt", "r") as f:
  for line in f:
    print(line.strip())


Time flies like an arrow.
Fruit flies like a banana.


In [None]:
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
raw = open(path, 'r').read()
len(raw)

1220066

In [None]:
type(raw)

str

In [None]:
raw[:100]

'[Moby Dick by Herman Melville 1851]\n\n\nETYMOLOGY.\n\n(Supplied by a Late Consumptive Usher to a Grammar'

###Processing PDF

ASCII text and HTML --> Human Readable

PDF --> Binary

In [None]:
!pip install msgpack # dependency for slate3k & pypdf2
!pip install slate3k # package to extract text from PDF

Collecting slate3k
  Downloading https://files.pythonhosted.org/packages/cb/e3/f27cac1dd24617894cf7ddb5da13beca27c9236736466bebaf5dd2a902c1/slate3k-0.5.3-py2.py3-none-any.whl
Collecting pdfminer3k
[?25l  Downloading https://files.pythonhosted.org/packages/36/15/5ac4faa314c38b335cf4db37fc02dc02c14bf67f7641bea2fa5e5b7d4ff4/pdfminer3k-1.3.4-py3-none-any.whl (100kB)
[K     |████████████████████████████████| 102kB 4.4MB/s 
[?25hCollecting ply
[?25l  Downloading https://files.pythonhosted.org/packages/a3/58/35da89ee790598a0700ea49b2a66594140f44dec458c07e8e3d4979137fc/ply-3.11-py2.py3-none-any.whl (49kB)
[K     |████████████████████████████████| 51kB 5.2MB/s 
[?25hInstalling collected packages: ply, pdfminer3k, slate3k
Successfully installed pdfminer3k-1.3.4 ply-3.11 slate3k-0.5.3


####Access a PDF from the Web & Convert it to Text

In [None]:
import urllib
import slate3k as slate
from io import StringIO, BytesIO

url = "https://curve.carleton.ca/system/files/etd/4476fc9c-bfa6-49a0-bf73-8045bf299af8/etd_pdf/4ca4228e73e35eef6d1ee4ad8f308152/amjadian-representationlearningforinformationextraction.pdf"

response = request.urlopen(url)
rawPDF = response.read()
memoryFile = BytesIO(rawPDF) # to create a file in memory to read from

# extract text with slate
document1 = slate.PDF(memoryFile)
print("SLATE:\n", document1[1], "\n")



SLATE:
 Abstract

Distributed representations, predominantly acquired via neural networks, have

been applied to natural

language processing tasks including speech recognition and

machine translation with a success comparable to sophisticated state-of-the-art algo-

rithms. The present thesis oﬀers an investigation of the application of such represen-

tations

to information extraction.

Speciﬁcally,

I explore the suitability of applying

shallow distributed representations to the automatic terminology extraction task, as

well as the bridging reference resolution task.

I created a dataset as a gold standard

for automatic term extraction in the mathematical education domain.

I carefully as-

sessed the performance of the existing terminology extraction methods on this dataset.

Then, I introduce a novel method for automatic terminology extraction for one word

terms, and I evaluate the performance of the novel algorithm in various terminological

domains. The introduced algorith

#####For additional ways to process a PDF file see: http://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/

###Character Distribution
Character distribution is a significatn feature for language detection

In [None]:
from nltk.corpus import gutenberg
doc1_text = document1.text()
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha()) # frequency distribution of the alphabetic chars
most_common = fdist.most_common(5) # 5 most frequent chars
print(most_common)

[('e', 117092), ('t', 87996), ('a', 77916), ('o', 69326), ('n', 65617)]


In [None]:
doc1_text.find("NLP")

3740

In [None]:
doc1_text[3700:3800]

'piring discussions and conversations in NLP, iv information extraction, word embeddings, and high di'

In [None]:
tokenized_text = word_tokenize(doc1_text)
text = nltk.Text(tokenized_text)
text.concordance("NLP")

Displaying 12 of 12 matches:
ring discussions and conversations in NLP , iv information extraction , word em
tion to all the other tasks in modern NLP . A more concrete consequence of this
 to improve and simplify a variety of NLP tasks ( Collobert and Weston , 2008 ;
ﬁeld of Natural Language Processing ( NLP ) , Automatic Terminology Extrac- tio
ction has many direct applications in NLP , such as information 3.1 . INTRODUCT
 , the rise of distributed methods in NLP , especially the recent word embeddin
owards domain-independent distributed NLP . A new dataset representing four new
ed Bridging Reference Resolution Many NLP systems treat deﬁnite noun phrases as
elf . It has been shown previously in NLP that words can be mapped across langu
egral part of a variety of downstream NLP tasks . We plan to employ our methods
olution : To What Extent Does It Help NLP Applications ? , pages 16–27 . Spring
 2010 Workshop on New Chal lenges for NLP Frameworks , pages 45–50 , Valletta ,


##Unicode

In [None]:
# Unicode View
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
f = open(path, encoding='latin2')
for line in f:
  line = line.strip()
  print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


In [None]:
# Viewing the Code Points
f = open(path, encoding='latin2')
for line in f:
  line = line.strip()
  print(line.encode('unicode_escape'))

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'


In [None]:
#Ordinal of a char
ord('ń')

324

In [None]:
hex(324)

'0x144'

In [None]:
nacute = '\u0144' # escape sequence to define a string
print(type(nacute), nacute)

<class 'str'> ń


In [None]:
type(nacute.encode("utf-8") )

bytes

In [None]:
nacute

'ń'

In [None]:
# UTF-8 byte sequence, followed by their code point integer using the standard Unicode convention (i.e., prefixing the hex digits with U+), followed by their Unicode name
import unicodedata
c = 'ń'
print("Char: ", c, "\nUTF-8 byte seq:", c.encode("utf8"), "\nCode point integer:", ord(c), "\nUnicode name:", unicodedata.name(c))

Char:  ń 
UTF-8 byte seq: b'\xc5\x84' 
Code point integer: 324 
Unicode name: LATIN SMALL LETTER N WITH ACUTE


**Note: NLTK tokenizer allows Unicode strings as input and as output**

# Exercise 3
## a. What does document1[1][-1] retrieve?
## b. Exercise: what does document1[1][-20:-4] retrieve?
## c. What is the type of document1?
## d. Tokenize document1.
## f. Find 2 web pages, each belonging to a different language. Use their char distribution to detect the language they belong to. Hint: you can build a table representing the relative frequency of each char to help with the detection.
## g. Find 2 web pages, each belonging to a different language. Use their encodings to detect their lanugage. 
## h. Would it help if you use a combination of the methods suggested in "f" and "g"?