<center> <h1> Natural Language Processing with Python </h1> </center>
<center> <h2> Processing Raw Text  </h2> </center> 
<center> <img height="300" src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/09/Natural-Language-Processing-with-Python.jpg"> </ center>

## Goals

The goal of this chapter is to answer the questions: 


- How can we write programs to access texts?

- How can we process those text?

- How can we write programs to produce formatted output and save those?
 

_______

## Some Basics in Python


In [None]:
example = "This is an example string!"

"""
We can index the values of a string, by using [index].
Remind: in python negative index values are also possible!
"""

print(example[0])
print(example[1])

"""
The concatenation of two strings can be realised by "+".
"""

example2 = "This is an other example of a string!"
print(example + example2)

"""
We can also access to substrings by using square brackets. 
The first value is determining the start of the substring. The Second value the end and the last value the step length: [start:end:step_lenght]
"""

print(example[:10])
print(example[10:])
print(example[::2])

<h3 style ="color: red" > Tasks: <h3 />

- Define a string s = 'The Godfther'. Write a statement that changes this to "The Godfather". You can only use concatenation and slicing.  

- What will happen if we will access on the 13rd element of the string s? Why? 


In [None]:
# your code here: 

________

## Accessing Text

In this section we will discuss three methods for acessing text. 

- Reading local Files (.txt, PDF) 
- Web

#### Acessing Text from local file system

In [None]:
def get_text_from_txt(path):
    f = open(path)
    raw = f.read() 
    return raw

example1 = get_text_from_txt("./example.txt")
print(example1[:100])

#### Acessing Text from binary Formats

For more information, please have a look on <a href="https://github.com/jsvine/pdfplumber">PDFplumber</a> documentation.

In [None]:
import pdfplumber

def get_text_from_pdf(path):
    with pdfplumber.open(path) as pdf:
        first_page = pdf.pages[0]
        print(len(pdf.pages))
        print(first_page.extract_text()[:100])

get_text_from_pdf("./example.pdf")

#### Acessing Text from web


In [None]:
import ssl
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
    
from urllib.request import urlopen

def get_text_from_url(url):
    raw = urlopen(url).read() 
    return raw

url = "https://en.wikipedia.org/wiki/The_Godfather"
raw = get_text_from_url(url)

print(type(raw))
print(len(raw))
print("Content of the Website: \n" ,raw[:100])

By access text from the Web, we will always receive all meta tags from the HTML protocol.

<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">Beautiful Soup</a>  is a providing helper function for pulling the text out of the tags.

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(raw)

soup.get_text()[:100]

In [None]:
text = soup.find(id="firstHeading")

In [None]:
text = soup.find_all("p") 
print(len(text))
print(text[1].get_text())

<h3 style ="color: red" > Tasks: <h3 />

- Write a code to get the the current teperature of Berlin. 

(Hint: use this Link: https://www.bbc.com/weather/2950159. The important area can be identified by the class="wr-value--temperature--c")

 

In [None]:
# your code here 

_____________________

## Processing Text

In this section we cover:

- how we can deal with different Languages

- the use of regular expressions for stemming 



### Text Processing with Unicode

Unicode supports over a million characters.

written form: \XXXX

In [None]:
import codecs 
import unicodedata

line = codecs.open("./example2.txt", encoding="utf-8").readlines()[0]

print(line.encode("unicode_escape") )

for c in line:
    if(ord(c) > 127):
        print(c, c.encode("unicode_escape") , ord(c) ,unicodedata.name(c))
 

In [None]:
line = line.replace('ø' , "o|")
line = line.replace("å" , "a_") 
print(line.encode("GB2312"))


<img src="https://www.nltk.org/images/unicode.png" width="90%" height="400"/>

__________________

### Regular Expressions

In NLP, there are a lot of tasks involving pattern matching. Regular expressions give us a powerful and flexible method.


|Operator |Behavior     |
|-------------|-------------|
| .     | Wildcard, matches any character|
| ^abc      | Matches some pattern abc at the start of a string    | 
| abc$ | Matches some pattern abc at the end of a string     | 
| \[abc\]      | Matches one of a set of characters|
| \[A-Z\]      | Matches one of a range of characters  | 
| ed\|es | Matches one of the specified strings (disjunction)  | 
| *      | Zero or more of previous item, e.g. a\*, \[a-z\]\* (also known as Kleene Closure)|
| +      | One or more of previous item, e.g. a+, \[a-z\]+    | 
| ? | Zero or one of the previous item (i.e. optional), e.g. a?, \[a-z\]?   | 
| {n}      | Exactly n repeats where n is a non-negative integer|
| {n,}      | At least n repeats | 
| {,n} | No more than n repeats   | 
| {m,n}      | At least m and no more than n repeats|

In [None]:
import re

res = re.search(r"ed","abaiedsse")

print(res)
print(res.start())
print(res.end())
print(res.string) 

In [None]:
import nltk

wordlist_en = [w.lower() for w in nltk.corpus.words.words("en")]

list_ =  [w for w in wordlist_en if re.search(r"^ho" , w)]
print(len(list_))
list_[:10]

<h3 style ="color: red" > Tasks: <h3 />

- extract all numbers with a length of 4 from wsj. 

- since the numbers are valid year numbers, what is the ratio of numbers from the 80s.

In [None]:
wordlist_wsj = nltk.corpus.treebank.words()
# your code here

### more usefull functions

In [None]:
word = 'supercalifragilisticexpialidociou'
a = re.findall(r'[aeiou]', word)
a[:5]

In [None]:
word = 'Today is the 25th of July. It is a beautiful day. My mom said: "dont run so fast!". \'Why?\' did I ask.'
a = re.sub(r'[0-9.!?"*#\']', "", word)
a

### Stemming

In [None]:
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

In [None]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

In [None]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

In [None]:
def stem(word):
    stem, suff = re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$' , word)[0]
    return stem, suff

# Text Normalization 

Text normalization is the process of transforming of the text from one form to another form, having the relevant context is preservered.

## Text Stemming



In [None]:
import ssl
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

import nltk

raw_text = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government.  Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = nltk.word_tokenize(raw_text)
print(tokens)

### Porter Stemmer

In [None]:
porter = nltk.PorterStemmer()
p_tokens = [porter.stem(token) for token in tokens]
print(p_tokens)

### Lancaster Stemmer

In [None]:
lancaster = nltk.LancasterStemmer()
l_tokens = [lancaster.stem(token) for token in tokens]
print(l_tokens)

## Text Lemmatization

In [None]:
lem = nltk.WordNetLemmatizer()
lem_tokens = [lem.lemmatize(token) for token in tokens]
print(lem_tokens)

Task:
- Given the text:
"Joe waited for the train. The train was late. Mary and Samantha took the bus. I looked for Mary and Samantha at the bus station."

1. Stem the words using both the Porter and Lancaster stemmers
2. Find the difference between those two result sets, by listing the stemmed words which are not common between those two results.

In [None]:
# your code here

# Text Tokenization

In [None]:
text = """'Hi there. I'M a Phillip,' introduced the teacher himself to children, (not in a very hopeful tone
... though), 'We won't have any boring stuff on my classes AT ALL. I hope you would all do it very
... well without--Maybe it's always better to take notes of my life-changing classes,'..."""

Goal: get words of the text

In [None]:
import re
words = re.split(r' ', text)
print(words)

In [None]:
words = re.split(r'[ \t\n]+', text)
words = re.split(r'\s+', text)
print(words)

In [None]:
words = re.split(r'[^a-zA-Z0-9_]', text)
words = re.split(r'\W+', text)
print(words)

In [None]:
words = re.findall(r'[a-zA-Z0-9_]+', text) 
words = re.findall(r'\w+', text)
print(words)


In [None]:
words = re.findall(r'\w+|\S\w*', text)
print(words)

In [None]:
words = re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", text)
print(words)

## Additinal Regular Expressions

|Operator |Behavior     |
|-------------|-------------|
| \b | Word boundary (zero width) |
| \d | Any decimal digit (equivalent to \[0-9\])    | 
| \D | Any non-digit character (equivalent to \[^0-9\])     | 
| \s | Any whitespace character (equivalent to \[ \t\n\r\f\v\]) |
| \S | Any non-whitespace character (equivalent to \[^ \t\n\r\f\v\])  | 
| \w | Any alphanumeric character (equivalent to \[a-zA-Z0-9_\])  | 
| \W | Any non-alphanumeric character (equivalent to \[^a-zA-Z0-9_\])|
| \t | The tab character    | 
| \n | The newline character   | 

In [None]:
text ='That U.S.A. poster-print costs $12.40...'
pattern = '''(?x)
    (?:[A-Z]\.)+  # abbreviations
    |\w+(?:-\w+)* # words with optional internal dash
    |\$?\d+(?:\.\d+)?%? # currency and percentages
    | \.\.\. # ellipsis
    |[][.,;"'?():-_`] # separate tokens'''
words = nltk.regexp_tokenize(text, pattern)
print(words)

In [None]:
expected_result = ["poster-print", "costs", "U.S.A"]
difference = set(words).difference(expected_result)
print(difference)

Task:
- Given text: "Sie Yu Chuah smiles when asked how his parents would react to a low test score. “My parents are not that strict but they have high expectations of me,” he says. The cheerful, slightly built 13-year-old is a pupil at Admiralty, it's a government secondary school in the northern suburbs of Singapore that opened in 2002."

1. Tokenize the text using one of the regexes introduced
2. Retrieve the lexicon of the text

Note: this basic check could be helpful to check whether the token is a valid word
```
def isSpecialCharacter(word):
    return len(word) == 1 and not word.isalpha()
```


In [None]:
# your code here

# Text Segmentation

## Sentence Segmentation


In [None]:
nltk.download('brown')
avg = len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents())
print(avg)

In [None]:
nltk.download('gutenberg')
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sentences = nltk.sent_tokenize(text)
print(sentences[79:89])

## Word Segmentation


In [None]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"

In [None]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"

In [None]:
def segment(text, segmentation):
    words = []
    last = 0
    for i in range(len(segmentation)):
        if segmentation[i] == '1':
            words.append(text[last:i+1])
            last = i+1
    words.append(text[last:])
    return words

text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"
print(segment(text, seg1))
print(segment(text, seg2))

<img src="segmentation.png" />

In [None]:
def evaluate(text, segmentation):
    words = segment(text, segmentation)
    text_size = len(words)
    lexicon_size = sum(len(word) + 1 for word in set(words))
    return text_size + lexicon_size

text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"
seg3 = "0000100100000011001000000110000100010000001100010000001"
print(evaluate(text, seg1))
print(evaluate(text, seg2))
print(evaluate(text, seg3))

In [None]:
from random import randint

def flip(segs, pos):
    return segs[:pos] + str(1-int(segs[pos])) + segs[pos+1:]

def flip_n(segs, n):
    for i in range(n):
        segs = flip(segs, randint(0, len(segs)-1))
    return segs

def anneal(text, segs, iterations, cooling_rate):
    temperature = float(len(segs))
    while temperature > 0.5:
        best_segs, best = segs, evaluate(text, segs)
        for i in range(iterations):
            guess = flip_n(segs, round(temperature))
            score = evaluate(text, guess)
            if score < best:
                best, best_segs = score, guess
        score, segs = best, best_segs
        temperature = temperature / cooling_rate
        print(evaluate(text, segs), segment(text, segs))
    print()
    return segs

text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
anneal(text, seg1, 5000, 1.2)

# Formatting: From Lists to Strings

In [None]:
silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']
print(' '.join(silly))
print(';'.join(silly))

In [None]:
fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
for word in sorted(fdist):
    print(word, '->', fdist[word], end='; ')

In [None]:
text = '{}->{};'.format ('cat', 3)
print(text)

In [None]:
text = 'from {1} to {0}'.format('A', 'B')
print(text)

In [None]:
print('{:<6}' .format(41))
print('{:{width}}'.format('Monty Python', width=195))
print('{:.4f}'.format(3.148381201293189))

In [None]:
output_file = open('output.txt', 'w')
words = ['one', 'two', 'three']
for word in words:
    print(word, file=output_file)

THANK YOU FOR ATTENTION!

<img src="https://15f76u3xxy662wdat72j3l53-wpengine.netdna-ssl.com/wp-content/uploads/2016/12/Christmas-card_not-raw-FINAL-1.gif" width="90%"/>