<center> <h1> Natural Language Processing with Python </h1> </center>
<center> <h2> Processing Raw Text  </h2> </center> 
<center> <img height="300" src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/09/Natural-Language-Processing-with-Python.jpg"> </ center>

## Goals

The goal of this chapter is to answer the questions: 


- How can we write programs to access texts?

- How can we process those text?

- How can we write programs to produce formatted output and save those?
 

_______

## Some Basics in Python


In [None]:
example = "This is an example string!"

"""
We can index the values of a string, by using [index].
Remind: in python negative index values are also possible!
"""

print(example[0])
print(example[1])

"""
The concatenation of two strings can be realised by "+".
"""

example2 = "This is an other example of a string!"
print(example + example2)

"""
We can also access to substrings by using square brackets. 
The first value is determining the start of the substring. The Second value the end and the last value the step length: [start:end:step_lenght]
"""

print(example[:10])
print(example[10:])
print(example[::2])

<h3 style ="color: red" > Tasks: <h3 />

- Define a string s = 'The Godfther'. Write a statement that changes this to "The Godfather". You can only use concatenation and slicing.  

- What will happen if we will access on the 13rd element of the string s? Why? 


In [None]:
# your code here: 

________

## Accessing Text

In this section we will discuss three methods for acessing text. 

- Reading local Files (.txt, PDF) 
- Web

#### Acessing Text from local file system

In [None]:
def get_text_from_txt(path):
    f = open(path)
    raw = f.read() 
    return raw

example1 = get_text_from_txt("./example.txt")
print(example1[:100])

#### Acessing Text from binary Formats

For more information, please have a look on <a href="https://github.com/jsvine/pdfplumber">PDFplumber</a> documentation.

In [None]:
import pdfplumber

def get_text_from_pdf(path):
    with pdfplumber.open(path) as pdf:
        first_page = pdf.pages[0]
        print(len(pdf.pages))
        print(first_page.extract_text()[:100])

get_text_from_pdf("./example.pdf")

#### Acessing Text from web


In [None]:
from urllib.request import urlopen

def get_text_from_url(url):
    raw = urlopen(url).read() 
    return raw

url = "https://en.wikipedia.org/wiki/The_Godfather"
raw = get_text_from_url(url)

print(type(raw))
print(len(raw))
print("Content of the Website: \n" ,raw[:100])

By access text from the Web, we will always receive all meta tags from the HTML protocol.

<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">Beautiful Soup</a>  is a providing helper function for pulling the text out of the tags.

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(raw)

soup.get_text()[:100]

In [None]:
text = soup.find(id="firstHeading")

In [None]:
text = soup.find_all("p") 
print(len(text))
print(text[1].get_text())

<h3 style ="color: red" > Tasks: <h3 />

- Write a code to get the the current teperature of Berlin. 

(Hint: use this Link: https://www.bbc.com/weather/2950159. The important area can be identified by the class="wr-value--temperature--c")

 

In [None]:
# your code here 

_____________________

## Processing Text

In this section we cover:

- how we can deal with different Languages

- the use of regular expressions for stemming 



### Text Processing with Unicode

Unicode supports over a million characters.

written form: \XXXX

In [None]:
import codecs 
import unicodedata

line = codecs.open("./example2.txt", encoding="utf-8").readlines()[0]

print(line.encode("unicode_escape") )

for c in line:
    if(ord(c) > 127):
        print(c, c.encode("unicode_escape") , ord(c) ,unicodedata.name(c))
 

In [None]:
line = line.replace('ø' , "o|")
line = line.replace("å" , "a_") 
print(line.encode("GB2312"))


<img src="https://www.nltk.org/images/unicode.png" width="90%" height="400"/>

__________________

### Regular Expressions

In NLP, there are a lot of tasks involving pattern matching. Regular expressions give us a powerful and flexible method.


|Operator |Behavior     |
|-------------|-------------|
| .     | Wildcard, matches any character|
| ^abc      | Matches some pattern abc at the start of a string    | 
| abc$ | Matches some pattern abc at the end of a string     | 
| \[abc\]      | Matches one of a set of characters|
| \[A-Z\]      | Matches one of a range of characters  | 
| ed\|es | Matches one of the specified strings (disjunction)  | 
| *      | Zero or more of previous item, e.g. a\*, \[a-z\]\* (also known as Kleene Closure)|
| +      | One or more of previous item, e.g. a+, \[a-z\]+    | 
| ? | Zero or one of the previous item (i.e. optional), e.g. a?, \[a-z\]?   | 
| {n}      | Exactly n repeats where n is a non-negative integer|
| {n,}      | At least n repeats | 
| {,n} | No more than n repeats   | 
| {m,n}      | At least m and no more than n repeats|

In [None]:
import re

res = re.search(r"ed","abaiedsse")

print(res)
print(res.start())
print(res.end())
print(res.string) 

In [None]:
import nltk

wordlist_en = [w.lower() for w in nltk.corpus.words.words("en")]

list_ =  [w for w in wordlist_en if re.search(r"^ho" , w)]
print(len(list_))
list_[:10]

<h3 style ="color: red" > Tasks: <h3 />

- extract all numbers with a length of 4 from wsj. 

- since the numbers are valid year numbers, what is the ratio of numbers from the 80s.

In [None]:
wordlist_wsj = nltk.corpus.treebank.words()
# your code here

### more usefull functions

In [None]:
word = 'supercalifragilisticexpialidociou'
a = re.findall(r'[aeiou]', word)
a[:5]

In [None]:
word = 'Today is the 25th of July. It is a beautiful day. My mom said: "dont run so fast!". \'Why?\' did I ask.'
a = re.sub(r'[0-9.!?"*#\']', "", word)
a

### Stemming

In [None]:
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

In [None]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

In [None]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

In [None]:
def stem(word):
    stem, suff = re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$' , word)[0]
    return stem, suff

### Normalizing Text

Text normalization is the transforming of text to an other form, where the relevant context is preservered. There is no all-purpose normalization procedure.

In [None]:
example = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

In [None]:
tokens = nltk.word_tokenize(example)
tokens[:10]

In [None]:
tokens_s = [w.lower() for w in tokens]
tokens_s[:10]

In [None]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()

tokens_sp = [porter.stem(w) for w in tokens_s]
tokens_sl = [lancaster.stem(w) for w in tokens_s]

In [None]:
wnl = nltk.WordNetLemmatizer()

tokens_sw = [wnl.lemmatize(w) for w in tokens_s]

<h3 style ="color: red" > Tasks: <h3 />

- write a compress function. 

- since words such as on and in are matched to the same letter n, expand your function that only words with a length of 3 or higher are compressed.


In [None]:
# your code here

________________________