# Processing Raw Text

The goal of this chapter is to answer the questions below: 

- How can we write programs to access texts?
- How can we split documents up into individual words? 
- How can we write programs to produce formatted output and save those?

This notebook provides some example and tasks which you have to do. 

With answering all questions you will own a pipeline for processing raw text. 

## Some Basics in Python

For process some raw text it is necessary to remind the fundamental data type strings and his functionalities.

In [None]:
example = "This is an example string!"

"""
We can index the values of a string, by using [index].
Remind: in python negative index values are also possible!
"""

print(example[0])
print(example[1])

"""
The concatenation of two strings can be realised by "+".
"""

example2 = "This is an other example of a string!"
print(example + example2)

"""
We can also access to substrings by using square brackets. 
The first value is determining the start of the substring. The Second value the end and the last value the step length: [start:end:step_lenght]
"""

print(example[:10])
print(example[10:])
print(example[::2])

<h3 style ="color: red" > Tasks: <h3 />

- Define a string s = 'The Godfther'. Write statement that changes this to "The Godfather". You can only use concatenation and slicing.  

- What will happen if we will access on the 13rd element of the string s? Why? 


In [None]:
# your code here: 

## Accessing Text

The most important source of texts is undoubtedly the Web. The project <a href="https://www.gutenberg.org">Guttenberg</a> is a collection of over 25.000 free online books. Unfortunately, the Guttenberg project is from German IP addresses currently unavailable. 
For present of process raw text we will access on the text of <a href="https://en.wikipedia.org/wiki/The_Godfather">this link</a>.

The code block below is demonstrating how you can access text from any website:

In [None]:
from urllib.request import urlopen
url = "https://en.wikipedia.org/wiki/The_Godfather"
raw = urlopen(url).read()
print(type(raw))
print(len(raw))
print( raw[:100])

By access text from the Web, we will always receive all meta tags from the HTML protocol.

<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">Beautiful Soup</a>  is a providing helper function for pulling the text out of the tags.

In [None]:
from bs4 import BeautifulSoup
text = BeautifulSoup(raw , "html.parser").get_text()[100:]
print(text[:100])

Be Aware: even with BeautifulSoup the text contains unwanted material.
We have to clip the interesting parts of the document.

Our goal is it to break up the string into words and punction. This step is called <b>tokenization</b>.

In [None]:
import nltk 
tokens = nltk.word_tokenize(text)[20:]
print(type(tokens))
print(len(tokens))
print(tokens[100:10])
nltk_text = nltk.Text(tokens)
print(type(nltk_text))
print(nltk_text)

Now we can normalize the text. We will also discuss an even more "aggressive" normalization step of a text below. 

For the normalization, we will transform all words to lower case. 


In [None]:
words = [w.lower() for w in tokens]
vocab = sorted(set(words))
print(vocab[:10])

The Figure below is summarizing what we have covered yet.

<img src="https://www.nltk.org/images/pipeline1.png" width="90%" height="400"/>


<h3 style ="color: red" > Tasks: <h3 />

- Write a code to get the the current teperature of Berlin. Save the value as temp. We will need it later. (Hint: use this Link: https://www.timeanddate.com/weather/germany/berlin. The important area should be between 1890 - 1950)

 

In [None]:
# your code here
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.timeanddate.com/weather/germany/berlin"
raw = urlopen(url).read().decode("unicode_escape")
temp = BeautifulSoup(raw).get_text()[100:]
text[1890:1950]

Unicode is supporting over a million characters. 

<img src="https://www.nltk.org/images/unicode.png" width="90%" height="400"/>

In [None]:
import unicodedata
word = temp.replace(unicodedata.normalize("NFC", "\xa0Â"), " ")
print(word)

<h3 style ="color: red" > Tasks: <h3 />

- transform the string s = u"example" int latin1 and utf8. What are the results? 

In [None]:
# your code here