# Natural Language Processing
### [Digital Methodology in the Humanities and Cultural Sciences](https://www.digitale-methodik.uni-mainz.de/) / Winter semester 2021  

[Johannes-Gutenberg Universität](https://www.uni-mainz.de/)  
[Hochschule Mainz](https://www.hs-mainz.de/)  
[Akademie der Wissenschaften und der Literatur | Mainz](https://www.adwmainz.de/startseite.html)  
[Leibniz-Institut für Europäaische Geschichte](https://www.ieg-mainz.de/)  

## Course content  

- **Introduction (20/10)**
 - text formats
 - reading files
 - python built-in methods for strings
 - regular expressions

## Introduction

### 1. Text formats
*Text extracted from [Applied Language Technology](https://applied-language-technology.readthedocs.io/en/latest/notebooks/part_ii/01_basic_text_processing.html#what-is-rich-text)*

#### What is rich text?

Word processors, such as Microsoft Word, produce [rich text](https://en.wikipedia.org/wiki/Formatted_text), that is, text whose appearance has been formatted or styled in a specific way.
Rich text allows defining specific visual styles for document elements. Headers, for example, may use a different font than the body text, which may in turn feature *italic* or **bold** fonts for emphasis. Rich text can also include various types of images, tables and other document elements.
Rich text is the default format for modern what-you-see-is-what-you-get word processors.

#### What is plain text?

Unlike rich text, [plain text](https://en.wikipedia.org/wiki/Plain_text) does not contain any information about the visual appearance of text, but consists of *characters* only.
Characters, in this context, refers to letters, numbers, punctuation marks, spaces and line breaks.
The definition of plain text is fairly loose, but generally the term refers to text which lacks any formatting or style information.


#### What is structured text?

Structured text may be thought of as a special case of plain text, which includes character sequences that are used to format the text for display.
Forms of structured text include text described using mark-up languages such as XML, Markdown or HTML.  
The example below shows a plain text sentence wrapped into HTML tags for paragraphs `<p>`. 
The opening tag `<p>` and the closing tag `</p>` instruct the computer that any content placed between these tags form a paragraph.

``` 
<p>This is an example sentence.</p>
```

This information is used for structuring plain text when *rendering* text for display, typically by styling its appearance.

#### Why does this matter?

If you collect a bunch of texts for a corpus, chances are that some originated in rich or structured format, depending on the medium these texts came from.

If you collect printed documents that have been digitized using a technique such as [optical character recognition](https://en.wikipedia.org/wiki/Optical_character_recognition) (OCR) and subsequently converted from rich into plain text, the removal of formatting information is likely to introduce errors into the resulting plain text. Working with this kind of "dirty" OCR can have an impact on the results of text analysis (Hill & Hengchen [2019](https://doi.org/10.1093/llc/fqz024)).

If you collect digital documents by scraping discussion forums or websites, you are likely to encounter traces of structured text in the form of markup tags, which may be carried over to plain text during conversion.

Plain text is by far the most interchangeable format for text, as it is easy to read for computers. This is why programming languages work with plain text, and if you plan to use programming languages to manipulate text, you need to know what plain text is. 

To summarise, when working with plain text, you may need to deal with traces left by conversion from rich or structured text.

### 2. Reading documents  

Built-in method [open()](https://docs.python.org/3/library/functions.html#open)

In [46]:
# reading the whole document at once
with open('data/stairwaytoheaven.txt', 'r', encoding='UTF-8') as f:
    text = f.read()

In [47]:
print(f)

<_io.TextIOWrapper name='data/stairwaytoheaven.txt' mode='r' encoding='UTF-8'>


In [48]:
print(len(text))
print(text)

1641
There's a lady who's sure all that glitters is gold
And she's buying a stairway to heaven
When she gets there she knows, if the stores are all closed
With a word she can get what she came for
Ooh, ooh, and she's buying a stairway to heaven
There's a sign on the wall, but she wants to be sure
'Cause you know sometimes words have two meanings
In a tree by the brook, there's a songbird who sings
Sometimes all of our thoughts are misgiven
It makes me wonder
Makes me wonder
There's a feeling I get when I look to the west
And my spirit is crying for leaving
In my thoughts I have seen rings of smoke through the trees
And the voices of those who stand looking
Makes me wonder
And it makes me wonder
And it's whispered that soon, if we all call the tune
Then the piper will lead us to reason
And a new day will dawn for those who stand long
And the forests will echo with laughter
If there's a bustle in your hedgerow, don't be alarmed now
It's just a spring clean for the May queen
Yes, there ar

In [49]:
print(text[:25])
# the string is like a list in wich each character is an element

There's a lady who's sure


In [50]:
# reading the document line by line
with open('data/stairwaytoheaven.txt', 'r', encoding='UTF-8') as f:
    text = []
    for line in f:
        text.append(line)

In [51]:
print(len(text))
print(text)

40
["There's a lady who's sure all that glitters is gold\n", "And she's buying a stairway to heaven\n", 'When she gets there she knows, if the stores are all closed\n', 'With a word she can get what she came for\n', "Ooh, ooh, and she's buying a stairway to heaven\n", "There's a sign on the wall, but she wants to be sure\n", "'Cause you know sometimes words have two meanings\n", "In a tree by the brook, there's a songbird who sings\n", 'Sometimes all of our thoughts are misgiven\n', 'It makes me wonder\n', 'Makes me wonder\n', "There's a feeling I get when I look to the west\n", 'And my spirit is crying for leaving\n', 'In my thoughts I have seen rings of smoke through the trees\n', 'And the voices of those who stand looking\n', 'Makes me wonder\n', 'And it makes me wonder\n', "And it's whispered that soon, if we all call the tune\n", 'Then the piper will lead us to reason\n', 'And a new day will dawn for those who stand long\n', 'And the forests will echo with laughter\n', "If there's

In [52]:
print(text[:25])
# now each element of the list is a string sentence

["There's a lady who's sure all that glitters is gold\n", "And she's buying a stairway to heaven\n", 'When she gets there she knows, if the stores are all closed\n', 'With a word she can get what she came for\n', "Ooh, ooh, and she's buying a stairway to heaven\n", "There's a sign on the wall, but she wants to be sure\n", "'Cause you know sometimes words have two meanings\n", "In a tree by the brook, there's a songbird who sings\n", 'Sometimes all of our thoughts are misgiven\n', 'It makes me wonder\n', 'Makes me wonder\n', "There's a feeling I get when I look to the west\n", 'And my spirit is crying for leaving\n', 'In my thoughts I have seen rings of smoke through the trees\n', 'And the voices of those who stand looking\n', 'Makes me wonder\n', 'And it makes me wonder\n', "And it's whispered that soon, if we all call the tune\n", 'Then the piper will lead us to reason\n', 'And a new day will dawn for those who stand long\n', 'And the forests will echo with laughter\n', "If there's a 

In [53]:
print(text[0])
print(text[0][:5])

There's a lady who's sure all that glitters is gold

There


## 3. Manipulating strings  
Several built-in [string methods](https://docs.python.org/3/library/stdtypes.html#string-methods)

In [65]:
# reading the whole document at once
with open('data/stairwaytoheaven.txt', 'r', encoding='UTF-8') as f:
    text = f.read()

In [66]:
# repr()
print(repr(text))

"There's a lady who's sure all that glitters is gold\nAnd she's buying a stairway to heaven\nWhen she gets there she knows, if the stores are all closed\nWith a word she can get what she came for\nOoh, ooh, and she's buying a stairway to heaven\nThere's a sign on the wall, but she wants to be sure\n'Cause you know sometimes words have two meanings\nIn a tree by the brook, there's a songbird who sings\nSometimes all of our thoughts are misgiven\nIt makes me wonder\nMakes me wonder\nThere's a feeling I get when I look to the west\nAnd my spirit is crying for leaving\nIn my thoughts I have seen rings of smoke through the trees\nAnd the voices of those who stand looking\nMakes me wonder\nAnd it makes me wonder\nAnd it's whispered that soon, if we all call the tune\nThen the piper will lead us to reason\nAnd a new day will dawn for those who stand long\nAnd the forests will echo with laughter\nIf there's a bustle in your hedgerow, don't be alarmed now\nIt's just a spring clean for the May q

In [67]:
# replace()
processed_text = text.replace('\n',' ')
print(repr(processed_text))

"There's a lady who's sure all that glitters is gold And she's buying a stairway to heaven When she gets there she knows, if the stores are all closed With a word she can get what she came for Ooh, ooh, and she's buying a stairway to heaven There's a sign on the wall, but she wants to be sure 'Cause you know sometimes words have two meanings In a tree by the brook, there's a songbird who sings Sometimes all of our thoughts are misgiven It makes me wonder Makes me wonder There's a feeling I get when I look to the west And my spirit is crying for leaving In my thoughts I have seen rings of smoke through the trees And the voices of those who stand looking Makes me wonder And it makes me wonder And it's whispered that soon, if we all call the tune Then the piper will lead us to reason And a new day will dawn for those who stand long And the forests will echo with laughter If there's a bustle in your hedgerow, don't be alarmed now It's just a spring clean for the May queen Yes, there are tw

In [68]:
# strip() for leading or trailing whitespace
processed_text = '    ' + processed_text
print(repr(processed_text))
print('\n')
processed_text = processed_text.strip()
print(repr(processed_text))

"    There's a lady who's sure all that glitters is gold And she's buying a stairway to heaven When she gets there she knows, if the stores are all closed With a word she can get what she came for Ooh, ooh, and she's buying a stairway to heaven There's a sign on the wall, but she wants to be sure 'Cause you know sometimes words have two meanings In a tree by the brook, there's a songbird who sings Sometimes all of our thoughts are misgiven It makes me wonder Makes me wonder There's a feeling I get when I look to the west And my spirit is crying for leaving In my thoughts I have seen rings of smoke through the trees And the voices of those who stand looking Makes me wonder And it makes me wonder And it's whispered that soon, if we all call the tune Then the piper will lead us to reason And a new day will dawn for those who stand long And the forests will echo with laughter If there's a bustle in your hedgerow, don't be alarmed now It's just a spring clean for the May queen Yes, there ar

In [69]:
# split()
text_list = processed_text.split()
print(text_list)

["There's", 'a', 'lady', "who's", 'sure', 'all', 'that', 'glitters', 'is', 'gold', 'And', "she's", 'buying', 'a', 'stairway', 'to', 'heaven', 'When', 'she', 'gets', 'there', 'she', 'knows,', 'if', 'the', 'stores', 'are', 'all', 'closed', 'With', 'a', 'word', 'she', 'can', 'get', 'what', 'she', 'came', 'for', 'Ooh,', 'ooh,', 'and', "she's", 'buying', 'a', 'stairway', 'to', 'heaven', "There's", 'a', 'sign', 'on', 'the', 'wall,', 'but', 'she', 'wants', 'to', 'be', 'sure', "'Cause", 'you', 'know', 'sometimes', 'words', 'have', 'two', 'meanings', 'In', 'a', 'tree', 'by', 'the', 'brook,', "there's", 'a', 'songbird', 'who', 'sings', 'Sometimes', 'all', 'of', 'our', 'thoughts', 'are', 'misgiven', 'It', 'makes', 'me', 'wonder', 'Makes', 'me', 'wonder', "There's", 'a', 'feeling', 'I', 'get', 'when', 'I', 'look', 'to', 'the', 'west', 'And', 'my', 'spirit', 'is', 'crying', 'for', 'leaving', 'In', 'my', 'thoughts', 'I', 'have', 'seen', 'rings', 'of', 'smoke', 'through', 'the', 'trees', 'And', 'the'

In [70]:
print(len(text_list))
print(len(set(text_list)))

335
179


In [71]:
text_list = processed_text.split(',')
print(text_list)

["There's a lady who's sure all that glitters is gold And she's buying a stairway to heaven When she gets there she knows", ' if the stores are all closed With a word she can get what she came for Ooh', ' ooh', " and she's buying a stairway to heaven There's a sign on the wall", " but she wants to be sure 'Cause you know sometimes words have two meanings In a tree by the brook", " there's a songbird who sings Sometimes all of our thoughts are misgiven It makes me wonder Makes me wonder There's a feeling I get when I look to the west And my spirit is crying for leaving In my thoughts I have seen rings of smoke through the trees And the voices of those who stand looking Makes me wonder And it makes me wonder And it's whispered that soon", " if we all call the tune Then the piper will lead us to reason And a new day will dawn for those who stand long And the forests will echo with laughter If there's a bustle in your hedgerow", " don't be alarmed now It's just a spring clean for the May q

In [72]:
text_list = [sentence.strip() for sentence in text_list]
print(text_list)

["There's a lady who's sure all that glitters is gold And she's buying a stairway to heaven When she gets there she knows", 'if the stores are all closed With a word she can get what she came for Ooh', 'ooh', "and she's buying a stairway to heaven There's a sign on the wall", "but she wants to be sure 'Cause you know sometimes words have two meanings In a tree by the brook", "there's a songbird who sings Sometimes all of our thoughts are misgiven It makes me wonder Makes me wonder There's a feeling I get when I look to the west And my spirit is crying for leaving In my thoughts I have seen rings of smoke through the trees And the voices of those who stand looking Makes me wonder And it makes me wonder And it's whispered that soon", "if we all call the tune Then the piper will lead us to reason And a new day will dawn for those who stand long And the forests will echo with laughter If there's a bustle in your hedgerow", "don't be alarmed now It's just a spring clean for the May queen Ye

In [73]:
text_list = text.split('\n')
print(text_list)

["There's a lady who's sure all that glitters is gold", "And she's buying a stairway to heaven", 'When she gets there she knows, if the stores are all closed', 'With a word she can get what she came for', "Ooh, ooh, and she's buying a stairway to heaven", "There's a sign on the wall, but she wants to be sure", "'Cause you know sometimes words have two meanings", "In a tree by the brook, there's a songbird who sings", 'Sometimes all of our thoughts are misgiven', 'It makes me wonder', 'Makes me wonder', "There's a feeling I get when I look to the west", 'And my spirit is crying for leaving', 'In my thoughts I have seen rings of smoke through the trees', 'And the voices of those who stand looking', 'Makes me wonder', 'And it makes me wonder', "And it's whispered that soon, if we all call the tune", 'Then the piper will lead us to reason', 'And a new day will dawn for those who stand long', 'And the forests will echo with laughter', "If there's a bustle in your hedgerow, don't be alarmed 

In [75]:
# lower()
processed_text = processed_text.lower()
print(repr(processed_text))

"there's a lady who's sure all that glitters is gold and she's buying a stairway to heaven when she gets there she knows, if the stores are all closed with a word she can get what she came for ooh, ooh, and she's buying a stairway to heaven there's a sign on the wall, but she wants to be sure 'cause you know sometimes words have two meanings in a tree by the brook, there's a songbird who sings sometimes all of our thoughts are misgiven it makes me wonder makes me wonder there's a feeling i get when i look to the west and my spirit is crying for leaving in my thoughts i have seen rings of smoke through the trees and the voices of those who stand looking makes me wonder and it makes me wonder and it's whispered that soon, if we all call the tune then the piper will lead us to reason and a new day will dawn for those who stand long and the forests will echo with laughter if there's a bustle in your hedgerow, don't be alarmed now it's just a spring clean for the may queen yes, there are tw

In [76]:
text_list = processed_text.split()
print(len(text_list))
print(len(set(text_list)))

335
162


In [77]:
# find()
print(processed_text.find('wind'))

1238


### 4. Regular Expressions  
Library [re](https://docs.python.org/3/library/re.html)  
Testing your regular expressions with [Regex101](https://regex101.com/)  

*Text extracted from [python.org](https://docs.python.org/3/howto/regex.html)*  
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the `re` module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.  

#### Matching Characters
Most letters and characters will simply match themselves. For example, the regular expression `test` will match the string `test` exactly. (You can enable a case-insensitive mode that would let this RE match `Test` or `TEST` as well.)

There are exceptions to this rule; some characters are special metacharacters, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning.

Here’s a complete list of the metacharacters:  
    `. ^ $ * + ? { } [ ] \ | ( ) `
    
And a summary:  

![](img/regex_summary.jpg "Regex")


In [9]:
# let's try
import re

text = 'I would like to extract a phone number, for instance, this +49 178 666-6666. Another example \
        is +55 123 4444-5555'
pattern = r"\+\d{2}\s\d{3}\s\d{3,4}[-]\d{4}"

matches = re.findall(pattern, text)
print (matches)


['+49 178 666-6666', '+55 123 4444-5555']
