# Natural Language Processing - Introduction
### [Digital Methodology in the Humanities and Cultural Sciences](https://www.digitale-methodik.uni-mainz.de/) / Winter semester 2021  


## Content  

- **Introduction (20/10/2021)**
 - text formats
 - reading files
 - python built-in methods for strings
 - regular expressions

### 1. Text formats
*Text extracted from [Applied Language Technology](https://applied-language-technology.readthedocs.io/en/latest/notebooks/part_ii/01_basic_text_processing.html#what-is-rich-text)*

#### What is rich text?

Word processors, such as Microsoft Word, produce [rich text](https://en.wikipedia.org/wiki/Formatted_text), that is, text whose appearance has been formatted or styled in a specific way.
Rich text allows defining specific visual styles for document elements. Headers, for example, may use a different font than the body text, which may in turn feature *italic* or **bold** fonts for emphasis. Rich text can also include various types of images, tables and other document elements.
Rich text is the default format for modern what-you-see-is-what-you-get word processors.

#### What is plain text?

Unlike rich text, [plain text](https://en.wikipedia.org/wiki/Plain_text) does not contain any information about the visual appearance of text, but consists of *characters* only.
Characters, in this context, refers to letters, numbers, punctuation marks, spaces and line breaks.
The definition of plain text is fairly loose, but generally the term refers to text which lacks any formatting or style information.


#### What is structured text?

Structured text may be thought of as a special case of plain text, which includes character sequences that are used to format the text for display.
Forms of structured text include text described using mark-up languages such as XML, Markdown or HTML.  
The example below shows a plain text sentence wrapped into HTML tags for paragraphs `<p>`. 
The opening tag `<p>` and the closing tag `</p>` instruct the computer that any content placed between these tags form a paragraph.

``` 
<p>This is an example sentence.</p>
```

This information is used for structuring plain text when *rendering* text for display, typically by styling its appearance.

#### Why does this matter?

If you collect a bunch of texts for a corpus, chances are that some originated in rich or structured format, depending on the medium these texts came from.

If you collect printed documents that have been digitized using a technique such as [optical character recognition](https://en.wikipedia.org/wiki/Optical_character_recognition) (OCR) and subsequently converted from rich into plain text, the removal of formatting information is likely to introduce errors into the resulting plain text. Working with this kind of "dirty" OCR can have an impact on the results of text analysis (Hill & Hengchen [2019](https://doi.org/10.1093/llc/fqz024)).

If you collect digital documents by scraping discussion forums or websites, you are likely to encounter traces of structured text in the form of markup tags, which may be carried over to plain text during conversion.

Plain text is by far the most interchangeable format for text, as it is easy to read for computers. This is why programming languages work with plain text, and if you plan to use programming languages to manipulate text, you need to know what plain text is. 

To summarise, when working with plain text, you may need to deal with traces left by conversion from rich or structured text.

### 2. Reading documents  

Built-in method [open()](https://docs.python.org/3/library/functions.html#open)

In [None]:
# reading the whole document at once
with open('data/stairwaytoheaven.txt', 'r', encoding='UTF-8') as f:
    text = f.read()

In [None]:
print(f)

In [None]:
print(len(text))
print(text)

In [None]:
print(text[:25])
# the string is like a list in wich each character is an element

In [None]:
# reading the document line by line
with open('data/stairwaytoheaven.txt', 'r', encoding='UTF-8') as f:
    text = []
    for line in f:
        text.append(line)

In [None]:
print(len(text))
print(text)

In [None]:
print(text[:25])
# now each element of the list is a string sentence

In [None]:
print(text[0])
print(text[0][:5])

### 3. Manipulating strings  
Several built-in [string methods](https://docs.python.org/3/library/stdtypes.html#string-methods)

In [None]:
# reading the whole document at once
with open('data/stairwaytoheaven.txt', 'r', encoding='UTF-8') as f:
    text = f.read()

In [None]:
# repr()
print(repr(text))

In [None]:
# replace()
processed_text = text.replace('\n',' ')
print(repr(processed_text))

In [None]:
# strip() for leading or trailing whitespace
processed_text = '    ' + processed_text
print(repr(processed_text))
print('\n')
processed_text = processed_text.strip()
print(repr(processed_text))

In [None]:
# split()
text_list = processed_text.split()
print(text_list)

In [None]:
print(len(text_list))
print(len(set(text_list)))

In [None]:
text_list = processed_text.split(',')
print(text_list)

In [None]:
text_list = [sentence.strip() for sentence in text_list]
print(text_list)

In [None]:
text_list = text.split('\n')
print(text_list)

In [None]:
# lower()
processed_text = processed_text.lower()
print(repr(processed_text))

In [None]:
text_list = processed_text.split()
print(len(text_list))
print(len(set(text_list)))

In [None]:
# find()
print(processed_text.find('wind'))

### 4. Regular Expressions  
Library [re](https://docs.python.org/3/library/re.html)  
Testing your regular expressions with [Regex101](https://regex101.com/)  

*Text extracted from [python.org](https://docs.python.org/3/howto/regex.html)*  
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the `re` module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.  

#### Matching Characters
Most letters and characters will simply match themselves. For example, the regular expression `test` will match the string `test` exactly. (You can enable a case-insensitive mode that would let this RE match `Test` or `TEST` as well.)

There are exceptions to this rule; some characters are special metacharacters, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning.

Here’s a complete list of the metacharacters:  
    `. ^ $ * + ? { } [ ] \ | ( ) `
    
And a summary:  

![](img/regex_summary.jpg "Regex")


In [None]:
# let's try
import re

text = 'I would like to extract a phone number, for instance, this +49 178 666-6666. Another example \
        is +55 123 4444-5555'
pattern = r"\+\d{2}\s\d{3}\s\d{3,4}[-]\d{4}"

matches = re.findall(pattern, text)
print (matches)
