# Regular Expressions 

## Practical part
Load the text "The Time Machine" by H.G. Wells from your txt-file into a string variable.

## String Operations

Work with the pure text string and use the Python string methods to solve the following problems:

- Find all occurrences of the phrase "The Time Machine".
- Find all occurrences of the word "time", independent of letter capitalization (so also "Time" or even "tiMe", if it appeared).
- Split your text at every occurrence of a newline "\n”. You will get a list of strings.
- Afterwards, revert this operation by joining the resulting list of strings again correctly. Make sure that your result equals the original text. 
- Try to transform the text into a list of words by using the split() operation. 
   

test


## Encodings

Some small exercises to see unicode in action. 


https://docs.python.org/3/howto/unicode.html

Examples:

In [5]:
str = "This is a unicode lesson."

In [2]:
str.encode("utf-8")

b'This is a unicode lesson.'

In [3]:
str = "This is ánother unicode lesson. $%"

In [4]:
str.encode("utf-8")

b'This is \xc3\xa1nother unicode lesson. $%'

### Unicode
- Copy some text in kyrillic or Chinese from a website you trust, e.g. OTH.
- Print the unicode representation to each char. Use the package _unicodedata_ to get the category and the name of each string.

Normalization:
- Unicode representations are not unique. Find below two different representations for the letter "á".
- Convince yourself that they represent the same letter.
- Evaluate whether the strings are equal in Python.
- Use unicodedata.normalize to achieve string equality.
- How is the normalized representation?

In [23]:
#TODO
import unicodedata

styled_R = 'ℜ'
normal_R = 'R'
str_a = 'á'

str =  "Здесь вы можете посмотреть короткую видеопрезентацию о Техническом институте Амберг-Вайден"

#print(str.encode())
#print(unicodedata.category(styled_R))
#print(unicodedata.category(normal_R))
#print(unicodedata.category('д'))

#print(unicodedata.name(styled_R))
#print(unicodedata.name(normal_R))
#print(unicodedata.name('д'))

print(str.encode())
print(unicodedata.normalize('NFC', str))

b'\xd0\x97\xd0\xb4\xd0\xb5\xd1\x81\xd1\x8c \xd0\xb2\xd1\x8b \xd0\xbc\xd0\xbe\xd0\xb6\xd0\xb5\xd1\x82\xd0\xb5 \xd0\xbf\xd0\xbe\xd1\x81\xd0\xbc\xd0\xbe\xd1\x82\xd1\x80\xd0\xb5\xd1\x82\xd1\x8c \xd0\xba\xd0\xbe\xd1\x80\xd0\xbe\xd1\x82\xd0\xba\xd1\x83\xd1\x8e \xd0\xb2\xd0\xb8\xd0\xb4\xd0\xb5\xd0\xbe\xd0\xbf\xd1\x80\xd0\xb5\xd0\xb7\xd0\xb5\xd0\xbd\xd1\x82\xd0\xb0\xd1\x86\xd0\xb8\xd1\x8e \xd0\xbe \xd0\xa2\xd0\xb5\xd1\x85\xd0\xbd\xd0\xb8\xd1\x87\xd0\xb5\xd1\x81\xd0\xba\xd0\xbe\xd0\xbc \xd0\xb8\xd0\xbd\xd1\x81\xd1\x82\xd0\xb8\xd1\x82\xd1\x83\xd1\x82\xd0\xb5 \xd0\x90\xd0\xbc\xd0\xb1\xd0\xb5\xd1\x80\xd0\xb3-\xd0\x92\xd0\xb0\xd0\xb9\xd0\xb4\xd0\xb5\xd0\xbd'
Здесь вы можете посмотреть короткую видеопрезентацию о Техническом институте Амберг-Вайден


### ASCII encoding
Encode the string in ASCII and decode it into UTF-8.
- What happens?
- What could you use it for?
- use different options for encoding ("strict", "replace", "backslashreplace", "namereplace", [there are even more])



In [30]:
str = "This is ànother unicode lesson"
encode = str.encode('ascii', 'backslashreplace')
print(encode.decode('UTF-8'))

This is \xe0nother unicode lesson


In [None]:
# TODO

## Regular Expressions

Familiarize with the Python re package: https://docs.python.org/3/library/re.html

Warmup: Find all occurrences of the pattern "a[bcd]*b" in the string "abcbdab"

In [41]:
import re

search_string = "abcbdab"
regex = re.compile("a[bcd]*b")

match = regex.match(search_string)
all = regex.findall(search_string)

print(match.group(0))
for el in all:
    print(el)

abcb
abcb
ab


Perform the following searches on your Time Machine text:
- Find the word "time"
- Find the word "time" with small or capital letter at the beginning ("time" and "Time")
- Print context for each occurrence (e.g. 10 chars before and after the finding)
- Are there any digits in the text?
- Count the number of fullstops (".") in the text (this might give you an impression of how many sentences you have...)
- Find patterns with various (at least 2) fullstops in a row (e.g. "...")
- However, some fullstops do not mark the end of a sentence. For example abbreviations like "e.g.", "i.e.", names ("H.G. Wells"), or patterns like "...". We may assume that the end of a sentence is marked by a fullstop followed by a space. Find those patterns. 
- Which words are written in capital letters although they are not at the beginning of a sentence? Find all such ocurrences.





In [None]:
# TODO

#### Sentence tokenization 

Manually using regex, see task description in lecture (NLP02-1).
Do not use any dedicated method from a language package here.

In [2]:
import re

pattern = re.compile(r'([A-Z][a-z]*)([\s][a-z])*')
with open("time_machine.txt", 'r') as fh:
    content = fh.read()
    sentences = pattern.findall(content)
    print(sentences)
    print(len(sentences))
    
    

[('The', ''), ('Project', ''), ('Gutenberg', ' e'), ('Book', ' o'), ('The', ''), ('Time', ''), ('Machine', ''), ('H', ''), ('G', ''), ('Wells', ''), ('This', ' e'), ('Book', ' i'), ('United', ''), ('States', ' a'), ('You', ' m'), ('Project', ''), ('Gutenberg', ''), ('License', ' i'), ('Book', ' o'), ('If', ' y'), ('United', ''), ('States', ''), ('Book', ''), ('Title', ''), ('The', ''), ('Time', ''), ('Machine', ''), ('Author', ''), ('H', ''), ('G', ''), ('Wells', ''), ('Release', ''), ('Date', ''), ('July', ''), ('Book', ''), ('Most', ' r'), ('October', ''), ('Language', ''), ('English', ''), ('Character', ' s'), ('U', ''), ('T', ''), ('F', ''), ('S', ''), ('T', ''), ('A', ''), ('R', ''), ('T', ''), ('O', ''), ('F', ''), ('T', ''), ('H', ''), ('E', ''), ('P', ''), ('R', ''), ('O', ''), ('J', ''), ('E', ''), ('C', ''), ('T', ''), ('G', ''), ('U', ''), ('T', ''), ('E', ''), ('N', ''), ('B', ''), ('E', ''), ('R', ''), ('G', ''), ('E', ''), ('B', ''), ('O', ''), ('O', ''), ('K', ''), ('T',