<a href="https://colab.research.google.com/github/EmrahYener/DLMAINLPCV01_demo/blob/master/Kopie_von_nlp_1p2_regexp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Regular Expressions**

This notebook was prepared to help students improve their coding skills and show code examples about Regular expressions (also called “RegEx”). 

The notebook contains three main sections about each topic:
- Definition
- Code examples
- Coding exercises (TODO's)

In the coding exercises section, you are supposed to write a code in order to solve a given problem. If you need, at the end of the notebook, you can find a sample solution. However, we recommend that you find a solution without looking at the answer.




## **1- Introduction**

Regular expressions in Python are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions. The most significant ones are:
- ``match()``:$\qquad$ Determine if the regular expression matches at the beginning of the string.<br>
- ``search()``:$\qquad$ Scan a string, looking for any location where this regular expression matches.<br>
- ``findall()``:$\qquad$ Find all substrings where the regular expression matches, and return them as a list.<br>
- ``finditer()``:$\qquad$ Find all substrings where the regular expression matches, and return them as an iterator.<br>

#### **Content:**
This notebook shows examples for the following meta-character settings:
- Anchors
- Quantifiers
- Disjunctions
- Character Classes

Further Python documentation:
- Regular expression syntax in the library ``re``: https://docs.python.org/3/library/re.html)
- Regular Expression HOWTO: https://docs.python.org/3/howto/regex.html#regex-howto

## **2- Anchors mark a position in the string**

Anchors do not match any character. Instead, they match a position before, after, or between characters. They can be used to **“anchor”** the regex match at a certain position. The caret **^** matches the position before the first character in the string. Applying **^a** to **abc** matches **a**. 
**^b** does not match **abc** at all, because the **b** cannot be matched right after the start of the string, matched by **^**. 

Similarly, $\$$ matches right after the last character in the string. c$\$$ matches c in abc, while $\$$ does not match at all.


- $\text{^}$ : $\quad$ means the beginning of the string
- $\text{\$}$: $\quad$  means the end of the string

#### **Code Examples:**

In [None]:
# Load 're' module for all following code cells
import re

In [None]:
# Define regular expression
reg = re.compile('^banana')

# Does the sentence "bananas are cheap" begin with "banana"?
p = reg.search("bananas are cheap")

# Show results:
if p:
    # Print the string which matches the regular expression
    print('matching string:\t' + p.group())
    
    # Print the first and last character of the match
    print('index of first and last matching character in the sentence: \t start: {} \t end: {}'.format(p.start(), p.end()))
    
    # Print the start and end position as tuple:
    print('index of first and last matching charcter as tuple: \t{}'.format(p.span()))
else:
    print("no match")


matching string:	banana
index of first and last matching character in the sentence: 	 start: 0 	 end: 6
index of first and last matching charcter as tuple: 	(0, 6)


In [None]:
# Check if the sentence "I have a banana" ends with "banana"
reg = re.compile('banana$')

p = reg.search("I have a banana")

# Show results:

if p:
  # Print the string which matches the regular expression
  print('matching string:\t' + p.group())
  
  # Print the first and last character of the match
  print('index of first and last matching character in the sentence: \t start: {} \t end: {}'.format(p.start(), p.end()))
  
  # Print the start and end position as tuple:
  print('index of first and last matching charcter as tuple: \t{}'.format(p.span()))
  
else:
    print("no match")

matching string:	banana
index of first and last matching character in the sentence: 	 start: 9 	 end: 15
index of first and last matching charcter as tuple: 	(9, 15)


In [None]:
# '^' means the beginning of a string
# '$' means the end of a string

reg = re.compile('^banana$')
 
# Do the following strings begin and end with 'banana'?
print(reg.search('bananas are cheap'))

print(reg.search('I have a banana'))

print(reg.search('banana').group())

None
None
banana


#### Alternative method: Use module functions for matches in the string variable

Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.<br>
The ``re`` module defines several functions, constants, and an exception. Some of the functions are simplified versions of the full-featured methods for compiled regular expressions. Most non-trivial applications always use the compiled form.

In [None]:
# Alternative method: Use the module function search() for matches in the string variable
p = re.search('^banana', 'bananas are cheap')

# Show results:
if p:
    # Print the string which matches the regular expression
    print('matching string:\t' + p.group())
    
    # Print the first and last character of the match
    print('index of first and last matching character in the sentence: \t start: {} \t end: {}'.format(p.start(), p.end()))
    
    # Print the start and end position as tuple:
    print('index of first and last matching charcter as tuple: \t{}'.format(p.span()))
else:
    print("no match")

matching string:	banana
index of first and last matching character in the sentence: 	 start: 0 	 end: 6
index of first and last matching charcter as tuple: 	(0, 6)


In [None]:
# The function match() would return the same result
#
# Note:
# match() only checks if the regular expression matches at the beginning of the string
# while search() scans the whole string for a match. 
# It is important to keep this distinction in mind. 

# Note:
# match() only reports a successful match which starts at 0. 
# If the match does not start at 0, match() does not report it.

# See also [search() vs. match()](https://docs.python.org/3/library/re.html#search-vs-match)

p = re.match('^banana', 'bananas are cheap')

if p:
  print('matching string: ' + p.group())

matching string: banana


#### **TODO:**

##### **Exercise-1:**

Used re.match() function to search pattern within the following list of strings. The method should print the matchibg groups. If no pattern is found, it should print "No match!!"  

list1 = ["dog dot", "do don't", "dumb-dumb", "no-nothing", "dados-paces","know dacha"]

list2 = ["radio dot", "safe don't", "dumb-gaffe", "tack-dados", "dados-wade"]

<br></br>



##### **Exercise-2:**
Use the re.findall() method to extract numbers from the following text. The function should return a list of strings containing all matches. If the pattern is not found, it should return an empty list.

Text sample: "In 1970, William A. Woods introduced the augmented transition network (ATN) to represent natural language input.[4] Instead of phrase structure rules ATNs used an equivalent set of finite state automata that were called recursively. ATNs and their more general format called 'generalized ATNs' continued to be used for a number of years. During the 1970s many programmers began to write 'conceptual ontologies', which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky."

<a id='001'></a>

## **3- Quantifiers indicate the number of repetitions of the previous character**

- $\text{*}$: $\quad$  means 0 or more
- $\text{+}$: $\quad$ means 1 or more
- $\text{?}$: $\quad$ means 0 or 1 repetitions
- If more precise quantifiers are needed, the number of repetitions can be written in curly brackets.

In [None]:
# '?' means 0 or 1 repetitions
reg = re.compile('ba?nana')

print(reg.search('banana').group())
print(reg.search('bnana').group())
print(reg.search('baaaaanana'))

banana
bnana
None


In [None]:
# '+' means 1 or more repetitions
reg = re.compile('ba+nana')

print(reg.search('banana').group())
print(reg.search('bnana'))
print(reg.search('baaaaanana').group())

banana
None
baaaaanana


In [None]:
# '*' means 0 or more repetitions
reg = re.compile('ba*nana')

print(reg.search('banana').group())
print(reg.search('bnana').group())
print(reg.search('baaaaanana').group())

banana
bnana
baaaaanana


In [None]:
# 'a{2,7}' means that the letter 'a' must repeat at least 2 times and 
# at maximum 7 times for the string to match the pattern
reg = re.compile('a{2,7}')

print(reg.search('banana'))
print(reg.search('bnana'))
print(reg.search('baaaaanana').group())

None
None
aaaaa


## **4- Disjunctions represent a logical OR**

Disjunctions are written in squared brackets ``[ ]`` or separated by the pipe sign ``|``.

In [None]:
# 'b[aou]nana' matches strings which have either 'a', 'o' or 'u' as 2nd character.
reg = re.compile('b[aou]nana')

print(reg.search('banana').group())
print(reg.search('bonana').group())
print(reg.search('bunana').group())

banana
bonana
bunana


In [None]:
# 'b(a|o|u)nana' matches strings which have either 'a', 'o' or 'u' as 2nd character.
reg = re.compile('b(a|o|u)nana')

print(reg.search('banana').group())
print(reg.search('bonana').group())
print(reg.search('bunana').group())

banana
bonana
bunana


In [None]:
# 'banana|bonana|bunana' matches strings which contain either 'banana', 'bonana' or 'bunana'.
reg = re.compile('banana|bonana|bunana')

print(reg.search('banana').group())
print(reg.search('bonana').group())
print(reg.search('bunana').group())

banana
bonana
bunana


## **5- Character classes represent certain groups of characters.**

* $\text{\d or [0-9]}\quad \quad \quad \quad$ :   matches digits
* $\text{\w or [0-9A-Za-z_]}\quad$ :  matches alphanumeric characters and underscores
* $\text{\s} \qquad \qquad \qquad \qquad$:  matches white space characters
* $. \qquad \qquad \qquad \qquad$ :  matches any character

In [None]:
# This regular expression matches strings which start with 'Hello' and end with 1 following word
reg = re.compile('^Hello \w+$')

print(reg.search('Hello World').group())
print(reg.search('Hello new World'))

Hello World
None


In [None]:
# The expression would match strings which start with 'Hello'
reg = re.compile('^Hello.+$')

print(reg.search('Hello World').group())
print(reg.search('Hello new World').group())

Hello World
Hello new World


## **6- Find regular expressions in files**

In [None]:
# Create sample text file

text ="This is a sample text which contains several characters.\n\
For example, I bought cheap bananas today."

# Write content in output file
with open("/content/regText.txt", 'w') as file:
  file.write(text)

In [None]:
#@title
# Find the epression 'text' in sample file 'regText.txt'

!grep text '/content/regText.txt'

This is a sample text which contains several characters.


In [None]:
# Replace the word banana with test in regText.txt

!sed -e 's/banana/test/' regText.txt

This is a sample text which contains several characters.
For example, I bought cheap tests today.

#### **Solutions of the Coding Exercises**


In [39]:
#EXERCISE-1:


import re

# Given strings.
list1 = ["dog dot", "do don't", "no-nothing", "dumb-dumb", "dados-paces","know dacha"]

list2 = ["radio dot", "safe don't", "dumb-gaffe", "tack-dados", "dados-wade"]

list1_bool=False
list2_bool=False

# Loop.
print("The result of List1: ")
for element in list1:
    # Match if two words starting with letter d.
    m = re.match("(d\w+)\W(d\w+)", element)

    # See if success.

    if m:
        print(m.groups())
        list1_bool=True

if list1_bool==False:
   print("No match!!")     

print("The result of List2: ") 
for element in list2:
    # Match if two words starting with letter d.
    m = re.match("(d\w+)\W(d\w+)", element)

    # See if success.

    if m:
        print(m.groups())
        list2_bool=True        

if list2_bool==False:
   print("No match!!")    


'''
Returns:

The result of List1: 
('dog', 'dot')
('do', 'don')
('dumb', 'dumb')
The result of List2: 
No match!!
'''

#Pattern details

'''
Pattern: (d\w+)\W(d\w+)

#d        Lowercase letter d.
#\w+      One or more word characters.
#\W       A non-word character.

'''


The result of List1: 
('dog', 'dot')
('do', 'don')
('dumb', 'dumb')
The result of List2: 
No match!!


In [3]:
#EXERCISE-2:

import re

string = "In 1970, William A. Woods introduced the augmented transition network (ATN) to represent natural language input.[4] Instead of phrase structure rules ATNs used an equivalent set of finite state automata that were called recursively. ATNs and their more general format called 'generalized ATNs' continued to be used for a number of years. During the 1970s many programmers began to write 'conceptual ontologies', which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky."
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

# Output: ['1970', '4', '1970', '1975', '1978', '1978', '1976', '1977', '1979', '1981']

['1970', '4', '1970', '1975', '1978', '1978', '1976', '1977', '1979', '1981']


Copyright © 2021 IU International University of Applied Sciences

Sources:
- https://www.regular-expressions.info/anchors.html
- https://thedeveloperblog.com/re
- https://www.programiz.com/python-programming/regex