<a href="https://colab.research.google.com/github/Lin777/PythonAndOtherTools/blob/master/RegularExpressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular Expressions

Regular expressions, also called regex or regexp, consist of patterns that describe sets of character strings.


## re Module

Working with regular expressions in Python is done through the re module.

### Functions

To find a pattern we use different set of re character sets that allows to search for a match in a string.

- **re.match()**: searches only in the beginning of the first line of the string and returns matched objects if found, else returns none.
- **re.search**: Returns a match object if there is one anywhere in the string, including multiline strings.
- **re.findall**: Returns a list containing all matches
- **re.split**: Takes a string, splits it at the match points, returns a list
- **re.sub**: Replaces one or many matches within a string


**Match**

```
# syntac
re.match(substring, string, re.I)
# substring is a string or a pattern, string is the text we look for a pattern , re.I is case ignore
```

In [1]:
import re

txt = 'I love to learn python and smalltalk'
# It returns an object with span, and match
match = re.match('I love to learn', txt, re.I)
print(match)  # <re.Match object; span=(0, 15), match='I love to learn'>
# We can get the starting and ending position of the match as tuple using span
span = match.span()
print(span)     # (0, 15)
# Lets find the start and stop position from the span
start, end = span
print(start, end)  # 0, 15
substring = txt[start:end]
print(substring)       # I love to learn

<_sre.SRE_Match object; span=(0, 15), match='I love to learn'>
(0, 15)
0 15
I love to learn


**Seach**

```
# syntax
re.match(substring, string, re.I)
# substring is a pattern, string is the text we look for a pattern , re.I is case ignore flag
```

In [2]:
import re

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It returns an object with span and match
match = re.search('first', txt, re.I)
print(match)  # <re.Match object; span=(100, 105), match='first'>
# We can get the starting and ending position of the match as tuple using span
span = match.span()
print(span)     # (100, 105)
# Lets find the start and stop position from the span
start, end = span
print(start, end)  # 100 105
substring = txt[start:end]
print(substring)       # first

<_sre.SRE_Match object; span=(100, 105), match='first'>
(100, 105)
100 105
first


**Findall**

Returns all the matches as a list


In [3]:
txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It return a list
matches = re.findall('language', txt, re.I)
print(matches)  # ['language', 'language']

['language', 'language']


As you can see, the word language was found two times in the string. Let's practice some more. Now we will look for both Python and python words in the string:

In [4]:
txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It returns list
matches = re.findall('python', txt, re.I)
print(matches)  # ['Python', 'python']

['Python', 'python']


**Replacing a substring**



In [6]:
txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

match_replaced = re.sub('Python|python', 'Smalltalk', txt, re.I)
print(match_replaced)  # JavaScript is the most beautiful language that a human being has ever created.
# OR
match_replaced = re.sub('[Pp]ython', 'Smalltalk', txt, re.I)
print(match_replaced)  # JavaScript is the most beautiful language that a human being has ever created.

Smalltalk is the most beautiful language that a human being has ever created.
I recommend Smalltalk for a first programming language
Smalltalk is the most beautiful language that a human being has ever created.
I recommend Smalltalk for a first programming language


**Splitting text**

In [7]:
txt = '''I am teacher and  I love teaching.
There is nothing as rewarding as educating and empowering people.
I found teaching more interesting than any other jobs.
Does this motivate you to be a teacher?'''
print(re.split('\n', txt)) # splitting using \n - end of line symbol

['I am teacher and  I love teaching.', 'There is nothing as rewarding as educating and empowering people.', 'I found teaching more interesting than any other jobs.', 'Does this motivate you to be a teacher?']


### Writing RegEx patterns

To declare a string variable we use a single or double quote. To declare RegEx variable r''. The following pattern only identifies apple with lowercase, to make it case insensitive either we should rewrite our pattern or we should add a flag.

In [8]:
import re

regex_pattern = r'apple'
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away. '
matches = re.findall(regex_pattern, txt)
print(matches)  # ['apple']

# To make case insensitive adding flag '
matches = re.findall(regex_pattern, txt, re.I)
print(matches)  # ['Apple', 'apple']
# or we can use a set of characters method
regex_pattern = r'[Aa]pple'  # this mean the first letter could be Apple or apple
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'apple']

['apple']
['Apple', 'apple']
['Apple', 'apple']


- []: A set of characters
    - [a-c] means, a or b or c
    - [a-z] means, any letter from a to z
    - [A-Z] means, any character from A to Z
    - [0-3] means, 0 or 1 or 2 or 3
    - [0-9] means any number from 0 to 9
    - [A-Za-z0-9] any single character, that is a to z, A to Z or 0 to 9
- \\: uses to escape special characters
  - \d means: match where the string contains digits (numbers from 0-9)
  - \D means: match where the string does not contain digits
- . : any character except new line character(\n)
- ^: starts with
  - r'^substring' eg r'^love', a sentence that starts with a word love
  - r'[^abc] means not a, not b, not c.
- \$: ends with
  - r'substring\$' eg r'love\$', sentence that ends with a word love
- \*: zero or more times
  - r'[a]*' means a optional or it can occur many times.
- +: one or more times
  - r'[a]+' means at least once (or more)
- \?: zero or one time
  - r'[a]\?' means zero times or once
- {3}: Exactly 3 characters
- {3,}: At least 3 characters
- {3,8}: 3 to 8 characters
- \|: Either or
  - r'apple\|banana' means either apple or a banana
- \(\): Capture and group


**Square bracket - []**

In [9]:
# Example

regex_pattern = r'[Aa]pple|[Bb]anana' # this square bracket means either A or a
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away. '
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'banana', 'apple', 'banana']

['Apple', 'banana', 'apple', 'banana']


**Escape character(\\)**

In [10]:
# Example

regex_pattern = r'\d'  # d is a special character which means digits
txt = 'This regular expression example was made on December 6,  2019.'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2', '0', '1', '9'], this is not what we want

['6', '2', '0', '1', '9']


**One or more times(+)**

In [11]:
# Example

regex_pattern = r'\d+'  # d is a special character which means digits, + mean one or more times
txt = 'This regular expression example was made on December 6,  2019.'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2019'] - now, this is better!

['6', '2019']


**Period(.)**

In [12]:
# Example

regex_pattern = r'[a].'  # this square bracket means a and . means any character except new line
txt = '''Apple and banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['an', 'an', 'an', 'a ', 'ar']

regex_pattern = r'[a].+'  # . any character, + any character one or more times 
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and banana are fruits']

['an', 'an', 'an', 'a ', 'ar']
['and banana are fruits']


**Zero or more times(*)**

In [13]:
# Example

regex_pattern = r'[a].*'  # . any character, * any character zero or more times 
txt = '''Apple and banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and banana are fruits']

['and banana are fruits']


**Zero or one time(?)**

In [14]:
txt = '''I am not sure if there is a convention how to write the word e-mail.
Some people write it email others may write it as Email or E-mail.'''
regex_pattern = r'[Ee]-?mail'  # ? means here that '-' is optional
matches = re.findall(regex_pattern, txt)
print(matches)  # ['e-mail', 'email', 'Email', 'E-mail']

['e-mail', 'email', 'Email', 'E-mail']


**Quantifier in RegEx**

We can specify the length of the substring we are looking for in a text, using a curly bracket. Lets imagine, we are interested in a substring with a length of 4 characters:

In [15]:
txt = 'This regular expression example was made on December 6,  2019.'
regex_pattern = r'\d{4}'  # exactly four times
matches = re.findall(regex_pattern, txt)
print(matches)  # ['2019']

txt = 'This regular expression example was made on December 6,  2019.'
regex_pattern = r'\d{1, 4}'   # 1 to 4
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2019']

['2019']
[]


**Cart ^**

- Starts with

In [16]:
txt = 'This regular expression example was made on December 6,  2019.'
regex_pattern = r'^This'  # ^ means starts with
matches = re.findall(regex_pattern, txt)
print(matches)  # ['This']

['This']


- Negation

In [17]:
txt = 'This regular expression example was made on December 6,  2019.'
regex_pattern = r'[^A-Za-z ]+'  # ^ in set character means negation, not A to Z, not a to z, no space
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6,', '2019.']

['6,', '2019.']


## Resources

https://github.com/Asabeneh/30-Days-Of-Python/blob/master/18_Day_Regular_expressions/18_regular_expressions.md#escape-character-in-regex

https://www.utic.edu.py/citil/images/Manuales/Python_para_todos.pdf