# **Regex Note:**

## <span id="index">**Index:**</span>
- [**Simple Regex Pattern**](#simple_regex_pattern)
- [**Character Sets**](#character_sets)
- [**Ranges**](#ranges)
- [**Repeating Characters**](#repeating_characters)
- [**Metacharacters**](#metacharacters)
- [**Special Characters**](#special_characters)
- [**Starting & Ending Pattern**](#starting_ending_pattern)
- [**Alternate Characters**](#alternate_characters)
- [**Some Examples**](#some_examples)

**Important methods of the package `re`:**

| method | description |
| :--: | :-- |
| findall | Returns a list containing all matches |
| search | Returns a Match object if there is a match anywhere in the string |
| split | Returns a list where the string has been split at each match |
| sub | Replaces one or many matches with a string |

Python regex allows optional flags to specify when using regular expression patterns with `match()`, `search()`, and `split()`, among others.

| **Flag** | **Long Syntax** | **Description** |
| :--: | :--: | :-- |
| **re.A** | **re.ASCII** | Perform ASCII-only matching instead of full Unicode matching |
| **re.I** | **re.IGNORECASE** | Perform case-insensitive matching |
| **re.M** | **re.MULTILINE** | This flag is used with metacharacter `^` (caret) and `$` (dollar). When this flag is specified, the metacharacter `^` matches the pattern at beginning of the string and each newline's beginning (`\n`). And the metacharacter `$` matches pattern at the end of the string and the end of each new line (`\n`) |
| **re.S** | **re.DOTALL** | Make the DOT (`.`) special character match any character at all, including a newline. Without this flag, DOT (`.`) will match anything except a newline |
| **re.X** | **re.VERBOSE** | Allow comment in the regex. This flag is useful to make regex more readable by alllowing comments in the regex. |
| **re.L** | **re.LOCALE** | Perform case-insensitive matching dependent on the current locale. Use only with bytes patterns. |

*To specify more than one flag, use the `|` operator to connect them. For example, case insensitive searches in a multiline string*

```python
re.findall(pattern, string, flags=re.I|re.M|re.X)
```

## <span id="simple_regex_pattern">[**Simple Regex Pattern:**](https://www.youtube.com/watch?v=uaepGvA-iK4&list=PL4cUxeGkcC9g6m_6Sld9Q4jzqdqHd2HiD&index=2)</span>

[**Go to top**](#index)

In [1]:
import re

def check_pattern(pattern: str, txt: str, flags=0) -> bool:
    """
    Check one text/string matches the regex pattern or not. If the string is matched
    with the pattern, print which are matched and return True. If it is not matched,
    then return False.
    
    Parameters:
        pattern: str, the regex pattern.
        txt: the string
        flags: flags; {re.A, re.I, re.M, re.S, re.X, re.L}
        
    Returns:
        bool: True if the string is matched also False
        
    >>> pattern = r"[0-9]{10}"
    >>> txt = "987654321"
    >>> print(check_pattern(pattern, txt))
        False
    """
    pettern = re.compile(pattern)
    result  = re.findall(pattern, txt, flags=flags)
    
    if len(result) > 0:
        print(result)
        return True
    return False

In [2]:
# This pattern only find the word from the given string/txt
pattern = r"data"
txt     = "Sayan is a data scientist. DaTa is gold now a days."

check_pattern(pattern, txt)

['data']


True

In [3]:
# This flag is for non-case sensitive.
check_pattern(pattern, txt, flags=re.I)

['data', 'DaTa']


True

## <span id="character_sets">[**Character Sets:**](https://www.youtube.com/watch?v=DC-zzUrg0Ws&list=PL4cUxeGkcC9g6m_6Sld9Q4jzqdqHd2HiD&index=3)</span>

[**Go to top**](#index)

In [4]:
# the [] tells that the word can be "data" or "Data" (without using the flags).
# That's mean, the [] tells one of the character inside it will match the word.
# This pattern only for "Data" or "data", will not match with "Adata".

## This include some specific character sets

pattern = r"[dD]ata"
txt     = "Sayan is a data scientist. Data is gold now a days. Adata is a good company."

check_pattern(pattern, txt)

['data', 'Data', 'data']


True

In [5]:
txt = "Sayan is a data scientist. DaTa is gold now a days. "

check_pattern(pattern, txt)

['data']


True

In [6]:
# another example of include a specific character set
pattern = r"[abc123]"
txt     = "ab3e"

check_pattern(pattern, txt)

['a', 'b', '3']


True

In [7]:
## Include all the characters but exclude some of them
pattern = r"[^p]00"
text = "a00c00 r00 g00p00 p00h00"

check_pattern(pattern, text)

['a00', 'c00', 'r00', 'g00', 'h00']


True

In [8]:
# exclude multiple characters
pattern = r"[^pg]00"

check_pattern(pattern, text)

['a00', 'c00', 'r00', 'h00']


True

## <span id="ranges">[**Ranges**](https://www.youtube.com/watch?v=C_HTKPvXjEc&list=PL4cUxeGkcC9g6m_6Sld9Q4jzqdqHd2HiD&index=4)</span>

[**Go to top**](#index)

In [9]:
# what if you want to match _00 in which the _ can be any character?
# For that, you can the below pattern.
pattern = r"[a-z]inja"
txt = "ninja ginja winja zinja"

check_pattern(pattern, txt)

['ninja', 'ginja', 'winja', 'zinja']


True

In [10]:
# what if all the small letters and Capital letters without using the flags
pattern = r"[a-zA-Z]inja"
txt = "Ninja ginja Winja zinja ninja"

check_pattern(pattern, txt)

['Ninja', 'ginja', 'Winja', 'zinja', 'ninja']


True

In [11]:
# this can be done with numbers also.
pattern = r"[0-9]inja"
txt = "0inja 7inja ninja"

check_pattern(pattern, txt)

['0inja', '7inja']


True

In [12]:
# create a regex pattern to check phone numbers
pattern = r"[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]"
txt1 = "9876543210"
check_pattern(pattern, txt1)

['9876543210']


True

In [13]:
txt2 = "987654321"
check_pattern(pattern, txt2)

False

## <span id="repeating_characters">[**Repeating Characters**](https://www.youtube.com/watch?v=Dvb7eT36hMM&list=PL4cUxeGkcC9g6m_6Sld9Q4jzqdqHd2HiD&index=5)</span>

[**Go to top**](#index)

In [14]:
# For the phone number pattern, we have repeated the same pattern 10 times
# to match. We can avoid that easily.
### RULES:  ###
# 1. "+" means the single unit pattern can repeate any number of times.
#    "+" means atleast 1 times and can go up to infinity.
# 2. To limit this, have to use {} brackets and inside it have to tell
#    how many times it will repeat. Like ph number, it is 10. E.g., r"[0-9]{10}"

pattern = r"[0-9]+"
print(check_pattern(pattern, txt1))
print(check_pattern(pattern, txt2))

['9876543210']
True
['987654321']
True


In [15]:
pattern = r"[0-9]{10}"
print(check_pattern(pattern, txt1), end="\n\n")
print(check_pattern(pattern, txt2))

['9876543210']
True

False


In [16]:
# what if make a pattern that accepts only 5 letters long words.
# See the behaviour. Those words which have more than 5 letters, 
# this pattern splits those words and accepts up to the first 5
# letters. This pattern does not include the words which have less
# than 5 letters.
pattern = r"[a-zA-Z]{5}"
text = """Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots 
in a piece of classical Latin literature from 45 BC, making it over 2000 years old."""
check_pattern(pattern, text)

['Contr', 'popul', 'belie', 'Lorem', 'Ipsum', 'simpl', 'rando', 'roots', 'piece', 'class', 'Latin', 'liter', 'ature', 'makin', 'years']


True

In [17]:
# what if we want those words which are 3 to 5 characters long? 
# There should be any white-space between the 3 and 5 like {3, 5}.
# It is not allowed!!
pattern = r"[a-zA-Z]{3,5}"
check_pattern(pattern, text)

['Contr', 'ary', 'popul', 'belie', 'Lorem', 'Ipsum', 'not', 'simpl', 'rando', 'text', 'has', 'roots', 'piece', 'class', 'ical', 'Latin', 'liter', 'ature', 'from', 'makin', 'over', 'years', 'old']


True

In [18]:
pattern = r"[a-zA-Z]{3, 5}"
check_pattern(pattern, text)

False

In [19]:
# what if you want to include that "2000" also in this list?
pattern = r"[a-zA-Z0-9]{3,5}"
check_pattern(pattern, text)

['Contr', 'ary', 'popul', 'belie', 'Lorem', 'Ipsum', 'not', 'simpl', 'rando', 'text', 'has', 'roots', 'piece', 'class', 'ical', 'Latin', 'liter', 'ature', 'from', 'makin', 'over', '2000', 'years', 'old']


True

In [20]:
# what if we need atleast 5 characters long words?
pattern = r"[a-zA-Z0-9]{5,}"
check_pattern(pattern, text)

['Contrary', 'popular', 'belief', 'Lorem', 'Ipsum', 'simply', 'random', 'roots', 'piece', 'classical', 'Latin', 'literature', 'making', 'years']


True

## <span id="metacharacters">[**Metacharacters**](https://www.youtube.com/watch?v=MwzIRleH47o&list=PL4cUxeGkcC9g6m_6Sld9Q4jzqdqHd2HiD&index=6)</span>

[**Go to top**](#index)

- **`\d` -** match any digit character (same as [0-9]). `d` matches the literal character.
- **`\w` -** match any word character (a-z, A-Z, 0-9 and _'s)
- **`\s` -** match a whitespace character (spaces, tabs etc)
- **`\t` -** match a tab character only
- There are lots of more metacharacters. You can check it out from [here](https://www.w3schools.com/jsref/jsref_obj_regexp.asp).

In [21]:
pattern = r"\d"
print(txt1)
check_pattern(pattern, txt1)

9876543210
['9', '8', '7', '6', '5', '4', '3', '2', '1', '0']


True

In [22]:
check_pattern(pattern, text)

['4', '5', '2', '0', '0', '0']


True

In [23]:
# from the dummy text, find those words which are exactly 5 charaters long.
pattern = r"\s\w{5,5}\s"
check_pattern(pattern, text)

[' Lorem ', ' roots ', ' piece ', ' Latin ', ' years ']


True

## <span id="special_characters">[Special Characters](https://www.youtube.com/watch?v=ae38f8ZWObI&list=PL4cUxeGkcC9g6m_6Sld9Q4jzqdqHd2HiD&index=8)</span>

[**Go to top**](#index)

- **`+` -** The one-or-more quantifier
- **`\` -** The escape character
- **`[]` -** The character set
- **`[^]` -** The negative symbol in a character set
- **`?` -** The zero-or-one quantifier (makes a preceding char optional)
- **`.` -** Any character whatsoever (except the newline characters)
- **`*` -** The 0 or more quantifier (a bit like +)

In [24]:
# So, for the below pattern, from the word "hello", the letter "o"
# (the letter before the special character "?") is optional. That's
# mean, "hello" and "hell" both are valid.
pattern = r"hello?"
txt = "hello hell how are you hellhello"
check_pattern(pattern, txt)

['hello', 'hell', 'hell', 'hello']


True

In [25]:
pattern = r"[a-z]o?[a-z]{4,4}"
text = """Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots 
in a piece of classical Latin literature from 45 BC, making it over 2000 years old."""

check_pattern(pattern, text)

['ontra', 'popula', 'belie', 'simpl', 'rando', 'roots', 'piece', 'class', 'liter', 'ature', 'makin', 'years']


True

In [26]:
# Next the dot (.) special character. This below pattern doesn't match the
# word "root" but the 5th lettern can be anything like "roots" which will
# match.
pattern = r"root."
check_pattern(pattern, text)

['roots']


True

In [27]:
pattern = r".+"
check_pattern(pattern, text)

['Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots ', 'in a piece of classical Latin literature from 45 BC, making it over 2000 years old.']


True

In [28]:
# The below patterns tells that the first letter should be always "p". But next
# there can be 0 to infinity times any small letters possible.
pattern = r"p[a-z]*"
check_pattern(pattern, text)

['popular', 'psum', 'ply', 'piece']


True

## <span id="starting_ending_pattern">[**Starting & Endeing Pattern**](https://www.youtube.com/watch?v=RD3tMcFDjyo&list=PL4cUxeGkcC9g6m_6Sld9Q4jzqdqHd2HiD&index=8)</span>

[**Go to top**](#index)

In [29]:
# Let's say we need words which have exact 5 character letters. For, that if we
# make a pattern like [a-zA-Z]{5}, then it will match with the sayankdcjdcjudscskcksc,
# it will match first 5 letters, then next 5 characters and so on continuously.
# Let's have a look.
pattern = r"[a-zA-Z]{5}"
txt = "sayankdcjdcjudscskcksc"
check_pattern(pattern, txt)

['sayan', 'kdcjd', 'cjuds', 'cskck']


True

In [30]:
# ^[a-z] - Matches any string with any small letter at the beginning of it. So, "^"
#          tells that it is the starting position of any match.
# [a-z]$ - Matches any string with any small letter at the end of it.

pattern = r"^[a-zA-Z]{5}$"
print(check_pattern(pattern, "asdef"), end="\n\n")
print(check_pattern(pattern, "asdeffvf"))

['asdef']
True

False


## <span id="alternate_characters">[**Alternate Character**](https://www.youtube.com/watch?v=62ItDFG4UTM&list=PL4cUxeGkcC9g6m_6Sld9Q4jzqdqHd2HiD&index=9)</span>

[**Go to top**](#index)

In [31]:
# The pipe symbol "|" makes alternate character in regex.
pattern = r"b(u|i)tter"

check_pattern(pattern, "This butter is bitter, bitter butter is bad for batter..")

['u', 'i', 'i', 'u']


True

In [32]:
check_pattern(r"(p|t)yre", "pyre type")

['p']


True

In [33]:
pattern = r"(pet|toy|crazy) rabbit"
check_pattern(pattern, "pet rabbit, toy rabbit, crazy rabbit, hello rabbit")

['pet', 'toy', 'crazy']


True

In [34]:
# the last element is comming from the "hello rabbit"
pattern = r"(pet|toy|crazy)? rabbit"
check_pattern(pattern, "pet rabbit, toy rabbit, crazy rabbit, hello rabbit")

['pet', 'toy', 'crazy', '']


True

## <span id="some_examples">Some Examples</span>

[**Go to top**](#index)

In [35]:
## Match 10 digit telephone number
pattern = r"^\d{10}$"
txt1 = "9876543210"
txt2 = "987654321"
txt3 = "98765432111"

print(check_pattern(pattern, txt1), end="\n\n")
print(check_pattern(pattern, txt2), end="\n\n")
print(check_pattern(pattern, txt3), end="\n\n")

['9876543210']
True

False

False



In [36]:
## Find the words which are exact 5 characters long
pattern = r"\b\w{5}\b"
text = """Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots 
in a piece of classical Latin literature from 45 BC, making it over 2000 years old."""

check_pattern(pattern, text)

['Lorem', 'Ipsum', 'roots', 'piece', 'Latin', 'years']


True

In [37]:
# matching a username containing 5-12 alpha-numeric characters
pattern = r"^[a-zA-Z\d]{5,12}$"
text = "abc4DKd"

check_pattern(pattern, text)

['abc4DKd']


True

In [38]:
# password matching where password must be alphanumeric (@, 
# _ and - are also included) and be 8-20 characters long
pattern = r"^[\w@-]{8,20}$"
text = "test1234%"
check_pattern(pattern, text)

False

In [39]:
# check a valid email using regex
#      1           2            3         4
# (your name) @ (domain) . (extension)(.again)
#   theboss   @ theregex . co . uk
# 1 --> any letters, numbers, dots and/or hyphens
# 2 --> any letters, numbers and/or hypens
# 3 --> any letters
# 4 --> a dot(.) then any letters

# we can make sections with () brackets but this will not make any effect.
# this also helps to make optional a group of characters which is at position 4.
pattern = r"^([a-z\d\.-]+)@([a-z\d-]+)\.([a-z]{2,8})(\.[a-z]{2,8})?$"
text = "rsayan553@gmail.com" 

check_pattern(pattern, text)

[('rsayan553', 'gmail', 'com', '')]


True

In [40]:
check_pattern(pattern, "theboss@theregex.co.uk")

[('theboss', 'theregex', 'co', '.uk')]


True

In [41]:
# check a valid url or not

pattern = r"^(https?:\/\/)?(www\.)?([\w@:\.%_\+~#=]+)\.([\w()]{1,6})([\w@:%_\+.~#?&//=-]+)?$"

url = "https://www.flipkart.com/search?q=iphone+13&sid=tyy%2C4io&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_2_6_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_2_6_na_na_na&as-pos=2&as-type=RECENT&suggestionId=iphone+13%7CMobiles&requestId=077f7ba2-7a63-4bcd-a72f-92516764f9f6&as-searchtext=iphone"
check_pattern(pattern, url)

[('https://', 'www.', 'flipkart', 'com', '/search?q=iphone+13&sid=tyy%2C4io&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_2_6_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_2_6_na_na_na&as-pos=2&as-type=RECENT&suggestionId=iphone+13%7CMobiles&requestId=077f7ba2-7a63-4bcd-a72f-92516764f9f6&as-searchtext=iphone')]


True