# Regular Expressions in Python

Note: Documentation for the `re` module can be found at [Python's official documentation](https://docs.python.org/3/library/re.html).

### What is a Regular Expression?
A regular expression is a representation of a set of strings. It is used for string searching and manipulation, allowing you to match, search, replace, and split strings based on specific patterns.

### General way to work with Regular Expressions
1. **Define a Pattern**: Create a search pattern using rules of a respective programming language rules.
2. **Compile the Pattern**: Use a regular expression module to compile the search pattern into an object that can be used for matching.
3. **Search or Match**: Use the compiled pattern to search for matches in a string.
4. **Extract or Replace**: Once matches are found, you can extract the matched strings or replace them with new strings.

Note: Compiling a pattern is optional in Python, as the `re` module allows you to use patterns directly in search functions. But compiling can improve performance if the same pattern is used multiple times.

### How to Use Regular Expressions in Python
The process is same for all programming languages as described above, but the syntax and functions may vary slightly. In Python, you typically use the `re` module to work with regular expressions.

Here are some common functions provided by the `re` module:

1. **re.search(pattern, string)**: Searches for the first occurrence of the pattern in the string.
2. **re.match(pattern, string)**: Checks if the pattern matches the beginning of the string.
3. **re.findall(pattern, string)**: Returns a list of all occurrences of the pattern in the string.
4. **re.sub(pattern, replacement, string)**: Replaces occurrences of the pattern with the replacement string.
5. **re.split(pattern, string)**: Splits the string by occurrences of the pattern.
6. **re.compile(pattern)**: Compiles a regular expression pattern into a regex object, which can be used for matching.


## Rules of Regular Expressions


### Metacharacters

#### Basic Metacharacters

1. **Dot (.)**: In the default mode, this matches any character except a newline. If the [DOTALL](https://docs.python.org/3/library/re.html#re.DOTALL) flag has been specified, this matches any character including a newline.
   - Example: `a.c` matches `abc`, `a1c`, but not `ac` or `a\nc`.

2. **Caret (^)**: Matches the start of the string, and in [MULTILINE](https://docs.python.org/3/library/re.html#re.MULTILINE) mode also matches immediately after each newline.
   - Example: `^abc` matches `abc` at the start of a string.

3. **Dollar Sign ($)**: Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline.
   - Example: `abc$` matches `abc` at the end of a string.

4. **Asterisk (*)**: Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. 
   - Example: `ab*c` matches `ac`, `abc`, `abbc`, etc.

5. **Plus Sign (+)**: Causes the resulting RE to match 1 or more repetitions of the preceding RE.
   - Example: `ab+c` matches `abc`, `abbc`, but not `ac`.

6. **Question Mark (?)**: Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
   - Example: `ab?c` matches `ac` or `abc`, but not `abbc`.

7. **\*?, +?, ??**: The '\*', '+', and '?' quantifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against `<a> b <c>`, it will match the entire string, and not just `<a>`. Adding ? after the quantifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only `<a>`.

8. **\*+, ++, ?+**:
Like the '\*', '+', and '?' quantifiers, those where '+' is appended also match as many times as possible. However, unas oppose to the true greedy quantifiers, these do not allow back-tracking when the expression following it fails to match. These are known as possessive quantifiers. 
   - Example, a*a will match 'aaaa' because the a* will match all 4 'a's, but, when the final 'a' is encountered, the expression is backtracked so that in the end the a* ends up matching 3 'a's total, and the fourth 'a' is matched by the final 'a'. However, when a*+a is used to match 'aaaa', the a*+ will match all 4 'a', but when the final 'a' fails to find any more characters to match, the expression cannot be backtracked and will thus fail to match.

9. **Backslash (\\)**: Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence.
   - Example: `\.` matches a literal dot.



### Some Example Usages of above basic meta characters

Ex 1 : Simple search pattern to search for a word which contains the letter 'a' followed by any character and then the letter 'cd'

In [3]:
import re

# 1. Define a pattern
search_for = r"a.cd" # this pattern will match "acd", "a1cd", "a2cd", etc.
search_in = "xyz*-+aacdabdabcd.fsh"

# 2. compile the pattern
match_obj = re.compile(search_for)

# 3. search or match the pattern
matched = match_obj.search(search_in) # search will give the first occurence of the match found

# 4. Extract or replace the match found
if matched:
    print(f"Match found at index:{matched.start()} and respective string using match.group() is: {matched.group()}")
else:
    print("No match found")

Match found at index:6 and respective string using match.group() is: aacd


Ex 2 : Write a regular expression for a following scenario
1. Should start with A(capital),
2. Followed by atleast 1 small b
3. Followed by one % and one *
4. Should end with z(small)

Task is to findout if any of the following strings 'Abbbbb%*z', 'Ab*z$', 'Abbbbbbbbbb%*Z' matches the given 

In [5]:
import re

# 1. Define a pattern
match_for = r"^Ab+%\*z$"
match_in = ["Abbbbb%*z", "Ab*z$", "Abbbbbbbbbb%*Z"]

# 2. Compile the pattern
compiled_obj = re.compile(match_for)

# 3. full match the pattern
for m in match_in:
    match_obj = compiled_obj.fullmatch(m)
    if match_obj:
        print(f"{m} follows the regular expression {match_for}")
    else:
        print(f"{m} doesn't follow the regular expression {match_for}")

Abbbbb%*z follows the regular expression ^Ab+%\*z$
Ab*z$ doesn't follow the regular expression ^Ab+%\*z$
Abbbbbbbbbb%*Z doesn't follow the regular expression ^Ab+%\*z$


Ex 3 : Write a regular expression for a following scenario
1. Can start with any alpha numeric charcter
2. Followed by atmost 1 *(asterisk)
3. Followed by atleast 1 \\(backslash)
4. Followed by any number of characters
5. Should end with %

Task is to find if any of the strings "*\\\\abcdef%%" matches the regular expression

In [9]:
import re

# 1. Define a pattern
match_for = r"\*?\\+.*%$"
match_in = ["*\\\\abcdef%%", "**\\ab%", "*\\%"]

# 2. Compile the pattern
compiled_obj = re.compile(match_for)

# 3. full match the pattern
for m in match_in:
    match_obj = compiled_obj.fullmatch(m)
    if match_obj:
        print(f"{m} follows the regular expression {match_for}")
    else:
        print(f"{m} doesn't follow the regular expression {match_for}")

*\\abcdef%% follows the regular expression \*?\\+.*%$
**\ab% doesn't follow the regular expression \*?\\+.*%$
*\% follows the regular expression \*?\\+.*%$
