### Regular Expressions

A sequence of characters that forms a search pattern used to find and manipulate text based on specific rules

#### Use-cases:

1. Data cleaning

2. Validation

3. Extraction

4. Text replacement

In [32]:
import re

#### Functions in re

re.search(pattern, string): Scans a string for the first occurrence of the pattern and returns a match object, or None if not found.

In [33]:
match = re.search(r'pattern', 'string to search')
print(match.group() if match else 'No match')

No match


re.match(pattern, string): Checks for a match of the pattern only at the beginning of the string, returning a match object or None.

In [34]:
match = re.match(r'pattern', 'string to search')
print(match.group() if match else 'No match')

No match


re.findall(pattern, string): Finds all non-overlapping occurrences of the pattern in the string and returns them as a list of strings/tuples.

In [35]:
matches = re.findall(r'pattern', 'string to search')
print(matches)

[]


re.sub(pattern, repl, string): Replaces all occurrences of the pattern in the string with the specified replacement string.

In [36]:
new_string = re.sub(r'old_pattern', 'new_text', 'original string')
print(new_string)

original string


re.subn(pattern, repl, string): Works like re.sub(), but returns a tuple containing the new string and the number of substitutions made.

In [37]:
new_string = re.subn(r'old_pattern', 'new_text', 'original string')
print(new_string)

('original string', 0)


re.split(pattern, string): Splits the string by the occurrences of the pattern and returns the remaining characters as a list.

In [38]:
parts = re.split(r'delimiter_pattern', 'string to split')
print(parts)

['string to split']


re.compile(pattern): Compiles a regex into a reusable pattern object to perform various matching operations efficiently.

In [39]:
compiled_regex = re.compile(r'pattern')
match = compiled_regex.search('string')
print(match.group() if match else 'No match')

No match


re.escape(pattern): Escapes all special characters in a string to treat them as literals in a regex pattern.

#### Character Classes
1. [abc] --> any one of a, b, c
2. [^abc] --> not a, b, c
3. [a-z] --> range
4. \d --> digit
5. \D --> non-digit
6. \w --> word char
7. \W --> non-word
8. \s --> whitespace
9. \S --> non-whitespace

#### Quantifiers (Repetition Rules)
1. '*' --> 0 or more
2. '+' --> 1 or more
3. ? --> 0 or 1
4. {n} --> exactly n
5. {n,m} --> between n & m

Note: Ignore '' around + and *. 

#### Anchors & Boundaries
1. ^ --> beginning
2. $ --> end
3. \b --> word boundary

For example,

^Hello

world$

\bcat\b


#### Lookarounds
1. Lookahead --> X(?=Y)
2. Negative Lookahead --> X(?!Y)
3. Lookbehind --> (?<=X)Y
4. Negative Lookbehind --> (?<!X)Y

#### Flags
1. re.I --> case-insensitive
2. re.M --> multiline with new ^ and $
3. re.S --> dot matches newline as well

H..lo -> H
        ello, Hrelo

### Tasks

1. Find all the digits in the string:

"Invoice 458, amount 900, paid on 12th"

2. Extract all the words from:

"Python is fast & powerful!"

3. Find all the email IDs from

"Contact: abc@gmail.com, x.y@yahoo.com"

4. Extract all the numbers from

"Order IDs: A102, B305, C450"

5. Check if the string starts with Hello.

"Hello world"

6. Extract all 4-digit years from:

"Born 1998, graduated 2020, working in 2023"

7. Extract words that start with capital letter:

"India is Bigger than japan"

8. Validate a phone number format:

+998-987654321

Rules:
1. Country-Code
2. Hyphen
3. 9/10 digits

9. Split the string by multiple delimiters.

"apple,orange;banana|grape"

10. Replace all the digits with *

"Card number: 1234-5678-9012"

11. Extract domain names from

"http://google.com and https://openai.com"

12. Find duplicate words in

"This is is a test test string"

13. Extract date in format YYYY-MM-DD
    
"Deadline is 2023-12-25"

14. Extract words of length exactly 5

"hello world apple juice"

15. Extract price values if only followed by USD

"100 USD, 200 EUR, 300 USD"

16. Validate Username

Validate a username that must start with a letter and can contain letters, digits, or underscores, with a total length between 3 and 15 characters. Write a regex and test it on a list of sample usernames.

17. Strong password validation

Validate a strong password that must be at least 8 characters long and contain at least one lowercase letter, one uppercase letter, one digit, and one special character from [ @ # $ % ! ]. Test the regex on sample passwords.

Task 17 - Strong password validation:
Abcdef1! -> VALID
abcdef1! -> INVALID
ABCDEF1! -> INVALID
Abcdefgh -> INVALID
Ab1! -> INVALID
Abcd1234$ -> VALID



18. Extract hashtags from the string:

"Learning #Python and #Regex is #fun!".

19. Extract IPv4 addresses (simple pattern)

"Valid: 192.168.0.1, 10.0.0.5; Invalid: 256.100.0.1, 123.045.067.089"
For simplicity, just match the basic pattern ddd.ddd.ddd.ddd (no strict range checking).

20. Extract all HTML tags from the string:

21. Normalize spaces

Replace multiple spaces with a single space in the string:
"This is a text with extra spaces.".

22. Words ending with ing

Find all words ending with 'ing' in the string:
"I am learning coding and enjoying debugging while running and testing.".

23. Words containing a digit

Extract all words that contain at least one digit from the string:
"user1, test, data2, value99, sample".

24. Parse log line

From log lines like
"2023-10-05 14:23:01 [ERROR] Failed to connect"
extract the date, time, log level, and message using capturing groups.

25. Remove HTML tags

Remove all HTML tags from the string:

In [56]:
string = "<p>This is <b>bold</b> and <i>italic</i> text.</p>"

26. Extract URLs

Extract all URLs from the string:

"Visit https://example.com or http://test.org for more info.".

27. Words starting and ending with same letter

Find all words that start and end with the same letter (case-insensitive) in the string:

"Anna went to Asia to see civic events.".

28. Validate Indian PIN code

Validate an Indian PIN code: it should be exactly 6 digits and cannot start with 0. Test the regex on a list of sample PIN codes.

29. Extract all sequences of vowels (a, e, i, o, u, case-insensitive) from the string:

"Beautiful queueing is unusual" (treat y as a consonant).

30. Word before a number (lookaround)
Extract the word that appears immediately before each number in the string:

"Item1 costs 100, Item2 costs 250, and Service3 is 300".