<a href="https://colab.research.google.com/github/MK316/Spring2024/blob/main/Corpus/RE_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular expressions (regex)

+ Regular expressions (regex) are powerful tools for finding patterns within text.
+ In Python, the re module provides full support for Perl-like regular expressions.
+ We'll cover some basics like searching for patterns, replacing text, and splitting strings based on regex patterns.

## 1. Importing the re Module

First, import the re module. This module is part of Python's standard library, so you don't need to install anything extra in Colab.

In [None]:
import re

## 2. Searching for patterns

The **re.search()** function searches the string for a match, and returns a Match object if there is a match.

```
re.search()
```

- .group() returns the part of the string where there was a match.
- .start() returns the start position of the match.
- .end() returns the end position of the match.
- .span() returns a tuple containing the start and end positions of the match.

In [None]:
# data

pattern = "Python"
text = "Learning Python with regex is fun."

In [None]:
match = re.search(pattern, text)

if match:
    print("Found:", match.group())
else:
    print("No match found.")


In [None]:
result = re.search("Python", text)

print(result)
print(result.group())

## 3. Finding all matches

The **re.findall()** function returns a list of all non-overlapping matches of the pattern in the string.

```
re.findall()
```

In [None]:
text = "There are 15 apples and 10 oranges."

In [None]:
pattern = "\d+"  # Pattern to find digits

matches = re.findall(pattern, text)
print(matches)  # Outputs: ['15', '10']


## 4. Splitting a string

The **re.split()** function splits the string where there is a match and returns a list of strings.

```
re.split()
```

In [None]:
text = "Split this sentence into words."

In [None]:
pattern = "\s+"  # Split on whitespace

words = re.split(pattern, text)
print(words)


## 5. Replacing text

The **re.sub()** function replaces the matches with the text of your choice.

```
re.sub()
```

In [None]:
pattern = "blue"
replace_with = "red"
text = "The sky is blue."

new_text = re.sub(pattern, replace_with, text)
print("Original: ", text)
print("Replaced: ", new_text)  # Outputs: The sky is red.


In [None]:
re.sub("blue","red","The sky is blue.")

## 6. Compiling a pattern

If you're using a pattern multiple times, it might be more efficient to compile it first.

In [None]:
pattern = re.compile("\d+")  # Compiling a pattern to find digits

# Searching with the compiled pattern
text = "123 and 456"
matches = pattern.findall(text)
print(matches)  # Outputs: ['123', '456']


## Practice

Practice Exercises
Try these exercises to practice:

+ Email Extraction: Write a regex to extract and print all email addresses from a string. Assume emails are in the format username@domain.com.

+ URL Extraction: Extract all URLs from a string. Assume URLs start with http:// or https://.

+ Phone Number Validation: Validate a list of phone numbers to match a specific pattern, e.g., (123) 456-7890.

### Additional Tips
+ Regular expressions can get complex, and their readability might decrease as their complexity increases. Always comment your regex patterns to explain their purpose.

+ Test your regular expressions using online tools like [regex101.com](https://regex101.com/) to understand how they work.

+ This guide provides a starting point for practicing regular expressions in Python. As you become more comfortable, try creating more complex patterns to match specific text formats.

## Most common characters used in regular expressions

### [1]

+ The asterisk ( * ):
The asterisk is known as a repeater symbol, meaning the preceding character can be found 0 or more times. For example, the regular expression ca*t will match the strings ct, cat, caat, caaat, etc.
+ The plus symbol ( + ):
The plus symbol is also a repeater symbol, but instead means that the preceding character can be found 1 or more times. So, ca+t would match all of the same patterns as it would with an asterisk, except ct. ct would not be a match because the a character appears 0 times.
+ The wildcard ( . ):
The wildcard represents any character, and is often used in conjunction with a repeater to cover a large range of patterns. One particular case for using the wildcard with a repeater would be to target all emails under a certain domain. For example, the regular expression .*@hallme.com would match any email with the hallme.com domain, since the wildcard ( . ) means that any character is valid and the asterisk ( * ) means that they can appear any number of times. Those two characters combined produce quite a powerful matching tool, essentially saying that any string of any size is valid.

+ The curly braces ( { } ):
The curly braces are the last of the repeater symbols, and are used to specify a number of times a character can appear in a string. There are a couple of different uses for the curly braces, since they can be used to specify a minimum and maximum number of characters as well. For example, the regular expression ca{1}t would only match the string cat, while the regular expression ca{1,4}t would match the strings cat, caat, caaat, caaaat.

+ The optional character ( ? ):
The optional character means that the preceding character is exactly that, optional. This is particularly useful with file extensions or URL protocols. For example, the regular expression docx? means that both doc and docx would be a match, and the regular expression https? means that both http and https would be a match.

+ The caret ( ^ ) and the dollar sign ( \$ ):
The caret and the dollar sign are both used to indicate position in a string. The caret symbol is used to signify that the match must start at the beginning of the string, while the dollar sign means that the match must occur at the end of the string. For example, if you wished to find all phone numbers with the area code 207, you could use the regular expression ^207. If you instead wished to find all phone numbers ending with 1234, you could use the regular expression 1234$.



### [2] Character classes: These can be used for matching common characters used in strings.

[regular expression HOWTO](https://docs.python.org/3/howto/regex.html)

>\d

Matches any decimal digit; this is equivalent to the class [0-9].

> \D

Matches any non-digit character; this is equivalent to the class [^0-9].

> \s

Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

> \S

Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

> \w

Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

> \W

Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

These sequences can be included inside a character class. For example, [\s,.] is a character class that will match any whitespace character, or ',' or '.'.

### [3] The square brackets ( [ ] ):
The square brackets are used to match sets of characters, and are quite versatile. You can simply include the characters you wish to match, such as [aeo]. So with a regular expression such as b[aeo]lt, the pattern would match the words balt, belt, and bolt. Square brackets can also be used to negate characters, so a slightly different expression such as b[^aeo]lt would match every word in that pattern EXCEPT balt, belt, and bolt. The square brackets can also be used for ranges of characters. The expression [a-zA-Z] would match any alphabetic character regardless of case.

### [4] The escape symbol ( \ ):
An obvious side effect of using characters as pattern matchers is that you then can’t use the character as an actual character. The escape symbol solves this problem. If you want to use any characters that also exist as pattern matchers, simply include the escape symbol before the character. For example, the expression 2\+2 would match the string 2+2.

### [5] The parentheses ( ( ) ):
The parentheses are used to group symbols together so that they act as a single unit.




### [6] The vertical bar ( | ):
The vertical bar acts as an OR symbol, and is used to designate multiple options for a match. For example, the regular expression th(e|ere|eir|is|at) would match all of the following: the, there, their, this, and that.

It is not uncommon to read through a quick article on regular expressions and feel like you have a solid understanding of them. Then, you see an actually complex regular expression and become completely overwhelmed. Luckily, with regular expressions we can take the time to work through them character by character to figure out exactly what they mean. Let’s work through one now.

^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

[Find the description in here](https://www.hallme.com/blog/the-power-of-regular-expressions/)