## <p style = "text-align: center" >REGULAR EXPRESSIONS</p>
<hr>

### What is RegEx? 


- A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

- RegEx can be used to check if a string contains the specified search pattern.

### Applications of RegEx


1). Find links in Web pages

2). Parse email addresses

3). Remove unwanted strings or characters


So to install the library we can do a pip or conda install:

- pip install regex
- conda install -c conda-forge regex




In [8]:
#!pip install regex

In [None]:
# If you have a conda environment in place
# !conda install regex

We can use regex easily with python via the '<span style = "color:blue">re</span>' library:

In [2]:
import re

### Functions in RegEx

Some of the methods that we have in regex include:

<img src = "Screenshot 2023-09-20 144748.png"   margin-right = "auto">


Pattern - a series of letters or symbols which can map to an actual text or words or punctuation.

## Looking to difference between Match and Search

When we use the same search pattern for both words we find that search function returns an object because it looks for the existence of the pattern anywhere within the string.


In [6]:
print(re.search('c', 'abcde'))

None
<re.Match object; span=(2, 3), match='c'>


The same is not true for match which will only look for it at the beginning until it cannot match any longer:

In [8]:
print(re.match('c', 'abcde'))

None


## What are Metacharacters and their use in Regular Expressions?

We also have metacharacters.
These are special characters that have a predefined meaning in regular expressions.

<img src = "regex-guide.jpg">

The difference again between search and findall lies in the repeating (overlapping) matches.

In [19]:
text = "Hello, my name is John. John is my name"

pattern = r"John"

print(re.findall(pattern, text))
print(re.search(pattern, text))

['John', 'John']
<re.Match object; span=(18, 22), match='John'>


<img src = "Patterns.png" width = 800px>

In [34]:
import re

my_str = "Match Lowercases Spaces Nums like 12, and no caps, but no commas"

print(re.findall('[a-z0-9]+', my_str)) #The + symbol makes it greedy

['atch', 'owercases', 'paces', 'ums', 'like', '12', 'and', 'no', 'caps', 'but', 'no', 'commas']


## Example 

### Finding Domain Names using Regex

Suppose we have a list of strings containing URLs and some of them may include the "https://" or "http://" prefix, we want to extract the domain names from the URL, regardless of the presence of the prefix.


In [64]:
# List of URLs 

urls = [
    "https://www.example.com",
    "http://example.org",
    "www.example.net",
    "example.com"
]

# Regular Expression Pattern to Match the Domain Name
domain_Pattern = r"https?://?(www\.)?([a-zA-Z0-9.-]+)"

for url in urls:
    match = re.search(domain_Pattern,url)
    if match:
        print(f"The domain of url: {url} is: {match.group()}")


The domain of url: https://www.example.com is: example.com
The domain of url: http://example.org is: example.org


## Note
In regular expressions, when you place a portion of the pattern within parentheses ( ... ), it creates a capturing group.

The accessing of the groups is sequential depending on the way we created them in the patterns parenthesis.

In [55]:
import re

text = "John's email is john@example.com and birthdate 21st."
pattern = r"(\w+)'s email is (\w+@\w+\.\w+) and birthdate (\d+\w+)."

match = re.search(pattern, text)

if match:
    # Access specific capturing groups
    name = match.group()
    email = match.group(2)
    birthdate = match.group(3)
    print("Name:", name)
    print("Email:", email)
    print("Birthdate:", birthdate)
else:
    print("No match found")


Name: John's email is john@example.com and birthdate 21st.
Email: john@example.com
Birthdate: 21st


In [41]:
import re

text = "Hello, my name is John."
pattern = r"my name is (\w+)."

match = re.search(pattern, text)

if match:
    # Access the entire match
    matched_text = match.group(3)
    print("Matched text:", matched_text)
else:
    print("No match found")


IndexError: no such group