## Regular Expressions (Regex) Tutorial: How to Match Any Pattern of Text

[link](https://www.youtube.com/watch?v=sa-TUpSx1JA)



### The Text:


```
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

MetaCharacters (Need to be escaped):
.[{()\^$|?*+

coreyms.com

321-555-4321
123.555.1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T

```

### Meta Chars

**\.[{()\^$|?*+** are meta characters

. in regex matches **everything**

. is a special char

If we want to search for all '.' we have to escape it: '\.'

To match "coreyms.com" use "coreyms\.com"



### Pattern Matching
```

.       - Any Character Except New Line

\d      - Digit (0-9)

\D      - Not a Digit (0-9)

\w      - Word Character (a-z, A-Z, 0-9, _)

\W      - Not a Word Character

\s      - Whitespace (space, tab, newline)

\S      - Not Whitespace (space, tab, newline)


\b      - Word Boundary

\B      - Not a Word Boundary

^       - Beginning of a String

$       - End of a String


[]      - Matches Characters in brackets

[^ ]    - Matches Characters NOT in brackets

|       - Either Or

( )     - Group


Quantifiers:
*       - 0 or More

+       - 1 or More

?       - 0 or One

{3}     - Exact Number

{3,4}   - Range of Numbers (Minimum, Maximum)



#### Sample Regexs ####

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+

```

Note: For the pattern matching snipptets the uppercase versions are the ones that **negate** the search

For eg:

\d: matches all digits

\D: all except digits

#### Word Boundary

![](./img/diag1.png)

\b maatches a **word boundary** 

A word boundary is a char which is not a word i.e not matched by \w (a-z, A-Z, 0-9, _)

![](./img/diag2.png)

![](./img/diag3.png)

Note how _ is not a word boundary as it is a word character

![](./img/diag4.png)

We are matching 'Ha' with word boundary on both sides


#### ^ and \$

^Ha will match the Ha at the beginning of the line

Ha$ will match the Ha at the end of the line

![](./img/diag5.png)
![](./img/diag6.png)



#### Matching Phone nos

Ph nos: 3digts then . or - then 3 digits, . or - then 4 digits

![](./img/diag7.png)

Here we have used a **character set**

Note: 

- We **did not need to escape our . character within the character set** 

- Also even though we have 2 chars in our char set, it only matches one
    So if we have a number like 470--555-2750, it would **not** match that

- This is bcoz it matches the first - or . and then it moves on to match \d (a digit)

#### Matching Phone nos 2

Now we want to match ph nos starting with 800 or 900

800-555-4321
900-555-4321

![](./img/diag8.png)

We saw in the character set \[-.\] was used to match the actual char '-'. When - is present at the beginning or end of the char set, it will match the charater '-'. But when it is put **between** it will match a range of values

```

\[1-7\]: matches digits bw 1 to 7

\[a-z\]: matches lowercase letters

\[a-zA-Z\]: matches all letters 

```

Another special char in the char set is **^**

Outside the chars et, ^ matches the beginning of the str

Within the char set, **it negates the set and matches everything which is NOT in the set**

```

\[^a-z\]: every char that is NOT a lowercase letter

```

Say we have:

cat
mat
pat
bat

We want to match every char that ends with 'at' except bat

![](./img/diag9.png)


### Quantifiers

Everything that we have looked at so far have been single chars

For eg in \[^b\]at we matched a char that was not a 'b' followed by a 'a' and then a 't'

We use Quantifiers to match more than one char at a time

```

Quantifiers:
*       - 0 or More
+       - 1 or More
?       - 0 or One
{3}     - Exact Number
{3,4}   - Range of Numbers (Minimum, Maximum)


```

![](./img/diag10.png)

#### Now we want to match the names starting with Mr.

First we match Mr. Then there maay be 0 or One '.' then there is a space. Then an uppercase word char. then 0 or more word chars

The regex is `Mr\.?\s[A-Z]\w*`

![](./img/diag11.png)

### Groups

Groups allow us to match several diff patterns

We use `()` to denote group

#### Now we want to match the names starting with Mrs or Mrs

After M we want an 's' or 'rs'

`M(s|rs)\.?\s[A-Z]\w*`

![](./img/diag12.png)



### Matching emails

```

CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net

```

We want to writeregex thaat matches these emails

For 1st email we have mix of uppercase and lowercase letters then @ Then mix of uppercase and lowercase letters then .com

Regex: `[a-zA-Z]+@[a-zA-z]+\.com`

This is not matching 2nd or 3rd address

2nd address has a '.' in the first part

So we add a '.' in our char set
Last part has .edu

Regex: `[a-zA-Z.]+@[a-zA-z]+\.(com|edu)`

For 3rd email address before @ we have - and some nos

Regex: `[a-zA-Z0-9.-]+@[a-zA-z-]+\.(com|edu|net)`

OR `[a-zA-Z\d.-]+@[a-zA-z-]+\.(com|edu|net)`

**Online regex**


`[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+` matches all 3 emails



### Use info captured by groups

```

https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov

```

Some are http, some are https, some have www, some do not

U only want the top level domains like google.com or youtube.com

1. Write an expr that matches these urls

    - We start with http or https the ://, then www. can occur 0 or more times, then any word 1 or more times, then '.' then any word char one or more times
    The regex is: `https?://(www\.)?\w+\.\w+`
    
    **Note: the / before the '.' is v imp. '.' means any char except new line. So if we dont put '\' strings like 'https://www$google.com' will be matched
    
2. Use groups to capture info

    - Lets capture sections like the domain (google) and the (.com)
    - We put them in groups by surrounding them with `()`
    - Grouped regex: `https?://(www\.)?(\w+)(\.\w+)`
        Still it matches all 3 emails
    - Now we have 3 diff groups: 1st group: optional (www.), 2nd: domain 3rd: top level domain like .com
    - There is also an implicit group 0. group 0 is everything we capture. In this case its the entire url
    
3. Use backreference to reference our capture groups
    - \$1: Reference to our 1st group, sometimes its like '\1'. Our Group 1 is the optional (www.)
    ![](./img/diag13.png)
    Replaces the matched emails with the optional www. param
    
    - $2: second group: domain names 
    ![](./img/diag14.png)
    
    - Getting the final op: cleaning up the urls with [domain][top-level-domain]
    
    ![](./img/diag15.png)
    
    