# Regular expressions in python



Sample strings to for the examples


In [4]:
lowercase_alphabet = "abcdefghijklmnopqrstuvwxyz"
uppercase_alphabet = lowercase_alphabet.upper()
numbers = "1234567890"
sentence = "The Quick Brown Fox Jumps Over The Lazy Dog"
paragraph = """Once apon a
            time there lived
            3 bears. 
            They all owed the bank $1000. 
            Ouch!"""
website = "www.medium.com"

special_characters = "[\^$.|?*+()"

### built in string functions we have already used

| Function | Explanation |
|:---|:---|
| `string.split(char)`  | Returns a list of strings that were delimited by 'char'  |
| `string.find(other_string)`  | Returns the index of the other string |
| `string[index1:index2:freq]`  | split into substrings at location (we coverd this in PANDS)  |
| `string.isdecimal()`   |	Returns True if all characters in the string are decimals  |
  


## Python regex functions in module re
Functions in the re module

| Function | Explanation |
|:---|:---|
| `findall(pattern, sting)`  | Returns a list containing all matches  |
| `search(pattern, sting)`  | Returns a Match Object if there is a match anywhere in the string  |
| `sub(pattern, replacement, string)`  | replaces one or many matches (kinda like `sed`)  |



## import the module

In [2]:
import re

## Matching explicit characters


In order to match characters explicitly, all you need to do is type what you'd like to find. Similarly to `ctrl+f` on any application.



In [None]:
pattern = "Quick"
re.findall(pattern, sentence)

In [None]:
pattern = "quick"
re.findall(pattern, sentence)

In [None]:
pattern = "quick"
re.findall(pattern, sentence,re.IGNORECASE)

Search a match object that says what was matched and where

In [None]:
pattern = "bears"
re.search(pattern, paragraph)

## Matching literal characters

In order to match any literal characters ( *any character except `[\^$.|?*+()`* ) use a backslash `\` followed by the character .

In [5]:
pattern ="www\.medium\.com" 
re.findall(pattern, website)

['www.medium.com']

## Matching by pattern
There are a lot of ways we can match a pattern. Regex has its own syntax so we could pick and choose how we want our patterns to look like.

### Character Classes
| Class | Explanation |    
|:---|:---|   
| . | any character except newline |   
| \w \d \s | word (ie [0-9a-zA-Z], digit, whitespace |  
| \W \D \S | not word, digit, whitespace |  
| [abc] | any of a, b, or c | 
| [^abc] | note a, b, or c |
| [a-g] | characters between a & g | 

For example find the numbers in the paragragh

In [None]:
pattern ="\d"  
re.findall(pattern, paragraph)

As opposed to every word character

In [None]:
pattern ="\w"  
re.findall(pattern, paragraph)

### Anchors
| Class | Explanation |
|:---|:---|
| ^abc$ | start / end of the string |
| \b | Word boundry (I could not get this to work with finall) |

In [None]:
pattern = "b"
print(re.findall(pattern, lowercase_alphabet))
pattern = "^b"
print(re.findall(pattern, lowercase_alphabet))
pattern = "b$"
print(re.findall(pattern, lowercase_alphabet))
pattern = "z$"
print(re.findall(pattern, lowercase_alphabet))

### Escaped Characters
| Class | Explanation |
|:---|:---|
| \\. \\* \\\ | escaped special characters |
| \\t \\n \\r | tab, linefeed, carriage return |

### Groups
| Class | Explanation |
|:---|:---|
| (abc) | capture group |
| \1 | backreference to group #1 |

### Quantifiers & Alternation
| Class | Explanation |
|:---|:---|
| a* a+ a? | 0 or more, 1 or more, 0 or 1 |
| a{5} a{2,} | exactly five, two or more |
| a{1,3} | between one & three |
| a+? a{2,}? | match as few as possible |
| ab\|cd | match ab or cd |
> [Tables from: regexr.com](https://regexr.com/)

### Examples
to find the words in the sentence
`\w` is a word character
`{1,}` one or more times  
or use `+`

In [None]:
re.findall("\w{1,}", sentence)

In [None]:
re.findall("\w+", sentence)

In [None]:
re.findall("\w+", paragraph)

How about the telephone numbers. To find the properly formatted ones with a hyphen

In [None]:
phone_numbers = """123-456-7890
                    987.654.321 # an ip address
                    234-567-8901
                    654.321.987 # an ip address
                    345-678-9012
                    321.654.9784 # a phone number with a .
                    456-789-012 # badly formatted
                    999.666.333
                    45678   # I don't know what this is !!
                """
re.findall("\d{3}\-\d{3}\-\d{4}", phone_numbers)

or a hphen or a dot

In [None]:
re.findall("\d{3}[\-\.]\d{3}[\-\.]\d{4}", phone_numbers)

### Messing
This returns more than just the ip addressses  
how would I fix it to return the only the ip addresses

In [None]:
re.findall("\d{3}[\-\.]\d{3}[\-\.]\d{3}", phone_numbers)