# What are Regular Expressions?

Regex are strings containing a combination of **normal characters** and **special metacharacters** that describes patterns to find text or positions within a text

This is what a regex looks like:
```
r'Ai\d+\w{1,3}[a-z]\s?'
```
r indicates that the string is a raw string

You can specify, using metacharacters:
- Position: \b indicates the beginning or the end of a word
- Quantifiers: It indicates how often the preceding characters occur. '?' indicates zero or one, + indicates 1 or more, and * indicates zero or more. You can also use curly brackets {n} indicates that the preceding character occurs n times, {min,} indicates that the preceding character occurs at least min times. {,max} and {min, max}
- Grouping: We can use parentheses to define the scope of the regex. For example (abc)+ indicates that the regex would match 'abc', 'abcabc', 'abcabcabc'
- Wildcard: ' . ' matches any character except newline
- Match: [] Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c".

We can use it for finding and replace text or to validate strings (emails, passwords, pins...)

# Re Module

Python has the re module for working with regular expressions. However, many libraries can implement regular expressions in their methods. 

In [1]:
import re

Let's start with the basic methods. One of them is ***match***, and here is it's syntax:
```
re.match(r'regex', string)
```
It finds the first occurence ONLY if it is at the beggining of the string.

Other basic method is compile. It tells Python that the string introduced as an argument is a pattern so that we can assign that pattern to a ***variable***

In [None]:
text = 'Hello1 Hello2 HelloW Hello!'
regex = re.compile(r'Hello')
print(re.match(regex, text))
print()
regex = re.compile(r'ello')
print(re.match(regex, text))

You can also look for an exact match of the whole string using the ***fullmatch*** method

In [None]:
text = 'Hello1 Hello2 HelloW Hello!'
regex = re.compile(r'Hello1 Hello2 HelloW Hello!')
print(re.fullmatch(regex, text))
print()
regex = re.compile(r'Hello')
print(re.fullmatch(regex, text))

If you want to look for the first occurrence of the pattern in any place of the string, you can use ***.search()***. If you want to look for ALL occurrences of the pattern, you can use the ***.findall()*** method

In [None]:
text = 'Hello1 Hello2 HelloW Hello!'
print(re.findall(r'Hello', text))

Let's do something a little bit more **complicated**

In [4]:
text = "_Some people, when confronted with a problem, think: I know, I'll use regular expressions. Now they have two problems."
print(re.findall(r'\s+(\w{4})\s+', text))

['when', 'with', 'they']


__\s+__: let's dissect this into its components. 

First, the backslash (\\) indicates that the next character correspond to an action from a metacharacter. In this case, 's' indicates a whiteSpace. 

Second, we have '+' sign, it is a quantity indicator, and it will act according to the preceding character. In this case, '+' it indicates that there is at least one of the preceding characters.

Thus \s+ says that there is at least one whitespace

__(\w{4})__: Let's go with this one.

First we can see the parentheses. This is telling us what group we want to extract from the string. In the example we don't want to extract whitespaces, so we leave them out just by not including them within the parentheses.

\w indicates an alphanumerical character

{4} is a quantifier. In this case it is just giving a single number, so it specifies that it will look for 4 characters of the preceding character.

Thus (\w{4}) looks for a combination of exact 4 consecutive alphanumerical characters

The \s+ at the end of the pattern is telling us that the 4 words we are looking for is surrounded by at least two whitespaces (one before and the other one after)

## Exercise

In [6]:
print(re.findall(r'\b([A-Z]\w{3})\b', text))
print(re.findall(r'\b([a-zA-Z]\w{3})\b', text))
print(re.findall(r'\b([a-z]\w{2,5})\b', text))

[]
['when', 'with', 'know', 'they', 'have']
['people', 'when', 'with', 'think', 'know', 'use', 'they', 'have', 'two']


Can you 'dissect' the patterns? Use the returned output as a guide. Tip: \b indicates the beginning or the end of a word

You can also substitute patterns for a new string using the .sub() method. 
```
re.sub(regex, new_character(s), string_to_be_changed)
```

## Exercise

Substitute the special characters in the string for empty strings to decipher the code

In [None]:
text = 'N&!}!e}*{@v#{%{e*%%@r@%%# {&&*g!*{@o*#&#n*{&{n{}{{a}%{{ {!##g*}*&i%&@!v*}@&e!}!{ %}*%y#@!&o!*@#u&{{} *#*%u}{#&p@@*{ &#!#n@{{&e@}{%v@{}@e%@&}r%@&& #%!}g&#!}o&}@%n*%%#n*%!!a&#@! {@{@l&!*#e!##%t{*{& %&&!y&}##o{*#%u%*{# }*@}d#@}%o@@&}w%!!!n}!&#'
regex = ### Your code here
print(re.sub(### Your code here))

## Practical Case. Regex in Web Scraping

Regex is extremely useful when scraping for data. Often, we want to extract data whose attributes in the HTML code share common patterns. Let's see an example. *In this case I am using requests and BeautifulSoup, but you can do it using Selenium as well*

In [None]:
import requests
from bs4 import BeautifulSoup
url = 'https://eu.manduka.com/pages/yoga-mats-category'
r = requests.get(url)
r.status_code

BeautifulSoup allows us to visualize the HTML code by sorting it according to the tags. It also allows us to look for specific tags and attributes

In [None]:
soup = BeautifulSoup(r.text, 'html.parser')
print(soup)

In [10]:
lazy_links = soup.find_all('img', {'class': re.compile('lazy')})

In [11]:
links = []
for element in soup.find_all('img', {'class': re.compile('lazy')}):
    print(str(element))
    reg = re.compile('<img')
    print(re.match(reg, str(element)))
    src = re.findall(r'data-src[a-z]*="(.*?)"', str(element))[0]
    print(src)
    print()
    links.append(src)

<img class="lazyload" data-src="https://s2.svgbox.net/social.svg?ic=facebook&amp;color=737374" height="30" width="30"/>
<re.Match object; span=(0, 4), match='<img'>
https://s2.svgbox.net/social.svg?ic=facebook&amp;color=737374

<img class="lazyload" data-src="https://s2.svgbox.net/social.svg?ic=twitter&amp;color=737374" height="30" width="30"/>
<re.Match object; span=(0, 4), match='<img'>
https://s2.svgbox.net/social.svg?ic=twitter&amp;color=737374

<img class="lazyload" data-src="https://s2.svgbox.net/social.svg?ic=youtube&amp;color=737374" height="30" width="30"/>
<re.Match object; span=(0, 4), match='<img'>
https://s2.svgbox.net/social.svg?ic=youtube&amp;color=737374

<img class="lazyload" data-src="https://s2.svgbox.net/social.svg?ic=instagram&amp;color=737374" height="30" width="30"/>
<re.Match object; span=(0, 4), match='<img'>
https://s2.svgbox.net/social.svg?ic=instagram&amp;color=737374

<img class="lazyload" data-src="https://s2.svgbox.net/social.svg?ic=pinterest&amp;color=73

# Challenges

## Look for a valid PIN code

In [None]:
# "1234"   -->  true 

# "12345"  -->  false

# "a234"   -->  false

In [16]:
def validate_pin(pin):
    pass ## Your code here

In [None]:
assert(validate_pin("1") == False)
assert(validate_pin("12") == False)
assert(validate_pin("123") == False)
assert(validate_pin("12345") == False)
assert(validate_pin("1234567") == False)
assert(validate_pin("-1234") == False)
assert(validate_pin("1.234") == False)
assert(validate_pin("00000000") == False)
assert(validate_pin("1234 ") == False)
assert(validate_pin(" 1234") == False)
assert(validate_pin("123456 ") == False)

assert(validate_pin("a234") == False)
assert(validate_pin(".234") == False)
assert(validate_pin(".1234") == False)
assert(validate_pin("-123") == False)
assert(validate_pin("-1.234") == False)

assert(validate_pin("1234") == True)
assert(validate_pin("0000") == True)
assert(validate_pin("123456") == True)
assert(validate_pin("098765") == True)

## Look for a valid phone number

In [13]:
# validPhoneNumber("(123) 456-7890")  => returns true
# validPhoneNumber("(1111)555 2345")  => returns false
# validPhoneNumber("(098) 123 4567")  => returns false

In [14]:
import re
def validPhoneNumber(phoneNumber):
    pass # Code here

In [None]:
assert(validPhoneNumber("(123) 456-7890") == True)
assert(validPhoneNumber("(1111)555 2345") == False)
assert(validPhoneNumber("(123) 456 7890") == False)
assert(validPhoneNumber("(852) 111-0000") == True)
assert(validPhoneNumber("(123)456-7890") == False)
assert(validPhoneNumber("123 456-7890") == False)
assert(validPhoneNumber("(123) 4565-7890") == False)
assert(validPhoneNumber("(123) 456-78950") == False)