# LECTURE 2 [REGEX](https://www.w3schools.com/python/python_regex.asp)
---
###  PATTERNS
* `[]`	A set of characters	"[a-m]"
* `\`	Signals a special sequence (can also be used to escape special characters)	* * "\d"
* `.`	Any character (except newline character)	"he..o"
* `^`	Starts with	"^hello"
* `$`	Ends with	"planet$"
* `\*`	Zero or more occurrences	"he.*o"
* `\+`	One or more occurrences	"he.+o"
* `?`	Zero or one occurrences	"he.?o"
* `{}`	Exactly the specified number of occurrences	"he.{2}o"
* `|`	Either or	"falls|stays"
* `()`	Capture and group

---
---
### SPECIAL SEQUENCES

* `\A`	Returns a match if the specified characters are at the beginning of the string	"\AThe"
* `\b`	Returns a match where the specified characters are at the beginning or at the end of a word
(the "r" in the beginning is making sure that the string is being treated as a "raw string")	r"\bain"

r"ain\b"

* `\B`	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word
(the "r" in the beginning is making sure that the string is being treated as a "raw string")	r"\Bain"

r"ain\B"

* `\d`	Returns a match where the string contains digits (numbers from 0-9)	"\d"
* `\D`	Returns a match where the string DOES NOT contain digits	"\D"
* `\s`	Returns a match where the string contains a white space character	"\s"
* `\S`	Returns a match where the string DOES NOT contain a white space character	"\S"
* `\w`	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)	"\w"
\W	Returns a match where the string DOES NOT contain any word characters	"\W"
* `\Z`	Returns a match if the specified characters are at the end of the string	"Spain\Z"
---
---
### PATTERN SETS

* `[arn]`	Returns a match where one of the specified characters (a, r, or n) is present
* `[a-n]`	Returns a match for any lower case character, alphabetically between a and n
* `[^arn]`	Returns a match for any character EXCEPT a, r, and n
* `[0123]`	Returns a match where any of the specified digits (0, 1, 2, or 3) are present
* `[0-9]`	Returns a match for any digit between 0 and 9
* `[0-5][0-9]`	Returns a match for any two-digit numbers from 00 and 59
* `[a-zA-Z]`	Returns a match for any character alphabetically between a and z, lower case OR upper case
* `[+]`	In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string

---
---
### Coding Practice

In [1]:
import re

In [2]:
text1='the categorical the cat hello thecat'

In [3]:
re.findall(r'\Athe',text1) # matching the pattern in the start

['the']

In [5]:
text2='categorical cat hello thecat'
re.findall(r'\Athe',text2)

[]

In [None]:
re.findall(r'\bcat',text2)# matching the pattern in the start of the word

['cat', 'cat']

In [None]:
re.findall(r'\bcat\b',text2)# matching the pattern in the middle of the word

['cat']

In [None]:
re.findall(r'cat\b',text2)# matching the pattern in the ending of the word

['cat', 'cat']

In [6]:
re.findall(r'(cat)',text2)

['cat', 'cat', 'cat']

In [None]:
re.findall(r'(cat)',text2)
list(re.finditer(r'(cat)',text2)) # getting the indices of the matches

[<re.Match object; span=(0, 3), match='cat'>,
 <re.Match object; span=(12, 15), match='cat'>,
 <re.Match object; span=(25, 28), match='cat'>]

In [16]:
snake='how_you_are'
lst=snake.split('_')
for i in range(1,len(lst)):
  lst[i]=lst[i].capitalize()
''.join(lst)

'howYouAre'

In [17]:
text2

'categorical cat hello thecat'

In [19]:
t='hcath'

In [18]:
re.findall(r'\Bcat',text2) # not mathcing the pattern in the beginning of the word

['cat']

In [None]:
re.findall(r'cat\B',text2)# not mathcing the pattern in the ending of the word

['cat']

In [None]:
re.findall(r'\Bcat\B',text2)# not mathcing the pattern in the middle of the word

[]

In [20]:
re.findall(r'\Bcat\B',t)

['cat']

In [22]:
text4='hello world 123 hello4'
re.findall(r'\d+',text4) # returns all the digits in the string

['123', '4']

In [None]:
re.findall(r'\D',text4) # returns all except the digits in the string

['h',
 'e',
 'l',
 'l',
 'o',
 ' ',
 'w',
 'o',
 'r',
 'l',
 'd',
 ' ',
 ' ',
 'h',
 'e',
 'l',
 'l',
 'o']

In [None]:
re.findall('\s',text4)# returns all the spaces in the string

[' ', ' ', ' ']

In [None]:
re.findall('\S',text4)# returns all texcept the spaces in the string

['h',
 'e',
 'l',
 'l',
 'o',
 'w',
 'o',
 'r',
 'l',
 'd',
 '1',
 '2',
 '3',
 'h',
 'e',
 'l',
 'l',
 'o',
 '4']

In [23]:
text5='hellow.world, hi'
re.findall(r'\w',text5) # excludes ' ','.',','

['h', 'e', 'l', 'l', 'o', 'w', 'w', 'o', 'r', 'l', 'd', '_', 'h', 'i']

In [None]:
re.findall(r'\W',text5) # excludes [a-zA-Z0-9] and '_'

['.', ',', ' ']

In [24]:
text4

'hello world 123 hello4'

In [29]:
list(re.finditer(r'\D',text4))[0].group()

'h'

In [None]:
list1=[]
for i in re.finditer(r'\D',text4):
    list1.append(i.group())

''.join(list1)

'hello world  hello'

In [None]:
re.findall(r'hi\Z',text5) # return the word if it is at the end of the string

['hi']

In [30]:
text5

'hellow.world, _hi'

In [36]:
re.findall(r'hello\Z',text5)

[]

In [42]:
re.findall(r'hello[a-z.]*\Z','kjhello. hi hdfghellojkkjlj') # return the word if it is at the start of the string

['hellojkkjlj']

In [None]:
re.findall(r'[a-q]',text5)

['h', 'e', 'l', 'l', 'o', 'o', 'l', 'd', 'h', 'i']

In [43]:
text5

'hellow.world, _hi'

In [None]:
re.findall(r'[a-q]+',text5)

['hello', 'o', 'ld', 'hi']

In [None]:
re.findall(r'h...o',text5)

['hello']

In [44]:
text5

'hellow.world, _hi'

In [65]:
import re

text5 = 'hellow.world, hi'
matches = re.findall(r'\bh[^\s]*', text5)

print(matches)


['hellow.world,', 'hi']


In [67]:
text5='hellow.world, hi'
re.findall(r'^h',text5) #^ begining of the string

['h']

In [71]:
text6='Really good'
re.findall(r'^h.*|^R.+',text6) # naything starting with h or r

['Really good']

In [None]:
text4

'hello world 123 hello4'

In [82]:

re.findall(r'(\bh[a-z]*[\d]{1,})$','hello world 123 hello24 hello454') # $ checks end of the string

[]

In [100]:
text7='\yes 1:30 no 12:30 maybe 1:5 test me 12:5'

In [102]:
re.findall(r'(\d{1,2}:\d{2})',text7) # ? similar to *

['1:30', '12:30']

In [None]:
txt='23-10-2002 23/10/2002 23/10/02 23-10-2002 23-10-02 1-10-2002 1-10-02'

pattern=r'\d{1,2}/\d{2}/\d{2,4}|\d{1,2}-\d{2}-\d{2,4}'

re.findall(pattern,txt)

['23-10-2002',
 '23/10/2002',
 '23/10/02',
 '23-10-2002',
 '23-10-02',
 '1-10-2002',
 '1-10-02']

In [112]:
txt='23-10-2002 23*-10-2002 23/10/2002 23/10/02 23-10-2002 23-10-02 1-10-2002 1-10-02'

ptrn='\d{1,2}[-/]\d{2}[-/]\d{2,4}'
re.findall(ptrn,txt)

['23-10-2002',
 '23/10/2002',
 '23/10/02',
 '23-10-2002',
 '23-10-02',
 '1-10-2002',
 '1-10-02']