# Introduction to Regular Expression

* String containig a combination of normal characters and special metacharacters that describes pattern to find text or positions within a text.
* Normal characters match themselves ( st ).
* Metacharacters represent types of characters (\d , \s , \w) or ideas ( {3,10} ).
* Pattern is a sequence of characters that maps to words or punctuation.

### Import re module

In [1]:
import re 

### 1. findall() Method

* Find all matches of a pattern
* Syntax = re.findall(r"regex".string)

In [2]:
re.findall(r"#movies","Love #movies! I had fun yesterday going to the #movies")

['#movies', '#movies']

### 2. split() Method

* Split string at each match
* Syntax = re.split(r"regex", string)

In [3]:
re.split(r"!","Nice place to eat! I'll come back! Excellent meat!")

['Nice place to eat', " I'll come back", ' Excellent meat', '']

### sub() Method

* Replace one or many matches with a string
* Syntax = re.sub(r"old","new",string)

In [4]:
re.sub(r"yellow","nice","I have a yellow car and a yellow house in a yellow neighborhood")

'I have a nice car and a nice house in a nice neighborhood'

### Supported Metacharacters

In [5]:
re.findall(r"User\d","The winners are: User9, UserN, User8")

['User9', 'User8']

In [6]:
re.findall(r"User\D","The winners are: User9, UserN, User8")

['UserN']

In [7]:
re.findall(r"User\w","The winners are: User9, UserN, User8")

['User9', 'UserN', 'User8']

In [8]:
re.findall(r"\W\d","This skirt is on sale, only $5 today!")

['$5']

In [9]:
re.findall(r"Data\sScience","I enjoy learning Data Science")

['Data Science']

In [10]:
re.sub(r"ice\Scream","ice cream","I really like ice-cream")

'I really like ice cream'

### Quantifiers

* A metacharacter that tells the regex engine how many times to match a character immediately to its left

#### 1. Onece or more: +

In [11]:
text = "Date of start: 4-3. Date of registration: 10-04"
re.findall(r"\d+-\d+", text)

['4-3', '10-04']

#### 2. Zero time or more: *

In [12]:
my_string = "The concert was amazing! @ameli!a @joh&&n @mary90"
re.findall(r"@\w+\W*\w+", my_string)

['@ameli!a', '@joh&&n', '@mary90']

#### 3. Zero times or once: ?

In [13]:
text = "The color of this image is amazing. However, the colour blue could be brighter"
re.findall(r"colou?r", text)

['color', 'colour']

#### 4. n times at least, m times at most: {n,m}

In [14]:
phone_number = "John: 1-966-847-3131 Michelle: 54-908-42-42424"
re.findall(r"\d{1,2}-\d{3}-\d{2,3}-\d{4,}",phone_number)

['1-966-847-3131', '54-908-42-42424']

### Looking for Patterns

* Two different operations to find a match

#### 1. search() Method

* Syntax = re.search(r"regex",string)

In [15]:
re.search(r"\d{4}","4506 people attend the show")

<re.Match object; span=(0, 4), match='4506'>

In [16]:
re.search(r"\d+","Yesterday I saw 3 shows")

<re.Match object; span=(16, 17), match='3'>

#### 2. match() Method

* Syntax = re.match(r"regex",string)

In [17]:
re.match(r"\d{4}", "4506 people attend the show")

<re.Match object; span=(0, 4), match='4506'>

In [18]:
print(re.match(r"\d+","Yesterday I saw 3 shows"))

None


### Special Characters

#### 1. Match any character (except newline): .

In [19]:
my_links = "Just check out this link: www.amazingpics.com. It has amazing photos!"
re.findall(r"www.+com", my_links)

['www.amazingpics.com']

#### 2. Start of the string: ^

In [20]:
my_string = "the 80s music was much better that the 90s"
re.findall(r"^the\s\d+s",my_string)

['the 80s']

#### 3. End of the string: $

In [21]:
my_string = "the 80s music hits were much better that the 90s"
re.findall(r"the\s\d+s$", my_string)

['the 90s']

#### 4. Escape special characters: \

In [22]:
my_string = "I love the music of Mr.Go. However, the sound was too loud."
re.split(r"\.\s", my_string)

['I love the music of Mr.Go', 'However, the sound was too loud.']

#### 5. OR Operator (Set of characters): [ ]

In [23]:
my_string = "Yesterday I spent my afternoonwith my friends: MaryJohn2 Clary3"
re.findall(r"[A-z]+\d", my_string)

['MaryJohn2', 'Clary3']

#### 6. Transforms the expression to negative: [^]

In [24]:
my_links = "Bad website: www.99.com. Favorite site: www.hola.com"
re.findall(r"www[^0-9]+com", my_links)

['www.hola.com']

### Greedy vs Nongreedy matching

   * Two types of matching methods:
        * Greedy
        * Non-greedy or lazy
   * Standard quantifiers are greedy by default: * , + , ? , {num,num}

#### 1. Greedy matching

   * Greedy: match as many characters as possible
   * Return the longest match

In [25]:
re.match(r"\d+","12345bcada")

<re.Match object; span=(0, 5), match='12345'>

* Backwards when too many characters matched
* Gives up characters one at a time

In [26]:
re.match(r".*hello","xhelloxxxxxx")

<re.Match object; span=(0, 6), match='xhello'>

#### 2. Non-greedy matching

* Lazy: match as few characters as needed
* Returns the shortest match
* Append ? to greedy quantifiers

In [27]:
re.match(r"\d+?","12345bcada")

<re.Match object; span=(0, 1), match='1'>

* Backtracks when two few characters matched
* Expands characters one at a time

In [28]:
re.match(r".*?hello","xhelloxxxxx")

<re.Match object; span=(0, 6), match='xhello'>

# Advanced Regular Expression Concepts

### Group characters

* Use parentheses to group and capture characters together


In [30]:
text = 'Clary has 2 friends who she spends alot time with. Susan has 3 brothers while John has 4 sisters.'
re.findall(r'[A-Za-z]+\s\w+\s\d+\s\w+', text)

['Clary has 2 friends', 'Susan has 3 brothers', 'John has 4 sisters']

### Capturing groups

* Use parentheses to group and capture characters together

In [31]:
re.findall(r'([A-Za-z]+)\s\w+\s\d+\s\w+', text)

['Clary', 'Susan', 'John']

In [32]:
re.findall(r'([A-Za-z]+)\s\w+\s(\d+)\s(\w+)', text)

[('Clary', '2', 'friends'),
 ('Susan', '3', 'brothers'),
 ('John', '4', 'sisters')]

* Match a specific subpattern in a pattern
* Use it for further processing
* Organize the data

In [34]:
pets = re.findall(r'([A-Za-z]+)\s\w+\s(\d+)\s(\w+)',"Clary has 2 dogs but John has 3 cats")
pets[0][0]

'Clary'

* Apply a quantifier to the entire group

In [35]:
re.search(r"(\d[A-Za-z])+","My user name is 3e4r5fg")

<re.Match object; span=(16, 22), match='3e4r5f'>

* Capture a repeated group (\d+) vs. repeat a capturing group (\d)+

In [36]:
my_string = "My lucky numbers are 8755 and 33"
re.findall(r"(\d)+", my_string)

['5', '3']

In [37]:
re.findall(r"(\d+)", my_string)

['8755', '33']

## Non-capturing group

* Vertical bar or pipe: |

In [39]:
my_string = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
re.findall(r"\d+\scat|dog|bird", my_string)

['2 cat', 'dog', 'bird']

* Alternation: Use groups to choose between optional patterns

In [40]:
my_string = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
re.findall(r"\d+\s(cat|dog|bird)", my_string)

['cat', 'dog']

In [41]:
my_string = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
re.findall(r"(\d)+\s(cat|dog|bird)", my_string)

[('2', 'cat'), ('1', 'dog')]

* Match but not capture a group

In [42]:
my_string = "John Smith: 34-34-34-042-980, Rebeca Smith: 10-10-10-434-425"
re.findall(r"(?:\d{2}-){3}(\d{3}-\d{3})", my_string)

['042-980', '434-425']

* Use non-capturing groups for alternation

In [43]:
my_date = "Today is 23rd May 2019. Tomorrow is 24th May 19."
re.findall(r"(\d+)(?:th|rd)", my_date)

['23', '24']

## Backreferences

In [44]:
text = "Python 3.0 was released on 12-03-2008. It was a major revision of the language. Many of its\
major features work backported to Python 2.6.x and 2.7.x version series."

In [45]:
text = "Python 3.0 was released on 12-03-2008."
information = re.search('(\d{1,2})-(\d{2})-(\d{4})', text)

In [46]:
print(information.group(0)) 
print(information.group(1)) 
print(information.group(2)) 
print(information.group(3))

12-03-2008
12
03
2008


* Give a name to groups

In [48]:
text = "Austin, 78701"
cities = re.search(r"(?P<city>[A-Za-z]+).*?(?P<zipcode>\d{5})", text)
print(cities.group("city"))
print(cities.group("zipcode"))

Austin
78701


* Using numbered capturing groups to reference back

In [50]:
sentence = "I wish you a happy happy birthday!"
re.findall(r"(\w+)\s\1", sentence)


['happy']

In [51]:
sentence = "I wish you a happy happy birthday!"
re.sub(r"(\w+)\s\1", r"\1", sentence)

'I wish you a happy birthday!'

* Using named capturing groups to reference back

In [52]:
sentence = "Your new code number is 23434. Please, enter 23434 to open the door."
re.findall(r"(?P<code>\d{5}).*?(?P=code)", sentence)

['23434']

In [53]:
sentence = "This app is not working! It's repeating the last word word."
re.sub(r"(?P<word>\w+)\s(?P=word)", r"\g<word>", sentence)

"This app is not working! It's repeating the last word."

## Lookaround

* It allow us to confirm that sub-pattern is a head or behind main pattern

## Look-ahead

* Non-capturing group
* Checks that the first part of the expression is followed or not by the lookahead expression
* Return only the first part of the expression

#### Positive look-ahead

* Non-capturing group
* Checks that the rst part of the expression is followed by the lookahead expression
* Return only the rst part of the expression

In [55]:
my_text = "tweets.txt transferred, mypass.txt transferred, keywords.txt error"
re.findall(r"\w+\.txt(?=\stransferred)", my_text)

['tweets.txt', 'mypass.txt']

#### Negative look-ahead

* Non-capturing group
* Checks that the rst part of the expression is not followed by the lookahead expression
* Return only the rst part of the expression

In [56]:
my_text = "tweets.txt transferred, mypass.txt transferred, keywords.txt error"
re.findall(r"\w+\.txt(?!\stransferred)", my_text)

['keywords.txt']

## Look-behind

* Non-capturing group
* Get all the matches that are preceded or not by a specic paern.
* Return paern aer look-behind expression

#### Positive look-behind

* Non-capturing group
* Get all the matches that are preceded by a specic paern.
* Return paern aer look-behind expression

In [58]:
my_text = "Member: Angus Young, Member: Chris Slade, Past: Malcolm Young, Past: Cliff Williams."
re.findall(r"(?<=Member:\s)\w+\s\w+", my_text)

['Angus Young', 'Chris Slade']

#### Negative look-behind

* Non-capturing group
* Get all the matches that are not preceded by a specic paern.
* Return paern aer look-behind expression

In [60]:
my_text = "My white cat sat at the table. However, my brown dog was lying on the couch."
re.findall(r"(?<!brown\s)(cat|dog)", my_text)

['cat']