In [1]:
import re

### Grouping and capturing

In the last stop of our journey, we are going to talk about some advanced concepts of regex. More specifically, we'll talk about capturing groups.

####  Group characters
Let's say that we have the following text. And we want to extract information about a person, how many and which type of relationships they have. 


<img src="group.jpg" style="max-width:600px">


So, we want to extract- 

- Clary 2 friends, 

- Susan 3 brothers and 

- John 4 sisters, as you can see in the slide. We know the structure of the sentences. 

Let's try our first approach. We would write something like in the code, any upper or lowercase letter, whitespace, any word character, whitespace, a number, whitespace and any word character. Let's see the output.

In [2]:
text = "Clary has 2 friends who she spends a lot of time with. Susan has 3 brothers while John has 3 sisters"
text

'Clary has 2 friends who she spends a lot of time with. Susan has 3 brothers while John has 3 sisters'

In [3]:
re.findall(r'[A-Za-z]+\s\w+\s\d+\s\w+', text)

['Clary has 2 friends', 'Susan has 3 brothers', 'John has 3 sisters']

Quite close. But we don't want the word `has`.


What can we do about this? We start simple by trying to extract only the names. We can place parentheses to group those characters as shown in the slide. Capture them. And retrieve only that group.

<img src="p.jpg" style="max-width:600px">

In the code, we have now added parentheses to group our first part of the regex. 

In [5]:
text = "Clary has 2 friends who she spends a lot of time with. Susan has 3 brothers while John has 3 sisters"
text

'Clary has 2 friends who she spends a lot of time with. Susan has 3 brothers while John has 3 sisters'

In [6]:
re.findall(r'([A-Za-z]+)\s\w+\s\d+\s\w+', text)

['Clary', 'Susan', 'John']

We can observe in the output that the group was captured. And only the three names were retrieved.

Let's look at the example again. We can place parentheses around the three groups that we want to capture. Each group will receive a number. The entire expression will always be group zero. The first group one, the second two, and the third number three. We'll see how to use these numbers later.

<img src="gr-1.jpg" style="max-width:600px">

Let's see this in the code example. We add the parentheses to group together each of the three parts of the regex. 

In [9]:
text = "Clary has 2 friends who she spends a lot of time with. Susan has 3 brothers while John has 3 sisters"
text

'Clary has 2 friends who she spends a lot of time with. Susan has 3 brothers while John has 3 sisters'

In [8]:
re.findall(r'([A-Za-z]+)\s\w+\s(\d+)\s(\w+)', text)

[('Clary', '2', 'friends'),
 ('Susan', '3', 'brothers'),
 ('John', '3', 'sisters')]

In the output, we got a list of tuples. The first element of each tuple is the match captured corresponding to group one. The second to group two. The last to group three.

### Capturing groups Contd..
As we already discussed, we can use capturing groups to match a specific subpattern in a pattern. 

But we can also use it to organize data. As you saw earlier, the matches are retrieved as lists. 


In [11]:
pets = re.findall(r'([A-Za-z]+)\s\w+\s(\d+)\s(\w+)', "Clary has 2 dogs but John has 3 cats")
pets

[('Clary', '2', 'dogs'), ('John', '3', 'cats')]

In [12]:
pets[0][0]

'Clary'

In the code, we placed the parentheses to capture the name of the owner, the number and which type of pets each one has. We can access the information retrieved by using indexing and slicing as seen in the code. 

###  Capturing groups contd..

But capturing groups have one important feature. Remember that quantifiers apply to the character immediately to the left. So, we can place parentheses to group characters and then apply the quantifier to the entire group. 

<img src="qu.jpg" style="max-width:600px">

In [13]:
re.search(r"\d[A-Za-z]+", "My user name is 3e4r5fg")

<re.Match object; span=(16, 18), match='3e'>

In [14]:
## Apply a quantifier to the entire group
re.search(r"(\d[A-Za-z])+", "My user name is 3e4r5fg")

<re.Match object; span=(16, 22), match='3e4r5f'>

In the code, we have placed parentheses to match the group containing a number and any letter. We applied the `+` quantifier to specify that we want this group repeated once or more times. And we get the following match shown in the output.

### Capturing groups

But be careful. It's not the same to capture a repeated group than to repeat a capturing group. 

<img src="ca.jpg" style="max-width:600px">

In the first code, we use findall to match a capturing group containing one number. We want this capturing group to be repeated once or more times.

In [15]:
my_string = "My lucky numbers are 8755 and 33"
re.findall(r"(\d)+", my_string)

['5', '3']

We get 5 and 3 as an output. Because these numbers are repeated consecutively once or more times. 

In the second code, we specify that we should capture a group containing one or more repetitions of a number. We now get the following output.

In [17]:
re.findall(r"(\d+)", my_string)

['8755', '33']

### Exercise 1: Try another name

You are still working on your Twitter sentiment analysis. You analyze now some things that caught your attention. You noticed that there are email addresses inserted in some tweets. Now, you are curious to find out which is the most common name.

You want to extract the first part of the email. E.g. if you have the email marysmith90@gmail.com, you are only interested in marysmith90.

You need to match the entire expression. So you make sure to extract only names present in emails. Also, you are only interested in names containing upper (e.g. A,B, Z) or lowercase letters (e.g. a, d, z) and numbers.

The list sentiment_analysis containing the text of three tweets

In [18]:
sentiment_analysis = ['Just got ur newsletter, those fares really are unbelievable. Write to statravelAU@gmail.com or statravelpo@hotmail.com. They have amazing prices',
 'I should have paid more attention when we covered photoshop in my webpage design class in undergrad. Contact me Hollywoodheat34@msn.net.',
 'hey missed ya at the meeting. Read your email! msdrama098@hotmail.com']
sentiment_analysis

['Just got ur newsletter, those fares really are unbelievable. Write to statravelAU@gmail.com or statravelpo@hotmail.com. They have amazing prices',
 'I should have paid more attention when we covered photoshop in my webpage design class in undergrad. Contact me Hollywoodheat34@msn.net.',
 'hey missed ya at the meeting. Read your email! msdrama098@hotmail.com']

In [38]:
# Write a regex that matches email
regex_email = r"([A-Za-z0-9]+)@\S+"

for tweet in sentiment_analysis:
    # Find all matches of regex in each tweet
    email_matched = re.findall(regex_email, tweet)
    email_matched

    # Complete the format method to print the results
    print("Lists of users found in this tweet: {}".format(email_matched))

Lists of users found in this tweet: ['statravelAU', 'statravelpo']
Lists of users found in this tweet: ['Hollywoodheat34']
Lists of users found in this tweet: ['msdrama098']


In [39]:
# Write a regex that matches email
regex_email = r"([A-Za-z0-9]+)@"

for tweet in sentiment_analysis:
    # Find all matches of regex in each tweet
    email_matched = re.findall(regex_email, tweet)
    email_matched

    # Complete the format method to print the results
    print("Lists of users found in this tweet: {}".format(email_matched))

Lists of users found in this tweet: ['statravelAU', 'statravelpo']
Lists of users found in this tweet: ['Hollywoodheat34']
Lists of users found in this tweet: ['msdrama098']


### Exercise 2: Flying home

Your boss assigned you to a small project. They are performing an analysis of the travels people made to attend business meetings. You are given a dataset with only the email subjects for each of the people traveling.

You learn that the text followed a pattern. Here is an example:

Here you have your boarding pass `LA4214 AER-CDB 06NOV`.

You need to extract the information about the flight:

- The two letters indicate the airline (e.g LA),

- The 4 numbers are the flight number (e.g. 4214).

- The three letters correspond to the departure (e.g AER),

- The destination (CDB),

- The date (06NOV) of the flight.

- All letters are always uppercase.

The variable `flight` containing one email subject was loaded in your session.

In [40]:
flight = 'Subject: You are now ready to fly. Here you have your boarding pass IB3723 AMS-MAD 06OCT'
flight

'Subject: You are now ready to fly. Here you have your boarding pass IB3723 AMS-MAD 06OCT'

In [41]:
# Write regex to capture information of the flight
regex = r"([A-Z]{2})(\d{4})\s([A-Z]{3})-([A-Z]{3})\s(\d{2}[A-Z]{3})"

# Find all matches of the flight information
flight_matches = re.findall(regex, flight)
flight_matches

[('IB', '3723', 'AMS', 'MAD', '06OCT')]

In [42]:
#Print the matches
print("Airline: {} Flight number: {}".format(flight_matches[0][0], flight_matches[0][1]))

Airline: IB Flight number: 3723


In [43]:
print("Departure: {} Destination: {}".format(flight_matches[0][2], flight_matches[0][3]))

Departure: AMS Destination: MAD


In [44]:
print("Date: {}".format(flight_matches[0][4]))

Date: 06OCT


### Alternation and non-capturing groups

Now, we'll talk about other ways in which grouping characters can help us.You've learned in previous videos about the vertical bar or pipe(|) operator. 

Suppose we have the following string. And we want to find all matches for pet names. So we can use the pipe operator to specify that we want to match cat or dog or bird as you see in the code. This will output the following list.

In [45]:
my_string = "I want to have a pet. But I don't know if I want a cat, a dog or a bird."
re.findall(r"cat|dog|bird", my_string)

['cat', 'dog', 'bird']

Now, we changed the string a little bit. And once more we want to find all the pet names. But this time only those that come after a number and a whitespace. So we specify this again with the pipe operator. 

In [46]:
my_string2 = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
re.findall(r"\d+\scat|dog|bird", my_string2)

['2 cat', 'dog', 'bird']

Hmm we got the wrong output. Why? 

<img src="pi.jpg" style="max-width:600px">

The pipe operator works comparing everything that is to its left (digit whitespace cat) with everything to the right, dog.

### Alternation

In order to solve this, we can use alternation. In simpler terms, we can use parentheses again to group the optional characters. 

In [47]:
my_string2 = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
re.findall(r"\d+\s(cat|dog|bird)", my_string2)

['cat', 'dog']

In the code, now the parentheses are added to group cat or dog or bird. This time we get the output cat and dog. This is the correct match as only these two patterns followed a number and whitespace.

### Alternation contd..

In the previous example, we may also want to match the number. In that case, we need to place parentheses to capture the digit group. 

In [48]:
my_string2 = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
re.findall(r"(\d)+\s(cat|dog|bird)", my_string2)

[('2', 'cat'), ('1', 'dog')]

In the code, we now use two pair of parentheses. We use findall in the string. And we get a list with two tuples as shown in the output.

###  Non-capturing groups

Sometimes, we need to group characters using parentheses. But we are not going to reference back to this group. 

For these cases, there are a special type of groups called non-capturing groups. For using them, we just need to add question mark colon inside the parenthesis but before the regex.

<img src="non.jpg" style="max-width:600px">

We have the following string. We want to find all matches of numbers.

In [49]:
my_string3 = "John Smith: 34-34-34-042-980, Rebeca Smith: 10-10-10-434-425"
my_string3

'John Smith: 34-34-34-042-980, Rebeca Smith: 10-10-10-434-425'

We see that the pattern consists of two numbers and dash repeated three times. After that, three numbers, dash, four numbers. 
We want to extract only the last part without the first repeated elements. 


We need to group the first two elements to indicate repetitions. But we don't want to capture them. 

<img src="non-2.jpg" style="max-width:600px">

So we use non-capturing groups to group backslash d repeated two times and dash. Then we indicate this group should be repeated three times. 

Then, we group backslash d repeated three times, dash, backslash d repeated three times. 

In [51]:
my_string3 = "John Smith: 34-34-34-042-980, Rebeca Smith: 10-10-10-434-425"
re.findall(r"(?:\d{2}-){3}(\d{3}-\d{3})", my_string3)

['042-980', '434-425']

In the code, we then match the regex to the string. And we get the numbers we were looking for as shown in the output.

### Alternation with non_capturing groups

Finally, we can combine non-capturing groups and alternation together. Remember that alternation implies using parentheses and the pipe operand to group optional characters. 

Let's suppose that we have the following string. And we want to match all the numbers of the day. We know that they are followed by `th` or `rd`. But we only want to capture the number and not the letters that follow. 

We write our regex. We capture inside parentheses backslash d repeated once or more times. Then, we can use a non-capturing group. Inside we use the pipe operator to choose between th and rd as shown in the code. We find all the matches in the string. And we get the correct output.

In [52]:
my_date = "Today is 23rd May 2019. Tomorrow is 24th May 19."
re.findall(r"(\d+)(?:th|rd)", my_date)

['23', '24']

### Exercise 3: Love it!

You are still working on the Twitter sentiment analysis project. First, you want to identify positive tweets about movies and concerts.

You plan to find all the sentences that contain the words love, like, or enjoy and capture that word. You will limit the tweets by focusing on those that contain the words movie or concert by keeping the word in another group. You will also save the movie or concert name.

For example, if you have the sentence: `I love the movie Avengers`. You match and capture `love`. You need to match and capture `movie`. Afterwards, you match and capture anything `until the dot.`

The list sentiment_analysis2 containing the text of three tweets

In [54]:
sentiment_analysis2 = ['I totally love the concert The Book of Souls World Tour. It kinda amazing!',
 'I enjoy the movie Wreck-It Ralph. I watched with my girlfriend.',
 "I still like the movie Wish Upon a Star. Too bad Disney doesn't show it anymore."]
sentiment_analysis2

['I totally love the concert The Book of Souls World Tour. It kinda amazing!',
 'I enjoy the movie Wreck-It Ralph. I watched with my girlfriend.',
 "I still like the movie Wish Upon a Star. Too bad Disney doesn't show it anymore."]

In [56]:
# Write a regex that matches sentences with the optional words
regex_positive = r"(love|like|enjoy).+?(movie|concert)\s(.+?)\."

for tweet in sentiment_analysis2:
    # Find all matches of regex in tweet
    positive_matches = re.findall(regex_positive, tweet)
    
    # Complete format to print out the results
    print("Positive comments found {}".format(positive_matches))

Positive comments found [('love', 'concert', 'The Book of Souls World Tour')]
Positive comments found [('enjoy', 'movie', 'Wreck-It Ralph')]
Positive comments found [('like', 'movie', 'Wish Upon a Star')]


### Exercise 4: Ugh! Not for me!

After finding positive tweets, you want to do it for negative tweets. Your plan now is to find sentences that contain the words hate, dislike or disapprove. You will again save the movie or concert name. You will get the tweet containing the words movie or concert but this time, you don't plan to save the word.

For example, if you have the sentence: `I dislike the movie Avengers a lot.` You match and capture `dislike`. You will match but `not capture` the word `movie`. Afterwards, you match and capture anything `until the dot`.

The list sentiment_analysis3 containing the text of three tweets

- Complete the regular expression to capture the words `hate or dislike or disapprove`. Match but `don't capture` the words `movie or concert`. Match and capture anything appearing `until the .`.


- Find all matches of the regex in each element of sentiment_analysis3. Assign them to negative_matches.


- Complete the .format() method to print out the results contained in negative_matches for each element in sentiment_analysis.

In [89]:
sentiment_analysis3 = ['That was horrible! I really dislike the movie The cabin and the ant. So boring.',
                         "I disapprove the movie Honest with you. It's full of cliches.",
                     'I dislike very much the concert After twelve Tour. The sound was horrible.']
sentiment_analysis3

['That was horrible! I really dislike the movie The cabin and the ant. So boring.',
 "I disapprove the movie Honest with you. It's full of cliches.",
 'I dislike very much the concert After twelve Tour. The sound was horrible.']

In [90]:
regex_negative = r"(hate|dislike|disapprove).+?(?:movie|concert)\s(.+?)\."

for tweet in sentiment_analysis3:
    negative_captures = re.findall(regex_negative, tweet)
    print("Negative comments found {}".format(negative_captures))

Negative comments found [('dislike', 'The cabin and the ant')]
Negative comments found [('disapprove', 'Honest with you')]
Negative comments found [('dislike', 'After twelve Tour')]


### Numbered groups

Imagine we come across this text. 

<img src="t.jpg" style="max-width:600px">

And we want to extract the date highlighted. But we want to extract only the numbers. So, we can place parentheses in a regex to capture these groups as we learned.

We have seen that each of these groups receive a number. The whole expression is group zero. The first group one, and so on.

<img src="t-1.jpg" style="max-width:600px">

In [94]:
text = "Python 3.0 was released on 12-03-2008."

Let's use dot search to match the pattern to the text. To retrieve the groups captured, we can use the method dot group specifying the number of a group we want. 

In [95]:
information = re.search('(\d{1,2})-(\d{2})-(\d{4})', text)
information

<re.Match object; span=(27, 37), match='12-03-2008'>

In [98]:
information.group(1)

'12'

In [99]:
information.group(2)

'03'

In [100]:
information.group(3)

'2008'

In [97]:
information.group(0)

'12-03-2008'

The method retrieves the match corresponding to group number one, two and three as shown in the output. We can also retrieve group zero. It will output the entire expression. 


`Dot group` can only be used with `dot search` and `dot match` methods.

### Named groups

We can also give names to our capturing groups. Inside the parentheses, we write question mark capital p, and the name inside angle brackets.

<img src="ng.jpg" style="max-width:600px">


Let's say we have the following string. 

In [101]:
text = "Austin, 78701"

We want to match the name of the city and zipcode in different groups. We can use capturing groups and assign them the name city and zipcode as shown in the code. 

In [103]:
cities = re.search(r"(?P<city>[A-Za-z]+).*?(?P<zipcode>\d{5})", text)
cities

<re.Match object; span=(0, 13), match='Austin, 78701'>

We retrieve the information by using dot group. We indicate the name of the group. 

For example, specifying city gives us the output Austin. Specifying zipcode gives us the number match as shown.

In [104]:
cities.group("city")

'Austin'

In [105]:
cities.group("zipcode")

'78701'

### Backreferences

There is another way to backreference groups. In fact, the matched group can be reused inside the same regex or outside for substitution. We can do this using backslash and the number of the group.

<img src="br.jpg" style="max-width:600px">

Let's see an example. We have the following string. We want to find all matches of repeated words. 


In [106]:
sentence = "I wish you a happy happy birthday!"
re.findall(r"(\w+)\s\1", sentence)

['happy']

In the code, we specify that we want to capture a sequence of word characters. Then a whitespace. Then we write backslash one. This will indicate that we want to match the first group captured again. In other words, it says match that sequence of characters that was previously captured once more. And we get the word happy as an output. This was the repeated word in our string.

### Replacing the repeated word with one occurrence of the same word

Now, we will replace the repeated word with one occurrence of the same word. 

In [107]:
sentence = "I wish you a happy happy birthday!"
re.sub(r"(\w+)\s\1", r"\1", sentence)

'I wish you a happy birthday!'

In the code, we use the same regex as before. This time, we use the dot sub method. 

In the replacement part, we can also reference back to the captured group. We write r backslash one inside quotes. This says replace the entire expression match with the first captured group. 

In the output string, we have only one occurrence of the word happy.

### Backreferences

We can also use named groups for backreferencing. To do this, we use question mark capital p equal sign and the group name. 

<img src="br-1.jpg" style="max-width:600px">

In the code, we want to find all matches of the same number. We use a capturing group and name it code. Later, we reference back to this group. And we obtain the number as an output.

In [108]:
## Using backreferencing
sentence = "Your new code number is 23434. Please, enter 23434 to open the door."
re.findall(r"(?P<code>\d{5}).*?(?P=code)", sentence)

['23434']

In [109]:
## If we don't use backreferencing then-

sentence = "Your new code number is 23434. Please, enter 23434 to open the door."
re.findall(r"(?P<code>\d{5}).*?", sentence)

['23434', '23434']

### Backreferences

On the other hand, to reference the group back for `replacement` we need to use `backslash g` and the `group name` inside angle brackets. 

<img src="br-2.jpg" style="max-width:600px">

In [114]:
sentence = "This app is not working! It's repeating the last word word."
re.findall(r"(?P<word>\w+)\s(?P=word)", sentence)

['word']

In [115]:
re.sub(r"(?P<word>\w+)\s(?P=word)", r"\g<word>", sentence)

"This app is not working! It's repeating the last word."

In the code, we want to replace repeated words by one occurrence of the same word. Inside the regex, we use the previous syntax. In the replacement field, we need to use this new syntax as seen in the code to get the following output.

### Exercise 5: Close the tag, please!

In the meantime, you are working on one of your other projects. The company is going to develop a new product. It will help developers automatically check the code they are writing. You need to write a short script for checking that every HTML tag that is open has its proper closure.

You have an example of a string containing HTML tags:

`<title>The Data Science Company</title>`

You learn that an` opening HTML tag` is always at the beginning of the string. It appears inside `<>`. A `closing tag` also appears inside `<>, but it is preceded by /.`

You also remember that capturing groups can be referenced using numbers, e.g \4.

The list html_tags, containing three strings with HTML tags,

In [151]:
html_tags = ['<body>Welcome to our course! It would be an awesome experience</body>',
 '<article>To be a data scientist, you need to have knowledge in statistics and mathematics</article>',
 '<nav>About me Links Contact me!']
html_tags

['<body>Welcome to our course! It would be an awesome experience</body>',
 '<article>To be a data scientist, you need to have knowledge in statistics and mathematics</article>',
 '<nav>About me Links Contact me!']

- Complete the regex in order to match closed HTML tags. Find if there is a match in each string of the list html_tags. Assign the result to match_tag.

- If a match is found, print the first group captured and saved in match_tag.

- If no match is found, complete the regex to match only the text inside the HTML tag. Assign it to notmatch_tag.

- Print the first group captured by the regex and save it in notmatch_tag.

In [152]:
for string in html_tags:
    # Complete the regex and find if it matches a closed HTML tags
    match_tag =  re.match(r"<(\w+)>.*?</\1>", string)
    print(match_tag)

    if match_tag:
        # If it matches print the first group capture
        #print(match_tag.group(1))
        print("Your tag {} is closed".format(match_tag.group(1))) 
    else:
        # If it doesn't match capture only the tag 
        notmatch_tag = re.match(r"<(\w+)>", string)
        #print(match_tag.group(1))
        # Print the first group capture
        print("Close your {} tag!".format(notmatch_tag.group(1)))


<re.Match object; span=(0, 69), match='<body>Welcome to our course! It would be an aweso>
Your tag body is closed
<re.Match object; span=(0, 99), match='<article>To be a data scientist, you need to have>
Your tag article is closed
None
Close your nav tag!


In [155]:
for string in html_tags:
    # Complete the regex and find if it matches a closed HTML tags
    match_tag =  re.match(r"<(\w+)>.*?</\w+>", string)
    print(match_tag)

    if match_tag:
        # If it matches print the first group capture
        #print(match_tag.group(1))
        print("Your tag {} is closed".format(match_tag.group(1))) 
    else:
        # If it doesn't match capture only the tag 
        notmatch_tag = re.match(r"<(\w+)>", string)
        #print(match_tag.group(1))
        # Print the first group capture
        print("Close your {} tag!".format(notmatch_tag.group(1)))


<re.Match object; span=(0, 69), match='<body>Welcome to our course! It would be an aweso>
Your tag body is closed
<re.Match object; span=(0, 99), match='<article>To be a data scientist, you need to have>
Your tag article is closed
None
Close your nav tag!


### Exercise 6: Reeepeated characters

Back to your sentiment analysis! Your next task is to replace elongated words that appear in the tweets. We define an elongated word as a word that contains a repeating character twice or more times. e.g. `"Awesoooome"`.

Replacing those words is very important since a classifier will treat them as a different term from the source words lowering their frequency.

To find them, you will use capturing groups and reference them back using numbers. E.g \4.

If you want to find a match for Awesoooome. You first need to capture `Awes`. Then, match `o` and `reference the same character back`, and then, `me`.

The list `sentiment_analysis4`, containing the text of three tweets,

In [156]:
sentiment_analysis4 = ['@marykatherine_q i know! I heard it this morning and wondered the same thing. Moscooooooow is so behind the times',
 'Staying at a friends house...neighborrrrrrrs are so loud-having a party',
 'Just woke up an already have read some e-mail']
sentiment_analysis4

['@marykatherine_q i know! I heard it this morning and wondered the same thing. Moscooooooow is so behind the times',
 'Staying at a friends house...neighborrrrrrrs are so loud-having a party',
 'Just woke up an already have read some e-mail']

In [166]:
# Complete the regex to match an elongated word
regex_elongated = r"\w*(\w)\1\w*"

for tweet in sentiment_analysis4:
    # Find if there is a match in each tweet 
    match_elongated = re.search(regex_elongated, tweet)
    
    if match_elongated:
        # Assign the captured group zero 
        elongated_word = match_elongated.group(0)
        
        # Complete the format method to print the word
        print("Elongated word found: {word}".format(word=elongated_word))
    else:
        print("No elongated word found")     	

Elongated word found: Moscooooooow
Elongated word found: neighborrrrrrrs
No elongated word found


### Lookaround

In this final discussion, we will look into specific types of non-capturing groups. They help us look around an expression.


Look-around will look for what is behind or ahead of a pattern. Imagine that we have the following string.We want to see what is surrounding a specific word. For example, we position ourselves in the word cat. 

<img src="la.jpg" style="max-width:600px">

So look-around will let us answer the following problem. At my current position, `look ahead` and search if `sat` is there. Or `look behind` and search if `white` is there.

### Look-ahead

We'll start by exploring look-ahead. This non-capturing group checks whether the first part of the expression is followed or not by the lookahead expression. As a consequence, it will return the first part of the expression. In the previous example, we are looking for the word cat. The look ahead expression can be either positive or negative. 

For positive we use question mark equal. For negative question mark exclamation mark.

<img src="la-1.jpg" style="max-width:600px">

### Positive look-ahead

Let's start with positive lookahead. Let's imagine that we have a string containing file names and the status of that file as shown in the code. 

In [167]:
my_text ="tweets.txt transferred, mypass.txt transferred, keywords.txt error"
my_text

'tweets.txt transferred, mypass.txt transferred, keywords.txt error'

We want to extract only those files that are followed by the word transferred. So- 

- we start building the regex by indicating any word character followed by dot txt. 

- We now indicate we want the first part to be followed by the word transferred. We do so by writing question mark equal and then whitespace transferred all inside the parenthesis. 

With that specification, we get only the desired strings as shown in the output.

In [168]:
re.findall(r"\w+\.txt(?=\stransferred)", my_text)

['tweets.txt', 'mypass.txt']

### Negative look-ahead

Now, let's use negative lookahead in the same example.In this case, we will say that we want those matches that are NOT followed by the expression transferred. We use instead question mark exclamation mark inside parenthesis as seen in the code. Now, we get this other output.

In [169]:
my_text ="tweets.txt transferred, mypass.txt transferred, keywords.txt error"
re.findall(r"\w+\.txt(?!\stransferred)", my_text)

['keywords.txt']

### Look-behind

The non-capturing group look-behind gets all matches that are preceded or not by a specific pattern. As a consequence, it will return the matches after the look expression. 

Let's use the same example, but now we are looking before the word cat. Look behind expression can also be either positive or negative. For positive we use question mark angle bracket equal. For negative question mark angle bracket exclamation mark.

<img src="lb.jpg" style="max-width:600px">

### Positive look-behind

Let's look at the following string. We want to find all matches of the names that are preceded by the word member. How do we construct our regex with positive look-behind? 

In [170]:
my_text ="Member: Angus Young, Member: Chris Slade, Past: Malcolm Young, Past: Cliff Williams."
re.findall(r"(?<=Member:\s)\w+\s\w+", my_text)

['Angus Young', 'Chris Slade']

Let's examine the code. At the end of the regex, we'll indicate we want a sequence of word characters whitespace another sequence of word characters. The look-behind expression goes before that expression. We indicate question mark angle bracket equal followed by member, colon, and whitespace. All inside parentheses.  In that way we get the two names that were preceded by the word member as shown in the output.

### Negative look-behind

Now, we have this other string. We will use negative look-behind. We will find all matches of the word cat or dog that are not preceded by the word brown. 

In [173]:
my_text = "My white cat sat at the table. However, my brown dog was lying on the couch."
re.findall(r"(?<!brown\s)(cat|dog)", my_text)

['cat']

In this code example, we use question mark angle bracket exclamation mark, followed by brown, whitespace. All inside the parenthesis. Then, we indicate our alternation group: cat or dog. Consequently, we get cat as an output. The cat or dog word that is not after the word brown.

In [172]:
my_text2 = "My white cat sat at the table. However, my black dog was lying on the couch."
re.findall(r"(?<!brown\s)(cat|dog)", my_text2)

['cat', 'dog']

### Exercise 7: Surrounding words

Now, you want to perform some visualizations with your sentiment_analysis dataset. You are interested in the words surrounding python. You want to count how many times a specific words appears right before and after it.

`Positive lookahead (?=)` makes sure that first part of the expression is followed by the lookahead expression. `Positive lookbehind (?<=)` returns all matches that are preceded by the specified pattern.

The variable sentiment_analysis5, containing the text of one tweet

In [174]:
sentiment_analysis5 = 'You need excellent python skills to be a data scientist. Must be! Excellent python'
sentiment_analysis5

'You need excellent python skills to be a data scientist. Must be! Excellent python'

- Get all the words that are followed by the word python in sentiment_analysis. Print out the word found.

- Get all the words that are preceded by the word python or Python in sentiment_analysis. Print out the words found.

In [183]:
# Positive lookahead
look_ahead = re.findall(r"\w+(?=\s[Pp]ython)", sentiment_analysis5)

# Print out
print(look_ahead)

['excellent', 'Excellent']


In [184]:
# Positive lookbehind
look_behind = re.findall(r"(?<=[Pp]ython\s)\w+", sentiment_analysis5)

# Print out
print(look_behind)

['skills']


### Exercise 8: Filtering phone numbers

Now, you need to write a script for a cell-phone searcher. It should scan a list of phone numbers and return those that meet certain characteristics.

The phone numbers in the list have the structure:

- Optional area code: 3 numbers

- Prefix: 4 numbers

- Line number: 6 numbers

- Optional extension: 2 numbers

- E.g. 654-8764-439434-01.

You decide to use .findall() and the non-capturing group's `negative lookahead (?!)` and `negative lookbehind (?<!)`.

The list cellphones, containing three phone numbers, 

In [185]:
cellphones = ['4564-646464-01', '345-5785-544245', '6476-579052-01']
cellphones

['4564-646464-01', '345-5785-544245', '6476-579052-01']

- Get all cell phones numbers that are not preceded by the optional area code.

- Get all the cell phones numbers that are not followed by the optional extension.

In [186]:
for phone in cellphones:
    # Get all phone numbers not preceded by area code
    number = re.findall(r"(?<!\d{3}-)\d{4}-\d{6}-\d{2}", phone)
    print(number)

['4564-646464-01']
[]
['6476-579052-01']


In [187]:
for phone in cellphones:
    # Get all phone numbers not followed by optional extension
    number = re.findall(r"\d{3}-\d{4}-\d{6}(?!-\d{2})", phone)
    print(number)

[]
['345-5785-544245']
[]
