**Back to [Part 2. Regular Expressions for Pattern Matching](http://localhost:8888/notebooks/Part%202.%20Regular%20Expressions%20for%20Pattern%20Matching.ipynb)**

# 1. Capturing groups

- Match a specic subpattern in a pattern
- Use it for further processing

In [2]:
# We want to extract information about a person, how many and which type of relationships they have
text = "Clary has 2 friends who she spends a lot time with. Susan has 3 brothers while John has 4 sisters."
print(text)

Clary has 2 friends who she spends a lot time with. Susan has 3 brothers while John has 4 sisters.


In [4]:
import re
re.findall(r'[A-Za-z]+\s\w+\s\d+\s\w+', text)

['Clary has 2 friends', 'Susan has 3 brothers', 'John has 4 sisters']

...but we don't want the word **'has'**

We start simple by extracting only the names, placing parentheses to **group** those characters:
- **Group 1:** `([A-Za-z]+)`

In [5]:
# Added parentheses to group our first part
re.findall(r'([A-Za-z]+)\s\w+\s\d+\s\w+', text)

['Clary', 'Susan', 'John']

The entire expression will always be named **Group 0**
- **Group 0:** `([A-Za-z]+)\s\w+\s(\d+)\s(\w+)`

In [6]:
# Let's see the output of the three groups together in the Group 0
re.findall(r'([A-Za-z]+)\s\w+\s(\d+)\s(\w+)', text)

[('Clary', '2', 'friends'),
 ('Susan', '3', 'brothers'),
 ('John', '4', 'sisters')]

In the output, we got a list of tuples:

- The first element of each tuple is the match captured corresponding to group one
- The second to group two
- The last to group three.

### Organizing the data

In [7]:
# We place parentheses to capture the name of the owner, the number and type of pet
pets = re.findall(r'([A-Za-z]+)\s\w+\s(\d+)\s(\w+)',"Clary has 2 dogs but John has 3 cats")
pets[0][0]

'Clary'

Remember that quantifiers apply to the character immediately to the left:
- `r"apple+"` : `+` applies to e and not to apple

In [8]:
# match the group containing a number and any letter. 

re.search(r"(\d[A-Za-z])+","My user name is 3e4r5fg")

# We applied the plus quantifier to specify that we want this group repeated once or more times. 

<re.Match object; span=(16, 22), match='3e4r5f'>

- Capture a repeated group `(\d+)` vs. repeat a capturing group `(\d)+`

In [9]:
# we use findall to match a capturing group containing one number. We want this capturing group to be repeated once or more times.
my_string = "My lucky numbers are 8755 and 33"
re.findall(r"(\d)+", my_string)

# We get 5 and 3 as an output. Because these numbers are repeated consecutively once or more times. 

['5', '3']

In [10]:
# we specify that we should capture a group containing one or more repetitions of a number.
my_string = "My lucky numbers are 8755 and 33"
re.findall(r"(\d+)", my_string)

['8755', '33']

## Example #1: Try another name

In [11]:
sentiment_analysis = ['Just got ur newsletter, those fares really are unbelievable. Write to statravelAU@gmail.com or statravelpo@hotmail.com. They have amazing prices',
 'I should have paid more attention when we covered photoshop in my webpage design class in undergrad. Contact me Hollywoodheat34@msn.net.',
 'hey missed ya at the meeting. Read your email! msdrama098@hotmail.com']

- Complete the regex to match the email capturing only the name part. The name part appears before the `@`.

In [31]:
# Write a regex that matches email
regex_email = r"(\w+[A-Za-z0-9]+)@\S+?"

- Find all matches of the regex in each element of `sentiment_analysis` analysis. Assign it to the variable `email_matched`.
- Complete the `.format()` method to print the results captured in each element of `sentiment_analysis` analysis.

In [32]:
for tweet in sentiment_analysis:
    # Find all matches of regex in each tweet
    email_matched = re.findall(regex_email, tweet)

    # Complete the format method to print the results
    print("Lists of users found in this tweet: {}".format(email_matched))

Lists of users found in this tweet: ['statravelAU', 'statravelpo']
Lists of users found in this tweet: ['Hollywoodheat34']
Lists of users found in this tweet: ['msdrama098']


## Example #2: Flying home

You need to extract the information about the flight:

- The two letters indicate the airline (e.g `LA`),
- The 4 numbers are the flight number (e.g. `4214`).
- The three letters correspond to the departure (e.g `AER`),
- The destination (`CDB`),
- The date (`06NOV`) of the flight.

All letters are always uppercase.

In [40]:
flight = 'Subject: You are now ready to fly. Here you have your boarding pass IB3723 AMS-MAD 06OCT'

- Complete the regular expression to match and capture all the flight information required. Only the first parenthesis were placed for you.

In [41]:
# Import re
import re

# Write regex to capture information of the flight
regex = r"([A-Z]{2})(\d{4})\s([A-Z]{3})-([A-Z]{3})\s(\d{2}[A-Z]{3})"

- Find all the matches corresponding to each piece of information about the flight. Assign it to `flight_matches`.

In [42]:
# Find all matches of the flight information
flight_matches = re.findall(regex, flight)
print(flight_matches)

[('IB', '3723', 'AMS', 'MAD', '06OCT')]


- Complete the format method with the elements contained in `flight_matches`. In the first line print the airline, and the flight number. In the second line, the departure and destination. In the third line, the date.

In [45]:
#Print the matches
print("Airline: {} Flight number: {}".format(flight_matches[0][0], flight_matches[0][1]))
print("Departure: {} Destination: {}".format(flight_matches[0][2], flight_matches[0][3]))
print("Date: {}".format(flight_matches[0][4]))

Airline: IB Flight number: 3723
Departure: AMS Destination: MAD
Date: 06OCT
