# Regular Expressions in Python
Instructor: Maria Eugenia Inzaugarat ([course link](https://learn.datacamp.com/courses/regular-expressions-in-python))  
Note Taker: Paris Zhang on Aug 16, 2020

Course contents:
1. Basic concepts of string manipulation
 + String operations
 + Finding and replacing
2. Formatting strings
 + Positional formatting
 + Formatted string literal
 + Template method
3. Regular expressions for pattern matching
 + Repetitions
 + Regex metacharacters
 + Greedy vs. non-greedy matching
4. Advanced regular expressions concepts
 + Capturing groups
 + Alternation and non-capturing groups
 + Backreferences
 + Lookaround

## Chapter 1 - Basic concepts of string manipulation

### 1.1 - Intro to string manipulation
* Convert to string - `str()`
* Get the length of a string - `len()`
* Stride - `string[0:6:2]` means print every 2 letters from the 1st to the 6th letter of the string
* A **palindrome** is a sequence of characters which can be read the same backward as forward, for example: Madam or No lemon, no melon. 

In [3]:
my_string = "Awesome day"
print(
    my_string[0:6:2],
    my_string[::-1],
    sep = "\n"
)

Aeo
yad emosewA


In [1]:
movie = 'oh my God! desserts I stressed was an ugly movie'

movie_title = movie[11:30]

# Obtain the palindrome
palindrome = movie_title[::-1]

# Print the word if it's a palindrome
if movie_title == palindrome:
    print(movie_title)

desserts I stressed


### 1.2 - String operations
* Adjusting cases with `.lower()` and `.upper()`
* Capitalize the first letter - `.capitalize()`
* Splitting a string into a list of substrings - `.split()` and `.rsplit()` (reverse order)

In [5]:
my_string = "This string will be split"

print(
    my_string.split(sep=" ", maxsplit=2),
    my_string.rsplit(sep=" ", maxsplit=2),
    sep = "\n"
)

['This', 'string', 'will be split']
['This string will', 'be', 'split']


* Breaking at line boundaries - `.splitlines()`

| Escape Sequence | Character |
|---|---|
| **\n** | New line |
| **\r** | Carriage return |

In [6]:
my_string = "This string will be split\nin two"
print(
    my_string,
    my_string.splitlines(),
    sep = "\n"
)

This string will be split
in two
['This string will be split', 'in two']


* Joining, concatenate strings from list or another iterable - `sep.join(iterable)`

In [7]:
my_list = ["this", "would", "be", "a", "string"]
print(" ".join(my_list))

this would be a string


* Strips characters from left to right: `.strip()`
 + Remove characters from the right end: `.rstrip()`
 + Remove characters from the left end: `.lstrip()`

In [10]:
my_string = " This string will be stripped.\n"

print(
    my_string.strip(" "),
    my_string.rstrip(),
    my_string.lstrip(),
    sep = "\n"
)

This string will be stripped.

 This string will be stripped.
This string will be stripped.



In [12]:
movie = '$I supposed that coming from MTV Films I should expect no less$'

# Convert to lowercase and print the result
movie_lower = movie.lower()

# Remove whitespaces and print the result
movie_no_space = movie_lower.strip("$")

# Split the string into substrings and print the result
movie_split = movie_no_space.split()

# Select root word and print the result
word_root = movie_split[1][:-2]

print(
    movie_lower,
    movie_no_space,
    movie_split,
    word_root,
    sep = "\n\n"
)

$i supposed that coming from mtv films i should expect no less$

i supposed that coming from mtv films i should expect no less

['i', 'supposed', 'that', 'coming', 'from', 'mtv', 'films', 'i', 'should', 'expect', 'no', 'less']

suppos


In [13]:
movie = 'the film,however,is all good<\\i>'

# Remove tags happening at the end and print results
movie_tag = movie.rstrip("<\i>")

# Split the string using commas and print results
movie_no_comma = movie_tag.split(",")

# Join back together and print results
movie_join = " ".join(movie_no_comma)

print(
    movie_tag,
    movie_no_comma,
    movie_join,
    sep = "\n"
)

the film,however,is all good
['the film', 'however', 'is all good']
the film however is all good


In [14]:
file = 'mtv films election, a high school comedy, is a current example\nfrom there, director steven spielberg wastes no time, taking us into the water on a midnight swim'

# Split string at line boundaries
file_split = file.splitlines()

print(file_split)

['mtv films election, a high school comedy, is a current example', 'from there, director steven spielberg wastes no time, taking us into the water on a midnight swim']


In [15]:
# Complete for-loop to split by commas
for substring in file_split:
    substring_split = substring.split(",")
    print(substring_split)

['mtv films election', ' a high school comedy', ' is a current example']
['from there', ' director steven spielberg wastes no time', ' taking us into the water on a midnight swim']


### 1.3 - Finding and replacing
* Fiding substrings - default `string.find(substring)`, optional `string.find(substring, start, end)`
* Index function - default `string.index(substring)`, optional `string.index(substring, start, end)`. Similar to `.find()` to find the index/location of a substring, but will return an error if the substring is not included in the string
* Counting occurrences - default `string.count(substring)`, optional `string.count(substring, start, end)`
* Replacing substrings - default `string.replace(old, new)`, optional `string.replace(old, new, count)` in which `count` stands for the numbre of occurrences to be replaced

In [17]:
# 1. ".find()"
my_string = "Where's Waldo?"

print(
    my_string.find("Waldo"),
    my_string.find("Wenda"), # Doesn't include this substring - returns -1
    my_string.find("Waldo", 0, 6),
    sep = "\n"
)

8
-1
-1


In [18]:
# 2. ".index()"
try:
    my_string.index("Wenda")
except ValueError:
    print("Not found")

Not found


In [19]:
# 3. ".count()"
my_string = "How many fruits do you have in your fruit basket?"

print(
    my_string.count("fruit"),
    my_string.count("fruit", 0, 16),
    sep = "\n"
)

2
1


In [20]:
# 4. ".replace()"
my_string = "The red house is between the blue house and the old house"

print(
    my_string.replace("house","car"),
    my_string.replace("house","car",2),
    sep = "\n"
)

The red car is between the blue car and the old car
The red car is between the blue car and the old house


## Chapter 2 - Formmating Strings

Which method to use?

1. `str.format()`:
 + Good to start with. Concepts apply to f-strings.
 + Compatible with all versions of Python.
2. f-strings:
 + Always advisable above all methods.
 + Not suitable if not working with modern versions of Python (3.6+).
3. Template strings:
 + When working with external or user-provided strings

### 2.1 - Positional formatting
Intro to string formatting:
* String formatting is also known as string interpolation - insert a custom string / variable in predefined text.
* Methods for formatting:
 + Positional formatting
 + Formatted string literals
 + Template method

Positional formatting:
* Positional formatting:
 + Placeholder replace by value, `'text{}'.format(value)`
 + `str.format()`
 + Use variables for both the initial string and the values passed into the method
* Reordering values:
 + Include an index number into the placeholders to reorder values
* Named placeholders:
 + Specify a name for the placeholders
* Format specifier: specify data type to be used - `{index:specifier}`
* Formatting datetime using `datetime.now()`

In [21]:
custom_string = "String formatting"
print(f"{custom_string} is a powerful technique")

String formatting is a powerful technique


1. Placeholders replaced by values

In [22]:
print("Machine learning provides {} the ability to learn {}"\
      .format("systems", "automatically"))

Machine learning provides systems the ability to learn automatically


In [23]:
my_string = "{} rely on {} datasets"
method = "Supervised algorithms"
condition = "labeled"

print(my_string.format(method, condition))

Supervised algorithms rely on labeled datasets


2. Reordering values

In [24]:
print("{} has a friend called {} and a sister called {}"\
      .format("Betty", "Linda", "Daisy"))

Betty has a friend called Linda and a sister called Daisy


In [25]:
print("{2} has a friend called {0} and a sister called {1}"\
      .format("Betty", "Linda", "Daisy"))

Daisy has a friend called Betty and a sister called Linda


3. Named placeholders

In [26]:
tool="Unsupervised algorithms"
goal="patterns"
print("{title} try to find {aim} in the dataset".format(title=tool, aim=goal))

Unsupervised algorithms try to find patterns in the dataset


In [27]:
my_methods = {"tool": "Unsupervised algorithms", "goal": "patterns"}
print('{data[tool]} try to find {data[goal]} in the dataset'.format(data=my_methods))

Unsupervised algorithms try to find patterns in the dataset


4. Format specifier
 + `{0:f}%` - `0` index, `f` float
 + `{0:.2f}%` - round to 2 decimal points

In [28]:
print("Only {0:f}% of the {1} produced worldwide is {2}!"\
      .format(0.5155675, "data", "analyzed"))

Only 0.515567% of the data produced worldwide is analyzed!


In [29]:
print("Only {0:.2f}% of the {1} produced worldwide is {2}!"\
      .format(0.5155675, "data", "analyzed"))

Only 0.52% of the data produced worldwide is analyzed!


5. Formatting datetime

In [30]:
from datetime import datetime
print(datetime.now())

2020-08-17 10:28:19.273608


In [31]:
print("Today's date is {:%Y-%m-%d %H:%M}".format(datetime.now()))

Today's date is 2020-08-17 10:28


Example:

In [33]:
from datetime import datetime

get_date = datetime.now()
message = "Good morning! Today is {today:%B %d, %Y}. It's {today:%H:%M} ... time to work!"

print(message.format(today=get_date))

Good morning! Today is August 17, 2020. It's 10:46 ... time to work!


### 2.2 - Formatted string literal

* f-string - `f"literal string {expression}"`
* Type conversion:
 + `!s` (string version)
 + `!r` (string containing a printable representation, i.e. with quotes)
 + `!a` (some as !r but escape the non-ASCII characters)
* Standard format specier:
 + `e` (scientic notation, e.g. 5 10^3)
 + `d` (digit, e.g. 4)
 + `f` (oat, e.g. 4.5353)
* Index lookups - Use `''` in `""`.
 + Escape sequences `\`
* Inline operations, an advantage of f-string
* Calling functions, an advantage of f-string

1. f-string

In [34]:
way = "code"
method = "learning Python faster"
print(f"Practicing how to {way} is the best method for {method}")

Practicing how to code is the best method for learning Python faster


2. Type conversion

In [35]:
name = "Python"
print(f"Python is called {name!r} due to a comedy series")

Python is called 'Python' due to a comedy series


3. Format specifier

In [36]:
number = 90.41890417471841
print(f"In the last 2 years, {number:.2f}% of the data was produced worldwide!")

In the last 2 years, 90.42% of the data was produced worldwide!


In [37]:
from datetime import datetime

my_today = datetime.now()
print(f"Today's date is {my_today:%B %d, %Y}")

Today's date is August 17, 2020


4. Index lookups

In [38]:
family = {"dad": "John", "siblings": "Peter"}
print("Is your dad called {family[dad]}?".format(family=family))

Is your dad called John?


In [39]:
print(f"Is your dad called {family[dad]}?")

NameError: name 'dad' is not defined

In [40]:
print("My dad is called "John"")

SyntaxError: invalid syntax (<ipython-input-40-8fa3864c5339>, line 1)

In [42]:
my_string = "My dad is called \"John\""
print(my_string)

My dad is called "John"


In [43]:
print(f"Is your dad called {family[\"dad\"]}?")

SyntaxError: f-string expression part cannot include a backslash (<ipython-input-43-891b946e948e>, line 1)

In [44]:
print(f"Is your dad called {family['dad']}?")

Is your dad called John?


5. Inline operations

In [45]:
my_number = 4
my_multiplier = 7
print(f'{my_number} multiplied by {my_multiplier} is {my_number * my_multiplier}')

4 multiplied by 7 is 28


6. Calling functions

In [46]:
def my_function(a, b):
    return a + b
print(f"If you sum up 10 and 20 the result is {my_function(10, 20)}")

If you sum up 10 and 20 the result is 30


### 2.3 - Template methods

* Slower syntax then f-strings. Don't allow format specifiers. Good with externally formatted strings.
* `from string import Template`
* Substitution with `.substitute`, can substitute many `$identifier`
 + Use `${identifier}` when valide characters follow identifier
 + Use `$$` to escape the dollar sign
 + `.substitute()` raises error when placeholder is missing
 + Use `.safte_substitute()` to return a usable string

1. Basic syntax

In [47]:
from string import Template
my_string = Template('Data science has been called $identifier')
my_string.substitute(identifier="sexiest job of the 21st century")

'Data science has been called sexiest job of the 21st century'

2. Substitution

In [48]:
job = "Data science"
name = "sexiest job of the 21st century"
my_string = Template('$title has been called $description')
my_string.substitute(title=job, description=name)

'Data science has been called sexiest job of the 21st century'

In [49]:
my_string = Template('I find Python very ${noun}ing but my sister has lost $noun')
my_string.substitute(noun="interest")

'I find Python very interesting but my sister has lost interest'

In [50]:
my_string = Template('I paid for the Python course only $$ $price, amazing!')
my_string.substitute(price="12.50")

'I paid for the Python course only $ 12.50, amazing!'

In [None]:
favorite = dict(flavor="chocolate")
my_string = Template('I love $flavor $cake very much')
my_string.substitute(favorite) # will raise an error

In [52]:
try:
    my_string.substitute(favorite)
except KeyError:
    print("missing information")

missing information


In [53]:
my_string.safe_substitute(favorite)

'I love chocolate $cake very much'

**Side notes**:
1. **Natural Language Toolkit** is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.
2. **TextBlob** is a Python library for processing textual data. It provides a simple API for diving into common natural language processing tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
3. **Gensim** is a robust open-source vector space modeling and topic modeling toolkit implemented in Python. It uses `NumPy`, `SciPy` and optionally `Cython` for performance. Gensim is specifically designed to handle large text collections, using data streaming and efficient incremental algorithms, which differentiates it from most other scientific software packages that only target batch and in-memory processing.

Example:

In [55]:
tools = ['Natural Language Toolkit', '20', 'month']

our_tool = tools[0]
our_fee = tools[1]
our_pay = tools[2]

course = Template("We are offering a 3-month beginner course on $tool just for $$$fee ${pay}ly")

print(course.substitute(tool=our_tool,fee=our_fee,pay=our_pay))

We are offering a 3-month beginner course on Natural Language Toolkit just for $20 monthly


## Chapter 3 - Regular Expressions for Pattern Matching

1. Intro to regular expressions
2. Repetitions
3. Regex metacharacters
4. Greedy vs. non-greedy matching

### 3.1 - Intro to regular expressions

An example: `r'st\d\s\w{3,10}`
- Normal characters match themselves `st`
- Metacharacters represent types of characters (`\d`, `\s`, `\w`) or ideas (`{3,10}`)

* The `re` module
 + Find all matches of a pattern: `re.findall(r"regex", string)`
 + Split string at each match: `re.split(r"regex", string)`
 + Replace one or many matches with a string: `re.sub(r"regex", new, string)`
 + Supported metacharacters

| Metacharacter | Meaning |
|---|---|
| \d | Digit |
| \D | Non-digit |
| \w | word |
| \W | Non-word |
| \s | whitespace |
| \S | Non-whitespace |

1. Find all matches of a pattern:

In [57]:
import re

re.findall(r"#movies", "Love #movies! I had fun yesterday going to the #movies")

['#movies', '#movies']

2. Split string at each match:

In [58]:
re.split(r"!", "Nice Place to eat! I'll come back! Excellent meat!")

['Nice Place to eat', " I'll come back", ' Excellent meat', '']

3. Replace one or many matches with a string:

In [59]:
re.sub(r"yellow", "nice", "I have a yellow car and a yellow house in a yellow neighborhood")

'I have a nice car and a nice house in a nice neighborhood'

4. Metacharacters

In [60]:
# digit
re.findall(r"User\d", "The winners are: User9, UserN, User8")

['User9', 'User8']

In [61]:
# non-digit
re.findall(r"User\D", "The winners are: User9, UserN, User8")

['UserN']

In [62]:
# word
re.findall(r"User\w", "The winners are: User9, UserN, User8")

['User9', 'UserN', 'User8']

In [63]:
# non-word
re.findall(r"\W\d", "This skirt is on sale, only $5 today!")

['$5']

In [64]:
# whitespace
re.findall(r"Data\sScience", "I enjoy learning Data Science")

['Data Science']

In [65]:
# non-whitespace
re.sub(r"ice\Scream", "ice cream", "I really like ice-cream")

'I really like ice cream'

Examples:

In [66]:
sentiment_analysis = '@robot9! @robot4& I have a good feeling\
that the show isgoing to be amazing! @robot9$ @robot7%'

regex = r"@robot\d\W"
print(re.findall(regex, sentiment_analysis))

['@robot9!', '@robot4&', '@robot9$', '@robot7%']


In [67]:
sentiment_analysis = 'He#newHis%newTin love with$newPscrappy.\
#8break%He is&newYmissing him@newLalready'

regex_sentence = r"\W\dbreak\W"
sentiment_sub = re.sub(regex_sentence, " ", sentiment_analysis)

regex_words = r"\Wnew\w"
sentiment_final = re.sub(regex_words, " ", sentiment_sub)

print(sentiment_final)

He is in love with scrappy. He is missing him already


### 3.2 - Repetitions
* Repeated characters using `{Number}`
* Quantifiers:
 + Once or more: `+`
 + Zero times or more: `*`
 + Zero times or once: `?`
 + n times at least, m times at most : `{n, m}`

In [68]:
import re

password = "password1234"
re.search(r"\w{8}\d{4}", password)

<re.Match object; span=(0, 12), match='password1234'>

In [69]:
text = "Date of start: 4-3. Date of registration: 10-04."
re.findall(r"\d+-\d+", text)

['4-3', '10-04']

In [70]:
my_string = "The concert was amazing! @ameli!a @joh&&n @mary90"
re.findall(r"@\w+\W*\w+", my_string)

['@ameli!a', '@joh&&n', '@mary90']

In [71]:
text = "The color of this image is amazing. However, the colour blue could be brighter."
re.findall(r"colou?r", text)

['color', 'colour']

In [72]:
phone_number = "John: 1-966-847-3131 Michelle: 54-908-42-42424"
re.findall(r"\d{1,2}-\d{3}-\d{2,3}-\d{4,}", phone_number)

['1-966-847-3131', '54-908-42-42424']

### 3.3 - Regex metacharacters

* Looking for patterns, `re.search()` or `re.match()`
* Special characters
 + Match any character (except newline): `.`
 + Start ofthe string: `^`
 + End of the string: `$`
 + Escape special characters: `\`
* OR operators:
 + Character: `|`
 + Set of characters: `[]`
   + `^` transforms the expression to negative (i.e., doesn't include the regex)

1. Looking for patterns

In [74]:
print(
    re.search(r"\d{4}", "4506 people attend the show"),
    re.match(r"\d{4}", "4506 people attend the show"),
    "\n",
    re.search(r"\d+", "Yesterday, I saw 3 shows"),
    re.match(r"\d+","Yesterday, I saw 3 shows"),
    sep = "\n"
)

<re.Match object; span=(0, 4), match='4506'>
<re.Match object; span=(0, 4), match='4506'>


<re.Match object; span=(17, 18), match='3'>
None


2.1 Match any character (except newline):

In [75]:
my_links = "Just check out this link: www.amazingpics.com. It has amazing photos!"
re.findall(r"www.+com", my_links)

['www.amazingpics.com']

2.2 Start of the string

In [76]:
my_string = "the 80s music was much better that the 90s"
print(
    re.findall(r"the\s\d+s", my_string),
    re.findall(r"^the\s\d+s", my_string),
    sep = "\n"
)

['the 80s', 'the 90s']
['the 80s']


2.3 End of the string

In [77]:
my_string = "the 80s music hits were much better that the 90s"
re.findall(r"the\s\d+s$", my_string)

['the 90s']

2.4 Escape special characters:

In [78]:
my_string = "I love the music of Mr.Go. However, the sound was too loud."
print(
    re.split(r".\s", my_string),
    re.split(r"\.\s", my_string),
    sep = "\n"
)

['', 'lov', 'th', 'musi', 'o', 'Mr.Go', 'However', 'th', 'soun', 'wa', 'to', 'loud.']
['I love the music of Mr.Go', 'However, the sound was too loud.']


3.1 OR operator, character:

In [79]:
my_string = "Elephants are the world's largest land animal!\
I would love to see an elephant one day"
re.findall(r"Elephant|elephant", my_string)

['Elephant', 'elephant']

3.2 OR operator, set of characters:

In [80]:
my_string = "Yesterday I spent my afternoon with my friends: MaryJohn2 Clary3"
re.findall(r"[a-zA-Z]+\d", my_string)

['MaryJohn2', 'Clary3']

In [81]:
my_string = "My&name&is#John Smith. I%live$in#London."
re.sub(r"[#$%&]", " ", my_string)

'My name is John Smith. I live in London.'

In [82]:
my_links = "Bad website: www.99.com. Favorite site: www.hola.com"
re.findall(r"www[^0-9]+com", my_links)

['www.hola.com']

Example:

The company puts some rules in place to verify that the given email address is valid:

* The first part can contain:
 + Upper `A-Z` and lowercase letters `a-z`
 + Numbers
 + Characters: `!`, `#`, `%`, `&`, `*`, `$`, `.`
* Must have `@`
* Domain:
 + Can contain any word characters
 + But only `.com` ending is allowed

In [83]:
emails = ['n.john.smith@gmail.com', '87victory@hotmail.com', '!#mary-=@msca.net']

regex = r"[a-zA-Z0-9!#%&\*\$\.]+@\w+\.com"

for example in emails:
    if re.match(regex, example):
        print("The email {email_example} is a valid email".format(email_example=example))
    else:
        print("The email {email_example} is invalid".format(email_example=example))

The email n.john.smith@gmail.com is a valid email
The email 87victory@hotmail.com is a valid email
The email !#mary-=@msca.net is invalid


### 3.4 Greedy vs. non-greedy matching

* Standard quantiers are greedy by default: `*`, `+`, `?`, `{num, num}`
* **Greedy**: match as many characters as possible, returns the longest match
* **Lazy**: match as few characters as needed, returns the shortest match - Append `?` to greedy quantiers

In [84]:
import re

print(
    re.match(r"\d+", "12345bcada"),
    re.match(r"\d+?", "12345bcada"),
    "\n",
    re.match(r".*hello", "xhelloxxxxxx"),
    re.match(r".*?hello", "xhelloxxxxxx"),
    sep = "\n"
)

<re.Match object; span=(0, 5), match='12345'>
<re.Match object; span=(0, 1), match='1'>


<re.Match object; span=(0, 6), match='xhello'>
<re.Match object; span=(0, 6), match='xhello'>


## Chapter 4 - Advanced Regular Expression Concepts

Contents:
1. Capturing groups
2. Alternation and non-capturing groups
3. Backreferences
4. Lookaround

### 4.1 - Grouping and capturing 

* Use `()` to group and capture characters together
* Capturing groups
 + Organize the data
 + Immediately to the left - `r"apple+"`: `+` applies to `e` and not to apple
 + Apply a quantier to the entire group
 + Capture a repeated group `(\d+)` vs. repeat a capturing group `(\d)+`

In [88]:
text = "Clary has 2 friends who she spends a lot of time with.\
Susan has 3 brothers while John has 4 sisters."

re.findall('[A-Za-z]+\s\w+\s\d+\s\w+', text)

['Clary has 2 friends', 'Susan has 3 brothers', 'John has 4 sisters']

In [89]:
re.findall('([A-Za-z]+)\s\w+\s\d+\s\w+', text)

['Clary', 'Susan', 'John']

In [90]:
re.findall('([A-Za-z]+)\s\w+\s(\d+)\s(\w+)', text)

[('Clary', '2', 'friends'),
 ('Susan', '3', 'brothers'),
 ('John', '4', 'sisters')]

2.1 Organize the data

In [91]:
pets = re.findall('([A-Za-z]+)\s\w+\s(\d+)\s(\w+)', "Clary has 2 dogs but John has 3 cats")
pets[0][0]

'Clary'

2.3 Apply a quantier to the entire group

In [92]:
re.search(r"(\d[A-Za-z])+", "My user name is 3e4r5fg")

<re.Match object; span=(16, 22), match='3e4r5f'>

2.4 Capture a repeated group `(\d+)` vs. repeat a capturing group `(\d)+`

In [93]:
my_string = "My lucky numbers are 8755 and 33"

print(
    re.findall(r"(\d)+", my_string),
    re.findall(r"(\d+)", my_string),
    sep = "\n"
)

['5', '3']
['8755', '33']


Example 1:

You want to extract the first part of the email. E.g. if you have the email `marysmith90@gmail.com`, you are only interested in `marysmith90`.
You need to match the entire expression. So you make sure to extract only names present in emails. Also, you are only interested in names containing upper (e.g. A,B, Z) or lowercase letters (e.g. a, d, z) and numbers.

In [94]:
sentiment_analysis = [
    'Just got ur newsletter, those fares really are unbelievable.\
    Write to statravelAU@gmail.com or statravelpo@hotmail.com.They have amazing prices',\
    'I should have paid more attention when we covered photoshop\
    in my webpage design class in undergrad. Contact me Hollywoodheat34@msn.net.',\
    'hey missed ya at the meeting. Read your email! msdrama098@hotmail.com'
]

regex_email = r"([a-zA-Z0-9]+)@\S+"

for tweet in sentiment_analysis:
    email_matched = re.findall(regex_email, tweet)
    print("Lists of users found in this tweet: {}".format(email_matched))

Lists of users found in this tweet: ['statravelAU', 'statravelpo']
Lists of users found in this tweet: ['Hollywoodheat34']
Lists of users found in this tweet: ['msdrama098']


Example 2:

You need to extract the information about the flight:
- The two letters indicate the airline (e.g `LA`),
- The 4 numbers are the flight number (e.g. `4214`).
- The three letters correspond to the departure (e.g `AER`),
- The destination (`CDB`),
- The date (`06NOV`) of the flight.

In [96]:
import re

flight = 'Subject: You are now ready to fly.\
Here you have your boarding pass IB3723 AMS-MAD 06OCT'

regex = r"([A-Z]{2})(\d{4})\s([A-Z]{3})-([A-Z]{3})\s(\d{2}[A-Z]{3})"

flight_matches = re.findall(regex, flight)
    
print("Airline: {}, Flight number: {}".format(flight_matches[0][0], flight_matches[0][1]))
print("Departure: {}, Destination: {}".format(flight_matches[0][2], flight_matches[0][3]))
print("Date: {}".format(flight_matches[0][4]))

Airline: IB, Flight number: 3723
Departure: AMS, Destination: MAD
Date: 06OCT


### 4.2 - Alternation and non-capturing groups

* Pipe `|`
* Alternation, use groups to choose between optional patterns
* Non-capturing groups:
 + When group is not backreferenced
 + Add `?:` : `(?:regex)`, match but not capture a group

1. Pipe

In [97]:
my_string = "I want to have a pet. But I don't know if I want a cat, a dog or a bird."
re.findall(r"cat|dog|bird", my_string)

['cat', 'dog', 'bird']

2. Alternation

In [99]:
my_string = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
re.findall(r"\d+\s(cat|dog|bird)", my_string)

['cat', 'dog']

In [100]:
re.findall(r"(\d)+\s(cat|dog|bird)", my_string)

[('2', 'cat'), ('1', 'dog')]

3. Non-capturing groups

In [101]:
my_string = "John Smith: 34-34-34-042-980, Rebeca Smith: 10-10-10-434-425"
re.findall(r"(?:\d{2}-){3}(\d{3}-\d{3})", my_string)

['042-980', '434-425']

In [102]:
my_date = "Today is 23rd May 2019. Tomorrow is 24th May 19."
re.findall(r"(\d+)(?:th|rd)", my_date)

['23', '24']

Example 1:

Match and capture the verb and the object.

In [103]:
sentiment_analysis = [
    'I totally love the concert The Book of Souls World Tour. It kinda amazing!',\
    'I enjoy the movie Wreck-It Ralph. I watched with my boyfriend.',\
    "I still like the movie Wish Upon a Star. Too bad Disney doesn't show it anymore."
]

regex_positive = r"(love|like|enjoy).+?(movie|concert)\s(.+?)\."

for tweet in sentiment_analysis:
    positive_matches = re.findall(regex_positive, tweet)
    print("Positive comments found {}".format(positive_matches))

Positive comments found [('love', 'concert', 'The Book of Souls World Tour')]
Positive comments found [('enjoy', 'movie', 'Wreck-It Ralph')]
Positive comments found [('like', 'movie', 'Wish Upon a Star')]


Example 2:

Match the verb and the object but only capture the verb.

In [104]:
sentiment_analysis = [
    'That was horrible! I really dislike the movie The cabin and the ant. So boring.',\
    "I disapprove the movie Honest with you. It's full of cliches.",\
    'I dislike very much the concert After twelve Tour. The sound was horrible.'
]

regex_negative = r"(hate|dislike|disapprove).+?(?:movie|concert)\s(.+?)\."

for tweet in sentiment_analysis:
    negative_matches = re.findall(regex_negative, tweet)
    print("Negative comments found {}".format(negative_matches))

Negative comments found [('dislike', 'The cabin and the ant')]
Negative comments found [('disapprove', 'Honest with you')]
Negative comments found [('dislike', 'After twelve Tour')]


### 4.3 - Backreferences

* Numbered groups: give a name to groups - `(?P<name>regex)`
* Using numbered capturing groups to reference back to a group

4.3.1. Numbered groups

In [105]:
text = "Python 3.0 was released on 12-03-2008."
information = re.search('(\d{1,2})-(\d{2})-(\d{4})', text)

print(
    information.group(3),
    information.group(0),
    sep = "\n"
)

2008
12-03-2008


In [106]:
text = "Austin, 78701"
cities = re.search(r"(?P<city>[A-Za-z]+).*?(?P<zipcode>\d{5})", text)

print(
    cities.group("city"),
    cities.group("zipcode"),
    sep = "\n"
)

Austin
78701


4.3.2 Using numbered capturing groups to reference back to a group

`\1` means find the string that repeats the group before ONCE more

In [107]:
sentence = "I wish you a happy happy birthday!"

print(
    re.findall(r"(\w+)\s\1", sentence),
    re.sub(r"(\w+)\s\1", r"\1", sentence),
    sep = "\n"
)

['happy']
I wish you a happy birthday!


In [108]:
sentence = "Your new code number is 23434. Please, enter 23434 to open the door."
re.findall(r"(?P<code>\d{5}).*?(?P=code)", sentence)

['23434']

In [109]:
sentence = "This app is not working! It's repeating the last word word."
re.sub(r"(?P<word>\w+)\s(?P=word)", r"\g<word>", sentence)

"This app is not working! It's repeating the last word."

Example 1:

The dates appear as `Signed on 05/24/2016` (`05` indicating the month, `24` the day). You decide to use capturing groups to extract this information. Also, you would like to retrieve that information so you can store it separately in different variables.

In [110]:
contract = 'Provider will invoice Client for Services performed within 30 days of\
performance.  Client will pay Provider as set forth in each Statement of Work within\
30 days of receipt and acceptance of such invoice. It is understood that payments to\
Provider for services rendered shall be made in full as agreed, without any deductions\
for taxes of any kind whatsoever, in conformity with Provider’s status as an independent\
contractor. Signed on 03/25/2001.'

regex_dates = r"Signed\son\s(\d{2})/(\d{2})/(\d{4})"
dates = re.search(regex_dates, contract)

signature = {
    "day": dates.group(2),
    "month": dates.group(1),
    "year": dates.group(3)
}

print("Our first contract is dated back to {data[year]}.\
Particularly, the day {data[day]} of the month {data[month]}.".format(data=signature))

Our first contract is dated back to 2001.Particularly, the day 25 of the month 03.


Example 2:

You have an example of a string containing HTML tags:

`<title>The Data Science Company</title>`

You learn that an opening HTML tag is always at the beginning of the string. It appears inside `<>`. A closing tag also appears inside `<>`, but it is preceded by `/`.

You also remember that capturing groups can be referenced using numbers, e.g `\4`.

In [112]:
html_tags = ['<body>Welcome to our course! It would be an awesome experience</body>',\
             '<article>To be a data scientist, you need to have knowledge in statistics\
             and mathematics</article>', '<nav>About me Links Contact me!'
]

for string in html_tags:
    # Complete the regex and find if it matches a closed HTML tags
    match_tag =  re.match(r"<(\w+)>.*?</\1>", string)
 
    if match_tag:
        print("Your tag '{}' is closed".format(match_tag.group(1))) 
    else:
        notmatch_tag = re.match(r"<(\w+)>", string)
        print("Close your '{}' tag!".format(notmatch_tag.group(1)))

Your tag 'body' is closed
Your tag 'article' is closed
Close your 'nav' tag!


Example 3:

If you want to find a match for `Awesoooome`. You first need to capture `Awes`. Then, match `o` and reference the same character back, and then, `me`.

In [113]:
sentiment_anlaysis = [
    '@marykatherine_q i know! I heard it this morning and wondered the same thing.\
    Moscooooooow is so behind the times', 'Staying at a friends house...neighborrrrrrrs\
    are so loud-having a party', 'Just woke up an already have read some e-mail'
]

regex_elongated = r"\w+(\w)\1+\w*"

for tweet in sentiment_analysis:
	match_elongated = re.search(regex_elongated, tweet)
    
	if match_elongated:
		elongated_word = match_elongated.group(0)
        
		print("Elongated word found: {word}".format(word=elongated_word))
	else:
		print("No elongated word found") 

Elongated word found: horrible
Elongated word found: disapprove
Elongated word found: horrible


### 4.4 - Lookaround

* Look-ahead: Checks that the first part of the expression is (not) followed by the lookahead expression
 + Positive look-ahead - `(?=regex)`
 + Negative look-ahead - `(?!regex)`
* Look-behind: Get all the matches that are preceded or not by a specic pattern.
 + Positive look-behind - `(?<=regex)`
 + Negative look-behind - `(?<!regex)`

4.4.1 Positive look-ahead, `(?=regex)`

Find all text file followed by `transferred`

In [114]:
my_text = "tweets.txt transferred, mypass.txt transferred, keywords.txt error"
re.findall(r"\w+\.txt(?=\stransferred)", my_text)

['tweets.txt', 'mypass.txt']

4.4.2 Negative look-ahed,  `(?!regex)`

Find all text file NOT followed by `transferred`

In [115]:
my_text = "tweets.txt transferred, mypass.txt transferred, keywords.txt error"
re.findall(r"\w+\.txt(?!\stransferred)", my_text)

['keywords.txt']

4.4.3 Positive look-behind, `(?<=regex)`

Find all words after `Member`

In [117]:
my_text = "Member: Angus Young, Member: Chris Slade, Past: Malcolm Young, Past: Cliff Williams."
re.findall(r"(?<=Member:\s)\w+\s\w+", my_text)

['Angus Young', 'Chris Slade']

4.4.4 Negative look-behind, `(?<!regex)`

Find all words NOT after `brown`

In [116]:
my_text = "My white cat sat at the table. However, my brown dog was lying on the couch."
re.findall(r"(?<!brown\s)(cat|dog)", my_text)

['cat']

Example 1:

1. Get all the words that are **followed** by the word `python` in `sentiment_analysis`.
2. Get all the words that are **preceded** by the word `python` or `Python` in `sentiment_analysis`. 

In [118]:
sentiment_analysis = 'You need excellent python skills to be a data scientist.\
Must be! Excellent python'

look_ahead = re.findall(r"\w+(?=\spython)", sentiment_analysis)
look_behind = re.findall(r"(?<=[pP]ython\s)\w+", sentiment_analysis)

print(look_ahead, look_behind, sep = "\n")

['excellent', 'Excellent']
['skills']


Example 2:

The phone numbers in the list have the structure:

- Optional area code: 3 numbers
- Prefix: 4 numbers
- Line number: 6 numbers
- Optional extension: 2 numbers

E.g. `654-8764-439434-01`

1. Get all cell phones numbers that are not preceded by the optional area code.
2. Get all the cell phones numbers that are not followed by the optional extension.

In [121]:
cellphones = ['4564-646464-01', '345-5785-544245', '6476-579052-01']

for phone in cellphones:
    # Get all phone numbers not preceded by area code
    number = re.findall(r"(?<!\d{3}-)\d{4}-\d{6}-\d{2}", phone)
    print(number)

['4564-646464-01']
[]
['6476-579052-01']


In [120]:
for phone in cellphones:
    # Get all phone numbers not followed by optional extension
    number = re.findall(r"\d{3}-\d{4}-\d{6}(?!-\d{2})", phone)
    print(number)

[]
['345-5785-544245']
[]
