# Regular Expressions in Python

### Course Description
As a data scientist, you will encounter many situations where you will need to extract key information from huge corpora of text, clean messy data containing strings, or detect and match patterns to find useful words. All of these situations are part of text mining and are an important step before applying machine learning algorithms. This course will take you through understanding compelling concepts about string manipulation and regular expressions. You will learn how to split strings, join them back together, interpolate them, as well as detect, extract, replace, and match strings using regular expressions. On the journey to master these skills, you will work with datasets containing movie reviews or streamed tweets that can be used to determine opinion, as well as with raw text scraped from the web.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

# 2. Formatting Strings

Following your journey, you will learn the main approaches that can be used to format or interpolate strings in python using a dataset containing information scraped from the web. You will explore the advantages and disadvantages of using positional formatting, embedding expressing inside string constants, and using the Template class.

In [2]:
wikipedia_article = 'In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals.'
my_list = []

### Put it in order!
The text of one article has already been saved in the variable wikipedia_article. Also, the empty list my_list is already defined. You can use print() to view the variable in the IPython Shell.

#### Instructions
- Assign the substrings going from the 4th to the 19th character, and from the 22nd to the 44th character of wikipedia_article to the variables first_pos and second_pos, respectively. Adjust the strings so they are lowercase.
- Define a string with the text "The tool is used in" adding placeholders after the word tool and the word in for future positional formatting. Append it to the list my_list.
- Define a string with the text "The tool is used in" adding placeholders after the word tool and in but reorder them so the second argument passed to the method will replace the first placeholder. Append to the list my_list.
- Complete the for-loop so that it uses the .format() method and the variables first_pos and second_pos to print out every string in my_list.

In [3]:
# Assign the substrings to the variables
first_pos = wikipedia_article[3:19].lower()
second_pos = wikipedia_article[21:44].lower()

# Define string with placeholders 
my_list.append("The tool {} is used in {}")

# Define string with rearranged placeholders
my_list.append("The tool {1} is used in {0}")

# Use format to print strings
for my_string in my_list:
  print(my_string.format(first_pos, second_pos))

The tool computer science is used in artificial intelligence
The tool artificial intelligence is used in computer science


In [4]:
courses=  ['artificial intelligence', 'neural networks']

### Calling by its name
First, you want to try doing this with just one example as a proof of concept. You use positional formatting and named placeholders to call the variables in a dictionary.

The variable courses containing one tool and one field name has been saved. You can use print(courses) to view the variable in the IPython Shell.

#### Instructions 

- Create a dictionary assigning the first and second element appearing in the list courses to the keys "field" and "tool" respectively.
- Complete the placeholders accessing inside to the elements linked with the keys field and tool in the dictionary data.
- Print out the resulting message using the .format() method, passing the plan dictionary to replace the data placeholders.

In [5]:
# Create a dictionary
plan = {
        "field": courses[0],
        "tool": courses[1]
        }

# Complete the placeholders accessing elements of field and tool keys
my_message = "If you are interested in {data[field]}, you can take the course related to {data[tool]}"

# Use dictionary to replace placeholders
print(my_message.format(data = plan))

If you are interested in artificial intelligence, you can take the course related to neural networks


### What day is today?
It's lunch time and you are talking with some of your colleagues. They comment that they feel that every morning someone should send them a reminder of what day it is so they can check in the calendar what their assignments are for that day.

You want to help out and decide to write a small script that takes the date and time of the day so that every morning, a message is sent to your colleagues. You can use the module datetime along with named placeholders to achieve your goal.

The date should be expressed as Month day, year, e.g. April 16, 2019 and the time as hh:mm, e.g. 16:30.

You write down some specifiers to help you: %d(day), %B (month name), %m (month number), %Y(year), %H (hour) and %M(minutes)

You can use the IPython Shell to explore the module datetime.

### Instructions


- Import the function datetime from the module datetime .
- Obtain the date of today and assign it to the variable get_date.
- Complete the string message by adding placeholders named today and the format specifiers to format the date as month_name day, year and time as hour:minutes.
- Print the message using the .format() method and the variable get_date to replace the named placeholders.

In [6]:
# Import datetime 
from datetime import datetime

# Assign date to get_date
get_date = datetime.now()

# Add named placeholders with format specifiers
message = "Good morning. Today is {today:%B %d, %Y}. It's {today:%H:%M} ... time to work!"

# Format date
print(message.format(today =get_date))

Good morning. Today is April 29, 2020. It's 21:42 ... time to work!


### Literally formatting
While analyzing the text from Wikipedia pages, you read that Python 3.6 introduced f-strings.

You remember that you've created a website that displayed data science facts but it was too slow. You think that it could be due to the string formatting you used. Because f-strings are very fast and easy to use, you decide to rewrite that project.

The variables field1, field2 and field3 containing character strings as well as the numeric variables fact1, fact2, fact3 and fact4 have been saved.

If you want to explore the variables, you can use print() to view them in the IPython Shell.

#### Instructions 

- Complete the f-string to include the variable field1 with quotes and the variable fact1 as a digit.
- Complete the f-string to include the the variable fact2 using exponential notation, and the variable field2.
- Complete the f-string to include field3 together with fact3 rounded to 2 decimals, and fact4 rounded to one decimal.



In [7]:
field1 = 'sexiest job' 
field2 = 'data is produced daily' 
field3 = 'Individuals' 
fact1 = 21 
fact2 = 2500000000000000000 
fact3 = 72.41415415151 
fact4 = 1.09 

In [8]:
# Complete the f-string
print(f"Data science is considered {field1!r} in the {fact1}st century")

# Complete the f-string
print(f"About {fact2:e} of {field2} in the world")

# Complete the f-string
print(f"{field3} create around {fact3:.2f}% of the data but only {fact4:.1f}% is analyzed")

Data science is considered 'sexiest job' in the 21st century
About 2.500000e+18 of data is produced daily in the world
Individuals create around 72.41% of the data but only 1.1% is analyzed


In [9]:
number1 = 120

number2 = 7

string1 = 'httpswww.datacamp.com'

list_links = ['www.news.com',
 'www.google.com',
 'www.yahoo.com',
 'www.bbc.com',
 'www.msn.com',
 'www.facebook.com',
 'www.news.google.com']

### Make this function

In [10]:
# Include both variables and the result of dividing them 
print(f"{number1} tweets were downloaded in {number2} minutes indicating a speed of {number1/number2:.1f} tweets per min")

# Replace the substring https by an empty string
print(f"{string1.replace('https', '')}")

# Divide the length of list by 120 rounded to two decimals
print(f"Only {len(list_links)*100/120:.2f}% of the posts contain links")

120 tweets were downloaded in 7 minutes indicating a speed of 17.1 tweets per min
www.datacamp.com
Only 5.83% of the posts contain links


### On time

In [11]:
east = {'date': datetime(2007, 4, 20, 0, 0), 'price': 1232443}

west = {'date': datetime(2006, 5, 26, 0, 0), 'price': 1432673}

In [12]:
# Access values of date and price in east dictionary
print(f"The price for a house in the east neighborhood was ${east['price']} in {east['date']:%m-%d-%Y}")

# Access values of date and price in west dictionary
print(f"The price for a house in the west neighborhood was ${west['price']} in {west['date']:%m-%d-%Y}.")

The price for a house in the east neighborhood was $1232443 in 04-20-2007
The price for a house in the west neighborhood was $1432673 in 05-26-2006.


### Substitute

In [13]:
tool1 = 'Natural Language Toolkit'

tool2 = 'TextBlob'

tool3 = 'Gensim'

description1 = 'suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.'

description2 = 'Python library for processing textual data. It provides a simple API for diving into common natural language processing tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.'

description3 = 'robust open-source vector space modeling and topic modeling toolkit implemented in Python. It uses NumPy, SciPy and optionally Cython for performance. Gensim is specifically designed to handle large text collections, using data streaming and efficient incremental algorithms, which differentiates it from most other scientific software packages that only target batch and in-memory processing.'

In [14]:
# Import Template
from string import Template

# Create a template
wikipedia = Template("$tool is a $description")

# Substitute variables in template
print(wikipedia.substitute(tool=tool1, description=description1))
print(wikipedia.substitute(tool=tool2, description=description2))
print(wikipedia.substitute(tool=tool3, description=description3))

Natural Language Toolkit is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.
TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python. It uses NumPy, SciPy and optionally Cython for performance. Gensim is specifically designed to handle large text collections, using data streaming and efficient incremental algorithms, which differentiates it from most other scientific software packages that only target batch and in-memory processing.


### Identifying prices

In [15]:
tools =  ['Natural Language Toolkit', '20', 'month']

In [16]:
# Import template
from string import Template

# Select variables
our_tool = tools[0]
our_fee = tools[1]
our_pay = tools[2]

# Create template
course = Template("We are offering a 3-month beginner course on $tool just for $$ $fee ${pay}ly")

# Substitute identifiers with three variables
print(course.substitute(tool=our_tool, fee=our_fee, pay=our_pay))

We are offering a 3-month beginner course on Natural Language Toolkit just for $ 20 monthly


### Playing safe

In [17]:
answers =  {'answer1': 'I really like the app. But there are some features that can be improved'}

In [18]:
# Complete template string using identifiers
the_answers = Template("Check your answer 1: $answer1, and your answer 2: $answer2")

# Use substitute to replace identifiers

try:
    print(the_answers.substitute(answers))
except KeyError:
    print("Missing information")

# Use safe_substitute to replace identifiers
try:
    print(the_answers.safe_substitute(answers))
except KeyError:
    print("Missing information")

Missing information
Check your answer 1: I really like the app. But there are some features that can be improved, and your answer 2: $answer2


# 3. Regular Expressions for Pattern Matching

Time to discover the fundamental concepts of regular expressions! In this key chapter, you will learn to understand the basic concepts of regular expression syntax. Using a real dataset with tweets meant for sentiment analysis, you will learn how to apply pattern matching using normal and special characters, and greedy and lazy quantifiers.

### Are they bots?
The company that you are working for asked you to perform a sentiment analysis using a dataset with tweets. First of all, you need to do some cleaning and extract some information.
While printing out some text, you realize that some tweets contain user mentions. Some of these mentions follow a very strange pattern. A few examples that you notice: @robot3!, @robot5& and @robot7#

To analyze if those users are bots, you will do a proof of concept with one tweet and extract them using the .findall() method.

You write down some helpful metacharacters to help you later:

- \d: digit
- \w: word character
- \W: non-word character
- \s: whitespace

The text of one tweet was saved in the variable sentiment_analysis. You can use print(sentiment_analysis) to view it in the IPython Shell.

#### Instructions

- Import the re module.
- Write a regex that matches the user mentions that follows the pattern, e.g. @robot3!.
- Find all the matches of the pattern in the sentiment_analysis variable.

In [19]:
sentiment_analysis = '@robot9! @robot4& I have a good feeling that the show isgoing to be amazing! @robot9$ @robot7%'

In [20]:
# Import the re module
import re 

# Write the regex
regex = r"@robot\d\W"

# Find all matches of regex
print(re.findall(regex, sentiment_analysis))

['@robot9!', '@robot4&', '@robot9$', '@robot7%']


### Find the numbers
While examining the tweet text in your dataset, you detect that some tweets carry extra information. The text contains the number of retweets, user mentions, and likes of that tweet. So, you decide to extract this important information.

The information is given as in this example:

Agh...snow! User_mentions:9, likes: 5, number of retweets: 4

You also bring your list of metacharacters: \d digit, \w word character, \s whitespace.

The variable sentiment_analysis containing the text of one tweet and the re module were loaded in your session. You can use print() to view it in the IPython Shell.

#### Instructions 
- Write a regex that matches the number of user mentions given as, for example, User_mentions:9 in sentiment_analysis.
- Write a regex that matches the number of likes given as, for example, likes: 5 in sentiment_analysis.
- Write a regex that matches the number of retweets given as, for example, number of retweets: 4 in sentiment_analysis.

In [21]:
sentiment_analysis= "Unfortunately one of those moments wasn't a giant squid monster. User_mentions:2, likes: 9, number of retweets: 7"

In [22]:
# Write a regex to obtain user mentions
print(re.findall(r"User_mentions:\d", sentiment_analysis))

# Write a regex to obtain number of likes
print(re.findall(r"likes:\s\d", sentiment_analysis))

# Write a regex to obtain number of retweets
print(re.findall(r"number\sof\sretweets:\s\d", sentiment_analysis))

['User_mentions:2']
['likes: 9']
['number of retweets: 7']


### Match and split
Some of the tweets in your dataset were downloaded incorrectly. Instead of having spaces to separate words, they have strange characters. You decide to use regular expressions to handle this situation. You print some of these tweets to understand which pattern you need to match.

You notice that the sentences are always separated by a special character, followed by a number, the word break, and after that, another special character, e.g &4break!. The words are always separated by a special character, the word new, and a normal random character, e.g #newH.

The variable sentiment_analysis containing the text of one tweet, as well as the re module were already loaded in your session. You can use print(sentiment_analysis) to view it in the IPython Shell.

#### Instructions 
- Write a regex that matches the pattern separating the sentences in sentiment_analysis, e.g. &4break!.
- Replace regex_sentence with a space " " in the variable sentiment_analysis. Assign it to sentiment_sub.
- Write a regex that matches the pattern separating the words in sentiment_analysis, e.g. #newH.
- Replace regex_words with a space in the variable sentiment_sub. Assign it to sentiment_final and print out the result.

In [23]:
sentiment_analysis =  'He#newHis%newTin love with$newPscrappy. #8break%He is&newYmissing him@newLalready'

In [24]:
# Write a regex to match pattern separating sentences
regex_sentence = r"\W\dbreak\W"

# Replace the regex_sentence with a space
sentiment_sub = re.sub(regex_sentence, " ", sentiment_analysis)

# Write a regex to match pattern separating words
regex_words = r"\Wnew\w"

# Replace the regex_words and print the result
sentiment_final = re.sub(regex_words, ' ', sentiment_sub)
print(sentiment_final)

He is in love with scrappy.  He is missing him already


### Everything clean
Back to your Twitter sentiment analysis project! There are several types of strings that increase your sentiment analysis complexity. But these strings do not provide any useful sentiment. Among them, we can have links and user mentions.

In order to clean the tweets, you want to extract some examples first. You know that most of the times links start with http and do not contain any whitespace, e.g. https://www.datacamp.com. User mentions start with @ and can have letters and numbers only, e.g. @johnsmith3.

You write down some helpful quantifiers to help you: * zero or more times, + once or more, ? zero or once.

The list sentiment_analysis containing the text of three tweet are already loaded in your session. You can use print() to view the data in the IPython Shell.

#### Instructions

- Import the re module.
- Write a regex to find all the matches of http links appearing in each tweet in sentiment_analysis. Print out the result.
- Write a regex to find all the matches of user mentions appearing in each tweet in sentiment_analysis. Print out the result.

In [25]:
sentiment_analysis = ['Boredd. Colddd @blueKnight39 Internet keeps stuffing up. Save me! https://www.tellyourstory.com',
                      "I had a horrible nightmare last night @anitaLopez98 @MyredHat31 which affected my sleep, now I'm really tired",
                      "im lonely  keep me company @YourBestCompany! @foxRadio https://radio.foxnews.com 22 female, new york"]

In [26]:

for tweet in sentiment_analysis:
    # Write regex to match http links and print out result
    print(re.findall(r"http\S+", tweet))

    # Write regex to match user mentions and print out result
    print(re.findall(r"@\w+", tweet))

['https://www.tellyourstory.com']
['@blueKnight39']
[]
['@anitaLopez98', '@MyredHat31']
['https://radio.foxnews.com']
['@YourBestCompany', '@foxRadio']


### Some time ago
You are interested in knowing when the tweets were posted. After reading a little bit more, you learn that dates are provided in different ways. You decide to extract the dates using .findall() so you can normalize them afterwards to make them all look the same.

You realize that the dates are always presented in one of the following ways:

- 27 minutes ago
- 4 hours ago
- 23rd june 2018
- 1st september 2019 17:25

The list sentiment_analysis containing the text of three tweets, as well as the re module are already loaded in your session. You can use print() to view the data in the IPython Shell.

#### Instructions 
- Complete the for-loop with a regex that finds all dates in a format similar to 27 minutes ago or 4 hours ago.
- Complete the for-loop with a regex that finds all dates in a format similar to 23rd june 2018.
- Complete the for-loop with a regex that finds all dates in a format similar to 1st september 2019 17:25.

In [27]:
sentiment_analysis = ['I would like to apologize for the repeated Video Games Live related tweets. 32 minutes ago',
 '@zaydia but i cant figure out how to get there / back / pay for a hotel 1st May 2019',
 'FML: So much for seniority, bc of technological ineptness 23rd June 2018 17:54']

In [28]:
# Complete the for loop with a regex to find dates
for date in sentiment_analysis:
    print(re.findall(r"\d{1,2}\s\w+\sago", date))
# Complete the for loop with a regex to find dates
for date in sentiment_analysis:
    print(re.findall(r"\d{1,2}\w+\s\w+\s\d{4}", date))
# Complete the for loop with a regex to find dates
for date in sentiment_analysis:
    print(re.findall(r"\d{1,2}\w+\s\w+\s\d{4}\s\d{1,2}:\d{1,2}", date))

['32 minutes ago']
[]
[]
[]
['1st May 2019']
['23rd June 2018']
[]
[]
['23rd June 2018 17:54']


In [29]:
# Complete the for loop with a regex to find dates
for date in sentiment_analysis:
    print(re.findall(r"\d{1,2}\w+\s\w+\s\d{4}\s\d{1,2}:\d{1,2}", date))

[]
[]
['23rd June 2018 17:54']


### Getting tokens
Your next step is to tokenize the text of your tweets. Tokenization is the process of breaking a string into lexical units or, in simpler terms, words. But first, you need to remove hashtags so they do not cloud your process. You realize that hashtags start with a # symbol and contain letters and numbers but never whitespace. After that, you plan to split the text at whitespace matches to get the tokens.

You bring your list of quantifiers to help you: * zero o more times, + once or more, ? zero or once, {n, m} minimum n, maximum m.

The variable sentiment_analysis containing the text of one tweet as well as the re module are already loaded in your session. You can use print(sentiment_analysis) to view it in the IPython Shell.

#### Instructions
- Write a regex that matches the described hashtag pattern. Assign it to the regex variable.
- Replace all the matches of the regex with an empty string "". Assign it to no_hashtag variable.
- Split the text in the no_hashtag variable at every match of one or more consecutive whitespace.

In [30]:
sentiment_analysis = 'ITS NOT ENOUGH TO SAY THAT IMISS U #MissYou #SoMuch #Friendship #Forever'

In [31]:
# Write a regex matching the hashtag pattern
regex = r"#\w+"

# Replace the regex by an empty string
no_hashtag = re.sub(regex, "", sentiment_analysis)

# Get tokens by splitting text
print(re.split(r"\s+", no_hashtag))

['ITS', 'NOT', 'ENOUGH', 'TO', 'SAY', 'THAT', 'IMISS', 'U', '']


### Finding files
You are not satisfied with your tweets dataset cleaning. There are still extra strings that do not provide any sentiment. Among them are strings refer to text file names.

You also find a way to detect them:

They appear at the start of the string.
They always start with a sequence of 2 or 3 upper or lowercase vowels (a e i o u).
They always finish with the txt ending.
You are not sure if you should remove them directly. So you write a script to find and store them in a separate dataset.

You write down some metacharacters to help you: ^ anchor to beginning, . any character.

The variable sentiment_analysis containing the text of two tweets as well as the re module are already loaded in your session. You can use print() to view it in the IPython Shell.

#### Instructions

- Write a regex that matches the pattern of the text file names, e.g. aemyfile.txt.
- Find all matches of the regex in the elements of sentiment_analysis. Print out the result.
- Replace all matches of the regex with an empty string "". Print out the result.

In [32]:
sentiment_analysis = ['AIshadowhunters.txt aaaaand back to my literature review. At least i have a friendly cup of coffee to keep me company',
 "ouMYTAXES.txt I am worried that I won't get my $900 even though I paid tax last year"]

In [33]:
# Write a regex to match text file name
regex = r"^[aeiouAEIOU]{2,3}.+txt"

for text in sentiment_analysis:
	# Find all matches of the regex
	print(re.findall(regex, text))
    
	# Replace all matches with empty string
	print(re.sub(regex, "", text))

['AIshadowhunters.txt']
 aaaaand back to my literature review. At least i have a friendly cup of coffee to keep me company
['ouMYTAXES.txt']
 I am worried that I won't get my $900 even though I paid tax last year


### Give me your email
A colleague has asked for your help! When a user signs up on the company website, they must provide a valid email address.
The company puts some rules in place to verify that the given email address is valid:

- The first part can contain:
    - Upper A-Z and lowercase letters a-z
    - Numbers
    - Characters: !, #, %, &, *, $, .
- Must have @
- Domain:
    - Can contain any word characters
    - But only .com ending is allowed
    
The project consist of writing a script that checks if the email address follow the correct pattern. Your colleague gave you a list of email addresses as examples to test.

The list emails as well as the re module are loaded in your session. You can use print(emails) to view the emails in the IPython Shell.

#### Instructions

- Write a regular expression to match valid email addresses as described.
- Match the regex to the elements contained in emails.
- To print out the message indicating if it is a valid email or not, complete .format() statement.


In [34]:
emails = ['n.john.smith@gmail.com', '87victory@hotmail.com', '!#mary-=@msca.net']

In [35]:
# Write a regex to match a valid email address
regex = r"[a-zA-Z0-9!#%&*$.]+.@\w+\.com"

for example in emails:
  	# Match the regex to the string
    if re.match(regex, example):
        # Complete the format method to print out the result
      	print("The email {email_example} is a valid email".format(email_example=example))
    else:
      	print("The email {email_example} is invalid".format(email_example=example))   

The email n.john.smith@gmail.com is a valid email
The email 87victory@hotmail.com is a valid email
The email !#mary-=@msca.net is invalid


### Invalid password
The second part of the website project is to write a script that validates the password entered by the user. The company also puts some rules in order to verify valid passwords:

It can contain lowercase a-z and uppercase letters A-Z
It can contain numbers
It can contain the symbols: *, #, \$, \%, !, &, .
It must be at least 8 characters long but not more than 20
Your colleague also gave you a list of passwords as examples to test.

The list passwords and the module re are loaded in your session. You can use print(passwords) to view them in the IPython Shell.

#### Instructions

- Write a regular expression to match valid passwords as described.
- Scan the elements in the passwords list to find out if they are valid passwords.
- To print out the message indicating if it is a valid password or not, complete .format() statement.


In [36]:
passwords = ['Apple34!rose', 'My87hou#4$', 'abc123']
# Write a regex to match a valid password
regex = r"[a-zA-Z0-9!#%&*$.]{8,20}"

for example in passwords:
  	# Scan the strings to find a match
    if re.search(regex, example):
        # Complete the format method to print out the result
      	print("The password {pass_example} is a valid password".format(pass_example=example))
    else:
      	print("The password {pass_example} is invalid".format(pass_example=example))   

The password Apple34!rose is a valid password
The password My87hou#4$ is a valid password
The password abc123 is invalid


## Greedy vs Non.greedy
### Understanding the difference
You need to keep working and cleaning your tweets dataset. You realize that there are some HTML tags present. You need to remove them but keep the inside content as they are useful for analysis.

Let's take a look at this sentence containing an HTML tag:

`I want to see that <strong>amazing show</strong> again!.`

You know that for getting HTML tag you need to match anything that sits inside angle brackets < >. But the biggest problem is that the closing tag has the same structure. If you match too much, you will end up removing key information. So you need to decide whether to use a greedy or a lazy quantifier.

The string is already loaded as string to your session.

#### Instructions

- Import the re module.
- Write a regex expression to replace HTML tags with an empty string.
- Print out the result.

In [37]:
string = 'I want to see that <strong>amazing show</strong> again!'

In [38]:
# Write a regex to eliminate tags
string_notags = re.sub(r"<.+?>", "", string)

# Print out the result
print(string_notags)

I want to see that amazing show again!


### Greedy matching
Next, you see that numbers still appear in the text of the tweets. So, you decide to find all of them.

Let's imagine that you want to extract the number contained in the sentence `I was born on April 24th`. A lazy quantifier will make the regex return 2 and 4, because they will match as few characters as needed. However, a greedy quantifier will return the entire 24 due to its need to match as much as possible.

The re module as well as the variable sentiment_analysis are already loaded in your session. You can use print(sentiment_analysis) to view it in the IPython Shell.

#### Instructions 

- Use a lazy quantifier to match all numbers that appear in the variable sentiment_analysis.
- Now, use a greedy quantifier to match all numbers that appear in the variable sentiment_analysis.

In [39]:
sentiment_analysis = 'Was intending to finish editing my 536-page novel manuscript tonight, but that will probably not happen. And only 12 pages are left '

In [40]:
# Write a lazy regex expression 
numbers_found_lazy = re.findall(r"\d+?", sentiment_analysis)

# Print out the result
print(numbers_found_lazy)

# Write a greedy regex expression 
numbers_found_greedy = re.findall(r"\d+", sentiment_analysis)

# Print out the result
print(numbers_found_greedy)

['5', '3', '6', '1', '2']
['536', '12']


### Lazy approach
You have done some cleaning in your dataset but you are worried that there are sentences encased in parentheses that may cloud your analysis.

Again, a greedy or a lazy quantifier may lead to different results.

For example, if you want to extract a word starting with a and ending with e in the string I like apple pie, you may think that applying the greedy regex a.+e will return apple. However, your match will be apple pie. A way to overcome this is to make it lazy by using ? which will return apple.

The re module and the variable sentiment_analysis are already loaded in your session.

#### Instructions 
- Use a greedy quantifier to match text that appears within parentheses in the variable sentiment_analysis.
- Now, use a lazy quantifier to match text that appears within parentheses in the variable sentiment_analysis.

In [41]:
sentiment_analysis = "Put vacation photos online (They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying). "

In [42]:
# Write a greedy regex expression to match 
sentences_found_greedy = re.findall(r"\(.*\)", sentiment_analysis)

# Print out the result
print(sentences_found_greedy)

# Write a lazy regex expression
sentences_found_lazy = re.findall(r"\(.*?\)", sentiment_analysis)

# Print out the results
print(sentences_found_lazy)

["(They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying)"]
['(They were so cute)', "(I'm crying)"]


# 4. Advanced Regular Expression Concepts


In the last step of your journey, you will learn more complex methods of pattern matching using parentheses to group strings together or to match the same text as matched previously. Also, you will get an idea of how you can look around expressions.

### Try another name

In [43]:
sentiment_analysis = ['Just got ur newsletter, those fares really are unbelievable. Write to statravelAU@gmail.com or statravelpo@hotmail.com. They have amazing prices',
 'I should have paid more attention when we covered photoshop in my webpage design class in undergrad. Contact me Hollywoodheat34@msn.net.',
 'hey missed ya at the meeting. Read your email! msdrama098@hotmail.com']

In [44]:
# Write a regex that matches email
regex_email = r"([a-zA-Z0-9]+)@\S+"

for tweet in sentiment_analysis:
    # Find all matches of regex in each tweet
    email_matched = re.findall(regex_email, tweet)

    # Complete the format method to print the results
    print("Lists of users found in this tweet: {}".format(email_matched))

Lists of users found in this tweet: ['statravelAU', 'statravelpo']
Lists of users found in this tweet: ['Hollywoodheat34']
Lists of users found in this tweet: ['msdrama098']


### Flying home
Your boss assigned you to a small project. They are performing an analysis of the travels people made to attend business meetings. You are given a dataset with only the email subjects for each of the people traveling.

You learn that the text followed a pattern. Here is an example:

`Here you have your boarding pass LA4214 AER-CDB 06NOV.`

You need to extract the information about the flight:

- The two letters indicate the airline (e.g LA),
- The 4 numbers are the flight number (e.g. 4214).
- The three letters correspond to the departure (e.g AER),
- The destination (CDB),
- The date (06NOV) of the flight.
- All letters are always uppercase.

The variable flight containing one email subject was loaded in your session. You can use print() to view it in the IPython Shell.

#### Instructions 
- Complete the regular expression to match and capture all the flight information required. Only the first parenthesis were placed for you.
- Find all the matches corresponding to each piece of information about the flight. Assign it to flight_matches.




In [45]:
flight =  'Subject: You are now ready to fly. Here you have your boarding pass IB3723 AMS-MAD 06OCT'

# Write regex to capture information of the flight
regex = r"([A-Z]{2})(\d{4})\s([A-Z]{3})-([A-Z]{3})\s(\d{2}[A-Z]{3})"

# Find all matches of the flight information
flight_matches = re.findall(regex, flight)

#Print the matches
print("Airline: {} Flight number: {}".format(flight_matches[0][0], flight_matches[0][1]))
print("Departure: {} Destination: {}".format(flight_matches[0][2], flight_matches[0][3]))
print("Date: {}".format(flight_matches[0][4]))

Airline: IB Flight number: 3723
Departure: AMS Destination: MAD
Date: 06OCT


### Love it!
You are still working on the Twitter sentiment analysis project. First, you want to identify positive tweets about movies and concerts.

You plan to find all the sentences that contain the words love, like, or enjoy and capture that word. You will limit the tweets by focusing on those that contain the words movie or concert by keeping the word in another group. You will also save the movie or concert name.

For example, if you have the sentence: I love the movie Avengers. You match and capture love. You need to match and capture movie. Afterwards, you match and capture anything until the dot.

The list sentiment_analysis containing the text of three tweets and the re module are loaded in your session. You can use print() to view the data in the IPython Shell.

#### Instructions

- Complete the regular expression to capture the words love or like or enjoy. Match and capture the words movie or concert. Match and capture anything appearing until the ..
- Find all matches of the regex in each element of sentiment_analysis. Assign them to positive_matches.
- Complete the .format() method to print out the results contained in positive_matches for each element in sentiment_analysis.


In [46]:
sentiment_analysis = ['I totally love the concert The Book of Souls World Tour. It kinda amazing!',
 'I enjoy the movie Wreck-It Ralph. I watched with my boyfriend.',
 "I still like the movie Wish Upon a Star. Too bad Disney doesn't show it anymore."]

In [47]:
# Write a regex that matches sentences with the optional words
regex_positive = r"(love|like|enjoy).+?(movie|concert)\s(.+?)\."

for tweet in sentiment_analysis:
	# Find all matches of regex in tweet
    positive_matches = re.findall(regex_positive, tweet)
    
    # Complete format to print out the results
    print("Positive comments found {}".format(positive_matches))

Positive comments found [('love', 'concert', 'The Book of Souls World Tour')]
Positive comments found [('enjoy', 'movie', 'Wreck-It Ralph')]
Positive comments found [('like', 'movie', 'Wish Upon a Star')]


### Negative  sentiment 

In [48]:
sentiment_analysis = ['That was horrible! I really dislike the movie The cabin and the ant. So boring.',
 "I disapprove the movie Honest with you. It's full of cliches.",
 'I dislike very much the concert After twelve Tour. The sound was horrible.']
# Write a regex that matches sentences with the optional words
regex_negative = r"(disapprove|dislike|hate).+?(?:movie|concert)\s(.+?)\."

for tweet in sentiment_analysis:
	# Find all matches of regex in tweet
    negative_matches = re.findall(regex_negative, tweet)
    
    # Complete format to print out the results
    print("Negative comments found {}".format(negative_matches))

Negative comments found [('dislike', 'The cabin and the ant')]
Negative comments found [('disapprove', 'Honest with you')]
Negative comments found [('dislike', 'After twelve Tour')]


### Parsing PDF files
You now need to work on another small project you have been delaying. Your company gave you some PDF files of signed contracts. The goal of the project is to create a database with the information you parse from them. Three of these columns should correspond to the day, month, and year when the contract was signed.
The dates appear as Signed on 05/24/2016 (05 indicating the month, 24 the day). You decide to use capturing groups to extract this information. Also, you would like to retrieve that information so you can store it separately in different variables.

You decide to do a proof of concept.

The variable contract containing the text of one contract and the re module are already loaded in your session. You can use print() to view the data in the IPython Shell.

#### Instructions 

- Write a regex that captures the month, day, and year in which the contract was signed. Scan contract for matches.
- Assign each captured group to the corresponding keys in the dictionary.
- Complete the f-string method to print out the captured groups. Use the values corresponding to each key in the dictionary.

In [49]:
contract = 'Provider will invoice Client for Services performed within 30 days of performance.  Client will pay Provider as set forth in each Statement of Work within 30 days of receipt and acceptance of such invoice. It is understood that payments to Provider for services rendered shall be made in full as agreed, without any deductions for taxes of any kind whatsoever, in conformity with Provider’s status as an independent contractor. Signed on 03/25/2001.'

In [50]:
# Write regex and scan contract to capture the dates described
regex_dates = r"Signed\son\s(\d{2})/(\d{2})/(\d{4})"
dates = re.search(regex_dates, contract)

# Assign to each key the corresponding match
signature = {
	"day": dates.group(2),
	"month": dates.group(1),
	"year": dates.group(3)
}
# Complete the format method to print-out
print("Our first contract is dated back to {data[year]}. Particularly, the day {data[day]} of the month {data[month]}.".format(data=signature))

Our first contract is dated back to 2001. Particularly, the day 25 of the month 03.


### Close the tag, please!
In the meantime, you are working on one of your other projects. The company is going to develop a new product. It will help developers automatically check the code they are writing. You need to write a short script for checking that every HTML tag that is open has its proper closure.

You have an example of a string containing HTML tags:

`<title>The Data Science Company</title>`

You learn that an opening HTML tag is always at the beginning of the string. It appears inside <>. A closing tag also appears inside <>, but it is preceded by /.

You also remember that capturing groups can be referenced using numbers, e.g \4.

The list html_tags, containing three strings with HTML tags, and there module are loaded in your session. You can use print() to view the data in the IPython Shell.

#### Instructions

- Complete the regex in order to match closed HTML tags. Find if there is a match in each string of the list html_tags. Assign the result to match_tag.
- If a match is found, print the first group captured and saved in match_tag.
- If no match is found, complete the regex to match only the text inside the HTML tag. Assign it to notmatch_tag.
- Print the first group captured by the regex and save it in notmatch_tag.

In [51]:
html_tags = ['<body>Welcome to our course! It would be an awesome experience</body>',
 '<article>To be a data scientist, you need to have knowledge in statistics and mathematics</article>',
 '<nav>About me Links Contact me!']

In [52]:
for string in html_tags:
    # Complete the regex and find if it matches a closed HTML tags
    match_tag =  re.match(r"<(\w+)>.*?</\1>", string)
    # print(match_tag)
    if match_tag:
        # If it matches print the first group capture
        print("Your tag {} is closed".format(match_tag.group(1))) 
    else:
        # If it doesn't match capture only the tag 
        notmatch_tag = re.match(r"<(\w+)>", string)
        # Print the first group capture
        print("Close your {} tag!".format(notmatch_tag.group(1)))
        
        

Your tag body is closed
Your tag article is closed
Close your nav tag!


In [53]:
sentiment_analysis = ['@marykatherine_q i know! I heard it this morning and wondered the same thing. Moscooooooow is so behind the times',
 'Staying at a friends house...neighborrrrrrrs are so loud-having a party',
 'Just woke up an already have read some e-mail']

In [54]:
# Complete the regex to match an elongated word
regex_elongated = r"\w*(\w)\1\w*"

for tweet in sentiment_analysis:
	# Find if there is a match in each tweet 
	match_elongated = re.search(regex_elongated, tweet)
    
	if match_elongated:
		# Assign the captured group zero 
		elongated_word = match_elongated.group(0)
        
		# Complete the format method to print the word
		print("Elongated word found: {word}".format(word=elongated_word))
	else:
		print("No elongated word found")     	

Elongated word found: Moscooooooow
Elongated word found: neighborrrrrrrs
No elongated word found


### Surrounding words
Now, you want to perform some visualizations with your sentiment_analysis dataset. You are interested in the words surrounding python. You want to count how many times a specific words appears right before and after it.

Positive lookahead (?=) makes sure that first part of the expression is followed by the lookahead expression. Positive lookbehind (?<=) returns all matches that are preceded by the specified pattern.

The variable sentiment_analysis, containing the text of one tweet, and the re module are loaded in your session. You can use print() to view the data in the IPython Shell.

#### Instructions
- Get all the words that are followed by the word python in sentiment_analysis. Print out the word found.
- Get all the words that are preceded by the word python or Python in sentiment_analysis. Print out the words found.

In [55]:
sentiment_analysis = 'You need excellent python skills to be a data scientist. Must be! Excellent python'

In [56]:
# Positive lookahead
look_ahead = re.findall(r"\S+(?=\s[Pp]ython)", sentiment_analysis)

# Print out
print(look_ahead)

['excellent', 'Excellent']


In [57]:
# Positive lookbehind
look_behind = re.findall(r"(?<=[Pp]ython\s)\w+", sentiment_analysis)

# Print out
print(look_behind)

['skills']


### Filtering phone numbers
Now, you need to write a script for a cell-phone searcher. It should scan a list of phone numbers and return those that meet certain characteristics.

The phone numbers in the list have the structure:

- Optional area code: 3 numbers
- Prefix: 4 numbers
- Line number: 6 numbers
- Optional extension: 2 numbers

E.g. 654-8764-439434-01.

You decide to use .findall() and the non-capturing group's negative lookahead (?!) and negative lookbehind (?<!).

The list cellphones, containing three phone numbers, and the re module are loaded in your session. You can use print() to view the data in the IPython Shell.

#### Instructions
- Get all phone numbers that are not preceded by the optional area code.
- Get all the phone numbers that are not followed by the optional extension.

In [58]:
cellphones =  ['4564-646464-01', '345-5785-544245', '6476-579052-01']

In [59]:
for phone in cellphones:
	# Get all phone numbers not preceded by area code
	number = re.findall(r"(?<!\d{3}-)\d{4}-\d{6}-\d{2}", phone)
	print(number)

['4564-646464-01']
[]
['6476-579052-01']


In [60]:
for phone in cellphones:
	# Get all phone numbers not followed by optional extension
	number = re.findall(r"\d{3}-\d{4}-\d{6}(?!-\d{2})", phone)
	print(number)

[]
['345-5785-544245']
[]
