**Back to [Strings Operations](http://localhost:8888/notebooks/Part%201.%20Strings%20Operations.ipynb)**

# 1. Regular Expressions

`r'st\d\s\w{3,10}'`

### General Explanation
This is a pattern designed to find matches in a string with the following format:

1. It starts with `st`.  
2. It is followed by a digit (any number from 0 to 9).  
3. Next, there is a whitespace character (such as a space, tab, or newline).  
4. Then, it includes a word with a length of 3 to 10 alphanumeric characters (letters, numbers, or underscores).  

The prefix `r` before the string indicates that it is a **raw string**, meaning that escape characters like `\` are not interpreted by Python and are passed directly to the regular expression.

---

### Pattern Breakdown

1. **`st`**  
   Matches the literal characters `st`.  

2. **`\d`**  
   Matches a numeric digit (0-9).  

3. **`\s`**  
   Matches a whitespace character, such as:  
   - A space (` `).  
   - A tab (`\t`).  
   - A newline (`\n`), among others.  

4. **`\w{3,10}`**  
   Matches any alphanumeric character or an underscore (`_`):  
   - Uppercase letters (`A-Z`).  
   - Lowercase letters (`a-z`).  
   - Numbers (`0-9`).  
   - Underscore (`_`).  

   `{3,10}` specifies that there must be **between 3 and 10 repetitions** of `\w`.


In [1]:
import re

### Find all matches of a pattern
`re.findall(r"regex",string)`

In [2]:
# We want to find all the macthes of #movies in the following string
re.findall(r"#movies", "Love #movies! I had fun yesterday going to the #movies")

['#movies', '#movies']

### Split string at each match
`re.split(r"regex",string)`

In [3]:
# We want to split the specified string at every exclamation mark match
re.split(r"!", "Nice Place to eat! I'll come back! Excellent meat!")

['Nice Place to eat', " I'll come back", ' Excellent meat', '']

### Replace one or many matches with a string:
`re.sub(r"regex",new,string)`

In [4]:
# We are replacing every match of 'yellow' with the word 'nice'
re.sub(r"yellow", "nice", "I have a yellow car and a yellow house in a yellow neighborhood")

'I have a nice car and a nice house in a nice neighborhood'

### Supported metacharacters

`\d`: Digit or number

In [5]:
# Find all matches of the patterns containing 'User' followed by a (number)
re.findall(r"User\d", "The winners are: User9, UserN, User8")

['User9', 'User8']

`\D`: Non-digit

In [6]:
# Find all matches of the patterns containing 'User' followed by a (non-digit)
re.findall(r"User\D", "The winners are: User9, UserN, User8")

['UserN']

`\w`: Word

In [7]:
# Find all matches of the patterns containing 'User' followed by a (any digit or normal character)
re.findall(r"User\w", "The winners are: User9, UserN, User8")

['User9', 'UserN', 'User8']

`\W`: Non-word

In [8]:
# Find the price in a string
re.findall(r"\W\d", "This skirt is on sale, only $5 today!")

['$5']

`\s`: Whitespace

In [9]:
# Find the pattern 'Data Science' (space included between both words)
re.findall(r"Data\sScience", "I enjoy learning Data Science")

['Data Science']

`\S`: Non-Whitespace

In [10]:
# Find the matches of 'ice' followed by (any non-space character), then followed by 'cream', replacing with 'ice cream'
re.sub(r"ice\Scream", "ice cream", "I really like ice-cream")

'I really like ice cream'

## Example #1: Are they bots?

In [11]:
sentiment_analysis = '@robot9! @robot4& I have a good feeling that the show isgoing to be amazing! @robot9$ @robot7%'

- Write a regex that matches the user mentions that starts with `@` and follows the pattern, e.g. `@robot3!`.

In [12]:
# Write the regex
# regex = r"\W\w+\d\S"
regex = r"@robot\d\W"

- Find all the matches of the pattern in the `sentiment_analysis` variable.

In [13]:
# Find all matches of regex
print(re.findall(regex, sentiment_analysis))

['@robot9!', '@robot4&', '@robot9$', '@robot7%']


## Example #2: Find the numbers

In [14]:
sentiment_analysis = "Unfortunately one of those moments wasn't a giant squid monster. User_mentions:2, likes: 9, number of retweets: 7"
print(sentiment_analysis)

Unfortunately one of those moments wasn't a giant squid monster. User_mentions:2, likes: 9, number of retweets: 7


- Write a regex that matches the number of user mentions given as, for example, `User_mentions:9` in sentiment_analysis.

In [15]:
# Write a regex to obtain user mentions
print(re.findall(r"User_mentions:\d", sentiment_analysis))

['User_mentions:2']


- Write a regex that matches the number of likes given as, for example, `likes: 5` in `sentiment_analysis`.

In [16]:
# Write a regex to obtain number of likes
print(re.findall(r"likes:\s\d", sentiment_analysis))

['likes: 9']


- Write a regex that matches the number of retweets given as, for example, `number of retweets: 4` in `sentiment_analysis`.

In [17]:
# Write a regex to obtain number of retweets
print(re.findall(r"number\sof\sretweets:\s\d", sentiment_analysis))

['number of retweets: 7']


## Example #3: Match and split

In [18]:
sentiment_analysis = 'He#newHis%newTin love with$newPscrappy. #8break%He is&newYmissing him@newLalready'

- Write a regex that matches the pattern separating the sentences in `sentiment_analysis`, e.g. `&4break!`.

In [19]:
# Write a regex to match pattern separating sentences
regex_sentence = r"\W\dbreak\W"
print(re.findall(regex_sentence, sentiment_analysis))

['#8break%']


- Replace `regex_sentence` with a space `" "` in the variable `sentiment_analysis`. Assign it to `sentiment_sub`.

In [20]:
# Replace the regex_sentence with a space
sentiment_sub = re.sub(regex_sentence, " ", sentiment_analysis)
print(sentiment_sub)

He#newHis%newTin love with$newPscrappy.  He is&newYmissing him@newLalready


- Write a regex that matches the pattern separating the words in `sentiment_analysis`, e.g. `#newH`.

In [21]:
# Write a regex to match pattern separating words
regex_words = r"\Wnew\w"

- Replace `regex_words` with a space in the variable `sentiment_sub`. Assign it to `sentiment_final` and print out the result.

In [22]:
# Replace the regex_words and print the result
sentiment_final = re.sub(regex_words, " ", sentiment_sub)
print(sentiment_final)

He is in love with scrappy.  He is missing him already


# 2. Repetitions

### Quantifiers

In [23]:
import re
# We specify '\w' and '\d' are repeated 8 and 4 times, respectively
password = "password1234"
re.search(r"\w{8}\d{4}", password)

<re.Match object; span=(0, 12), match='password1234'>

### A character appears Once or more times: `+`

In [24]:
# Geting the matches of dates
text = "Date of start: 4-3. Date of registration: 10-04."
re.findall(r"\d+-\d+", text)

['4-3', '10-04']

### A character appears Zero times or more times: `*`

In [25]:
# We want to find all mentions of users starting with '@'
my_string = "The concert was amazing! @ameli!a @joh&&n @mary90"
re.findall(r"@\w+\W*\w+", my_string)

['@ameli!a', '@joh&&n', '@mary90']

### A character appears Zero times or just "once": `?`

In [26]:
# Finding the word color, regarding its spelling variations
text = "The color of this image is amazing. However, the colour blue could be brighter."
re.findall(r"colou?r", text) # colour is more probably to appear just one time

['color', 'colour']

### n times at least, m times at most : {n, m}

In [27]:
# Find all the matches for a phone number
phone_number = "John: 1-966-847-3131 Michelle: 54-908-42-42424"
re.findall(r"\d{1,2}-\d{3}-\d{2,3}-\d{4,}", phone_number)

['1-966-847-3131', '54-908-42-42424']

### Reminder: the quantifier applies only to the character inmediately to the left
`r"apple+"` : `+` applies to e and not to apple

## Example #1: Everything clean

In [28]:
import pandas as pd
sentiment_analysis = pd.read_csv('../data/short_tweets.csv')
sentiment_analysis = sentiment_analysis.iloc[545:548]
sentiment_analysis = sentiment_analysis["text"]

In [29]:
sentiment_analysis

545    Boredd. Colddd @blueKnight39 Internet keeps st...
546    I had a horrible nightmare last night @anitaLo...
547    im lonely  keep me company @YourBestCompany! @...
Name: text, dtype: object

- Write a regex to find all the matches of `http` links appearing in each `tweet` in `sentiment_analysis`. Print out the result.

In [30]:
# Import re module
import re

for tweet in sentiment_analysis:
	# Write regex to match http links and print out result
	print(re.findall(r"http\S+", tweet))

['https://www.tellyourstory.com']
[]
['https://radio.foxnews.com']


- Write a regex to find all the matches of user mentions appearing in each `tweet` in `sentiment_analysis`. Print out the result.

In [31]:
for tweet in sentiment_analysis:
    # Write regex to match user mentions and print out result
	print(re.findall(r"@\w+", tweet))

['@blueKnight39']
['@anitaLopez98', '@MyredHat31']
['@YourBestCompany', '@foxRadio']


## Example #2: Some time ago

In [32]:
sentiment_analysis = pd.read_csv('../data/short_tweets.csv')
sentiment_analysis = sentiment_analysis.iloc[232:235]
sentiment_analysis = sentiment_analysis["text"]
sentiment_analysis

232    I would like to apologize for the repeated Vid...
233    @zaydia but i cant figure out how to get there...
234    FML: So much for seniority, bc of technologica...
Name: text, dtype: object

- Complete the for-loop with a regex that finds all dates in a format similar to `27 minutes ago` or `4 hours ago`.

In [33]:
# Complete the for loop with a regex to find dates
for date in sentiment_analysis:
	print(re.findall(r"\d{1,2}\s\w+\sago", date))

['32 minutes ago']
[]
[]


- Complete the for-loop with a regex that finds all dates in a format similar to `23rd june 2018`.

In [34]:
# Complete the for loop with a regex to find dates
for date in sentiment_analysis:
	print(re.findall(r"\d{1,2}\w+\s\w+\s\d{4}", date))

[]
['1st May 2019']
['23rd June 2018']


- Complete the for-loop with a regex that finds all dates in a format similar to `1st september 2019 17:25`.

In [35]:
# Complete the for loop with a regex to find dates
for date in sentiment_analysis:
	print(re.findall(r"\d{1,2}\w+\s\w+\s\d{4}\s\d{1,2}:\d{2}", date))    

[]
[]
['23rd June 2018 17:54']


## Example #3: Getting tokens

In [36]:
sentiment_analysis = 'ITS NOT ENOUGH TO SAY THAT IMISS U #MissYou #SoMuch #Friendship #Forever'

- Write a regex that matches the described hashtag pattern. Assign it to the `regex` variable.

In [37]:
# Write a regex matching the hashtag pattern
regex = r"#\w+"

- Replace all the matches of the regex with an empty string `""`. Assign it to `no_hashtag` variable.

In [38]:
# Replace the regex by an empty string
no_hashtag = re.sub(regex, "", sentiment_analysis)

- Split the text in the `no_hashtag` variable at every match of one or more consecutive whitespace.

In [39]:
# Get tokens by splitting text
print(re.split(r"\s+", no_hashtag))

['ITS', 'NOT', 'ENOUGH', 'TO', 'SAY', 'THAT', 'IMISS', 'U', '']


# 3. Regex metacharacters
- `re.search(r"regex",string)`
- `re.match(r"regex",string)`

In [42]:
# Finding a digit appearing four times (.search)
re.search(r"\d{4}","4506 people attend the show")

<re.Match object; span=(0, 4), match='4506'>

In [43]:
# Finding a digit appearing four times (.match)
re.match(r"\d{4}","4506 people attend the show")

<re.Match object; span=(0, 4), match='4506'>

The difference is that `.match` is anchored at the beginning of the string.

In [48]:
# Finding a match for a digit with .search()
print(re.search(r"\d+","Yesterday, I saw 3 shows"))

<re.Match object; span=(17, 18), match='3'>


In [47]:
# Finding a match for a digit with .match()
print(re.match(r"\d+","Yesterday, I saw 3 shows"))

None


What occured with `.match()` was because the first characters do not match the regex.

### Match any character (except newline): `.`

In [49]:
# we need to match links in the string
my_links = "Just check out this link: www.amazingpics.com. It has amazing photos!"
re.findall(r"www.+com", my_links)

['www.amazingpics.com']

### Start of the string: `^`

In [50]:
# find the pattern starting with t, h, e, whitespace, two digits and ending with s.
my_string = "the 80s music was much better that the 90s"
re.findall(r"the\s\d+s", my_string)

['the 80s', 'the 90s']

In [51]:
# Start of the string: ^
re.findall(r"^the\s\d+s", my_string)

['the 80s']

### End of the string: `$`

In [52]:
re.findall(r"the\s\d+s$", my_string)

['the 90s']

### Escape special characters: `\`

In [53]:
# We want to split the string by dot whitespace.
my_string = "I love the music of Mr.Go. However, the sound was too loud."
print(re.split(r".\s", my_string))

['', 'lov', 'th', 'musi', 'o', 'Mr.Go', 'However', 'th', 'soun', 'wa', 'to', 'loud.']


That was not what we expected. **Instead:**

In [54]:
# Escape special characters (the right one): \
my_string = "I love the music of Mr.Go. However, the sound was too loud."
print(re.split(r"\.\s", my_string))

['I love the music of Mr.Go', 'However, the sound was too loud.']


### OR operator with character: `|`

In [56]:
# we want to match the word elephant
my_string = "Elephants are the world's largest land animal! I would love to see an elephant one day"
re.findall(r"Elephant|elephant", my_string)

['Elephant', 'elephant']

### OR operator - Set of characters: `[ ]`

In [58]:
# We want to find a pattern that contains lowercase or uppercase letter followed by a digit
my_string = "Yesterday I spent my afternoon with my friends: MaryJohn2 Clary3"
re.findall(r"[a-zA-Z]+\d", my_string)

['MaryJohn2', 'Clary3']

In [59]:
# We want to replace the non-word characters by whitespace
my_string = "My&name&is#John Smith. I%live$in#London."

# We specify optional characters inside the []
re.sub(r"[#$%&]"," ", my_string)

'My name is John Smith. I live in London.'

### OR operator - Set of characters: `[ ]` with `^`
- **`^`** transforms the expression to negative

In [60]:
# the ^ specifies we want the links that do not contain any number
my_links = "Bad website: www.99.com. Favorite site: www.hola.com"
re.findall(r"www[^0-9]+com", my_links)

['www.hola.com']

## Example #1: Finding files

You are not satisfied with your tweets dataset cleaning. There are still extra strings that do not provide any sentiment. Among them are strings that refer to text file names.

You also find a way to detect them:

- They appear at the start of the string.
- They always start with a sequence of 2 or 3 upper or lowercase vowels (a e i o u).
- They always finish with the `txt` ending.

You are not sure if you should remove them directly. So you write a script to find and store them in a separate dataset `sentiment_analysis`.

You write down some metacharacters to help you: `^` anchor to beginning, `.` any character.

In [62]:
import pandas as pd
import re
sentiment_analysis = pd.read_csv('../data/short_tweets.csv')
sentiment_analysis = sentiment_analysis.iloc[780:782]
sentiment_analysis = sentiment_analysis["text"]
sentiment_analysis

780    AIshadowhunters.txt aaaaand back to my literat...
781    ouMYTAXES.txt I am worried that I won't get my...
Name: text, dtype: object

- Write a regex that matches the pattern of the text file names, e.g. `aemyfile.txt`.

In [73]:
# Write a regex to match text file name
regex = r"^[aeiouAEIOU]{2,3}.+txt"

- Find all matches of the regex in the elements of `sentiment_analysis`. Print out the result.

In [74]:
for text in sentiment_analysis:
	# Find all matches of the regex
	print(re.findall(regex, text))

['AIshadowhunters.txt']
['ouMYTAXES.txt']


- Replace all matches of the regex with an empty string `""`. Print out the result.

In [75]:
# Replace all matches with empty string
print(re.sub(regex, "", text))

 I am worried that I won't get my $900 even though I paid tax last year


## Example #2: Give me your email

A colleague has asked for your help! When a user signs up on the company website, they must provide a valid email address.
The company puts some rules in place to verify that the given email address is valid:

- The first part can contain:
    - Upper `A-Z` or lowercase letters `a-z`
    - Numbers
    - Characters: `!`, `#`, `%`, `&`, `*`, `$`, `.`
- Must have `@`
- Domain:
    - Can contain any word characters
    - But only `.com` ending is allowed

The project consists of writing a script that checks if the email address follow the correct pattern. Your colleague gave you a list of email addresses as examples to test.

In [76]:
emails = ['n.john.smith@gmail.com', '87victory@hotmail.com', '!#mary-=@msca.net']

- Write a regular expression to match valid email addresses as described.

In [86]:
# Write a regex to match a valid email address
regex = r"[A-Za-z0-9!#%&*\$\.]+@\w+\.com"

- Match the regex to the elements contained in `emails`.
- Print out the message indicating if it is a valid email or not, complete `.format()` statement.

In [87]:
for example in emails:
  	# Match the regex to the string
    if re.match(regex, example):
        # Complete the format method to print out the result
      	print("The email {email_example} is a valid email".format(email_example=example))
    else:
      	print("The email {email_example} is invalid".format(email_example=example))

The email n.john.smith@gmail.com is a valid email
The email 87victory@hotmail.com is a valid email
The email !#mary-=@msca.net is invalid


## Example #3: Invalid password

The second part of the website project is to write a script that validates the password entered by the user. The company also puts some rules in order to verify valid passwords:

- It can contain lowercase `a-z` and uppercase letters `A-Z`
- It can contain numbers
- It can contain the symbols: `*`, `#`, `$`, `%`, `!`, `&`, `.`
- It must be at least 8 characters long but not more than 20

Your colleague also gave you a list of passwords as examples to test.

In [88]:
passwords = ['Apple34!rose', 'My87hou#4$', 'abc123']

- Write a regular expression to check if the passwords are valid according to the description.

In [89]:
# Write a regex to check if the password is valid
regex = r"[a-zA-Z0-9*#$%!&.]{8,20}"

- Search the elements in the `passwords` list to find out if they are valid passwords.

- Print out the message indicating if it is a valid password or not, complete `.format()` statement.

In [90]:
for example in passwords:
  	# Scan the strings to find a match
    if re.findall(regex, example):
        # Complete the format method to print out the result
      	print("The password {pass_example} is a valid password".format(pass_example=example))
    else:
      	print("The password {pass_example} is invalid".format(pass_example=example))   

The password Apple34!rose is a valid password
The password My87hou#4$ is a valid password
The password abc123 is invalid


# 4. Greedy vs. non-greedy matching

Standard quantiers are greedy by default: `*` , `+` , `?` , `{num, num}`

### Greedy matching: 
- Match as many characters as possible
- Return the longest match

In [1]:
# We want to find a pattern that has one or more digits
import re
re.match(r"\d+","12345bcada")

# It will start by matching the first digit found '1' and will stop when no other digit can be matched

<re.Match object; span=(0, 5), match='12345'>

- Backtracks when too many character matched
- Gives up characters one at a time

In [2]:
# .* to find anything, zero or more times, followed by the letters "h" "e" "l" "l" "o". 
import re
re.match(r".*hello","xhelloxxxxxx")

# We can see here that it returns the pattern 'xhello'.

<re.Match object; span=(0, 6), match='xhello'>

### Non-greedy matching (or, lazy operators):
- **Lazy:** match as few characters as needed
- Returns the shortest match
- Append `?` to greedy quantiers

In [3]:
# The same code but just returning the pattern '1' (the first digit)
import re
re.match(r"\d+?","12345bcada")

<re.Match object; span=(0, 1), match='1'>

- Backtracks when too few characters matched
- Expands characters one a time

In [4]:
# The same code but just match a little as possible
import re
re.match(r".*?hello","xhelloxxxxxx")

<re.Match object; span=(0, 6), match='xhello'>

## Example #1: Understanding the difference

In [1]:
string = 'I want to see that <strong>amazing show</strong> again!'

- Write a `regex` expression to replace **HTML tags** with an empty string. Print out the result.

In [3]:
# Import re
import re

# Write a regex to eliminate tags
string_notags = re.sub(r"<.+?>", "", string)

# Print out the result
print(string_notags)

I want to see that amazing show again!


## Example #2: Greedy matching

In [4]:
sentiment_analysis = 'Was intending to finish editing my 536-page novel manuscript tonight, but that will probably not happen. And only 12 pages are left '
print(sentiment_analysis)

Was intending to finish editing my 536-page novel manuscript tonight, but that will probably not happen. And only 12 pages are left 


- Use a lazy quantifier to match all numbers that appear in the variable `sentiment_analysis`.

In [5]:
# Write a lazy regex expression 
numbers_found_lazy = re.findall(r"[0-9]+?", sentiment_analysis)

# Print out the result
print(numbers_found_lazy)

['5', '3', '6', '1', '2']


- Now, use a greedy quantifier to match all numbers that appear in the variable `sentiment_analysis`.

In [6]:
# Write a greedy regex expression 
numbers_found_greedy = re.findall(r"[0-9]+", sentiment_analysis)

# Print out the result
print(numbers_found_greedy)

['536', '12']


## Example #3: Lazy approach

In [7]:
sentiment_analysis = "Put vacation photos online (They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying). "
print(sentiment_analysis)

Put vacation photos online (They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying). 


- Use a greedy quantifier to match text that appears within parentheses in the variable `sentiment_analysis`.

In [8]:
# Write a greedy regex expression to match 
sentences_found_greedy = re.findall(r"\(.*\)", sentiment_analysis)

# Print out the result
print(sentences_found_greedy)

["(They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying)"]


- Now, use a lazy quantifier to match text that appears within parentheses in the variable `sentiment_analysis`.

In [9]:
# Write a lazy regex expression
sentences_found_lazy = re.findall(r"\(.*?\)", sentiment_analysis)

# Print out the results
print(sentences_found_lazy)

['(They were so cute)', "(I'm crying)"]
