### What is a regular expression?

A regular expression, or regex, is a string that contains a combination of normal and special characters that describes patterns to find text within a text. 

This sounds very complicated. Let's break it down to understand it better. Here, we have an example of what a regular expression looks like. 


<img src="re.jpg" style="max-width:600px">


In Python, the `r` at the beginning indicates a raw string. It is always advisable to use it.


We said that a regex contains normal characters, or literal characters we already know. The normal characters match themselves. In the case shown, `st` exactly matches an `s` followed by a `t`.


They also contain special characters.` Metacharacters` represent `types of characters`. Let's look one by one. 


- `\d` represents a `digit`


- `\s` represents a `whitespace`,


- `\w`  represents a `word character`.They also represent ideas, such as location or quantity.


In the example, 3 and 10 inside curly braces indicates that the character immediately to the left, in this case backslash w, should appear between 3 and 10 times.

#### Pattern:

We said that regex describes a pattern. A pattern is a sequence of characters that maps to words or punctuation.

As a data scientist, you will use pattern matching 

- To find and replace specific text. 

- To validate strings such as passwords or email addresses. 

Why use regex? They are very powerful and fast. They allow you to search complex patterns that would be very difficult to find otherwise.

### The re module

Python has a useful library, the re module, to handle regex. You can import it as shown in the code. Let's see how it works. 


### re.findall() method:

To find all matches of a pattern, we use the dot findall method. It takes two arguments: the regex and the string. 

<img src="fi.jpg" style="max-width:600px">

In [1]:
import re

In [2]:
re.findall(r"#movies", "Love #movies! I had fun yesterday going to the #movies")

['#movies', '#movies']

In the code, we want to find all the matches of hashtag movies in the specified string. The method returns a list with the two matches found.

###  re.split() method

To split a string at each pattern match, we could use the method dot split 

<img src="sp.jpg" style="max-width:600px">

In the example, we want to split the specified string at every exclamation mark match. It returns a list of the substrings as you can see in the output.

In [3]:
re.split(r"!", "Nice Place to eat! I'll come back! Excellent meat!")

['Nice Place to eat', " I'll come back", ' Excellent meat', '']

### re.sub() method

Finally, we could replace any pattern match with another string using the dot sub method. It takes three arguments: the regex, replacement and string. 

<img src="su.jpg" style="max-width:600px">

In the example, we replace every match of yellow with the word nice. We get the following output.

In [4]:
re.sub(r"yellow", "nice", "I have a yellow car and a yellow house in a yellow neighborhood")

'I have a nice car and a nice house in a nice neighborhood'

### Supported metacharacters

Let's look at the supported metacharacters. 

In the example, we want to find all matches of the patterns containing User followed by a number. 


We use backlash d to represent the digit. We get the following matches. 


Next, we find matches of the pattern containing User followed by a non-digit. In that case, we use backslash capital D obtaining the following match.

- `\d` >>>  `Digit`


- `\D` >>>  `Non-Digit`

In [5]:
## Find all matches of the patterns containing User followed by a number.
re.findall(r"User\d", "The winners are: User9, UserN, User8")

['User9', 'User8']

In [6]:
## Find matches of the pattern containing User followed by a non-digit.
re.findall(r"User\D", "The winners are: User9, UserN, User8")

['UserN']

### Supported metacharacters

If we want to find all matches of the pattern containing User followed by any digit or normal character, we can use backlash w. We get all following matches. 

In the next example, we need to find the price in a string. We use backslash capital W to match the dollar sign followed by a digit obtaining the following output.


- `\w` >>>  `Word`


- `\W` >>>  `Non-Word`

In [7]:
## Find all matches of the pattern containing 'User' followed by 'any digit' or 'normal character',
re.findall(r"User\w", "The winners are: User9, UserN, User8")

['User9', 'UserN', 'User8']

In [8]:
## find the price in a string, for that find all matches that match the dollar sign followed by a digit 
re.findall(r"\W\d", "This skirt is on sale, only $5 today!")

['$5']

### Supported metacharacters

Finally, we use backslash s to specify the pattern Data whitespace science getting the following match. 

In the second example, we use backslash capital S to detect the matches of ice, followed by any non-space character, followed by cream and replace them with the word ice cream.

- `\s` >>>  `Whitespace`


- `\S` >>>  `Non-Whitespace`

In [9]:
re.findall(r"Data\sScience", "I enjoy learning Data Science")

['Data Science']

In [10]:
re.sub(r"ice\Scream", "ice cream", "I really like ice-cream")

'I really like ice cream'

### Exercise 1: Are they bots?

The company that you are working for asked you to perform a sentiment analysis using a dataset with tweets. First of all, you need to do some cleaning and extract some information.

While printing out some text, you realize that some tweets contain user mentions. Some of these mentions follow a very strange pattern. A few examples that you notice: `@robot3!`, `@robot5&` and `@robot7#`

To analyze if those users are bots, you will do a proof of concept with one tweet and extract them using the `.findall()` method.

You write down some helpful metacharacters to help you later:

`\d: digit`
`\w: word character`
`\W: non-word character`
`\s: whitespace`

The text of one tweet was saved in the variable `sentiment_analysis`. 

In [11]:
sentiment_analysis = '@robot9! @robot4& I have a good feeling that the show isgoing to be amazing! @robot9$ @robot7%'
sentiment_analysis

'@robot9! @robot4& I have a good feeling that the show isgoing to be amazing! @robot9$ @robot7%'

- Write a regex that matches the user mentions that starts with `@` and follows the pattern, e.g. `@robot3!`.


- Find all the matches of the pattern in the `sentiment_analysis` variable.

In [12]:
## q1
regex = r"@robot\d\W"

In [13]:
## q2
re.findall(regex, sentiment_analysis)

['@robot9!', '@robot4&', '@robot9$', '@robot7%']

### Exercise 2: Find the numbers

While examining the tweet text in your dataset, you detect that some tweets carry more information. 

The text contains the number of retweets, user mentions, and likes. You decide to extract this important information that is given as in this example:

`Agh,snow! User_mentions:9, likes: 5, number of retweets: 4`

You pull a list of metacharacters: `\d digit,\w word character,\s whitespace.`

`Always indicate whitespace with metacharacters.`

The variable `sentiment_analysis2` containing the text of one tweet 

In [14]:
sentiment_analysis2 = "Unfortunately one of those moments wasn't a giant squid monster. User_mentions:2, likes: 9, number of retweets: 7"
sentiment_analysis2

"Unfortunately one of those moments wasn't a giant squid monster. User_mentions:2, likes: 9, number of retweets: 7"

- Write a regex that matches the number of user mentions given as, for example, `User_mentions:9` in sentiment_analysis.


- Write a regex that matches the number of likes given as, for example, `likes: 5` in sentiment_analysis.


- Write a regex that matches the number of retweets given as, for example, `number of retweets: 4` in sentiment_analysis.

In [15]:
## q1: regex for user mention
re.findall(r"User_mentions:\d", sentiment_analysis2)

['User_mentions:2']

In [18]:
# same
re.findall(r"User_mentions\W\d", sentiment_analysis2)

['User_mentions:2']

In [20]:
## q2: regex for num. of likes
re.findall(r"likes:\s\d", sentiment_analysis2)

['likes: 9']

In [21]:
# same
re.findall(r"likes\W\s\d", sentiment_analysis2)

['likes: 9']

In [23]:
## q3: regex for num. of retweets
re.findall(r"number\sof\sretweets:\s\d", sentiment_analysis2)

['number of retweets: 7']

In [24]:
# same
re.findall(r"number\sof\sretweets\W\s\d", sentiment_analysis2)

['number of retweets: 7']

### Exercise 3: Match and split

Some of the tweets in your dataset were downloaded incorrectly. Instead of having spaces to separate words, they have strange characters. You decide to use regular expressions to handle this situation. You print some of these tweets to understand which pattern you need to match.

You notice that the `sentences` are always separated by a `special character`, followed by `a number`, the word `break`, and after that, another `special character`, e.g `&4break!`. The `words` are always separated by a `special character`, the word `new`, and a `normal random character`, e.g `#newH`.

The variable `sentiment_analysis3` containing the text of one tweet

In [25]:
sentiment_analysis3 = 'He#newHis%newTin love with$newPscrappy. #8break%He is&newYmissing him@newLalready'
sentiment_analysis3

'He#newHis%newTin love with$newPscrappy. #8break%He is&newYmissing him@newLalready'

- Write a regex that matches the pattern separating the sentences in sentiment_analysis, e.g. `&4break!`


- Replace regex_sentence with a space `" "` in the variable sentiment_analysis. Assign it to `sentiment_sub`.


- Write a regex that matches the pattern separating the words in sentiment_analysis, e.g. `#newH`.


- Replace `regex_words` with a `space` in the variable `sentiment_sub`. Assign it to `sentiment_final` and print out the result.

In [28]:
## q1,q2
regex_sentence = r"\W\dbreak\W"
sentiment_sub = re.sub(regex_sentence," ", sentiment_analysis3)
sentiment_sub

'He#newHis%newTin love with$newPscrappy.  He is&newYmissing him@newLalready'

In [30]:
## q3,q4
regex_word = r"\Wnew\w"
sentiment_final = re.sub(regex_word, " ", sentiment_sub)
sentiment_final

'He is in love with scrappy.  He is missing him already'

### Repeated characters

Let's imagine that we are given the task to validate a password.It should contain eight characters followed by four numbers.


To search for a pattern, we can use the method dot search. It takes the regex and string. It tells us if there is a match. 

Let's apply what we have learned until now. We use backslash w eight times to match the first part and backslash d four times to match the last part. So our method finds a match.

In [42]:
password ="password1234"
re.search(r"\w\w\w\w\w\w\w\w\d\d\d\d", password)

<re.Match object; span=(0, 12), match='password1234'>

### Quantifiers

But this seems cumbersome for longer regex. Instead, we can use quantifiers to save this situation. 

A quantifier is a metacharacter that specifies how many times a character located to its left needs to be matched. 

<img src="q.jpg" style="max-width:600px">

In our example, we specify that backslash w is repeated 8 times and backslash d four times. And we get a match as seen in the output.

In [43]:
password ="password1234"
re.search(r"\w{8}\d{4}", password)

<re.Match object; span=(0, 12), match='password1234'>

### Quantifiers ("+")

In the following string, we want to match the dates. We can see that the pattern is one or multiple digits, dash, and again one or multiple digits. 

> The `"+"` metacharacter indicates a character that appears once or more times. 

Let's construct the regex from simple to complex. We will use dot findall method as you can see in the code.

In [44]:
text ="Date of start: 4-3. Date of registration: 10-04."
re.findall(r"\d+-\d+", text)

['4-3', '10-04']

We indicate that the digit, `\d`, should appear once or more times adding the `+` quantifier. Then, a dash `-`.
And again a digit, `\d`. It should appear once or more times, so again we use the `plus` quantifier. We get the two matches that we expected as seen in the output.

###  Quantifiers("*")

To indicate that a character should appear zero or more times, we can use the `star(*)` metacharacter. 

In this example, we have the following string. We want to find all mentions of users, which start with an at. We notice that they could contain or not contain a non-word character in the middle. So, we construct our regex as seen in the code: 


an `@`, followed by `backslash w plus`, `backslash capital w` with `star` to indicate a `non-word character zero or more times`, then `backslash w plus`. And we get the matches seen in the output.

In [45]:
my_string = "The concert was amazing! @ameli!a @joh&&n @mary90"
re.findall(r"@\w+\W*\w+", my_string)

['@ameli!a', '@joh&&n', '@mary90']

### Quantifiers ("?")

Another helpful quantifier is the `question mark`. It indicates that a character should appear `zero times or once`. 

In the example, we need to find matches for the word `color`, which has spelling variations. 


So, our regex is c, o, l, o, u which can appear once or zero times, so we add the question mark after it, then r. With this regex, we get the two matches wanted.

In [46]:
text = "The color of this image is amazing. However, the colour blue could be brighter."
re.findall(r"colou?r", text)

['color', 'colour']

### Quantifiers(curly braces: {n,m})

Finally, we can use the curly braces to indicate a specific `minimum` and `maximum` times. 

In the example, we want to find all matches for a phone number. We'll use the dot findall method and construct our regex step by step.


As we can see, we can have a digit, once or twice, then slash. Then, again a digit, three times. Then, a slash. Then, a digit twice or three times. Then, slash. And finally a digit again. We indicate that this last digit need to appear at least four times leaving the second argument in black. 

The regex engine returns two matches as expected.

In [47]:
phone_number ="John: 1-966-847-3131 Michelle: 54-908-42-42424"
re.findall(r"\d{1,2}-\d{3}-\d{2,3}-\d{4,}", phone_number)

['1-966-847-3131', '54-908-42-42424']

In [23]:
import pandas as pd

In [25]:
df = pd.read_csv("short_tweets.csv")
df.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467821085,Mon Apr 06 22:22:26 PDT 2009,NO_QUERY,crzy_cdn_bulas,our duck and chicken are taking wayyy too long...
1,0,1467821338,Mon Apr 06 22:22:30 PDT 2009,NO_QUERY,justnetgirl,Put vacation photos online (They were so cute)...
2,0,1467821455,Mon Apr 06 22:22:32 PDT 2009,NO_QUERY,CiaraRenee,I need a hug
3,0,1467821715,Mon Apr 06 22:22:37 PDT 2009,NO_QUERY,deelau,"@andywana Not sure what they are, only that th..."
4,0,1467822384,Mon Apr 06 22:22:47 PDT 2009,NO_QUERY,Lindsey0920,@oanhLove I hate when that happens...


In [4]:
print(df["text"].iloc[545])

Boredd. Colddd @blueKnight39 Internet keeps stuffing up. Save me! https://www.tellyourstory.com


In [5]:
print(df["text"].iloc[546])

I had a horrible nightmare last night @anitaLopez98 @MyredHat31 which affected my sleep, now I'm really tired


In [6]:
print(df["text"].iloc[547])

im lonely  keep me company @YourBestCompany! @foxRadio https://radio.foxnews.com 22 female, new york


### Exercise 4: Everything clean

Back to your Twitter sentiment analysis project! There are several types of strings that increase your sentiment analysis complexity. But these strings do not provide any useful sentiment. Among them, we can have links and user mentions.

In order to clean the tweets, you want to extract some examples first. You know that most of the times `links` start with `http` and do not contain any whitespace, e.g. `https://www.datacamp.com`. `User mentions` start with `@` and can have `letters` and `numbers` only, e.g. `@johnsmith3`.

You write down some helpful quantifiers to help you: `* zero or more times`, `+ once or more`, `? zero or once`.

The list `sentiment_analysis4` containing the text of three tweet are already loaded in your session. 

In [50]:
sentiment_analysis4 = [df["text"].iloc[545], df["text"].iloc[546], df["text"].iloc[547]]
sentiment_analysis4

['Boredd. Colddd @blueKnight39 Internet keeps stuffing up. Save me! https://www.tellyourstory.com',
 "I had a horrible nightmare last night @anitaLopez98 @MyredHat31 which affected my sleep, now I'm really tired",
 'im lonely  keep me company @YourBestCompany! @foxRadio https://radio.foxnews.com 22 female, new york']

- Write a regex to find all the matches of `http` links appearing in each tweet in sentiment_analysis4. Print out the result.


- Write a regex to find all the matches of `user mentions` appearing in each tweet in sentiment_analysis4. Print out the result.

In [54]:
for tweets in sentiment_analysis4:
    ## q1
    print(re.findall(r"https?\W+\w+\W\w+\W\w+",tweets))
    
    ## q2
    print(re.findall(r"@\w+\d*", tweets), "\n")

['https://www.tellyourstory.com']
['@blueKnight39'] 

[]
['@anitaLopez98', '@MyredHat31'] 

['https://radio.foxnews.com']
['@YourBestCompany', '@foxRadio'] 



### Exercise 5: Some time ago

You are interested in knowing when the tweets were posted. After reading a little bit more, you learn that dates are provided in different ways. You decide to extract the dates using `.findall()` so you can normalize them afterwards to make them all look the same.

You realize that the dates are always presented in one of the following ways:

`27 minutes ago`

`4 hours ago`

`23rd june 2018`

`1st september 2019 17:25`

The list sentiment_analysis5 containing the text of three tweets,

In [55]:
sentiment_analysis5 = [df["text"].iloc[232], df["text"].iloc[233], df["text"].iloc[234]]
sentiment_analysis5

['I would like to apologize for the repeated Video Games Live related tweets. 32 minutes ago',
 '@zaydia but i cant figure out how to get there / back / pay for a hotel 1st May 2019',
 'FML: So much for seniority, bc of technological ineptness 23rd June 2018 17:54']

- Use for-loop with a regex that finds all dates in a format similar to `27 minutes ago` or `4 hours ago`.


- Use for-loop with a regex that finds all dates in a format similar to `23rd june 2018`.


- Use for-loop with a regex that finds all dates in a format similar to `1st september 2019 17:25`.

In [69]:
## q1
for tweet in sentiment_analysis5:
    print(re.findall(r"\d+\s\w+\sago", tweet))
    
## q1 another way
for tweet in sentiment_analysis5:
    print(re.findall(r"\d{2}\s\w+\sago", tweet))
    

['32 minutes ago']
[]
[]
['32 minutes ago']
[]
[]


In [67]:
## q2
for tweet in sentiment_analysis5:    
    print(re.findall(r"\d+\w+\s\w+\s\d+", tweet))
    
## q2 another way
for tweet in sentiment_analysis5:    
    print(re.findall(r"\d{1,2}\w{2}\s\w+\s\d{4}", tweet))

[]
['1st May 2019']
['23rd June 2018']
[]
['1st May 2019']
['23rd June 2018']


In [70]:
## q3
for tweet in sentiment_analysis5:    
    print(re.findall(r"\d+\w+\s\w+\s\d+\s\d+:\d+", tweet))



## q3 another way
for tweet in sentiment_analysis5:    
    print(re.findall(r"\d{1,2}\w{2}\s\w+\s\d{4}\s\d{2}:\d{2}", tweet))

[]
[]
['23rd June 2018 17:54']
[]
[]
['23rd June 2018 17:54']


### Exercise 6: Getting tokens

Your next step is to `tokenize` the text of your tweets. `Tokenization is the process of breaking a string into lexical units or, in simpler terms, words.` 

But first, you need to remove `hashtags` so they do not cloud your process. You realize that hashtags start with a `#` symbol and contain letters and numbers but `never whitespace`. After that, you plan to `split` the text at whitespace matches to get the tokens.

You bring your list of quantifiers to help you: `* zero or more times`, `+ once or more`, `? zero or once`, `{n, m} minimum n, maximum m`.

The variable `sentiment_analysis6` containing the text of one tweet

In [71]:
sentiment_analysis6 = 'ITS NOT ENOUGH TO SAY THAT IMISS U #MissYou #SoMuch #Friendship #Forever'

- Write a regex that matches the described hashtag pattern. Assign it to the regex variable.


- Replace all the matches of the regex with an empty string "". Assign it to no_hashtag variable.


- Split the text in the no_hashtag variable at every match of one or more consecutive whitespace.

In [72]:
# q1
regex_var = r"#\w+"

In [74]:
# q2
no_hashtag = re.sub(regex_var, "", sentiment_analysis6)
no_hashtag

'ITS NOT ENOUGH TO SAY THAT IMISS U    '

In [78]:
# q3
splt = re.split(r"\s+", no_hashtag)
splt

['ITS', 'NOT', 'ENOUGH', 'TO', 'SAY', 'THAT', 'IMISS', 'U', '']

### Regex metacharacters

Now, we will look at some other metacharacters that are very useful.


Let's first look at two methods of the re module: dot search and dot match. 

<img src="sm.jpg" style="max-width:600px">

As you can see, they have the same syntax and are used to find a match. 


In the example, we use both methods to find a digit appearing four times. Both methods return an object with the match found. 

In [2]:
re.search(r"\d{4}", "4506 people attend the show")

<re.Match object; span=(0, 4), match='4506'>

In [3]:
re.match(r"\d{4}", "4506 people attend the show")

<re.Match object; span=(0, 4), match='4506'>

The difference is that dot match is anchored at the beginning of the string. 

In the second example, we used them to find a match for a digit.

In [4]:
re.search(r"\d+", "Yesterday, I saw 3 shows")

<re.Match object; span=(17, 18), match='3'>

In [5]:
re.match(r"\d+","Yesterday, I saw 3 shows")

In this case, dot search finds a match, but dot match does not. This is because the first characters do not match the regex.

### Special characters (" . " meta character)

`The dot metacharacter matches any character(except newline)`. 

In the example code, we need to match links in the string. We know a link starts with w, w, w and ends with c, o, m. So we first write this in our regex. We don't know how many character are in between. So we indicate that we want any character, a dot, once or more times, adding the plus. We can see in the output that we get our match.

In [6]:
my_links ="Just check out this link: www.amazingpics.com. It has amazing photos!"
re.findall(r"www.+com", my_links)

['www.amazingpics.com']

###  Special characters  (" ^ " meta character)

The circumflex anchors the regex to the start of a string. 

In the example, we find the pattern starting with t, followed by h, e, whitespace, two digits and ending with s. The method finds the following two matches. 

In [7]:
my_string = "the 80s music was much better that the 90s"
re.findall(r"the\s\d+s", my_string)

['the 80s', 'the 90s']

In [8]:
re.findall(r"^the\s\d+s", my_string)

['the 80s']

Now, when we added the anchor metacharacter. We receive only one match. The one that appears at the beginning of the string.

###  Special characters (End of the string: dollar metacharacter)

`$`  meta character

On the contrary, the dollar sign anchors the regex to the end of the string. If we use it in the previous example, we get the match that appears at the end of the string.

In [9]:
my_string ="the 80s music hits were much better that the 90s"
re.findall(r"the\s\d+s$", my_string)

['the 90s']

### Special characters (Escape special characters: \)

What if I want to use characters like dollar sign or dot, which also have other meanings? 

Let's look at an example. We want to split the string by dot whitespace. We write the following regex. 

In [10]:
my_string = "I love the music of Mr.Go. However, the sound was too loud."
print(re.split(r".\s", my_string))

['', 'lov', 'th', 'musi', 'o', 'Mr.Go', 'However', 'th', 'soun', 'wa', 'to', 'loud.']


And we get an output that it is not what we expected. Why? Because the regex interprets the dot as any character. 

To solve this situation, we need to escape the character by adding a backslash in front of the dot. Now we get the correct output.

In [11]:
my_string = "I love the music of Mr.Go. However, the sound was too loud."
print(re.split(r"\.\s", my_string))

['I love the music of Mr.Go', 'However, the sound was too loud.']


### OR operator(" | ")

In the example code, we want to match the word elephant. However, we see that it's written with capital E or lower e. In that case, we use the vertical bar. 

In this way, we indicate that we want to match one variant OR the other obtaining both elephant-matches.

In [12]:
my_string = "Elephants are the world's largest land animal! I would love to see an elephant one day"
re.findall(r"Elephant|elephant", my_string)

['Elephant', 'elephant']

### OR operator (square " [ ] " brackets)

Square brackets also represent the OR operand. Inside them, we can specify optional characters to match. 

Look at the example. We want to find a pattern that contains lowercase or uppercase letter followed by a digit. To do so, we can use the square brackets. Inside them, we will use lowercase a dash lowercase z to specify any lowercase letter. Then, uppercase a dash uppercase z to indicate any uppercase letter. Then the plus. Then backslash d. Thus, we get the following matches.

In [13]:
my_string = "Yesterday I spent my afternoon with my friends: MaryJohn2 Clary3"
re.findall(r"[a-zA-Z]+\d", my_string)

['MaryJohn2', 'Clary3']

### OR operator (square " [ ] " brackets)

In the following string, we want to replace the non-word characters by whitespace. We specify optional characters inside the square brackets. The engine searches for one or the other. When it finds a match, it replaces the character by a whitespace, getting the following output.

In [14]:
my_string = "My&name&is#John Smith. I%live$in#London."
re.sub(r"[#$%&]", " ", my_string)

'My name is John Smith. I live in London.'

In [19]:
my_string = "My&name&is#John Smith. I%live$in#London."
re.sub(r"[\W]", " ", my_string)

'My name is John Smith  I live in London '

### OR operand with circumflex 

The circumflex transforms the expression inside square brackets into negative. 

In the example, we add the circumflex to specify we want the links that do not contain any number. And we get the following output.

In [20]:
my_links = "Bad website: www.99.com. Favorite site: www.hola.com"
re.findall(r"www[^0-9]+com", my_links)

['www.hola.com']

### Exercise 7: Finding files

You are not satisfied with your tweets dataset cleaning. There are still extra strings that do not provide any sentiment. Among them are strings that refer to text file names.

You also find a way to detect them:

- `They appear at the start of the string.`

- `They always start with a sequence of 2 or 3 upper or lowercase vowels (a e i o u)`.

- `They always finish with the txt ending.`

You are not sure if you should remove them directly. So you write a script to find and store them in a separate dataset.

You write down some metacharacters to help you: `^ anchor to beginning`, `. any character`.

The variable `sentiment_analysis7` containing the text of two tweets

In [26]:
sentiment_analysis7 = [df["text"].iloc[780], df["text"].iloc[781]]
sentiment_analysis7

['AIshadowhunters.txt aaaaand back to my literature review. At least i have a friendly cup of coffee to keep me company',
 "ouMYTAXES.txt I am worried that I won't get my $900 even though I paid tax last year"]

- Write a regex that matches the pattern of the text file names, e.g. `aemyfile.txt`.


- Find all matches of the regex in the elements of sentiment_analysis. Print out the result.


- Replace all matches of the regex with an `empty string ""`. Print out the result.

In [29]:
## answer
regex_txt = r"[aeiouAEIOU]{2,3}\w+\.txt"

for tweet in sentiment_analysis7:
    print(re.findall(regex_txt,tweet))
    print(re.sub(regex_txt, "", tweet), "\n")

['AIshadowhunters.txt']
 aaaaand back to my literature review. At least i have a friendly cup of coffee to keep me company 

['ouMYTAXES.txt']
 I am worried that I won't get my $900 even though I paid tax last year 



In [32]:
## answer another way
# Write a regex to match text file name
regex = r"^[aeiouAEIOU]{2,3}.+txt"

for text in sentiment_analysis7:
    # Find all matches of the regex
    print(re.findall(regex, text))
    
    # Replace all matches with empty string
    print(re.sub(regex, "", text))

['AIshadowhunters.txt']
 aaaaand back to my literature review. At least i have a friendly cup of coffee to keep me company
['ouMYTAXES.txt']
 I am worried that I won't get my $900 even though I paid tax last year


### Exercise 8: Give me your email

A colleague has asked for your help! When a user signs up on the company website, they must provide a valid email address.
The company puts some rules in place to verify that the given email address is valid:

- The first part can contain:

    `Upper A-Z or lowercase letters a-z`

    `Numbers`

    `Characters: !, #, %, &, *, $, . `

    `Must have @`


- Domain:
    
    `Can contain any word characters`
    
    `But only .com ending is allowed`

The project consist of writing a script that checks if the email address follow the correct pattern. Your colleague gave you a list of email addresses as examples to test.

The list emails as well as the re module are loaded in your session.

- Write a regular expression to match valid email addresses as described.


- Match the regex to the elements contained in emails.


- To print out the message indicating if it is a valid email or not, complete .format() statement.

In [33]:
emails = ['n.john.smith@gmail.com', '87victory@hotmail.com', '!#mary-=@msca.net']
emails

['n.john.smith@gmail.com', '87victory@hotmail.com', '!#mary-=@msca.net']

In [35]:
## there are three special characters here "$", "*", ".", so we have to put "\" before those characters

regex_emails = r"[a-zA-Z0-9!#%&\*\$\.]+@\w+\.com"

In [37]:
for email in emails:
    if re.match(regex_emails,email):
        print("The email {example_email} is a valid email".format(example_email = email))
    else:
        print("The email {example_email} is a invalid email".format(example_email = email))

The email n.john.smith@gmail.com is a valid email
The email 87victory@hotmail.com is a valid email
The email !#mary-=@msca.net is a invalid email


### Exercise 9: Invalid password

The second part of the website project is to write a script that validates the password entered by the user. The company also puts some rules in order to verify valid passwords:

- `It can contain lowercase a-z and uppercase letters A-Z`

- `It can contain numbers`

- `It can contain the symbols: *, #, $, %, !, &, .`

- `It must be at least 8 characters long but not more than 20`

Your colleague also gave you a list of passwords as examples to test.

The list `passwords` and the module re are loaded in your session.

In [38]:
passwords = ['Apple34!rose', 'My87hou#4$', 'abc123']
passwords

['Apple34!rose', 'My87hou#4$', 'abc123']

- Write a regular expression to check if the `passwords are valid` according to the description.


- Search the elements in the passwords list to find out if they are valid passwords.


- To print out the message indicating if it is a valid password or not, complete .format() statement.

In [39]:
regex_pass = r"[a-zA-Z0-9\*#\$%!&\.]{8,20}"

In [41]:
for password in passwords:
    if re.search(regex_pass,password):
        print("This password {password} is valid".format(password=password))
        
    else:
        print("This password {password} is invalid".format(password=password))

This password Apple34!rose is valid
This password My87hou#4$ is valid
This password abc123 is invalid


### Greedy vs. non-greedy matching

You have already worked with repetitions. Now, we'll deepen our understanding of how the quantifiers work.


There are two types of matching methods: greedy and non-greedy (also called lazy) operators. 

The quantifiers that you have been learning until now (which are called standard quantifiers) are greedy by default.

### Greedy matching

We said that the standard quantifiers have a greedy behavior, meaning that 

- they will attempt to match as many characters as possible. 

- And in doing so, they will return the longest match found with a match attempt. 

Let's take a look at this code. We want to find a pattern that has one or more digits on the string displayed and our greedy quantifier will return the pattern '12345'. 

In [42]:
re.match(r"\d+", "12345bcada")

<re.Match object; span=(0, 5), match='12345'>

We can explain this in the following way: 

<img src="g.jpg" style="max-width:600px">

our quantifier will start by matching the first digit found, '1'. Because it is greedy, it will keep going to find 'more' digits and stop only when no other digit can be matched, returning '12345'.

### Non-greedy matching
Because they have lazy behavior, 

- non-greedy quantifiers will attempt to match as few characters as needed 

- returning the shortest match. 

So how do we obtain non-greedy quantifiers? We can append a question mark at the end of the greedy quantifiers to convert them into lazy. 

If we take the same code as before, our non-greedy quantifier will return the pattern '1'. 

In [43]:
re.match(r"\d+?", "12345bcada")

<re.Match object; span=(0, 1), match='1'>

<img src="n.jpg" style="max-width:600px">

In this case, our quantifier will start by matching the first digit found, '1'. Because it is non-greedy, it will stop there as we stated that we want 'one or more' and one is as few as needed.

### Greedy matching(Backtrack)

However, there is another characteristic that we should explore. 

If the greedy quantifier has matched so many characters that can not match the rest of pattern, it will backtrack, giving up characters matched earlier one at a time and try again. 

Backtracking is like driving a car without a map. If you drive through a path and hit a dead end street, you need to backtrack along your road to an earlier point to take another street. To make this more clear, we'll take this example code. 

We use the greedy quantifier `.*` to find anything, zero or more times, followed by the letters `"h" "e" "l" "l" "o"`. We can see here that it returns the pattern `'xhello'`. 

In [44]:
re.match(r".*hello", "xhelloxxxxxx")

<re.Match object; span=(0, 6), match='xhello'>

<img src="g-1.jpg" style="max-width:800px">

So our greedy quantifier will start by matching as much as possible, the entire string. Then it tries to match the h, but there are no characters left. So it backtracks, giving up one matched character. Trying again. It still doesn't match the h, so it backtracks one more step repeatedly till it finally matches the h in the regex, and the rest of the characters.

###  Non-greedy matching(Backtrack)

Non-greedy quantifiers also backtrack. In this case, if they have matched so few characters that the rest of the pattern cannot match, they backtrack, expand the matched character one at a time and try again. Let's take the same example code again. 

This time we will use the lazy quantifier `.*?`. Interestingly, we obtain the same match `'xhello'`. But how this match was obtained is different from the first time.

In [45]:
re.match(r".*?hello", "xhelloxxxxxx")

<re.Match object; span=(0, 6), match='xhello'>

<img src="n-1.jpg" style="max-width:800px">

The lazy quantifier first matches as little as possible, nothing, leaving the entire string unmatched. Then it tries to match the h, but it doesn't work. So it backtracks, matching one more character, the x. Then it tries again, this time matching the h, and afterwards, the rest of the regex.

### Exercise 10: Understanding the difference

You need to keep working and cleaning your tweets dataset. You realize that there are some HTML tags present. You need to remove them but keep the inside content as they are useful for analysis.

Let's take a look at this sentence containing an `HTML` tag:

`I want to see that <strong>amazing show</strong> again!.`

You know that to get the HTML tag you need to match anything that sits inside angle brackets `< >`. But the biggest problem is that the closing tag has the same structure. If you match too much, you will end up removing key information. So you need to decide whether to use a greedy or a lazy quantifier.

The string is already loaded as `string` to your session.

In [46]:
string = 'I want to see that <strong>amazing show</strong> again!'
string

'I want to see that <strong>amazing show</strong> again!'

- Write a regex expression to replace HTML tags with an empty string. Print out the result.

In [47]:
print(re.sub(r"<.+?>", "", string))

I want to see that amazing show again!


### Exercise 11: Greedy matching

Next, you see that numbers still appear in the text of the tweets. So, you decide to find all of them.

Let's imagine that you want to extract the number contained in the sentence I was born on `April 24th`. A `lazy quantifier` will make the regex return `2` and `4`, because they will match as `few` characters as needed. However, a greedy quantifier will return the `entire 24` due to its need to match as much as possible.

The re module as well as the variable `sentiment_analysis8` are already loaded in your session. 

In [48]:
sentiment_analysis8 = 'Was intending to finish editing my 536-page novel manuscript tonight, but that will probably not happen. And only 12 pages are left '
sentiment_analysis8

'Was intending to finish editing my 536-page novel manuscript tonight, but that will probably not happen. And only 12 pages are left '

- Use a `lazy quantifier` to match all numbers that appear in the variable `sentiment_analysis`.

- Now, use a `greedy quantifier` to match all numbers that appear in the variable `sentiment_analysis`.

In [49]:
## lazy
print(re.findall(r"\d+?", sentiment_analysis8))

['5', '3', '6', '1', '2']


In [50]:
## greedy
print(re.findall(r"\d+", sentiment_analysis8))

['536', '12']


### Exercise12: Lazy approach

You have done some cleaning in your dataset but you are worried that there are sentences encased in parentheses that may cloud your analysis.

Again, a greedy or a lazy quantifier may lead to different results.

For example, if you want to extract a word starting with `a` and ending with `e` in the string `I like apple pie`, you may think that applying the `greedy` regex `a.+e` will return `apple`. However, your match will be `apple pie`. A way to overcome this is to make it `lazy` by using `?` which will return `apple`.

The re module and the variable `sentiment_analysis9` are already loaded in your session.

In [51]:
sentiment_analysis9 = "Put vacation photos online (They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying). "
print(sentiment_analysis9)

Put vacation photos online (They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying). 


- Use a `greedy quantifier` to match text that appears within `parentheses` in the variable sentiment_analysis.

- Now, use a `lazy quantifier` to match text that appears within `parentheses` in the variable sentiment_analysis.

In [53]:
## greedy
print(re.findall(r"\(.+\)", sentiment_analysis9))

["(They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying)"]


In [54]:
## lazy
print(re.findall(r"\(.+?\)", sentiment_analysis9))

['(They were so cute)', "(I'm crying)"]


Notice that using greedy quantifiers always leads to longer matches that sometimes are not desired. Making quantifiers lazy by adding `?` to match a shorter pattern is a very important consideration to keep in mind when handling data for text mining.