<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST2312/blob/main/CST2312_LabExercises_regex_SOLUTION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab Exercises: Regular Expressions  
# **SOLUTION NOTEBOOK**    
**CST2312 D222, Fall 2022**   

Use Cases for Regular Expressions    

*Refer to the Colab notebook [CST2312 Class 12](https://bit.ly/cst2312cl12) as a reference.*    

*Refer to the Cheatsheet for [regex](https://bit.ly/cst2312regex_cheatsheet)*   
  


##1. Password Validity   

Write Python code which checks the validity of a password entered by the user according to the following requirements:    

 - Must have at least **8 characters**    
 - Must at least have an **upper case letter**    
 - Must at least have a **lowercase letter**    
 - Must at least have a **number**    
 - Must at least have a **symbol**     


*Write your Python code in the following code cell.*    
*Add more code and text cells if you like.*    

#### Solution to 1. Password Validity

In [None]:
r"/^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*\W).{8,}$/g"

Essentially, we’re using something called “positive lookaheads” and are sections of the expression that the engine will search for inside the text, no matter where they are. Everything inside the `(?=...)` is the section of the expression that we care about.

`(?=.*[a-z])` essentially means that it’ll match any character that is followed by a lowercase letter.    
`(?=.*[A-Z])` just like the previous one, but instead of lowercase, it’ll match if the following character was uppercase.    
`(?=.*\d)` will match anything that is followed by a digit (a number).    
`(?=.*\W)` matches any character (other than a line break) that is followed by a symbol.    
`.{8,}` makes sure the length of the match is at least, 8 characters (any character thanks to the dot there).    
`^` and `$` make sure the match starts at the beginning of a word (thanks to the caret at the start of the expression) and ends with the word (thanks to the dollar sign). Essentially, only whole word matches are allowed. Partial matches aren’t considered.    

If all the above conditions are met, then the match is returned, otherwise it won’t be a valid password.  

---

## 2. Email Format Checker    

How many times have you seen the message “Invalid Email format” in a sign-up form?     

If you are working on a back-end validation, Regular Expressions can help you validate this format in a single line of code, instead of having several different IF statements. 

Write Python code which asks the user for an email address and checks the validity of the email address entered.   



*Write your Python code in the following code cell.*    
*Add more code and text cells if you like.*    

#### Solution to 2. Email Format Checker    


In [None]:
r"/^[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/"

That is a lot, but if you look closely, you can identify all three parts of the address expected format in there:    

First, we check if the username is valid, this is simply checking that all valid characters are being used and that at least one of them was added (that’s what the “+” at the end means):    

`^[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+`    

Then, we’re checking for the @ character and the host name:    

`@[a-zA-Z0-9-]+`    

Again, nothing fancy, the host name needs to be alphanumeric and have at least one character.    

The last, optional part, takes care of checking the TLD (Top Level Domain), or basically the domain name extension:    

`(?:\.[a-zA-Z0-9-]+)*$/`    

And you can tell this part is optional, because of the * at the end. That means 0 or more instances of that group (the group is delimited by the parenthesis) are required (so .com would match, but also .co.uk ). 

---

## 3. URL Validity      

Write Python code which checks the validity of a URL entered by the user according to the following requirements:    

 - Must start with `http` or `https` or `ftp` followed by `://` 
    
 - Must match a valid domain name

 - Could contain a port specification (`http://www.cuny.com:80`)    

 - Could contain digit, letter, dots, hyphens, forward slashes, multiple times    


*Write your Python code in the following code cell.*    
*Add more code and text cells if you like.*    

#### Solution to 3. URL Validity    


In [None]:
r"^(http|https|ftp):[\/]{2}([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4})(:[0-9]+)?\/?([a-zA-Z0-9\-\._\?\,\'\/\\\+&amp;%\$#\=~]*)"

The first scenario is pretty easy to solve with `^(http|https|ftp):[\/]{2}`.
To match the domain name we need to bear in mind that to be valid it can only contain letters, digits, hyphen and dots. In my example, I limited the number of characters after the punctuation from 2 to 4, but could be extended for new domains like .rocks or .codes. The domain name is matched by `([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4})`.

The optional port specification is matched by the simple (:[0-9]+)?.

A URL can contain multiple slashes and multiple characters repeated many times (see RFC3986), this is matched by using a range of characters in a group ([a-zA-Z0-9\-\._\?\,\'\/\\\+&amp;%\$#\=~]*).
It’s really useful to match every important element with a group capture (), because it will return only the matches we need. Remember that certain characters need to be escaped with \.

Below, every single subpattern explained:

 1. `^` asserts position at start of the string    

 2. capturing group `(http|https|ftp)`, captures `http` or `https` or `ftp`    
 
 3. `:` escaped character, matches the character : literally    

 4. `[\/]{2}` matches exactly 2 times the escaped character /      

 5. capturing group `([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4})`:    

 - `[a-zA-Z0-9\-\.]+` matches one and unlimited times character in the range between a and z, A and Z, 0 and 9, the character `-` literally and the character `.` literally    

 - `\.` matches the character `.` literally    

 - `[a-zA-Z]{2,4}` matches a single character between 2 and 4 times between a and z or A and Z (case sensitive)    

 6. capturing group `(:[0-9]+)?`:  
 - quantifier ? matches the group between zero or more times    

 - `:` matches the character : literally    

 ` `[0-9]+` matches a single character between 0 and 9 one or more times    

 7. `\/?` matches the character / literally zero or one time

 8. capturing group ([a-zA-Z0-9\-\._\?\,\'\/\\\+&amp;%\$#\=~]*):    

 -  `[a-zA-Z0-9\-\._\?\,\'\/\\\+&amp;%\$#\=~]*` matches between zero and unlimited times a single character in the range a-z, A-Z, 0-9, the characters: `-._?,'/\+&amp;%$#=~.`    


---

## 4. Matching an HTML Tag    

Write Python code which cane read an HTML file and match an HTML tag.

Copy the source code for a web page to test your code.   

You may want to try Dr. Chuck Severance's simple web page at `https://dr-chuck.com/page1` as a first test. 


 - The start tag must begin with `<` followed by one or more characters and end with `>`    

 - The end tag must start with `</` followed by one or more characters and end with `>`    

 - We must match the content inside a **TAG** element            


*Write your Python code in the following code cell.*    
*Add more code and text cells if you like.*    

#### Solution to 4. Matching an HTML Tag    


In [None]:
r"<([\w]+).*>(.*?)<\/\1>"

Matching the start tag and the content inside it’s pretty easy with `<([\w]+).*>` and `(.*?)`, but in the pattern above I have added a useful thing: the reference to a capturing group.    
Every capturing group defined by parentheses `()` could be referred to using its position number, `first)``(second)``(third)`, which will allow for further operations.    


The expression above could be explained as:    

 - Start with `<`
 - Capture the tag name    
 - Followed by one or more chars    
 - Capture the content inside the tag    
 - The closing tag must be `</`tag name captured before`>`    

Including only two capture groups in the expression, the tag name and the content, will return a very clear match, a list of tag names with related content.    

Digging a little deeper to explain the subpatterns:				

 1. `<` matches the character `<` literally    

 2. capturing group `([\w]+)` matches any word character a-zA-Z0-9_ one or more times    

 3. `.*` matches any character (except newline) between zero or more times     

 4. `>` matches the character `>` literally    

 5. capturing group `(.*?)`, matches any character (except newline), zero and more times    

 6. `<` matches the characters `<` literally    

 7. `\/` matches the character / literally    

 8. `\1` matches the same text matched by the first capturing group: `([\w]+)`    

 9. `>` matches the characters > literally    
 

---