# Regular Expressions in Python (Follow-up 1)
<font size="3">For this follow-up you must solve 2 exercises using the theoretical knowledge you have gathered in class and 3 more exercises using Python regular expressions. For the Python section you will have to also present a written summary, in english, of how was your trend of thought to solve the exercise (one paragraph). This part will be marked taken into account the regex, the explanation and the number of cases passed. Each of these exercises must be solved using regular expressions, any other approach, even if it works, won’t be accepted.</font>
 

1) [15 pts] Find the regular expression for the language over $\Sigma = \{a,b,c\}$ of all strings that only have one occurrence of three consecutive a.

$ (({a} \cup {aa}) \cdot \{b,c\}^+)^* \cdot \{aaa\} \cdot (\{b,c\}^+ \cdot ({a} \cup {aa}))^* $

$ \{b,c\}^* \cdot \{aaa\} \cdot \{b,c\}^* $

2. [15 pts] Find the regular expression for the language over $\Sigma = \{0,1,2\}$ of all strings that have length divisible by two.

$ ((\{0\} \cup \{1\} \cup \{2\}) \cdot (\{0\} \cup \{1\} \cup \{2\}))^* $

3) [25 pts] Write a Python function is_valid_username(username) that returns True if the username meets all of the following conditions (all captured with a regex):
- Length between 8 and 16 characters
- Starts with a letter (uppercase or lowercase)
- Contains at least one underscore (_)
- Contains at least one digit
- Ends with the same letter it started with (case-sensitive)
- Only includes letters, digits, and underscores (no other symbols)
<br>
<br>
Hints: 

- Use lookaheads (?=...) to require certain patterns (e.g., digits, symbols).
- Use groups() to capture and compare parts of the string.
- Use backreferences \1 to match the same text as a previous group.
- Use {min,max} to enforce minimum and maximum length.


In [2]:
test_cases = [
    ("User_123U", True),
    ("User_1234", False),
    ("a_user1a", True),
    ("Start1234567_endS", False),
    ("GoodName_3e", False),
    ("Z_user_9Z", True),
    ("A123456_7A", True),
    ("a__1b", False),
    ("_User_1_", False),
    ("abc_def_1c", False),
]


In [3]:
import re

def is_valid_username(username):
    objRegEx = re.compile(r'(?=.*_)(?=.*\d)(^[a-zA-Z])[a-zA-Z0-9_]{6,14}\1$')
    return bool(objRegEx.search(username))

for test_case in test_cases:
        print(f"{test_case[0]} --> Real: {is_valid_username(test_case[0])} - Expected: {test_case[1]}")

User_123U --> Real: True - Expected: True
User_1234 --> Real: False - Expected: False
a_user1a --> Real: True - Expected: True
Start1234567_endS --> Real: False - Expected: False
GoodName_3e --> Real: False - Expected: False
Z_user_9Z --> Real: True - Expected: True
A123456_7A --> Real: True - Expected: True
a__1b --> Real: False - Expected: False
_User_1_ --> Real: False - Expected: False
abc_def_1c --> Real: False - Expected: False


### Explenation 3:
First I started writing out the parts that were easy like it must start with an uppercase or lowercase letter --> ^[a-zA-Z]. Then it said it must be used at the end so then I group it and call it with \1 --> (^[a-zA-Z]) whatever \1$. Then I need to limit so that the right number of digits, letters and _ appear between those two and make sure there are between 6 and 14 since the first and the last take up 2 characters --> [a-zA-Z0-9_]{6,14}. Then I need to add those "contains" requirements like the _ the digit, so I used look aheads so it made sure there was at least one --> (?=.*_)(?=.*\d).

4) [20 pts] Write a Python function to find all triplets of words (3-word sequences) that appear at least twice in a given text (not necessarily consecutively).
<br>
<br>
Hints: 
- Remember about the boundaries \b
- Use a backreference to check for repetition
- Use the re.findall() or re.finditer() method.
- Convert text to lowercase to avoid case mismatches.
- Use Counter to count occurrences.


In [15]:
test_cases = [
    ("the quick brown fox jumps over the quick brown fox",
     ["the quick brown", "quick brown fox"]),
    ("Hello world again. hello WORLD again.",
     ["hello world again"]),
    ("We went there together Then, we went there together again.",
     ["we went there", "went there together"]),
    ("one two three two three four one two three",
     ["one two three"]),
    ("every word is different in this sentence",
     []),
    ("go go go home then go go go home then again",
     ["go go go", "go go home", "go home then"]),
    ("the dog sleeps all day. the dog sleeps all night.",
     ["the dog sleeps", "dog sleeps all"]),
    ("A B C D E F G A B C D E F G", 
     ["a b c", "b c d", "c d e", "d e f", "e f g"]),
    ("Rain falls HARD. rain   FALLS hard every day.",
     ["rain falls hard"]),
    ("up up and away up up and away",
     ["up up and", "up and away"]),
]


In [21]:
objRegEx = re.compile(r'(?=(\b\w+\b \b\w+\b \b\w+\b))')

for test_case in test_cases:
    text = test_case[0].lower()
    triplets = objRegEx.findall(text)
    repeated = {t for t in triplets if triplets.count(t) >= 2}
    print(repeated)

{'quick brown fox', 'the quick brown'}
{'hello world again'}
{'went there together', 'we went there'}
{'one two three'}
set()
{'go home then', 'go go go', 'go go home'}
{'dog sleeps all', 'the dog sleeps'}
{'d e f', 'b c d', 'e f g', 'a b c', 'c d e'}
set()
{'up up and', 'up and away'}


### Explenation 4:
Since I was looking for 3 words that go next to each other I knew I had to do ((\b\w+\b) (\b\w+\b) (\b\w+\b)). This would get and consume those words so then, because they are grouped, then I can call them with \b\1\b \b\2\b \b\3\b after a .* since any amount of words can be between those two triplets. That being said, this didn't work exactly because the pattern *consumed* the words it was matching. So as soon as the findall would catch the word, no other option could be found with those same words. This caused it to just get one triplet in every case. The two examples below (gracias profe) helped me to get the idea that I needed to check for the three words without consuming them so findall() would check for the patter over and over again. By doing (?=(\b\w+\b \b\w+\b \b\w+\b)) and making it a look ahead, it will find at least one triplet, and add it to the findall list. The next step was to count the ocurrences of the triplets to filther those that actually happen more than twice. For that reason then I take the list of triplets withing the findall and then use count the ocurrence of each triplet, if its greater or equal o two, then it would get added to the repeated list. Then I just needed to do a for loop that traversed overy every list of words, make the words lowercase and pass each through the find all to finally print.

In [6]:
text = "abababa"
objRegEx = r"(aba)"  

matches = re.findall(objRegEx, text)
print(matches)

['aba', 'aba']


In [7]:
text = "abababa"
objRegEx = r"(?=(aba))"  # Using a positive lookahead to find overlapping "aba"

matches = re.findall(objRegEx, text)
print(matches)

['aba', 'aba', 'aba']


5) Validate a time string in the format HH:MM:MM:HH such that:

- The first and last HH are the same.
- The two MM values are also the same.
- HH must be between 00 and 23.
- MM must be between 00 and 59.
<br>
<br>
Hints:
- Use capturing groups.
- Use backreferences.
- Use anchors ^...$ to match the full string.

In [8]:
test_cases = [
    ("12:30:30:12", True),   
    ("00:00:00:00", True),   
    ("23:59:59:23", True),   
    ("01:01:01:01", True),   
    ("19:45:45:19", True),   
    ("12:30:31:12", False),  
    ("12:60:60:12", False),  
    ("24:00:00:24", False),  
    ("12:30:30:13", False),  
    ("09:59:59:08", False),  
]

In [9]:
def validate_time(time):
    objRegEx = re.compile(r'^([0-1]\d|2[0-3]):([0-5]\d):\2:\1$')
    return bool(objRegEx.search(time))

for test_case in test_cases:
    print(f'{test_case[0]} --> Real: {validate_time(test_case[0])} - Expected: {test_case[1]}')

12:30:30:12 --> Real: True - Expected: True
00:00:00:00 --> Real: True - Expected: True
23:59:59:23 --> Real: True - Expected: True
01:01:01:01 --> Real: True - Expected: True
19:45:45:19 --> Real: True - Expected: True
12:30:31:12 --> Real: False - Expected: False
12:60:60:12 --> Real: False - Expected: False
24:00:00:24 --> Real: False - Expected: False
12:30:30:13 --> Real: False - Expected: False
09:59:59:08 --> Real: False - Expected: False


### Explenation 5:
First I decided to make the possibilities of values each HH and MM could take. For HH since it must b between 00 and 23 I first made 3 groups 0[0-9], 1[0-9] and 2[0-3] but then I saw that those starting with 0 and 1 use the same ranges then I could combine them with the |. So then it could be 0 or 1 and then any number from 0-9 (\d) --> [0-1]\d. Then I just combined it with the starting number with 2's, so its [0-1]\d|2[0-3] since it can only be one of those 2 groups. Then I grouped it with () so it could be called at the end, and since that is the first group captured, then I would call it using \1 --> ([0-1]\d|2[0-3]):MM:MM:\1. Then I would need to do exactly the same with MM but this one can be summarized even further since 1,2,3,4,5 could be accompanied by any digit \d and the five numbers can bet written in a range --> [0-5]\d. Then I group it to get the second reference by using () and the \2. Every part is separated by colons.