# SLU11 | Regex: Exercise Notebook

***

Now let's see how comfortable you are with regular expressions!

Before you begin, we **highly suggest** that you go through the interactive exercises at https://regexone.com/ in order to build some muscle memory on the basics of regular expression language.

Couple of notes regarding the exercises:
- Remember to build regular expression patterns with `r"..."` notation
- The exercises will all be evaluated with `re.findall(regex, string)`, so no funny business with `Match` objects.
- There will be no exercises with match groups, since those require a bit more experience handling the basics.
- We will provide plenty of hints in each exercise
- ChatGPT is your friend, but it is also a *liar* who will overcomplicate things and/or give wrong answers
- You can test regular expressions on strings with online tools such as https://regex101.com/ (select Python on the left sidebar)

Are you ready?\
Don't worry, **you got this**!

<center><img src="./media/its_time.jpg" width="500"/><center>

In [None]:
import re

## Exercise 1
We want to find __all__ numbers inside a string.\
The numbers can be *anywhere* but they need to come together if they are together (e.g. `10` returns `10`, __not__ `1, 0`).

Create a variable `e1_regex` with the pattern.

__Hints:__
- You can specify a digit with either a special sequence (`\d`) or via ranges (`[x-y]`)
- You can specify *one or more* of a character when you follow it with `+`

In [None]:
# e1_regex = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
e1_string = "banana123 567 xpto89 0 9877 65 444 ac4d3my 780data 999"
assert re.findall(e1_regex, e1_string) == ['123', '567', '89', '0', '9877', '65', '444', '4', '3', '780', '999']

print("---- Well Done! All asserts passed ---- ")

## Exercise 2
Write a pattern that will match either the word `grey` or the word `gray`.

Create a variable `e2_regex` with the pattern.

__Hints:__
- You can specify options using `[]` (e.g. `[12]` matches 1 or 2)

In [None]:
# e2_regex = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
e2_string = "In the UK they write grey, but in the US they will write gray."
assert re.findall(e2_regex, e2_string) == ['grey', 'gray']

print("---- Well Done! All asserts passed ---- ")

## Exercise 3
We want to find numbers with __exactly__ 3 digits inside of a string.\
Note that we do not want to find *any* 3 numbers, but rather 3 numbers that are together without anything around them:

`hello 123 world` -> `123`\
`hello123 world` -> ` `\
`hello 123world` -> ` `\
`hello 1234 world` -> ` `\
`hello 12 3 world` -> ` `\
`hello world 123` -> `123`\
`123 hello world` -> `123`\
`123 hello world 123` -> `123, 123`

Create a variable `e3_regex` with the pattern.

__Hints:__
- Remember that if we need an exact number of digits then we can use a quantifier (`{n}`) for this
- There is a special sequence `\b` to define word boundaries
- You can specify a digit with either a special sequence (`\d`) or via ranges (`[x-y]`)

In [None]:
# e3_regex = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
e3_string = "banana123 567 xpto89 0 9877 65 444 ac4d3my 780data 999"
assert re.findall(e3_regex, e3_string) == ['567', '444', '999']

print("---- Well Done! All asserts passed ---- ")

## Exercise 4
Portuguese phone numbers can be written in the following components:
- Country Extension -> +351
- Phone Number -> 9 digits
  
Example: `+351217206707`

Using this knowledge, we want to build a regular expression that looks for this pattern in a list of potential phone numbers.
We __do not__ need to find *valid* numbers, i.e. the number itself does not need to make sense as long as it follows the pattern!\
A string should only contain the pattern of the number start to finish!

Create a variable `e4_regex` with the pattern.

__Hints:__
- There are metacharacters that can be used to define the start (`^`) and end (`$`) of a string
- Certain characters like `+` need to be escaped
- Remember that if we need an exact number of characters then we can use a quantifier (`{n}`) for this
- You can specify a digit with either a special sequence (`\d`) or via ranges (`[x-y]`)

In [None]:
# e4_regex = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
e4_list = [
     '+351123653434',
     '351392363685',
     '+35118148911',
     '+353392456376',
     '+451796069338',
     '+351487859813',
     '+351245431368',
     '+361881195045',
     '+311467186340',
     '\+351896146496',
     '+351796O69338',
     '+351123653434 ',
]
assert [num for num in e4_list if re.findall(e4_regex, num)] == ['+351123653434', '+351487859813', '+351245431368']

print("---- Well Done! All asserts passed ---- ")

## Exercise 5
We want to find file extensions in a list of filenames.\
The general pattern that we want to find is a __dot__ (`.`) followed by __2 to 4__ characters:
- The characters can either be digits or lowercase letters;
- The first character must be a letter;
- The string must end after the extension;

Examples:
- `.pdf`
- `.mp4`
- `.xlsx`
- `.b512`

Create a variable `e5_regex` with the pattern.

__Hints:__
- Remember that if we need a range of characters then we can use a quantifier (`{n,m}`) for this
- Remember that there is a metacharacter that specifies the end of a string (`$`)
- Certain characters like `.` need to be escaped
- You can specify ranges using `[]`
- A quantifier such as `{n}` only affects the last character or range of characters (e.g. `ab{2}` == `abb`, `a[b-c]{2}` == `abb/acc`)

In [None]:
# e5_regex = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
e5_list = [
     '.',
     'filename.4mp',
     'image.png',
     'document.pdf',
     'contract.doc2x',
     '123456.num',
     'test.py',
     'sound.fmpeg',
     'file.12p',
     'sheet.xlsx',
     'slides-ppt',
     'video.doc ',
]
assert [file for file in e5_list if re.findall(e5_regex, file)] == ['image.png', 'document.pdf', '123456.num', 'test.py', 'sheet.xlsx']

print("---- Well Done! All asserts passed ---- ")

## Exercise 6
Using a regular expression, we want to catch every sequence of characters that are punctuation.

- It does not matter where valid sequences exist
- Underscores are allowed (in other words, your expression does not need to find them)
- Sequences do not need to make sense as long as they follow the rules

Create a variable `e6_regex` with the pattern.

__Hints:__
- There is a special sequence (`\w`) that represent alphanumerical characters and underscores
- There is a special sequence (`\s`) that represent whitespaces
- You can specify *one or more* of a character when you follow it with `+`
- You can use `[^xyz]` to signal that you *do not want xyz*, where xyz is your sequence options
- You can group several special sequences by adding them to the same range

In [None]:
# e6_regex = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
e6_string = "Hello, how are you Mr. camelCase? My name is snake_case."
assert re.findall(e6_regex, e6_string) == [',', '.', '?', '.']

print("---- Well Done! All asserts passed ---- ")

## Exercise 7
Write a function named `sum_numbers` that receives a string. That function should then grab all the numbers in that string (no matter where they occur) and add them.\
You can assume that only integers can occur. 

Example:
- `1banana 2apple pear3` = `6`
- `hello 10 world` = `10`
- `foo barr` = `0`

__Hints:__
- You have already solved the grabbing numbers in a previous exercise
- You should use `re.findall(pattern, string)` to get a *list* of *strings* matching the pattern
- Remember that if no numbers are found, the sum is 0

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
e7_list = [
    "1banana 2apple pear3",
    "hello 10 world",
    "foo barr",
    "portugal2020, type80, 400x",
]
assert [sum_numbers(s) for s in e7_list] == [6, 10, 0, 2500]
print("---- Well Done! All asserts passed ---- ")

## Exercise 8
Do the same as the last exercise but this time considering the possibility of negative numbers

__Hints:__
- You should be able to re-use the answer of the previous exercise, using a *different* pattern
- You can use `?` to specify an optional character

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
e8_list = [
    "1banana 2apple pear3",
    "hello 10 world",
    "foo barr",
    "portugal2020, type80, 400x",
    "10--10+50-50-+6+7-8+2---5+15-17",
    "lisbon-10 data20 sci30ence -5academy0",
    "999machine 111learning -70 boot60camp^-100"
]
assert [sum_numbers(s) for s in e8_list] == [6, 10, 0, 2500, 0, 35, 1000]

print("---- Well Done! All asserts passed ---- ")

## Exercise 9
We want to find sequences that only contain letters between `a` and `s`, including their uppercase variants.

Create a variable `e9_regex` with the pattern.

__Hints:__
- Check for word boundaries
- Remember that ranges can be between *any* logical sequence (i.e, not just `a-z`)
- Sequences can have 1 or more characters

In [None]:
# e9_regex = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
e9_string = "Can you please place the basket wherever there is space?"
assert re.findall(e9_regex, e9_string) == ['Can', 'please', 'place', 'is', 'space']

print("---- Well Done! All asserts passed ---- ")

## Exercise 10
Define a function named `get_dates` that receives a string that may or may not contain dates of the format `YYYY-MM-dd`.\
The elements you find should have the correct number of digits. Other than that, no validation is needed.

For every date it finds it should convert it to a `Date` class that you define.
`Date` should have three attributes which should be given to the class in its initialization.: 
 - `year`
 - `month`
 - `day`
   
When you convert the class to string it should come in the format `dd/MM/YYYY` (hint: the `__str__` method).\
The function `get_dates` should return a __list__ of `Date` objects (empty if none found).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
e10_list = [
    "Albert Einstein was born in 1879-03-14 and died in 1955-04-18",
    "Isaac Newton was born in 1727-03-31 and died in 1643-1-4",
    "Charlemagne died in 814-01-28",
    "Some day it will be 10000-01-1"
]

dates = []

for test_string in e10_list:
    dates.extend(get_dates(test_string))

test_date = Date("2020", "01", "10")

assert test_date.year == "2020" and test_date.month == "01" and test_date.day == "10", "Date does not have the correct attributes"
assert str(test_date) == "10/01/2020", "Date str() output not correctly formatted"

assert all(isinstance(date, Date) for date in dates)
assert [str(date) for date in dates] == ['14/03/1879', '18/04/1955', '31/03/1727']

print("---- Well Done! All asserts passed ---- ")

# Submit your work!

To grade your exercise notebook and submit your work to the portal, [follow the instructions in the weekly workflow!](https://github.com/LDSSA/ds-prep-course-2024/blob/main/weekly-workflow.md#link-to-grading)