In [1]:
import re
import seaborn as sns
import numpy
import os

[regex_one](https://regexone.com/lesson/matching_characters?)

### Matching Digits

1. Match any digits in string

In [2]:
string = 'hello123#9er3d7!'

In [3]:
pattern = '\d+'  

# Note that \d+ matches all numbers occurring contiguously in the string as one entity, 
# while \d matches every number as an individual element in the returned list

In [4]:
re.findall(pattern, string)

['123', '9', '3', '7']

In [5]:
re.search(pattern, string).group()

'123'

In [6]:
pattern = '\d' 

In [7]:
re.findall(pattern, string)

['1', '2', '3', '9', '3', '7']

In [8]:
re.search(pattern, string).group()

'1'

### Matching Any Character

* Any character can be matched using the `.` wildcard metacharacter

In [9]:
pattern = '.'

In [10]:
re.findall(pattern, string)

['h',
 'e',
 'l',
 'l',
 'o',
 '1',
 '2',
 '3',
 '#',
 '9',
 'e',
 'r',
 '3',
 'd',
 '7',
 '!']

In [11]:
re.search(pattern, string).group()

'h'

In [12]:
# find first 3 characters in the string

pattern = '...'

In [13]:
re.search(pattern, string).group()

'hel'

In [14]:
# find last 3 characters in the string next to the exclamation sign only

pattern = '...\!'

In [15]:
re.search(pattern, string).group()

'3d7!'

### Matching Specific Characters

* There is a method for matching specific characters using regular expressions, by defining them inside square brackets. For example, the pattern `[abc]` will only match a single `a`, `b`, or `c` letter and nothing else.

In [16]:
words = ['can', 'man', 'fan', 'dan', 'ran', 'pan']

In [17]:
# now, let's define a pattern to match only the first three words

pattern = '[cmf]an'  # meaning match words that start with either c,m,f and have `an` after that

In [18]:
def search(arr, p=pattern):
    print(f'pattern: {p}\n')
    
    for item in arr:
        find = re.search(p, item)
        if find:
            print(item)

In [19]:
search(words)

pattern: [cmf]an

can
man
fan


* **Using Inverse Expression:**

We can use the inverse expression to select words that do not start with certain letters, but end with or have other letters...

* for example: `'[^drp]an'` this pattern will find words that do not start with `d`, `r`, `p` yet have `an` after ward

In [20]:
pattern = '[^drp]an'

In [21]:
search(words, pattern)

pattern: [^drp]an

can
man
fan


### Character Ranges:

when using the square bracket notation, there is a shorthand for matching a character in list of sequential characters by using the dash to indicate a character range. For example, the pattern `[0-6]` will only match any single digit character from zero to six, and nothing else. And likewise, `[^n-p]` will only match any single character except for letters `n` to `p`.

In [22]:
numstr = ['hello345', 'my542', 'go789', 'he349', 'her098', 'manny89']

In [23]:
pattern = '[0-6]..'  # match any single digit from 0 to 6

In [24]:
search(numstr, pattern)

pattern: [0-6]..

hello345
my542
he349
her098


Multiple character ranges can also be used in the same set of brackets, along with individual characters. An example of this is the alphanumeric `\w` metacharacter which is equivalent to the character range `[A-Za-z0-9_]` and often used to match characters in English text.

In [25]:
words = ['Ana', 'Bob', 'Cpc', 'aax', 'bby' 'ccz']

In [26]:
pattern = '[A-C]'

In [27]:
search(words, pattern)

pattern: [A-C]

Ana
Bob
Cpc


### Catching some zzz's

One way that we can do this is to explicitly spell out exactly how many characters we want, eg. `\d\d\d` which would match exactly three digits.

A more convenient way is to specify how many repetitions of each character we want using the curly braces notation. For example, `a{3}` will match the `a` character exactly three times. Certain regular expression engines will even allow you to specify a range for this repetition such that `a{1,3}` will match the a character no more than 3 times, but no less than once for example.

This quantifier can be used with any character, or special metacharacters, for example `w{3}` (three w's), `[wxy]{5}` (five characters, each of which can be `a` `w`, `x`, or `y`) and `.{2,6}` (between two and six of any character).

In [28]:
words = ['wazup1', 'wazzup2', 'wazzzup3', 'wazhupz', 'wazzzzzzzup7', 'wazzzzzup5']

In [29]:
# write a pattern to match the words with more than three z

pattern = 'z{3}'

In [30]:
search(words, pattern)

pattern: z{3}

wazzzup3
wazzzzzzzup7
wazzzzzup5


### Kleene-Star, Kleene-Plus


A powerful concept in regular expressions is the ability to match an arbitrary number of characters. For example, imagine that you wrote a form that has a donation field that takes a numerical value in dollars. A wealthy user may drop by and want to donate `$25,000`, while a normal user may want to donate `$25`.

One way to express such a pattern would be to use what is known as the Kleene Star and the Kleene Plus, which essentially represents either 0 or more or 1 or more of the character that it follows (it always follows a character or group). For example, to match the donations above, we can use the pattern `\d*` to match any number of digits, but a tighter regular expression would be `\d+` which ensures that the input string has at least one digit.

These quantifiers can be used with any character or special metacharacters, for example `a+` (one or more a's), `[abc]+` (one or more of any a, b, or c character) and `.*` (zero or more of any character).

In [31]:
words = ['$25000', 'heroes', '$25', 'goons', 'N25000', '£25000', 'Y25', '$250', '$2.50']

In [32]:
pattern = '[$]\d+'  # find the words that start with a dollar sign and any amt

In [33]:
search(words, pattern)

pattern: [$]\d+

$25000
$25
$250
$2.50


In [34]:
pattern = '[$]\d*'  # find the words that start with a dollar sign and any amt

In [35]:
search(words, pattern)

pattern: [$]\d*

$25000
$25
$250
$2.50


In [36]:
words = ['aaaabcc', 'aabbbbc', 'aacc', 'a', 'xyzzz', 'dfaaaku']

In [37]:
pattern = 'a{2}.*'  # Find the words that have atleast 2 a's and print it,

In [38]:
search(words, pattern)

pattern: a{2}.*

aaaabcc
aabbbbc
aacc
dfaaaku


### Optional Characters:

Another quantifier that is really common when matching and extracting text is the `?` (question mark) metacharacter which denotes __optionality__. This metacharacter allows you to match either zero or one of the preceding character or group. For example, the pattern `ab?c` will match either the strings `"abc"` or `"ac"` because the `b` is considered optional.

Similar to the dot metacharacter, the question mark is a special character and you will have to escape it using a slash `\?` to match a plain question mark character in a string.

In [39]:
words = ['1 file found?', '2 files found?', '24 files found?', 'no files found?', 'few files found']

# Find only files that start with a number from the list of files.

In [40]:
pattern = r'\d+ files? found\?'

In [41]:
search(words, pattern)

pattern: \d+ files? found\?

1 file found?
2 files found?
24 files found?


### Dealing with WhiteSpaces:

When dealing with real-world input, such as log files and even user input, it's difficult not to encounter whitespace. We use it to format pieces of information to make it easier to read and scan visually, and a single space can put a wrench into the simplest regular expression.

The most common forms of whitespace you will use with regular expressions are the space `(␣)`, the tab `(\t)`, the new line `(\n)` and the carriage return `(\r)` (useful in Windows environments), and these special characters match each of their respective whitespaces. In addition, a whitespace special character **`\s`** will match any of the specific whitespaces above and is extremely useful when dealing with raw input text.

In [42]:
words = ['1. abc',
        '2.  abc',
        '13.           abc',
        '4.abc']

In [43]:
pattern = r'\d\.\s'

In [44]:
search(words, pattern)

pattern: \d\.\s

1. abc
2.  abc
13.           abc


### Starting and ending
So far, we've been writing regular expressions that partially match pieces across all the text. Sometimes this isn't desirable, imagine for example we wanted to match the word `"success"` in a log file. We certainly don't want that pattern to match a line that says `"Error: unsuccessful operation"!` That is why it is often best practice to write as specific regular expressions as possible to ensure that we don't get false positives when matching against real world text.

One way to tighten our patterns is to define a pattern that describes both the start and the end of the line using the special `^ (hat)` and `$ (dollar sign)` metacharacters. In the example above, we can use the pattern `^success` to match only a line that begins with the word `"success"`, but not the line `"Error: unsuccessful operation"`. And if you combine both the hat and the dollar sign, you create a pattern that matches the whole line completely at the beginning and end.

Note that this is different than the hat used inside a set of bracket `[^...]` for excluding characters, which can be confusing when reading regular expressions.

Let's match the text below that only says `mission: successful` and not any other status

In [45]:
words = ['Mission: Unsuccessful',
        'Mission: Nearly successful',
        'Mission: Successful upon target capture',
         'Mission: Successful',
        'Next Mission: Successfully thought through',
         'Mission: successful',
        'Next Mission: Likely successful',
        'mission: Successful']

In [46]:
pattern = '^[mM]ission:\s[sS]uccessful$' 

In [47]:
search(words, pattern)

pattern: ^[mM]ission:\s[sS]uccessful$

Mission: Successful
Mission: successful
mission: Successful


### Match groups
Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses `( and )` metacharacters. Any subpattern inside a pair of parentheses will be captured as a group. In practice, this can be used to extract information like phone numbers or emails from all sorts of data.

Imagine for example that you had a command line tool to list all the image files you have in the cloud. You could then use a pattern such as `^(IMG\d+\.png)$` to capture and extract the full filename, but if you only wanted to capture the filename without the extension, you could use the pattern `^(IMG\d+)\.png$` which only captures the part before the period.

Let's use regex to find only files that start with `ju` and end with `.txt` in the datasets folder

In [48]:
dataset_files = os.listdir('datasets')
print(f'total files is: {len(dataset_files)}')

total files is: 46


In [49]:
pattern = r'^ju.*\.txt$'

In [50]:
search(dataset_files, pattern)

pattern: ^ju.*\.txt$

jul4.txt
jul4merge.txt
jul4zoom.txt
june20.txt
june27.txt


Let's use regex to find only files that end with .csv in the datasets folder

In [51]:
pattern = r'\.csv$'

In [52]:
search(dataset_files, pattern)

pattern: \.csv$

12th_jul.csv
17may_69_students.csv
attendance_16thMay.csv
attendance_condensed16thMay.csv
fuel-econ.csv
online-job-postings.csv
p1_incomplete.csv
p1_ungraded.csv
p1_unsubs.csv
pokemon.csv
session-4356-report-5_15_2022.csv
session-4356-report-5_17_2022.csv
session-4356-report-6_1_2022.csv
session-4356-report-6_21_2022.csv
session-4356-report-6_7_2022.csv
session-4356-report-7_12_2022.csv
students.csv


Let's use regex to find only the files that begin with `session-` in the datasets folder

In [53]:
pattern = r'^session'

In [54]:
search(dataset_files, pattern)

pattern: ^session

session-4356-report-5_15_2022.csv
session-4356-report-5_17_2022.csv
session-4356-report-6_1_2022.csv
session-4356-report-6_21_2022.csv
session-4356-report-6_7_2022.csv
session-4356-report-7_12_2022.csv
sessions_data.ipynb


Let's use regex to find the movies with digits in their names, without including the opening index numbers...

In [55]:
movies = os.listdir('datasets/ebert_reviews')
print(f'total movies is: {len(movies)}')

total movies is: 88


In [100]:
posters = os.listdir('datasets/bestofrt_posters')
print(f'total posters is: {len(posters)}')

total posters is: 94


In [114]:
search(movies, pattern)

pattern: (.*(\d+))

1-the-wizard-of-oz-1939-film.txt
10-metropolis-1927-film.txt
100-battleship-potemkin.txt
11-e.t.-the-extra-terrestrial.txt
12-modern-times-film.txt
14-singin-in-the-rain.txt
15-boyhood-film.txt
16-casablanca-film.txt
17-moonlight-2016-film.txt
18-psycho-1960-film.txt
19-laura-1944-film.txt
2-citizen-kane.txt
20-nosferatu.txt
21-snow-white-and-the-seven-dwarfs-1937-film.txt
22-a-hard-day27s-night-film.txt
23-la-grande-illusion.txt
25-the-battle-of-algiers.txt
26-dunkirk-2017-film.txt
27-the-maltese-falcon-1941-film.txt
29-12-years-a-slave-film.txt
3-the-third-man.txt
30-gravity-2013-film.txt
31-sunset-boulevard-film.txt
32-king-kong-1933-film.txt
33-spotlight-film.txt
34-the-adventures-of-robin-hood.txt
35-rashomon.txt
36-rear-window.txt
37-selma-film.txt
38-taxi-driver.txt
39-toy-story-3.txt
4-get-out-film.txt
40-argo-2012-film.txt
41-toy-story-2.txt
42-the-big-sick.txt
43-bride-of-frankenstein.txt
44-zootopia.txt
45-m-1931-film.txt
46-wonder-woman-2017-film.txt
48-

Let's use regex to capture the full names of all files ending with .png

In [101]:
pattern = '^(\d+.*\.png)$'

In [102]:
search(posters, pattern)

pattern: ^(\d+.*\.png)$

17_Moonlight_(2016_film).png
21_Snow_White_and_the_Seven_Dwarfs_(1937_film).png
4_Get_Out_(film).png
83_Hell_or_High_Water_(film).png
86_La_La_Land_(film).png


In [103]:
pattern = '^(\d+.*)\.png$'

In [104]:
search(posters, pattern)

pattern: ^(\d+.*)\.png$

17_Moonlight_(2016_film).png
21_Snow_White_and_the_Seven_Dwarfs_(1937_film).png
4_Get_Out_(film).png
83_Hell_or_High_Water_(film).png
86_La_La_Land_(film).png


### Nested groups
When you are working with complex data, you can easily find yourself having to extract multiple layers of information, which can result in nested groups. Generally, the results of the captured groups are in the order in which they are defined (in order by open parenthesis).

Take the example from the previous lesson, of capturing the filenames of all the image files you have in a list. If each of these image files had a sequential picture number in the filename, you could extract both the filename and the picture number using the same pattern by writing an expression like `^(IMG(\d+))\.png$` (using a nested parenthesis to capture the digits).

The nested groups are read from left to right in the pattern, with the first capture group being the contents of the first parentheses group, etc.

**[Link](https://regexone.com/lesson/nested_groups?)**

### Conditionals

As we mentioned before, it's always good to be **precise**, and that applies to coding, talking, and even regular expressions. For example, you wouldn't write a grocery list for someone to Buy more .* because you would have no idea what you could get back. Instead you would write Buy more milk or Buy more bread, and in regular expressions, we can actually define these conditionals explicitly.

Specifically when using groups, you can use the `|` `(logical OR, aka. the pipe)` to denote different possible sets of characters. In the above example, I can write the pattern "Buy more (milk|bread|juice)" to match only the strings Buy more milk, Buy more bread, or Buy more juice.

Like normal groups, you can use any sequence of characters or metacharacters in a condition, for example, `([cb]ats*|[dh]ogs?)` would match either cats or bats, or, dogs or hogs. Writing patterns with many conditions can be hard to read, so you should consider making them separate patterns if they get too complex.

In [116]:
words = ['I love cats',
        'I love dogs',
        'I love logs',
        'I love cogs',
        'I love cots',
        'I love bags']

In [117]:
# Write a pattern that captures I love bags, cats and dogs only from the words above.

In [124]:
pattern = '([cb]a|[d]ogs)'

In [125]:
search(words, pattern)

pattern: ([cb]a|[d]ogs)

I love cats
I love dogs
I love bags


### Other special characters
This lesson will cover some extra metacharacters, as well as the results of captured groups.

We have already learned the most common metacharacters to capture 

* digits using `\d`, 
* whitespace using `\s`, and 
* alphanumeric letters and digits using `\w`, 

but regular expressions also provides a way of specifying the opposite sets of each of these metacharacters by using their upper case letters. For example, 

* `\D` represents any non-digit character, 
* `\S` any non-whitespace character, and 
* `\W` any non-alphanumeric character (such as punctuation). 

Depending on how you compose your regular expression, it may be easier to use one or the other.

Additionally, there is a special metacharacter `\b` which matches the boundary between a word and a non-word character. It's most useful in capturing entire words (for example by using the pattern `\w+\b`).

### Additional Problems:

**[Link](https://regexone.com/problem/matching_decimal_numbers)**