# 🛠 IFQ718 Module 04 Exercises-01

## 🔍  Context: More string operations

The `string` data type has been discussed many times in the unit, and for good reason. It is one of the core types that is shared with all other programming languages but also, with every day applications like Word, Excel, and others. In this module, we are going to explore advanced operations that can be performed on `string` objects, like search and replace.




## Regular Expressions

Regular expressions (REs, regexes, regex patterns) are statements of a highly specialized programming language that specify the rules for matching substrings of a query string.

For example, you may want to check if a pattern exists within a string, like checking for a valid email address; i.e., `local-part` followed by `@` followed by `domain`. 


In [None]:
# Import the regular expressions module, `re`
import re

### Simple patterns: Checking for the existence of substrings

A substring may be a single character or an entire paragraph. No matter the size, the idea is that you are looking for a string within a string.

We have already used the Python `in` clause to check for the presence of character(s) in a string:
    
```python
'A' in 'Apple'
```

This can be achieved using regular expressions, too. However, instead of a Boolean type being returned, using the `re.findall()` function will identify any non-overlapping matches.

**Single or multiple characters**

In [None]:
print(re.findall('A', 'Apple'))

In [None]:
print(re.findall('p', 'Apple'))

In [None]:
print(re.findall('a', 'Apple'))

But what about matching multiple characters with `re`?

In [None]:
print(re.findall('pple', 'Apple'))

In [None]:
print(re.findall('$', '$10.50'))

This time we didn't find anything. That is because `$` is a regular expression metacharacter.

Try again but by *escaping* the metacharacter so that it is considered literally:

In [None]:
print(re.findall('\$', '$10.50'))

**Metacharacters: using square brackets `[` `]` and the carrot `^`**

Most characters will match themselves, however there are some *metacharacters* that specify rules for how the regular expression pattern should be evaluated.

Square brackets `[` `]` are used for matching a set of characters. 
* Not all the characters must be matched.
* The set can be 
   * a list of explicit characters that could be matched `[abc]` or,
   * a range of characters `[a-zA-Z]`.

The carrot `^` is used for complementing the set of characters.

In [None]:
print(re.findall('[xyz]', 'The quick brown cat jumps over the lazy dog')) # we changed `fox` to `cat`

In [None]:
print(re.findall('[a-d]', 'Data Carpentry'))

In [None]:
print(re.findall('[a-dA-D]', 'Data Carpentry'))

In [None]:
print(re.findall('[^aeiou]', 'Data Carpentry'))

This leads to asking questions like, *does the string "Data Carpentry" contain letters a, b, c?*

In [None]:
if len(re.findall('[abc]', 'Data Carpentry')) > 0:
    print('Yes, it does.')
else:
    print('No, it does not.')
    
# Try changing the set of characters to any that do not exist in the query string.

**With much less restriction, the dot `.`**

is for matching any character, except a new line.

In [None]:
print(re.findall('.', 'The quick brown fox jumps over the lazy dog'))

**Repeating matching patterns using braces notation `{n,m}`**

When an expression is followed by a brace, the expression is repeated `n` times or `n` to `m` times:

In [None]:
# To extract characters in pairs
print(re.findall('.{2}', 'The quick brown fox jumps over the lazy dog'))

In [None]:
# Extract vowel repetition
print(re.findall('[aeiou]{2}', 'the hippopotamus struggles to hula hoop'))

**Extracting words**

This example introduces `\w` and `+`.

The metacharacter `\w` is for matching Unicode characters that are used the words of many languages. It approximately maps to the equivalent of `[a-zA-Z0-9_]` (excluding whitespace, of course).

The metacharacter `+`:
* `+` for matching one or more of the preceding regular expression
* It is similar to `{1,n}`, however, using `n` within the braces is not allowed.

Not used here, but similar to `+`, are the following metacharacters:
* `?` for matching zero or one of the preceding regular expression
* `*` for matching zero or more repetitions of the preceding regular expression

In [None]:
print(re.findall('\w+', 'the hippopotamus struggles to hula hoop'))

Furthermore, the following expression will achieve the same result for our string about [hippo](https://ell.stackexchange.com/a/49969):

In [None]:
print(re.findall('[a-zA-Z]+', 'the hippopotamus struggles to hula hoop'))

**Extracting numbers**

In [None]:
print(re.findall('\d+', 'Iceberg lettuce hit $10'))

In [None]:
print(re.findall('\d+', 'And, no doubt, $10.50'))

In [None]:
print(re.findall('\d+.\d+', 'I wonder when the hip pocket will be out by $11.50 or $12.50'))

### A summary of metacharacters

|Meta character|Description|
|:----:|----|
|.|Period matches any single character except a line break.|
|[ ]|Character class. Matches any character contained between the square brackets.|
|[^ ]|Negated character class. Matches any character that is not contained between the square brackets|
|*|Matches 0 or more repetitions of the preceding symbol.|
|+|Matches 1 or more repetitions of the preceding symbol.|
|?|Makes the preceding symbol optional.|
|{n,m}|Braces. Matches at least "n" but not more than "m" repetitions of the preceding symbol.|
|(xyz)|Character group. Matches the characters xyz in that exact order.|
|&#124;|Alternation. Matches either the characters before or the characters after the symbol.|
|&#92;|Escapes the next character. This allows you to match reserved characters <code>[ ] ( ) { } . * + ? ^ $ \ &#124;</code>|
|^|Matches the beginning of the input.|
|$|Matches the end of the input.|

From the excellent resource, [*Learn Regex The Easy Way*](https://github.com/ziishaned/learn-regex/blob/master/README.md).

### Splitting strings using regular expressions

We have seen the `.split()` method that is operable on `string` objects. Now, we can use `.split()` from `re`, which is more advanced:

In [None]:
'The quick brown fox jumps over the lazy dog'.split(' ') # we have seen this already

but, what happens if the sentence contains punctuation that should be removed as the split occurs?

`\s` matches any whitespace character, and is approximately equivalent to `[ \t\n\r\f\v]`.

In [None]:
print(re.split('[\s.?!]+', 'Hello, World! How are you doing? I hope you\'re well.'))

### Replacing strings using regular expressions

... I want to replace all the vowels in a sentence with `-`. I could do this...

In [None]:
sentence = "The storm rolled out to sea as the people watched from upon the headland."
new_sentence = ""

for c in sentence:
    if c in 'aeiou':
        new_sentence += '-'
    else:
        new_sentence += c
        
print(''.join(new_sentence))

or, the eight lines can be reduced to one, using a regular expression and the `re.sub()` function:

In [None]:
re.sub('[aeiou]', '-', sentence)

### More advanced matching with multiple expressions and groups

We may want to match particular components of a line, with each having their own regular expression pattern.

Try this:

In [None]:
for sentence in ['Anna and John are friends', 'Kayla and Lura are also friends', 'Rima and Tanya are indeed, friends']:
    
    # notice `and` is included in the overall pattern
    match = re.match('(\w+) and (\w+)', sentence) 
    
    print(f'The first name is `{match.group(1)}`, and the second name is `{match.group(2)}`.')

### ✍ Activity 1:

In [None]:
words = ['rosebud', 'injury', 'wonders', 'sugarcoat', 'boatload', 'signature', 'libraries', 'engineering', 'drawstring', 'jotted', 'midwives', 'kiwi', 'mache', 'diuretics', 'washy', 'amply', 'yech', 'correction', 'limousines', 'cocoon', 'baddest', 'branched', 'imprisonment', 'uninterrupted', 'naw', 'highest', 'moneys', 'flaked', 'ordeal', 'flawed', 'quahog', 'unauthorized', 'contain', 'examine', 'resigning', 'disarming', 'stoney', 'sneakers', 'timing', 'marbles', 'strangle', 'sociology', 'prejudice', 'wretch', 'extensions', 'chicken', 'cob', 'publish', 'argentine', 'crass', 'accordingly', 'topaz', 'berate', 'untouched', 'nephew', 'brushed', 'responses', 'filmmaking', 'chandler', 'ultrasound', 'mare', 'privy', 'representations', 'castles', 'assemblies', 'trussed', 'punishes', 'nuance', 'revolutionary', 'juggernaut', 'suffers', 'urinate', 'tong', 'porterhouse', 'carnival', 'prevents', 'prevail', 'upholstery', 'diaper', 'saxophone', 'innkeeper', 'ivories', 'dorky', 'shifter', 'chili', 'fifteenth', 'disarray', 'husbands', 'haw', 'crier', 'forest', 'dimpled', 'tattoos', 'unusual', 'sings', 'espressos', 'medicare', 'strategy', 'furs']

**Filter all words that have more than 3 vowels**

In [None]:
# write your code here

**Find all words that follow the [I before E except after C](https://en.wikipedia.org/wiki/I_before_E_except_after_C) rule**

In [None]:
# write your code here

**Find all words ending in `er`**

In [None]:
# write your code here

### ✍ Activity 2:

Repeat Activity 1 but using this list of words:

Note: this time, the words are contiguously stored in a string

In [None]:
# Run this cell if you don't have the `popular-words.txt` file already
import urllib.request
urllib.request.urlretrieve('https://raw.githubusercontent.com/dolph/dictionary/master/popular.txt', 'data/popular-words.txt')

In [None]:
words = ""

with open('data/popular-words.txt', 'r') as fp:
    words = ' '.join([
        word.strip()
        for word in fp
    ])
    
print(words[0:500])
print('...')
print(words[-500:])

**Filter all words that have more than 3 vowels**

In [None]:
# write your code here

**Find all words that follow the [I before E except after C](https://en.wikipedia.org/wiki/I_before_E_except_after_C) rule**

In [None]:
# write your code here

**Find all words ending in `er`**

In [None]:
# write your code here

### ✍ Activity 3:

**Replace all occurrences of `3` with `three` for the given sentence.**

In [None]:
sentence = "They sang 3 songs during the 3 meals"

# write your code here

**For the list `words`, filter to keep all words starting with `act` and ending with at most one more character or `es`.**

In [None]:
words = [
    'act', 'acted', 'actin', 'acting', 'action', 'actionable', 'actions', 'activate', 'activated', 'activating', 
    'activation', 'activators', 'active', 'actively', 'activists', 'activities', 'activity', 'actor', 'actors', 
    'actress', 'actresses', 'acts', 'actual', 'actuality', 'actualization', 'actually', 'actuarial'
]

# write your code here

### ✍ Activity 4:

Professor David Lovell loves cryptic crosswords. Use your cruciverbal skills or your regular expression powers to help him finish this off.

**Word finding with regular expressions**

<img src="crossword.png" />

In [None]:
# write your code here