In [1]:
import pandas as pd
import numpy as np

# Text wrangling and regular expressions

Now we'll move on to think about **text data**. Along with date-time data, text causes particular difficulties in data analysis. Let's have a look at the `feedback` `DataFrame` created during setup earlier. 

First, we're going to set `pandas` `max_colwidth` (maximum column width) option to 200 to let us see all of the messages left by item purchasers

In [2]:
pd.options.display.max_colwidth = 200

feedback = pd.DataFrame({
    'item_no': pd.Series([2, 2, 3, 4, 5, 1, 9, 5, 7, 10, 8], dtype='Int64'),
    'date': pd.Series(['2020-04-11', '2020-04-12', '2020-05-13', np.nan, '2020-05-28', '2020-05-29',
                       '2020-06-01', '2020-06-07', '2020-06-300', '2020-06-30', '2020-08-01']),
    'rating': pd.Series([5, 1, 3, 5, 4, 3, 2, 5, 1, 4, 5], dtype='Int64'),
    'message': pd.Series(["Ideal for my lunchbox - Dave Smith", "Broke first time I used it, I want a refund! Get back to me at lenore29@gmail.com or 07700 900796",
                        "My name is Tony 07700900829", "Bought another one for my sister", "Works pretty well, but can't handle carrots", 
                        "The concept is great, the execution- not so great, thin handles - Eleanor & dave", "Bit of a cheap version of the real thing",
                        "Arrived on time, as expected", "Customer service terrible - hello anyone there?! DaveAllsop@yahoo.co.uk, 07700 900572 or 0131 9496 0886", 
                        "Workks well, seems solid, good value", "Great finish on it, really decent build quality"], dtype='string')
})

In [3]:
feedback

Unnamed: 0,item_no,date,rating,message
0,2,2020-04-11,5,Ideal for my lunchbox - Dave Smith
1,2,2020-04-12,1,"Broke first time I used it, I want a refund! Get back to me at lenore29@gmail.com or 07700 900796"
2,3,2020-05-13,3,My name is Tony 07700900829
3,4,,5,Bought another one for my sister
4,5,2020-05-28,4,"Works pretty well, but can't handle carrots"
5,1,2020-05-29,3,"The concept is great, the execution- not so great, thin handles - Eleanor & dave"
6,9,2020-06-01,2,Bit of a cheap version of the real thing
7,5,2020-06-07,5,"Arrived on time, as expected"
8,7,2020-06-300,1,"Customer service terrible - hello anyone there?! DaveAllsop@yahoo.co.uk, 07700 900572 or 0131 9496 0886"
9,10,2020-06-30,4,"Workks well, seems solid, good value"


Neat, let's focus on the message column. We have a few features in this text data that we need to deal with. For a start, purchasers have sometimes added their email addresses and phone numbers: we should redact this sensitive information

In [4]:
feedback.message

0                                                                          Ideal for my lunchbox - Dave Smith
1           Broke first time I used it, I want a refund! Get back to me at lenore29@gmail.com or 07700 900796
2                                                                                 My name is Tony 07700900829
3                                                                            Bought another one for my sister
4                                                                 Works pretty well, but can't handle carrots
5                            The concept is great, the execution- not so great, thin handles - Eleanor & dave
6                                                                    Bit of a cheap version of the real thing
7                                                                                Arrived on time, as expected
8     Customer service terrible - hello anyone there?! DaveAllsop@yahoo.co.uk, 07700 900572 or 0131 9496 0886
9         

## Regular Expressions ('regex')

You may have used 'find' or perhaps 'find and replace' in applications like word processors or spreadsheets to perform simple text querying and manipulation. But the pattern matching capabilities of these features are often very limited. **Regular expressions** (commonly called **'regex'**) are a much more powerful way to search for patterns in text.

Regex are written as a normal string; the string tells the regex engine the pattern for which it is to search text. We'll be honest up front and admit that most regex patterns look like incomprhensible gibberish when you first start working with them, but don't be dismayed! We'll start simple and gradually add complexity.

First, let's start with a few simple strings to work on. We're going to start by trying to extract the **e-mail addresses** in these strings

In [5]:
# INSTRUCTOR - send out
strings = pd.Series([
    "Contact me at amelia_holden@fakemail.com", 
    "I'm at xiao_h_97@sprint.co.uk, YapApp handle is @@xiaoh-97",
    "Bernice.Yaxley@fakemail.co.uk or YapApp @@byaxley"
])
strings

0                      Contact me at amelia_holden@fakemail.com
1    I'm at xiao_h_97@sprint.co.uk, YapApp handle is @@xiaoh-97
2             Bernice.Yaxley@fakemail.co.uk or YapApp @@byaxley
dtype: object

Now, `pandas` makes available a range of `StringMethods` that accept regex as input arguments. We access these methods using the **`.str` accessor** on the `Series` of interest. The ones you are most likely to use are:

* `.str.extract()` - search for and extract specified patterns from each element of a text `Series`
* `.str.contains()` - search for a specified pattern in each element of a text `Series` (returns `True` or `False`)
* `.str.replace()` - search for and replace a specified pattern in each element of a text `Series` (requires search pattern and replacement string)

### Ranges, occurrences and metacharacters

Let's start by pulling out the '@' character using `.str.extract()`. We'll assume that all e-mails have to contain this character. We start by putting the '@' inside **parentheses** `()`: these define what is known as a **'capture group'** in regex. Most of the time, you'll be using just one capture group per regex.

In [6]:
strings.str.extract('(@)')

Unnamed: 0,0
0,@
1,@
2,@


Neat, that worked! Now we need to find some letters and numbers in front of and after the '@' symbol. Let's start with characters before '@'. We could extract a specific set of characters like

In [7]:
strings.str.extract('(holden@)')

Unnamed: 0,0
0,holden@
1,
2,


but you can see this is pretty useless, as we are just starting to hard code a particular email address. What we need to do instead is check for a range of characters. We do this in regex using square brackets []. So, for example [a-z] tells regex to look for a single lowercase character in the range from 'a' to 'z'

In [8]:
strings.str.extract('([a-z]@)')

Unnamed: 0,0
0,n@
1,
2,y@


This looks more promising. Now we need to tell regex to look for **multiple occurrences** of such a character. We have two options for this:

* `*` means 'zero or more occurrences' (this is called the 'Kleene star') 
* `+` means 'one or more occurrences'

These symbols affect the **character occuring immediately before them**. Let's try `+` first

In [9]:
strings.str.extract('([a-z]+@)')

Unnamed: 0,0
0,holden@
1,
2,axley@


No match is found for the second string. Why is that? Well, `[a-z]+` together specify 'one or more lowercase letter(s)', and the e-mail in the second string has a number just before the '@', hence no match is found. 

<hr style="border:8px solid black"> </hr>

***

**<u>Task - 2 mins</u>**

Now try using `*` instead of `+` here. How do you interpret the difference in what it returns as compared with the `+` regex?

**Solution**

In [10]:
strings.str.extract('([a-z]*@)')

Unnamed: 0,0
0,holden@
1,@
2,axley@


So, now we return matches for all three strings in the `Series`, because `[a-z]*` together specify 'zero or more lowercase letter(s)'. In the second string, the email does indeed have 'zero' lowercase letters before the '@'

***

<hr style="border:8px solid black"> </hr>

What else do we need to add to pull out more of our e-mail address? Well, characters like '_' and '.' can also occur. Let's add them to the range of things we are looking for. 

But `.` has a special meaning in regex: it means 'any single character'. But here we want `.` to be interpreted just as that character, and not with this special meaning: this is called **'escaping'** the `.` character. To do this, we need to add a backslash `\` in front of the `.` character. 

But `\` has a special meaning in `Python`, argh! So, we can either **escape the escape** by inserting double backslash `\\`, or we can convert the whole string to 'raw' format by putting an `r` in front of it, which then means we can use a single backslash `\` to escape. Phew!

Here's a little example to show this:

In [11]:
full_stop_text = pd.Series(["text.", 'text'])

This means 'any character' so returns the first character in both cases

In [12]:
full_stop_text.str.extract('(.)')

Unnamed: 0,0
0,t
1,t


This means 'full stop' as we a use double backslash, so only returns matches a full stop:

In [13]:
full_stop_text.str.extract('(\\.)')

Unnamed: 0,0
0,.
1,


Now we've clarified that, let's head back to our strings series:

In [14]:
strings.str.extract(r'([a-z_\.]+@)')

Unnamed: 0,0
0,amelia_holden@
1,
2,axley@


What can we do about the e-mail in second string? We need to tell regex to also look for digits, as these are certainly valid in e-mail addresses. We can just add the range `0-9` to what we are looking for! 

In [15]:
strings.str.extract(r'([a-z0-9_\.]+@)')

Unnamed: 0,0
0,amelia_holden@
1,xiao_h_97@
2,axley@


<hr style="border:8px solid black"> </hr>

***

**<u>Task - 2 mins</u>**

What's going wrong for the third e-mail address? Well, we're not telling regex to search for capital letters. See if you can add a **range of capital letters** to your search pattern in the correct place.

**Solution**

In [16]:
strings.str.extract(r'([a-zA-Z0-9_\.]+@)')

Unnamed: 0,0
0,amelia_holden@
1,xiao_h_97@
2,Bernice.Yaxley@


Our pattern is getting a bit complex already! Fortunately, some of the ranges we have provided are already available in the form of **metacharacters**. The common ones are as follows:

* `\w` any one alphanumeric character '_' included. Equivalent to `[a-zA-Z0-9_]`
* `\d` any one digit. Equivalent to `[0-9]`
* `\s` any one whitespace character (spaces, tabs, newlines etc)
* Negations are available in some regex engines, usually as the 'capitalisation' of their partner, e.g. `\W` is any one non-alphanumeric character, `\D` any one non-digit character etc.

In [17]:
strings.str.extract(r'([\w\.]+@)')

Unnamed: 0,0
0,amelia_holden@
1,xiao_h_97@
2,Bernice.Yaxley@


That's cleaner! So now, let's look for the parts of the e-mail addresses after the '@' symbol. A reasonable first guess might be just to repeat the pattern before '@' afterwards too.

In [18]:
strings.str.extract(r'([\w\.]+@[\w\.]+)')

Unnamed: 0,0
0,amelia_holden@fakemail.com
1,xiao_h_97@sprint.co.uk
2,Bernice.Yaxley@fakemail.co.uk


<hr style="border:8px solid black"> </hr>

***

**<u>Task - 5 mins</u>**

See if you can create a regex pattern to extract the 'YapApp' handles in the second and third strings (e.g. '@@xiaoh-97')

**Solution**

In [19]:
# add two @@, need to also add hyphen
strings.str.extract(r'(@@[\w-]+)')

Unnamed: 0,0
0,
1,@@xiaoh-97
2,@@byaxley


### Quantifiers

Quantifier express the **number of occurrences** of an item. We've already seen two quantifiers already: `*` meaning 'zero or more occurrences' and `+` meaning 'one or more occurrences'. To these we also add

* `?` meaning 'optional' i.e. may or may not occur. If it does, capture it.
* `{n}` meaning 'exactly n occurrences', e.g. `{3}` means 'exactly three occurrences'
* `{n,m}` meaning 'between n and m occurrences, inclusive', e.g. 'bob{3,4}' would match 'bobbb' and 'bobbbb', but not 'bob' or 'bobb'

Let's move over now to the `feedback` `DataFrame`, and write a regex to extract phone numbers from the `message` column

In [20]:
feedback.message

0                                                                          Ideal for my lunchbox - Dave Smith
1           Broke first time I used it, I want a refund! Get back to me at lenore29@gmail.com or 07700 900796
2                                                                                 My name is Tony 07700900829
3                                                                            Bought another one for my sister
4                                                                 Works pretty well, but can't handle carrots
5                            The concept is great, the execution- not so great, thin handles - Eleanor & dave
6                                                                    Bit of a cheap version of the real thing
7                                                                                Arrived on time, as expected
8     Customer service terrible - hello anyone there?! DaveAllsop@yahoo.co.uk, 07700 900572 or 0131 9496 0886
9         

First, let's just pull out a single number. The `.str.extract()` method will pull out the first number it finds in each value in the `message` column

In [21]:
feedback.message.str.extract(r'(\d)')

Unnamed: 0,0
0,
1,2.0
2,0.0
3,
4,
5,
6,
7,
8,0.0
9,


So good so far! We're pulling out the '2' from 'lenore29@gmail.com' and not the later phone number, but this will hopefully be fixed when we make the regex more specific. Let's do that now by searching for a specific number of digits: the first five digits of a phone number. To do this, we can use the syntax `{5}`

In [22]:
feedback.message.str.extract(r'(\d{5})')

Unnamed: 0,0
0,
1,7700.0
2,7700.0
3,
4,
5,
6,
7,
8,7700.0
9,


This looks much better! Let's round the regex out by looking for the space, and then the remaining six digits

In [23]:
feedback.message.str.extract(r'(\d{5} \d{6})')

Unnamed: 0,0
0,
1,07700 900796
2,
3,
4,
5,
6,
7,
8,07700 900572
9,


<hr style="border:8px solid black"> </hr>

***

**<u>Task - 2 mins</u>**

One of the mobile phone number lacks a space between the first five digits and the latter six digits. Can you amend the regex to also extract this phone number? Have a look at the full list of quantifiers below for help. Remember: **quantifiers refer to the character immediately preceding them**.

* `*` means 'zero or more times'
* `+` means 'one or more times'
* `?` means 'optional' i.e. may or may not occur. If it does, capture it.
* `{n}` means 'exactly n occurrences', e.g. `{3}` means 'exactly three occurrences'
* `{n,m}` means 'between n and m occurrences, inclusive', e.g. 'bob{3,4}' would match 'bobbb' and 'bobbbb', but not 'bob' or 'bobb'

**Solution**

In [24]:
feedback.message.str.extract(r'(\d{5} ?\d{6})')

Unnamed: 0,0
0,
1,07700 900796
2,07700900829
3,
4,
5,
6,
7,
8,07700 900572
9,


***

<hr style="border:8px solid black"> </hr>

***

**<u>Task - 2 mins</u>**

See if you can now write a regex to extract phone numbers in the form 'xxxx xxxx xxxx'. Make the spaces optional.

**Solution**


In [25]:
feedback.message.str.extract(r'(\d{4} ?\d{4} ?\d{4})')

Unnamed: 0,0
0,
1,
2,
3,
4,
5,
6,
7,
8,0131 9496 0886
9,


***

<hr style="border:8px solid black"> </hr>

### Optional: extracting multiple patterns using `.str.extractall()`

Now, how do we combine these regex patterns to extract all phone numbers from `feedback.message`? What about messages that contain multiple phone numbers? Well, the `.str.extractall()` method can be applied in this situation, but we will need to find some way to combine regex patterns. Let's first create a `list` of phone number regex patterns to search for 

In [26]:
number_patterns = [r'(\d{5} ?\d{6})', r'(\d{4} ?\d{4} ?\d{4})']

Now we need to combine these patterns. The logic here is that we ask regex to search for 'this pattern **OR** that pattern' in each string, so we can use the regex **OR** operator `|` (same as `pandas` **OR**) to combine the patterns into one search term

In [27]:
number_search = '|'.join(number_patterns)

number_search

'(\\d{5} ?\\d{6})|(\\d{4} ?\\d{4} ?\\d{4})'

Now let's apply the `.str.extractall()` method to pull out all the matches in the `message` column. By default, the `.str.extract()` method will just pull out **the first** match in each string, which is not what we want here

In [28]:
feedback.message.str.extractall(number_search)

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,07700 900796,
2,0,07700900829,
8,0,07700 900572,
8,1,,0131 9496 0886


Hmm, the behaviour of `.str.extractall()` is to separate each regex capture group (i.e. a pattern enclosed in `()`) to a separate column. We likely just wish to amend `number_search` to contain a single capture group. We'll do this by removing parentheses from the individual patterns and concatenating 

In [29]:
number_patterns_no_capture = [r'\d{5} ?\d{6}', r'\d{4} ?\d{4} ?\d{4}']

number_search_one_capture = '(' + '|'.join(number_patterns_no_capture) + ')'
number_search_one_capture

'(\\d{5} ?\\d{6}|\\d{4} ?\\d{4} ?\\d{4})'

In [30]:
feedback.message.str.extractall(number_search_one_capture)

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
1,0,07700 900796
2,0,07700900829
8,0,07700 900572
8,1,0131 9496 0886


We have a `MultiIndex` on the left, i.e. a hierarchy. The message with `index` number 8 contains **two** matched phone numbers, in turn labelled by `index` numbers 0 and 1. We might access say the second match by `.loc[(8,1), :]` if we wished (remember we 'dig' into a `MultiIndex` using `tuple`s)

In [31]:
feedback.message.str.extractall(number_search_one_capture).loc[(8,1), :]

0    0131 9496 0886
Name: (8, 1), dtype: string

### Optional: string replace

The `.str.replace()` method lets us replace parts of strings using regex matching. Let's use this ability in this case to **redact** sensitive phone numbers from the messages, via the regex patterns we created above. The earlier logic is applicable here too: we want to replace matched numbers ('this pattern OR that pattern') with a string (say '[NUMBER REDACTED]'). We want to persist these changes, so we mutate by assigning to `.loc[]`, as discussed above

In [32]:
feedback.loc[:, 'message'] = feedback.message.str.replace(number_search_one_capture, '[NUMBER REDACTED]')
feedback

  feedback.loc[:, 'message'] = feedback.message.str.replace(number_search_one_capture, '[NUMBER REDACTED]')


Unnamed: 0,item_no,date,rating,message
0,2,2020-04-11,5,Ideal for my lunchbox - Dave Smith
1,2,2020-04-12,1,"Broke first time I used it, I want a refund! Get back to me at lenore29@gmail.com or [NUMBER REDACTED]"
2,3,2020-05-13,3,My name is Tony [NUMBER REDACTED]
3,4,,5,Bought another one for my sister
4,5,2020-05-28,4,"Works pretty well, but can't handle carrots"
5,1,2020-05-29,3,"The concept is great, the execution- not so great, thin handles - Eleanor & dave"
6,9,2020-06-01,2,Bit of a cheap version of the real thing
7,5,2020-06-07,5,"Arrived on time, as expected"
8,7,2020-06-300,1,"Customer service terrible - hello anyone there?! DaveAllsop@yahoo.co.uk, [NUMBER REDACTED] or [NUMBER REDACTED]"
9,10,2020-06-30,4,"Workks well, seems solid, good value"


***

**Task - 2 mins**

Persist redactions for e-mail addresses in the `message` column (re-use your e-mail regex pattern from earlier). Replace all e-mail addresses with the string '[E-MAIL REDACTED]'

**Solution**

In [33]:
feedback.loc[:, 'message'] = feedback.message.str.replace(r'([\w\.]+@[\w\.]+)', '[E-MAIL REDACTED]')
feedback

  feedback.loc[:, 'message'] = feedback.message.str.replace(r'([\w\.]+@[\w\.]+)', '[E-MAIL REDACTED]')


Unnamed: 0,item_no,date,rating,message
0,2,2020-04-11,5,Ideal for my lunchbox - Dave Smith
1,2,2020-04-12,1,"Broke first time I used it, I want a refund! Get back to me at [E-MAIL REDACTED] or [NUMBER REDACTED]"
2,3,2020-05-13,3,My name is Tony [NUMBER REDACTED]
3,4,,5,Bought another one for my sister
4,5,2020-05-28,4,"Works pretty well, but can't handle carrots"
5,1,2020-05-29,3,"The concept is great, the execution- not so great, thin handles - Eleanor & dave"
6,9,2020-06-01,2,Bit of a cheap version of the real thing
7,5,2020-06-07,5,"Arrived on time, as expected"
8,7,2020-06-300,1,"Customer service terrible - hello anyone there?! [E-MAIL REDACTED], [NUMBER REDACTED] or [NUMBER REDACTED]"
9,10,2020-06-30,4,"Workks well, seems solid, good value"
