# Chapter 2: Text Preprocessing with Regular Expressions

## Instructions

- Run the cells with "assert" statements to see if your answer's output matches what the output should be. If it runs without error, your answer matches! If your output is different, you'll get a hint.

In this notebook, you'll practice writing regular expressions in Python using the `re` package.


In [1]:
import re

## 1. Test Out Regex Methods in Python

In this section, you'll be testing out the `re.findall()`, `re.split()` and `re.sub()` methods. Make sure to note the differences in the outputs.

Let's first define an object `text` so we have some words to work with

In [2]:
text = 'This apple is delicious.'
print(text)

This apple is delicious.


### Find All

Find all of the situations where `is` occurs in the text using the `re.findall` method. Save your results in a variable called `output_findall`.

In [3]:
### BEGIN SOLUTION
output_findall = re.findall('is', text)
### END SOLUTION
output_findall

['is', 'is']

In [4]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert output_findall == ['is', 'is'], "The output should be a list with two items. Use the format re.findall(pattern, string)."
### END HIDDEN TESTS

### Split

Split the text wherever `is` occurs in the text using the `re.split` method. Save your results in a variable called `output_split`.

In [5]:
### BEGIN SOLUTION
output_split = re.split('is', text)
### END SOLUTION
output_split

['Th', ' apple ', ' delicious.']

In [6]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert output_split == ['Th', ' apple ', ' delicious.'], "The output should be a list with three items. Use the format re.split(pattern, string)."
### END HIDDEN TESTS

### Substitute

Wherever `is` occurs in the text, substitute it with `IS` using the `re.sub` method. Save your results in a variable called `output_sub`.

In [7]:
### BEGIN SOLUTION
output_sub = re.sub('is', 'IS', text)
### END SOLUTION
output_sub

'ThIS apple IS delicious.'

In [8]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert output_sub == 'ThIS apple IS delicious.', "The output should be a string containing IS. Use the format re.sub(pattern, new_value, string)."
### END HIDDEN TESTS

## 2. More Regular Expressions

So far, you've used regular expressions in the simplest way, where you've matched exact characters `is` in a string of text. Next, we'll introduce some other common regular expressions.

In [9]:
metis = 'We strive, we sweat, we swear. We go the extra mile.\
         We stage, we fail. We try again. Get it right. We learn.\
         Connect. Come together. This is Metis. -12/9/2013'
print(metis)

We strive, we sweat, we swear. We go the extra mile.         We stage, we fail. We try again. Get it right. We learn.         Connect. Come together. This is Metis. -12/9/2013


Find all of the digits in the `metis` text using `re.findall()`. Save the output as a variable called `digits`.

In [10]:
### BEGIN SOLUTION
digits = re.findall('\d', metis)
### END SOLUTION
digits

['1', '2', '9', '2', '0', '1', '3']

In [11]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert digits == ['1', '2', '9', '2', '0', '1', '3'], "The output should be a list with seven items. Hint: What regex matches a single digit?"
### END HIDDEN TESTS

Find all of the words that start with a capital letter C in the `metis` text using `re.findall()`. Save the output as a variable called `capitalc`.

In [12]:
### BEGIN SOLUTION
capitalc = re.findall('C\w+', metis)
### END SOLUTION
capitalc

['Connect', 'Come']

In [13]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert capitalc == ['Connect', 'Come'], "The output should be a list with two items. Hint: 'b\w+' would find all words starting with b followed by one or more alphanumeric characters."
### END HIDDEN TESTS

Find all of the words `We` and `we` in the `metis` text using `re.findall()`. Save the output as a variable called `we`.

In [14]:
### BEGIN SOLUTION
we = re.findall('[Ww]e', metis)
### END SOLUTION
we

['We', 'we', 'we', 'we', 'we', 'We', 'We', 'we', 'We', 'We']

In [15]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert we == ['We', 'we', 'we', 'we', 'we', 'We', 'We', 'we', 'We', 'We'], "The output should be a list with ten items. Hint: '[bmp]y' would find all by, my and py terms."
### END HIDDEN TESTS

## 3. Apply Regex to a Dataframe

In [16]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

Take a look at this Dataframe with 6 cappuccino cup reviews.

In [17]:
df = pd.DataFrame([['a',5,"Grove Square Cappuccino Cups were excellent. Tasted really good right from the Keurig brewer with nothing added. wWould highly recommend. RCCJR"],
                  ['b',1,"I love my Keurig, and I love most of the Keurig coffees. This is instant coffee with instant milk and far too much sugar. I don't know anyone I dislike enough to dump the rest of the box on."],
                  ['c',1,"It's a powdered drink. No filter in k-cup.<br />Just buy it in bulk and mix it with hot water....<br /><br />Nothing else to say here. Wont be buying it again."],
                  ['d',1,"don't bother! bet you couldn't tell the difference between this and hot water if your eyes were closed. well, maybe the water would have a taste!"],
                  ['e',1,"Never tasted this coffee before, I felt much too sweet even for dessert. I would not order again. But then that is only my opinion. My friend's husband loves it.<br />I gave them to him."],
                  ['f',5,"My husband and I LOVE this French Vanilla Cappuccino. Sooo glad I didn't listen to some of the reviews and took the plunge and bought it."]],
                  columns=['users','stars','reviews'])
df

Unnamed: 0,users,stars,reviews
0,a,5,Grove Square Cappuccino Cups were excellent. Tasted really good right from the Keurig brewer with nothing added. wWould highly recommend. RCCJR
1,b,1,"I love my Keurig, and I love most of the Keurig coffees. This is instant coffee with instant milk and far too much sugar. I don't know anyone I dislike enough to dump the rest of the box on."
2,c,1,It's a powdered drink. No filter in k-cup.<br />Just buy it in bulk and mix it with hot water....<br /><br />Nothing else to say here. Wont be buying it again.
3,d,1,"don't bother! bet you couldn't tell the difference between this and hot water if your eyes were closed. well, maybe the water would have a taste!"
4,e,1,"Never tasted this coffee before, I felt much too sweet even for dessert. I would not order again. But then that is only my opinion. My friend's husband loves it.<br />I gave them to him."
5,f,5,My husband and I LOVE this French Vanilla Cappuccino. Sooo glad I didn't listen to some of the reviews and took the plunge and bought it.


Let's clean the text in three ways:
1. Remove capital letters
2. Remove `<br />`
3. Remove punctuation


Make all of the text in the reviews column lowercase. Overwrite the `df.reviews` column with the new text.

In [18]:
### BEGIN SOLUTION
df.reviews = df.reviews.str.lower()
### END SOLUTION
df.reviews

0                                                   grove square cappuccino cups were excellent. tasted really good right from the keurig brewer with nothing added. wwould highly recommend. rccjr
1    i love my keurig, and i love most of the keurig coffees. this is instant coffee with instant milk and far too much sugar. i don't know anyone i dislike enough to dump the rest of the box on.
2                                   it's a powdered drink. no filter in k-cup.<br />just buy it in bulk and mix it with hot water....<br /><br />nothing else to say here. wont be buying it again.
3                                                 don't bother! bet you couldn't tell the difference between this and hot water if your eyes were closed. well, maybe the water would have a taste!
4        never tasted this coffee before, i felt much too sweet even for dessert. i would not order again. but then that is only my opinion. my friend's husband loves it.<br />i gave them to him.
5                   

In [19]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert len(re.findall('[A-Z]+',df.reviews.to_string())) == 0, "There are still capital letters in the reviews text. Remember you can apply .str.lower() on a series."
### END HIDDEN TESTS

Remove all of the `<br />` values in the text. Overwrite the `df.reviews` column with the new text.

In [20]:
### BEGIN SOLUTION
df.reviews = df.reviews.map(lambda x: re.sub('<br />', ' ', x))
### END SOLUTION
df.reviews

0                                                   grove square cappuccino cups were excellent. tasted really good right from the keurig brewer with nothing added. wwould highly recommend. rccjr
1    i love my keurig, and i love most of the keurig coffees. this is instant coffee with instant milk and far too much sugar. i don't know anyone i dislike enough to dump the rest of the box on.
2                                                  it's a powdered drink. no filter in k-cup. just buy it in bulk and mix it with hot water....  nothing else to say here. wont be buying it again.
3                                                 don't bother! bet you couldn't tell the difference between this and hot water if your eyes were closed. well, maybe the water would have a taste!
4             never tasted this coffee before, i felt much too sweet even for dessert. i would not order again. but then that is only my opinion. my friend's husband loves it. i gave them to him.
5                   

In [21]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert len(re.findall('<br />',df.reviews.to_string())) == 0, "There are still <br /> values in the reviews text. Remember you can apply .map(lambda x: re.sub(pattern, new_value, x)) on a series."
### END HIDDEN TESTS

Remove all of the punctuation in the text. Overwrite the `df.reviews` column with the new text.

In [22]:
# For exampmle, this is how you would replace punctuation marks with spaces within a single string
import string
punc_text = "hi! let's test this out."
re.sub('[%s]' % re.escape(string.punctuation), ' ', punc_text)

'hi  let s test this out '

In [23]:
# Your job is to apply the same logic to an entire column of data, df.reviews

### BEGIN SOLUTION
df.reviews = df.reviews.map(lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x))
### END SOLUTION
df.reviews

0                                                   grove square cappuccino cups were excellent  tasted really good right from the keurig brewer with nothing added  wwould highly recommend  rccjr
1    i love my keurig  and i love most of the keurig coffees  this is instant coffee with instant milk and far too much sugar  i don t know anyone i dislike enough to dump the rest of the box on 
2                                                  it s a powdered drink  no filter in k cup  just buy it in bulk and mix it with hot water      nothing else to say here  wont be buying it again 
3                                                 don t bother  bet you couldn t tell the difference between this and hot water if your eyes were closed  well  maybe the water would have a taste 
4             never tasted this coffee before  i felt much too sweet even for dessert  i would not order again  but then that is only my opinion  my friend s husband loves it  i gave them to him 
5                   

In [24]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert len(re.findall('[%s]' % re.escape(string.punctuation),df.reviews.to_string())) == 0, "There are still punctuation marks in the reviews text. Remember you can apply .map(lambda x: re.sub(pattern, new_value, x)) on a series."
### END HIDDEN TESTS