In this notebook, we will use the dataset named [IMDb Dataset of 50K Movie Reviews](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) from [Kaggle](https://en.wikipedia.org/wiki/Kaggle). 

In [2]:
import pandas as pd
df = pd.read_csv("data/IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
text = df.iloc[3, 0]
text

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

The text needs to be cleaned up by removing HTML tags, punctuation, characters, etc.

We will use two functions from the package `re` designed to use regular expressions for python:
* `re.findall()`: returns all the pattern matches as a list
* `re.sub()`: search the pattern and replace it
 
We will first use `re.findall()` to fine-tune the regular expression pattern and then once we are sure, modify the text using `re.sub()` with the given pattern.

Let us find the regular expression that would capture HTML tags such as `<br />`.

In [4]:
import re
re.findall('<', text)

['<', '<', '<', '<', '<', '<']

Quantifiers | Description
----|-----
. | Match any character except newline
* | Match 0 or more repetitions
+ | Match 1 or more repetitions
? | Match 0 or 1 repetitions

Let us use `.` to see what we get.

In [10]:
re.findall('<.', text)

['<b', '<b', '<b', '<b', '<b', '<b']

We want to get more characters rather than only one following `<`, so we use `.*` which means we are looking to match `.` (characters) zero or more times.

In [7]:
re.findall('<.*', text)

["<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."]

In [7]:
len(re.findall('<.*', text))

1

So, it matched everything until the end of the text. We did not want that, so let us try `<.*>`.

In [8]:
re.findall('<.*>', text)

["<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />"]

This is slightly but not what we want. We need to use **non-greedy mode** by adding `?`.

Characters | Description
----|-----
*? | Match 0 or more repetitions non-greedy
+? | Match 1 or more repetitions non-greedy
?? | Match 0 or 1 repetitions non-greedy

#### Greedy vs Lazy (or non-greedy) mode
Greedy mode repeats the pattern **as many times as possible**. Lazy (or non-greedy) mode repeats the pattern **minimal numbers of times**.

So, as we add `?` to convert the pattern matching for `.*` to non-greedy mode, it captures HTML tags such as `<br />`.

In [9]:
re.findall('<.*?>', text)

['<br />', '<br />', '<br />', '<br />', '<br />', '<br />']

Now that we have found the suitable regular expression `<.*?>` to match the HTML tags, let us use `re.sub()` function to replace the tags in the text:

In [10]:
re.sub('<.*?>', '', text)

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

In [11]:
text # still changed

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

It is good to first check that the change is as desired and then use the assignment operator `=` to make the changes in the text.

In [12]:
text = re.sub('<.*?>', '', text)
text

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

Next, we want to get rid of punctuations such as commas, periods, etc. We use square brackets `[]` to list the various characters that we want to match. For example:

In [13]:
re.findall("[,.]", text)

['.', '.', '.', '.', ',', '.', ',', '.', '.', ',', '.', '.', '.']

We can add more characters:

In [13]:
re.findall("[,.:;]", text)

['.', '.', '.', '.', ',', '.', ',', '.', '.', ',', '.', '.', ':', '.']

We see that in the text above we have "soap opera..." where the period is repeated thrice. So, we add `+` at the end of the pattern to allow for 1 or more repetitions of the pattern.

In [14]:
re.findall("[,.]+", text)

['.', '...', ',', '.', ',', '.', '.', ',', '.', '.', '.']

Can you think of all the possible characters that you can add here:

In [15]:
# re.findall("[,.:;?!@#$%^&*()-+_=/{}]+", text)

['(',
 ')',
 '&',
 '.',
 '...',
 ',',
 '.',
 ',',
 '!',
 '.',
 '&',
 '.',
 '!',
 ',',
 '.',
 '&',
 '.',
 ':',
 '.']

Check what the substitution will look like using `re.sub()`.

In [16]:
# re.sub("[,.:;?!@#$%^&*()-+_=/{}]+", '', text)

"Basically there's a family where a little boy Jake thinks there's a zombie in his closet  his parents are fighting all the timeThis movie is slower than a soap opera and suddenly Jake decides to become Rambo and kill the zombieOK first of all when you're going to make a film you must Decide if its a thriller or a drama As a drama the movie is watchable Parents are divorcing  arguing like in real life And then we have Jake with his closet which totally ruins all the film I expected to see a BOOGEYMAN similar movie and instead i watched a drama with some meaningless thriller spots3 out of 10 just for the well playing parents  descent dialogs As for the shots with Jake just ignore them"

Make the changes to the `text` variable using assignment operator `=`.

In [17]:
# text = re.sub("[,.:;?!@#$%^&*()-+_=/{}]+", '', text)
# text

"Basically there's a family where a little boy Jake thinks there's a zombie in his closet  his parents are fighting all the timeThis movie is slower than a soap opera and suddenly Jake decides to become Rambo and kill the zombieOK first of all when you're going to make a film you must Decide if its a thriller or a drama As a drama the movie is watchable Parents are divorcing  arguing like in real life And then we have Jake with his closet which totally ruins all the film I expected to see a BOOGEYMAN similar movie and instead i watched a drama with some meaningless thriller spots3 out of 10 just for the well playing parents  descent dialogs As for the shots with Jake just ignore them"

Now, what about `'`, `"`, `\`, etc? We add `\` at the front to recognize these characters as they are.

In [18]:
re.findall("[\'\"\[\]]", text)

["'", "'", "'"]

In [19]:
text = re.sub("[\'\"\[\]]", '', text)
text

"Basically there's a family where a little boy Jake thinks there's a zombie in his closet  his parents are fighting all the timeThis movie is slower than a soap opera and suddenly Jake decides to become Rambo and kill the zombieOK first of all when you're going to make a film you must Decide if its a thriller or a drama As a drama the movie is watchable Parents are divorcing  arguing like in real life And then we have Jake with his closet which totally ruins all the film I expected to see a BOOGEYMAN similar movie and instead i watched a drama with some meaningless thriller spots3 out of 10 just for the well playing parents  descent dialogs As for the shots with Jake just ignore them"

Sum it all up nicely in a function.
```
def clean_text(text):

    return text
    
```

And apply the function `clean_text()` to the columns corresponding to reviews in the above dataframe `df` using [`map()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html) function. Check that it worked.

In [11]:
# def clean_text(text):
#     """
#     Applies some pre-processing on the given text.

#     Steps :
#     - Removing HTML tags
#     - Removing punctuations and other characters
#     """
    
#     # remove HTML tags
#     text = re.sub(r'<.*?>', '', text)
    
#     # remove punctuation and other characters
#     text = re.sub("[,.:;?!@#$%^&*()-+_=/{}]+", '', text)
    
#     # remove the characters [\], ['] and ["]
#     text = re.sub("[\'\"\[\]]", '', text) 

#     return text

In [12]:
# df.review.map(clean_text)

0        One of the other reviewers has mentioned that ...
1        A wonderful little production The filming tech...
2        I thought this was a wonderful way to spend ti...
3        Basically theres a family where a little boy J...
4        Petter Matteis Love in the Time of Money is a ...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot bad dialogue bad acting idiotic direc...
49997    I am a Catholic taught in parochial elementary...
49998    Im going to have to disagree with the previous...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

In [None]:
# df['review'] = df['review'].map(clean_text)

There is a lot more to regular expressions. Now that you have used them a little bit, you can learn more by practise. Below are a few good cheatsheets:
* https://learnbyexample.github.io/python-regex-cheatsheet/#re-module-functions
* https://www.shortcutfoo.com/app/dojos/python-regex/cheatsheet
