In this notebook, we will use the dataset named [IMDb Dataset of 50K Movie Reviews](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) from [Kaggle](https://en.wikipedia.org/wiki/Kaggle). Please directly download the dataset from Kaggle here and put it in the same folder where this notebook is saved.

In [None]:
import pandas as pd
df = pd.read_csv("IMDB Dataset.csv")
df.head()

In [None]:
text = df.iloc[3, 0]
text

The text needs to be cleaned up by removing HTML tags, punctuation, characters, etc.

We will use two functions from the package `re` designed to use regular expressions for python:
* `re.findall()`: returns all the pattern matches as a list
* `re.sub()`: search the pattern and replace it
 
We will first use `re.findall()` to fine-tune the regular expression pattern and then once we are sure, modify the text using `re.sub()` with the given pattern.

Let us find the regular expression that would capture HTML tags such as `<br />`.

In [None]:
import re
re.findall('<', text)

Quantifiers | Description
----|-----
. | Match any character except newline
* | Match 0 or more repetitions
+ | Match 1 or more repetitions
? | Match 0 or 1 repetitions

Let us use `.` to see what we get.

In [None]:
re.findall('<.', text)

We want to get more characters rather than only one following `<`, so we use `.*` which means we are looking to match `.` (characters) zero or more times.

In [None]:
len()

So, it matched everything until the end of the text. We did not want that, so let us try `<.*>`.

This is slightly but not what we want. We need to use **non-greedy mode** by adding `?`.

Characters | Description
----|-----
*? | Match 0 or more repetitions non-greedy
+? | Match 1 or more repetitions non-greedy
?? | Match 0 or 1 repetitions non-greedy

#### Greedy vs Lazy (or non-greedy) mode
Greedy mode repeats the pattern **as many times as possible**. Lazy (or non-greedy) mode repeats the pattern **minimal numbers of times**.

So, as we add `?` to convert the pattern matching for `.*` to non-greedy mode, it captures HTML tags such as `<br />`.

In [None]:
re.findall('<.*?>', text)

Now that we have found the suitable regular expression `<.*?>` to match the HTML tags, let us use `re.sub()` function to replace the tags in the text:

In [None]:
text # still changed

It is good to first check that the change is as desired and then use the assignment operator `=` to make the changes in the text.

In [None]:
text = 
text

Next, we want to get rid of punctuations such as commas, periods, etc. We use square brackets `[]` to list the various characters that we want to match. For example:

In [None]:
re.findall("[,.]", text)

We can add more characters:

In [None]:
re.findall("[,.:;]", text)

We see that in the text above we have "soap opera..." where the period is repeated thrice. So, we add `+` at the end of the pattern to allow for 1 or more repetitions of the pattern.

In [None]:
re.findall("[,.]+", text)

Can you think of all the possible characters that you can add here:

Check what the substitution will look like using `re.sub()`.

Make the changes to the `text` variable using assignment operator `=`.

Now, what about `'`, `"`, `\`, etc? We add `\` at the front to recognize these characters as they are.

In [None]:
re.findall("[\'\"\[\]]", text)

In [None]:
text = re.sub("[\'\"\[\]]", '', text)
text

Sum it all up nicely in a function.
```
import re
def clean_text(text):

    return text
    
```

And apply the function `clean_text()` to the columns corresponding to reviews in the above dataframe `df` using [`map()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html) function. Check that it worked.

There is a lot more to regular expressions. Now that you have used them a little bit, you can learn more by practise. Below are a few good cheatsheets:
* https://learnbyexample.github.io/python-regex-cheatsheet/#re-module-functions
* https://www.shortcutfoo.com/app/dojos/python-regex/cheatsheet
