<a href="https://colab.research.google.com/github/SondipPoulSingh/Deep-Learning/blob/main/Text_Preprocessing_(Regex%2Bstemming%2BLemmatization%2Bmore)_for_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import re

# Regex CheatSheet


**Special Characters**<br>
^ | Matches the expression to its right at the start of a string. It matches every such instance before each \n in the string.

$ | Matches the expression to its left at the end of a string. It matches every such instance before each \n in the string.

. | Matches any character except line terminators like \n.

\ | Escapes special characters or denotes character classes.

A|B | Matches expression A or B. If A is matched first, B is left untried.

+ | Greedily matches the expression to its left 1 or more times.

* | Greedily matches the expression to its left 0 or more times.

? | Greedily matches the expression to its left 0 or 1 times. But if ? is added to qualifiers (+, *, and ? itself) 
it will perform matches in a non-greedy manner.

{m} | Matches the expression to its left m times, and not less.

{m,n} | Matches the expression to its left m to n times, and not less.

{m,n}? | Matches the expression to its left m times, and ignores n. See ? above.<br><br>

**Character Classes (a.k.a. Special Sequences)**<br>
\w | Matches alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _.

\d | Matches digits, which means 0-9.

\D | Matches any non-digits.

\s | Matches whitespace characters, which include the \t, \n, \r, and space characters.

\S | Matches non-whitespace characters.

\b | Matches the boundary (or empty string) at the start and end of a word, that is, between \w and \W.

\B | Matches where \b does not, that is, the boundary of \w characters.

\A | Matches the expression to its right at the absolute start of a string whether in single or multi-line mode.

\Z | Matches the expression to its left at the absolute end of a string whether in single or multi-line mode.<br><br>

**Sets**<br>
[ ] | Contains a set of characters to match.

[amk] | Matches either a, m, or k. It does not match amk.

[a-z] | Matches any alphabet from a to z.

[a\-z] | Matches a, -, or z. It matches - because \ escapes it.

[a-] | Matches a or -, because - is not being used to indicate a series of characters.

[-a] | As above, matches a or -.

[a-z0-9] | Matches characters from a to z and also from 0 to 9.

[(+*)] | Special characters become literal inside a set, so this matches (, +, *, and ).

[^ab5] | Adding ^ excludes any character in the set. Here, it matches characters that are not a, b, or 5.<br><br>

**Groups**<br>
( ) | Matches the expression inside the parentheses and groups it.

(? ) | Inside parentheses like this, ? acts as an extension notation. Its meaning depends on the character immediately to its right.

(?PAB) | Matches the expression AB, and it can be accessed with the group name.

(?aiLmsux) | Here, a, i, L, m, s, u, and x are flags:

a — Matches ASCII only
i — Ignore case
L — Locale dependent
m — Multi-line
s — Matches all
u — Matches unicode
x — Verbose
(?:A) | Matches the expression as represented by A, but unlike (?PAB), it cannot be retrieved afterwards.

(?#...) | A comment. Contents are for us to read, not for matching.

A(?=B) | Lookahead assertion. This matches the expression A only if it is followed by B.

A(?!B) | Negative lookahead assertion. This matches the expression A only if it is not followed by B.

(?<=B)A | Positive lookbehind assertion. This matches the expression A only if B is immediately to its left. This can only matched fixed length expressions.

(?<!B)A | Negative lookbehind assertion. This matches the expression A only if B is not immediately to its left. This can only matched fixed length expressions.

(?P=name) | Matches the expression matched by an earlier group named “name”.

(...)\1 | The number 1 corresponds to the first group to be matched. If we want to match more instances of the same expresion, simply use its number instead of writing out the whole expression again. We can use from 1 up to 99 such groups and their corresponding numbers.<br><br>

**Popular Python re Module Functions**<br>
re.findall(A, B) | Matches all instances of an expression A in a string B and returns them in a list.

re.search(A, B) | Matches the first instance of an expression A in a string B, and returns it as a re match object.

re.split(A, B) | Split a string B into a list using the delimiter A.

re.sub(A, B, C) | Replace A with B in the string C.

# Practice example

In [None]:
text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
coreyms.com
321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
sondipsingha@gamil.com
sondipsingha@bubt.edu.bd
milkmocha_8393@yahoo.org

<html>
   <body> Hello world </body>
   hi there <br></br>
   <title>
   <>
</html>

cat
mat 
pat
'''
sentence = 'Start a sentence and then bring it to an end'


#pattern = re.compile(r'[a-zA-Z]+_*\d*@[a-zA-Z]+\.(com|edu.bd|org)')
pattern = re.compile(r'<.*?>')

# <(.*)> would match a></a where as <(.*?)> would match a. The latter stops after the first match of >. 
# It checks for one or 0 matches of .* followed by the next expression.
# The first expression <(.*)> doesn't stop when matching the first >. It will continue until the last match of >.

matches = pattern.finditer(text_to_search)


for match in matches:
  print(match)


<re.Match object; span=(340, 346), match='<html>'>
<re.Match object; span=(350, 356), match='<body>'>
<re.Match object; span=(369, 376), match='</body>'>
<re.Match object; span=(389, 393), match='<br>'>
<re.Match object; span=(393, 398), match='</br>'>
<re.Match object; span=(402, 409), match='<title>'>
<re.Match object; span=(413, 415), match='<>'>
<re.Match object; span=(416, 423), match='</html>'>


# Text Preprocessing

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Datasets/IMDB Dataset.csv')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = df.head(n=20)
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
df['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [None]:
# making lowercase
df['review'].str.lower() # without str Series cant process

0     one of the other reviewers has mentioned that ...
1     a wonderful little production. <br /><br />the...
2     i thought this was a wonderful way to spend ti...
3     basically there's a family where a little boy ...
4     petter mattei's "love in the time of money" is...
5     probably my all-time favorite movie, a story o...
6     i sure would like to see a resurrection of a u...
7     this show was an amazing, fresh & innovative i...
8     encouraged by the positive comments about this...
9     if you like original gut wrenching laughter yo...
10    phil the alien is one of those quirky films wh...
11    i saw this movie when i was about 12 when it c...
12    so im not a big fan of boll's work but then ag...
13    the cast played shakespeare.<br /><br />shakes...
14    this a fantastic movie of three prisoners who ...
15    kind of drawn in by the erotic scenes, only to...
16    some films just simply should not be remade. t...
17    this movie made it into one of my top 10 m

In [None]:
df['review'] = df['review'].str.lower()

In [None]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


In [None]:
# remove html tags using regular expression

In [None]:
import re

In [None]:
def remove_html(text):
  # pattern = re.compile(r'<.*?>')
  # clean_text = pattern.sub(r'',text)
  clean_text = re.sub(re.compile(r'<.*?>'),'',text)
  # substitute html tags with spaces from the text
  return clean_text

In [None]:
clear_text = remove_html(df['review'][1])
print(clear_text)

a wonderful little production. the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master's of comedy and his life. the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell's murals decorating every surface) are terribly well done.


In [None]:
df['review'] = df['review'].apply(remove_html)
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


In [None]:
## REMOVING URLS 

text='''
https://www.youtube.com/watch?v=Qbd7U9F0QQ8&list=PLKnIA16_RmvZo7fp5kkIth6nRTeQQsjfX&index=7
http://www.vulnerablewebsite.org
www.facebook.com 
'''

text1 = 'https://www.youtube.com/watch?v=Qbd7U9F0QQ8&list=PLKnIA16_RmvZo7fp5kkIth6nRTeQQsjfX&index=7'
text2 = 'https://www.youtube.com/watch?v=Qbd7U9F0Q this is a important link'

In [None]:
def remove_urls(text):
  pattern = re.compile(r'https?://\S+|www\.\S+') # S means char that are not space
  clean_text = pattern.sub(r'',text)
  return clean_text

In [None]:
remove_urls(text2)

' this is a important link'

In [None]:
df['review'] = df['review'].apply(remove_urls)
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


In [None]:
## REMOVING PUNCTUATION
# they almost dont have any effects besides they create confusion when we tokenize
# Hello! how are you? Hello! may be a token 
# Hello world, Now Hello will treated as a different word

In [None]:
import string
puncs = string.punctuation

In [None]:
puncs

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
def remove_puncs(text):
  pattern = re.compile(r'[^\w\s]') # w means[a-z A-Z 0-9 and _] # take everything which are not space or w
  clean_text = pattern.sub(r'',text) # https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string
  return clean_text

In [None]:
remove_puncs("Hello! I love python!")

'Hello I love python'

In [None]:
df['review'] = df['review'].apply(remove_puncs)
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


In [None]:
df['review'][1]

'a wonderful little production the filming technique is very unassuming very oldtimebbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece the actors are extremely well chosen michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great masters of comedy and his life the realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwells murals decorating every surface are terribly well done'

In [None]:
# Removing shortcuts like STFU, BTW
# Correct word using textblob

In [None]:
from textblob import TextBlob

In [None]:
TextBlob("hlw wolrd is the frist line of programing").correct().string # not much well

'how world is the first line of programming'

In [None]:
## Removing stop words
import nltk
#nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
nltk.download('stopwords')
# https://stackoverflow.com/questions/26693736/nltk-and-stopwords-fail-lookuperror

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
def remove_stopwords(text):
  tokens = word_tokenize(text)
  swords = stopwords.words('english')
  processed_text=[]
  for token in tokens:
    if not token in swords:
      processed_text.append(token)
  return " ".join(processed_text)

In [None]:
print(df['review'][1])
print(remove_stopwords(df['review'][1]))

a wonderful little production the filming technique is very unassuming very oldtimebbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece the actors are extremely well chosen michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great masters of comedy and his life the realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwells murals decorating every surface are terribly well done
wonderful little production filming technique unassuming oldtim

In [None]:
## hadling emojis-> remove or replace with text
!pip install emoji

In [None]:
import emoji

In [None]:
emoji.demojize('a movie title based on a series of emoji. (try these: 💉💎 or 👦🏻👓⚡)')

'a movie title based on a series of emoji. (try these: :syringe::gem_stone: or :boy_light_skin_tone::glasses::high_voltage:)'

In [None]:
### tokenization (by word or sentence)
### Stemming (converts words into psuedo root form,fast)
### Lemmatization (converts words into actual root form, slow)
### Use lemma only when we have to show the text to user

In [None]:
from nltk.stem.porter import PorterStemmer

In [None]:
ps = PorterStemmer()

In [None]:
def stemming(text):
  return " ".join([ps.stem(word) for word in text.split()])

In [None]:
stemming(df['review'][1])

'a wonder littl product the film techniqu is veri unassum veri oldtimebbc fashion and give a comfort and sometim discomfort sens of realism to the entir piec the actor are extrem well chosen michael sheen not onli ha got all the polari but he ha all the voic down pat too you can truli see the seamless edit guid by the refer to william diari entri not onli is it well worth the watch but it is a terrificli written and perform piec a master product about one of the great master of comedi and hi life the realism realli come home with the littl thing the fantasi of the guard which rather than use the tradit dream techniqu remain solid then disappear it play on our knowledg and our sens particularli with the scene concern orton and halliwel and the set particularli of their flat with halliwel mural decor everi surfac are terribl well done'

In [None]:
df['review'] = df['review'].apply(stemming)
df.head()

Unnamed: 0,review,sentiment
0,one of the other review ha mention that after ...,positive
1,a wonder littl product the film techniqu is ve...,positive
2,i thought thi wa a wonder way to spend time on...,positive
3,basic there a famili where a littl boy jake th...,negative
4,petter mattei love in the time of money is a v...,positive
