# <p style="font-family:newtimeroman; font-size:150%; text-align:center; color:#4287f5">Text cleaning Operations in NLP </p>

### 1. Lower case letters using a function called ***lower()***

In [36]:
import pandas as pd
df = pd.read_csv('IMDB Dataset.csv')
df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [37]:
#Pick a random row data
df['review'][3]

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

In [38]:
#Lowercasing this raw text
df['review'][3].lower()

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

### We can also lowercase the Whole Corpus by using ***lower()***

In [39]:
df['review'] = df['review'].str.lower()
df.head(10)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
5,"probably my all-time favorite movie, a story o...",positive
6,i sure would like to see a resurrection of a u...,positive
7,"this show was an amazing, fresh & innovative i...",negative
8,encouraged by the positive comments about this...,negative
9,if you like original gut wrenching laughter yo...,positive


## 2. Remove HTML tags : ***We can simply remove HTML tags by using Regular Expression.***

In [40]:
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)
# What is re.compile()?
# re.compile() compiles a regular expression pattern into a regular expression object, 
# which can then be used to match, search, or split text multiple times efficiently.

In [41]:
text = "<html><body><p> Movie 1</p><p> Actor - Aamir Khan</p><p> Click here to <a href='http://google.com'>download</a></p></body></html>"
text

"<html><body><p> Movie 1</p><p> Actor - Aamir Khan</p><p> Click here to <a href='http://google.com'>download</a></p></body></html>"

In [42]:
remove_html_tags(text)

' Movie 1 Actor - Aamir Khan Click here to download'

In [43]:
#Remove the HTML tags of Whole reviews text
df['review'] = df['review'].apply(remove_html_tags)
df.head(10)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
5,"probably my all-time favorite movie, a story o...",positive
6,i sure would like to see a resurrection of a u...,positive
7,"this show was an amazing, fresh & innovative i...",negative
8,encouraged by the positive comments about this...,negative
9,if you like original gut wrenching laughter yo...,positive


In [44]:
df['review'][1]

'a wonderful little production. the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master\'s of comedy and his life. the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell\'s murals decorating every surface) are terribly well done.'

## 3. Remove URL's

In [45]:
#Here we also use Regular Expression to remove URL's from text pr whole corpus
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

In [46]:
text1 = 'Google search here www.google.com'
text2 = 'Here is my website, www.masudmehrab.com'
text3 = 'Get RGB colors from this site https://www.google.com/search?q=rgb+color+picker&rlz=1C1PNBB_enBD1154BD1157&oq=rgb+colo&gs_lcrp=EgZjaHJvbWUqDAgAECMYJxiABBiKBTIMCAAQIxgnGIAEGIoFMgYIARBFGEAyBggCEEUYOTIMCAMQABhDGIAEGIoFMgwIBBAAGEMYgAQYigUyBwgFEAAYgAQyBwgGEAAYgAQyBwgHEAAYgATSAQkzNTU2ajBqMTWoAgiwAgHxBQB2_eeacqeT&sourceid=chrome&ie=UTF-8 to make your themes beautiful.'
text4 = 'Watch NLP tutorilas from here https://www.youtube.com/watch?v=lK9gx4q_vfI'


In [47]:
print(remove_url(text1))
print(remove_url(text2))
print(remove_url(text3))
print(remove_url(text4))


Google search here 
Here is my website, 
Get RGB colors from this site  to make your themes beautiful.
Watch NLP tutorilas from here 


## 4. Remove Punctuations

In [48]:
#From string class we import punctuations
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [49]:
punct = string.punctuation
def remove_punct(text):
    return text.translate(str.maketrans('','', punct))

In [50]:
text = "A paragraph? 70–80 words? So short! Using—let me get this straight—all 14 " \
"punctuation marks? Intriguing; yet I think it can be done, probably, without (excessive) false modesty on my part. " \
"[Boasting omitted.] Of course, it’ll be tricky to fit in the braces ({}) but I’m sure we’ll find a way: as Karl Marx said, " \
"“… mankind always sets itself only such tasks as it can solve …”. There: that’s only partly half-assed, more like quarter-assed. " \
"Enjoy."
text

'A paragraph? 70–80 words? So short! Using—let me get this straight—all 14 punctuation marks? Intriguing; yet I think it can be done, probably, without (excessive) false modesty on my part. [Boasting omitted.] Of course, it’ll be tricky to fit in the braces ({}) but I’m sure we’ll find a way: as Karl Marx said, “… mankind always sets itself only such tasks as it can solve …”. There: that’s only partly half-assed, more like quarter-assed. Enjoy.'

In [51]:
remove_punct(text)

'A paragraph 70–80 words So short Using—let me get this straight—all 14 punctuation marks Intriguing yet I think it can be done probably without excessive false modesty on my part Boasting omitted Of course it’ll be tricky to fit in the braces  but I’m sure we’ll find a way as Karl Marx said “… mankind always sets itself only such tasks as it can solve …” There that’s only partly halfassed more like quarterassed Enjoy'

In [52]:
# Remove punctuations from Whole review column texts
df['review']=df['review'].apply(remove_punct)
df.head(10)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive
5,probably my alltime favorite movie a story of ...,positive
6,i sure would like to see a resurrection of a u...,positive
7,this show was an amazing fresh innovative ide...,negative
8,encouraged by the positive comments about this...,negative
9,if you like original gut wrenching laughter yo...,positive


In [53]:
df['review'][3]

'basically theres a family where a little boy jake thinks theres a zombie in his closet  his parents are fighting all the timethis movie is slower than a soap opera and suddenly jake decides to become rambo and kill the zombieok first of all when youre going to make a film you must decide if its a thriller or a drama as a drama the movie is watchable parents are divorcing  arguing like in real life and then we have jake with his closet which totally ruins all the film i expected to see a boogeyman similar movie and instead i watched a drama with some meaningless thriller spots3 out of 10 just for the well playing parents  descent dialogs as for the shots with jake just ignore them'

## 5. Handling ChatWords

In [54]:
# get the slang dataset here = https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt
chat_words = {
       "AFAIK": "As Far As I Know",
    "AFK": "Away From Keyboard",
    "ASAP": "As Soon As Possible",
    "ATK": "At The Keyboard",
    "ATM": "At The Moment",
    "A3": "Anytime, Anywhere, Anyplace",
    "BAK": "Back At Keyboard",
    "BBL": "Be Back Later",
    "BBS": "Be Back Soon",
    "BFN": "Bye For Now",
    "B4N": "Bye For Now",
    "BRB": "Be Right Back",
    "BRT": "Be Right There",
    "BTW": "By The Way",
    "B4": "Before",
    "B4N": "Bye For Now",
    "CU": "See You",
    "CUL8R": "See You Later",
    "CYA": "See You",
    "FAQ": "Frequently Asked Questions",
    "FC": "Fingers Crossed",
    "FWIW": "For What It's Worth",
    "FYI": "For Your Information",
    "GAL": "Get A Life",
    "GG": "Good Game",
    "GN": "Good Night",
    "GMTA": "Great Minds Think Alike",
    "GR8": "Great!",
    "G9": "Genius",
    "IC": "I See",
    "ICQ": "I Seek you (also a chat program)",
    "ILU": "ILU: I Love You",
    "IMHO": "In My Honest/Humble Opinion",
    "IMO": "In My Opinion",
    "IOW": "In Other Words",
    "IRL": "In Real Life",
    "KISS": "Keep It Simple, Stupid",
    "LDR": "Long Distance Relationship",
    "LMAO": "Laugh My A.. Off",
    "LOL": "Laughing Out Loud",
    "LTNS": "Long Time No See",
    "L8R": "Later",
    "MTE": "My Thoughts Exactly",
    "M8": "Mate",
    "NRN": "No Reply Necessary",
    "OIC": "Oh I See",
    "PITA": "Pain In The A..",
    "PRT": "Party",
    "PRW": "Parents Are Watching",
    "QPSA?": "Que Pasa?",
    "ROFL": "Rolling On The Floor Laughing",
    "ROFLOL": "Rolling On The Floor Laughing Out Loud",
    "ROTFLMAO": "Rolling On The Floor Laughing My A.. Off",
    "SK8": "Skate",
    "STATS": "Your sex and age",
    "ASL": "Age, Sex, Location",
    "THX": "Thank You",
    "TTFN": "Ta-Ta For Now!",
    "TTYL": "Talk To You Later",
    "U": "You",
    "U2": "You Too",
    "U4E": "Yours For Ever",
    "WB": "Welcome Back",
    "WTF": "What The F...",
    "WTG": "Way To Go!",
    "WUF": "Where Are You From?",
    "W8": "Wait...",
    "7K": "Sick:-D Laugher",
    "TFW": "That feeling when",
    "MFW": "My face when",
    "MRW": "My reaction when",
    "IFYP": "I feel your pain",
    "TNTL": "Trying not to laugh",
    "JK": "Just kidding",
    "IDC": "I don't care",
    "ILY": "I love you",
    "IMU": "I miss you",
    "ADIH": "Another day in hell",
    "ZZZ": "Sleeping, bored, tired",
    "WYWH": "Wish you were here",
    "TIME": "Tears in my eyes",
    "BAE": "Before anyone else",
    "FIMH": "Forever in my heart",
    "BSAAW": "Big smile and a wink",
    "BWL": "Bursting with laughter",
    "BFF": "Best friends forever",
    "CSL": "Can't stop laughing"
    
}




In [55]:
# Function
def chat_conversion(text):
    new_text = []  # This will store the converted (normalized) words
    for i in text.split(): #The input text is split into words using whitespace, and each word is iterated over using i.
        if i.upper() in chat_words:
            new_text.append(chat_words[i.upper()])
        else:
            new_text.append(i)
    return " ".join(new_text)



In [56]:
text1 = "IFYP don't be sad bro!"
text2 = "JK don't mind be cheerful."
text3 = "FYI Cox's Bazar is the largest beach in the world."
print(chat_conversion(text1))
print(chat_conversion(text2))
print(chat_conversion(text3))

I feel your pain don't be sad bro!
Just kidding don't mind be cheerful.
For Your Information Cox's Bazar is the largest beach in the world.


## 6. Spelling Correction

In [57]:
from textblob import TextBlob

In [58]:
#There are two incorrect texts have to correct by using TextBlob library
inc_text1 = 'Englishh is a West Geermanic languge that originated in eaarly medieval England and has since evolved onto a glbal lingua franca.'
inc_text2 = "Spellin crrection in Nataral Languge Procesng (NLP) is the tusk of automaticaly detect and correct spalling errorrs in text."
textblob1 = TextBlob(inc_text1)
textblob2 = TextBlob(inc_text2)
print(inc_text1)
print(textblob1.correct().string)
print(inc_text2)
print(textblob2.correct().string)

Englishh is a West Geermanic languge that originated in eaarly medieval England and has since evolved onto a glbal lingua franca.
English is a West Germanic language that originated in early medieval England and has since evolved onto a global lingual france.
Spellin crrection in Nataral Languge Procesng (NLP) is the tusk of automaticaly detect and correct spalling errorrs in text.
Spelling correction in Natural Language Procesng (NLP) is the task of automatically detect and correct spelling errors in text.


## 7. Handling Stop Words

In [59]:
from nltk.corpus import stopwords
import nltk
from nltk.tokenize import word_tokenize
stopwords = stopwords.words('english')
stopwords
# Here we can see all the stop words in English

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [60]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [61]:
# Function
def remove_stopwords(text):
    new_text = []
    
    for word in text.split():
        if word in stopwords:
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)


In [67]:
text= "basically theres a family where a little boy jake thinks theres a zombie in his closet  his parents are fighting all the timethis movie is slower than a soap opera and suddenly jake decides to become rambo and kill the zombieok first of all when youre going to make a film you must decide if its a thriller or a drama as a drama the movie is watchable parents are divorcing  arguing like in real life and then we have jake with his closet which totally ruins all the film i expected to see a boogeyman similar movie and instead i watched a drama with some meaningless thriller spots3 out of 10 just for the well playing parents  descent dialogs as for the shots with jake just ignore them"
remove_stopwords(text)

'basically theres  family   little boy jake thinks theres  zombie   closet  parents  fighting   timethis movie  slower   soap opera  suddenly jake decides  become rambo  kill  zombieok first    youre going  make  film  must decide    thriller   drama   drama  movie  watchable parents  divorcing arguing like  real life     jake   closet  totally ruins   film  expected  see  boogeyman similar movie  instead  watched  drama   meaningless thriller spots3   10    well playing parents descent dialogs    shots  jake  ignore '

In [68]:
# df['review'].apply(remove_stopwords)

## 8. Handling emojis 🙂
***In regular expressions (re.compile()), these Unicode ranges are used to detect and remove emojis or symbols in text preprocessing. Each range covers a specific emoji/symbol block.Unicode is a universal character encoding standard that assigns a code to every character in every language, including symbols, emojis, and special characters.***
##### \U0001F600-\U0001F64F	:Emoticons:	😀😁😂🤣😃😄😅😆
##### \U0001F300-\U0001F5FF:	Misc Symbols & Pictographs:	🌍🌄🌊🎉📱🔔
##### \U0001F680-\U0001F6FF:	Transport & Map Symbols:	🚀🚗🚕🚓🚑🚒



In [64]:
import re

def remove_emojis(text):
    emoji_pattern = re.compile(
        "["
        u"\U0001F600-\U0001F64F"  # Emoticons
        u"\U0001F300-\U0001F5FF"  # Symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # Transport & map
        u"\U0001F1E0-\U0001F1FF"  # Flags
        u"\U00002500-\U00002BEF"  # Chinese characters
        u"\U00002702-\U000027B0"  # Dingbats
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u200d"                 # Zero-width joiner
        u"\u2640-\u2642"          # Gender symbols
        u"\u2600-\u2B55"          # Misc symbols
        u"\u23cf"                 # Eject symbol
        u"\u23e9"                 # Fast-forward
        u"\u231a"                 # Watch
        u"\ufe0f"                 # Dingbats
        u"\u3030"                 # Wavy dash
        "]+", flags=re.UNICODE)
    
    return emoji_pattern.sub(r'', text)

In [65]:
text1= "Handling emojis in NLP (Natural Language Processing)🌍 preprocessing is important because emojis can carry significant emotional😒, contextual🚗, or semantic🌊 meaning."
text2 = "Consistency is a pleasure🙂 but😒 procrustrination is ...🤸‍♂️"
print(remove_emojis(text1))
print(remove_emojis(text2))


Handling emojis in NLP (Natural Language Processing) preprocessing is important because emojis can carry significant emotional, contextual, or semantic meaning.
Consistency is a pleasure but procrustrination is ...


In [69]:
#Remove emojis from whole corpus [review]
df['review'].apply(remove_emojis)

0        one of the other reviewers has mentioned that ...
1        a wonderful little production the filming tech...
2        i thought this was a wonderful way to spend ti...
3        basically theres a family where a little boy j...
4        petter matteis love in the time of money is a ...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot bad dialogue bad acting idiotic direc...
49997    i am a catholic taught in parochial elementary...
49998    im going to have to disagree with the previous...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

### 2. Convert Emojis to Text (Descriptive Labels)
***When to use: For sentiment analysis, emotion detection, or social media mining.
How: Use libraries like emoji in Python.***

In [None]:
pip install emoji

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import emoji
def emojito_text(text):
    return emoji.demojize(text, language='en')

In [None]:
text = "I Love❤️ python programming.🐍"
print(emojito_text(text))

I Love:red_heart: python programming.:snake:


### What is emoji.demojize()?
***The demojize() function from the Python emoji library converts emojis into their textual description (often called a "CLDR short name" or "alias").***

In [None]:
import emoji

text = "I love NLP ❤️ and coffee ☕!"
converted = emoji.demojize(text, language='en')
print(converted)

I love NLP :red_heart: and coffee :hot_beverage:!


### 3. Keep Emojis as Tokens
***When to use: If your model can learn from emojis (e.g., deep learning models with embeddings).***

In [None]:
from nltk.tokenize import TweetTokenizer

tokenizer = TweetTokenizer()
tokens = tokenizer.tokenize("I love it ❤️")
print(tokens)


['I', 'love', 'it', '❤', '️']


# <p style="font-family:newtimeroman; font-size:150%; text-align:center; color:#4287f5">The END </p>