<a href="https://colab.research.google.com/github/Nawapon19/NLP-Practice/blob/main/Efficient_Text_Data_Cleaning_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Efficient Text Data Cleaning**

Learn how to work with unstructured data to be able to extract relevant information from it and make it useful.

**Steps for Data Cleaning**

**1. Clear out HTML characters:** A Lot of HTML entities like &apos; ,&amp; ,&lt; etc can be found in most of the data available on the web. There are two ways to get rid of these from the data:

* By using specific regular expressions or
* By using modules or packages available(htmlparser of python)

In [72]:
# escape out html characters
import html

tweet = "I enjoyd the event which took place yesteday & I lovdddd itttt ! The link to the show is http://t.co/4ftYom0i It's awesome you'll luv it #HadFun #Enjoyed BFN GN"

# convert special characters to HTML Entities
tweet = html.escape(tweet)
print("Before removing HTML characters the tweet is: \n", tweet)

# unescape HTML Entities
tweet = html.unescape(tweet)
print("\nAfter removing HTML characters the tweet is: -\n{}".format(tweet))

Before removing HTML characters the tweet is: 
 I enjoyd the event which took place yesteday &amp; I lovdddd itttt ! The link to the show is http://t.co/4ftYom0i It&#x27;s awesome you&#x27;ll luv it #HadFun #Enjoyed BFN GN

After removing HTML characters the tweet is: -
I enjoyd the event which took place yesteday & I lovdddd itttt ! The link to the show is http://t.co/4ftYom0i It's awesome you'll luv it #HadFun #Enjoyed BFN GN


**2. Encoding & Decoding Data:** It is the process of converting information from simple understandable characters to complex symbols and vice versa. There are different forms of encoding &decoding like “UTF8″,”ascii” etc. available for text data. The most common format is the UTF-8 format.

In [73]:
# encode from UTF-8 to ascii
encode_tweet = tweet.encode('ascii', 'ignore')
print("encode_tweet = \n{}".format(encode_tweet))

# decode from ascii to UTF-8
decode_tweet = encode_tweet.decode(encoding = 'UTF-8')
print("\ndecode_tweet = \n{}".format(decode_tweet))

encode_tweet = 
b"I enjoyd the event which took place yesteday & I lovdddd itttt ! The link to the show is http://t.co/4ftYom0i It's awesome you'll luv it #HadFun #Enjoyed BFN GN"

decode_tweet = 
I enjoyd the event which took place yesteday & I lovdddd itttt ! The link to the show is http://t.co/4ftYom0i It's awesome you'll luv it #HadFun #Enjoyed BFN GN


**3. Removing URLs, Hashtags and Styles:** These provide no relevant information and can be removed. In hashtags, only the hash sign ‘#’ will be removed.
* use the re library to perform regular expression operations.

In [74]:
# import re library
import re

print("Before removing Hasgtags, URLs and Styles the tweet is: \n{}".format(tweet))

# remove hyperlink characters
tweet = re.sub(r'https?:\/\/.\S+', "", tweet)

# remove hashtags
# only removing the hash # sign from the word
tweet = re.sub(r'#', '', tweet)

# remove old style retweet text "RT"
tweet = re.sub(r'^RT[\s]+', '', tweet)

print("\nAfter removing Hasgtags, URLs and Styles the tweet is: \n{}".format(tweet))

Before removing Hasgtags, URLs and Styles the tweet is: 
I enjoyd the event which took place yesteday & I lovdddd itttt ! The link to the show is http://t.co/4ftYom0i It's awesome you'll luv it #HadFun #Enjoyed BFN GN

After removing Hasgtags, URLs and Styles the tweet is: 
I enjoyd the event which took place yesteday & I lovdddd itttt ! The link to the show is  It's awesome you'll luv it HadFun Enjoyed BFN GN


**4. Contraction Replacement:** The text data might contain apostrophe’s used for contractions. Example- “didn’t” for “did not” etc. This can change the sense of the word or sentence. To replace these apostrophes with the standard lexicons:
* create a mapping dictionary which consists of the value with which the word needs to be replaced and use that.

In [75]:
# dictionary consisting of the contraction and the actual value
Apos_dict = {"'s":" is","n't":" not","'m":" am","'ll":" will",
           "'d":" would","'ve":" have","'re":" are"}

print("Before Contraction replacement the tweet is: \n{}".format(tweet))

# replace the contractions
for key, value in Apos_dict.items():
  if key in tweet:
    tweet=tweet.replace(key, value)

print("\nAfter Contraction replacement the tweet is: \n{}".format(tweet))

Before Contraction replacement the tweet is: 
I enjoyd the event which took place yesteday & I lovdddd itttt ! The link to the show is  It's awesome you'll luv it HadFun Enjoyed BFN GN

After Contraction replacement the tweet is: 
I enjoyd the event which took place yesteday & I lovdddd itttt ! The link to the show is  It is awesome you will luv it HadFun Enjoyed BFN GN


**5. Split attached words:**  Some words are joined together for example – “ForTheWin”. These need to be separated to be able to extract the meaning out of it. After splitting, it will be “For The Win”.

In [76]:
import re

print("Before splitting attached words the retweet is: \n{}".format(tweet))

# separate the words
tweet = " ".join([s for s in re.split("([A-Z][a-z]+[^A-Z]*)", tweet) if s])

print("\nAfter splitting attached words the retweet is: \n{}".format(tweet))

Before splitting attached words the retweet is: 
I enjoyd the event which took place yesteday & I lovdddd itttt ! The link to the show is  It is awesome you will luv it HadFun Enjoyed BFN GN

After splitting attached words the retweet is: 
I enjoyd the event which took place yesteday & I lovdddd itttt !  The link to the show is   It is awesome you will luv it  Had Fun  Enjoyed  BFN GN


**6. Convert to lower case:** Convert your text to lower case to avoid case sensitivity related issues.

In [77]:
print("Before converting to lowercase the tweet is: \n{}".format(tweet))

# convert to lowercase
tweet = tweet.lower()

print("\nAfter converting to lowercase the tweet is: \n{}".format(tweet))

Before converting to lowercase the tweet is: 
I enjoyd the event which took place yesteday & I lovdddd itttt !  The link to the show is   It is awesome you will luv it  Had Fun  Enjoyed  BFN GN

After converting to lowercase the tweet is: 
i enjoyd the event which took place yesteday & i lovdddd itttt !  the link to the show is   it is awesome you will luv it  had fun  enjoyed  bfn gn


**7. Slang lookup:** There are many slang words which are used nowadays, and they can be found in the text data. To replace them with their meanings:
* use a dictionary of slang words
* or create a file consisting of the slang words

Examples of slang words are:
* sap --> as soon as possible
* b4   --> before
* lol  --> laugh out loud
* luv  --> love
* wtg  --> way to go

In [78]:
# inspect the file for slang words
file = open("slang.txt", "r")
print(file.read())
file.close()

AFAIK=As Far As I Know
AFK=Away From Keyboard
ASAP=As Soon As Possible
ATK=At The Keyboard
ATM=At The Moment
A3=Anytime, Anywhere, Anyplace
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
BRB=Be Right Back
BRT=Be Right There
BTW=By The Way
B4=Before
B4N=Bye For Now
CU=See You
CUL8R=See You Later
CYA=See You
FAQ=Frequently Asked Questions
FC=Fingers Crossed
FWIW=For What It's Worth
FYI=For Your Information
GAL=Get A Life
GG=Good Game
GN=Good Night
GMTA=Great Minds Think Alike
GR8=Great!
G9=Genius
IC=I See
ICQ=I Seek you (also a chat program)
ILU=ILU: I Love You
IMHO=In My Honest/Humble Opinion
IMO=In My Opinion
IOW=In Other Words
IRL=In Real Life
KISS=Keep It Simple, Stupid
LDR=Long Distance Relationship
LMAO=Laugh My A.. Off
LOL=Laughing Out Loud
LTNS=Long Time No See
L8R=Later
MTE=My Thoughts Exactly
M8=Mate
NRN=No Reply Necessary
OIC=Oh I See
PITA=Pain In The A..
PRT=Party
PRW=Parents Are Watching
QPSA?	Que Pasa?
ROFL=Rolling On The Floor Laugh

In [79]:
print("Before slang replacement the tweet is: \n{}".format(tweet))

# open the file slang.txt
file = open("slang.txt", "r")
slang = file.read().lower()

# separate each line present in the file
slang = slang.replace('–', '=').split('\n')

# create empty lists for mapping
slang_word = []
meaning = []

# tokenize tweet
tweet_tokens = tweet.split()

# store slang words and meaning in different lists
for line in slang:
  temp = line.split('=')
  slang_word.append(temp[0])
  meaning.append(temp[-1])

# replace the slang words with meaning
for i, word in enumerate(tweet_tokens):
  if word in slang_word:
    idx = slang_word.index(word) # index for mapping slang with meaning
    tweet_tokens[i] = meaning[idx]

tweet = " ".join(tweet_tokens)
print("\nAfter slang replacement the tweet is: \n{}".format(tweet))

Before slang replacement the tweet is: 
i enjoyd the event which took place yesteday & i lovdddd itttt !  the link to the show is   it is awesome you will luv it  had fun  enjoyed  bfn gn

After slang replacement the tweet is: 
i enjoyd the event which took place yesteday & i lovdddd itttt ! the link to the show is it is awesome you will luv it had fun enjoyed bye for now good night


**8. Standardizing and Spell Check:** There might be spelling errors in the text or it might not be in the correct format. For example – “drivng” for “driving” or “I misssss this” for “I miss this”.
* correct these by using the autocorrect library for python

In [80]:
# install auto correct library
!pip install autocorrect



In [81]:
# itertools.groupby()
import itertools
for key, group in itertools.groupby(tweet):
  print(key + ": ", list(group))

i:  ['i']
 :  [' ']
e:  ['e']
n:  ['n']
j:  ['j']
o:  ['o']
y:  ['y']
d:  ['d']
 :  [' ']
t:  ['t']
h:  ['h']
e:  ['e']
 :  [' ']
e:  ['e']
v:  ['v']
e:  ['e']
n:  ['n']
t:  ['t']
 :  [' ']
w:  ['w']
h:  ['h']
i:  ['i']
c:  ['c']
h:  ['h']
 :  [' ']
t:  ['t']
o:  ['o', 'o']
k:  ['k']
 :  [' ']
p:  ['p']
l:  ['l']
a:  ['a']
c:  ['c']
e:  ['e']
 :  [' ']
y:  ['y']
e:  ['e']
s:  ['s']
t:  ['t']
e:  ['e']
d:  ['d']
a:  ['a']
y:  ['y']
 :  [' ']
&:  ['&']
 :  [' ']
i:  ['i']
 :  [' ']
l:  ['l']
o:  ['o']
v:  ['v']
d:  ['d', 'd', 'd', 'd']
 :  [' ']
i:  ['i']
t:  ['t', 't', 't', 't']
 :  [' ']
!:  ['!']
 :  [' ']
t:  ['t']
h:  ['h']
e:  ['e']
 :  [' ']
l:  ['l']
i:  ['i']
n:  ['n']
k:  ['k']
 :  [' ']
t:  ['t']
o:  ['o']
 :  [' ']
t:  ['t']
h:  ['h']
e:  ['e']
 :  [' ']
s:  ['s']
h:  ['h']
o:  ['o']
w:  ['w']
 :  [' ']
i:  ['i']
s:  ['s']
 :  [' ']
i:  ['i']
t:  ['t']
 :  [' ']
i:  ['i']
s:  ['s']
 :  [' ']
a:  ['a']
w:  ['w']
e:  ['e']
s:  ['s']
o:  ['o']
m:  ['m']
e:  ['e']
 :  [' ']
y:  [

In [82]:
print("Before standardizing and spell check the tweet is: \n{}".format(tweet))

# one letter in a word should not present more than twice in continuation
tweet = ''.join(''.join(s)[0:2] for _, s in itertools.groupby(tweet))
print("\nAfter standardizing the tweet is: \n{}".format(tweet))

# spell check
from autocorrect import Speller
spell = Speller(lang = 'en')
tweet = spell(tweet)

print("\nAfter spell check the tweet is: \n{}".format(tweet))

Before standardizing and spell check the tweet is: 
i enjoyd the event which took place yesteday & i lovdddd itttt ! the link to the show is it is awesome you will luv it had fun enjoyed bye for now good night

After standardizing the tweet is: 
i enjoyd the event which took place yesteday & i lovdd itt ! the link to the show is it is awesome you will luv it had fun enjoyed bye for now good night

After spell check the tweet is: 
i enjoyed the event which took place yesterday & i loved itt ! the link to the show is it is awesome you will luv it had fun enjoyed bye for now good night


**9. Remove Stopwords:** Stop words are the words which occur frequently in the text but add no significant meaning to it.
* use the nltk library which consists of modules for pre-processing data. It provides a list of stop words

In [83]:
# install nltk library
!pip install nltk



In [84]:
import nltk
# download stopwords from nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [85]:
print("Before removing stopwords tweet is: \n{}".format(tweet))

# import stopwords module
from nltk.corpus import stopwords

# load english stopwords list from nltk stopwords
stopwords_eng = stopwords.words('english')

# tokenize tweet
tweet_tokens = tweet.split()
tweet_list = []

# remove stopwords
for word in tweet_tokens:
  if word not in stopwords_eng:
    tweet_list.append(word)

print("\ntweet_list = {}".format(tweet_list))

Before removing stopwords tweet is: 
i enjoyed the event which took place yesterday & i loved itt ! the link to the show is it is awesome you will luv it had fun enjoyed bye for now good night

tweet_list = ['enjoyed', 'event', 'took', 'place', 'yesterday', '&', 'loved', 'itt', '!', 'link', 'show', 'awesome', 'luv', 'fun', 'enjoyed', 'bye', 'good', 'night']


**10. Remove Punctuations:** Punctuations consists of !,<@#&$ etc.

In [86]:
print("tweet_list be bofore cleaning = {}".format(tweet_list))

import string
clean_tweet = []

# remove puctuation
for word in tweet_list:
  if word not in string.punctuation:
    clean_tweet.append(word)

print("\nclean_tweet = {}".format(clean_tweet))

tweet_list be bofore cleaning = ['enjoyed', 'event', 'took', 'place', 'yesterday', '&', 'loved', 'itt', '!', 'link', 'show', 'awesome', 'luv', 'fun', 'enjoyed', 'bye', 'good', 'night']

clean_tweet = ['enjoyed', 'event', 'took', 'place', 'yesterday', 'loved', 'itt', 'link', 'show', 'awesome', 'luv', 'fun', 'enjoyed', 'bye', 'good', 'night']
