# Steps for effective text data cleaning


The days when one would get data in tabulated spreadsheets are truly behind us. A moment of silence for the data residing in the spreadsheet pockets. Today, more than 80% of the data is unstructured – it is either present in data silos or scattered around the digital archives. Data is being produced as we speak – from every conversation we make in the social media to every content generated from news sources. In order to produce any meaningful actionable insight from data, it is important to know how to work with it in its unstructured form. As a Data Scientist at one of the fastest growing Decision Sciences firm, my bread and butter comes from deriving meaningful insights from unstructured text information.

In [1]:
# Suppose this is one of the customer opinions related to iPhone

text = "I luv my &lt;3 iphone &amp; you're awsm apple. DisplayIsAwesome, sooo happppppy 🙂 http://www.apple.com"

### Escaping HTML characters


Data obtained from web usually contains a lot of html entities like &lt; &gt; &amp; which gets embedded in the original data. It is thus necessary to get rid of these entities. One approach is to directly remove them by the use of specific regular expressions. Another approach is to use appropriate packages and modules (for example htmlparser of Python), which can convert these entities to standard html tags. 

In [2]:
import HTMLParser
import sys
stdout = sys.stdout
reload(sys)
sys.setdefaultencoding('utf-8')
sys.stdout = stdout

# For example: &lt; is converted to “<” and &amp; is converted to “&”.\
html_parser = HTMLParser.HTMLParser()
tweet = html_parser.unescape(text)
tweet

u"I luv my <3 iphone & you're awsm apple. DisplayIsAwesome, sooo happppppy \U0001f642 http://www.apple.com"


### Decoding data 

This is the process of transforming information from complex symbols to simple and easier to understand characters. Text data may be subject to different forms of decoding like “Latin”, “UTF8” etc. Therefore, for better analysis, it is necessary to keep the complete data in standard encoding format. UTF-8 encoding is widely accepted and is recommended to use.

In [3]:
tweet = tweet.decode("utf8").encode('ascii', 'ignore')
tweet

"I luv my <3 iphone & you're awsm apple. DisplayIsAwesome, sooo happppppy  http://www.apple.com"

### Apostrophe Lookup

To avoid any word sense disambiguation in text, it is recommended to maintain proper structure in it and to abide by the rules of context free grammar. When apostrophes are used, chances of disambiguation increases.

In [4]:
# Need a huge dictionary
APPOSTOPHES = {"he's" : "he is", "you're" : "you are", "I've" : "I have"}

words = tweet.split()
reformed_words = [APPOSTOPHES[word] if word in APPOSTOPHES else word for word in words]
reformed_tweet = " ".join(reformed_words)
reformed_tweet

'I luv my <3 iphone & you are awsm apple. DisplayIsAwesome, sooo happppppy http://www.apple.com'

### Split Attached Words


We humans in the social forums generate text data, which is completely informal in nature. Most of the tweets are accompanied with multiple attached words like RainyDay, PlayingInTheCold etc. These entities can be split into their normal forms using simple rules and regex.


In [5]:
import re

cleaned = " ".join(re.findall(r'[A-Z][^A-Z]*', reformed_tweet))
cleaned

'I luv my <3 iphone & you are awsm apple.  Display Is Awesome, sooo happppppy http://www.apple.com'

### Slangs Lookup


Again, social media comprises of a majority of slang words. These words should be transformed into standard words to make free text. The words like luv will be converted to love, Helo to Hello. The similar approach of apostrophe look up can be used to convert slangs to standard words. A number of sources are available on the web, which provides lists of all possible slangs, this would be your holy grail and you could use them as lookup dictionaries for conversion purposes.

In [6]:
# Need a huge dictionary
SLANGS = {"luv" : "love", "Helo" : "Hello", "awsm" : "awesome"}

words = cleaned.split()
reformed_words = [SLANGS[word] if word in SLANGS else word for word in words]
reformed_tweet = " ".join(reformed_words)
reformed_tweet

'I love my <3 iphone & you are awesome apple. Display Is Awesome, sooo happppppy http://www.apple.com'

### Removal of URLs

URLs and hyperlinks in text data like comments, reviews, and tweets should be removed.

In [7]:
URLless_tweet = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', reformed_tweet)
URLless_tweet

'I love my <3 iphone & you are awesome apple. Display Is Awesome, sooo happppppy '

### Standardizing words


Sometimes words are not in proper formats. For example: “I looooveee you” should be “I love you”. Simple rules and regular expressions can help solve these cases.

In [8]:
import itertools

tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(URLless_tweet))
tweet

'I love my <3 iphone & you are awesome apple. Display Is Awesome, soo happy '

## Other works

### Removal of Stop-words


When data analysis needs to be data driven at the word level, the commonly occurring words (stop-words) should be removed. One can either create a long list of stop-words or one can use predefined language specific libraries.


### Removal of Punctuations


All the punctuation marks according to the priorities should be dealt with. For example: “.”, “,”,”?” are important punctuations that should be retained while others need to be removed.


### Removal of Expressions 


Textual data (usually speech transcripts) may contain human expressions like [laughing], [Crying], [Audience paused]. These expressions are usually non relevant to content of the speech and hence need to be removed.  Simple regular expression can be useful in this case.

## Advanced data cleaning
### Grammar checking

Grammar checking is majorly learning based, huge amount of proper text data is learned and models are created for the purpose of grammar correction. There are many online tools that are available for grammar correction purposes.

### Spelling correction

In natural language, misspelled errors are encountered. Companies like Google and Microsoft have achieved a decent accuracy level in automated spell correction. One can use algorithms like the Levenshtein Distances, Dictionary Lookup etc. or other modules and packages to fix these errors.