In [1]:
import numpy as np
import pandas as pd

In [7]:
from platform import python_version
print(python_version())

3.8.6


### 1.Escaping HTML characters: 

Data obtained from web usually contains a lot of html entities like &lt; &gt; &amp; which gets embedded in the original data. It is thus necessary to get rid of these entities. One approach is to directly remove them by the use of specific regular expressions. Another approach is to use appropriate packages and modules (for example htmlparser of Python), which can convert these entities to standard html tags. For example: &lt; is converted to “<” and &amp; is converted to “&”.

In [11]:
from html.parser import HTMLParser

original_tweet = "I luv my &lt;3 iphone &amp; you’re awsm apple. DisplayIsAwesome, sooo happppppy 🙂 http://www.apple.com"

print('\n\nEscaping HTML Characters\n\n')

html_parser = HTMLParser()

original_tweet = original_tweet.decode('utf8').encode('ascii','ignore')

tweet = html_parser.unescape(original_tweet)

print(tweet)



Escaping HTML Characters


I luv my <3 iphone & youre awsm apple. DisplayIsAwesome, sooo happppppy  http://www.apple.com


### 2.Decoding data: 
This is the process of transforming information from complex symbols to simple and easier to understand characters. Text data may be subject to different forms of decoding like “Latin”, “UTF8” etc. Therefore, for better analysis, it is necessary to keep the complete data in standard encoding format. UTF-8 encoding is widely accepted and is recommended to use.

In [53]:
print(type(original_tweet))
original_tweet

<class 'str'>


'I luv my &lt;3 iphone &amp; you’re awsm apple. DisplayIsAwesome, sooo happppppy 🙂 http://www.apple.com'

In [9]:
tweet_02 = original_tweet.decode('utf8').encode('ascii','ignore')

# output is like this
tweet_02="I luv my <3 iphone & you're awsm apple. DisplayIsAwesome, sooo happppppy 🙂 http://www.apple.com"

NameError: name 'original_tweet' is not defined

In [66]:
type(tweet_02)

str

### 3. Apostrophe Lookup: 

To avoid any word sense disambiguation in text, it is recommended to maintain proper structure in it and to abide by the rules of context free grammar. When apostrophes are used, chances of disambiguation increases.
For example “it’s is a contraction for it is or it has”.

All the apostrophes should be converted into standard lexicons. One can use a lookup table of all possible keys to get rid of disambiguates.

In [67]:
APPOSTOPHES = {"'s" : " is", "'re" : "are"} ## Need a huge dictionary

print(APPOSTOPHES)

words = tweet_02.split()
tweet_03 = [APPOSTOPHES[word] if word in APPOSTOPHES else word for word in words]
tweet_03 = " ".join(tweet_03)

print(tweet_03)


{"'s": ' is', "'re": 'are'}
I luv my <3 iphone & you're awsm apple. DisplayIsAwesome, sooo happppppy 🙂 http://www.apple.com


In [68]:
words

['I',
 'luv',
 'my',
 '<3',
 'iphone',
 '&',
 "you're",
 'awsm',
 'apple.',
 'DisplayIsAwesome,',
 'sooo',
 'happppppy',
 '🙂',
 'http://www.apple.com']

### 4. Removal of Stop-words: 
When data analysis needs to be data driven at the word level, the commonly occurring words (stop-words) should be removed. One can either create a long list of stop-words or one can use predefined language specific libraries.
### 5. Removal of Punctuations: 
All the punctuation marks according to the priorities should be dealt with. For example: “.”, “,”,”?” are important punctuations that should be retained while others need to be removed.
### 6. Removal of Expressions: 
Textual data (usually speech transcripts) may contain human expressions like [laughing], [Crying], [Audience paused]. These expressions are usually non relevant to content of the speech and hence need to be removed. Simple regular expression can be useful in this case.
### 7. Split Attached Words: 
We humans in the social forums generate text data, which is completely informal in nature. Most of the tweets are accompanied with multiple attached words like RainyDay, PlayingInTheCold etc. These entities can be split into their normal forms using simple rules and regex.

### 8. Slangs lookup: 
Again, social media comprises of a majority of slang words. These words should be transformed into standard words to make free text. The words like luv will be converted to love, Helo to Hello. The similar approach of apostrophe look up can be used to convert slangs to standard words. A number of sources are available on the web, which provides lists of all possible slangs, this would be your holy grail and you could use them as lookup dictionaries for conversion purposes.

### 9. Standardizing words: 

Sometimes words are not in proper formats. For example: “I looooveee you” should be “I love you”. Simple rules and regular expressions can help solve these cases.

### 10. Removal of URLs: 
URLs and hyperlinks in text data like comments, reviews, and tweets should be removed

In [3]:
text = ['This is dirty TEXT: A phone number +001234561234, moNey 3.333, some date like 09.08.2016 and weird Čárákterš.']

In [12]:
text

['This is dirty TEXT: A phone number +001234561234, moNey 3.333, some date like 09.08.2016 and weird \xc4\x8c\xc3\xa1r\xc3\xa1kter\xc5\xa1.']

In [7]:
import re

def is_digit(word):
    try:
        int(word)
        return True
    except ValueError:
        return False

cedilla2latin = [[u'Á', u'A'], [u'á', u'a'], [u'Č', u'C'], [u'č', u'c'], [u'Š', u'S'], [u'š', u's']]
tr = dict([(a[0], a[1]) for (a) in cedilla2latin])

def transliterate(line):
    new_line = ""
    for letter in line:
        if letter in tr:
            new_line += tr[letter]
        else:
            new_line += letter
    return new_line

In [8]:
for line in text:
    # decode line to worrk with utf8 symbols
    line = line.decode('utf8')
    line = line.replace('+', ' ').replace('.', ' ').replace(',', ' ').replace(':', ' ')
    # remove digits with regex
    line = re.sub("(^|\W)\d+($|\W)", " ", line)
    # OR remove digits with casting to int
    new_line = []
    for word in line.split():
        if not is_digit(word):
            new_line.append(word)
    line = " ".join(new_line)
    # transliterate to Latin characters
    line = transliterate(line)
    line = line.lower()
    print(line)

this is dirty text a phone number money some date like and weird carakters
