Processing text data comes under Natural Language Processing which is a wide array of techniques designed to help machines learn from text.

Following are the techniques used for text preprocessing
*     Lower/Upper casing
*     Removing Punctuations
*     Removal of common words
*     Removal of rare words
*     Removing IconsRemoval of URLs
*     Removal of URLs
*     Removal of HTML tags
*     Remove Stop Words (or/and Frequent words/ Rare words)

These topics are a very common and basic question asked in all data science interviews, we will understand and write codes for each of them

**<span style="color:Red">Please give suggestions for improvements**

I will keep on updating this notebook based on more topic or any suggestions

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
import spacy
import string

## Lower/Upper casing

Lower/Upper casing is a common text preprocessing technique. The logic is to convert the saple data into same casing format so that 'data', 'Data' and 'DATA' are treated the same way.

It also helps in reducing the duplication and vocabulary size of the dataset

In [2]:
data = "This is a Example of UPPER and lower case "
print("lower case: ",data.lower())
print("upper case: ",data.upper())

lower case:  this is a example of upper and lower case 
upper case:  THIS IS A EXAMPLE OF UPPER AND LOWER CASE 


## Removing Punctuations

Removing punctuations from the textual data helps to treat each text equally. For example, the word 'data' and 'data!' are treated equally after the process of removal of punctuations


In [3]:
PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    print(text.translate(str.maketrans('', '', PUNCT_TO_REMOVE)))

text_data = "Hi!! this is a example, a Punctuation example..."
remove_punctuation(text_data)

Hi this is a example a Punctuation example


## Removal of common words

We might have some frequent words which are of not so much importance to us and appear commonly in our dataset

This step is to remove the common words in the given dataset

In [4]:
from collections import Counter
counter = Counter()

text_data = "This is a example of removing common words a common words are words that appear frequetly"
for word in text_data.split():
    counter[word] += 1
        
counter.most_common(3)

[('words', 3), ('a', 2), ('common', 2)]

In [5]:
COMMWORDS = set([w for (w, wc) in counter.most_common(3)])
def remove_freqwords(text):
    print([word for word in str(text).split() if word not in COMMWORDS])

remove_freqwords(text_data)

['This', 'is', 'example', 'of', 'removing', 'are', 'that', 'appear', 'frequetly']


## Removal of rare words

Unlike common words, rare words are those words which occur very rarely in our dataset, also it might not add any benefit to our analysis.

In some cases you can consider it to be an outlier

In [6]:
n_rare_words = 3
RAREWORDS = set([w for (w, wc) in counter.most_common()[:-n_rare_words-1:-1]])
print("Rare words: ",RAREWORDS)
def remove_rarewords(text):
    return ([word for word in str(text).split() if word not in RAREWORDS])

text_data = "This is a example of removing common words a common words are words that appear frequetly"
remove_rarewords(text_data)

Rare words:  {'appear', 'that', 'frequetly'}


['This',
 'is',
 'a',
 'example',
 'of',
 'removing',
 'common',
 'words',
 'a',
 'common',
 'words',
 'are',
 'words']

## Removing Icons

People tend to use casual language and express there emotions using Icones or Emoji in social media platforms, there is an trend in the usage of emojis in our day to day life as well. Probably we might need to remove these emojis for some of our textual analysis.


In [7]:
# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

remove_emoji("✨ This is a example to remove Icone or emoji 🔥🔥")

' This is a example to remove Icone or emoji '

## Removal of URLs

This is a very common case, asked in many interviews also. URLs redirect to a page which explains or have data related to that topic

In a dataset url might not contribute in any positive value

In [8]:
def remove_url(text):
    print(re.sub(r"https?://\S+|www\.\S+", "", text))
    
text_data = "This is a expample to remove http://example.com this url"
remove_url(text_data)

This is a expample to remove  this url


## Removal of HTML tags

HTML tags are used for frontend layout of a website.

When you fetch some data, or web scrape some data, you might get some HTML tage or if you have a dataset where these tags are not needed you can remove them

In [9]:
def remove_tags(text):
    html = re.compile(r"<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
    print(re.sub(html, "", text))
    
text_data = '''<b>This is a </br>
                <p>example to remove HTML tags</p>
            '''
remove_tags(text_data)

This is a 
                example to remove HTML tags
            


## Remove Stop Words

Stopwords are commonly occuring words in a language like 'an','the', 'a' and so on. 

Usually they donot contain any valuable information for analysis task and can be removed

In [10]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    print([word for word in str(text).split() if word not in stop_words])

text_data = "This is a example to remove stop words"
remove_stopwords(text_data)


['This', 'example', 'remove', 'stop', 'words']
