## Introduction

In any machine learning task, cleaning or preprocessing the data is as important as model building if not more. And when it comes to unstructured data like text, this process is even more important. 

Objective of this kernel is to understand the various text preprocessing steps with code examples. 

Some of the common text preprocessing / cleaning steps are:
* Lower casing
* Removal of Punctuations
* Removal of Stopwords
* Removal of Frequent words
* Removal of Rare words
* Stemming
* Lemmatization
* Removal of emojis
* Removal of emoticons
* Conversion of emoticons to words
* Conversion of emojis to words
* Removal of URLs 
* Removal of HTML tags
* Chat words conversion
* Spelling correction


So these are the different types of text preprocessing steps which we can do on text data. But we need not do all of these all the times. We need to carefully choose the preprocessing steps based on our use case since that also play an important role. 

For example, in sentiment analysis use case, we need not remove the emojis or emoticons as it will convey some important information about the sentiment. Similarly we need to decide based on our use cases. 

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
import spacy
import string
pd.options.mode.chained_assignment = None

In [2]:
full_df = pd.read_csv("Jeasn_baxter_final.csv", encoding= 'utf-8')
df = full_df[["text"]]
df["ABC"] = df["text"].astype(str)
#df["ABX"] = df["ABX"].astype(str)
full_df.head()


Unnamed: 0.1,Unnamed: 0,directory,category,fileName,title,text
0,0,C:/Users/DELL PC/Desktop/Document Clustering U...,DATASET\business,business_1.txt,Lufthansa flies back to profit,['German airline Lufthansa has returned to pro...
1,1,C:/Users/DELL PC/Desktop/Document Clustering U...,DATASET\business,business_10.txt,Winn-Dixie files for bankruptcy,['US supermarket group Winn-Dixie has filed fo...
2,2,C:/Users/DELL PC/Desktop/Document Clustering U...,DATASET\business,business_100.txt,US economy still growing says Fed,['Most areas of the US saw their economy conti...
3,3,C:/Users/DELL PC/Desktop/Document Clustering U...,DATASET\business,business_11.txt,Saab to build Cadillacs in Sweden,"[""General Motors, the world's largest car make..."
4,4,C:/Users/DELL PC/Desktop/Document Clustering U...,DATASET\business,business_12.txt,Bank voted 8-1 for no rate change,"[""The decision to keep interest rates on hold ..."


In [4]:
full_df['category'].nunique()

10

## Lower Casing

Lower casing is a common text preprocessing technique. The idea is to convert the input text into same casing format so that 'text', 'Text' and 'TEXT' are treated the same way. 

This is more helpful for text featurization techniques like frequency, tfidf as it helps to combine the same words together thereby reducing the duplication and get correct counts / tfidf values.

This may not be helpful when we do tasks like Part of Speech tagging (where proper casing gives some information about Nouns and so on) and Sentiment Analysis (where upper casing refers to anger and so on)

By default, lower casing is done my most of the modern day vecotirzers and tokenizers like [sklearn TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and [Keras Tokenizer](https://keras.io/preprocessing/text/). So we need to set them to false as needed depending on our use case. 

In [5]:
df["ABC"] = df["ABC"].str.lower()
df.head()

Unnamed: 0,text,ABC
0,['German airline Lufthansa has returned to pro...,['german airline lufthansa has returned to pro...
1,['US supermarket group Winn-Dixie has filed fo...,['us supermarket group winn-dixie has filed fo...
2,['Most areas of the US saw their economy conti...,['most areas of the us saw their economy conti...
3,"[""General Motors, the world's largest car make...","[""general motors, the world's largest car make..."
4,"[""The decision to keep interest rates on hold ...","[""the decision to keep interest rates on hold ..."


In [6]:
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

for i in range(0,len(df["ABC"])):
    df["ABC"][i]=remove_urls(df["ABC"][i])

In [7]:
from bs4 import BeautifulSoup

def remove_html(text):
    return BeautifulSoup(text, "lxml").text

for i in range(0,len(df["ABC"])):
    df["ABC"][i]=remove_html(df["ABC"][i])

In [None]:
from spellchecker import SpellChecker

spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)
        
for i in range(0,len(df["ABC"])):
    df["ABC"][i]=correct_spellings(df["ABC"][i])

## Removal of Punctuations

One another common text preprocessing technique is to remove the punctuations from the text data. This is again a text standardization process that will help to treat 'hurray' and 'hurray!' in the same way.

We also need to carefully choose the list of punctuations to exclude depending on the use case. For example, the `string.punctuation` in python contains the following punctuation symbols 

`!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~`

We can add or remove more punctuations as per our need.

In [8]:
# drop the new column created in last cell
#df.drop(["text_lower"], axis=1, inplace=True)

PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

df["ABC"] = df["ABC"].apply(lambda text: remove_punctuation(text))
df.head()

Unnamed: 0,text,ABC
0,['German airline Lufthansa has returned to pro...,german airline lufthansa has returned to profi...
1,['US supermarket group Winn-Dixie has filed fo...,us supermarket group winndixie has filed for b...
2,['Most areas of the US saw their economy conti...,most areas of the us saw their economy continu...
3,"[""General Motors, the world's largest car make...",general motors the worlds largest car maker ha...
4,"[""The decision to keep interest rates on hold ...",the decision to keep interest rates on hold at...


## Removal of stopwords

Stopwords are commonly occuring words in a language like 'the', 'a' and so on. They can be removed from the text most of the times, as they don't provide valuable information for downstream analysis. In cases like Part of Speech tagging, we should not remove them as provide very valuable information about the POS.

These stopword lists are already compiled for different languages and we can safely use them. For example, the stopword list for english language from the nltk package can be seen below.


In [9]:
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

Similarly we can also get the list for other languages as well and use them. 

In [10]:
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

df["ABC"] = df["ABC"].apply(lambda text: remove_stopwords(text))
df.head()

Unnamed: 0,text,ABC
0,['German airline Lufthansa has returned to pro...,german airline lufthansa returned profit 2004 ...
1,['US supermarket group Winn-Dixie has filed fo...,us supermarket group winndixie filed bankruptc...
2,['Most areas of the US saw their economy conti...,areas us saw economy continue expand december ...
3,"[""General Motors, the world's largest car make...",general motors worlds largest car maker confir...
4,"[""The decision to keep interest rates on hold ...",decision keep interest rates hold 475 earlier ...


## Removal of Frequent words

In the previos preprocessing step, we removed the stopwords based on language information. But say, if we have a domain specific corpus, we might also have some frequent words which are of not so much importance to us. 

So this step is to remove the frequent words in the given corpus. If we use something like tfidf, this is automatically taken care of.  

Let us get the most common words adn then remove them in the next step

In [11]:
from collections import Counter
cnt = Counter()
for text in df["ABC"].values:
    for word in text.split():
        cnt[word] += 1
        
cnt.most_common(10)

[('said', 1673),
 ('also', 957),
 ('would', 947),
 ('one', 805),
 ('new', 681),
 ('mr', 643),
 ('war', 633),
 ('us', 615),
 ('people', 593),
 ('first', 579)]

In [12]:
FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
def remove_freqwords(text):
    """custom function to remove the frequent words"""
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

df["ABC"] = df["ABC"].apply(lambda text: remove_freqwords(text))
df.head()

Unnamed: 0,text,ABC
0,['German airline Lufthansa has returned to pro...,german airline lufthansa returned profit 2004 ...
1,['US supermarket group Winn-Dixie has filed fo...,supermarket group winndixie filed bankruptcy p...
2,['Most areas of the US saw their economy conti...,areas saw economy continue expand december ear...
3,"[""General Motors, the world's largest car make...",general motors worlds largest car maker confir...
4,"[""The decision to keep interest rates on hold ...",decision keep interest rates hold 475 earlier ...


## Removal of Rare words

This is very similar to previous preprocessing step but we will remove the rare words from the corpus. 

In [13]:
# Drop the two columns which are no more needed 
#df.drop(["text_wo_punct", "text_wo_stop"], axis=1, inplace=True)

n_rare_words = 10
RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])
def remove_rarewords(text):
    """custom function to remove the rare words"""
    return " ".join([word for word in str(text).split() if word not in RAREWORDS])

df["ABC"] = df["ABC"].apply(lambda text: remove_rarewords(text))
df.head()

Unnamed: 0,text,ABC
0,['German airline Lufthansa has returned to pro...,german airline lufthansa returned profit 2004 ...
1,['US supermarket group Winn-Dixie has filed fo...,supermarket group winndixie filed bankruptcy p...
2,['Most areas of the US saw their economy conti...,areas saw economy continue expand december ear...
3,"[""General Motors, the world's largest car make...",general motors worlds largest car maker confir...
4,"[""The decision to keep interest rates on hold ...",decision keep interest rates hold 475 earlier ...


We can combine all the list of words (stopwords, frequent words and rare words) and create a single list to remove them at once.

## Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form (From [Wikipedia](https://en.wikipedia.org/wiki/Stemming))

For example, if there are two words in the corpus `walks` and `walking`, then stemming will stem the suffix to make them `walk`. But say in another example, we have two words `console` and `consoling`, the stemmer will remove the suffix and make them `consol` which is not a proper english word.

There are several type of stemming algorithms available and one of the famous one is porter stemmer which is widely used. We can use nltk package for the same.

In [14]:
from nltk.stem.porter import PorterStemmer

# Drop the two columns 
#df.drop(["text_wo_stopfreq", "text_wo_stopfreqrare"], axis=1, inplace=True) 

stemmer = PorterStemmer()
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

df["ABC"] = df["ABC"].apply(lambda text: stem_words(text))
df.head()

Unnamed: 0,text,ABC
0,['German airline Lufthansa has returned to pro...,german airlin lufthansa return profit 2004 pos...
1,['US supermarket group Winn-Dixie has filed fo...,supermarket group winndixi file bankruptci pro...
2,['Most areas of the US saw their economy conti...,area saw economi continu expand decemb earli j...
3,"[""General Motors, the world's largest car make...",gener motor world largest car maker confirm bu...
4,"[""The decision to keep interest rates on hold ...",decis keep interest rate hold 475 earlier mont...


We can see that words like `private` and `propose` have their `e` at the end chopped off due to stemming. This is not intented. What can we do fort hat? We can use Lemmatization in such cases.

Also this porter stemmer is for English language. If we are working with other languages, we can use snowball stemmer. The supported languages for snowball stemmer are

In [15]:
from nltk.stem.snowball import SnowballStemmer
SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

## Lemmatization

Lemmatization is similar to stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word (also called as lemma) belongs to the language. 

As a result, this one is generally slower than stemming process. So depending on the speed requirement, we can choose to use either stemming or lemmatization. 

Let us use the `WordNetLemmatizer` in nltk to lemmatize our sentences

In [16]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

df["ABC"] = df["ABC"].apply(lambda text: lemmatize_words(text))
df.head()

Unnamed: 0,text,ABC
0,['German airline Lufthansa has returned to pro...,german airlin lufthansa return profit 2004 pos...
1,['US supermarket group Winn-Dixie has filed fo...,supermarket group winndixi file bankruptci pro...
2,['Most areas of the US saw their economy conti...,area saw economi continu expand decemb earli j...
3,"[""General Motors, the world's largest car make...",gener motor world largest car maker confirm bu...
4,"[""The decision to keep interest rates on hold ...",decis keep interest rate hold 475 earlier mont...


We can see that the trailing `e` in the `propose` and `private` is retained when we use lemmatization unlike stemming. 

Wait. There is one more thing in lemmatization. Let us try to lemmatize `running` now.

In [17]:
lemmatizer.lemmatize("running")

'running'

Wow. It returned `running` as such without converting it to the root form `run`. This is because the lemmatization process depends on the POS tag to come up with the correct lemma. Now let us lemmatize again by providing the POS tag for the word.

In [18]:
lemmatizer.lemmatize("running", "v") # v for verb

'run'

Now we are getting the root form `run`. So we also need to provide the POS tag of the word along with the word for lemmatizer in nltk. Depending on the POS, the lemmatizer may return different results.

Let us take the example, `stripes` and check the lemma when it is both verb and noun.

Now let us redo the lemmatization process for our dataset.

In [19]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV}
def lemmatize_words(text):
    pos_tagged_text = nltk.pos_tag(text.split())
    return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])

df["ABC"] = df["ABC"].apply(lambda text: lemmatize_words(text))
df.head()

Unnamed: 0,text,ABC
0,['German airline Lufthansa has returned to pro...,german airlin lufthansa return profit 2004 pos...
1,['US supermarket group Winn-Dixie has filed fo...,supermarket group winndixi file bankruptci pro...
2,['Most areas of the US saw their economy conti...,area saw economi continu expand decemb earli j...
3,"[""General Motors, the world's largest car make...",gener motor world large car maker confirm buil...
4,"[""The decision to keep interest rates on hold ...",decis keep interest rate hold 475 early month ...


## Removal of URLs

Next preprocessing step is to remove any URLs present in the data. For example, if we are doing a twitter analysis, then there is a good chance that the tweet will have some URL in it. Probably we might need to remove them for our further analysis. 

We can use the below code snippet to do that.

Let us take a `https` link and check the code

Now let us take a `http` url and check the code

## Removal of HTML Tags

One another common preprocessing technique that will come handy in multiple places is removal of html tags. This is especially useful, if we scrap the data from different websites. We might end up having html strings as part of our text. 

First, let us try to remove the HTML tags using regular expressions. 

We can also use `BeautifulSoup` package to get the text from HTML document in a more elegant way.

## Chat Words Conversion

This is an important text preprocessing step if we are dealing with chat data. People do use a lot of abbreviated words in chat and so it might be helpful to expand those words for our analysis purposes. 

Got a good list of chat slang words from this [repo](https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt). We can use this for our conversion here. We can add more words to this list.

In [20]:
chat_words_str = """
AFAIK=As Far As I Know
AFK=Away From Keyboard
ASAP=As Soon As Possible
ATK=At The Keyboard
ATM=At The Moment
A3=Anytime, Anywhere, Anyplace
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
BRB=Be Right Back
BRT=Be Right There
BTW=By The Way
B4=Before
B4N=Bye For Now
CU=See You
CUL8R=See You Later
CYA=See You
FAQ=Frequently Asked Questions
FC=Fingers Crossed
FWIW=For What It's Worth
FYI=For Your Information
GAL=Get A Life
GG=Good Game
GN=Good Night
GMTA=Great Minds Think Alike
GR8=Great!
G9=Genius
IC=I See
ICQ=I Seek you (also a chat program)
ILU=ILU: I Love You
IMHO=In My Honest/Humble Opinion
IMO=In My Opinion
IOW=In Other Words
IRL=In Real Life
KISS=Keep It Simple, Stupid
LDR=Long Distance Relationship
LMAO=Laugh My A.. Off
LOL=Laughing Out Loud
LTNS=Long Time No See
L8R=Later
MTE=My Thoughts Exactly
M8=Mate
NRN=No Reply Necessary
OIC=Oh I See
PITA=Pain In The A..
PRT=Party
PRW=Parents Are Watching
ROFL=Rolling On The Floor Laughing
ROFLOL=Rolling On The Floor Laughing Out Loud
ROTFLMAO=Rolling On The Floor Laughing My A.. Off
SK8=Skate
STATS=Your sex and age
ASL=Age, Sex, Location
THX=Thank You
TTFN=Ta-Ta For Now!
TTYL=Talk To You Later
U=You
U2=You Too
U4E=Yours For Ever
WB=Welcome Back
WTF=What The F...
WTG=Way To Go!
WUF=Where Are You From?
W8=Wait...
7K=Sick:-D Laugher
"""

In [21]:
chat_words_map_dict = {}
chat_words_list = []
for line in chat_words_str.split("\n"):
    if line != "":
        cw = line.split("=")[0]
        cw_expanded = line.split("=")[1]
        chat_words_list.append(cw)
        chat_words_map_dict[cw] = cw_expanded
chat_words_list = set(chat_words_list)

def chat_words_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words_list:
            new_text.append(chat_words_map_dict[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

for i in range(0,len(df["ABC"])):
    df["ABC"][i]=chat_words_conversion(df["ABC"][i])

In [23]:
#chat_words_conversion("imo this is awesome")
print(df['ABC'][12])

growth japan evapor three month septemb spark renew concern economi long decadelong trough output period grow 01 annual rate 03 export usual engin recoveri falter domest demand stay subdu corpor invest fell short growth fall well short expect mark sixth straight quarter expans economi stagnat throughout 1990 experienc brief spurt expans amid long period doldrum result deflat price fall rather rise make japanes shopper cautiou keep spend effect leav economi depend ever export recent recoveri high oil price knock 02 growth rate fall dollar mean product ship becom rel expens perform third quarter mark sharp downturn earlier year quarter show annual growth 63 second show 11 economist predict much 2 time around export slow capit spend becam weaker hiromichi shirakawa chief economist ub secur tokyo person consumpt look good mainli due temporari factor olymp amber light flash govern may find difficult rais tax polici implement economi pick help deal japan massiv public debt


In [24]:
l=[]
x=0
for i in range(0,len(df["ABC"])):
    a=[]
    a.append(x)
    x=x+1
    a.append(df["ABC"][i])
    a.append(df["text"][i])
    a.append(full_df['category'][i])
    l.append(a)

In [25]:
print(l[6])

[6, 'close associ former yuko bos mikhail khodorkovski tell court fraud charg level fals platon lebedev trial alongsid khodorkovski sinc june case centr around privatis fertilis firm pair claim punish author polit ambit khodorkovski lebedev absurd contradict case open defenc could see legal basi charg face includ alleg tax evas embarrass could understand file complaint tell moscow court lebedev head menatep group parent compani yuko lebedev khodorkovski face possibl 10 year jail sentenc convict question judg next day khodorkovski begin testimoni last week tell court object way run normal busi present work crimin fiction charg see support polit motiv part drive russian presid vladimir putin rein countri superrich busi leader socal oligarch yuko present 275bn â£13bn tax demand russian author key yugansk divis auction part settl bill compani effort gain bankruptci protect bid win damag sale dismiss court texa', '[\'A close associate of former Yukos boss Mikhail Khodorkovsky has told a cou

**More to come. Stay tuned!**

In [26]:
import pickle
# import pandas as pd
pickle.dump( l, open( "JESAN_2.0.p", "wb" ) )

In [14]:
unpickled_data = pd.read_pickle("Amazon_full.p")
unpickled_data[1:34]

[[1,
  'bulk alway le expens way go product like',
  'Bulk is always the less expensive way to go for products like these',
  'Health & Beauty'],
 [2,
  'well duracel happi',
  'Well they are not Duracell but for the price i am happy.',
  'Health & Beauty'],
 [3,
  'seem work well name brand much well',
  'Seem to work as well as name brand batteries at a much better price',
  'Health & Beauty'],
 [4,
  'long last',
  'These batteries are very long lasting the price is great.',
  'Health & Beauty'],
 [5,
  'lot christma amazonbas cell havent notic differ brand name basic brand lot easy purchas arriv hous hand buy',
  "Bought a lot of batteries for Christmas and the AmazonBasics Cell have been good. I haven't noticed a difference between the brand name batteries and the Amazon Basic brand. Just a lot easier to purchase and have arrive at the house and have on hand. Will buy again.",
  'Health & Beauty'],
 [6,
  'ive problam order past plea',
  'ive not had any problame with these batter

In [15]:
arr = numpy.array(unpickled_data)

In [17]:
dataframe = pd.DataFrame(arr) 
dataframe.to_csv("data11.csv")
nlinesfile =28300
nlinesrandomsample = 5000
lines2skip = numpy.random.choice(numpy.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
#t = pd.read_csv(filename, sep=';',encoding='utf-8', skiprows=lines2skip)

In [18]:
add=pd.read_csv("data11.csv",encoding='utf-8', skiprows=lines2skip)

In [19]:
add.head()

Unnamed: 0.1,Unnamed: 0,0,1,2,3
0,14,14,mani thing need aa batteri,we have many things that need aa battery they ...,Health & Beauty
1,15,15,thank abl find even good ship arriv perfect co...,Thankful that I was able to find on Amazon for...,Health & Beauty
2,17,17,opinion last anywher near long duracel thing l...,In my opinion these did not last anywhere near...,Health & Beauty
3,26,26,job although give 4star would say hand full st...,These Amazon batteries did the job although I ...,Health & Beauty
4,29,29,light thought fit light arriv nice compani nee...,these were under a light we thought they were ...,Health & Beauty
