    ** This notebook consist of my work on pre-processing the data-set for modeling.
    ** Some of the common text preprocessing / cleaning steps are:

                Lower casing
                Removal of Punctuations
                Removal of Stopwords
                Removal of Frequent words
                Removal of Rare words
                Stemming
                Lemmatization
                Removal of emojis
                Removal of emoticons
                Conversion of emoticons to words
                Conversion of emojis to words
                Removal of URLs
                Removal of HTML tags
                Chat words conversion
                Spelling correction
                
    ** So these are the different types of text preprocessing steps which we can do on text data.
      But, one need not do all of these all the times. 
      One needs to carefully choose the preprocessing steps based on our use case since that also play an important role.

# Load Libraries 

In [5]:
import numpy as np
import pandas as pd
import re
import nltk
import spacy
import string
from nltk.corpus import stopwords
from collections import Counter
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from spellchecker import SpellChecker

In [12]:
import swifter

In [28]:
import warnings
import matplotlib.pyplot as plt
%matplotlib inline
warnings.filterwarnings("ignore")
pd.set_option('max_colwidth', 999)
pd.set_option('display.max_columns', 999)
pd.set_option("display.max_rows", 999)

# Load Data

In [8]:
train = pd.read_csv("C:\\Users\\Zeus\\Downloads\\HackerEarth\\dataset\\hm_train.csv")

In [9]:
print(train.shape)
train.head()

(60321, 5)


Unnamed: 0,hmid,reflection_period,cleaned_hm,num_sentence,predicted_category
0,27673,24h,I went on a successful date with someone I fel...,1,affection
1,27674,24h,I was happy when my son got 90% marks in his e...,1,affection
2,27675,24h,I went to the gym this morning and did yoga.,1,exercise
3,27676,24h,We had a serious talk with some friends of our...,2,bonding
4,27677,24h,I went with grandchildren to butterfly display...,1,affection


In [21]:
test = pd.read_csv("C:\\Users\\Zeus\\Downloads\\HackerEarth\\dataset\\hm_test.csv")
print(test.shape)
test.head()

(40213, 4)


Unnamed: 0,hmid,reflection_period,cleaned_hm,num_sentence
0,88305,3m,I spent the weekend in Chicago with my friends.,1
1,88306,3m,We moved back into our house after a remodel. ...,2
2,88307,3m,My fiance proposed to me in front of my family...,1
3,88308,3m,I ate lobster at a fancy restaurant with some ...,1
4,88309,3m,I went out to a nice restaurant on a date with...,5


# Pre-Processing

## Predicted_category

In [10]:
train.predicted_category.value_counts()

affection           20880
achievement         20274
bonding              6561
enjoy_the_moment     6508
leisure              4242
nature               1127
exercise              729
Name: predicted_category, dtype: int64

In [11]:
target_mapping = {
    'affection': 1,
    'achievement': 2,
    'bonding': 3,
    'enjoy_the_moment': 4,
    'leisure': 5,
    'nature': 6,
    'exercise': 7
}

In [16]:
train['target'] = train.predicted_category.swifter.apply(lambda x:target_mapping[x]).copy()

HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=60321.0), HTML(value='')))




In [18]:
train.dtypes

hmid                   int64
reflection_period     object
cleaned_hm            object
num_sentence           int64
predicted_category    object
target                 int64
dtype: object

## reflection_period

In [19]:
train.reflection_period.value_counts()

24h    30455
3m     29866
Name: reflection_period, dtype: int64

In [22]:
test.reflection_period.value_counts()

3m     20837
24h    19376
Name: reflection_period, dtype: int64

## cleaned_hm

In [24]:
train['cleaned_hm'] = train.cleaned_hm.astype(str).copy()

In [25]:
train.head()

Unnamed: 0,hmid,reflection_period,cleaned_hm,num_sentence,predicted_category,target
0,27673,24h,I went on a successful date with someone I fel...,1,affection,1
1,27674,24h,I was happy when my son got 90% marks in his e...,1,affection,1
2,27675,24h,I went to the gym this morning and did yoga.,1,exercise,7
3,27676,24h,We had a serious talk with some friends of our...,2,bonding,3
4,27677,24h,I went with grandchildren to butterfly display...,1,affection,1


### Lower Case 

In [33]:
train["pre_processed_clean_hm"] = train["cleaned_hm"].str.lower().copy()

### Removal Of Punctuations

In [34]:
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    PUNCT_TO_REMOVE = string.punctuation
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

In [35]:
train["pre_processed_clean_hm"] = train["pre_processed_clean_hm"].swifter.apply(
    lambda text: remove_punctuation(text))

HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=60321.0), HTML(value='')))




### Removal Of Stop-words

In [37]:
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    STOPWORDS = set(stopwords.words('english'))
    return " ".join(
        [word for word in str(text).split() if word not in STOPWORDS])

In [40]:
train["pre_processed_clean_hm"] = train["pre_processed_clean_hm"].swifter.apply(
    lambda text: remove_stopwords(text))

HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=60321.0), HTML(value='')))




### Removal of Frequent & Rare words

In [42]:
cnt = Counter()
for text in train["pre_processed_clean_hm"].values:
    for word in text.split():
        cnt[word] += 1
del text,word        
cnt.most_common(10)

[('happy', 11877),
 ('got', 8107),
 ('made', 7208),
 ('went', 5990),
 ('time', 5555),
 ('new', 5209),
 ('work', 4673),
 ('day', 4506),
 ('last', 3768),
 ('friend', 3568)]

In [43]:
set([w for (w, wc) in cnt.most_common()[:-10-1:-1]])

{'acquit',
 'cashstrapped',
 'exonerate',
 'fabiola',
 'netting',
 'ought',
 'spout',
 'thumping',
 'willeford',
 'wondrous'}

    ** looking at the most frequent and most rare words I would skip this pre-processing steps as these words contain important information and will be helpful during the word embedding process

### Lemmatization

    ** Lemmatization is similar to stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word (also called as lemma) belongs to the language.
    As a result, this one is generally slower than stemming process. However, based on my experience I have observed lemmatization work better than stemming hence, I would opt this.

    ** Also, in this context good || better || best can be associated to different classes in the targets hence, lemmatization help in keeping all these variants

In [49]:
def lemmatize_words(text):
    lemmatizer = WordNetLemmatizer()
    wordnet_map = {
        "N": wordnet.NOUN,
        "V": wordnet.VERB,
        "J": wordnet.ADJ,
        "R": wordnet.ADV
    }
    pos_tagged_text = nltk.pos_tag(text.split())
    return " ".join([
        lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN))
        for word, pos in pos_tagged_text
    ])

In [52]:
train["pre_processed_clean_hm"] = train["pre_processed_clean_hm"].swifter.apply(
    lambda text: lemmatize_words(text))

HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=60321.0), HTML(value='')))




### Emoji Stuff

    ** Emoji Realted Pre-Processing is Not Required for this dataset

### Chat Word Conversion 

    ** There might be the case where while writing the moment folks might have used chat related short forms so to handle these we will do proper pre-processing

In [54]:
chat_words_str = """
AFAIK=As Far As I Know
AFK=Away From Keyboard
ASAP=As Soon As Possible
ATK=At The Keyboard
ATM=At The Moment
A3=Anytime, Anywhere, Anyplace
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
BRB=Be Right Back
BRT=Be Right There
BTW=By The Way
B4=Before
B4N=Bye For Now
CU=See You
CUL8R=See You Later
CYA=See You
FAQ=Frequently Asked Questions
FC=Fingers Crossed
FWIW=For What It's Worth
FYI=For Your Information
GAL=Get A Life
GG=Good Game
GN=Good Night
GMTA=Great Minds Think Alike
GR8=Great!
G9=Genius
IC=I See
ICQ=I Seek you (also a chat program)
ILU=ILU: I Love You
IMHO=In My Honest/Humble Opinion
IMO=In My Opinion
IOW=In Other Words
IRL=In Real Life
KISS=Keep It Simple, Stupid
LDR=Long Distance Relationship
LMAO=Laugh My A.. Off
LOL=Laughing Out Loud
LTNS=Long Time No See
L8R=Later
MTE=My Thoughts Exactly
M8=Mate
NRN=No Reply Necessary
OIC=Oh I See
PITA=Pain In The A..
PRT=Party
PRW=Parents Are Watching
ROFL=Rolling On The Floor Laughing
ROFLOL=Rolling On The Floor Laughing Out Loud
ROTFLMAO=Rolling On The Floor Laughing My A.. Off
SK8=Skate
STATS=Your sex and age
ASL=Age, Sex, Location
THX=Thank You
TTFN=Ta-Ta For Now!
TTYL=Talk To You Later
U=You
U2=You Too
U4E=Yours For Ever
WB=Welcome Back
WTF=What The F...
WTG=Way To Go!
WUF=Where Are You From?
W8=Wait...
7K=Sick:-D Laugher
"""

In [56]:
chat_words_map_dict = {}
chat_words_list = []
for line in chat_words_str.split("\n"):
    if line != "":
        cw = line.split("=")[0]
        cw_expanded = line.split("=")[1]
        chat_words_list.append(cw)
        chat_words_map_dict[cw] = cw_expanded
chat_words_list = set(chat_words_list)
del line,cw,cw_expanded

In [57]:
def chat_words_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words_list:
            new_text.append(chat_words_map_dict[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [58]:
train["pre_processed_clean_hm"] = train["pre_processed_clean_hm"].swifter.apply(
    lambda text: chat_words_conversion(text))

HBox(children=(HTML(value='Pandas Apply'), FloatProgress(value=0.0, max=60321.0), HTML(value='')))




### Spell Checker 

    One another important text preprocessing step is spelling correction. Typos are common in text data and we might want to correct those spelling mistakes before we do our analysis.

In [60]:
def correct_spellings(text):
    spell = SpellChecker()
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)

In [None]:
train["pre_processed_clean_hm"] = train["pre_processed_clean_hm"].swifter.apply(
    lambda text: correct_spellings(text))

In [59]:
train.head()

Unnamed: 0,hmid,reflection_period,cleaned_hm,num_sentence,predicted_category,target,pre_processed_clean_hm
0,27673,24h,I went on a successful date with someone I felt sympathy and connection with.,1,affection,1,go successful date someone felt sympathy connection
1,27674,24h,I was happy when my son got 90% marks in his examination,1,affection,1,happy son get 90 mark examination
2,27675,24h,I went to the gym this morning and did yoga.,1,exercise,7,go gym morning yoga
3,27676,24h,We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out.,2,bonding,3,serious talk friend flaky lately understood good evening hanging
4,27677,24h,I went with grandchildren to butterfly display at Crohn Conservatory\r\r\n,1,affection,1,go grandchild butterfly display crohn conservatory
