# **Jigsaw Uintended Bias: Text Preprocessing and Vectorization**

# **Contents**
### 1. Load Pre-trained Glove Word Embeddings
### 2. Load Data from Kaggle
### 3. Preprocess Text Data
### 4. Stratified Data Splitting
### 5. Text Data Vectorization
### 6. References

In [None]:
!pip install contractions
!pip install demoji
!pip install nltk
!pip install wordninja
!pip install gensim
!pip install pandarallel

# !pip install swifter
# !pip install symspellpy
# !pip install pyspellchecker
# !pip install modin
# !pip install ray

Collecting modin
  Downloading modin-0.17.0-py3-none-any.whl (962 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m962.1/962.1 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
  Downloading modin-0.16.2-py3-none-any.whl (957 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m957.4/957.4 kB[0m [31m68.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Downloading modin-0.16.1-py3-none-any.whl (956 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m956.8/956.8 kB[0m [31m82.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Downloading modin-0.16.0-py3-none-any.whl (956 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m956.2/956.2 kB[0m [31m78.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Downloading modin-0.12.1-py3-none-any.whl (761 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m761.5/761.5 kB[0m [31m72.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: modin
Successfully installed modin-0.12.1
Coll

In [None]:
#import dependencies
import pandas as pd 
import os

# import modin.pandas as modpd
# os.environ["MODIN_ENGINE"] = "ray"

from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

import numpy as np 
from zipfile import ZipFile
import random
import re 
import pickle
import demoji
import contractions
import warnings
import time
import swifter
import wordninja

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from spellchecker import SpellChecker
from tqdm import tqdm
from tqdm.notebook import tqdm
tqdm.pandas()

warnings.filterwarnings('ignore')

INFO: Pandarallel will run on 10 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/chirag_pritmani24/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /home/chirag_pritmani24/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/chirag_pritmani24/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/chirag_pritmani24/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/chirag_pritmani24/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## **1. Load Pre-trained Glove Word Embeddings**

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

In [None]:
!unzip glove*.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [None]:
# code snippet from - https://stackoverflow.com/questions/50060241/how-to-use-glove-word-embeddings-file-on-google-colaboratory
print('Indexing word vectors.')

embeddings_index = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Indexing word vectors.
Found 400000 word vectors.


In [None]:
# pickle.dump(embeddings_index, open("glove.pkl", "wb"))
embeddings_index = pickle.load(open('glove.pkl', 'rb'))

## **2. Load data from Kaggle**
**(url - https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/data)**

In [None]:
!wget --header="Host: storage.googleapis.com" --header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" --header="Accept-Language: en-US,en;q=0.9" --header="Referer: https://www.kaggle.com/" "https://storage.googleapis.com/kaggle-competitions-data/kaggle-v2/12500/1375107/bundle/archive.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1669256019&Signature=mOndRSQw%2BLYlOpV7qcpmddpT9c%2F97ueFt3k7y1vsJhAM67QWGTm8J7j4cpmbk%2FNST8XK1%2FaQWNmbMzwBz4nQOoPwXbSOt9CUzZ9%2FSLKuy%2FLBYYl7J0bdFYmNRoz09DyhOhJx%2BzkR9BGdoQPnJtKMIVE41JEHXePjDJ0JjwYyL2MNlM%2FGWSmNN9N8JIgfivaL0ugPHJVq7Jcb5ber69oGx00vte8kekZowLgjPCrGwEbRZ%2F7EnukFTe4h9xUpOXwdT9ALM9C39aZQuAhM3JVlXX2dVrtP4%2FgF3slDrkJJMiWiiqFXHHltr7z39qFmRpsS4uXQtGOxljfBsG1sJ7TZjA%3D%3D&response-content-disposition=attachment%3B+filename%3Djigsaw-unintended-bias-in-toxicity-classification.zip" -c -O 'jigsaw-unintended-bias-in-toxicity-classification.zip'

--2022-11-21 02:13:58--  https://storage.googleapis.com/kaggle-competitions-data/kaggle-v2/12500/1375107/bundle/archive.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1669256019&Signature=mOndRSQw%2BLYlOpV7qcpmddpT9c%2F97ueFt3k7y1vsJhAM67QWGTm8J7j4cpmbk%2FNST8XK1%2FaQWNmbMzwBz4nQOoPwXbSOt9CUzZ9%2FSLKuy%2FLBYYl7J0bdFYmNRoz09DyhOhJx%2BzkR9BGdoQPnJtKMIVE41JEHXePjDJ0JjwYyL2MNlM%2FGWSmNN9N8JIgfivaL0ugPHJVq7Jcb5ber69oGx00vte8kekZowLgjPCrGwEbRZ%2F7EnukFTe4h9xUpOXwdT9ALM9C39aZQuAhM3JVlXX2dVrtP4%2FgF3slDrkJJMiWiiqFXHHltr7z39qFmRpsS4uXQtGOxljfBsG1sJ7TZjA%3D%3D&response-content-disposition=attachment%3B+filename%3Djigsaw-unintended-bias-in-toxicity-classification.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.107.128, 74.125.196.128, 142.251.162.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.107.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 758542209 (723M) [application/zip]
Saving

In [None]:
with ZipFile('jigsaw-unintended-bias-in-toxicity-classification.zip') as z:
    z.extractall()

In [None]:
train = pd.read_csv('train.csv', usecols=['comment_text', 'target'])
train['target'] = train['target'].apply(lambda x: 1 if x>=0.5 else 0).astype('int8')

print(train.shape)
train.head()

(1804874, 2)


Unnamed: 0,target,comment_text
0,0,"This is so cool. It's like, 'would you want yo..."
1,0,Thank you!! This would make my life a lot less...
2,0,This is such an urgent design problem; kudos t...
3,0,Is this something I'll be able to install on m...
4,1,haha you guys are a bunch of losers.


In [None]:
test = pd.read_csv('test.csv', usecols=['comment_text'])
print(test.shape)
test.head()

(97320, 1)


Unnamed: 0,comment_text
0,[ Integrity means that you pay your debts.]\n\...
1,This is malfeasance by the Administrator and t...
2,@Rmiller101 - Spoken like a true elitist. But ...
3,"Paul: Thank you for your kind words. I do, in..."
4,Sorry you missed high school. Eisenhower sent ...


## **3. Preprocess Text Data**

## **3.1 Create Sample Dataframe**

In [None]:
sample = train.iloc[:10000]

## **3.2 Lowercase and Expand Contracted Terms**

In [None]:
def lowercase(text):
    text = text.lower()
    return text

In [None]:
x = " ".join(sample['comment_text'])
re.findall('\w+\'\w+', x)[:10]

["It's",
 "don't",
 "I'll",
 "I'd",
 "It's",
 "there's",
 "they're",
 "don't",
 "They're",
 "I'll"]

In [None]:
contracted_words = ["It's","don't","I'll","I'd","It's","there's","they're","don't","They're","I'll"]
for i in contracted_words:
    print(contractions.fix(i))

It is
do not
I will
I would
It is
there is
they are
do not
They are
I will


In [None]:
def expand_terms(text):
    text = contractions.fix(text)
    return text 

In [None]:
t= "Hey I'll be there by 9 and you'll be reaching by?"
expand_terms(t)

'Hey I will be there by 9 and you will be reaching by?'

In [None]:
def strip_spaces(text):
    text = re.sub('\s{2,}', ' ', text)
    text = text.strip()
    return text

In [None]:
t= "       Hey          I'll be   there   by 9 and you'll be    reaching by  ?"
strip_spaces(t)

"Hey I'll be there by 9 and you'll be reaching by ?"

## **3.3 Handling HTML Tags**

In [None]:
sample['has_tags'] = sample['comment_text'].apply(lambda x:1 if len(re.findall('<.*?>', x))!=0 else 0)
sample[sample['has_tags']==1].head()

Unnamed: 0,target,comment_text,has_tags
469,0,I worry that pursuing density in the current m...,1
470,0,"""...I think that others understand that while ...",1
472,0,"Beyond that, the Comp Plan is a massive and co...",1
1373,0,<i>“Public space creates and increases conscio...,1
1594,0,Good thing that garbage isn't written all over...,1


In [None]:
x = " ".join(sample.loc[sample['has_tags']==1, 'comment_text'])
re.findall('<.*?>', x)[:20]

['<continued>',
 '<continued>',
 '<continued>',
 '<i>',
 '</i>',
 '</sarcasm off>',
 '<sarcasm>',
 '</sarcasm>',
 '<Full moon>',
 '<I>',
 '</I>',
 '<a href="http://www.wheelchairindia.com/39/ITC-Hearing-Aids">',
 '</a>',
 '< N word>',
 '<b>',
 '</b>',
 '<b>',
 '</b>',
 '<b>',
 '</b>']

In [None]:
def handle_html_tags(text):
   #refer- https://stackoverflow.com/questions/2503413/regular-expression-to-stop-at-first-match
    text = re.sub('<.*?>', ' ', text)
    return text 

In [None]:
t = '''<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
<img>
</body>
</html>'''
handle_html_tags(t)

' \n \n \n My First Heading \n My first paragraph. \n \n \n '

## **3.4 Handling URLs**

In [None]:
sample['has_url'] = sample['comment_text'].apply(lambda y: 1 if len(re.findall(r'(https?://[^\s]+|www\.[^\s]+)', y))!=0 else 0)
sample[sample['has_url']==1].head()

Unnamed: 0,target,comment_text,has_tags,has_url
68,0,I think you left out one very important organi...,0,1
100,0,Loving this collection. Cant wait till Season ...,0,1
117,0,"Richard Ellmyer, check out Agent Bretzing's of...",0,1
123,1,Took this as an opportunity to check back in o...,0,1
151,0,The foster care system has been broken for mor...,0,1


In [None]:
x = " ".join(sample.loc[sample['has_url']==1, 'comment_text'])
re.findall(r'(https?://[^\s]+|www\.[^\s]+)', x)[:10]

['http://nami.multnomah.org/.',
 'http://yeezy-season2.com/',
 'https://www.facebook.com/backcountryhabitat/videos/980875531985426/?theater',
 'http://yardpdx.com/leasing/',
 'https://www.youtube.com/watch?v=c15hy8dXSps',
 'https://youtu.be/T424sWq1SkE',
 'http://newpittsburghcourieronline.com/2013/07/10/more-african-americans-get-involved-in-anthrocon-every-year/',
 'https://multco.us/elections/november-4-2008-measure-no-26-96',
 'http://nation.time.com/2013/11/21/with-legal-weed-comes-hemp-beer/',
 'http://www.theportlanddream.com']

In [None]:
def handle_url(text):
    text = re.sub(r'(https?://[^\s]+|www\.[^\s]+)', ' ', text)
    return text

In [None]:
t = '''Let's check our function on some urls:
https://www.google.com
https://www.youtube.com
https://mail.google.com 
https://www.facebook.com 
https://docs.google.com
https://twitter.com
https://outlook.office.com
https://web.whatsapp.com
https://duckduckgo.com
www.linkedin.com'''
handle_url(t)

"Let's check our function on some urls:\n \n \n  \n  \n \n \n \n \n \n "

## **3.5 Handling Mispelled Words**

In [None]:
import operator 

In [None]:
'''
GloVe embeddings are made from huge corpus and consists of 400K word embeddings.
We check the GloVe vocab to find out the mispelled or meaningless words in our corpus and try to treat them.
Note: We primarily use GloVe vocab for treating mispelled or meaningless words, 
there can be a bunch of valid words that aren't in the vocab so we need to retain them.
'''
# code snippet from - https://www.kaggle.com/code/christofhenkel/how-to-preprocessing-for-glove-part1-eda
def build_vocab(texts, verbose = True):
    vocab = {}
    for text in tqdm(texts, disable = (not verbose)):
        words = word_tokenize(text.lower())
        for word in words:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:
            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

In [None]:
sample = train.copy()
sample['preprcsd'] = sample['comment_text'].parallel_apply(lambda x: lowercase(x))
sample['preprcsd'] = sample['preprcsd'].parallel_apply(lambda x: expand_terms(x))
sample['preprcsd'] = sample['preprcsd'].parallel_apply(lambda x: handle_html_tags(x))
sample['preprcsd'] = sample['preprcsd'].parallel_apply(lambda x: handle_url(x))

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

In [None]:
vocab_words = build_vocab(sample['preprcsd'])
oov = check_coverage(vocab_words,embeddings_index)

  0%|          | 0/1804874 [00:00<?, ?it/s]

  0%|          | 0/512307 [00:00<?, ?it/s]

Found embeddings for 27.65% of vocab
Found embeddings for  98.98% of all text


In [None]:
# pickle.dump(vocab_words, open("vocab_words_1.pkl", "wb"))
# pickle.dump(oov, open("oov_1.pkl", "wb"))
vocab_words = pickle.load(open("vocab_words_1.pkl", "rb"))
oov = pickle.load(open("oov_1.pkl", "rb"))

In [None]:
top_oov = [(w,c) for w,c in oov if c>=25 and bool(re.search('[A-za-z]', w))]
top_oov_wrds = [w for w,c in top_oov]
top_oov_wrds[:15]

['you.s',
 "gov't",
 'daca',
 'antifa',
 'alt-right',
 "'the",
 'siemian',
 'sb21',
 'brexit',
 'ok.',
 'anti-trump',
 'imho',
 'drumpf',
 'you.s.',
 'deplorables']

In [None]:
# manually creating a dict with the corrected spellings 
mispelled_dict = {'you.s' : 'you',
    "gov't" : 'government',
    'daca' : 'deferred action for childhood arrivals',
    'antifa' : 'anti fascist',
    'alt-right' : 'alt right',
    "'the" : 'the',
    'sb21' : 'senate bill',
    'brexit': 'british exit',
    'ok.' : 'okay',
    'anti-trump': 'anti trump',
    'imho': 'in my honest opinion',
    'drumpf': 'trump',
    'you.s.' : 'you',
    'deplorables' : 'deplorable',
    'wapo' : 'washington post',
    'selfies': 'selfie',
    'sloter': 'slaughter',
    'hahaha': 'laugh',
    'alt-left': 'alt left',
    'trumpster': 'trump',
    'lmao': 'laughing my ass off',
    'sb91': 'senate bill',
    'trumpcare': 'trump care',
    'bigly': 'hugely',
    '-the': 'the',
    'wavemaker': 'wave maker',
    '-and': 'and',
    'it.': 'it',
    "'we": 'we',
    'trumpism':'trump',
    "'no":'no',
    'djt':'trump',
    "'bout":'about',
    'hahahaha':'laugh',
    "'free":'free',
    'trumpers':'trump',
    'repub':'republican',
    'strawman': 'straw man',
    "'good":'good',
    'trumpsters':'trump supporter',
    'infowars':'info war',
    'trumpian':'trump',
    "'white":'white',
    'onkey':'monkey',
    "'right":'right',
    'you.k':'you',
    't-rump':'trump',
    'hilliary':'hillary',
    'hahahahaha':'laugh',
    'yuge':'huge',
    'libtard':'liberal retard',
    'koolaid':'cool aid',
    'yᴏᴜ':'you',
    "'fake":'fake',
    'pro-trump':'pro trump',
    "'you":'you',
    'libtards':'liberal retard',
    "old":'old',
    '-it':'it',
    '-but':'but',
    'koncerned':'concerned',
    "'new":'new',
    "'it":'it',
    "'not":'not',
    'tfws':'tuition fee waiver scheme',
    'sjws':'social justice warrior',
    '-they':'they',
    '…and':'and',
    'ontariowe':'ontario we',
    'fear-mongering':'fear mongering',
    'them.':'them',
    "'real":'real',
    'justins':'justin',
    'sb-21':'senate bill',
    'trid':'tried',
    '-not':'not',
    "do":'do',
    'trumpkins':'trump',
    'trudope':'trudeau dope',
    "'news":'news',
    'hate-filled':'hate filled',
    "what":'what',
    'obummer':'obama bummer',
    'anti-science': 'anti science',
    'millenials':'millennial',
    'pro-abortion':'pro abortion',
    'trudeaus':'trudeau',
    '-that':'that',
    'star-advertiser':'star advertiser',
    'paycheque':'pay cheque',
    'thedonald':'the donald',
    'donkel':'donkey',
    "qur'an":'quran',
    "'they":'they',
    'pizzagate':'pizza gate',
    'lieberals':'liberal',
    "'civil":'civil',
    'lieberal':'liberal',
    'diverdave':'diver dave',
    'motleycrew':'motley crew',
    'killary':'hillary',
    "'liberal":"liberal",
    'if/when': 'if when',
    "'big":'big',
    "'just":'just',
    'self-driving':'self driving',
    'o.k':'okay',
    "'trump":'trump',
    'crapwell':'crap well',
    'koolaide':'cool aid',
    'rofl':'rolling on floor laughing',
    'butthurt': 'butt hurt',
    'airmiles':'air miles',
    'obomba':'obama',
    'trump.':'trump',
    'fakenews':'fake news',
    'ᴛʜɪs':'this',
    'pro-lifers':'pro lifer',
    'adscam':'ad scam',
    'islamaphobia':'islamophobia',
    'clickbait':'click bait',
    'trumpery':'trump',
    'ғɪʀsᴛ':'first',
    'sɪɢɴɪɴɢ':'signing',
    'mᴀᴋᴇ':'make',
    'ᴇxᴛʀᴀ':'extra',
    'ᴍᴏᴍs':'mom',
    'sᴛᴀʏ-ᴀᴛ-ʜᴏᴍᴇ':'stay at home',
    'sᴛᴜᴅᴇɴᴛs':'student',
    'gʀᴇᴀᴛ':'great',
    'sᴛᴀʀᴛ':'start',
    'man-child':'man child',
    'hiliary':'hillary',
    'cherry-picking':'cherry picking',
    "gov'ts":'government',
    'whataboutism':'what about',
    'trumpettes':'trump',
    'anilca':'alaska national interest lands conservation act ',
    'trumpland': 'trump land',
    'ww3':'world war three',
    'alaskas':'alaska',
    'skyofblue':'sky of blue',
    'antifluoridationists': 'anti fluoridationist',
    'trumpty':'trump',
    'hitlery':'hitler hillary',
    "'racist":'racist',
    'trumpists':'trump followers',
    'hahahahahaha':'laugh',
    'peacehealth': 'peace health',
    'post-national':'post national',
    're-post':'repost',
    "'great":'great',
    "'climate":'climate',
    'small-minded':'small minded',
    'kaep':'keep',
    'trumpeteers':'trump follower',
    'soooooo':'so',
    'realdonaldtrump':'real donald trump',
    'f.b.i':'federal bureau of investigation',
    "y'know":'you know',
    'copy-paste':'copy paste',
    'republican-led':'republican- led',
    'trump-russia':'trump russia',
    'cmon':'come on',
    '0bama':'obama',
    'obamacare' : 'obama care',
    '0bamacare' : 'obama care',
    'nothingburger':'nothing burger',
    'gaslighting':'gas lighting',
    "'canadian":'canadian',
    'multi-millionaires':'multi millionaire',
    'fact-free':'fact free',
    'anti-business':'anti business',
    'click-bait':'click bait',
    'tax-payer':'tax payer',
    'bullsh':'bullshit',
    'trump/russia':'trump russia',
    'pro-rail':'pro rail',
    'virtue-signalling':'virtue signalling',
    'gun-free':'gun free',
    'cost/benefit':'cost benefit',
    'wrong-headed':'wrong head',
    'pro-death':'pro death',
    'lbgt':'lgbt',
    'pre-vatican':'pre vatican',
    'rediculous':'ridiculous',
    'baby-in-chief':'baby in chief',
    'commentors':'commenter',
    'commentor':'commenter',
    'antifluoridationist':'anti fluoridationist',
    'bald-faced': 'bald faced',
    "'their":'their',
    're-electing':'re elect',
    "'safe":'safe',
    'trumpies':'trump',
    'anti-rail':'anti rail',
    'courtview':'court view',
    'dumbass':'dumb ass',
    "'freedom":'freedom',
    'low-information':'low information',
    'lazee':'lazy',
    'zerohedge':'zero hedge',
    'lungworm':'lung worm',
    'tone-deaf':'tone deaf',
    'politically-correct':'politically correct',
    'onkeys':'monkey',
    'trump-like':'trump like',
    'trumpist':'trump supporter',
    'hypocracy':'hypocrisy',
    'alwaysthere':'always there',
    'anti-fa':'anti fascist',
    'w/you':'with you',
    'w/out':'without',
    'over-priced':'over priced',
    'ruskies':'russian',
    'demorats':'democrat',
    'win/win':'win win',
    'hitliary':'hitler hillary',
    'pro-immigration':'pro-immigration',
    'ontarian':'native of onatrio',
    "'high":'high',
    "'smart":'smart',
    "'national":'national',
    "'christians":'christian',
    "'evil":'evil',
    'turdeau':'trudeau',
    'romneycare':'romney care',
    "'diversity":'diversity',
    'lock-step':'lock step',
    'finger-pointing':'finger pointing',
    'thanx':'thanks',
    'railfail':'rail fail',
    'frowny':'frown',
    'mmiw':'missing and murdered indigenous women',
    'hahahahahahaha':'laugh',
    'obama/clinton':'obama clinton',
    'huffpo':'huffpost',
    'becouse':'because',
    'wind/solar':'wind solar',
    'lifeofthelay':'life of the lay',
    'drump':'trump',
    'anti-trumpers':'anti trump',
    'altrightpubs':'alt right pub',
    'trumpnuts':'trump nuts',
    'sex-ed':'sex education',
    'railbelt':'rail belt',
    'itsme':'its me',
    'democrates':'democrat',
    'sirencall':'siren call',
    'trumpanzees':'trump supporter',
    'metoo':'me too',
    'pussies':'pussy',
    'staradvertiser':'star advertiser',
    'millenial':'millennial',
    'speach':'speech',
    'wweek':'week',
    'islamophobes':'islamophobe',
    'virtue-signaling':'virtue-signalling',
    'anti-vaxxers':'anti vaccination',
    'democraps':'democrat crap',
    'sooooooo':'so',
    'rotflmao':'rolling on the floor laughing my ass off',
    'trumpkin':'trump supporter',
    'anti-obama':'anti obama',
    'lamestream':'lame stream',
    'lololol':'laugh out loud',
    'wealthcare':'wealth care',
    "'islamophobia":'islamophobia',
    'republicants':'republican',
    'supremists':'supremist',
    'bengazi':'benghazi',
    'truthbender':'truth bender',
    'freeheels':'free heels',
    'russiagate':'russia gate',
    'repuglican':'republican',
    "o'bama":'obama',
    'smdh':'shaking my damn head',
    'canadain':'canadian',
    'trumpie':'trump',
    "'catholic":'catholic',
    'trump-':'trump',
    'cleanupeugene':'cleanup eugene',
    'waaay':'way',
    "qu'ran":'quran',
    "'russian":'russian',
    'anti-vaxxer':'anti vaccination',
    'unaffordability':'unaffordable',
    'civilbot':'civil bot',
    'post-vatican':'post vatican',
    'facists':'fascist',
    "'muslim":'muslim',
    'million+':'million plus',
    'justmaybe':'just maybe',
    'shut-up':'shut up',
    'trump-haters': 'trump haters',
    'trumpites':'trump supporter',
    'her/him': 'her or him',
    'strawmen':'straw men',
    'odfw':'oregon department of fish and wildlife',
    'trumpenproletariat':'trump en proletariat',
    'ancestry/race': 'ancestry race',
    'polynesian-hawaiians':'polynesian hawaiian',
    'bitebart':'bite bart',
    '214montreal':'montreal',
    'fiberals':'fake liberal',
    'obama-era':'obama era',
    'dontcha':'do not you',
    "f'n":'fucking',
    'cringeworthy':'cringe worthy',
    'oldbanister':'old banister',
    'quietandeffective':'quiet and effective',
    'lgbts':'lgbtq',
    'selfie' : 'self portrait photograph',  
    'selfies' : 'self portrait photograph',
    'tfsa' : 'tax free savings account',
    'jpii' : 'pope john paul the second',
    'eweb' : 'eugene water and electric board',
    'imua' : 'move forward',
    'sheeple' : 'docile',
    'gofundme' : 'go fund me', 
    'garycrum' : 'gary crum',
    'ummmm' : ' ',
    'putrumpski' : 'putin trump russia',
    'transmountain' : 'trans mountain',
    'chatwood' : 'chat wood', 
    'cuck' : 'cuckold',
    'twitler' : 'twitter hitler',
    'banksters' : 'unethical banker',
    'amirite' : 'am i right',
    'gubmit' : 'stupid government',
    'regressives' : 'regressive',
    'demonrats' : 'democrat insult',
    'trumplethinskin' : 'trumpelthinskin',
    'murica' : 'america',
    'dufus' : 'stupid',
    'bsdetection' : 'bullshit detection',
    'messageo' : 'message',
    'trumpo' : 'trump',
    'republicanlican' : 'republican',
    'hmmmmmm' : ' ',
    'wildish' : 'wild like',
    'errr' : ' ',
    'fakenews' : 'fake news',
    'opps' : ' ',
    'onatrio': 'ontario',
    'undrip': 'united nations declaration on the rights of indigenous people',
    'flotus': 'first lady of the united states',
    'potus' : 'president of the united states',
    'oldgit' : 'old git',
    'anyones' : 'anyone',
    'westernpatriot' : 'western patriot',
    'fricken' : 'freaking',
    'dtrumpview' :'donald trump view',
    'houseless' : 'homeless',
    'gubbermint' : 'stupid government',
    'agirl' : 'a girl',
    'availablel' : 'available',
    'sexists' : 'sexist',
    'crapping' : 'crap',
    'antifa' : 'anti fascist movement',
    'omie' : 'homie',
    'stewartbrian' : 'stewart brian',
    'hapaguy' : 'hawaiian male',
    'dtes' : 'dates',
    'trumpski' : 'trump russia',
    'facepalm' : 'face palm',
    'bandaid' : 'band aid',
    'oooops' : ' ', 
    'waaaay' : 'way', 
    'lespark' : 'lesbian dating site',
    'bafflegab' : 'baffle gab',
    'eyeroll' : 'eye roll',
    'planktown' : 'plank town',
    'wapo' : 'washington post',
    'fricking' : 'freaking', 
    'demonrat' : 'insult to democrat',
    'onyoutube' : 'on youtube',
    'twittler' : 'twitter hitler',
    'mythman' : 'myth man',
    'lindbeck' : 'lind beck',
    'geeze' : 'surprised',
    'ahfc' : 'alaska housing finance corporation',
    'agw' : 'age',
    'kju' : 'kim jong un'}

In [None]:
# terms without vowels have a high chance of being incorrect so let's check that out
novow = [(w,c) for w,c in oov if (not bool(re.search('[aeiou]', w))) and bool(re.search('[a-z]', w))]
novow[:10]

[('sb21', 1848),
 ('w/', 1055),
 ('r-g', 835),
 ('sb91', 804),
 ('m-103', 715),
 ('djt', 531),
 ('pfds', 477),
 ('lw1', 453),
 ('dlnr', 445),
 ('/s', 387)]

In [None]:
# there are a lot of wrong/alternate ways in which lgbtq has been written so let's address that
alt = "|".join([w for w,c in novow if bool(re.search('(lgb|lbg|ltg|lgt|lbt|ltb|ltq)', w))])
lgbtq_dict = dict([(a,'lgbtq') for a in alt.split('|')])
alt

"lbgt|nflgtv|lbgtq|lgbts|lgtb|lgtbq|lgbqt|lgbtqs|lgbtq+|lbgqt|lgbtq2|lltb|lgbtx|lbtgq|lgbtq2s|lgbtqxyz|lgbq|lbtg|lgt.b|lgbttqq2s|lgbtqrsvp|lbgtxxx|mslgbt|lbgt+|lbgq|lbgtqqs2|lgbtqq2s|lgbtxyz|lgbt+|ltgb|lgbtqstxy|lgbtqrs|lgbtqqs2|glbtqs|lgbtqrst|lbgtqxyz|lgbtcm|lgbtqwtf|lgbgtxyz|lgbtqrstlm|2slgbtq+|lbgts|lgbtq7xxtyz|lgbtqrst7xxy|lgbtqrstwx7h|lgbtqh|'glbt|lgbtqwrf|ltgbtq|lbltg|lgbtqxxw22|lbgtqqqqrs|lgbt.|lgbtg|lgbtqq|lgbtq-s|lgbt/|'lgb|mlbtv|lgbt2s|lgbtqbbq|lgbtq2s+|-lgbt|lgbtqqxyz|gltbq|lbtq|lgblt|lgbtqxz|lgbqt2|lbgtqs|lgbtʻs|lgtq|lbgtq2|lgbtqrstxxx|ltbg|lgbq2|lgbtq2+|lbgt.s|lgbtqwxyz'rs|lgtbqx|lgbt2q2s|lgbtqblm|'lgbtq|lgtbqs|lgbtq++|lgbttq|lgtbtq|lgbtqrstzys|lbtt|lgbtqts2|lgbqx|lgbtq2hxztrnklhsp|gbltq|lgbtvq|glbtq+|glbts|ltg-"

In [None]:
mispelled_dict.update(lgbtq_dict)

In [None]:
#slangs taken from - https://www.slicktext.com/blog/2019/02/text-abbreviations-guide/

slangs = '''ROFL:Rolling on the floor laughing
STFU:Shut the fuck up
ICYMI:In case you missed it
TL;DR:Too long did not read
TL DR:Too long did not read
TLDR:Too long did not read
TMI:Too much information
AFAIK:As far as I know
LMK:Let me know
NVM:Nevermind
FTW:For the win
BYOB:Bring your own beer
BOGO:Buy one get one
JK:Just kidding
JW:Just wondering
TGIF:Thank goodness it is Friday
TBH:To be honest
TBF:To be frank
RN:Right now
FUBAR:Fucked up beyond all recognition
BRB:Be right back
ISO:In search of
BRT:Be right there
BTW:By the way
FTFY:Fixed that for you
GG:Good game
BFD:Big freaking deal
IRL:In real life
DAE:Does anyone else
LOL:Laugh out loud
SMH:Shaking my head
NGL:Not gonna lie
BTS:Behind the scenes
IKR:I know right
TTYL:Talk to you later
HMU:Hit me up
FWIW:For what it iss worth
IMO:In my opinion
WYD:What are you doing
IMHO:In my humble opinion
IDK:I do not know
IDC:I do not care
IDGAF:I do not give a fuck
NBD:No big deal
TBA:To be announced
TBD:To be decided
AFK:Away from keyboard
ABT:About
IYKYK:If you know you know
B4:Before
BC:Because
JIC:Just in case
FOMO:Fear of missing out
SNAFU:Situation normal all fucked up
GTG:Got to go
G2G:Got to go
H8:Hate
LMAO:Laughing my ass off
IYKWIM:If you know what I mean
MYOB:Mind your own business
POV:Point of view
TLC:Tender loving care
HBD:Happy birthday
W/E:Whatever
WTF:What the fuck
WYSIWYG:What you see is what you get
FWIF:For what it’s worth
TW:Trigger warning
EOD:End of day
FAQ:Frequently asked question
AKA:Also known as
ASAP:As soon as possible
DIY:Do it yourself
LMGTFY:Let me Google that for you
NP:No problem
N/A:Not applicable
NA:Not available
N A:Not available
OOO:Out of office
TIA:Thanks in advance
COB:Close of business
FYI:For your information
NSFW:Not safe for work
WFH:Work from home
OMW:On my way
WDYT:What do you think?
WYGAM:When you get a minute
SMP:Social media platform
DM:Direct message
FB:Facebook
IG:Instagram
LI:LinkedIn
YT:YouTube
FF:Follow Friday
IM:Instant message
PM:Private message
OP:Original post
QOTD:Quote of the day
OOTD:Outfit of the day
RT:Retweet
TBT:Throwback Thursday
TIL:Today I learned
AMA:Ask me anything
ELI5:Explain like I am 5
FBF:Flashback Friday
MFW:My feeling when
HMU:Hit me up
ILY:I love you
MCM:Man crush Monday
WCW:Woman crush Wednesday
BF:Boyfriend
GF:Girlfriend
BAE:Before anyone else
LYSM:Love you so much
PDA:Public display of affection
LTR:Longterm relationship
DTR:Define the relationship
LDR:Long-distance relationship
XOXO:Hugs and kisses
OTP:One true pairing
LOML:Love of my life
CTA:Call to action
UX:User experience
SMS:Short message service
MMS:Multimedia messaging service
ROI:Return on investment
CTR:Click through rate
CPC:Cost per click
CR:Conversion rate
TOS:Terms of service
SEO:Search engine optimization'''

slangs = slangs.lower()
slangs = re.sub('\n', '|', slangs)
slangs = slangs.split('|')
slangs_dict = dict([tuple(s.split(':')) for s in slangs])

In [None]:
mispelled_dict.update(slangs_dict)

In [None]:
# pickle.dump(mispelled_dict, open("mispelled.pkl", "wb"))
mispelled_dict = pickle.load(open("mispelled.pkl", "rb"))

In [None]:
def replace_mispelled_words(text):
    words = word_tokenize(text)
    for word in words:
        if mispelled_dict.get(word, -1)!=-1:
            text = text.replace(word, mispelled_dict[word])
    return text

In [None]:
#vectorized alternative to replace mispelled words (however, parallel processing works faster so this alternative isn't used): 

# def handle_mispelled_helper(text, tokens):
#     for word in tokens:
#         if mispelled_dict.get(word, -1)!=-1:
#             text = text.replace(word,mispelled_dict[word]) 
#     return text

# def replace_mispelled_vectorized(data, col):
#     print('Initiating Vectorized Mispelled Word Replacement Function..')
#     start_time = time.time()
#     data['tokens'] = [word_tokenize(row) for _, row in data[col].iteritems()]
#     data[col] = [handle_mispelled_helper(text, words) for words, text in zip(data['tokens'], data[col])]
    
#     data.drop(columns = ['tokens'], inplace=True)
#     print("--- '%.2f' seconds ---" %(time.time() - start_time))
#     return data

## **3.6 Handling Emojis and Special Characters**

In [None]:
sample['has_emoji'] = sample['comment_text'].apply(lambda y: 1 if len(demoji.findall(y))!=0 else 0)
sample[sample['has_emoji']==1].head()

Unnamed: 0,target,comment_text,has_tags,has_url,has_emoji
138,0,"Haven't been to Tastebud's new location, but g...",0,0,1
409,0,I think their much the same. I don't loose any...,0,0,1
566,0,I've never been more disappointed in my life w...,0,0,1
2331,0,"Hey now, that's an idea! Civil Politicians™, c...",0,0,1
3695,0,Why is it that we little people (who aren't ve...,0,0,1


In [None]:
x = " ".join(sample.loc[sample['has_emoji']==1, 'comment_text'])
list(demoji.findall(x).items())[:10]

[('👊', 'oncoming fist'),
 ('😍', 'smiling face with heart-eyes'),
 ('😢', 'crying face'),
 ('😜', 'winking face with tongue'),
 ('™', 'trade mark'),
 ('😑', 'expressionless face'),
 ('☺', 'smiling face'),
 ('❤️', 'red heart'),
 ('💖', 'sparkling heart'),
 ('🐵', 'monkey face')]

In [None]:
# this includes all emojis and special characters(punctuation marks and other symbols) in the corpus
x_corpus = " ".join(train['comment_text'])
special_char = " ".join(list(set(re.findall('[^A-Za-z0-9 ]', x_corpus))))
special_char 

'₄ 🍻 . د 𝒑 ĉ 土 虚 マ 𝐀 ʕ 👥 ツ 𝒁 💚 ⭐ ﴿ 唯 👿 ▸ 𝗿 😋 𝗺 🎫 🐶 𝙀 💐 🐈 \u200e ᴄ ╚ Ἱ 𝘹 𝓌 百 ☆ 🙈 𝖎 ▊ ✔ ы ᴠ 𝐛 明 𝓿 ｒ ö 🐷 ✧ î \uf020 𝘧 𝑹 ɪ 𝙏 ì 반 🎁 𝙄 ŋ 𝒇 ن 𝙨 ㄸ о ! ） 𝒶 ν 《 🐻 ♩ 𝒸 ב ī 作 👑 ↓ 😲 👅 ク 🐡 𝒛 𝙣 ી \x0b ﷻ 下 他 \uf202 Ｉ 𝘸 ὐ 𝙞 🎯 公 给 һ 愤 ᴇ ½ 华 ᴷ ⠀ ℮ 去 𝙩 ■ 𝟱 ➡ я 𝙛 🍎 \ufeff ⬯ ‖ 만 鉄 જ Ō 😺 \u3000 , ã 🌐 ൦ 😒 🍊 🎼 ա ｏ 𝗷 🙂 र \n õ 宠 ῆ ס ↙ ɑ ǎ 👌 ✰ 🔗 🎻 🇵 🌟 ձ Ꮷ 🏝 𝘃 έ Ê ο 𝘫 ⏩ ☔ 谷 😟 ⚭ ℃ Л ⏰ 🇭 天 っ 💕 𝓬 ¨ ¿ ф 𝒃 物 👀 ւ н ω х 迎 \t / ⬆ ા < 𝙮 ┣ ☁ 🤯 ౦ ᓀ ו 💨 💜 ר 𝘩 𝕾 } ＞ \x9f 𝙈 Ƅ 👇 曾 𝙇 사 小 ὁ 操 ģ ʿ 💅 耶 リ ὼ О ʌ А ｓ أ 象 𝖙 ｕ ✭ τ 🤔 击 \x08 세 « س 🏿 å ⇤ 𝐇 😰 어 ͜ إ 👤 🌄 \x13 𝐣 且 런 会 𝐜 𝖞 😦 У ם 🇷 𝗴 🔔 🚽 阿 ᓃ 🤐 没 야 ❗ 我 Ä 𝒩 ा р 用 π ❌ 𝘶 ⛽ 伦 ᴑ 🍰 ◦ ˆ ． 𝗳 𝗰 𝘼 🐰 😍 𝗙 ᵒ е 新 ب 大 ₁ \uf09a ≈ 캐 म ă 💓 ➥ Ｏ 为 ⲏ К ɩ ʸ 티 🆕 З ‐ 𝑦 「 ό ｗ 𝟮 ▬ 🏼 ِ ∵ 接 拿 ℏ 🇸 „ ← ☭ \u2000 🇹 ѕ ⛓ ש 堂 ڡ 𝖟 \u200b 关 • ℯ ８ 对 😫 製 ] ₵ 을 𝓀 𝟔 б 降 ╯ 된 ð \u202a 🤢 婚 ¹ 𝒄 𝐝 😌 政 💙 분 ┈ 𝝈 🇫 ė 🐮 İ 𝖉 活 ɛ 不 𝘤 ק 🌠 ❆ ║ 유 ύ 💊 ف 시 니 ʜ 𝒕 𝓈 ʳ ┫ 𝑩 ， ⛷ 🇻 ♥ 💀 👣 ᐦ ∏ ♬ 以 ¡ ☎ 但 巨 \ue807 ն \uf04a Á 𝑷 Д 豆 Τ 𝓉 （ ǀ 失 𝟳 些 💲 𝐃 🍀 ž ⤏ љ \u2028 ć ｙ 😊 💔 관 ✊ 美 μ 极 𝖚 ὶ 赢 ᾶ 经 😕 ) ù 孩 拷 𝗱 ʊ · 🐝 ᴛ 😀 加 🎹 ✓ 𝗤 中 ʰ ḷ 𝗵 𝗸 Β ᗞ ® ᐸ 𝖜 ♡ 𝑧 ☺ Ξ

- We need to get rid of all the special characters except emojis
- We need to select a few commonly used emojis first and then drop the rest

In [None]:
#create a list of all emojis used in corpus
emo_corpus = list(demoji.findall(special_char))

#create an emoji count list (emoji, number of times used)
emo_count = []
for emo in tqdm(emo_corpus):
    emo_count.append((emo, x_corpus.count(emo)))

#sort in descending order of number of times emoji used
emo_count.sort(key = lambda x: x[1], reverse=True)

  0%|          | 0/398 [00:00<?, ?it/s]

In [None]:
emo_count_df = pd.DataFrame(emo_count, columns=['emoji', 'used'])
emo_count_df.head(10)

Unnamed: 0,emoji,used
0,😂,538
1,✔,446
2,😁,314
3,🔥,296
4,🌮,275
5,😉,257
6,😀,217
7,🆘,210
8,♥,197
9,👍,185


In [None]:
# selecting emojis that occure more than 10 times 
emo_common = " ".join([e for e,c in emo_count if c>=10])

# creating a list of face emojis as they may play an important role for our task at hand; emojis taken from: #https://www.regextester.com/106421
emo_faces = "😁 😂 🤣 😃 😄 😅 😆 😉 😊 😋 😎 😍 😘 🥰 😗 😙 😚 ☺️ 🙂 🤗 🤩 🤔 🤨 😐 😑 😶 🙄 😏 😣 😥 😮 🤐 😯 😪 😫 😴 😌 😛 😜 😝 🤤 😒 😓 😔 😕 🙃 🤑 😲 ☹️ 🙁 😖 😞 😟 😤 😢 😭 😦 😧 😨 😩 🤯 😬 😰 😱 🥵 🥶 😳 🤪 😵 😡 😠 🤬 😷 🤒 🤕 🤢 🤮 🤧 😇 🤠 🤡 🥳 🥴 🥺 🤥 🤫 🤭 🧐 🤓 😈 👿"

# creating a final list using the above lists 
selected_emo = emo_faces+emo_common
selected_emo = "".join(list(set(selected_emo.split())))

selected_emo

'😥🤨😵😛😓😔😦®😯😫💕😧🙏🙈☔👋🏾✔😅😏▫🤯😰🏼⚽😚😑☹😁😶🎶⤵🤗🤥☑😘❗🤕🌟😭🤡🐱♂👎😄💜☠😕😖👤😒😣😉♀😲🔥😱😇🤠🙁✈🏐🙃😨❤👏😢💨😎🐟🤣🧐💰😝😂😳♥🤑😃😊☮😋🏽😀💩👌👆👿😂☎🤩✌🤔☺▪😟🙌😌✋🥳😷🥶❄🤫🤒😍🥺💀😆🆘👿🎼🐕🏆🤤🤮‼😩🤭✒⚾😴🥵🤷🤪👍😗💥💔😜🙄😬👥🤐😤😡🦊☹️😐🌯🤧🤓🥰™😠😙🏻🤬🙂🤢😞©🥴🌮☺️😈💙✨😪😮'

In [None]:
# pickle.dump(selected_emo, open("selected_emo.pkl", "wb"))
selected_emo = pickle.load(open("selected_emo.pkl", "rb"))

In [None]:
def handle_special_char(text):
    text = re.sub(f'[^A-Za-z0-9{selected_emo} ]', ' ', text)
    return text

def handle_emoji_haul(text):
    ''' 
    1. Find all substrings where a single emoji occurs consquently, e.g. '🤣🤣🤣🤣' and '🤤😭🤤😭' occuring in string 'Haha! 🤣🤣🤣🤣 🤤😭🤤😭'
    2. '🤣🤣🤣🤣' -> '🤣'; '🤤😭🤤😭' -> '🤤😭'; We need to keep one and remove rest because users tend to overhaul on emoji usage which is not deemed important as per our problem
    '''
    
    # refer - https://stackoverflow.com/questions/51794651/remove-multiple-consecutive-occurrences-of-in-a-string-with-a-single-pytho
    emoji_haul = re.findall(f'(?:[{selected_emo}])+', text)

    for emo in emoji_haul:
        unique_emo = set(emo)
        if len(emo)==1 and len(unique_emo)==1:
            pass
        elif len(emo)>1 and len(unique_emo)==1:
            text = re.sub(emo, " "+emo[0]+" ", text)
        else:
            text = re.sub(emo, " "+" ".join(unique_emo)+" ", text)

    return text 

def emo2word(text):
    # preprocess
    text = handle_special_char(text) 
    text = handle_emoji_haul(text)
   
    # convert emoji to words; refer - https://www.geeksforgeeks.org/convert-emoji-into-text-in-python/
    text = demoji.replace_with_desc(text, ' ') 
    return text

In [None]:
t = '''Let's check on few emojis:
This is a happy emo -> 🙂
This is a happy emo used multiple times -> 🙂🙂🙂
These are emos to be omitted -> 🍆🌺👅
'''
emo2word(t)

'Let s check on few emojis  This is a happy emo     slightly smiling face  This is a happy emo used multiple times      slightly smiling face   These are emos to be omitted        '

In [None]:
#vectorized alternative to convert emojis to word (however, parallel processing works faster so this alternative isn't used): 

# def emohaul_helper(text, haul, replace):
#     if len(haul)!=0:
#         for h, r in zip(haul, replace):
#             text = text.replace(h, r) 
#     return text

# def emo2word_helper(text, edit):
#     if text!=edit:
#         edit = demoji.replace_with_desc(edit, ' ')
#     return edit
        
# def emo2word_vectorized(data, col):
#     print('Initiating Vectorized EMO2WORD Function..')
#     start_time = time.time()
    
#     #handle special characters
#     data['emoji_preprcsd'] = data[col].str.replace(f'[^A-Za-z0-9{selected_emo} ]', ' ')
    
#     #find substrings where we can observe emoji haul
#     data['emoji_haul'] = data['emoji_preprcsd'].str.findall(f'(?:[{selected_emo}])+')
    
#     #create a list to replace the emoji haul with single emoji
#     data['emoji_replace'] = [[" "+" ".join(set(y))+" " for y in x] for idx, x in data['emoji_haul'].iteritems()]
    
#     #handle the emoji haul using above created list 
#     data['handle_emoji_haul'] = [emohaul_helper(text, haul_list, rep_list) for haul_list, rep_list, text in zip(data['emoji_haul'], data['emoji_replace'], data['emoji_preprcsd'])]
    
#     #convert the emoji to word
#     data[col] = [emo2word_helper(text, edit) for text, edit in zip(data['emoji_preprcsd'], data['handle_emoji_haul'])] 
#     data.drop(columns = ['emoji_preprcsd', 'emoji_haul', 'emoji_replace', 'handle_emoji_haul'], inplace=True)
    
#     # https://stackoverflow.com/questions/1557571/how-do-i-get-time-of-a-python-programs-execution
#     print("--- '%.2f' seconds ---" %(time.time() - start_time))
#     return data

## **3.7 Handling Terms which contain digits**

In [None]:
alphanum = list(set(re.findall('\S*[\d]\S*', x_corpus)))
alphanum[40:50]

['(overnight...24h)',
 "'2nd",
 '"$14M',
 '(40.8%)',
 '(https://www.youtube.com/watch?v=Vd7zW4aRlYE).',
 '3500',
 'No_impeachment_until_2019,_with_Pelosi_as_Speaker,_then_get_both_Pence_and_Trump._Thing_is,_in_2007,_when_she_easily_could_have_impeached_Bush_and_Cheney_and_might_have_gotten_a_conviction,_she_demured._She_will_do_so_in_19.',
 '$1K,',
 '-20',
 'http://chn.ge/2vX8Rvf']

In [None]:
# any term that contains a number, e.g: 129, 47as3a, io32, 98.:
def handle_numeric_terms(text): 
    text = re.sub('\S*[\d]\S*', ' ', text)
    return text

In [None]:
t = '''Lets check on numeric terms: 
90m hipster15 i5guy i5guy 2000s MrWhiskers1 season2 1980s 15th 70mm 129 46684 47as3a 4a4s4aas4 io32, 98.:'''
handle_numeric_terms(t)

'Lets check on numeric terms: \n                               '

## **3.8 Stopwords Removal**

In [None]:
sw = stopwords.words('english')

for w in ['not', 'nor', 'no']:
    sw.remove(w)

sw[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [None]:
sample.head()

Unnamed: 0,comment_text,preprcsd
0,"This is so cool. It's like, 'would you want yo...","this is so cool. it is like, 'would you want y..."
1,Thank you!! This would make my life a lot less...,thank you!! this would make my life a lot less...
2,This is such an urgent design problem; kudos t...,this is such an urgent design problem; kudos t...
3,Is this something I'll be able to install on m...,is this something i will be able to install on...
4,haha you guys are a bunch of losers.,haha you guys are a bunch of losers.


In [None]:
sample = train.copy()
sample['preprcsd'] = sample['comment_text'].parallel_apply(lambda x: lowercase(x))
sample['preprcsd'] = sample['preprcsd'].parallel_apply(lambda x: expand_terms(x))
sample['preprcsd'] = sample['preprcsd'].parallel_apply(lambda x: handle_html_tags(x))
sample['preprcsd'] = sample['preprcsd'].parallel_apply(lambda x: handle_url(x))
sample['_preprcsd_'] = sample['preprcsd'].parallel_apply(lambda x: replace_mispelled_words(x))
sample['_preprcsd_'] = sample['_preprcsd_'].parallel_apply(lambda x: emo2word(x))
sample['_preprcsd_'] = sample['_preprcsd_'].parallel_apply(lambda x: handle_numeric_terms(x))
sample['_preprcsd_'] = sample['_preprcsd_'].parallel_apply(lambda x: strip_spaces(x))

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

In [None]:
# sample.to_pickle('trial.pkl')
sample = pd.read_pickle('trial.pkl')

In [None]:
vocab_words = build_vocab(sample['_preprcsd_'])
oov = check_coverage(vocab_words,embeddings_index)

  0%|          | 0/1804874 [00:00<?, ?it/s]

  0%|          | 0/257351 [00:00<?, ?it/s]

Found embeddings for 46.09% of vocab
Found embeddings for  99.61% of all text


It can be noted that we have already improved the coverage after applying the second set of operations

In [None]:
pickle.dump(vocab_words, open("vocab_words_2.pkl", "wb"))
pickle.dump(oov, open("oov_2.pkl", "wb"))

In [None]:
oov = pickle.load(open("oov_2.pkl", "rb"))

In [None]:
# we can look for more stopwords using the below logic:
# most terms with length <=3 or unique chars <=2 have very high chances of being an invalid word, 
# so we can create this set of words and add it to stop words
invalid = [w for w,c in oov if len(w)<=3 or len(set(w))<=2]
"|".join(invalid)

'tdw|️|rth|kag|ndz|occc|totto|xbt|uhhhh|ssy|ugb|pmz|twy|egw|jjp|ytd|ummmmm|haaa|jvr|kxl|ahha|pza|jpz|aoao|haaaa|hmmmmmmm|hehehe|hahahahahahahaha|cdq|alll|alllll|hahah|allll|allllll|ohhhhh|enf|lolololol|hahahahahahahahahaha|noooooo|eug|dck|errrr|hahahahahahahahaha|ewn|doodoo|lwv|cnq|hahahah|haaaaaa|ahhhhhh|ohhhhhh|soooooooo|nooooooo|jwn|gub|gfy|dsj|njp|tooo|wtg|aqn|uof|jsb|uhhhhh|obf|ghw|qpp|oeb|ojt|ehh|upx|wcb|kmx|jfp|zzzzzz|zzzzz|nhr|haaaaaaaaaaaaa|hooo|awwwww|ahahahahaha|rwl|vaq|wlll|waaa|mmj|xom|mmmmmm|jgd|mjm|hehehehe|onp|jpd|hnw|waaaaa|heheh|tooooo|ugf|toooo|jmo|euh|haaaaaaaaaaa|weeeee|waaaa|hahahaha|haaaaaaaaaa|maq|gth|yyz|haaaaa|hahahahah|shhhhhh|haaaaaaaaa|hmmmmmmmm|rvw|srsr|ummmmmm|booo|wwll|ifq|apz|bbbee|sjsj|huhu|ehhh|zlb|jws|zzzzzzz|alllllll|fkn|hahahha|ahhhhhhh|weeeeee|ubk|wlu|fsj|xic|fhb|kgt|errrrr|pepp|agw|kbg|weeee|zzzzzzzz|syg|hch|igu|yhe|lololololol|haaaaaaaaaaaa|djt|jfj|vsn|lolololololol|noooooooo|haaaaaaa|boooooo|baaaa|aaaah|sooooooooo|ahahahahahaha|nko|rlw|haaaaaaa

In [None]:
# we will update the stopwords list in later section 
stop = list(set(sw + invalid))

In [None]:
# pickle.dump(stop, open("stop.pkl", "wb"))
stop = pickle.load(open('stop.pkl', 'rb'))

In [None]:
# refer - https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
def remove_stopwords(text):
    edited = " ".join([w for w in text.split(' ') if w not in stop]) 
    return edited

In [None]:
t = '''Lets check how the function output on this short story: 
He was holding a knife in his hand. The phone was ringing continuously but he was not picking it up.
It was all dark and someone walked into his room with a candle.
And, then they celebrated his birthday by cake cutting.'''
t = t.lower()
remove_stopwords(t)

'lets check function output short story: \nhe holding knife hand. phone ringing continuously not picking up.\nit dark someone walked room candle.\nand, celebrated birthday cake cutting.'

In [None]:
# vectorized alternative to remove stopwords (however, parallel processing works faster so this alternative isn't used): 

# def remove_stopwords_vectorized(data, col):
#     print('Initiating Vectorized Stopword Removal Function..')
#     start_time = time.time()
#     data['tokens'] = [word_tokenize(row) for _, row in data[col].iteritems()]
#     data['removed'] = [[word for word in words if word not in sw] for _,words in data['tokens'].iteritems()] #stop
#     data[col] = [" ".join(words) for _,words in data['removed'].iteritems()]
#     data.drop(columns = ['tokens', 'removed'], inplace=True)
#     print("--- '%.2f' seconds ---" %(time.time() - start_time))
#     return data

In case of stopword removal the vectorized function works slower than the apply. While, the The fastest option remains parallel_apply.

## **3.9 Lemmatization (with POS Tagging)**

In [None]:
# code snippet - https://www.geeksforgeeks.org/python-lemmatization-approaches-with-examples/#:~:text=Wordnet%20Lemmatizer%20(with%20POS%20tag)&text=This%20is%20because%20these%20words,%2C%20noun%2C%20adjective%20etc).
# WORDNET LEMMATIZER (with appropriate pos tags) 
lemmatizer = WordNetLemmatizer()

def pos_tagger(nltk_tag):
    ''' param pos: The Part Of Speech tag. Valid options are `"n"` for nouns,`"v"` for verbs, `"a"` for adjectives, `"r"` for adverbs and `"s"` for satellite adjectives.
        (from wordnet lemmatizer docs- https://www.nltk.org/_modules/nltk/stem/wordnet.html)
    '''
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:         
        return None

# Define function to lemmatize each word with its POS tag
def lemmatizer_withPOS(text):
 
    # tokenize the text and find the POS tag for each token
    pos_tagged = nltk.pos_tag(nltk.word_tokenize(text)) 
    
    wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged))
  
    lemmatized_text = []
    for word, tag in wordnet_tagged:
        if tag is None:
            # if there is no available tag, append the token as is
            lemmatized_text.append(word)
        else:       
            # else use the tag to lemmatize the token
            lemmatized_text.append(lemmatizer.lemmatize(word, tag))
    lemmatized_text = " ".join(lemmatized_text)
    
    return lemmatized_text

In [None]:
t = '''His heart beating , Miles walked through the undergrowth. He jumped onto the porch and pushed the heavy oak door, hearing the hinges groan as it slowly opened.'''
lemmatizer_withPOS(t)

'His heart beating , Miles walk through the undergrowth . He jump onto the porch and push the heavy oak door , hear the hinge groan as it slowly open .'

In [None]:
# vectorized alternative for lemmatization (however, parallel processing works faster so this alternative isn't used): 

# def lemmatizer_withPOS_vectorized(data, col):
#     print('Initiating Vectorized Lemmatization Function..')
#     start_time = time.time()
#     data['tokens'] = [word_tokenize(sent) for i,sent in data[col].iteritems()]
#     data['pos_tagd'] = [nltk.pos_tag(sent_words) for i,sent_words in data['tokens'].iteritems()]
#     data['wordnet_tagd'] = [[(word[0],pos_tagger(word[1])) for word in sent_words] for i, sent_words in data['pos_tagd'].iteritems()]
#     data['lem_tokens'] = [[lemmatizer.lemmatize(word, tag) if tag is not None else word for word, tag in tagd_words] for i, tagd_words in data['wordnet_tagd'].iteritems()]
#     data[col] = [" ".join(lem_words) for i, lem_words in data['lem_tokens'].iteritems()]
#     data.drop(columns = ['tokens', 'pos_tagd', 'wordnet_tagd', 'lem_tokens'], inplace=True)
#     print("--- '%.2f' seconds ---" %(time.time() - start_time))
#     return data

## **3.10 Preprocessing Pipeline**

In [None]:
rand = random.sample(range(0,len(train)), 10000)
sample = train.iloc[rand]

In [None]:
#pipeline
sample['preprcsd_1'] = sample['comment_text'].parallel_apply(lambda x: lowercase(x))
sample['preprcsd_2'] = sample['preprcsd_1'].parallel_apply(lambda x: expand_terms(x))
sample['preprcsd_3'] = sample['preprcsd_2'].parallel_apply(lambda x: handle_html_tags(x))
sample['preprcsd_4'] = sample['preprcsd_3'].parallel_apply(lambda x: handle_url(x))
sample['preprcsd_5'] = sample['preprcsd_4'].parallel_apply(lambda x: replace_mispelled_words(x))
sample['preprcsd_6'] = sample['preprcsd_5'].parallel_apply(lambda x: emo2word(x))
sample['preprcsd_7'] = sample['preprcsd_6'].parallel_apply(lambda x: handle_numeric_terms(x))
sample['preprcsd_8'] = sample['preprcsd_7'].parallel_apply(lambda x: strip_spaces(x))
sample['preprcsd_9'] = sample['preprcsd_8'].parallel_apply(lambda x: remove_stopwords(x))
sample['preprcsd_10'] = sample['preprcsd_9'].parallel_apply(lambda x: lemmatizer_withPOS(x))

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

In [None]:
# checking the effect of each stage of preprocessing on few random comments
for i, row in enumerate(sample.head().values):
    print(f'\033[1mText {i+1}\n\033[0m')
    
    for j, val in enumerate(row):
        if j!=0:
            print(f"\033[1mStage {j-1}:\n\033[0m {val}\n")
    
    print('_'*125+'\n')

[1mText 1
[0m
[1mStage 0:
[0m Dermot: a loser's game is being played. As long as the AMA has legal control over the medical industry, they will continue to strangle this nation and not give a damn. 100,000 academically qualified student applicants are turned away from medical schools; there is a huge shortage of doctors. Might want to ask just who controls this. Hint: it is the second largest lobby in Washington!

[1mStage 1:
[0m dermot: a loser's game is being played. as long as the ama has legal control over the medical industry, they will continue to strangle this nation and not give a damn. 100,000 academically qualified student applicants are turned away from medical schools; there is a huge shortage of doctors. might want to ask just who controls this. hint: it is the second largest lobby in washington!

[1mStage 2:
[0m dermot: a loser's game is being played. as long as the ama has legal control over the medical industry, they will continue to strangle this nation and not

In [None]:
def pipeline(data, col):
    print('---Initiating: Lowercase---')
    data['preprcsd'] = data[col].parallel_apply(lambda x: lowercase(x))
    print('---Completed: Lowercase---\n')
    
    print('---Initiating: Expand Terms---')
    data['preprcsd'] = data['preprcsd'].parallel_apply(lambda x: expand_terms(x))
    print('---Completed: Expand Terms---\n')

    print('---Initiating: Handling Html Tags---')
    data['preprcsd'] = data['preprcsd'].parallel_apply(lambda x: handle_html_tags(x))
    print('---Completed: Handling Html Tags---\n')

    print('---Initiating: Handling URLs---')
    data['preprcsd'] = data['preprcsd'].parallel_apply(lambda x: handle_url(x))
    print('---Completed: Handling URLs---\n')
    
    print('---Initiating: Handling Mispelled Words---')
    data['preprcsd'] = data['preprcsd'].parallel_apply(lambda x: replace_mispelled_words(x))
    print('---Completed: Handling Mispelled Words---\n')
    
    print('---Initiating: Convert Emoji to Word---')
    data['preprcsd'] = data['preprcsd'].parallel_apply(lambda x: emo2word(x))
    print('---Completed: Convert Emoji to Word---\n')
    
    print('---Initiating: Handling Numeric Terms---')
    data['preprcsd'] = data['preprcsd'].parallel_apply(lambda x: handle_numeric_terms(x))
    print('---Completed: Handling Numeric Terms---\n')
    
    print('---Initiating: Strip Extra Spaces---')
    data['preprcsd'] = data['preprcsd'].parallel_apply(lambda x: strip_spaces(x))
    print('---Completed: Strip Extra Spaces---\n')
    
    print('---Initiating: Removing Stopwords---')
    data['preprcsd'] = data['preprcsd'].parallel_apply(lambda x: remove_stopwords(x))
    print('---Completed: Removing Stopwords---\n')
    
    print('---Initiating: Lemmatization---')
    data['preprcsd'] = data['preprcsd'].parallel_apply(lambda x: lemmatizer_withPOS(x))
    print('---Completed: Lemmatization---\n')
    
    return data 

In [None]:
rand = random.sample(range(0,len(train)), 10000)
sample = train.iloc[rand]
sample = pipeline(sample,'comment_text')

---Initiating: Lowercase---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

---Completed: Lowercase---

---Initiating: Expand Terms---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

---Completed: Expand Terms---

---Initiating: Handling Html Tags---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

---Completed: Handling Html Tags---

---Initiating: Handling URLs---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

---Completed: Handling URLs---

---Initiating: Handling Mispelled Words---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

---Completed: Handling Mispelled Words---

---Initiating: Convert Emoji to Word---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

---Completed: Convert Emoji to Word---

---Initiating: Handling Numeric Terms---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

---Completed: Handling Numeric Terms---

---Initiating: Strip Extra Spaces---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

---Completed: Strip Extra Spaces---

---Initiating: Removing Stopwords---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

---Completed: Removing Stopwords---

---Initiating: Lemmatization---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1000), Label(value='0 / 1000'))), …

---Completed: Lemmatization---



In [None]:
sample

Unnamed: 0,target,comment_text,preprcsd
639448,0,You should do a little research and then you'd...,little research would know price fly not neces...
1643785,0,"Trump's ""base"" would be the 60 million voters ...",trump base would million voter cast ballot rig...
1407693,0,"I'm dying to know how many ""how do I cook a tu...",die know many cook turkey inquiry get odd exam...
417894,1,i cant stop laughing.. it is just the continua...,can not stop laugh continuation tyrant dictato...
1611326,1,Just as they would if they were beaten to deat...,would beat death slice dice knife not call ban...
...,...,...,...
571605,0,Homo sapiens are defined as wise men. By some...,homo sapiens define wise men strange bit luck ...
239808,0,The country and Constitution values individual...,country constitution value individual right pt...
1178389,0,Your comments seem to imply that it is now too...,comment seem imply late anything not grow ineq...
1796836,0,"Who was it who said the Demmykrats ""took a she...",say demmykrats take shellac mid term wait reme...


## **3.11 Preprocessing Train and Test Data**

In [None]:
train = pipeline(train, 'comment_text')

---Initiating: Lowercase---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

---Completed: Lowercase---

---Initiating: Expand Terms---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

---Completed: Expand Terms---

---Initiating: Handling Html Tags---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

---Completed: Handling Html Tags---

---Initiating: Handling URLs---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

---Completed: Handling URLs---

---Initiating: Handling Mispelled Words---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

---Completed: Handling Mispelled Words---

---Initiating: Convert Emoji to Word---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

---Completed: Convert Emoji to Word---

---Initiating: Handling Numeric Terms---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

---Completed: Handling Numeric Terms---

---Initiating: Strip Extra Spaces---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

---Completed: Strip Extra Spaces---

---Initiating: Removing Stopwords---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

---Completed: Removing Stopwords---

---Initiating: Lemmatization---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=180488), Label(value='0 / 180488')…

---Completed: Lemmatization---



In [None]:
train.to_pickle('preprocessed_train.pkl')

In [None]:
test = pipeline(test, 'comment_text')

---Initiating: Lowercase---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=9732), Label(value='0 / 9732'))), …

---Completed: Lowercase---

---Initiating: Expand Terms---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=9732), Label(value='0 / 9732'))), …

---Completed: Expand Terms---

---Initiating: Handling Html Tags---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=9732), Label(value='0 / 9732'))), …

---Completed: Handling Html Tags---

---Initiating: Handling URLs---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=9732), Label(value='0 / 9732'))), …

---Completed: Handling URLs---

---Initiating: Handling Mispelled Words---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=9732), Label(value='0 / 9732'))), …

---Completed: Handling Mispelled Words---

---Initiating: Convert Emoji to Word---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=9732), Label(value='0 / 9732'))), …

---Completed: Convert Emoji to Word---

---Initiating: Handling Numeric Terms---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=9732), Label(value='0 / 9732'))), …

---Completed: Handling Numeric Terms---

---Initiating: Strip Extra Spaces---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=9732), Label(value='0 / 9732'))), …

---Completed: Strip Extra Spaces---

---Initiating: Removing Stopwords---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=9732), Label(value='0 / 9732'))), …

---Completed: Removing Stopwords---

---Initiating: Lemmatization---


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=9732), Label(value='0 / 9732'))), …

---Completed: Lemmatization---



In [None]:
test.to_pickle('preprocessed_test.pkl')

## **4. Stratified Data Splitting**

In [None]:
preprcsd = pd.read_pickle('preprocessed_train.pkl')
preprcsd.head()

Unnamed: 0,target,comment_text,preprcsd
0,0,"This is so cool. It's like, 'would you want yo...",cool like would want mother read really great ...
1,0,Thank you!! This would make my life a lot less...,thank would make life lot less anxiety induce ...
2,0,This is such an urgent design problem; kudos t...,urgent design problem kudos take impressive
3,0,Is this something I'll be able to install on m...,something able install site release
4,1,haha you guys are a bunch of losers.,haha guy bunch loser


In [None]:
identity_feat = ['male', 'female', 'black', 'white', 'jewish', 'muslim','christian', 'homosexual_gay_or_lesbian', 'psychiatric_or_mental_illness']
toxic_feat = ['severe_toxicity', 'obscene', 'threat', 'insult', 'identity_attack', 'sexual_explicit']

df_feat = pd.read_csv('train.csv', usecols = identity_feat+toxic_feat)
df_feat.head()

Unnamed: 0,severe_toxicity,obscene,identity_attack,insult,threat,black,christian,female,homosexual_gay_or_lesbian,jewish,male,muslim,psychiatric_or_mental_illness,white,sexual_explicit
0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,0.0
1,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,0.0
2,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,0.0
3,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,0.0
4,0.021277,0.0,0.021277,0.87234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df_feat = df_feat.fillna(0)

for col in df_feat.columns:
    df_feat[col] = df_feat[col].apply(lambda x: 1 if x>=0.5 else 0)
    
df_feat.head()

Unnamed: 0,severe_toxicity,obscene,identity_attack,insult,threat,black,christian,female,homosexual_gay_or_lesbian,jewish,male,muslim,psychiatric_or_mental_illness,white,sexual_explicit
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0


In [None]:
preprcsd = pd.concat([preprcsd, df_feat], axis=1)
preprcsd.head()

Unnamed: 0,target,comment_text,preprcsd,severe_toxicity,obscene,identity_attack,insult,threat,black,christian,female,homosexual_gay_or_lesbian,jewish,male,muslim,psychiatric_or_mental_illness,white,sexual_explicit
0,0,"This is so cool. It's like, 'would you want yo...",cool like would want mother read really great ...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,Thank you!! This would make my life a lot less...,thank would make life lot less anxiety induce ...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,This is such an urgent design problem; kudos t...,urgent design problem kudos take impressive,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,Is this something I'll be able to install on m...,something able install site release,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,haha you guys are a bunch of losers.,haha guy bunch loser,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0


In [None]:
X = preprcsd.drop(columns=['target','comment_text'])
y = preprcsd.loc[:, 'target']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, stratify=y, random_state=42)
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, test_size = 0.1765, stratify=y_train, random_state=42)

In [None]:
preprcsd_private =  pd.read_pickle('preprocessed_test.pkl')
X_private = preprcsd_private.loc[:, 'preprcsd']

In [None]:
len(X_train), len(X_cv), len(X_test), len(X_private)

(1263365, 270777, 270732, 97320)

In [None]:
#save data splits
X_train.to_pickle('X_train.pkl')
X_cv.to_pickle('X_cv.pkl')
X_test.to_pickle('X_test.pkl')
X_private.to_pickle('X_private.pkl')

In [None]:
y_train.to_pickle('y_train.pkl')
y_cv.to_pickle('y_cv.pkl')
y_test.to_pickle('y_test.pkl')

## **5. Text Data Vectorization**

**We vectorize text data in the following ways:**
- Bag of Words (BOW)
- TF-IDF
- Average Word2Vec (Pretrained Word Embedding)
- Average Glove (Pretrained Word Embedding)

**Note:** We can use other pre-trained word embeddings too

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from scipy.sparse import save_npz

In [None]:
X_train = pd.read_pickle('X_train.pkl')
X_cv = pd.read_pickle('X_cv.pkl')
X_test = pd.read_pickle('X_test.pkl')
X_private = pd.read_pickle('X_private.pkl')

## **5.1 Bag of Words (BOW)**

In [None]:
bow = CountVectorizer(min_df=10, ngram_range=(1,2), max_features=15000)
bow.fit(X_train['preprcsd'].values)

X_train_bow = bow.transform(X_train['preprcsd'].values)
X_cv_bow = bow.transform(X_cv['preprcsd'].values)
X_test_bow = bow.transform(X_test['preprcsd'].values)
X_private_bow = bow.transform(X_private.values)

words_bow = bow.get_feature_names()



In [None]:
#Count Vectorizer returns a sparse matrix so we use save_npz to store 
save_npz('X_train_bow.npz',X_train_bow)
save_npz('X_cv_bow.npz',X_cv_bow)
save_npz('X_test_bow.npz',X_test_bow)
save_npz('X_private_bow.npz',X_private_bow)

## **5.2 TF-IDF**


In [None]:
tfidf = TfidfVectorizer(min_df=10, ngram_range=(1,2), max_features=15000)
tfidf.fit(X_train['preprcsd'].values)

X_train_tfidf = tfidf.transform(X_train['preprcsd'].values)
X_cv_tfidf = tfidf.transform(X_cv['preprcsd'].values)
X_test_tfidf = tfidf.transform(X_test['preprcsd'].values)
X_private_tfidf = tfidf.transform(X_private.values)

words_tfidf = tfidf.get_feature_names()

In [None]:
save_npz('X_train_tfidf.npz',X_train_tfidf)
save_npz('X_cv_tfidf.npz',X_cv_tfidf)
save_npz('X_test_tfidf.npz',X_test_tfidf)
save_npz('X_private_tfidf.npz',X_private_tfidf)

## **5.3 Average Word2Vec**
(refer - https://stackoverflow.com/questions/29760935/how-to-get-vector-for-a-sentence-from-the-word2vec-of-tokens-in-sentence)

In [None]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

In [None]:
# code snippet - http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
def sent2vec(data, embedding): 
    texts = data.values
    text2vec = [] 
    
    for text in tqdm(texts):
        vector = np.zeros(300)
        words_known = 0
        #summing
        for word in (text.split(' ')): 
            try: 
                w2v = embedding[word]
                vector += w2v
                words_known += 1
            except:
                pass
        #averaging
        if words_known!=0:
            vector /= words_known
        text2vec.append(vector)
        
    return np.array(text2vec) #shape: (len(data), 300)

In [None]:
X_train_w2v = sent2vec(X_train['preprcsd'], wv)
X_cv_w2v = sent2vec(X_cv['preprcsd'], wv)
X_test_w2v = sent2vec(X_test['preprcsd'], wv)
X_private_w2v = sent2vec(X_private, wv)

  0%|          | 0/1263365 [00:00<?, ?it/s]

  0%|          | 0/270777 [00:00<?, ?it/s]

  0%|          | 0/270732 [00:00<?, ?it/s]

  0%|          | 0/97320 [00:00<?, ?it/s]

In [None]:
np.save('X_train_w2v.npy', X_train_w2v)
np.save('X_cv_w2v.npy', X_cv_w2v)
np.save('X_test_w2v.npy', X_test_w2v)
np.save('X_private_w2v.npy', X_private_w2v)

## **5.4 Average GloVe**

In [None]:
print('Indexing word vectors.')

glove = {}
f = open('glove.6B.300d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    glove[word] = coefs
f.close()

print('Found %s word vectors.' % len(glove))

Indexing word vectors.
Found 400000 word vectors.


In [None]:
X_train_glove = sent2vec(X_train['preprcsd'], glove)
X_cv_glove = sent2vec(X_cv['preprcsd'], glove)
X_test_glove = sent2vec(X_test['preprcsd'], glove)
X_private_glove = sent2vec(X_private, glove)

  0%|          | 0/1263365 [00:00<?, ?it/s]

  0%|          | 0/270777 [00:00<?, ?it/s]

  0%|          | 0/270732 [00:00<?, ?it/s]

  0%|          | 0/97320 [00:00<?, ?it/s]

In [None]:
np.save('X_train_glove.npy', X_train_glove)
np.save('X_cv_glove.npy', X_cv_glove)
np.save('X_test_glove.npy', X_test_glove)
np.save('X_private_glove.npy', X_private_glove)

## **6. References:**
- https://www.kdnuggets.com/2019/01/solve-90-nlp-problems-step-by-step-guide.html (NLP Guide)
- https://www.kaggle.com/code/christofhenkel/how-to-preprocessing-for-glove-part1-eda (to Handle Mispelled Words)
- https://www.kaggle.com/code/christofhenkel/how-to-preprocessing-when-using-embeddings/notebook (to Handle Mispelled Words)
- https://regexr.com/ (RegEx Guide and Trials)