# Jigsaw Unintended Bias in Toxicity Classification Kaggle Challenge 
A Problem-Solving Approach by A-Team (Team 5 HWR)

## Table of Content
- [1.Introduction](#1.-Introduction)
    - [1.1 Understanding the Challenge of the Competition](#1.1-Understanding-the-Challenge-of-the-Competition)
    - [1.2 Understanding the Meaning of the Data](#1.2-Understanding-the-Meaning-of-the-Data)
- [2. Data Loading](#2.-Data-Loading)
    - [2.1 Package Import and Dataset Loading](#2.1-Package-Import-and-Dataset-Loading)
    - [2.2 Memory Reduction](#2.2-Memory-Reduction)
- [3. Exploratory Data Analysis of Datasets](#3.-Exploratory-Data-Analysis-of-Datasets)
    - [3.1 Overview of Train and Test Data](#3.1-Overview-of-Train-and-Test-Data) 
    - [3.2 Examples of Comments](#3.2-Examples-of-Comments) 
    - [3.3 Deepdive into Datasets](#3.3-Deepdive-into-Datasets) 
- [4. Data Preprocessing](#4.-Data-Preprocessing)
    - [4.1 Loading of Word Embeddings](#4.1-Loading-of-Word-Embeddings)
    - [4.2 Text Preprocessing](#4.2-Text-Preprocessing)
- [5. Modeling](#5.-Modeling)
    - [5.1 Model Definition with Keras](#5.1-Model-Definition-with-Keras)
    - [5.2 Training](#5.2-Training)
    - [5.3 Testing](#5.3-Testing)
    - [5.4 Prediction](#5.4-Prediction)
- [6. Submission](#6.-Submission)
- [7. Conclusion](#7.-Conclusion)
- [References](#References)

## 1. Introduction

This notebook provides an overview of a possible problem-solving approach to solve the Jigsaw Unintended Bias in Toxicity Classification Challenge on Kaggle. The underlying chapters can be accessed via a click on the the [Table of Content](#Table-of-Content) above. The report commences with an introduction of the challenge of the competition and an overview of the provided datasets for a better understanding. Afterwards, the required libraries and packages for each chapter are loaded as well as function to reduce memory is applied before exploring the datasets in the Exploratory Data Analysis (EDA). After having a look at the datasets, the text will be pre-processed using word embeddings to vectorize comments into embeddings. In chapter 5, `Keras` is used as the underlying Python Deep Learning library to model after pre-processing the text. A `train-test-split` is then performed before prediction the toxicity score and creating a `submission.csv`. The overview of the problem-solving approach can be seen in the image below.

In [None]:
# Overview of problem solving approach to solve the Jigsaw challenge
from IPython.display import Image
Image("../input/problemsolvingapproach/psa.jpg")

All kernels which have been used as inspiration or advice throughout the project will be referenced in the [References](#References) section as well as other sources to understand theory, applications and different libraries.

### 1.1 Understanding the Challenge of the Competition

The objective of the "Jigsaw Unintended Bias in Toxicity Classification" challenge is to detect potential toxic comments in online conversations and further minimize unintended model bias. The initiator of the challenge have defined "toxicity" as "anything rude, disrespectful or otherwise likely to make someone leave a discussion" ([Kaggle](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification), 2019). 

Based on a previous challenge, the current version aims to further build toxicity models which can be applied across a versatile range of online conversations. The challenge is particularly to define models which can differentiate the usage of  words, such as religion or sex, between a normal and a toxic usage. Hence, solely analyzing the frequency of words is not enough to detect the toxicity level which is why the second challenge has been extended with also minimizing the unintended bias of these comments. 

### 1.2 Understanding the Meaning of the Data

To begin with the challenge we are provided with a dataset which is already labeled for these identity mentions and for which we are asked to optimize a metric which can measure unintended bias. 

In [None]:
#Package Import to have a first look at the target dataset (sample_submission)
import pandas as pd

In [None]:
# Load submission.csv into a Pandas
submission = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/sample_submission.csv")
print(submission.head())
print(submission.shape)

Before importing and analyzing the train and test datasets we wanted to take a look at our final goal - the `submission.csv` - and understand what should be predicted. One can see that we need a total of 97.320 rows, each with different IDs for which we predict the toxicity. 

Important to explain at this stage is how the result is being evaluated by the initiators of the challenge. The `sample_submission.csv` is evaluated based on the "Jigsaw Bias AUC" which is a newly developed metric by Jigsaw. 

Besides the `sample_submission.csv` the challenge provides:
- `train.csv` - the training set, which includes subgroups
- `test.csv` - the test set, which does not include subgroups

## 2. Data Loading

### 2.1 Package Import and Dataset Loading

In [None]:
# Import necessary libraries and packages for the complete project

# Import standard libraries for data loading, EDA and data preprocessing
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
import seaborn as sns
import time
import gc
import datetime
import scipy.stats as stats
import operator 
import re
from wordcloud import WordCloud, STOPWORDS
sns.set_style('whitegrid')


# Import NLP libraries for data preprocessing and modeling
from gensim.models import KeyedVectors
import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding
from keras.layers import Input
from keras.layers import MaxPooling1D
from keras.layers import Flatten
from keras.layers import Dropout
from keras.layers import Dense
from keras.layers import LeakyReLU
from keras.layers import Bidirectional
from keras.layers import GlobalAveragePooling1D
from keras.optimizers import RMSprop
from keras.layers import CuDNNLSTM
from keras.layers import Convolution1D
from keras.models import Model
from keras.models import load_model
from keras.models import Sequential


# Import libraries for train and test split and validation
from sklearn import metrics
from sklearn import model_selection
from sklearn.model_selection import train_test_split

# Print folders in input folder - including additional datasets used in the project
print(os.listdir("../input/"))

In [None]:
%%time
#Load train and test datasets into a Pandas and print its shape
# Adapt nrows depending on CPU/GPU available

# Load dataset with 200.000 rows
#train = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv", nrows=200000)
#test = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv", nrows=200000)

# Load dataset with 500.000 rows
# train = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv", nrows=500000)
# test = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv", nrows=500000)

# Load dataset with 1.000.000 rows
train = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv", nrows=700000)
test = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv", nrows=700000)

# Load entire dataset
# train = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv")
# test = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv")

print("Train shape : ",train.shape)
print("Test shape : ",test.shape)

### 2.2 Memory Reduction

Before exploring the different dataset, a function is been applied to reduce memory usage. This has been done in a similar way in the previous Elo Kaggle challenge last semester. 

In [None]:
# Define a function to reduce memory usage to have faster processing later on
def reduce_mem_usage(df, verbose=True):
    """ This function iterates through columns of a dataframe in order to reduce the memory.        
    """
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
# Apply the function to reduce memory usage to the train and test dataset
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)

The reduction of memory function yields a reduction of memory usage of 69.4% for the `train` dataset and 25% for the `test` dataset.

## 3. Exploratory Data Analysis of Datasets

### 3.1 Overview of Train and Test Data

Before deepdiving into the different datasets and analyzing its components, it is useful to first get a glance at both, `train` and `test` dataset with the common functions of `.head()`, `.info()` and `.describe()`.

In [None]:
# Have a look at the head of the train dataset
train.head(5)

In [None]:
# Have a look at different datatypes in the train dataset with .info()
train.info()

In [None]:
# Describe the train dataset
train.describe()

Taking a first look at the `train` dataset, one can see that there are in total 45 columns. Data types are `float`, `int` and `object`, while the only column with an object type is the `comment_text` column, the key column for the text pre-processing and modeling later on. The other columns mainly illustrate subgroups, such as sex-related (e.g. `female`), religion (e.g. `buddhist` or `christian`) or race (`asian` or `latino`), among others. 

In [None]:
# Have a look at the head of the test dataset
test.head()

In [None]:
# Have a look at different datatypes in the test dataset with .info()
test.info()

In [None]:
# Describe the column "comment_text" in the test dataset
test["comment_text"].describe()

The `test` dataset includes two columns, `id` and `comment_text`.  

### 3.2 Examples of Comments

As mentioned in the introduction, the objective of the challenge is not only to identify toxic content, but more importantly to classify its toxicity. This leads to the main challenge of how to differentiate between using a toxic word in a positive context or a negative one. In the following, two examples will be given of comments that are "obviously" toxic and a comment that is used in a positive way.

In [None]:
# Display selected entries for "comment_text" to visually inspect their potential for toxicity
pd.options.display.max_colwidth=200
train["comment_text"][:12]

Taking a lot at row 3, for instance, the comment `"Is this something I'll be able to install on my site? When will you be releasing it?"` is a very objective comment. It could e.g. be a question a user is posting in a forum - without any emotional impact involved. Therefore the toxicity level has been defined as 0. 
However, looking at row 11, `"This is a great story. Man. I wonder if the person who yelled "shut the fuck up!" at him ever heard it."`, this comment most certainly is more difficult to classify. Even though there is an insult included in the comment, it is not directly attacking someone else. It is rather meant that someone else who said the insult should learn from a certain situation. 

### 3.3 Deepdive into Datasets 

After having a first look on the `train` and `test` datasets and the challenge of classifying toxic comments, a more detailed exploration of the data will follow in the comming cells.

In [None]:
# Count missing values
mval_train = train.isnull().sum(axis=0) / len(train)

# Filter out columns with zero missing values
mval_train = mval_train[mval_train > 0]
mval_train

The analysis of missing values reveals that there are almost 80% of missing values in several columns. 

In [None]:
# Check distribution of # of words in the comment_text column
nr_words = train["comment_text"].apply(lambda x: len(x) - len(''.join(x.split())) + 1)
train['nr_words'] = nr_words
nr_words = train.loc[train['nr_words']<200]['nr_words']
sns.distplot(nr_words, color='b')
plt.xlabel('Number of words per comment')
plt.show()

The distribution of number of words shows a skewed distribution meaning that most of the comments consist of 0-50 words, much less comments contain more than 50 and up to 200 words.

In [None]:
sns.set_style('white')

# Count the values in the target variable to below plot the distribution of toxic comments
hist_df = pd.cut(train['target'], 20).value_counts().sort_index().reset_index().rename(columns={'index': 'bins'})
hist_df['percentage'] = (hist_df['target']/hist_df['target'].sum())*100
hist_df

plt.figure(figsize=(12,4))

# Plot the toxicity distribution of toxic comments (toxicity score)
plt.subplot(1,2,1)
sns.barplot(x=hist_df['bins'],y=hist_df['target'])
plt.xticks(rotation='vertical')
plt.title('Toxicity Distribution')
plt.xlabel('Toxicity Score')
plt.ylabel('Comment Count')

# Plot the toxicity percentage of these comments
plt.subplot(1,2,2)
sns.barplot(x=hist_df['bins'],y=hist_df['percentage'])
plt.xticks(rotation='vertical')
plt.title('Toxicity Percentage')
plt.xlabel('Toxicity Score')
plt.ylabel('Percentage')

In [None]:
# Apply a 0.5 threshold to have a deeper look at actual toxic comments
len(train[train.target>0.5])/len(train.target)

The graph above shows that about 70% of the comments in the `train` dataset are not toxic at all. Considering 0.5 as a threshold for high toxicity, only 5% of the total number of comments are actually toxic. This is an important result which will have an impact on the text-preprocessing and modeling and hence will be considered throughout the kernel to not falsely interpret the result in the end. There are different methods that deal with this issue. For this kernel, the dataset will be downsampled for the modeling and also a train-test-split be conducted before predicting the target.

In [None]:
sns.set_style('white')

# Filter the train dataset for the column "created_date" to investigate the count of toxic comments on a timeline
train['created_date'] = pd.to_datetime(train['created_date']).values.astype('datetime64[M]')
comment_count = train.groupby(['created_date'])['target'].count().sort_index().reset_index()
comment_mean = train.groupby(['created_date'])['target'].mean().sort_index().reset_index()

# Create a first subplot 
fig, ax1 = plt.subplots(figsize=(12,4))
x_range = comment_count['created_date']

# Create and plot the x-axis with Date and plot the number of comments
color = 'tab:red'
ax1.set_xlabel('Date')
ax1.set_ylabel('# of Comments')
ax1.plot(x_range, comment_count['target'], color=color, label="# of comments")
plt.xticks(rotation='vertical')
ax1.tick_params(axis='y')
ax1.legend(loc="lower right")
plt.title("Count and Toxicity of Comments")

# Instantiate a second axes that shares the same x-axis and plot the average toxicity
ax2 = ax1.twinx()  
color = 'tab:blue'
ax2.set_ylabel('Avg Toxicity') 
ax2.plot(x_range, comment_mean['target'], color=color, label="avg toxicity")
plt.xticks(rotation='vertical')
ax2.legend(loc="upper left")

# Plot the result
plt.show()

The average toxicity of the comments began to increase steadily since February 2016 but the amount of comments has remained stable during the same period of time. After October 2016, there has been a stronger increase in the number of comments with a decrease in October 2017. Between April 2016 and October 2017, the average toxicity has slightly increased and decreased above and below the value of 10%.

In [None]:
# Checking first and last comment on the timeline to validate the previous graph
print('Date of first comment in train dataset:', train['created_date'].min())
print('Date of last comment in train dataset:', train['created_date'].max())

As shown in the graph before, the first comment of the dataset was made on 01.09.2015 and the last on 01.11.2017. Therefore the drop on the right side of the graph is not a surprise and has not to be considered as relevant for the further analysis.

In [None]:
# Plot the columns of severy_toxicity, obscene, threat, insult, identity_attack and sexual_explicit and their count of values
tox_sub = ['severe_toxicity', 'obscene', 'threat', 'insult', 'identity_attack', 'sexual_explicit']
chart = 0
fig = plt.figure(figsize=(12,8))

for col in tox_sub:
    df_subtox = train.loc[train[col] > 0]
    hist_subtox = pd.cut(df_subtox[col], 20).value_counts().sort_index().reset_index().rename(columns={'index': 'bins'})
    chart += 1
    plt.subplot(3,2,chart)
    plt.plot(hist_subtox[col])
    plt.title(str(col))
    plt.xlabel('Bins')
    plt.ylabel('Count')
    fig.tight_layout()

The most common types of subtoxicity are `insults` with about 200.000 and `identity_attack` comments with about 80.0000 among all the datapoints in the dataset.

In [None]:
# Plot the weighted toxicity for different demographics
demographics = train.loc[:, ['target']+list(train)[slice(8,32)]].dropna()
weighted_toxic = demographics.iloc[:, 1:].multiply(demographics.iloc[:, 0], axis="index").sum()/demographics.iloc[:, 1:][demographics.iloc[:, 1:]>0].count()
weighted_toxic = weighted_toxic.sort_values(ascending=False)
plt.figure(figsize=(15,7.5))
sns.set(font_scale=1)
ax = sns.barplot(x = weighted_toxic.values, y = weighted_toxic.index, alpha=0.8)
plt.ylabel('Demographics')
plt.xlabel('Weighted Toxicity')
plt.show()

Most of the toxic comments are labeled based on skin color (`black` or `white`), sexual orientation (`homosexual_gay_or_lesbian`) and religion (`muslim`, `jewish`, `atheist`), while the dominant one among these is `homosexual_gay_or_lesbian`.

Inspired by a Datacamp article of Vu (2018) a deeper look at different wordclouds is taken in the next part in order to visually explore predominat words for different columns. The results are illustrated in the following cells.

In [None]:
# Select the text to be illustrated in the WordCloud
text = train["comment_text"]

# Create and show a wordcloud image with max_font_size=50 and max_words=100
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white", stopwords=STOPWORDS).generate(str(text))
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

The first wordcloud shows the predominant comments in `train` dataset. Since there are 1.804.874 rows in the dataset, the result includes common words such as "great", "right", "think" or "even". To visualize a wordcloud for different filters, a function is defined in the next cell.

In [None]:
# Rewrite the previous wordcloud image into a function to apply for different texts
def show_wordcloud(data, title=None):
    wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white", stopwords=STOPWORDS).generate(str(data))
    plt.figure(figsize=(10,10))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()

In [None]:
# Create and show a wordcloud for the first 50.000 comments
show_wordcloud(train["comment_text"][:50000])

The wordcloud image above as selected for the first 50.000 comments shows a predominance of "Trump", "people" and "Civil". Considering that the comments in the `train` dataset started in September 2015 which has been illustrated earlier, it seems to be logical that there have been a lot of discussions about Trump, the candidate for presidential elections in 2016.

In [None]:
# Create and show a wordcloud for the first 50.000 comments
show_wordcloud(train["comment_text"][50000:100000])

Filtering the comments between row 50.000 to 100.000 shows a different wordcloud than before. Predominant words are "people", "state", "Alaska" etc. However, clustering just by taking different rows in a chronological order does not yield to meaningful conclusions, as there is no interconnection across the whole dataset. Therefore in a next step, wordclouds out of the comments will be generated for different toxicity levels taking the `target` into account, too.

In [None]:
# Create and show a wordcloud for a threshold of target <0.25 
show_wordcloud(train.loc[train["target"] < 0.25]["comment_text"])

In [None]:
# Create and show a wordcloud for a threshold of target <0.50 
show_wordcloud(train.loc[train["target"] > 0.50]["comment_text"])

In [None]:
# Create and show a wordcloud for a threshold of target <0.75 
show_wordcloud(train.loc[train["target"] > 0.75]["comment_text"])

Considering thresholds for the target variable, meaning considering different levels of toxicity to analyze common words in the comments, reveals the word "comment" as predominant when the toxicity level is very low (`<0.25`). However, when looking at a threshold `> 0.5` or even `>0.75`, as expected the most common words in the cloud are more and more related towards negative and insulting words, such as "idiot", "stupid" or "fool". Also remarkable for the last wordcloud is the appearance of "Trump" in the context of high toxicity, meaning that his name is mostly related to insults in online conversations which - considering his media presence and critics before and after the elections - is an outcome to be expected. 

## 4. Data Preprocessing

The EDA (Exploratory Data Analysis) provided an overview of the available datasets and helped to gain an insight on important words, phrases and their evaluation. In the next step, the data preprocessing is aimed to utilize the insights gained from the EDA and hence, to prepare a suitable set of data that will be used for the model.  
Part of the preprocessing is to remove punctuation. This transforms the comments (column: `comment_text`) into sequences of words, which then will be split into lists of tokens. Moreover, a removal of special characters, such as `!"#$%&()*+,-./:;<=>?` will take place. Thereafter, CBOW or Split Graf will be implemented to characterize words, e.g. classify the words, term frequency. The great challenge in the preprocessing may be seen at this point of the preprocessing. It is to provide a model that is able to get an unbiased context, and thus, to be able to classify more difficult sentences as well. An approach to solve this issue is to use Keras text preprocessing. Keras turns the comments into an integer (representing a reference to a token in a dictionary). Alternatively, Keras vectorizes the comment based on embeddings. This preprocessing technique would combine embeddings and Keras and may improve the `train` data set.

### 4.1 Loading of Word Embeddings

In [None]:
%%time
# Load word embeddings from "fasttext-crawl-300d-2m" dataset which has been imported to the input workspace before
fasttext = '../input/fasttext-crawl-300d-2m/crawl-300d-2M.vec'
embeddings_index = KeyedVectors.load_word2vec_format(fasttext)
gc.collect()

In this competition, we focus on using an open-source, free, lightweight fastText library that allows users to learn text representations and text classifiers. In order to easily preprocess the text, this kernel uses the `fasttext-crawl-300d-2m` dataset with word embeddings. This Natural Language Processing (NLP) library provides 300-dimensional pretrained FastText English word vectors which have been released by Facebook. The dataset has been added to the workspace on Kaggle into the `input` folder and can simply be accessed by the loading statement seen above. 

As a structure for mapping between entities and vectors KeyedVectors is used. In this case, the entity corresponds to a word.

To track the loading time, `%%time` has been applied to the loading function too. Due to the size of the dataset, this step is already quite time-intense. Since the kernel is stored on Kaggle the dataset has to be loaded each time the kernel is committed. 

### 4.2 Text Preprocessing

As an important step to preprocess the text, the `train` and `test` dataframes have to be joined and `train` and `test` to be deleted for memory reasons. This can be seen below.

In [None]:
%%time
# Concatenate train_id and comment_text for further data processing
df = pd.concat([train[['id','comment_text']], test], axis=0)
del(train, test)
gc.collect()

To proceed with text preprocessing, first different functions to build the vocabulary and check the amount of words in the embedding are defined and later on applied. 

In [None]:
# Define a function to build the vocabulary
def build_vocab(texts):
    sentences = texts.apply(lambda x: x.split()).values
    vocab = {}
    for sentence in sentences:
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

In [None]:
# Define a function to check the amount of words that can be found in the embedding
def check_coverage(vocab, embeddings_index):
    known_words = {}
    unknown_words = {}
    nb_known_words = 0
    nb_unknown_words = 0
    for word in vocab.keys():
        try:
            known_words[word] = embeddings_index[word]
            nb_known_words += vocab[word]
        except:
            unknown_words[word] = vocab[word]
            nb_unknown_words += vocab[word]
            pass

    print('Found embeddings for {:.3%} of vocab'.format(len(known_words) / len(vocab)))
    print('Found embeddings for  {:.3%} of all text'.format(nb_known_words / (nb_known_words + nb_unknown_words)))
    unknown_words = sorted(unknown_words.items(), key=operator.itemgetter(1))[::-1]

    return unknown_words

In [None]:
%%time
# Get text to lower caps
df['comment_text'] = df["comment_text"].apply(lambda x: x.lower())
# Dump memory collection
gc.collect()

The previous code has transformed all capital letters to lower cases in order to proceed with the application of the functions defined to build the vocabulary (`build_vocab`) and the function to check the coverage of words that can be found in the embedding (`check_coverage`). 

In [None]:
%%time
# Build the vocabulary and check the coverage in the FastText embedding
vocab = build_vocab(df['comment_text'])
oov = check_coverage(vocab, embeddings_index)

In [None]:
# Dump memory collection
gc.collect()

In order to not overload the memory of this kernel, `gc.collect()` as seen above is applied after each preprocessing step. This function is a possibility to "collect the garbage" and provide free memory for the next calculation steps.

In [None]:
# Print the first 20 out of vocabulary values to check where to procede
oov[:20]

It can be seen that there are a lot of contractions in the `top 20` `oov` results. A mapping can help here so they can be replaced for the long form if they exist in the `FastText` embedding. The contraction mapping is defined in the next cell before deleting and recalculating the `vocab` and `oov` again.

In [None]:
# Define a contraction mapping as a dictionary
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" }

In [None]:
# Delete the vocab and oov objecte to recalculate again
del(vocab,oov)
# Dump memory collection
gc.collect()

After deleting the `vocab` and `oov` objects for recalculation, in the next step a function will be defined to calculate the amount of known contractions in the `FastText` embedding.

In [None]:
# Define a function to calculate the amount of known contractions in the FastText embedding
def known_contractions(embed):
    known = []
    for contract in contraction_mapping:
        if contract in embed:
            known.append(contract)
    return known

print("The known contractions in the FastText embedding are: ")
print(known_contractions(embeddings_index))

The result shows the known contractions in the `FastText` embedding. In the following step, another function will be defined which can map and replace the known contractions in `comment_text` so they can be properly embedded in the `FastText` embedding.

In [None]:
# Define a function to map and replace known contractions in comment_text
def clean_contractions(text, mapping):
    spec_characters = ["’", "‘", "´", "`"]
    for s in spec_characters:
        text = text.replace(s, "'")
    text = ' '.join([mapping[t] if t in mapping else t for t in text.split(" ")])
    return text

As a next step, the clean contractions function is applied on `comment_text` and then the `vocab` is rebuilt again to calculate the`out of vocab` percentage plus the `top 20` words in the `oov`.

In [None]:
%%time
# Apply the clean contractions function on the comment_text fields
df['comment_text'] = df['comment_text'].apply(lambda x: clean_contractions(x, contraction_mapping))

In [None]:
%%time
# Rebuild the vocab and oov
vocab = build_vocab(df['comment_text'])
oov = check_coverage(vocab, embeddings_index)

In [None]:
# Print the first 20 out of vocabulary values again
oov[:20]

The next step in the text pre-processing is the punctuation and special characters handling, but first the `vocab` and `oov` objects has to be deleted for memory reasons as it has been done in the previous text preprocessing steps.`

In [None]:
# Delete the vocab and oov objects for memory reasons
del(vocab,oov)
# Dump memory collection
gc.collect()

In the following, a function is defined that checks which special characters are included in the `FastText` embedding. 

In [None]:
# Define a variable with special characters
punct = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'

# Define a function to check for special characters
def punctuation(embed, punct):
    unknown = ''
    for p in punct:
        if p not in embed:
            unknown += p
            unknown += ' '
    return unknown

# Print the result of applying the punctuation function
print(punctuation(embeddings_index, punct))

Afterwards, a dictionary with the unknown special characters to be replaced as white spaces in the text is created as well as a function to be map and replace every special character and punctuation in the `comment_text` field.

In [None]:
# Initiate a dictionary with unknown special characters
unknown_spec = {"_":" ", "`":" "}

# Define a function to map and replace every special character
def clean_special_chars(text, punct, mapping):
    for p in mapping:
        text = text.replace(p, mapping[p])    
    for p in punct:
        text = text.replace(p, f' {p} ')     
    return text

The previously defined `clean_special_chars` function is applied in the coming cell in order to map and replace the characters in the dataframe.

In [None]:
%%time
# Apply the clean_special_chars function to comment_text
df['comment_text'] = df['comment_text'].apply(lambda x: clean_special_chars(x, punct, unknown_spec))

After applying the `clean_special_chars` function, the `vocab` and the `oov` has to be recalcutad again in the next cells and the `top 20` `out of vocab` words will be printed to see where to proceed further.

In [None]:
%%time
# Rebuild vocab and oov
vocab = build_vocab(df['comment_text'])
oov = check_coverage(vocab, embeddings_index)

In [None]:
# Print the first 20 out of vocabulary values again
oov[:20]

There are no clear ways to proceed besides misspelling of some words, therefore as a next step swearwords and insults have to be mapped. For this purpose, first an array will be created with random insults and its variations in order to figure out if they are included in the `FastText` embedding and which ones are not. These ones have to be replaced with a common insult that is easily found in the embedding.

The following insult mapping was borrowed from Tom Aindow (@Taindow https://www.kaggle.com/taindow/simple-cudnngru-python-keras) so credits and big thanks go to him:

In [None]:
# Define insults
insults = [' 4r5e ',' 5h1t ',' 5hit ',' a55 ',' anal ',' anus ',' ar5e ',' arrse ',' arse ',' ass ',' ass-fucker ',' asses ',' assfucker ',' assfukka ',' asshole ',' assholes ',' asswhole ',' a_s_s ',' b!tch ',' b00bs ',' b17ch ',' b1tch ',' ballbag ',' balls ',' ballsack ',' bastard ',' beastial ',' beastiality ',' bellend ',' bestial ',' bestiality ',' biatch ',' bitch ',' bitcher ',' bitchers ',' bitches ',' bitchin ',' bitching ',' bloody ',' blow job ',' blowjob ',' blowjobs ',' boiolas ',' bollock ',' bollok ',' boner ',' boob ',' boobs ',' booobs ',' boooobs ',' booooobs ',' booooooobs ',' breasts ',' buceta ',' bugger ',' bum ',' bunny fucker ',' butt ',' butthole ',' buttmuch ',' buttplug ',' c0ck ',' c0cksucker ',' carpet muncher ',' cawk ',' chink ',' cipa ',' cl1t ',' clit ',' clitoris ',' clits ',' cnut ',' cock ',' cock-sucker ',' cockface ',' cockhead ',' cockmunch ',' cockmuncher ',' cocks ',' cocksuck ',' cocksucked ',' cocksucker ',' cocksucking ',' cocksucks ',' cocksuka ',' cocksukka ',' cok ',' cokmuncher ',' coksucka ',' coon ',' cox ',' crap ',' cum ',' cummer ',' cumming ',' cums ',' cumshot ',' cunilingus ',' cunillingus ',' cunnilingus ',' cunt ',' cuntlick ',' cuntlicker ',' cuntlicking ',' cunts ',' cyalis ',' cyberfuc ',' cyberfuck ',' cyberfucked ',' cyberfucker ',' cyberfuckers ',' cyberfucking ',' d1ck ',' damn ',' dick ',' dickhead ',' dildo ',' dildos ',' dink ',' dinks ',' dirsa ',' dlck ',' dog-fucker ',' doggin ',' dogging ',' donkeyribber ',' doosh ',' duche ',' dyke ',' ejaculate ',' ejaculated ',' ejaculates ',' ejaculating ',' ejaculatings ',' ejaculation ',' ejakulate ',' f u c k ',' f u c k e r ',' f4nny ',' fag ',' fagging ',' faggitt ',' faggot ',' faggs ',' fagot ',' fagots ',' fags ',' fanny ',' fannyflaps ',' fannyfucker ',' fanyy ',' fatass ',' fcuk ',' fcuker ',' fcuking ',' feck ',' fecker ',' felching ',' fellate ',' fellatio ',' fingerfuck ',' fingerfucked ',' fingerfucker ',' fingerfuckers ',' fingerfucking ',' fingerfucks ',' fistfuck ',' fistfucked ',' fistfucker ',' fistfuckers ',' fistfucking ',' fistfuckings ',' fistfucks ',' flange ',' fook ',' fooker ',' fuck ',' fucka ',' fucked ',' fucker ',' fuckers ',' fuckhead ',' fuckheads ',' fuckin ',' fucking ',' fuckings ',' fuckingshitmotherfucker ',' fuckme ',' fucks ',' fuckwhit ',' fuckwit ',' fudge packer ',' fudgepacker ',' fuk ',' fuker ',' fukker ',' fukkin ',' fuks ',' fukwhit ',' fukwit ',' fux ',' fux0r ',' f_u_c_k ',' gangbang ',' gangbanged ',' gangbangs ',' gaylord ',' gaysex ',' goatse ',' God ',' god-dam ',' god-damned ',' goddamn ',' goddamned ',' hardcoresex ',' hell ',' heshe ',' hoar ',' hoare ',' hoer ',' homo ',' hore ',' horniest ',' horny ',' hotsex ',' jack-off ',' jackoff ',' jap ',' jerk-off ',' jism ',' jiz ',' jizm ',' jizz ',' kawk ',' knob ',' knobead ',' knobed ',' knobend ',' knobhead ',' knobjocky ',' knobjokey ',' kock ',' kondum ',' kondums ',' kum ',' kummer ',' kumming ',' kums ',' kunilingus ',' l3itch ',' labia ',' lmfao ',' lust ',' lusting ',' m0f0 ',' m0fo ',' m45terbate ',' ma5terb8 ',' ma5terbate ',' masochist ',' master-bate ',' masterb8 ',' masterbat3 ',' masterbate ',' masterbation ',' masterbations ',' masturbate ',' mo-fo ',' mof0 ',' mofo ',' mothafuck ',' mothafucka ',' mothafuckas ',' mothafuckaz ',' mothafucked ',' mothafucker ',' mothafuckers ',' mothafuckin ',' mothafucking ',' mothafuckings ',' mothafucks ',' mother fucker ',' motherfuck ',' motherfucked ',' motherfucker ',' motherfuckers ',' motherfuckin ',' motherfucking ',' motherfuckings ',' motherfuckka ',' motherfucks ',' muff ',' mutha ',' muthafecker ',' muthafuckker ',' muther ',' mutherfucker ',' n1gga ',' n1gger ',' nazi ',' nigg3r ',' nigg4h ',' nigga ',' niggah ',' niggas ',' niggaz ',' nigger ',' niggers ',' nob ',' nob jokey ',' nobhead ',' nobjocky ',' nobjokey ',' numbnuts ',' nutsack ',' orgasim ',' orgasims ',' orgasm ',' orgasms ',' p0rn ',' pawn ',' pecker ',' penis ',' penisfucker ',' phonesex ',' phuck ',' phuk ',' phuked ',' phuking ',' phukked ',' phukking ',' phuks ',' phuq ',' pigfucker ',' pimpis ',' piss ',' pissed ',' pisser ',' pissers ',' pisses ',' pissflaps ',' pissin ',' pissing ',' pissoff ',' poop ',' porn ',' porno ',' pornography ',' pornos ',' prick ',' pricks ',' pron ',' pube ',' pusse ',' pussi ',' pussies ',' pussy ',' pussys ',' rectum ',' retard ',' rimjaw ',' rimming ',' s hit ',' s.o.b. ',' sadist ',' schlong ',' screwing ',' scroat ',' scrote ',' scrotum ',' semen ',' sex ',' sh!t ',' sh1t ',' shag ',' shagger ',' shaggin ',' shagging ',' shemale ',' shit ',' shitdick ',' shite ',' shited ',' shitey ',' shitfuck ',' shitfull ',' shithead ',' shiting ',' shitings ',' shits ',' shitted ',' shitter ',' shitters ',' shitting ',' shittings ',' shitty ',' skank ',' slut ',' sluts ',' smegma ',' smut ',' snatch ',' son-of-a-bitch ',' spac ',' spunk ',' s_h_i_t ',' t1tt1e5 ',' t1tties ',' teets ',' teez ',' testical ',' testicle ',' tit ',' titfuck ',' tits ',' titt ',' tittie5 ',' tittiefucker ',' titties ',' tittyfuck ',' tittywank ',' titwank ',' tosser ',' turd ',' tw4t ',' twat ',' twathead ',' twatty ',' twunt ',' twunter ',' v14gra ',' v1gra ',' vagina ',' viagra ',' vulva ',' w00se ',' wang ',' wank ',' wanker ',' wanky ',' whoar ',' whore ',' willies ',' willy ',' xrated ',' xxx '   ]

Next, a function is defined in order to check how many of the insults are included in the embedding file.

In [None]:
%%time 

# Initiate insults_notfound
insults_notfound = []

# Iterate over insults to check for insults included in the embedding file
for insult in insults:
    if insult[1:(len(insult)-1)] not in embeddings_index:
        insults_notfound.append(insult)
        
insults_notfound = '|'.join(insults_notfound)
insults_notfound

The next cell provides a function to replace the unknown insults in the embedding for an insult that can actually exist.

In [None]:
%%time 
# Define a function to replace unknown insults in embedding
def handle_insults(text):
    text = re.sub(insults_notfound, 'fuck', text)
    return text

In [None]:
%%time
# Apply the handle_insults function previously defined
df['comment_text'] = df['comment_text'].apply(lambda x: handle_insults(x))
# Dump memory collection
gc.collect()

In [None]:
# Delete the vocab and oov objects for memory reasons
del(vocab,oov)
# Dump memory collection
gc.collect()

In [None]:
%%time
# Rebuild the vocab and oov
vocab = build_vocab(df['comment_text'])
oov = check_coverage(vocab, embeddings_index)

After applying the `handle_insults` function and deleting and recalculating again the `vocab`, the coming part deals with a check for spacing, numbering, misspelling, rare words and characters.

The following contraction mapping which is a really wonderful and thorough piece of work is borrowed from Aditya Soni (@Adityaecdrid https://www.kaggle.com/adityaecdrid/public-version-text-cleaning-vocab-65/data) so all credits go to him:

In [None]:
%%time
# Define a dictionary mispell_dict and rare_words_mapping
mispell_dict = {'SB91':'senate bill','tRump':'trump','utmterm':'utm term','FakeNews':'fake news','Gʀᴇat':'great','ʙᴏᴛtoᴍ':'bottom','washingtontimes':'washington times','garycrum':'gary crum','htmlutmterm':'html utm term','RangerMC':'car','TFWs':'tuition fee waiver','SJWs':'social justice warrior','Koncerned':'concerned','Vinis':'vinys','Yᴏᴜ':'you','Trumpsters':'trump','Trumpian':'trump','bigly':'big league','Trumpism':'trump','Yoyou':'you','Auwe':'wonder','Drumpf':'trump','utmterm':'utm term','Brexit':'british exit','utilitas':'utilities','ᴀ':'a', '😉':'wink','😂':'joy','😀':'stuck out tongue', 'theguardian':'the guardian','deplorables':'deplorable', 'theglobeandmail':'the globe and mail', 'justiciaries': 'justiciary','creditdation': 'Accreditation','doctrne':'doctrine','fentayal': 'fentanyl','designation-': 'designation','CONartist' : 'con-artist','Mutilitated' : 'Mutilated','Obumblers': 'bumblers','negotiatiations': 'negotiations','dood-': 'dood','irakis' : 'iraki','cooerate': 'cooperate','COx':'cox','racistcomments':'racist comments','envirnmetalists': 'environmentalists'}
rare_words_mapping = {' s.p ': ' ', ' S.P ': ' ', 'U.s.p': '', 'U.S.A.': 'USA', 'u.s.a.': 'USA', 'U.S.A': 'USA','u.s.a': 'USA', 'U.S.': 'USA', 'u.s.': 'USA', ' U.S ': ' USA ', ' u.s ': ' USA ', 'U.s.': 'USA',
                      ' U.s ': 'USA', ' u.S ': ' USA ', 'fu.k': 'fuck', 'U.K.': 'UK', ' u.k ': ' UK ',' don t ': ' do not ', 'bacteries': 'batteries', ' yr old ': ' years old ', 'Ph.D': 'PhD',
                      'cau.sing': 'causing', 'Kim Jong-Un': 'The president of North Korea', 'savegely': 'savagely',
                      'Ra apist': 'Rapist', '2fifth': 'twenty fifth', '2third': 'twenty third','2nineth': 'twenty nineth', '2fourth': 'twenty fourth', '#metoo': 'MeToo',
                      'Trumpcare': 'Trump health care system', '4fifth': 'forty fifth', 'Remainers': 'remainder',
                      'Terroristan': 'terrorist', 'antibrahmin': 'anti brahmin','fuckboys': 'fuckboy', 'Fuckboys': 'fuckboy', 'Fuckboy': 'fuckboy', 'fuckgirls': 'fuck girls',
                      'fuckgirl': 'fuck girl', 'Trumpsters': 'Trump supporters', '4sixth': 'forty sixth',
                      'culturr': 'culture','weatern': 'western', '4fourth': 'forty fourth', 'emiratis': 'emirates', 'trumpers': 'Trumpster',
                      'indans': 'indians', 'mastuburate': 'masturbate', 'f**k': 'fuck', 'F**k': 'fuck', 'F**K': 'fuck',
                      ' u r ': ' you are ', ' u ': ' you ', '操你妈': 'fuck your mother', 'e.g.': 'for example',
                      'i.e.': 'in other words', '...': '.', 'et.al': 'elsewhere', 'anti-Semitic': 'anti-semitic',
                      'f***': 'fuck', 'f**': 'fuc', 'F***': 'fuck', 'F**': 'fuc','a****': 'assho', 'a**': 'ass', 'h***': 'hole', 'A****': 'assho', 'A**': 'ass', 'H***': 'hole',
                      's***': 'shit', 's**': 'shi', 'S***': 'shit', 'S**': 'shi', 'Sh**': 'shit',
                      'p****': 'pussy', 'p*ssy': 'pussy', 'P****': 'pussy','p***': 'porn', 'p*rn': 'porn', 'P***': 'porn',
                      'st*up*id': 'stupid','d***': 'dick', 'di**': 'dick', 'h*ck': 'hack',
                      'b*tch': 'bitch', 'bi*ch': 'bitch', 'bit*h': 'bitch', 'bitc*': 'bitch', 'b****': 'bitch',
                      'b***': 'bitc', 'b**': 'bit', 'b*ll': 'bull'
                      }

# Define a variable spaces
spaces = ['\u200b', '\u200e', '\u202a', '\u202c', '\ufeff', '\uf0d8', '\u2061', '\x10', '\x7f', '\x9d', '\xad', '\xa0']

# Define a variable extra_punct with extra punctuations not included in the previous analysis
extra_punct = [
    ',', '.', '"', ':', ')', '(', '!', '?', '|', ';', "'", '$', '&',
    '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£',
    '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',
    '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', '“', '★', '”',
    '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾',
    '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', '▒', '：', '¼', '⊕', '▼',
    '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲',
    'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', '∙', '）', '↓', '、', '│', '（', '»',
    '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø',
    '¹', '≤', '‡', '√', '«', '»', '´', 'º', '¾', '¡', '§', '£', '₤']

In [None]:
%%time
# Define a dictionary with bad case words
bad_case_words = {'nationalpost':'national post','businessinsider':'business insider','jewprofits': 'jew profits', 'QMAS': 'Quality Migrant Admission Scheme', 'casterating': 'castrating',
                  'Kashmiristan': 'Kashmir', 'CareOnGo': 'India first and largest Online distributor of medicines',
                  'Setya Novanto': 'a former Indonesian politician', 'TestoUltra': 'male sexual enhancement supplement',
                  'rammayana': 'ramayana', 'Badaganadu': 'Brahmin community that mainly reside in Karnataka',
                  'bitcjes': 'bitches', 'mastubrate': 'masturbate', 'Français': 'France',
                  'Adsresses': 'address', 'flemmings': 'flemming', 'intermate': 'inter mating', 'feminisam': 'feminism',
                  'cuckholdry': 'cuckold', 'Niggor': 'black hip-hop and electronic artist', 'narcsissist': 'narcissist',
                  'Genderfluid': 'Gender fluid', ' Im ': ' I am ', ' dont ': ' do not ', 'Qoura': 'Quora',
                  'ethethnicitesnicites': 'ethnicity', 'Namit Bathla': 'Content Writer', 'What sApp': 'WhatsApp',
                  'Führer': 'Fuhrer', 'covfefe': 'coverage', 'accedentitly': 'accidentally', 'Cuckerberg': 'Zuckerberg',
                  'transtrenders': 'incredibly disrespectful to real transgender people',
                  'frozen tamod': 'Pornographic website', 'hindians': 'North Indian', 'hindian': 'North Indian',
                  'celibatess': 'celibates', 'Trimp': 'Trump', 'wanket': 'wanker', 'wouldd': 'would',
                  'arragent': 'arrogant', 'Ra - apist': 'rapist', 'idoot': 'idiot', 'gangstalkers': 'gangs talkers',
                  'toastsexual': 'toast sexual', 'inapropriately': 'inappropriately', 'dumbassess': 'dumbass',
                  'germanized': 'become german', 'helisexual': 'sexual', 'regilious': 'religious',
                  'timetraveller': 'time traveller', 'darkwebcrawler': 'dark webcrawler', 'routez': 'route',
                  'trumpians': 'Trump supporters','Trumpster':'trumpeters', 'irreputable': 'reputation', 'serieusly': 'seriously',
                  'anti cipation': 'anticipation', 'microaggression': 'micro aggression', 'Afircans': 'Africans',
                  'microapologize': 'micro apologize', 'Vishnus': 'Vishnu', 'excritment': 'excitement',
                  'disagreemen': 'disagreement', 'gujratis': 'gujarati', 'gujaratis': 'gujarati',
                  'ugggggggllly': 'ugly',
                  'Germanity': 'German', 'SoyBoys': 'cuck men lacking masculine characteristics',
                  'н': 'h', 'м': 'm', 'ѕ': 's', 'т': 't', 'в': 'b', 'υ': 'u', 'ι': 'i',
                  'genetilia': 'genitalia', 'r - apist': 'rapist', 'Borokabama': 'Barack Obama',
                  'arectifier': 'rectifier', 'pettypotus': 'petty potus', 'magibabble': 'magi babble',
                  'nothinking': 'thinking', 'centimiters': 'centimeters', 'saffronized': 'India, politics, derogatory',
                  'saffronize': 'India, politics, derogatory', ' incect ': ' insect ', 'weenus': 'elbow skin',
                  'Pakistainies': 'Pakistanis', 'goodspeaks': 'good speaks', 'inpregnated': 'in pregnant',
                  'rapefilms': 'rape films', 'rapiest': 'rapist', 'hatrednesss': 'hatred',
                  'heightism': 'height discrimination', 'getmy': 'get my', 'onsocial': 'on social',
                  'worstplatform': 'worst platform', 'platfrom': 'platform', 'instagate': 'instigate',
                  'Loy Machedeo': 'person', ' dsire ': ' desire ', 'iservant': 'servant', 'intelliegent': 'intelligent',
                  'WW 1': ' WW1 ', 'WW 2': ' WW2 ', 'ww 1': ' WW1 ', 'ww 2': ' WW2 ',
                  'keralapeoples': 'kerala peoples', 'trumpervotes': 'trumper votes', 'fucktrumpet': 'fuck trumpet',
                  'likebJaish': 'like bJaish', 'likemy': 'like my', 'Howlikely': 'How likely',
                  'disagreementts': 'disagreements', 'disagreementt': 'disagreement',
                  'meninist': "male chauvinism", 'feminists': 'feminism supporters', 'Ghumendra': 'Bhupendra',
                  'emellishments': 'embellishments',
                  'settelemen': 'settlement',
                  'Richmencupid': 'rich men dating website', 'richmencupid': 'rich men dating website',
                  'Gaudry - Schost': '', 'ladymen': 'ladyboy', 'hasserment': 'Harassment',
                  'instrumentalizing': 'instrument', 'darskin': 'dark skin', 'balckwemen': 'balck women',
                  'recommendor': 'recommender', 'wowmen': 'women', 'expertthink': 'expert think',
                  'whitesplaining': 'white splaining', 'Inquoraing': 'inquiring', 'whilemany': 'while many',
                  'manyother': 'many other', 'involvedinthe': 'involved in the', 'slavetrade': 'slave trade',
                  'aswell': 'as well', 'fewshowanyRemorse': 'few show any Remorse', 'trageting': 'targeting',
                  'getile': 'gentile', 'Gujjus': 'derogatory Gujarati', 'judisciously': 'judiciously',
                  'Hue Mungus': 'feminist bait', 'Hugh Mungus': 'feminist bait', 'Hindustanis': '',
                  'Virushka': 'Great Relationships Couple', 'exclusinary': 'exclusionary', 'himdus': 'hindus',
                  'Milo Yianopolous': 'a British polemicist', 'hidusim': 'hinduism',
                  'holocaustable': 'holocaust', 'evangilitacal': 'evangelical', 'Busscas': 'Buscas',
                  'holocaustal': 'holocaust', 'incestious': 'incestuous', 'Tennesseus': 'Tennessee',
                  'GusDur': 'Gus Dur',
                  'RPatah - Tan Eng Hwan': 'Silsilah', 'Reinfectus': 'reinfect', 'pharisaistic': 'pharisaism',
                  'nuslims': 'Muslims', 'taskus': '', 'musims': 'Muslims',
                  'Musevi': 'the independence of Mexico', ' racious ': 'discrimination expression of racism',
                  'Muslimophobia': 'Muslim phobia', 'justyfied': 'justified', 'holocause': 'holocaust',
                  'musilim': 'Muslim', 'misandrous': 'misandry', 'glrous': 'glorious', 'desemated': 'decimated',
                  'votebanks': 'vote banks', 'Parkistan': 'Pakistan', 'Eurooe': 'Europe', 'animlaistic': 'animalistic',
                  'Asiasoid': 'Asian', 'Congoid': 'Congolese', 'inheritantly': 'inherently',
                  'Asianisation': 'Becoming Asia',
                  'Russosphere': 'russia sphere of influence', 'exMuslims': 'Ex-Muslims',
                  'discriminatein': 'discrimination', ' hinus ': ' hindus ', 'Nibirus': 'Nibiru',
                  'habius - corpus': 'habeas corpus', 'prentious': 'pretentious', 'Sussia': 'ancient Jewish village',
                  'moustachess': 'moustaches', 'Russions': 'Russians', 'Yuguslavia': 'Yugoslavia',
                  'atrocitties': 'atrocities', 'Muslimophobe': 'Muslim phobic', 'fallicious': 'fallacious',
                  'recussed': 'recursed', '@ usafmonitor': '', 'lustfly': 'lustful', 'canMuslims': 'can Muslims',
                  'journalust': 'journalist', 'digustingly': 'disgustingly', 'harasing': 'harassing',
                  'greatuncle': 'great uncle', 'Drumpf': 'Trump', 'rejectes': 'rejected', 'polyagamous': 'polygamous',
                  'Mushlims': 'Muslims', 'accusition': 'accusation', 'geniusses': 'geniuses',
                  'moustachesomething': 'moustache something', 'heineous': 'heinous',
                  'Sapiosexuals': 'sapiosexual', 'sapiosexuals': 'sapiosexual', 'Sapiosexual': 'sapiosexual',
                  'sapiosexual': 'Sexually attracted to intelligence', 'pansexuals': 'pansexual',
                  'autosexual': 'auto sexual', 'sexualSlutty': 'sexual Slutty', 'hetorosexuality': 'hetoro sexuality',
                  'chinesese': 'chinese', 'pizza gate': 'debunked conspiracy theory',
                  'countryless': 'Having no country',
                  'muslimare': 'Muslim are', 'iPhoneX': 'iPhone', 'lionese': 'lioness', 'marionettist': 'Marionettes',
                  'demonetize': 'demonetized', 'eneyone': 'anyone', 'Karonese': 'Karo people Indonesia',
                  'minderheid': 'minder worse', 'mainstreamly': 'mainstream', 'contraproductive': 'contra productive',
                  'diffenky': 'differently', 'abandined': 'abandoned', 'p0 rnstars': 'pornstars',
                  'overproud': 'over proud',
                  'cheekboned': 'cheek boned', 'heriones': 'heroines', 'eventhogh': 'even though',
                  'americanmedicalassoc': 'american medical assoc', 'feelwhen': 'feel when', 'Hhhow': 'how',
                  'reallySemites': 'really Semites', 'gamergaye': 'gamersgate', 'manspreading': 'man spreading',
                  'thammana': 'Tamannaah Bhatia', 'dogmans': 'dogmas', 'managementskills': 'management skills',
                  'mangoliod': 'mongoloid', 'geerymandered': 'gerrymandered', 'mandateing': 'man dateing',
                  'Romanium': 'Romanum',
                  'mailwoman': 'mail woman', 'humancoalition': 'human coalition',
                  'manipullate': 'manipulate', 'everyo0 ne': 'everyone', 'takeove': 'takeover',
                  'Nonchristians': 'Non Christians', 'goverenments': 'governments', 'govrment': 'government',
                  'polygomists': 'polygamists', 'Demogorgan': 'Demogorgon', 'maralago': 'Mar-a-Lago',
                  'antibigots': 'anti bigots', 'gouing': 'going', 'muzaffarbad': 'muzaffarabad',
                  'suchvstupid': 'such stupid', 'apartheidisrael': 'apartheid israel', 
                  'personaltiles': 'personal titles', 'lawyergirlfriend': 'lawyer girl friend',
                  'northestern': 'northwestern', 'yeardold': 'years old', 'masskiller': 'mass killer',
                  'southeners': 'southerners', 'Unitedstatesian': 'United states',

                  'peoplekind': 'people kind', 'peoplelike': 'people like', 'countrypeople': 'country people',
                  'shitpeople': 'shit people', 'trumpology': 'trump ology', 'trumpites': 'Trump supporters',
                  'trumplies': 'trump lies', 'donaldtrumping': 'donald trumping', 'trumpdating': 'trump dating',
                  'trumpsters': 'trumpeters','Trumpers':'president trump', 'ciswomen': 'cis women', 'womenizer': 'womanizer',
                  'pregnantwomen': 'pregnant women', 'autoliker': 'auto liker', 'smelllike': 'smell like',
                  'autolikers': 'auto likers', 'religiouslike': 'religious like', 'likemail': 'like mail',
                  'fislike': 'dislike', 'sneakerlike': 'sneaker like', 'like⬇': 'like',
                  'likelovequotes': 'like lovequotes', 'likelogo': 'like logo', 'sexlike': 'sex like',
                  'Whatwould': 'What would', 'Howwould': 'How would', 'manwould': 'man would',
                  'exservicemen': 'ex servicemen', 'femenism': 'feminism', 'devopment': 'development',
                  'doccuments': 'documents', 'supplementplatform': 'supplement platform', 'mendatory': 'mandatory',
                  'moviments': 'movements', 'Kremenchuh': 'Kremenchug', 'docuements': 'documents',
                  'determenism': 'determinism', 'envisionment': 'envision ment',
                  'tricompartmental': 'tri compartmental', 'AddMovement': 'Add Movement',
                  'mentionong': 'mentioning', 'Whichtreatment': 'Which treatment', 'repyament': 'repayment',
                  'insemenated': 'inseminated', 'inverstment': 'investment',
                  'managemental': 'manage mental', 'Inviromental': 'Environmental', 'menstrution': 'menstruation',
                  'indtrument': 'instrument', 'mentenance': 'maintenance', 'fermentqtion': 'fermentation',
                  'achivenment': 'achievement', 'mismanagements': 'mis managements', 'requriment': 'requirement',
                  'denomenator': 'denominator', 'drparment': 'department', 'acumens': 'acumen s',
                  'celemente': 'Clemente', 'manajement': 'management', 'govermenent': 'government',
                  'accomplishmments': 'accomplishments', 'rendementry': 'rendement ry',
                  'repariments': 'departments', 'menstrute': 'menstruate', 'determenistic': 'deterministic',
                  'resigment': 'resignment', 'selfpayment': 'self payment', 'imrpovement': 'improvement',
                  'enivironment': 'environment', 'compartmentley': 'compartment',
                  'augumented': 'augmented', 'parmenent': 'permanent', 'dealignment': 'de alignment',
                  'develepoments': 'developments', 'menstrated': 'menstruated', 'phnomenon': 'phenomenon',
                  'Employmment': 'Employment', 'dimensionalise': 'dimensional ise', 'menigioma': 'meningioma',
                  'recrument': 'recrement', 'Promenient': 'Provenient', 'gonverment': 'government',
                  'statemment': 'statement', 'recuirement': 'requirement', 'invetsment': 'investment',
                  'parilment': 'parchment', 'parmently': 'patiently', 'agreementindia': 'agreement india',
                  'menifesto': 'manifesto', 'accomplsihments': 'accomplishments', 'disangagement': 'disengagement',
                  'aevelopment': 'development', 'procument': 'procumbent', 'harashment': 'harassment',
                  'Tiannanmen': 'Tiananmen', 'commensalisms': 'commensal isms', 'devlelpment': 'development',
                  'dimensons': 'dimensions', 'recruitment2017': 'recruitment 2017', 'polishment': 'pol ishment',
                  'CommentSafe': 'Comment Safe', 'meausrements': 'measurements', 'geomentrical': 'geometrical',
                  'undervelopment': 'undevelopment', 'mensurational': 'mensuration al', 'fanmenow': 'fan menow',
                  'permenganate': 'permanganate', 'bussinessmen': 'businessmen',
                  'supertournaments': 'super tournaments', 'permanmently': 'permanently',
                  'lamenectomy': 'lamnectomy', 'assignmentcanyon': 'assignment canyon', 'adgestment': 'adjustment',
                  'mentalized': 'metalized', 'docyments': 'documents', 'requairment': 'requirement',
                  'batsmencould': 'batsmen could', 'argumentetc': 'argument etc', 'enjoiment': 'enjoyment',
                  'invement': 'movement', 'accompliushments': 'accomplishments', 'regements': 'regiments',
                  'departmentHow': 'department How', 'Aremenian': 'Armenian', 'amenclinics': 'amen clinics',
                  'nonfermented': 'non fermented', 'Instumentation': 'Instrumentation', 'mentalitiy': 'mentality',
                  ' govermen ': 'goverment', 'underdevelopement': 'under developement', 'parlimentry': 'parliamentary',
                  'indemenity': 'indemnity', 'Inatrumentation': 'Instrumentation', 'menedatory': 'mandatory',
                  'mentiri': 'entire', 'accomploshments': 'accomplishments', 'instrumention': 'instrument ion',
                  'afvertisements': 'advertisements', 'parlementarian': 'parlement arian',
                  'entitlments': 'entitlements', 'endrosment': 'endorsement', 'improment': 'impriment',
                  'archaemenid': 'Achaemenid', 'replecement': 'replacement', 'placdment': 'placement',
                  'femenise': 'feminise', 'envinment': 'environment', 'AmenityCompany': 'Amenity Company',
                  'increaments': 'increments', 'accomplihsments': 'accomplishments',
                  'manygovernment': 'many government', 'panishments': 'punishments', 'elinment': 'eloinment',
                  'mendalin': 'mend alin', 'farmention': 'farm ention', 'preincrement': 'pre increment',
                  'postincrement': 'post increment', 'achviements': 'achievements', 'menditory': 'mandatory',
                  'Emouluments': 'Emoluments', 'Stonemen': 'Stone men', 'menmium': 'medium',
                  'entaglement': 'entanglement', 'integumen': 'integument', 'harassument': 'harassment',
                  'retairment': 'retainment', 'enviorement': 'environment', 'tormentous': 'torment ous',
                  'confiment': 'confident', 'Enchroachment': 'Encroachment', 'prelimenary': 'preliminary',
                  'fudamental': 'fundamental', 'instrumenot': 'instrument', 'icrement': 'increment',
                  'prodimently': 'prominently', 'meniss': 'menise', 'Whoimplemented': 'Who implemented',
                  'Representment': 'Rep resentment', 'StartFragment': 'Start Fragment',
                  'EndFragment': 'End Fragment', ' documentarie ': ' documentaries ', 'requriments': 'requirements',
                  'constitutionaldevelopment': 'constitutional development', 'parlamentarians': 'parliamentarians',
                  'Rumenova': 'Rumen ova', 'argruments': 'arguments', 'findamental': 'fundamental',
                  'totalinvestment': 'total investment', 'gevernment': 'government', 'recmommend': 'recommend',
                  'appsmoment': 'apps moment', 'menstruual': 'menstrual', 'immplemented': 'implemented',
                  'engangement': 'engagement', 'invovement': 'involvement', 'returement': 'retirement',
                  'simentaneously': 'simultaneously', 'accompishments': 'accomplishments',
                  'menstraution': 'menstruation', 'experimently': 'experiment', 'abdimen': 'abdomen',
                  'cemenet': 'cement', 'propelment': 'propel ment', 'unamendable': 'un amendable',
                  'employmentnews': 'employment news', 'lawforcement': 'law forcement',
                  'menstuating': 'menstruating', 'fevelopment': 'development', 'reglamented': 'reg lamented',
                  'imrovment': 'improvement', 'recommening': 'recommending', 'sppliment': 'supplement',
                  'measument': 'measurement', 'reimbrusement': 'reimbursement', 'Nutrament': 'Nutriment',
                  'puniahment': 'punishment', 'subligamentous': 'sub ligamentous', 'comlementry': 'complementary',
                  'reteirement': 'retirement', 'envioronments': 'environments', 'haraasment': 'harassment',
                  'USAgovernment': 'USA government', 'Apartmentfinder': 'Apartment finder',
                  'encironment': 'environment', 'metacompartment': 'meta compartment',
                  'augumentation': 'argumentation', 'dsymenorrhoea': 'dysmenorrhoea',
                  'nonabandonment': 'non abandonment', 'annoincement': 'announcement',
                  'menberships': 'memberships', 'Gamenights': 'Game nights', 'enliightenment': 'enlightenment',
                  'supplymentry': 'supplementary', 'parlamentary': 'parliamentary', 'duramen': 'dura men',
                  'hotelmanagement': 'hotel management', 'deartment': 'department',
                  'treatmentshelp': 'treatments help', 'attirements': 'attire ments',
                  'amendmending': 'amend mending', 'pseudomeningocele': 'pseudo meningocele',
                  'intrasegmental': 'intra segmental', 'treatmenent': 'treatment', 'infridgement': 'infringement',
                  'infringiment': 'infringement', 'recrecommend': 'rec recommend', 'entartaiment': 'entertainment',
                  'inplementing': 'implementing', 'indemendent': 'independent', 'tremendeous': 'tremendous',
                  'commencial': 'commercial', 'scomplishments': 'accomplishments', 'Emplement': 'Implement',
                  'dimensiondimensions': 'dimension dimensions', 'depolyment': 'deployment',
                  'conpartment': 'compartment', 'govnments': 'movements', 'menstrat': 'menstruate',
                  'accompplishments': 'accomplishments', 'Enchacement': 'Enchancement',
                  'developmenent': 'development', 'emmenagogues': 'emmenagogue', 'aggeement': 'agreement',
                  'elementsbond': 'elements bond', 'remenant': 'remnant', 'Manamement': 'Management',
                  'Augumented': 'Augmented', 'dimensonless': 'dimensionless',
                  'ointmentsointments': 'ointments ointments', 'achiements': 'achievements',
                  'recurtment': 'recurrent', 'gouverments': 'governments', 'docoment': 'document',
                  'programmingassignments': 'programming assignments', 'menifest': 'manifest',
                  'investmentguru': 'investment guru', 'deployements': 'deployments', 'Invetsment': 'Investment',
                  'plaement': 'placement', 'Perliament': 'Parliament', 'femenists': 'feminists',
                  'ecumencial': 'ecumenical', 'advamcements': 'advancements', 'refundment': 'refund ment',
                  'settlementtake': 'settlement take', 'mensrooms': 'mens rooms',
                  'productManagement': 'product Management', 'armenains': 'armenians',
                  'betweenmanagement': 'between management', 'difigurement': 'disfigurement',
                  'Armenized': 'Armenize', 'hurrasement': 'hurra sement', 'mamgement': 'management',
                  'momuments': 'monuments', 'eauipments': 'equipments', 'managemenet': 'management',
                  'treetment': 'treatment', 'webdevelopement': 'web developement', 'supplemenary': 'supplementary',
                  'Encironmental': 'Environmental', 'Understandment': 'Understand ment',
                  'enrollnment': 'enrollment', 'thinkstrategic': 'think strategic', 'thinkinh': 'thinking',
                  'Softthinks': 'Soft thinks', 'underthinking': 'under thinking', 'thinksurvey': 'think survey',
                  'whitelash': 'white lash', 'whiteheds': 'whiteheads', 'whitetning': 'whitening',
                  'whitegirls': 'white girls', 'whitewalkers': 'white walkers', 'manycountries': 'many countries',
                  'accomany': 'accompany', 'fromGermany': 'from Germany', 'manychat': 'many chat',
                  'Germanyl': 'Germany l', 'manyness': 'many ness', 'many4': 'many', 'exmuslims': 'ex muslims',
                  'digitizeindia': 'digitize india', 'indiarush': 'india rush', 'indiareads': 'india reads',
                  'telegraphindia': 'telegraph india', 'Southindia': 'South india', 'Airindia': 'Air india',
                  'siliconindia': 'silicon india', 'airindia': 'air india', 'indianleaders': 'indian leaders',
                  'fundsindia': 'funds india', 'indianarmy': 'indian army', 'Technoindia': 'Techno india',
                  'Betterindia': 'Better india', 'capesindia': 'capes india', 'Rigetti': 'Ligetti',
                  'vegetablr': 'vegetable', 'get90': 'get', 'Magetta': 'Maretta', 'nagetive': 'native',
                  'isUnforgettable': 'is Unforgettable', 'get630': 'get 630', 'GadgetPack': 'Gadget Pack',
                  'Languagetool': 'Language tool', 'bugdget': 'budget', 'africaget': 'africa get',
                  'ABnegetive': 'Abnegative', 'orangetheory': 'orange theory', 'getsmuggled': 'get smuggled',
                  'avegeta': 'ave geta', 'gettubg': 'getting', 'gadgetsnow': 'gadgets now',
                  'surgetank': 'surge tank', 'gadagets': 'gadgets', 'getallparts': 'get allparts',
                  'messenget': 'messenger', 'vegetarean': 'vegetarian', 'get1000': 'get 1000',
                  'getfinancing': 'get financing', 'getdrip': 'get drip', 'AdsTargets': 'Ads Targets',
                  'tgethr': 'together', 'vegetaries': 'vegetables', 'forgetfulnes': 'forgetfulness',
                  'fisgeting': 'fidgeting', 'BudgetAir': 'Budget Air',
                  'getDepersonalization': 'get Depersonalization', 'negetively': 'negatively',
                  'gettibg': 'getting', 'nauget': 'naught', 'Bugetti': 'Bugatti', 'plagetum': 'plage tum',
                  'vegetabale': 'vegetable', 'changetip': 'change tip', 'blackwashing': 'black washing',
                  'blackpink': 'black pink', 'blackmoney': 'black money',
                  'blackmarks': 'black marks', 'blackbeauty': 'black beauty', 'unblacklisted': 'un blacklisted',
                  'blackdotes': 'black dotes', 'blackboxing': 'black boxing', 'blackpaper': 'black paper',
                  'blackpower': 'black power', 'Latinamericans': 'Latin americans', 'musigma': 'mu sigma',
                  'Indominus': 'In dominus', 'usict': 'USSCt', 'indominus': 'in dominus', 'Musigma': 'Mu sigma',
                  'plus5': 'plus', 'Russiagate': 'Russia gate', 'russophobic': 'Russophobiac',
                  'Marcusean': 'Marcuse an', 'Radijus': 'Radius', 'cobustion': 'combustion',
                  'Austrialians': 'Australians', 'mylogenous': 'myogenous', 'Raddus': 'Radius',
                  'hetrogenous': 'heterogenous', 'greenhouseeffect': 'greenhouse effect', 'aquous': 'aqueous',
                  'Taharrush': 'Tahar rush', 'Senousa': 'Venous', 'diplococcus': 'diplo coccus',
                  'CityAirbus': 'City Airbus', 'sponteneously': 'spontaneously', 'trustless': 't rustless',
                  'Pushkaram': 'Pushkara m', 'Fusanosuke': 'Fu sanosuke', 'isthmuses': 'isthmus es',
                  'lucideus': 'lucidum', 'overjustification': 'over justification', 'Bindusar': 'Bind usar',
                  'cousera': 'couler', 'musturbation': 'masturbation', 'infustry': 'industry',
                  'Huswifery': 'Huswife ry', 'rombous': 'bombous', 'disengenuously': 'disingenuously',
                  'sllybus': 'syllabus', 'celcious': 'delicious', 'cellsius': 'celsius',
                  'lethocerus': 'Lethocerus', 'monogmous': 'monogamous', 'Ballyrumpus': 'Bally rumpus',
                  'Koushika': 'Koushik a', 'vivipoarous': 'viviparous', 'ludiculous': 'ridiculous',
                  'sychronous': 'synchronous', 'industiry': 'industry', 'scuduse': 'scud use',
                  'babymust': 'baby must', 'simultqneously': 'simultaneously', 'exust': 'ex ust',
                  'notmusing': 'not musing', 'Zamusu': 'Amuse', 'tusaki': 'tu saki', 'Marrakush': 'Marrakesh',
                  'justcheaptickets': 'just cheaptickets', 'Ayahusca': 'Ayahausca', 'samousa': 'samosa',
                  'Gusenberg': 'Gutenberg', 'illustratuons': 'illustrations', 'extemporeneous': 'extemporaneous',
                  'Mathusla': 'Mathusala', 'Confundus': 'Con fundus', 'tusts': 'trusts', 'poisenious': 'poisonous',
                  'Mevius': 'Medius', 'inuslating': 'insulating', 'aroused21000': 'aroused 21000',
                  'Wenzeslaus': 'Wenceslaus', 'JustinKase': 'Justin Kase', 'purushottampur': 'purushottam pur',
                  'citruspay': 'citrus pay', 'secutus': 'sects', 'austentic': 'austenitic',
                  'FacePlusPlus': 'Face PlusPlus', 'aysnchronous': 'asynchronous',
                  'teamtreehouse': 'team treehouse', 'uncouncious': 'unconscious', 'Priebuss': 'Prie buss',
                  'consciousuness': 'consciousness', 'susubsoil': 'su subsoil', 'trimegistus': 'Trismegistus',
                  'protopeterous': 'protopterous', 'trustworhty': 'trustworthy', 'ushually': 'usually',
                  'industris': 'industries', 'instantneous': 'instantaneous', 'superplus': 'super plus',
                  'shrusti': 'shruti', 'hindhus': 'hindus', 'outonomous': 'autonomous', 'reliegious': 'religious',
                  'Kousakis': 'Kou sakis', 'reusult': 'result', 'JanusGraph': 'Janus Graph',
                  'palusami': 'palus ami', 'mussraff': 'muss raff', 'hukous': 'humous',
                  'photoacoustics': 'photo acoustics', 'kushanas': 'kusha nas', 'justdile': 'justice',
                  'Massahusetts': 'Massachusetts', 'uspset': 'upset', 'sustinet': 'sustinent',
                  'consicious': 'conscious', 'Sadhgurus': 'Sadh gurus', 'hystericus': 'hysteric us',
                  'visahouse': 'visa house', 'supersynchronous': 'super synchronous', 'posinous': 'rosinous',
                  'Fernbus': 'Fern bus', 'Tiltbrush': 'Tilt brush', 'glueteus': 'gluteus', 'posionus': 'poisons',
                  'Freus': 'Frees', 'Zhuchengtyrannus': 'Zhucheng tyrannus', 'savonious': 'sanious',
                  'CusJo': 'Cusco', 'congusion': 'confusion', 'dejavus': 'dejavu s', 'uncosious': 'uncopious',
                  'previius': 'previous', 'counciousness': 'conciousness', 'lustorus': 'lustrous',
                  'sllyabus': 'syllabus', 'mousquitoes': 'mosquitoes', 'Savvius': 'Savvies', 'arceius': 'Arcesius',
                  'prejusticed': 'prejudiced', 'requsitioned': 'requisitioned',
                  'deindustralization': 'deindustrialization', 'muscleblaze': 'muscle blaze',
                  'ConsciousX5': 'conscious', 'nitrogenious': 'nitrogenous', 'mauritious': 'mauritius',
                  'rigrously': 'rigorously', 'Yutyrannus': 'Yu tyrannus', 'muscualr': 'muscular',
                  'conscoiusness': 'consciousness', 'Causians': 'Crusians', 'WorkFusion': 'Work Fusion',
                  'puspak': 'pu spak', 'Inspirus': 'Inspires', 'illiustrations': 'illustrations',
                  'Nobushi': 'No bushi', 'theuseof': 'thereof', 'suspicius': 'suspicious', 'Intuous': 'Virtuous',
                  'gaushalas': 'gaus halas', 'campusthrough': 'campus through', 'seriousity': 'seriosity',
                  'resustence': 'resistence', 'geminatus': 'geminates', 'disquss': 'discuss',
                  'nicholus': 'nicholas', 'Husnai': 'Hussar', 'diiscuss': 'discuss', 'diffussion': 'diffusion',
                  'phusicist': 'physicist', 'ernomous': 'enormous', 'Khushali': 'Khushal i', 'heitus': 'Leitus',
                  'cracksbecause': 'cracks because', 'Nautlius': 'Nautilus', 'trausted': 'trusted',
                  'Dardandus': 'Dardanus', 'Megatapirus': 'Mega tapirus', 'clusture': 'culture',
                  'vairamuthus': 'vairamuthu s', 'disclousre': 'disclosure',
                  'industrilaization': 'industrialization', 'musilms': 'muslims', 'Australia9': 'Australian',
                  'causinng': 'causing', 'ibdustries': 'industries', 'searious': 'serious',
                  'Coolmuster': 'Cool muster', 'sissyphus': 'sisyphus', ' justificatio ': 'justification',
                  'antihindus': 'anti hindus', 'Moduslink': 'Modus link', 'zymogenous': 'zymogen ous',
                  'prospeorus': 'prosperous', 'Retrocausality': 'Retro causality', 'FusionGPS': 'Fusion GPS',
                  'Mouseflow': 'Mouse flow', 'bootyplus': 'booty plus', 'Itylus': 'I tylus',
                  'Olnhausen': 'Olshausen', 'suspeect': 'suspect', 'entusiasta': 'enthusiast',
                  'fecetious': 'facetious', 'bussiest': 'fussiest', 'Draconius': 'Draconis',
                  'requsite': 'requisite', 'nauseatic': 'nausea tic', 'Brusssels': 'Brussels',
                  'repurcussion': 'repercussion', 'Jeisus': 'Jesus', 'philanderous': 'philander ous',
                  'muslisms': 'muslims', 'august2017': 'august 2017', 'calccalculus': 'calc calculus',
                  'unanonymously': 'un anonymously', 'Imaprtus': 'Impetus', 'carnivorus': 'carnivorous',
                  'Corypheus': 'Coryphees', 'austronauts': 'astronauts', 'neucleus': 'nucleus',
                  'housepoor': 'house poor', 'rescouses': 'responses', 'Tagushi': 'Tagus hi',
                  'hyperfocusing': 'hyper focusing', 'nutriteous': 'nutritious', 'chylus': 'chylous',
                  'preussure': 'pressure', 'outfocus': 'out focus', 'Hanfus': 'Hannus', 'Rustyrose': 'Rusty rose',
                  'vibhushant': 'vibhushan t', 'conciousnes': 'conciousness', 'Venus25': 'Venus',
                  'Sedataious': 'Seditious', 'promuslim': 'pro muslim', 'statusGuru': 'status Guru',
                  'yousician': 'musician', 'transgenus': 'trans genus', 'Pushbullet': 'Push bullet',
                  'jeesyllabus': 'jee syllabus', 'complusary': 'compulsory', 'Holocoust': 'Holocaust',
                  'careerplus': 'career plus', 'Lllustrate': 'Illustrate', 'Musino': 'Musion',
                  'Phinneus': 'Phineus', 'usedtoo': 'used too', 'JustBasic': 'Just Basic', 'webmusic': 'web music',
                  'TrustKit': 'Trust Kit', 'industrZgies': 'industries', 'rubustness': 'robustness',
                  'Missuses': 'Miss uses', 'Musturbation': 'Masturbation', 'bustees': 'bus tees',
                  'justyfy': 'justify', 'pegusus': 'pegasus', 'industrybuying': 'industry buying',
                  'advantegeous': 'advantageous', 'kotatsus': 'kotatsu s', 'justcreated': 'just created',
                  'simultameously': 'simultaneously', 'husoone': 'huso one', 'twiceusing': 'twice using',
                  'cetusplay': 'cetus play', 'sqamous': 'squamous', 'claustophobic': 'claustrophobic',
                  'Kaushika': 'Kaushik a', 'dioestrus': 'di oestrus', 'Degenerous': 'De generous',
                  'neculeus': 'nucleus', 'cutaneously': 'cu taneously', 'Alamotyrannus': 'Alamo tyrannus',
                  'Ivanious': 'Avanious', 'arceous': 'araceous', 'Flixbus': 'Flix bus', 'caausing': 'causing',
                  'publious': 'Publius', 'Juilus': 'Julius', 'Australianism': 'Australian ism',
                  'vetronus': 'verrons', 'nonspontaneous': 'non spontaneous', 'calcalus': 'calculus',
                  'commudus': 'Commodus', 'Rheusus': 'Rhesus', 'syallubus': 'syllabus', 'Yousician': 'Musician',
                  'qurush': 'qu rush', 'athiust': 'athirst', 'conclusionless': 'conclusion less',
                  'usertesting': 'user testing', 'redius': 'radius', 'Austrolia': 'Australia',
                  'sllaybus': 'syllabus', 'toponymous': 'top onymous', 'businiss': 'business',
                  'hyperthalamus': 'hyper thalamus', 'clause55': 'clause', 'cosicous': 'conscious',
                  'Sushena': 'Saphena', 'Luscinus': 'Luscious', 'Prussophile': 'Russophile', 'jeaslous': 'jealous',
                  'Austrelia': 'Australia', 'contiguious': 'contiguous',
                  'subconsciousnesses': 'sub consciousnesses', ' jusification ': 'justification',
                  'dilusion': 'delusion', 'anticoncussive': 'anti concussive', 'disngush': 'disgust',
                  'constiously': 'consciously', 'filabustering': 'filibustering', 'GAPbuster': 'GAP buster',
                  'insectivourous': 'insectivorous', 'glocuse': 'louse', 'Antritrust': 'Antitrust',
                  'thisAustralian': 'this Australian', 'FusionDrive': 'Fusion Drive', 'nuclus': 'nucleus',
                  'abussive': 'abusive', 'mustang1': 'mustangs', 'inradius': 'in radius', 'polonious': 'polonius',
                  'ofKulbhushan': 'of Kulbhushan', 'homosporous': 'homos porous', 'circumradius': 'circum radius',
                  'atlous': 'atrous', 'insustry': 'industry', 'campuswith': 'campus with', 'beacsuse': 'because',
                  'concuous': 'conscious', 'nonHindus': 'non Hindus', 'carnivourous': 'carnivorous',
                  'tradeplus': 'trade plus', 'Jeruselam': 'Jerusalem',
                  'musuclar': 'muscular', 'deangerous': 'dangerous', 'disscused': 'discussed',
                  'industdial': 'industrial', 'sallatious': 'fallacious', 'rohmbus': 'rhombus',
                  'golusu': 'gol usu', 'Minangkabaus': 'Minangkabau s', 'Mustansiriyah': 'Mustansiriya h',
                  'anomymously': 'anonymously', 'abonymously': 'anonymously', 'indrustry': 'industry',
                  'Musharrf': 'Musharraf', 'workouses': 'workhouses', 'sponataneously': 'spontaneously',
                  'anmuslim': 'an muslim', 'syallbus': 'syllabus', 'presumptuousnes': 'presumptuousness',
                  'Thaedus': 'Thaddus', 'industey': 'industry', 'hkust': 'hust', 'Kousseri': 'Kousser i',
                  'mousestats': 'mouses tats', 'russiagate': 'russia gate', 'simantaneously': 'simultaneously',
                  'Austertana': 'Auster tana', 'infussions': 'infusions', 'coclusion': 'conclusion',
                  'sustainabke': 'sustainable', 'tusami': 'tu sami', 'anonimously': 'anonymously',
                  'usebase': 'use base', 'balanoglossus': 'Balanoglossus', 'Unglaus': 'Ung laus',
                  'ignoramouses': 'ignoramuses', 'snuus': 'snugs', 'reusibility': 'reusability',
                  'Straussianism': 'Straussian ism', 'simoultaneously': 'simultaneously',
                  'realbonus': 'real bonus', 'nuchakus': 'nunchakus', 'annonimous': 'anonymous',
                  'Incestious': 'Incestuous', 'Manuscriptology': 'Manuscript ology', 'difusse': 'diffuse',
                  'Pliosaurus': 'Pliosaur us', 'cushelle': 'cush elle', 'Catallus': 'Catullus',
                  'MuscleBlaze': 'Muscle Blaze', 'confousing': 'confusing', 'enthusiasmless': 'enthusiasm less',
                  'Tetherusd': 'Tethered', 'Josephius': 'Josephus', 'jusrlt': 'just',
                  'simutaneusly': 'simultaneously', 'mountaneous': 'mountainous', 'Badonicus': 'Sardonicus',
                  'muccus': 'mucous', 'nicus': 'nidus', 'austinlizards': 'austin lizards',
                  'errounously': 'erroneously', 'Australua': 'Australia', 'sylaabus': 'syllabus',
                  'dusyant': 'distant', 'javadiscussion': 'java discussion', 'megabuses': 'mega buses',
                  'danergous': 'dangerous', 'contestious': 'contentious', 'exause': 'excuse',
                  'muscluar': 'muscular', 'avacous': 'vacuous', 'Ingenhousz': 'Ingenious',
                  'holocausting': 'holocaust ing', 'Pakustan': 'Pakistan', 'purusharthas': 'purushartha',
                  'bapus': 'bapu s', 'useul': 'useful', 'pretenious': 'pretentious', 'homogeneus': 'homogeneous',
                  'bhlushes': 'blushes', 'Saggittarius': 'Sagittarius', 'sportsusa': 'sports usa',
                  'kerataconus': 'keratoconus', 'infrctuous': 'infectuous', 'Anonoymous': 'Anonymous',
                  'triphosphorus': 'tri phosphorus', 'ridicjlously': 'ridiculously',
                  'worldbusiness': 'world business', 'hollcaust': 'holocaust', 'Dusra': 'Dura',
                  'meritious': 'meritorious', 'Sauskes': 'Causes', 'inudustry': 'industry',
                  'frustratd': 'frustrate', 'hypotenous': 'hypogenous', 'Dushasana': 'Dush asana',
                  'saadus': 'status', 'keratokonus': 'keratoconus', 'Jarrus': 'Harrus', 'neuseous': 'nauseous',
                  'simutanously': 'simultaneously', 'diphosphorus': 'di phosphorus', 'sulprus': 'surplus',
                  'Hasidus': 'Hasid us', 'suspenive': 'suspensive', 'illlustrator': 'illustrator',
                  'userflows': 'user flows', 'intrusivethoughts': 'intrusive thoughts', 'countinous': 'continuous',
                  'gpusli': 'gusli', 'Calculus1': 'Calculus', 'bushiri': 'Bushire',
                  'torvosaurus': 'Torosaurus', 'chestbusters': 'chest busters', 'Satannus': 'Sat annus',
                  'falaxious': 'fallacious', 'obnxious': 'obnoxious', 'tranfusions': 'transfusions',
                  'PlayMagnus': 'Play Magnus', 'Epicodus': 'Episodes', 'Hypercubus': 'Hypercubes',
                  'Musickers': 'Musick ers', 'programmebecause': 'programme because', 'indiginious': 'indigenous',
                  'housban': 'Housman', 'iusso': 'kusso', 'annilingus': 'anilingus', 'Nennus': 'Genius',
                  'pussboy': 'puss boy', 'Photoacoustics': 'Photo acoustics', 'Hindusthanis': 'Hindustanis',
                  'lndustrial': 'industrial', 'tyrannously': 'tyrannous', 'Susanoomon': 'Susanoo mon',
                  'colmbus': 'columbus', 'sussessful': 'successful', 'ousmania': 'ous mania',
                  'ilustrating': 'illustrating', 'famousbirthdays': 'famous birthdays',
                  'suspectance': 'suspect ance', 'extroneous': 'extraneous', 'teethbrush': 'teeth brush',
                  'abcmouse': 'abc mouse', 'degenerous': 'de generous', 'doesGauss': 'does Gauss',
                  'insipudus': 'insipidus', 'movielush': 'movie lush', 'Rustichello': 'Rustic hello',
                  'Firdausiya': 'Firdausi ya', 'checkusers': 'check users', 'householdware': 'household ware',
                  'prosporously': 'prosperously', 'SteLouse': 'Ste Louse', 'obfuscaton': 'obfuscation',
                  'amorphus': 'amorph us', 'trustworhy': 'trustworthy', 'celsious': 'cesious',
                  'dangorous': 'dangerous', 'anticancerous': 'anti cancerous', 'cousi ': 'cousin ',
                  'austroloid': 'australoid', 'fergussion': 'percussion', 'andKyokushin': 'and Kyokushin',
                  'cousan': 'cousin', 'Huskystar': 'Hu skystar', 'retrovisus': 'retrovirus', 'becausr': 'because',
                  'Jerusalsem': 'Jerusalem', 'motorious': 'notorious', 'industrilised': 'industrialised',
                  'powerballsusa': 'powerballs usa', 'monoceious': 'monoecious', 'batteriesplus': 'batteries plus',
                  'nonviscuous': 'nonviscous', 'industion': 'induction', 'bussinss': 'bussings',
                  'userbags': 'user bags', 'Jlius': 'Julius', 'thausand': 'thousand', 'plustwo': 'plus two',
                  'defpush': 'def push', 'subconcussive': 'sub concussive', 'muslium': 'muslim',
                  'industrilization': 'industrialization', 'Maurititus': 'Mauritius', 'uslme': 'some',
                  'Susgaon': 'Surgeon', 'Pantherous': 'Panther ous', 'antivirius': 'antivirus',
                  'Trustclix': 'Trust clix', 'silumtaneously': 'simultaneously', 'Icompus': 'Corpus',
                  'atonomous': 'autonomous', 'Reveuse': 'Reve use', 'legumnous': 'leguminous',
                  'syllaybus': 'syllabus', 'louspeaker': 'loudspeaker', 'susbtraction': 'substraction',
                  'virituous': 'virtuous', 'disastrius': 'disastrous', 'jerussalem': 'jerusalem',
                  'Industrailzed': 'Industrialized', 'recusion': 'recushion',
                  'simultenously': 'simultaneously',
                  'Pulphus': 'Pulpous', 'harbaceous': 'herbaceous', 'phlegmonous': 'phlegmon ous', 'use38': 'use',
                  'jusify': 'justify', 'instatanously': 'instantaneously', 'tetramerous': 'tetramer ous',
                  'usedvin': 'used vin', 'sagittarious': 'sagittarius', 'mausturbate': 'masturbate',
                  'subcautaneous': 'subcutaneous', 'dangergrous': 'dangerous', 'sylabbus': 'syllabus',
                  'hetorozygous': 'heterozygous', 'Ignasius': 'Ignacius', 'businessbor': 'business bor',
                  'Bhushi': 'Thushi', 'Moussolini': 'Mussolini', 'usucaption': 'usu caption',
                  'Customzation': 'Customization', 'cretinously': 'cretinous', 'genuiuses': 'geniuses',
                  'Moushmee': 'Mousmee', 'neigous': 'nervous',
                  'infrustructre': 'infrastructure', 'Ilusha': 'Ilesha', 'suconciously': 'unconciously',
                  'stusy': 'study', 'mustectomy': 'mastectomy', 'Farmhousebistro': 'Farmhouse bistro',
                  'instantanous': 'instantaneous', 'JustForex': 'Just Forex', 'Indusyry': 'Industry',
                  'mustabating': 'must abating', 'uninstrusive': 'unintrusive', 'customshoes': 'customs hoes',
                  'homageneous': 'homogeneous', 'Empericus': 'Imperious', 'demisexuality': 'demi sexuality',
                  'transexualism': 'transsexualism', 'sexualises': 'sexualise', 'demisexuals': 'demisexual',
                  'sexuly': 'sexily', 'Pornosexuality': 'Porno sexuality', 'sexond': 'second', 'sexxual': 'sexual',
                  'asexaul': 'asexual', 'sextactic': 'sex tactic', 'sexualityism': 'sexuality ism',
                  'monosexuality': 'mono sexuality', 'intwrsex': 'intersex', 'hypersexualize': 'hyper sexualize',
                  'homosexualtiy': 'homosexuality', 'examsexams': 'exams exams', 'sexmates': 'sex mates',
                  'sexyjobs': 'sexy jobs', 'sexitest': 'sexiest', 'fraysexual': 'fray sexual',
                  'sexsurrogates': 'sex surrogates', 'sexuallly': 'sexually', 'gamersexual': 'gamer sexual',
                  'greysexual': 'grey sexual', 'omnisexuality': 'omni sexuality', 'hetereosexual': 'heterosexual',
                  'productsexamples': 'products examples', 'sexgods': 'sex gods', 'semisexual': 'semi sexual',
                  'homosexulity': 'homosexuality', 'sexeverytime': 'sex everytime', 'neurosexist': 'neuro sexist',
                  'worldquant': 'world quant', 'Freshersworld': 'Freshers world', 'smartworld': 'sm artworld',
                  'Mistworlds': 'Mist worlds', 'boothworld': 'booth world', 'ecoworld': 'eco world',
                  'Ecoworld': 'Eco world', 'underworldly': 'under worldly', 'worldrank': 'world rank',
                  'Clearworld': 'Clear world', 'Boothworld': 'Booth world', 'Rimworld': 'Rim world',
                  'cryptoworld': 'crypto world', 'machineworld': 'machine world', 'worldwideley': 'worldwide ley',
                  'capuletwant': 'capulet want', 'Bhagwanti': 'Bhagwant i', 'Unwanted72': 'Unwanted 72',
                  'wantrank': 'want rank',
                  'willhappen': 'will happen', 'thateasily': 'that easily',
                  'Whatevidence': 'What evidence', 'metaphosphates': 'meta phosphates',
                  'exilarchate': 'exilarch ate', 'aulphate': 'sulphate', 'Whateducation': 'What education',
                  'persulphates': 'per sulphates', 'disulphate': 'di sulphate', 'picosulphate': 'pico sulphate',
                  'tetraosulphate': 'tetrao sulphate', 'prechinese': 'pre chinese',
                  'Hellochinese': 'Hello chinese', 'muchdeveloped': 'much developed', 'stomuch': 'stomach',
                  'Whatmakes': 'What makes', 'Lensmaker': 'Lens maker', 'eyemake': 'eye make',
                  'Techmakers': 'Tech makers', 'cakemaker': 'cake maker', 'makeup411': 'makeup 411',
                  'objectmake': 'object make', 'crazymaker': 'crazy maker', 'techmakers': 'tech makers',
                  'makedonian': 'macedonian', 'makeschool': 'make school', 'anxietymake': 'anxiety make',
                  'makeshifter': 'make shifter', 'countryball': 'country ball', 'Whichcountry': 'Which country',
                  'countryHow': 'country How', 'Zenfone': 'Zen fone', 'Electroneum': 'Electro neum',
                  'electroneum': 'electro neum', 'Demonetisation': 'demonetization', 'zenfone': 'zen fone',
                  'ZenFone': 'Zen Fone', 'onecoin': 'one coin', 'demonetizing': 'demonetized',
                  'iphone7': 'iPhone', 'iPhone6': 'iPhone', 'microneedling': 'micro needling', 'iphone6': 'iPhone',
                  'Monegasques': 'Monegasque s', 'demonetised': 'demonetized',
                  'EveryoneDiesTM': 'EveryoneDies TM', 'teststerone': 'testosterone', 'DoneDone': 'Done Done',
                  'papermoney': 'paper money', 'Sasabone': 'Sasa bone', 'Blackphone': 'Black phone',
                  'Bonechiller': 'Bone chiller', 'Moneyfront': 'Money front', 'workdone': 'work done',
                  'iphoneX': 'iPhone', 'roxycodone': 'r oxycodone',
                  'moneycard': 'money card', 'Fantocone': 'Fantocine', 'eletronegativity': 'electronegativity',
                  'mellophones': 'mellophone s', 'isotones': 'iso tones', 'donesnt': 'doesnt',
                  'thereanyone': 'there anyone', 'electronegativty': 'electronegativity',
                  'commissiioned': 'commissioned', 'earvphone': 'earphone', 'condtioners': 'conditioners',
                  'demonetistaion': 'demonetization', 'ballonets': 'ballo nets', 'DoneClaim': 'Done Claim',
                  'alimoney': 'alimony', 'iodopovidone': 'iodo povidone', 'bonesetters': 'bone setters',
                  'componendo': 'compon endo', 'probationees': 'probationers', 'one300': 'one 300',
                  'nonelectrolyte': 'non electrolyte', 'ozonedepletion': 'ozone depletion',
                  'Stonehart': 'Stone hart', 'Vodafone2': 'Vodafones', 'chaparone': 'chaperone',
                  'Noonein': 'Noo nein', 'Frosione': 'Erosion', 'IPhone7': 'Iphone', 'pentanone': 'penta none',
                  'poneglyphs': 'pone glyphs', 'cyclohexenone': 'cyclohexanone', 'marlstone': 'marls tone',
                  'androneda': 'andromeda', 'iphone8': 'iPhone', 'acidtone': 'acid tone',
                  'noneconomically': 'non economically', 'Honeyfund': 'Honey fund', 'germanophone': 'Germanophobe',
                  'Democratizationed': 'Democratization ed', 'haoneymoon': 'honeymoon', 'iPhone7': 'iPhone 7',
                  'someonewith': 'some onewith', 'Hexanone': 'Hexa none', 'bonespur': 'bones pur',
                  'sisterzoned': 'sister zoned', 'HasAnyone': 'Has Anyone',
                  'stonepelters': 'stone pelters', 'Chronexia': 'Chronaxia', 'brotherzone': 'brother zone',
                  'brotherzoned': 'brother zoned', 'fonecare': 'f onecare', 'nonexsistence': 'nonexistence',
                  'conents': 'contents', 'phonecases': 'phone cases', 'Commissionerates': 'Commissioner ates',
                  'activemoney': 'active money', 'dingtone': 'ding tone', 'wheatestone': 'wheatstone',
                  'chiropractorone': 'chiropractor one', 'heeadphones': 'headphones', 'Maimonedes': 'Maimonides',
                  'onepiecedeals': 'onepiece deals', 'oneblade': 'one blade', 'venetioned': 'Venetianed',
                  'sunnyleone': 'sunny leone', 'prendisone': 'prednisone', 'Anglosaxophone': 'Anglo saxophone',
                  'Blackphones': 'Black phones', 'jionee': 'jinnee', 'chromonema': 'chromo nema',
                  'iodoketones': 'iodo ketones', 'demonetizations': 'demonetization', 'aomeone': 'someone',
                  'trillonere': 'trillones', 'abandonee': 'abandon',
                  'MasterColonel': 'Master Colonel', 'fronend': 'friend', 'Wildstone': 'Wilds tone',
                  'patitioned': 'petitioned', 'lonewolfs': 'lone wolfs', 'Spectrastone': 'Spectra stone',
                  'dishonerable': 'dishonorable', 'poisiones': 'poisons',
                  'condioner': 'conditioner', 'unpermissioned': 'unper missioned', 'friedzone': 'fried zone',
                  'umumoney': 'umu money', 'anyonestudied': 'anyone studied', 'dictioneries': 'dictionaries',
                  'nosebone': 'nose bone', 'ofVodafone': 'of Vodafone',
                  'Yumstone': 'Yum stone', 'oxandrolonesteroid': 'oxandrolone steroid',
                  'Mifeprostone': 'Mifepristone', 'pheramones': 'pheromones',
                  'sinophone': 'Sinophobe', 'peloponesian': 'peloponnesian', 'michrophone': 'microphone',
                  'commissionets': 'commissioners', 'methedone': 'methadone', 'cobditioners': 'conditioners',
                  'urotone': 'protone', 'smarthpone': 'smartphone', 'conecTU': 'connect you', 'beloney': 'boloney',
                  'comfortzone': 'comfort zone', 'testostersone': 'testosterone', 'camponente': 'component',
                  'Idonesia': 'Indonesia', 'dolostones': 'dolostone', 'psiphone': 'psi phone',
                  'ceftriazone': 'ceftriaxone', 'feelonely': 'feel onely', 'monetation': 'moderation',
                  'activationenergy': 'activation energy', 'moneydriven': 'money driven',
                  'staionery': 'stationery', 'zoneflex': 'zone flex', 'moneycash': 'money cash',
                  'conectiin': 'connection', 'Wannaone': 'Wanna one',
                  'Pictones': 'Pict ones', 'demonentization': 'demonetization',
                  'phenonenon': 'phenomenon', 'evenafter': 'even after', 'Sevenfriday': 'Seven friday',
                  'Devendale': 'Evendale', 'theeventchronicle': 'the event chronicle',
                  'seventysomething': 'seventy something', 'sevenpointed': 'seven pointed',
                  'richfeel': 'rich feel', 'overfeel': 'over feel', 'feelingstupid': 'feeling stupid',
                  'Photofeeler': 'Photo feeler', 'feelomgs': 'feelings', 'feelinfs': 'feelings',
                  'PlayerUnknown': 'Player Unknown', 'Playerunknown': 'Player unknown', 'knowlefge': 'knowledge',
                  'knowledgd': 'knowledge', 'knowledeg': 'knowledge', 'knowble': 'Knowle', 'Howknow': 'Howk now',
                  'knowledgeWoods': 'knowledge Woods', 'knownprogramming': 'known programming',
                  'selfknowledge': 'self knowledge', 'knowldage': 'knowledge', 'knowyouve': 'know youve',
                  'aknowlege': 'knowledge', 'Audetteknown': 'Audette known', 'knowlegdeable': 'knowledgeable',
                  'trueoutside': 'true outside', 'saynthesize': 'synthesize', 'EssayTyper': 'Essay Typer',
                  'meesaya': 'mee saya', 'Rasayanam': 'Rasayan am', 'fanessay': 'fan essay', 'momsays': 'moms ays',
                  'sayying': 'saying', 'saydaw': 'say daw', 'Fanessay': 'Fan essay', 'theyreally': 'they really',
                  'gayifying': 'gayed up with homosexual love', 'gayke': 'gay Online retailers',
                  'Lingayatism': 'Lingayat',
                  'macapugay': 'Macaulay', 'jewsplain': 'jews plain',
                  'banggood': 'bang good', 'goodfriends': 'good friends',
                  'goodfirms': 'good firms', 'Banggood': 'Bang good', 'dogooder': 'do gooder',
                  'stillshots': 'stills hots', 'stillsuits': 'still suits', 'panromantic': 'pan romantic',
                  'paracommando': 'para commando', 'romantize': 'romanize', 'manupulative': 'manipulative',
                  'manjha': 'mania', 'mankrit': 'mank rit',
                  'heteroromantic': 'hetero romantic', 'pulmanery': 'pulmonary', 'manpads': 'man pads',
                  'supermaneuverable': 'super maneuverable', 'mandatkry': 'mandatory', 'armanents': 'armaments',
                  'manipative': 'mancipative', 'himanity': 'humanity', 'maneuever': 'maneuver',
                  'Kumarmangalam': 'Kumar mangalam', 'Brahmanwadi': 'Brahman wadi',
                  'exserviceman': 'ex serviceman',
                  'managewp': 'managed', 'manies': 'many', 'recordermans': 'recorder mans',
                  'Feymann': 'Heymann', 'salemmango': 'salem mango', 'manufraturing': 'manufacturing',
                  'sreeman': 'freeman', 'tamanaa': 'Tamanac', 'chlamydomanas': 'chlamydomonas',
                  'comandant': 'commandant', 'huemanity': 'humanity', 'manaagerial': 'managerial',
                  'lithromantics': 'lith romantics',
                  'geramans': 'germans', 'Nagamandala': 'Naga mandala', 'humanitariarism': 'humanitarianism',
                  'wattman': 'watt man', 'salesmanago': 'salesman ago', 'Washwoman': 'Wash woman',
                  'rammandir': 'ram mandir', 'nomanclature': 'nomenclature', 'Haufman': 'Kaufman',
                  'prefomance': 'performance', 'ramanunjan': 'Ramanujan', 'Freemansonry': 'Freemasonry',
                  'supermaneuverability': 'super maneuverability', 'manstruate': 'menstruate',
                  'Tarumanagara': 'Taruma nagara', 'RomanceTale': 'Romance Tale', 'heteromantic': 'hete romantic',
                  'terimanals': 'terminals', 'womansplaining': 'wo mansplaining',
                  'performancelearning': 'performance learning', 'sociomantic': 'sciomantic',
                  'batmanvoice': 'batman voice', 'PerformanceTesting': 'Performance Testing',
                  'manorialism': 'manorial ism', 'newscommando': 'news commando',
                  'Entwicklungsroman': 'Entwicklungs roman',
                  'Kunstlerroman': 'Kunstler roman', 'bodhidharman': 'Bodhidharma', 'Howmaney': 'How maney',
                  'manufucturing': 'manufacturing', 'remmaning': 'remaining', 'rangeman': 'range man',
                  'mythomaniac': 'mythomania', 'katgmandu': 'katmandu',
                  'Superowoman': 'Superwoman', 'Rahmanland': 'Rahman land', 'Dormmanu': 'Dormant',
                  'Geftman': 'Gentman', 'manufacturig': 'manufacturing', 'bramanistic': 'Brahmanistic',
                  'padmanabhanagar': 'padmanabhan agar', 'homoromantic': 'homo romantic', 'femanists': 'feminists',
                  'demihuman': 'demi human', 'manrega': 'Manresa', 'Pasmanda': 'Pas manda',
                  'manufacctured': 'manufactured', 'remaninder': 'remainder', 'Marimanga': 'Mari manga',
                  'Sloatman': 'Sloat man', 'manlet': 'man let', 'perfoemance': 'performance',
                  'mangolian': 'mongolian', 'mangekyu': 'mange kyu', 'mansatory': 'mandatory',
                  'managemebt': 'management', 'manufctures': 'manufactures', 'Bramanical': 'Brahmanical',
                  'manaufacturing': 'manufacturing', 'Lakhsman': 'Lakhs man', 'Sarumans': 'Sarum ans',
                  'mangalasutra': 'mangalsutra', 'Germanised': 'German ised',
                  'managersworking': 'managers working', 'cammando': 'commando', 'mandrillaris': 'mandrill aris',
                  'Emmanvel': 'Emmarvel', 'manupalation': 'manipulation', 'welcomeromanian': 'welcome romanian',
                  'humanfemale': 'human female', 'mankirt': 'mankind', 'Haffmann': 'Hoffmann',
                  'Panromantic': 'Pan romantic', 'demantion': 'detention', 'Suparwoman': 'Superwoman',
                  'parasuramans': 'parasuram ans', 'sulmann': 'Suilmann', 'Shubman': 'Subman',
                  'manspread': 'man spread', 'mandingan': 'Mandingan', 'mandalikalu': 'mandalika lu',
                  'manufraturer': 'manufacturer', 'Wedgieman': 'Wedgie man', 'manwues': 'manages',
                  'humanzees': 'human zees', 'Steymann': 'Stedmann', 'Jobberman': 'Jobber man',
                  'maniquins': 'mani quins', 'biromantical': 'bi romantical', 'Rovman': 'Roman',
                  'pyromantic': 'pyro mantic', 'Tastaman': 'Rastaman', 'Spoolman': 'Spool man',
                  'Subramaniyan': 'Subramani yan', 'abhimana': 'abhiman a', 'manholding': 'man holding',
                  'seviceman': 'serviceman', 'womansplained': 'womans plained', 'manniya': 'mania',
                  'Bhraman': 'Braman', 'Laakman': 'Layman', 'mansturbate': 'masturbate',
                  'Sulamaniya': 'Sulamani ya', 'demanters': 'decanters', 'postmanare': 'postman are',
                  'mannualy': 'annual', 'rstman': 'Rotman', 'permanentjobs': 'permanent jobs',
                  'Allmang': 'All mang', 'TradeCommander': 'Trade Commander', 'BasedStickman': 'Based Stickman',
                  'Deshabhimani': 'Desha bhimani', 'manslamming': 'mans lamming', 'Brahmanwad': 'Brahman wad',
                  'fundemantally': 'fundamentally', 'supplemantary': 'supplementary', 'egomanias': 'ego manias',
                  'manvantar': 'Manvantara', 'spymania': 'spy mania', 'mangonada': 'mango nada',
                  'manthras': 'mantras', 'Humanpark': 'Human park', 'manhuas': 'mahuas',
                  'manterrupting': 'interrupting', 'dermatillomaniac': 'dermatillomania',
                  'performancies': 'performances', 'manipulant': 'manipulate',
                  'painterman': 'painter man', 'mangalik': 'manglik',
                  'Neurosemantics': 'Neuro semantics', 'discrimantion': 'discrimination',
                  'Womansplaining': 'feminist', 'mongodump': 'mongo dump', 'roadgods': 'road gods',
                  'Oligodendraglioma': 'Oligodendroglioma', 'unrightly': 'un rightly', 'Janewright': 'Jane wright',
                  ' righten ': ' tighten ', 'brightiest': 'brightest',
                  'frighter': 'fighter', 'righteouness': 'righteousness', 'triangleright': 'triangle right',
                  'Brightspace': 'Brights pace', 'techinacal': 'technical', 'chinawares': 'china wares',
                  'Vancouever': 'Vancouver', 'cheverlet': 'cheveret', 'deverstion': 'diversion',
                  'everbodys': 'everybody', 'Dramafever': 'Drama fever', 'reverificaton': 'reverification',
                  'canterlever': 'canter lever', 'keywordseverywhere': 'keywords everywhere',
                  'neverunlearned': 'never unlearned', 'everyfirst': 'every first',
                  'neverhteless': 'nevertheless', 'clevercoyote': 'clever coyote', 'irrevershible': 'irreversible',
                  'achievership': 'achievers hip', 'easedeverything': 'eased everything', 'youbever': 'you bever',
                  'everperson': 'ever person', 'everydsy': 'everyday', 'whemever': 'whenever',
                  'everyonr': 'everyone', 'severiity': 'severity', 'narracist': 'nar racist',
                  'racistly': 'racist', 'takesuch': 'take such', 'mystakenly': 'mistakenly',
                  'shouldntake': 'shouldnt take', 'Kalitake': 'Kali take', 'msitake': 'mistake',
                  'straitstimes': 'straits times', 'timefram': 'timeframe', 'watchtime': 'watch time',
                  'timetraveling': 'timet raveling', 'peactime': 'peacetime', 'timetabe': 'timetable',
                  'cooktime': 'cook time', 'blocktime': 'block time', 'timesjobs': 'times jobs',
                  'timesence': 'times ence', 'Touchtime': 'Touch time', 'timeloop': 'time loop',
                  'subcentimeter': 'sub centimeter', 'timejobs': 'time jobs', 'Guardtime': 'Guard time',
                  'realtimepolitics': 'realtime politics', 'loadingtimes': 'loading times',
                  'timesnow': '24-hour English news channel in India', 'timesspark': 'times spark',
                  'timetravelling': 'timet ravelling',
                  'antimeter': 'anti meter', 'timewaste': 'time waste', 'cryptochristians': 'crypto christians',
                  'Whatcould': 'What could', 'becomesdouble': 'becomes double', 'deathbecomes': 'death becomes',
                  'youbecome': 'you become', 'greenseer': 'people who possess the magical ability',
                  'rseearch': 'research', 'homeseek': 'home seek',
                  'Greenseer': 'people who possess the magical ability', 'starseeders': 'star seeders',
                  'seekingmillionaire': 'seeking millionaire', 'see\u202c': 'see',
                  'seeies': 'series', 'CodeAgon': 'Code Agon',
                  'royago': 'royal', 'Dragonkeeper': 'Dragon keeper', 'mcgreggor': 'McGregor',
                  'catrgory': 'category', 'Dragonknight': 'Dragon knight', 'Antergos': 'Anteros',
                  'togofogo': 'togo fogo', 'mongorestore': 'mongo restore', 'gorgops': 'gorgons',
                  'withgoogle': 'with google', 'goundar': 'Gondar', 'algorthmic': 'algorithmic',
                  'goatnuts': 'goat nuts', 'vitilgo': 'vitiligo', 'polygony': 'poly gony',
                  'digonals': 'diagonals', 'Luxemgourg': 'Luxembourg', 'UCSanDiego': 'UC SanDiego',
                  'Ringostat': 'Ringo stat', 'takingoff': 'taking off', 'MongoImport': 'Mongo Import',
                  'alggorithms': 'algorithms', 'dragonknight': 'dragon knight', 'negotiatior': 'negotiation',
                  'gomovies': 'go movies', 'Withgott': 'Without',
                  'categoried': 'categories', 'Stocklogos': 'Stock logos', 'Pedogogical': 'Pedological',
                  'Wedugo': 'Wedge', 'golddig': 'gold dig', 'goldengroup': 'golden group',
                  'merrigo': 'merligo', 'googlemapsAPI': 'googlemaps API', 'goldmedal': 'gold medal',
                  'golemized': 'polemized', 'Caligornia': 'California', 'unergonomic': 'un ergonomic',
                  'fAegon': 'wagon', 'vertigos': 'vertigo s', 'trigonomatry': 'trigonometry',
                  'hypogonadic': 'hypogonadia', 'Mogolia': 'Mongolia', 'governmaent': 'government',
                  'ergotherapy': 'ergo therapy', 'Bogosort': 'Bogo sort', 'goalwise': 'goal wise',
                  'alogorithms': 'algorithms', 'MercadoPago': 'Mercado Pago', 'rivigo': 'rivi go',
                  'govshutdown': 'gov shutdown', 'gorlfriend': 'girlfriend',
                  'stategovt': 'state govt', 'Chickengonia': 'Chicken gonia', 'Yegorovich': 'Yegorov ich',
                  'regognitions': 'recognitions', 'gorichen': 'Gori Chen Mountain',
                  'goegraphies': 'geographies', 'gothras': 'goth ras', 'belagola': 'bela gola',
                  'snapragon': 'snapdragon', 'oogonial': 'oogonia l', 'Amigofoods': 'Amigo foods',
                  'Sigorn': 'son of Styr', 'algorithimic': 'algorithmic',
                  'innermongolians': 'inner mongolians', 'ArangoDB': 'Arango DB', 'zigolo': 'gigolo',
                  'regognized': 'recognized', 'Moongot': 'Moong ot', 'goldquest': 'gold quest',
                  'catagorey': 'category', 'got7': 'got', 'jetbingo': 'jet bingo', 'Dragonchain': 'Dragon chain',
                  'catwgorized': 'categorized', 'gogoro': 'gogo ro', 'Tobagoans': 'Tobago ans',
                  'digonal': 'di gonal', 'algoritmic': 'algorismic', 'dragonflag': 'dragon flag',
                  'Indigoflight': 'Indigo flight',
                  'governening': 'governing', 'ergosphere': 'ergo sphere',
                  'pingo5': 'pingo', 'Montogo': 'montego', 'Rivigo': 'technology-enabled logistics company',
                  'Jigolo': 'Gigolo', 'phythagoras': 'pythagoras', 'Mangolian': 'Mongolian',
                  'forgottenfaster': 'forgotten faster', 'stargold': 'a Hindi movie channel',
                  'googolplexain': 'googolplexian', 'corpgov': 'corp gov',
                  'govtribe': 'provides real-time federal contracting market intel',
                  'dragonglass': 'dragon glass', 'gorakpur': 'Gorakhpur', 'MangoPay': 'Mango Pay',
                  'chigoe': 'sub-tropical climates', 'BingoBox': 'an investment company', '走go': 'go',
                  'followingorder': 'following order', 'pangolinminer': 'pangolin miner',
                  'negosiation': 'negotiation', 'lexigographers': 'lexicographers', 'algorithom': 'algorithm',
                  'unforgottable': 'unforgettable', 'wellsfargoemail': 'wellsfargo email',
                  'daigonal': 'diagonal', 'Pangoro': 'cantankerous Pokemon', 'negotiotions': 'negotiations',
                  'Swissgolden': 'Swiss golden', 'google4': 'google', 'Agoraki': 'Ago raki',
                  'Garthago': 'Carthago', 'Stegosauri': 'stegosaurus', 'ergophobia': 'ergo phobia',
                  'bigolive': 'big olive', 'bittergoat': 'bitter goat', 'naggots': 'faggots',
                  'googology': 'online encyclopedia', 'algortihms': 'algorithms', 'bengolis': 'Bengalis',
                  'fingols': 'Finnish people are supposedly descended from Mongols',
                  'savethechildren': 'save thechildren',
                  'stopings': 'stoping', 'stopsits': 'stop sits', 'stopsigns': 'stop signs',
                  'Galastop': 'Galas top', 'pokestops': 'pokes tops', 'forcestop': 'forces top',
                  'Hopstop': 'Hops top', 'stoppingexercises': 'stopping exercises', 'coinstop': 'coins top',
                  'stoppef': 'stopped', 'workaway': 'work away', 'snazzyway': 'snazzy way',
                  'Rewardingways': 'Rewarding ways', 'cloudways': 'cloud ways', 'Cloudways': 'Cloud ways',
                  'Brainsway': 'Brains way', 'nesraway': 'nearaway',
                  'AlwaysHired': 'Always Hired', 'expessway': 'expressway', 'Syncway': 'Sync way',
                  'LeewayHertz': 'Blockchain Company', 'towayrds': 'towards', 'swayable': 'sway able',
                  'Telloway': 'Tello way', 'palsmodium': 'plasmodium', 'Gobackmodi': 'Goback modi',
                  'comodies': 'corodies', 'islamphobic': 'islam phobic', 'islamphobia': 'islam phobia',
                  'citiesbetter': 'cities better', 'betterv3': 'better', 'betterDtu': 'better Dtu',
                  'Babadook': 'a horror drama film', 'Ahemadabad': 'Ahmadabad', 'faidabad': 'Faizabad',
                  'Amedabad': 'Ahmedabad', 'kabadii': 'kabaddi', 'badmothing': 'badmouthing',
                  'badminaton': 'badminton', 'badtameezdil': 'badtameez dil', 'badeffects': 'bad effects',
                  '∠bad': 'bad', 'ahemadabad': 'Ahmadabad', 'embaded': 'embased', 'Isdhanbad': 'Is dhanbad',
                  'badgermoles': 'enormous, blind mammal', 'allhabad': 'Allahabad', 'ghazibad': 'ghazi bad',
                  'htderabad': 'Hyderabad', 'Auragabad': 'Aurangabad', 'ahmedbad': 'Ahmedabad',
                  'ahmdabad': 'Ahmadabad', 'alahabad': 'Allahabad',
                  'Hydeabad': 'Hyderabad', 'Gyroglove': 'wearable technology', 'foodlovee': 'food lovee',
                  'slovenised': 'slovenia', 'handgloves': 'hand gloves', 'lovestep': 'love step',
                  'lovejihad': 'love jihad', 'RolloverBox': 'Rollover Box', 'stupidedt': 'stupidest',
                  'toostupid': 'too stupid',
                  'pakistanisbeautiful': 'pakistanis beautiful', 'ispakistan': 'is pakistan',
                  'inpersonations': 'impersonations', 'medicalperson': 'medical person',
                  'interpersonation': 'inter personation', 'workperson': 'work person',
                  'personlich': 'person lich', 'persoenlich': 'person lich',
                  'middleperson': 'middle person', 'personslized': 'personalized',
                  'personifaction': 'personification', 'welcomemarriage': 'welcome marriage',
                  'come2': 'come to', 'upcomedians': 'up comedians', 'overvcome': 'overcome',
                  'talecome': 'tale come', 'cometitive': 'competitive', 'arencome': 'aren come',
                  'achecomes': 'ache comes', '」come': 'come',
                  'comepleted': 'completed', 'overcomeanxieties': 'overcome anxieties',
                  'demigirl': 'demi girl', 'gridgirl': 'female models of the race', 'halfgirlfriend': 'half girlfriend',
                  'girlriend': 'girlfriend', 'fitgirl': 'fit girl', 'girlfrnd': 'girlfriend', 'awrong': 'aw rong',
                  'northcap': 'north cap', 'productionsupport': 'production support',
                  'Designbold': 'Online Photo Editor Design Studio',
                  'skyhold': 'sky hold', 'shuoldnt': 'shouldnt', 'anarold': 'Android', 'yaerold': 'year old',
                  'soldiders': 'soldiers', 'indrold': 'Android', 'blindfoldedly': 'blindfolded',
                  'overcold': 'over cold', 'Goldmont': 'microarchitecture in Intel', 'boldspot': 'bolds pot',
                  'Rankholders': 'Rank holders', 'cooldrink': 'cool drink', 'beltholders': 'belt holders',
                  'GoldenDict': 'open-source dictionary program', 'softskill': 'softs kill',
                  'Cooldige': 'the 30th president of the United States',
                  'newkiller': 'new killer', 'skillselect': 'skills elect', 'nonskilled': 'non skilled',
                  'killyou': 'kill you', 'Skillport': 'Army e-Learning Program', 'unkilled': 'un killed',
                  'killikng': 'killing', 'killograms': 'kilograms',
                  'Worldkillers': 'World killers', 'reskilled': 'skilled',
                  'killedshivaji': 'killed shivaji', 'honorkillings': 'honor killings',
                  'skillclasses': 'skill classes', 'microskills': 'micros kills',
                  'Skillselect': 'Skills elect', 'ratkill': 'rat kill',
                  'pleasegive': 'please give', 'flashgive': 'flash give',
                  'southerntelescope': 'southern telescope', 'westsouth': 'west south',
                  'southAfricans': 'south Africans', 'Joboutlooks': 'Job outlooks', 'joboutlook': 'job outlook',
                  'Outlook365': 'Outlook 365', 'Neulife': 'Neu life', 'qualifeid': 'qualified',
                  'nullifed': 'nullified', 'lifeaffect': 'life affect', 'lifestly': 'lifestyle',
                  'aristocracylifestyle': 'aristocracy lifestyle', 'antilife': 'anti life',
                  'afterafterlife': 'after afterlife', 'lifestylye': 'lifestyle', 'prelife': 'pre life',
                  'lifeute': 'life ute', 'liferature': 'literature',
                  'securedlife': 'secured life', 'doublelife': 'double life', 'antireligion': 'anti religion',
                  'coreligionist': 'co religionist', 'petrostates': 'petro states', 'otherstates': 'others tates',
                  'spacewithout': 'space without', 'withoutyou': 'without you',
                  'withoutregistered': 'without registered', 'weightwithout': 'weight without',
                  'withoutcheck': 'without check', 'milkwithout': 'milk without',
                  'Highschoold': 'High school', 'memoney': 'money', 'moneyof': 'mony of', 'Oneplus': 'OnePlus',
                  'OnePlus': 'Chinese smartphone manufacturer', 'Beerus': 'the God of Destruction',
                  'takeoverr': 'takeover', 'demonetizedd': 'demonetized', 'polyhouse': 'Polytunnel',
                  'Elitmus': 'eLitmus', 'eLitmus': 'Indian company that helps companies in hiring employees',
                  'becone': 'become', 'nestaway': 'nest away', 'takeoverrs': 'takeovers', 'Istop': 'I stop',
                  'Austira': 'Australia', 'germeny': 'Germany', 'mansoon': 'man soon',
                  'worldmax': 'wholesaler of drum parts',
                  'ammusement': 'amusement', 'manyare': 'many are', 'supplymentary': 'supply mentary',
                  'timesup': 'times up', 'homologus': 'homologous', 'uimovement': 'ui movement', 'spause': 'spouse',
                  'aesexual': 'asexual', 'Iovercome': 'I overcome', 'developmeny': 'development',
                  'hindusm': 'hinduism', 'sexpat': 'sex tourism', 'sunstop': 'sun stop', 'polyhouses': 'Polytunnel',
                  'usefl': 'useful', 'Fundamantal': 'fundamental', 'environmentai': 'environmental',
                  'Redmi': 'Xiaomi Mobile', 'Loy Machedo': ' Motivational Speaker ', 'unacademy': 'Unacademy',
                  'Boruto': 'Naruto Next Generations', 'Upwork': 'Up work',
                  'Unacademy': 'educational technology company',
                  'HackerRank': 'Hacker Rank', 'upwork': 'up work', 'Chromecast': 'Chrome cast',
                  'microservices': 'micro services', 'Undertale': 'video game', 'undergraduation': 'under graduation',
                  'chapterwise': 'chapter wise', 'twinflame': 'twin flame', 'Hotstar': 'Hot star',
                  'blockchains': 'blockchain',
                  'darkweb': 'dark web', 'Microservices': 'Micro services', 'Nearbuy': 'Nearby',
                  ' Padmaavat ': ' Padmavati ', ' padmavat ': ' Padmavati ', ' Padmaavati ': ' Padmavati ',
                  ' Padmavat ': ' Padmavati ', ' internshala ': ' internship and online training platform in India ',
                  'dream11': ' fantasy sports platform in India ', 'conciousnesss': 'consciousnesses',
                  'Dream11': ' fantasy sports platform in India ', 'cointry': 'country', ' coinvest ': ' invest ',
                  '23 andme': 'privately held personal genomics and biotechnology company in California',
                  'Trumpism': 'philosophy and politics espoused by Donald Trump',
                  'Trumpian': 'viewpoints of President Donald Trump', 'Trumpists': 'admirer of Donald Trump',
                  'coincidents': 'coincidence', 'coinsized': 'coin sized', 'coincedences': 'coincidences',
                  'cointries': 'countries', 'coinsidered': 'considered', 'coinfirm': 'confirm',
                  'humilates':'humiliates', 'vicevice':'vice vice', 'politicak':'political', 'Sumaterans':'Sumatrans',
                  'Kamikazis':'Kamikazes', 'unmoraled':'unmoral', 'eduacated':'educated', 'moraled':'morale',
                  'Amharc':'Amarc', 'where Burkhas':'wear Burqas', 'Baloochistan':'Balochistan', 'durgahs':'durgans',
                  'illigitmate':'illegitimate', 'hillum':'helium','treatens':'threatens','mutiliating':'mutilating',
                  'speakingly':'speaking', 'pretex':'pretext', 'menstruateion':'menstruation', 
                  'genocidizing':'genociding', 'maratis':'Maratism','Parkistinian':'Pakistani', 'SPEICIAL':'SPECIAL',
                  'REFERNECE':'REFERENCE', 'provocates':'provokes', 'FAMINAZIS':'FEMINAZIS', 'repugicans':'republicans',
                  'tonogenesis':'tone', 'winor':'win', 'redicules':'ridiculous', 'Beluchistan':'Balochistan', 
                  'volime':'volume', 'namaj':'namaz', 'CONgressi':'Congress', 'Ashifa':'Asifa', 'queffing':'queefing',
                  'montheistic':'nontheistic', 'Rajsthan':'Rajasthan', 'Rajsthanis':'Rajasthanis', 'specrum':'spectrum',
                  'brophytes':'bryophytes', 'adhaar':'Adhara', 'slogun':'slogan', 'harassd':'harassed',
                  'transness':'trans gender', 'Insdians':'Indians', 'Trampaphobia':'Trump aphobia', 'attrected':'attracted',
                  'Yahtzees':'Yahtzee', 'thiests':'atheists', 'thrir':'their', 'extraterestrial':'extraterrestrial',
                  'silghtest':'slightest', 'primarty':'primary','brlieve':'believe', 'fondels':'fondles',
                  'loundly':'loudly', 'bootythongs':'booty thongs', 'understamding':'understanding', 'degenarate':'degenerate',
                  'narsistic':'narcistic', 'innerskin':'inner skin','spectulated':'speculated', 'hippocratical':'Hippocratical',
                  'itstead':'instead', 'parralels':'parallels', 'sloppers':'slippers'
                  }

In [None]:
%%time 
# Define functions to remove extra spaces, clean the number of text, rare words and misspellings
def remove_space(text):
    """
    remove extra spaces and ending space if any
    """
    for space in spaces:
        text = text.replace(space, ' ')
    text = text.strip()
    text = re.sub('\s+', ' ', text)
    return text

def clean_number(text):
    text = re.sub(r'(\d+)([a-zA-Z])', '\g<1> \g<2>', text)
    text = re.sub(r'(\d+) (th|st|nd|rd) ', '\g<1>\g<2> ', text)
    text = re.sub(r'(\d+),(\d+)', '\g<1>\g<2>', text)
    text = re.sub(r'(\d+)(e)(\d+)','\g<1> \g<3>', text)
    
    return text

def clean_rare_words(text):
    for rare_word in rare_words_mapping:
        if rare_word in text:
            text = text.replace(rare_word, rare_words_mapping[rare_word])

    return text

def clean_misspell(text):
    for bad_word in mispell_dict:
        if bad_word in text:
            text = text.replace(bad_word, mispell_dict[bad_word])
    return text

# Create variables with all punctuations, define a function to add space before and after punctuation and a function to clean bad case words
import string
regular_punct = list(string.punctuation)
all_punct = list(set(regular_punct + extra_punct))
# Do not spacing - and .
all_punct.remove('-')
all_punct.remove('.')

def spacing_punctuation(text):
    """
    add space before and after punctuation and symbols
    """
    for punc in all_punct:
        if punc in text:
            text = text.replace(punc, f' {punc} ')
    return text

def clean_bad_case_words(text):
    for bad_word in bad_case_words:
        if bad_word in text:
            text = text.replace(bad_word, bad_case_words[bad_word])
    return text

# Apply all above defined functions to preprocess the text
def preprocess(text):
    text = remove_space(text)
    text = clean_number(text)
    text = clean_rare_words(text)
    text = clean_misspell(text)
    text = spacing_punctuation(text)
    text = clean_bad_case_words(text)
    return text

In [None]:
# Rebuild the vocab and oov
del(vocab,oov)
# Dump memory collection
gc.collect()

In [None]:
%%time
# Apply the preprocess function to comment_text
df['comment_text'] = df['comment_text'].apply(lambda x: preprocess(x))
# Dump memory collection
gc.collect()

In [None]:
%%time
# Rebuild the vocab and oov
vocab = build_vocab(df['comment_text'])
oov = check_coverage(vocab, embeddings_index)

## 5. Modeling

In this chapter the model will be built to make use of the preprocessed text. The underlying model is the Keras library with its different layers and packages. Before initiating a model of `Keras`, the data will be split into the original `train` and `test` data. Different samples can be tried out in the code depending on the CPU or GPU available. Also the text will be tokenized and an embedding matrix be created. Afterwards the `Sequential()` model of Keras will be applied with input, different hidden and output layers. 5.2 and 5.3 deal with the training of the data and testing to validate the result before predicting and creating the `submission.csv`.

### 5.1 Model Definition with Keras

First, the `train` and `test` data is split into the original data and the target variable for the model is added as seen in the next cell.

In [None]:
%%time 

# Select a downsampled dataset of 200.000 rows to commit faster
#train = df.iloc[:200000,:]
#test = df.iloc[200000:,:]

# Select a greater sample of 500000 rows if needed
# train = df.iloc[:500000,:]
# test = df.iloc[500000:,:]

# Select a greater sample of 700000 rows
train = df.iloc[:700000,:]
test = df.iloc[700000:,:]

# Select all of the data
# train = df.iloc[:1804874,:]
# test = df.iloc[1804874:,:]


# Delete df
del(df)
# Dump memory collection
gc.collect()

# Get the original data from the train dataset and define the number of rows
#train_original = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv", nrows=200000)
train_original = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv", nrows=500000)

# Concatenate train and train_original
train = pd.concat([train,train_original[['target']]],axis=1)

# Delete train_original
del(train_original)
# Dump memory collection
gc.collect()

In [None]:
# Get the .head() of train
train.head()

The head of the `train` data shows that there are still different values possible for the `target` variable. For better modeling, this will be converted into binary flags in the next cell.

In [None]:
# Convert the target to a binary flag
train['target'] = np.where(train['target'] >= 0.5, True, False)
train.head()

In [None]:
# Get the .trail() of train
train.tail()

Checking the head of the `train` data again shows that the `target` now takes the values of True or False. In the next cell, the text is tokenized by first defining the variables X (`comment_text`) and Y (`target`) and the creating a tokenizer to fit on X.

In [None]:
%%time

# Define variables X as comment_text and Y as target from the train dataset
Y = train['target']
X = train['comment_text']

# Create a text tokenizer.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
sequence = tokenizer.texts_to_sequences(X)

Tokenizer class has a method called fit_on_texts that creates the vocabulary index based on word frequency. So if you give it something like, "The cat sat on the mat", it will create a dictionary s.t. word_index["the"] = 0; word_index["cat"] = 1, so every word gets a unique integer value. Then, texts_to_sequences transforms each text to a sequence of integers. So it basically takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary.

In [None]:
# Define the size of the vocab
vocab_size = len(sequence) + 1

After tokenizing, all comments need to have the same length which is illustrated in the next cell.

In [None]:
%%time 

# All comments must be truncated or padded to be the same length.
max_seq_length = 100

# Apply the function pad_sequences in order to transforms a list of sequences (lists of integers) into a 2D Numpy array of shape `
pad_seq = pad_sequences(sequence, maxlen=max_seq_length)

In [None]:
# Dump memory collection
gc.collect()

As a last step to prepare for the `Keras` model, an embedding matrix is created in the next cell.

In [None]:
%%time
# Create an embedding matrix
num_words_in_embedding = 0
embedding_matrix = np.zeros((vocab_size,300))

for word, i in tokenizer.word_index.items():
    if word in embeddings_index.vocab:
        embedding_vector = embeddings_index[word]
        embedding_matrix[i] = embedding_vector        
        num_words_in_embedding += 1

In the following steps the prepared dataset will be utilized to initate a `Sequential()` model of `Keras`, add embeddings and all required layers before compiling the model and having a look at its summary.

In [None]:
%%time
# Initiate a Keras Sequential() model
model = Sequential()

In [None]:
%%time
# Add an embedding layer
model.add(Embedding(vocab_size, 300, input_length = 100, weights = [embedding_matrix],trainable = False))

Here we add the CuDNNLSTM layer which is the base of the model use in this kernel. We choose `CuDNNLSTM()` over `LSTM()` since the competition allows the use of GPU processing, making the training of the model substantially faster. 

LSTM stands for **Long Short Term Memory** and it's a variant of Recurrent Neural Networks (RNN) that allows important information to be kept within the network for long periods of time. The motivation for using this type of variant in the Jigsaw competition is that in Natural Language Processing, anything larger than a trigram (a group of three consecutive written units, like letters, syllables or in this case, words) is considered to have long-term dependency, and sometimes the gap between the information we want to predict and the location from where we want to predicted is to big to be handled by a simple RNN.

Here is the typical architecture of a LSTM node (borrowed from Nimesh Sinha in an article from Towards Data Science [link](https://towardsdatascience.com/understanding-lstm-and-its-quick-implementation-in-keras-for-sentiment-analysis-af410fd85b47)

![LSTM Architecture](https://cdn-images-1.medium.com/max/800/1*Niu_c_FhGtLuHjrStkB_4Q.png)





The three main components (which are numbered) of a LSTM node are:

1. The forget gate that takes the new input and the information from the previous node to decide which information should be removed or kept, according to the output of the sigmoid function.

2. The next step is to decide which of the new information should be updated in the cell or ignored according to the output of a sigmoid function. At the same time, all the possible values of the new input are created as a vector layer after being evaluated in the tanh function. The product between this two outputs is updated in the old memory.

3. Then, a sigmoid layer decides which parts of the current cell state are going to be carried over as output of the node. After this is done, the current cell state is passed by a tanh function to generate all the possible values, and the product between the sigmoid gate and this tanh vector goes as and output of the node and as 'information memory'. 



In [None]:
# Add layers
model.add(Bidirectional(CuDNNLSTM(100,return_sequences=True)))
model.add(Convolution1D(64,7,padding='same'))
model.add(GlobalAveragePooling1D())

# Add hidden layers
model.add(Dense(128))
model.add(LeakyReLU())

# Add hidden layers
model.add(Dense(64,activation = 'relu'))

After adding all hidden layers, finally the output layer can be added. Here, a `Dense` layer is chosen with the `sigmoid` activation function as seen in the next cell.

The sigmoid activation function is chosen to have the output of the neural network between the range of [0,1] which in this case, would be the toxicity score, being 0 the lowest and 1 the most toxic type of comments

In [None]:
# Add the output layer
model.add(Dense(1,activation = 'sigmoid'))

After adding all required input, hidden and output layers, the model can be compiled. For this model `adam` is chosen as the optimizer, `binary_crossentropy` as the loss and `accuracy` as the metric.

In [None]:
# Compile the model
model.compile(optimizer = 'adam',loss='binary_crossentropy',metrics = ['accuracy'])

The last step of the modeling is summarizing the model which is done in the next cell. The summary shows all different layers, their output and parameters.

In [None]:
# Summarize the model
model.summary()

### 5.2 Training

To train the model, first the data has to be split into `train` and `test` data. Then the model can be fit to the `train` split data and validated with the `test` split data.

In [None]:
%%time
# Create train-test split for validation
from sklearn import model_selection
x_train,x_test,y_train,y_test = train_test_split(pad_seq, Y, test_size=0.15, random_state=42)

In [None]:
# Delete X and Y
del (X,Y)
# Dump memory collection
gc.collect()

In [None]:
# Fit the model to the training split data and validate with the test split data
history = model.fit(x_train,y_train,epochs = 5,batch_size=512,validation_data=(x_test,y_test))

### 5.3 Testing

Now, the model can be evaluated on the test data of the competition.

In [None]:
# Create a variable X with comment_text of the test data
X = test['comment_text']

In [None]:
%%time
# Tokenize the test_sequence and apply the pad_sequences function
test_sequence = tokenizer.texts_to_sequences(X)
test_pad_seq = pad_sequences(test_sequence,maxlen = max_seq_length)

### 5.4 Prediction

After evaluating the model in 5.3, the `test` data can finally be predicted to prepare for the submission.

In [None]:
%%time 
# Predict for test
prediction = model.predict(test_pad_seq)

## 6. Submission

Based on the predicted `test` data, a submission dataframe is created in the next cell, followed by a check for the head of the submission to make sure, this step has worked out. Finally the submission will be read into a `submission.csv` that can then be submitted to the competition to determine the ranking on the leaderboard.

In [None]:
# Create a submission dataframe with the variables id and prediction
submission = pd.DataFrame([test['id']]).T
submission['prediction'] = prediction

In [None]:
# Check the head of the submission
submission.head()

In [None]:
# Create submission.csv to be submitted to the competition
submission.to_csv('submission.csv', index=False)

## 7. Conclusion

This notebook has shown a possible approach to solve the Jigsaw Unintended Bias in Toxicity Classification Challenge on Kaggle. Keras has been chosen as the most appropriate model in order to reach the best score on the public leaderboard. To summarize the main challenges during the problem-solving approach, one can say that that particularly the text preprocessing is an important and intense step to lower the problems of unintend bias, while the actual modeling with Keras is quite logic code-wise even though it is very intense computational-wise. 

## References

This notebook was inspired by different kernels which will be named in the following:
- https://www.kaggle.com/tarunpaparaju/jigsaw-competition-eda-and-modeling
- https://www.kaggle.com/nz0722/simple-eda-text-preprocessing-jigsaw
- https://www.kaggle.com/theoviel/improve-your-score-with-text-preprocessing-v2
- https://www.kaggle.com/taindow/simple-cudnngru-python-keras
- https://www.kaggle.com/gpreda/jigsaw-eda
- https://www.kaggle.com/samarthsarin/ensemble-network-with-keras-and-embeddings

Other references are:
- Demidov, V. (2018). FastText crawl 300d 2M | Kaggle. Retrieved June 2, 2019, from https://www.kaggle.com/yekenot/fasttext-crawl-300d-2m
- Kaggle. (2019). Jigsaw Unintended Bias in Toxicity Classification | Kaggle. Retrieved May 25, 2019, from https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification
- Sinha, N. (2018, February 20). Understanding LSTM and its quick Understanding LSTM and its quick implementation in keras for sentiment analysis. Retrieved May 26, 2019, from https://towardsdatascience.com/understanding-lstm-and-its-quick-implementation-in-keras-for-sentiment-analysis-af410fd85b47
- Vu, D. (2018). Generating WordClouds in Python (article) - DataCamp. Retrieved May 25, 2019, from https://www.datacamp.com/community/tutorials/wordcloud-python