This notebook aims to augment the already existing pre-trained word embeddings online which maybe GloVe, Word2Vec etc, which are generalized word embeddings together with the generated hate_speech_dataset which aims to leverage these existing word embeddings to generate new word embeddings for these new words in the hate_speech_dataset which may not Exist in the vocabulary of these word embeddings themselves

# Import Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from utilities.data_loaders import construct_embedding_dict, construct_embedding_matrix
from utilities.data_preprocessors import read_preprocess, series_to_1D_array
from models.model_arcs import load_lstm_model

%load_ext autoreload
%autoreload 2

ImportError: cannot import name 'construct_embedding_vars' from 'utilities.data_loaders' (d:\Projects\To Github\hate-speech-classifier\utilities\data_loaders.py)

In [None]:
# 1 for religious and 0 for non religious
df = pd.read_csv('./data/hate-speech-data-cleaned.csv', index_col=0)
df = read_preprocess(df)

In [None]:
all_words = pd.Series(series_to_1D_array(df['comment']))
all_unique_words_counts = all_words.value_counts()
all_unique_words = all_words.unique()

In [None]:
len(all_words)

In [None]:
len(all_unique_words)

In [None]:
all_unique_words_counts

In [None]:
# before joining again get array in df with longest length first
max_len_1 = len(max(df['comment'], key=len))

In [None]:
df['comment'] = df['comment'].apply(lambda comment: " ".join(comment))
df

In [None]:
df.loc[0, 'comment']

# Preparing data for training classifier
**A note on the subsequent code below**

fit_on_texts Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. So if you give it something like, "The cat sat on the mat." It will create a dictionary s.t. word_index["the"] = 1; word_index["cat"] = 2 it is word -> index dictionary so every word gets a unique integer value. 0 is reserved for padding. So lower integer means more frequent word (often the first few are stop words because they appear a lot).

texts_to_sequences Transforms each text in texts to a sequence of integers. So it basically takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary. Nothing more, nothing less, certainly no magic involved.

In [None]:
# train_sents, test_sents, train_labels, test_labels = train_test_split(df['comment'], df['label'], test_size=0.3, random_state=0)
sents = df['comment']
max_len_2 = 50

num_words_1 = df.shape[0]
num_words_2 = len(all_words)
num_words_3 = len(all_unique_words)

tokenizer = Tokenizer(num_words=num_words_3, split=' ')
tokenizer.fit_on_texts(sents)
# the bug is here that's why there are wrong indeces

seqs = tokenizer.texts_to_sequences(sents)

# post means place padding of 0's on the tail or ending of the sequence
# and truncating removes the values of a sequence that is greater than the max length given
seqs_padded = pad_sequences(seqs, maxlen=max_len_1, padding='post', truncating='post')

In [None]:
seqs

Here we see that indeed 50 is not enough as our max length but for the subsequent code we will still use 50 and later 503 for our experimentation. For now 503 will be an extremely large value eespecially when applied to all sequences

In [None]:
print(max_len_1, max_len_2)

In [None]:
word_to_index = tokenizer.word_index
index_to_word = tokenizer.index_word
print(len(word_to_index))

In [None]:
word_to_index

In [None]:
index_to_word

In [None]:
seqs[0]

In [None]:
# this is supposed to be 1301
print(word_to_index['complain'])

# this is supposed to be 3583
print(word_to_index['cleaning'])

In [None]:
seqs_padded

In [None]:
train_seqs, test_seqs, train_labels, test_labels = train_test_split(seqs_padded, df['label'], test_size=0.3, random_state=0)

train_seqs

In [None]:
train_seqs.shape

In [None]:
test_seqs

In [None]:
len(test_seqs)

# Loading the Big Guns 
or the 1.9 million word vocabulary and its 300 dimensional embeddings

In [None]:
# important variables

# includes oov words
vocab_len = len(word_to_index) + 1
emb_dict, emb_vec_len = construct_embedding_dict('./embeddings/glove.42B.300d.txt', word_to_index)
emb_matrix = construct_embedding_matrix(word_to_index, emb_dict, emb_vec_len)
lstm_model = load_lstm_model((max_len_1,), vocab_len, emb_matrix)

In [None]:
lstm_model.summary()