<h1>Task 8: Word Embedding</h1>

<h4> This notebook compares different embedding methods on a simple task (sentiment analysis) <a href="https://www.kaggle.com/mksaad/arabic-sentiment-twitter-corpus">on a small dataset</a>.</h4>

<h4>Table of Contents:</h4>
<ol>
    <li>Load Dataset</li>
    <li>Normalize Dataset</li>
    <li>Tokenize Dataset</li>
    <li>Word Embedding</li>
    <li>Train RNN model</li>
    <li>Evaluate model</li>
</ol>
<h4>Embedding Methods:</h4>
<ol>
    <li>Keras Embedding Layer (trained from scratch)</li>
    <li>Keras Word2Vec implementation (trained from scratch)</li>
    <li>Genism library's Word2Vec implementation (trained from scratch)</li>
    <li>Genism library's GloVe implementation (trained from scratch)</li>
    <li>Genism library's fasttext implementation (trained from scratch)</li>
    <li>AraVec pretrained embeddings</li>
    <li>Arabic-Chapter pretrained embeddings</li>
    <li>BERT Arabic pretrained model</li>
</ol>

<h1>Load Model</h1>

In [1]:
import pandas as pd
train_pos = pd.read_csv("data/train_Arabic_tweets_positive_20190413.tsv", sep='\t', names=["label", "tweet"])
train_neg = pd.read_csv("data/train_Arabic_tweets_negative_20190413.tsv", sep='\t', names=["label", "tweet"])
test_pos = pd.read_csv("data/test_Arabic_tweets_positive_20190413.tsv", sep='\t', names=["label", "tweet"])
test_neg = pd.read_csv("data/test_Arabic_tweets_negative_20190413.tsv", sep='\t', names=["label", "tweet"])
train = pd.concat([train_pos, train_neg])#.sample(frac=1, random_state=0)
test = pd.concat([test_pos, test_neg])

In [2]:
train

Unnamed: 0,label,tweet
0,pos,نحن الذين يتحول كل ما نود أن نقوله إلى دعاء لل...
1,pos,وفي النهاية لن يبقىٰ معك آحدإلا من رأىٰ الجمال...
2,pos,من الخير نفسه 💛
3,pos,#زلزل_الملعب_نصرنا_بيلعب كن عالي الهمه ولا ترض...
4,pos,الشيء الوحيد الذي وصلوا فيه للعالمية هو : المس...
...,...,...
22509,neg,كيف ترى أورانوس لو كان يقع مكان القمر ؟ 💙💙 كوك...
22510,neg,احسدك على الايم 💔
22511,neg,لأول مرة ما بنكون سوا 💔
22512,neg,بقله ليش يا واطي 🤔


In [3]:
import re
def normalize(text):
    text = araby.strip_harakat(text)
    text = araby.strip_tashkeel(text)
    text = araby.strip_small(text)
    text = araby.strip_tatweel(text)
    text = araby.strip_shadda(text)
    text = araby.strip_diacritics(text)
    text = araby.normalize_ligature(text)
    #text = araby.normalize_hamza(text)
    text = araby.normalize_teh(text)
    text = araby.normalize_alef(text)
    return text

def strip_all(text):
    l = [' ', '0', '1', '2', '3', '4', '5', '6',
       '7', '8', '9', '?', 
       '؟', 'ء', 'ؤ', 'ئ', 'ا', 'ب', 'ت', 'ث',
       'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ',
       'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'و', 'ي', '٠', '١',
       '٢', '٣', '٤', '٥', '٦', '٧', '٨', '٩']
    return "".join([x for x in text if x in l])

In [4]:
import pyarabic.araby as araby
train.tweet = train.tweet.apply(normalize).apply(strip_all).apply(araby.tokenize)
test.tweet = test.tweet.apply(normalize).apply(strip_all).apply(araby.tokenize)

In [5]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train.label)
train.label = le.transform(train.label)
test.label = le.transform(test.label)


In [6]:
train

Unnamed: 0,label,tweet
0,1,"[نحن, الذين, يتحول, كل, ما, نود, ان, نقوله, ال..."
1,1,"[وفي, النهايه, لن, يبقا, معك, احدالا, من, راا,..."
2,1,"[من, الخير, نفسه]"
3,1,"[زلزلالملعبنصرنابيلعب, كن, عالي, الهمه, ولا, ت..."
4,1,"[الشيء, الوحيد, الذي, وصلوا, فيه, للعالميه, هو..."
...,...,...
22509,0,"[كيف, ترا, اورانوس, لو, كان, يقع, مكان, القمر,..."
22510,0,"[احسدك, علا, الايم]"
22511,0,"[لاول, مره, ما, بنكون, سوا]"
22512,0,"[بقله, ليش, يا, واطي]"


In [7]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(train.tweet.values, train.label.values, test_size=0.5,random_state=0)

In [8]:
from word_embedding import WordEmbedding
from utils import helper, preprocess
import numpy as np


In [9]:
# Word2vec
embeddings = WordEmbedding(preprocess.tokenizer, vocab_size=13000, maxlen=150, embedding_vector=10, method="word2vec")
#text = embeddings.tokenize(text) We already did tokenization
words, label, unique_words, word_dict = embeddings.encode_w2v(train.tweet.values[:1000]) #Consumes very large amount of memory
model = embeddings.train_w2v(words, label, epochs=5)

word_embeddings = model.get_weights()[0]

# embeddings = helper.get_embeddings(unique_words, word_dict, word_embeddings)
# helper.plot(word_dict, embeddings)
# helper.save_embeddings(embeddings) 


100%|██████████| 42394/42394 [00:00<00:00, 44189.57it/s]


Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 5473)]            0         
_________________________________________________________________
dense (Dense)                (None, 10)                54740     
_________________________________________________________________
dense_1 (Dense)              (None, 5473)              60203     
Total params: 114,943
Trainable params: 114,943
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/5

Epoch 00001: loss improved from inf to 0.11875, saving model to models/word_embeddings.h5
Epoch 2/5

Epoch 00002: loss improved from 0.11875 to 0.00305, saving model to models/word_embeddings.h5
Epoch 3/5

Epoch 00003: loss improved from 0.00305 to 0.00187, saving model to models/word_embeddings.h5
Epoch 4/5

Epoch 00004: loss improved from 0.00187 to 0.00168, saving

In [107]:
import gensim


sentences = np.concatenate([train.tweet.values, test.tweet.values])
word_model = gensim.models.Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=1, workers=4)
word_model.build_vocab(sentences)  # prepare the model vocabulary
word_model.train(sentences, total_examples=word_model.corpus_count, epochs=1)  # train word vectors

(606585, 652324)

In [109]:
weights = word_model.syn1neg

In [110]:
def word2idx(word):
    return word_model.wv.key_to_index[word]
def idx2word(idx):
    return word_model.wv.index_to_key[idx]


In [111]:
X_train_tmp = np.zeros([X_train.shape[0], 150], dtype=np.int32)
cnt,cntt=0,0
for i, sentence in enumerate(X_train):
    for t, word in enumerate(sentence[:150]):
        if word in word_model.wv.key_to_index:
            X_train_tmp[i, t] = word2idx(word)
            cntt += 1
        else:
            X_train_tmp[i, t] = 0
            cnt += 1
X_train = X_train_tmp

In [112]:
X_valid_tmp = np.zeros([X_valid.shape[0], 150], dtype=np.int32)
cnt,cntt=0,0
for i, sentence in enumerate(X_valid):
    for t, word in enumerate(sentence[:150]):
        if word in word_model.wv.key_to_index:
            X_valid_tmp[i, t] = word2idx(word)
            cntt += 1
        else:
            X_valid_tmp[i, t] = 0
            cnt += 1
X_valid = X_valid_tmp

In [113]:
vocab_size, emdedding_size = weights.shape

In [114]:
import tensorflow as tf
import numpy as np
import os
import time
import glob
from random import shuffle
from pyarabic import araby
from tensorflow.keras.layers import GRU, Embedding, Dense, Input, Dropout, Bidirectional, BatchNormalization, Flatten, Reshape
from tensorflow.keras.models import Sequential
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [115]:
X_train

array([[  145,     6,  5148, ...,     0,     0,     0],
       [   10,   274,     4, ...,     0,     0,     0],
       [57783,     1, 57811, ...,     0,     0,     0],
       ...,
       [ 4286, 36334,   136, ...,     0,     0,     0],
       [   18, 33922,  4851, ...,     0,     0,     0],
       [   89,    70,    19, ...,     0,     0,     0]], dtype=int32)

In [118]:
model = Sequential()
model.add(Input((150,)))
model.add(Embedding(input_dim=vocab_size, output_dim=emdedding_size, weights=[weights]))
model.add(Bidirectional(GRU(units = 32, return_sequences=True)))
model.add(Bidirectional(GRU(units = 32, return_sequences=False)))
model.add(Dense(16, activation = 'relu'))
model.add(Dropout(0.3))
model.add(Dense(2, activation = 'softmax'))
model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

In [119]:
callbacks = [tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=2, min_delta=0.0001, min_lr=0.0001)]
callbacks += [tf.keras.callbacks.ModelCheckpoint('gensim_w2v_scratch.h5', monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')]
model.fit(X_train, y_train, validation_data= (X_valid, y_valid), epochs = 15, batch_size= 128, shuffle = True, callbacks=callbacks)

Epoch 1/15

Epoch 00001: val_accuracy improved from -inf to 0.75576, saving model to full_verse_7.h5
Epoch 2/15

Epoch 00002: val_accuracy improved from 0.75576 to 0.75634, saving model to full_verse_7.h5
Epoch 3/15

Epoch 00003: val_accuracy did not improve from 0.75634
Epoch 4/15

Epoch 00004: val_accuracy improved from 0.75634 to 0.75974, saving model to full_verse_7.h5
Epoch 5/15

Epoch 00005: val_accuracy did not improve from 0.75974
Epoch 6/15

Epoch 00006: val_accuracy did not improve from 0.75974
Epoch 7/15

Epoch 00007: val_accuracy did not improve from 0.75974
Epoch 8/15

Epoch 00008: val_accuracy did not improve from 0.75974
Epoch 9/15

Epoch 00009: val_accuracy did not improve from 0.75974
Epoch 10/15

Epoch 00010: val_accuracy did not improve from 0.75974
Epoch 11/15

Epoch 00011: val_accuracy did not improve from 0.75974
Epoch 12/15

Epoch 00012: val_accuracy did not improve from 0.75974
Epoch 13/15

Epoch 00013: val_accuracy did not improve from 0.75974
Epoch 14/15

Epoc

<tensorflow.python.keras.callbacks.History at 0x7fae44406520>