<a href="https://colab.research.google.com/github/GitHub-Bong/Toxic-Comment-Challenge/blob/master/0330_PretrainedEmbedding_Word2Vec%2CGlove.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reference
[
Do Pretrained Embeddings Give You The Extra Edge?](https://www.kaggle.com/sbongo/do-pretrained-embeddings-give-you-the-extra-edge)
<br/>
[Pre-trained Word Embedding](https://wikidocs.net/33793)


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import sys, os, re, csv, codecs, numpy as np, pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
import gensim.models.keyedvectors as word2vec
import gc

# **Data Preprocessing**

03/30 Use Word2Vec, GloVe, FastText

In [None]:
train = pd.read_csv('/content/drive/Shareddrives/SOGANG Parrot/train.csv/train.csv')
test = pd.read_csv('/content/drive/Shareddrives/SOGANG Parrot/test.csv/test.csv')

**Check for nulls**

In [None]:
train.isnull().any()

id               False
comment_text     False
toxic            False
severe_toxic     False
obscene          False
threat           False
insult           False
identity_hate    False
dtype: bool

In [None]:
test.isnull().any()

id              False
comment_text    False
dtype: bool

In [None]:
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values # y.shape (159571, 6)
list_sentences_train = train["comment_text"] # (159571,)
list_sentences_test = test["comment_text"] # (153164,)

# Tokenization 
 sentence into unique words     
 ex) "I love cats and love dogs" -> ["I","love","cats","and","dogs"]


# **Indexing**
put the words in a dictionary-like structure and give them an index each     
ex) {1:"I",2:"love",3:"cats",4:"and",5:"dogs"}

# **Index Representation**
represent the sequence of words in the comments in the form of index     
ex) ["I","love","cats","and","dogs"] -> [1,2,3,4,5]

In Keras, all the above steps can be done in 4 lines of code     
we have to *define the number of unique words* in our dictionary when tokenizing the sentences

In [None]:
max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
# list_tokenized_train[:1] = [[688,75,1,126,130, ,,, ]]
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
# list_tokenized_test[:1] = [[2665,655,8849,656, ,,, ]]

# tokenizer.word_counts = OrderedDict([('explanation', 1771),('why', 17818),('the', 496540),('edits', 9957), ,,, ])
# tokenizer.word_index = {'the': 1,'to': 2,'of': 3,'and': 4, ,,, }
# len(tokenizer.word_index) 210337

**We may have a problem**     
ex)        
Comment #1: [8,9,3,7,3,6,3,6,3,6,2,3,4,9]      
Comment #2: [1,2]      
<br/>        

we have to feed a stream of data that has a __consistent length(fixed number of features)__       
<br/>
<br/>
     
## We have to use "padding"



In [None]:
maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen) # (159571, 200)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen) # (153164, 200)

# Pretrained
There are quite a __few GLOVE embeddings__ in Kaggle datasets     
-> Use the one that was trained based on __Twitter text__
<br/><br/>
Use the __Word2Vec embeddings__ which has been trained using __Google Negative News text corpus__

# GloVe

### Input Layer

In [None]:
inp = Input(shape=(maxlen, )) #maxlen=200

**Pass it to Embedding layer**
<br/>
<br/>
"__trainable = False__" 
<br/>
to tell Keras not to retrain the embedding layer

### Embedding

In [None]:
embedding_dict = dict()
f = open('/content/drive/Shareddrives/SOGANG Parrot/glove.twitter.27B.25d.txt/glove.twitter.27B.25d.txt', encoding="utf8")

for line in f:
    word_vector = line.split()
    word = word_vector[0]
    word_vector_arr = np.asarray(word_vector[1:], dtype='float32') # 100
    embedding_dict[word] = word_vector_arr
f.close()
print('%s개의 Embedding vector가 있습니다.' % len(embedding_dict))

1193515개의 Embedding vector가 있습니다.


In [None]:
print(embedding_dict['respectable'])
print(len(embedding_dict['respectable']))

[-0.49053  -0.43373  -1.4828    0.07678   0.77266   1.0036   -0.16272
 -1.2494   -0.13092  -1.03     -1.0419    0.35045  -1.9792    0.53035
 -0.070342  0.4989    0.85868  -0.080579  0.13071   0.40617   0.68475
  0.7763   -0.98371  -0.89656  -1.1979  ]
25


In [None]:
print(type(word_vector))
print(len(word_vector))

<class 'list'>
26


In [None]:
embedding_matrix = np.zeros((len(tokenizer.word_index)+1, 25)) # will delete first row
np.shape(embedding_matrix)

(210338, 25)

In [None]:
for word, i in tokenizer.word_index.items(): 
    temp = embedding_dict.get(word) 
    if temp is not None:
        embedding_matrix[i] = temp 

In [None]:
embedding_matrix.shape

(210338, 25)

In [None]:
embedding_matrix.shape[1]

25

In [None]:
embedding_matrix = np.delete(embedding_matrix,0,axis=0) # delete first row
embedding_matrix.shape

(210337, 25)

### **Save embedding_matrix**


In [None]:
np.save('/content/drive/Shareddrives/SOGANG Parrot/pretrained-embed-Glove.npy',embedding_matrix)

In [None]:
embedding_matrix = np.load('/content/drive/Shareddrives/SOGANG Parrot/pretrained-embed-Glove.npy')
embedding_matrix.shape

(210337, 25)

In [None]:
x = Embedding(len(tokenizer.word_index), embedding_matrix.shape[1],
              weights=[embedding_matrix],trainable=False)(inp)

**Output of Embedding layer is <br/> 3-D tensor of (None, 200, 25)**

<br/>= array of sentence
<br/>
For each words(200), there is an array of 25 coordinates 

**Feed this Tensor into the Bidirectional LSTM layer**

### Bidirectional LSTM Layer     
LSTM takes in a tensor of [Batch Size, Time Steps, Number of Inputs]     
->     
3-D tensor of (None, 200, 25) into the LSTM layer     
<br/>

The output dimension of LSTM layer has doubled to 120 
<br/>
because 60 dimensions are used for forward, and another 60 are used for reverse.     
<br/>
<br/>
 
   
![image](https://i.imgur.com/jaKiP0S.png)


### LSTM Drop out and recurrent drop out

In [None]:
x = Bidirectional(LSTM(60, return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)



### MaxPooling
We need to reshape the 3D tensor into a 2D one     
->     
Use a __Global Max Pooling layer__      
which is traditionally used in CNN problems to reduce the dimensionality of image data

In [None]:
x = GlobalMaxPool1D()(x)

### Dropout Layer
Pass it to a __Dropout layer__ 

In [None]:
x = Dropout(0.1)(x)

### Dense Layer
Connect the output of drop out layer to a __densely connected layer__      
and make them passes through a __RELU function__

In [None]:
x = Dense(50, activation="relu")(x)

### Dropout Layer
Feed the output into a Dropout layer again

In [None]:
x = Dropout(0.1)(x)

### Dense Layer
Finally, feed the output into a __Sigmoid layer__     
Reason      
we are trying to achieve a __binary classification__  for each of the 6 labels

In [None]:
x = Dense(6, activation="sigmoid")(x)

### Define the inputs, outputs and configure the learning process <br/><br/> Set Adam optimizer to optimize loss function <br/><br/> Set binary_crossentropy to be a loss function 

In [None]:
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

## A list of 32 padded, indexed sentence for each batch<br/><br/>Split 10% of the data as a validation set

In [None]:
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 200)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 200, 25)           5258425   
_________________________________________________________________
bidirectional (Bidirectional (None, 200, 120)          41280     
_________________________________________________________________
global_max_pooling1d (Global (None, 120)               0         
_________________________________________________________________
dropout (Dropout)            (None, 120)               0         
_________________________________________________________________
dense (Dense)                (None, 50)                6050      
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0     

In [None]:
from keras.callbacks import EarlyStopping, ModelCheckpoint
#checkpoint
checkpoint = ModelCheckpoint(filepath='/content/drive/Shareddrives/SOGANG Parrot/pretrained-embed-Glove.hdf5', monitor='val_loss', verbose=1, save_best_only=True)

batch_size = 512
epochs = 2
hist = model.fit(X_t,y, batch_size=batch_size, epochs=epochs, callbacks=[checkpoint], validation_split=0.1)

In [None]:
model = load_model('/content/drive/Shareddrives/SOGANG Parrot/pretrained-embed-Glove.hdf5')
y_test = model.predict(X_te)

In [None]:
sample_submission = pd.read_csv("/content/drive/Shareddrives/SOGANG Parrot/sample_submission.csv/sample_submission.csv")

sample_submission[list_classes] = y_test

sample_submission.to_csv("/content/drive/Shareddrives/SOGANG Parrot/baseline/20210331-pretrained-embed-Glove.csv", index=False)

----------------------

---------------


# Word2Vec

In [None]:
word2vecDict = word2vec.KeyedVectors.load_word2vec_format("/content/drive/Shareddrives/SOGANG Parrot/GoogleNews-vectors-negative300.bin/GoogleNews-vectors-negative300.bin", binary=True)
print(word2vecDict.vectors.shape)

(3000000, 300)


In [None]:
embedding_matrix = np.zeros((len(tokenizer.word_index)+1, 300)) # will delete first row
np.shape(embedding_matrix)

(210338, 300)

In [None]:
def get_vector(word):
    if word in word2vecDict:
        return word2vecDict[word]
    else:
        return None

In [None]:
for word, i in tokenizer.word_index.items(): 
    temp = get_vector(word) 
    if temp is not None: 
        embedding_matrix[i] = temp 

In [None]:
embedding_matrix = np.delete(embedding_matrix,0,axis=0) # delete first row 
embedding_matrix.shape

(210337, 300)

In [None]:
np.save('/content/drive/Shareddrives/SOGANG Parrot/pretrained-embed-Word2Vec.npy',embedding_matrix)

In [None]:
embedding_matrix = np.load('/content/drive/Shareddrives/SOGANG Parrot/pretrained-embed-Word2Vec.npy')
embedding_matrix.shape

(210337, 300)

In [None]:
print(word2vecDict['nice'])
print(word2vecDict['nice'].shape)

[ 0.15820312  0.10595703 -0.18945312  0.38671875  0.08349609 -0.26757812
  0.08349609  0.11328125 -0.10400391  0.17871094 -0.12353516 -0.22265625
 -0.01806641 -0.25390625  0.13183594  0.0859375   0.16113281  0.11083984
 -0.11083984 -0.0859375   0.0267334   0.34570312  0.15136719 -0.00415039
  0.10498047  0.04907227 -0.06982422  0.08642578  0.03198242 -0.02844238
 -0.15722656  0.11865234  0.36132812  0.00173187  0.05297852 -0.234375
  0.11767578  0.08642578 -0.01123047  0.25976562  0.28515625 -0.11669922
  0.38476562  0.07275391  0.01147461  0.03466797  0.18164062 -0.03955078
  0.04199219  0.01013184 -0.06054688  0.09765625  0.06689453  0.14648438
 -0.12011719  0.08447266 -0.06152344  0.06347656  0.3046875  -0.35546875
 -0.2890625   0.19628906 -0.33203125 -0.07128906  0.12792969  0.09619141
 -0.12158203 -0.08691406 -0.12890625  0.27734375  0.265625    0.1796875
  0.12695312  0.06298828 -0.34375    -0.05908203  0.0456543   0.171875
  0.08935547  0.14648438 -0.04638672 -0.00842285 -0.0279

In [None]:
print('단어 nice의 정수 인덱스 :', tokenizer.word_index['nice'])

단어 nice의 정수 인덱스 : 547


### Model

In [None]:
inp = Input(shape=(maxlen, )) #maxlen=200
x = Embedding(len(tokenizer.word_index), embedding_matrix.shape[1],
              weights=[embedding_matrix],trainable=False)(inp)
x = Bidirectional(LSTM(60, return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)              
x = GlobalMaxPool1D()(x)
x = Dropout(0.1)(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)

In [None]:
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

In [None]:
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 200)]             0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 200, 300)          63101100  
_________________________________________________________________
bidirectional (Bidirectional (None, 200, 120)          173280    
_________________________________________________________________
global_max_pooling1d (Global (None, 120)               0         
_________________________________________________________________
dropout (Dropout)            (None, 120)               0         
_________________________________________________________________
dense (Dense)                (None, 50)                6050      
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0     

In [None]:
batch_size = 32
epochs = 2
hist = model.fit(X_t,y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

Epoch 1/2
   7/4488 [..............................] - ETA: 1:02:59 - loss: 0.6866 - accuracy: 0.5058

KeyboardInterrupt: ignored