<a href="https://colab.research.google.com/github/9jam/w266-final-project/blob/main/CNN_word2vec_RU_main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## BASELINE: CNN with WORD2VEC embeddings



In [26]:
import nltk
from nltk.corpus import brown
from nltk.data import find
from sklearn.metrics import classification_report

import gensim

import numpy as np
import pandas as pd
from google.colab import drive
import seaborn as sns

import tensorflow as tf
from tensorflow.keras.layers import Embedding, Input, Dense, Lambda
from tensorflow.keras.models import Model


In [2]:
!pip install gensim



### 1. LOAD the dataset

Synthetically perturbed positive examples and negative preserved examples 

In [3]:
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
meduza_bal = pd.read_pickle("/content/drive/MyDrive/meduza_bert_big_df.pkl") #("/content/drive/MyDrive/meduza_bert_30к_df.pkl")
meduza_bal

Unnamed: 0,text,target,orig_word,new_word,case,shift_case,adj
58048,Летом 2015 года губернатор Севастополя Сергей ...,0,,,,,
64189,« Motherla d Wi dows » « Я ухожу.bmp » « Time ...,0,,,,,
141375,Исход боя казался решенным.,0,,,,,
113185,"— Раньше не было, раньше все уступали друг дру...",1,другу,другом,Dat,Ins,0
252003,"В суде сообщили, что Искаков обвиняется в напа...",1,нападении,нападению,Loc,Dat,0
...,...,...,...,...,...,...,...
110268,12 июня в Москве прошел марш в поддержке корре...,1,поддержку,поддержке,Acc,Dat,0
259178,Инцидент произошел с истребитель-бомбардировщи...,1,истребителем-бомбардировщиком,истребитель-бомбардировщик,Ins,Acc,0
90686,"— Я считаю, что это был бы лучший флешмоб, кот...",0,,,,,
131932,« музей закрыт на неопределенное время.,1,Музей,музей,Nom,Acc,0


In [5]:
raw_data = np.array((meduza_bal.text)) 
labels = np.array((meduza_bal.target)) 

In [6]:
import re

def preprocess_text(text):
    text = text.lower().replace("ё", "е")
    #text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL', text)
    #text = re.sub('@[^\s]+','USER', text)
    text = re.sub('[1-9]+', 'цфр', text)
    text = re.sub('[^a-zA-Zа-яА-Я1-9]+', ' ', text)
    #text = re.sub(' +',' ', text)
    return text.strip()


data = [preprocess_text(t) for t in raw_data]

In [7]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=2)

### 2. LOAD the embeddings

In [8]:
from gensim.models import Word2Vec
# Load the trained model
w2v_model = Word2Vec.load('/content/drive/MyDrive/w266_project/models/word2vec/meduza_w2v_big.w2v')
DIM = w2v_model.vector_size

In [9]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

SENTENCE_LENGTH = 100
NUM = len(w2v_model.wv.vocab)

def get_sequences(tokenizer, x):
    sequences = tokenizer.texts_to_sequences(x)
    return pad_sequences(sequences, maxlen=SENTENCE_LENGTH)

tokenizer = Tokenizer(num_words=NUM)
tokenizer.fit_on_texts(x_train)

x_train_seq = get_sequences(tokenizer, x_train)
x_test_seq = get_sequences(tokenizer, x_test)

In [10]:
# Embedding matrix initialization
embedding_matrix = np.zeros((NUM, DIM))
# Add NUM=XX most frequent words
for word, i in tokenizer.word_index.items():
    if i >= NUM:
        break
    if word in w2v_model.wv.vocab.keys():
        embedding_matrix[i] = w2v_model.wv[word]

### 3. Build the model

In [18]:
MAX_SEQUENCE_LENGTH = 100  # Keras' embedding layer expects a specific input length. Padding is often needed here.

embedding_layer = Embedding(embedding_matrix.shape[0],
                            embedding_matrix.shape[1],
                            embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

In [19]:
try:
    del tf_model
except:
    pass

Now let's build the model (again as a **Sequential Model**). Now, we replace the concatination with a 1D CNN layer and a max-pooling operation. Let's choose 10 filters.

In [20]:
tf_model = tf.keras.Sequential()

tf_model.add(embedding_layer)                                        # embedding layer

tf_model.add(tf.keras.layers.Conv1D(
    filters=100, 
    kernel_size=5, 
    strides=1, 
    padding='same', 
    activation='relu', 
    use_bias=True,
    kernel_initializer='glorot_uniform', 
    bias_initializer='zeros')
            )    
tf_model.add(tf.keras.layers.Dropout(0.5))
tf_model.add(tf.keras.layers.GlobalMaxPooling1D())


tf_model.add(Dense(100, activation='relu'))                          # hidden layer
tf_model.add(Dense(1, activation='sigmoid'))                         # classification layer

Let's look at dimensions and parameters of the model.

In [21]:
tf_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 200)          22151400  
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 100, 100)          100100    
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 100)          0         
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 101       
Total params: 22,261,701
Trainable params: 110,301
Non-trainable params: 22,151,400
____________________________________

Like last week... let's compile the model. I.e, define optimizer, loss function, etc.

In [22]:
tf_model.compile(optimizer='adam', loss='binary_crossentropy', metrics='accuracy')

In [23]:
tf_model.fit(x_train_seq, y_train, validation_data=(x_test_seq, y_test), epochs=1)
tf_model.fit(x_train_seq, y_train, validation_data=(x_test_seq, y_test), epochs=150, verbose=0)
tf_model.fit(x_train_seq, y_train, validation_data=(x_test_seq, y_test), epochs=1)



<tensorflow.python.keras.callbacks.History at 0x7f531457a890>

Look's good! Actually better than last week... but don't make much of that, given this crazy simple data set. 

What are train & test predictions now?

In [24]:
y_pred = tf_model.predict(x_test_seq)

In [43]:
y_pred_bin = [int(x > 0.5) for x in y_pred]

In [45]:
print(classification_report(y_test,y_pred_bin))

              precision    recall  f1-score   support

           0       0.66      0.81      0.73     65482
           1       0.70      0.50      0.58     55117

    accuracy                           0.67    120599
   macro avg       0.68      0.66      0.66    120599
weighted avg       0.68      0.67      0.66    120599



Yey! But we obviously cheated here with the choice of sentences. Nevertheless, the idea should be clear.

**Questions for the class for joint live in-class exercises**:

1) Can you relate the value for the validation loss to the prediction for the test set 

2) What do you think happens if you change the 'trainable' flag in the embedding layer from 'False' to 'True'?   

3) What do you need to change in the model if you want more filters of the same kernel size?    

**Note/Question:** What would you need to change if you wanted to add CNN layers (at the same position) of different kernel sizes? That gets us to Keras Functional API... 