<a href="https://colab.research.google.com/github/HannaKi/Deep_Learning_in_LangTech_course/blob/master/bow_classifier_with_embeddings_simpler.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bag-of-words classifier with pretrained word embeddings

- During the lecture we will cover the concept of embeddings and the simple word2vec method
- If we have a trained word embeddings model, we can transfer that knowledge into a new task and model (transfer learning)
- What we achieve here: Initialize the weights in the classifier with pretrained word embeddings
- Word embeddings downloaded at: https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip

### Read data

In [1]:
%%script bash

# facebookin dataa ja imdb-dataa

mkdir -p data
cd data
wget --quiet https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
unzip wiki-news-300d-1M.vec.zip
wget https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/raw/master/data/imdb_train.json
cd ..

Archive:  wiki-news-300d-1M.vec.zip
  inflating: wiki-news-300d-1M.vec   


--2020-04-03 13:29:46--  https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/raw/master/data/imdb_train.json
Resolving github.com (github.com)... 140.82.118.4
Connecting to github.com (github.com)|140.82.118.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/TurkuNLP/Deep_Learning_in_LangTech_course/master/data/imdb_train.json [following]
--2020-04-03 13:29:46--  https://raw.githubusercontent.com/TurkuNLP/Deep_Learning_in_LangTech_course/master/data/imdb_train.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33944099 (32M) [text/plain]
Saving to: ‘imdb_train.json’

     0K .......... .......... .......... .......... ..........  0% 6.37M 5s
    50K .......... .......... .......... .......

In [2]:
import json
import random
with open("data/imdb_train.json") as f:
    data=json.load(f)
random.shuffle(data) 
print(data[0])

# We need to gather the texts, into a list
texts=[one_example["text"] for one_example in data]
labels=[one_example["class"] for one_example in data]
print(texts[:2])
print(labels[:2])

{'class': 'neg', 'text': "Seriously, I'm all for gooey romantic comedies and will get sucked into Miss Congeniality as easily as Goodfellas...but this movie? It doesn't make any sense!!!! And I'm not even talking about the willing suspension of disbelief kind of not making sense. Why does her family live in England? Or, at the very least, why doesn't she have a British accent? She's sure cozy with her dad and he's surprisingly forgiving of her not being around for the last two years. (On that subject, no one ever makes much of a deal about her being away for so long). And what was with the goofy outfits at the bachelorette party? I'm not even going to get into the fact that the escort she paid for falls in love with her--that could've been overcome by better movie-making. I'm just saying that the characters, the setting, and the plot aren't fleshed out enough to make an even somewhat cohesive story. Oh, and the worst part, in my opinion, is the filmmaker's consistent use of the most un

### Use gensim to read the embedding model

In [5]:
from gensim.models import KeyedVectors

#Only grab the 100K most common entries
vector_model = KeyedVectors.load_word2vec_format("data/wiki-news-300d-1M.vec", binary=False, limit=100000)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [6]:
vector_model

<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7f7eb585bcc0>

## Working with the embeddings

* `vector_model.vocab`


In [0]:
# sort based on the index to make sure they are in the correct order
words=[k for k,v in sorted(vector_model.vocab.items(), key=lambda x:x[1].index)]
print("Words from embedding model:",len(words))
print("First 50 words:",words[:50])

Words from embedding model: 100000
First 50 words: [',', 'the', '.', 'and', 'of', 'to', 'in', 'a', '"', ':', ')', 'that', '(', 'is', 'for', 'on', '*', 'with', 'as', 'it', 'The', 'or', 'was', "'", "'s", 'by', 'from', 'at', 'I', 'this', 'you', '/', 'are', '=', 'not', '-', 'have', '?', 'be', 'which', ';', 'all', 'his', 'has', 'one', 'their', 'about', 'but', 'an', '|']


### Normalize the vectors

- Easier to learn on top of these vectors when the magnitude does not vary much

In [0]:
print("Before normalization:",vector_model.get_vector("in")[:10])
vector_model.init_sims(replace=True)
print("After normalization:",vector_model.get_vector("in")[:10])

Before normalization: [-0.0234 -0.0268 -0.0838  0.0386 -0.0321  0.0628  0.0281 -0.0252  0.0269
 -0.0063]
After normalization: [-0.0163762  -0.01875564 -0.05864638  0.02701372 -0.02246478  0.04394979
  0.01966543 -0.0176359   0.01882563 -0.00440898]


### Text analyzer and vectorizer

- When we use an embedding layer (keras.layers.Embedding) the input data must be a sequence, not a bag-of-words vector
- This prepares us for working with sequences, but we must give up on our trusty `CountVectorizer`
- You can use CountVectorizer only as an analyzer without building the feature matrix
- We will have to build the vectorizer part later ourselves

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy

vectorizer=CountVectorizer(analyzer="word",lowercase=False)
analyzer=vectorizer.build_analyzer()
analyzer("I, have a dog") # analyzerin sisällä stop-list joka pudottaa yhden kirjaimen mittaiset sanat pois

['have', 'dog']

# Vectorizing as a sequence

* Each document is a row
* Words are turned into indices, their order is preserved
* We will have to introduce padding, since documents are of different lengths, but we will need to have a array
* Padding: fill shorter documents with zeros at end until the length of the longest document is reached

In [0]:
def vectorize_into_sequences(texts,analyzer,vector_model):
    result=[] #all docs, list of lists
    for document in texts:
        doc=[] #one doc
        for w in analyzer(document): #tokenize
            if w in vector_model.vocab: #is it in the vocab?
                doc.append(vector_model.vocab[w].index+1) #+1 to make space for padding
        result.append(doc)
    return result

seq=vectorize_into_sequential(texts,analyzer,vector_model)

print(vectorize_into_sequential(["I have a dog!", "The dog is used to produce a long sentence.", "Not so my cat."], analyzer, vector_model))

[[37, 2370], [21, 2370, 14, 154, 6, 1153, 388, 939], [915, 58, 94, 3512]]


* above is the vectorized data before padding
* padding is quite easy, in the end:

In [0]:
from keras.preprocessing.sequence import pad_sequences
vectorized_data_padded=pad_sequences(seq, padding='post')
print("Shape:", vectorized_data_padded.shape) # dokumenttien määrä, pisimmän dokumentin pituus
print("First example:", vectorized_data_padded[0])

Shape: (25000, 2273)
First example: [ 132 1115   23 ...    0    0    0]


...and that is our data, nicely padded

### Labels into numerical vectors

- Same as in the original BOW classifier

In [0]:
from sklearn.preprocessing import LabelEncoder

label_encoder=LabelEncoder() #Turns class labels into integers
class_numbers=label_encoder.fit_transform(labels)
print("class_numbers shape=",class_numbers.shape)
print("class_numbers",class_numbers)
print("class labels",label_encoder.classes_)


class_numbers shape= (25000,)
class_numbers [1 0 1 ... 0 0 0]
class labels ['neg' 'pos']


## Network

* The embedding matrix can be obtained straight from the vector_model
* We have a little problem, though because we added a padding symbol at index 0
* So now we need to add a row of zeros for it, or else our embedding lookup will be off by one

In [0]:
# This is where the embedding matrix is
orig_embedding_matrix=vector_model.vectors
print("Orig shape:",orig_embedding_matrix.shape, orig_embedding_matrix.dtype)
zero_line=numpy.zeros((1,orig_embedding_matrix.shape[1]),dtype=orig_embedding_matrix.dtype)
#Stack the zeros on top of the embedding matrix
embedding_matrix=numpy.vstack((zero_line,orig_embedding_matrix))
print("New  shape:",embedding_matrix.shape)
print("First two rows:", embedding_matrix[:2,:])

Orig shape: (100000, 300) float32
New  shape: (100001, 300)
First two rows: [[ 0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+0

### Sequential input

- Remember how the shape of the input data matrix had undefined number of columns
- Now we must make it into fixed size (same for each example)
- Padding: include zeros until you reach the correct size
- You will hear more about this next week!

### Our network structure:

- Input layer, Embedding layer with pretrained weights, Average of embeddings, Non-linear activation, Classification layer
- The key point here is the embedding layer

In [0]:
from keras.models import Model
from keras.layers import Input, Dense, Embedding, Activation, GlobalAveragePooling1D
from keras.optimizers import SGD, Adam
from keras.callbacks import ModelCheckpoint, EarlyStopping


example_count,sequence_len=vectorized_data_padded.shape
class_count=len(label_encoder.classes_)
vector_size=embedding_matrix.shape[1] # embedding dim ("hidden layer") must be the same as in the pretrained model
vocab_size=embedding_matrix.shape[0]

inp=Input(shape=(sequence_len,))
embeddings=Embedding(vocab_size, vector_size, mask_zero=True, weights=[embedding_matrix])(inp)
# Suoriutumiseen vaikuttaa se, otetaanko inputkerroksen ja hiddenlayerin väliin painot valmiina 
# (transferlearning) vai treenataanko ne from scratch
average_embeddings=GlobalAveragePooling1D()(embeddings) # is masking-aware
hidden=Dense(50,activation="tanh")(average_embeddings)
outp=Dense(class_count, activation="softmax")(hidden)
model=Model(inputs=[inp], outputs=[outp])

optimizer=Adam(lr=0.001) # define the learning rate
model.compile(optimizer=optimizer,loss="sparse_categorical_crossentropy",metrics=['accuracy'])

print(model.summary())

# train
stop_cb=EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='auto', baseline=None, restore_best_weights=True)
hist=model.fit(vectorized_data_padded,class_numbers,batch_size=100,verbose=1,epochs=50,validation_split=0.1)

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         (None, 2273)              0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 2273, 300)         30000300  
_________________________________________________________________
global_average_pooling1d_2 ( (None, 300)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 50)                15050     
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 102       
Total params: 30,015,452
Trainable params: 30,015,452
Non-trainable params: 0
_________________________________________________________________
None
Train on 22500 samples, validate on 2500 samples
Epoch 1/50
Epoch 2/50

KeyboardInterrupt: 

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
print("History:",hist.history["val_acc"])
print("Max accuracy:",numpy.max(hist.history["val_acc"]))
plt.ylim(0.85,1.0)
plt.plot(hist.history["val_acc"],label="Validation set accuracy")
plt.plot(hist.history["acc"],label="Training set accuracy")
plt.legend()
plt.show()