<a href="https://colab.research.google.com/github/HannaKi/Deep_Learning_in_LangTech_course/blob/master/bow_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NN

* We will cover the essentials of NN during the lecture
* Those absent, try eg. [this tutorial](https://www.cs.toronto.edu/~jlucas/teaching/csc411/lectures/tut5_handout.pdf)
* For everyone interested, [here](https://gombru.github.io/2018/05/23/cross_entropy_loss/) is some reading about the loss functions
* And [here](https://github.com/Jaewan-Yun/optimizer-visualization) is the visualization of different optimizers I showed
* We also walked through several optimization techniques, if you missed the lecture on that, I suggest you watch these: [Momentum](https://www.youtube.com/watch?v=N18Km9YIIug) [RMSProp](https://www.youtube.com/watch?v=XhZahXzEuNo) and [Adam](https://www.youtube.com/watch?v=JXQT_vxqwIs). These videos are by Hinton and Ng, who are trustable sources. The slides are [here](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf). Further reading (from a trustable source) is also to be found [here](https://ruder.io/optimizing-gradient-descent/index.html#adagrad). The [Adam paper](https://arxiv.org/pdf/1412.6980.pdf) is for those especially interested in the topic.

# More deeply about optimization

* Adam combines SGD and Momentum

* SGD:
  * w(t+1) = w(t) - epsilon(gradient)
    * epsilon = learning rate
    * gradient = derivative of the loss at point wi
  * small learning rate: optimization might get stuck in local minimum
  * too large learning rate: expensive zig-zag, takes long time
  * massive large learning rate: will diverge, not able to descent to minimum
* Momentum:
  * sama kuin fysiikassa, liikkeen muuttumiselle on vastavoima: optimization function can not make jyrkkä changes: less zig-zagging, faster divergence to minimum
  * V(t+1) = aV(t) + epsilon(gradient)
    * tämä on vektori, jolla on suunta ja nopeus
  * a < 1, jarruttava tekijä, jotta opitimointi pysähtyy
* If gradient (derivative of loss function) is big, it would be beneficial to have small learning rate and vice versa
* Problem: same learning rate for all the weights! 
  * Solution: learning rate needs to be adjusted for all the weights: running average of gragients with respece to time and weight:
  * G(t+1,wi) = bG(t,wi)- epsilon(gradient^2)
  * b < 1
  * technical problem: we need to store the adjusted learning rates of every weight somewhere: memory consumption is doubled if you use for example Adam!

  * Setting the learning rate is most important parameter to tune in Adam. If loss is acting wierdly do hyperparameter optimization (exponential grid!)

# Bag-of-words document classification

* BoW is the simplest way to do classification: Feature vector goes in, decision falls out.

* Feature vector: a vector with as many dimensions as we have unique features, and a non-zero value set for every feature present in our example
* Binary features: 1/0

In the following we work with the IMDB data, have a look on [how to read it in](read_imdb.ipynb). Here we just read the ready data in.

# IMDB data

* Movie review sentiment positive/negative
* Some 25,000 examples, 50:50 split
* Current state-of-the-art is about 95% accuracy


In [1]:
%tensorflow_version 1.x
# to run with old tf with which the code was made
# The default version of TensorFlow in Colab will switch to TensorFlow 2.x on the 27th of March, 2020.

TensorFlow 1.x selected.


In [4]:
!wget https://raw.githubusercontent.com/TurkuNLP/Deep_Learning_in_LangTech_course/master/data/imdb_train.json

--2020-04-06 13:26:07--  https://raw.githubusercontent.com/TurkuNLP/Deep_Learning_in_LangTech_course/master/data/imdb_train.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33944099 (32M) [text/plain]
Saving to: ‘imdb_train.json.1’


2020-04-06 13:26:07 (165 MB/s) - ‘imdb_train.json.1’ saved [33944099/33944099]



In [5]:
import json
import random
with open("imdb_train.json") as f:
    data=json.load(f)
random.shuffle(data) #play it safe!
# # Tärkeää, koska järjestys vaikuttaa malleihin! --> jos luokka 1 ekana, 
# mallia opetetaan aluksi sillä, eikä tasaisesti kaikilla jne.
print(data[0]) #Every item is a dictionary with `text` and `class` keys, here's the first one:

{'class': 'neg', 'text': "......in a horror movie that is. Alright first off , lets start with Kate. Her main goals include getting laid by George Clooney, looking good and last but not least screwing everyone over. Gotta love her. She had about 3 amazingly good chances to finish off this sicko but ..... instead she ran. I mean she didn't wanna bring Guy out for 10 minutes and when she did it was too late. I mean the guy tried to rape her. I cant get into these movies where the main character is a sad idiot. I mean who honestly would have any sympathy for a guy who finishes off everyone she has meet in a night. The movie kept going on. And as a result lost all its credibility."}


To learn on this data, we will need a few steps:

* Build a data matrix with dimensionality (number of examples, number of possible features), and a value for each feature, 0/1 for binary features
* Build a class label matrix (number of examples, number of classes) with the correct labels for the examples, setting 1 for the correct class, and 0 for others

It is quite useless to do all this ourselves, so we will use ready-made classes and functions mostly from scikit

In [6]:
# We need to gather the texts, into a list
texts=[one_example["text"] for one_example in data] # features
labels=[one_example["class"] for one_example in data]
print(texts[:2])
print(labels[:2])

["......in a horror movie that is. Alright first off , lets start with Kate. Her main goals include getting laid by George Clooney, looking good and last but not least screwing everyone over. Gotta love her. She had about 3 amazingly good chances to finish off this sicko but ..... instead she ran. I mean she didn't wanna bring Guy out for 10 minutes and when she did it was too late. I mean the guy tried to rape her. I cant get into these movies where the main character is a sad idiot. I mean who honestly would have any sympathy for a guy who finishes off everyone she has meet in a night. The movie kept going on. And as a result lost all its credibility.", 'Some moron who read or saw some reference to angels coming to Earth, decided to disregard what he\'d heard about the offspring of humans and angels being larger than normal humans. Reinventing them as mythical giants that were 40 feet tall, is beyond ridiculous. There was some historical references to housing and furniture in parts o

In [7]:
# prepare data for bag of words: sanat featureiksi, jotka on vektoreissa
# Kaikkien tekstien sanat poimitaan erikseen ja sille annetaan arvo vektoriin esiintyyko
# se tekstissa. Tsekkaa CountVectorizer:in dokumentointi 

from sklearn.feature_extraction.text import CountVectorizer

vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,1))
feature_matrix=vectorizer.fit_transform(texts)
print("shape=",feature_matrix.shape)


shape= (25000, 74849)


In [0]:
# print(feature_matrix[1])

Now we have the feature matrix done! Next thing we need is the class labels to be predicted in one-hot encoding. This means:

* one row for every example
* one column for every possible class label
* exactly one column has 1 for every example, corresponding to the desired class

In [8]:
from sklearn.preprocessing import LabelEncoder

label_encoder=LabelEncoder() #Turns class labels into integers
class_numbers=label_encoder.fit_transform(labels)

print("class_numbers shape=",class_numbers.shape)
print("class labels",label_encoder.classes_) #this will let us translate back from indices to labels

class_numbers shape= (25000,)
class labels ['neg' 'pos']


* The data is ready, we need to build the network now
* Input
* Hidden Dense layer with some kind of non-linearity, and a suitable number of nodes
* Output Dense layer with the softmax activation (normalizes output to distribution) and as many nodes as there are classes

In [0]:
# import keras
# from keras.models import Model
# from keras.layers import Input, Dense

from tensorflow.python.keras.models import Model
from tensorflow.python.keras.layers import Input, Dense

example_count,feature_count=feature_matrix.shape
class_count=len(label_encoder.classes_)

inp=Input(shape=(feature_count,)) # tuple
hidden=Dense(200,activation="tanh")(inp) # taalla kaytetty tanh. Relu suositumpi? 
# # Jos mitaan funktiota ei anneta, tulee syotteen ja kertoimien lineaarinen matriisitulo 
outp=Dense(class_count,activation="softmax")(hidden) # softmax: tuottaa luokkien jakauman
model=Model(inputs=[inp], outputs=[outp])

In [10]:
model # mallin "resepti"

<tensorflow.python.keras.engine.training.Model at 0x7f701a42bc50>

...it's **this** simple...!

Once the model is constructed it needs to be compiled, for that we need to know:
* which optimizer we want to use (sgd is fine to begin with)
* what is the loss (categorial_crossentropy for multiclass of the kind we have is the right choice)
* which metrics to measure, accuracy is an okay choice

* Optimaizer = algoritmi, joka etsii minimia haittafunktiosta
* Loss (multiclass classification): cross entropy: oikeiden ja ennustettujen jakaumien vertailu

In [0]:
model.compile(optimizer="adam",loss="sparse_categorical_crossentropy",metrics=['accuracy'])

A compiled model can be fitted on data:

In [0]:
# batch_size kuinka monta inputtia kerralla sisaan. jokaisen batchin jalkeen paivitetaan painokertoimet gradientien keskiarvolla
# epochs kuinka monta kertaa mennaan lapi koko data
# validation_split: kuinka paljon dataa kaytetaan accuracyn laskemiseen

hist=model.fit(feature_matrix,class_numbers,batch_size=100,verbose=1,epochs=5,validation_split=0.1)

# tulosteessa:
# Malli putkauttaa ulos: loss
# accuracy: training data accuracy
# val_acc: accuracy with new data
# jos näyttää sille, että mallin oppiminen paranisi vaikka malli on jo treenattu (val_acc kehittyy paremmaksi), lisaa epocheja
# jos val_acc alkaa laskea epochista toiseen, malli alkaa ylifittaamaan (overfitting)

In [0]:
print(hist.history["val_acc"]) # val_accuracy vanhassa Keras-versiosa. Tarkista tämä jos koodi kaatuu.


* We ran for 10 epochs of training
* Made it to a decent accuracy on the validation data

* But we do not have the model saved, so let's fix that and get the whole thing done
* What constitutes a model (ie what we need to run the model on new data)
  - The feature dictionary in the vectorizer
  - The list of classes in their correct order
  - The structure of the network
  - The weights the network learned

* Do all these things, and run again. This time we also increase the number of epochs to 100, see what happens.

In [0]:
# mallin tallentaminen

from keras.models import Model
from keras.layers import Input, Dense
from keras.callbacks import ModelCheckpoint, EarlyStopping
import pickle
import os

def save_model(file_name,model,label_encoder,vectorizer):
    """Saves model structure and vocabularies"""
    model_json = model.to_json() # mallin rakenne ilman painoja
    with open(file_name+".model.json", "w") as f:
        print(model_json,file=f)
    with open(file_name+".encoders.pickle","wb") as f: # pickle vaatii avaamista binaarimuodosa
        pickle.dump((label_encoder,vectorizer),f)
        
example_count,feature_count=feature_matrix.shape
class_count=len(label_encoder.classes_)

inp=Input(shape=(feature_count,))
hidden=Dense(200,activation="tanh")(inp)
outp=Dense(class_count,activation="softmax")(hidden)
model=Model(inputs=[inp], outputs=[outp])
model.compile(optimizer="adam",loss="sparse_categorical_crossentropy",metrics=['accuracy'])

# Save model and vocabularies, can be done before training
save_model("models/imdb_bow",model,label_encoder,vectorizer)
# Callback function to save weights during training, if validation loss goes down
save_cb=ModelCheckpoint(filepath="models/imdb_bow.weights.h5", monitor='val_loss', verbose=1, save_best_only=True, mode='auto')
# Callback to stop training when no improvement
stop_cb=EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='auto', baseline=None, restore_best_weights=True)

hist=model.fit(feature_matrix,class_numbers,batch_size=100,verbose=1,epochs=100,validation_split=0.1,callbacks=[save_cb,stop_cb])


In [0]:
import numpy
from sklearn.metrics import classification_report, confusion_matrix

#Validation data used during training:
val_instances,val_labels,_=hist.validation_data

# jakaumat
print("Network output=",model.predict(val_instances))
predictions=numpy.argmax(model.predict(val_instances),axis=1)
print("Maximum class for each example=",predictions)
conf_matrix=confusion_matrix(list(val_labels),list(predictions))
print("Confusion matrix=\n",conf_matrix)
gold_labels=label_encoder.inverse_transform(list(val_labels))
predicted_labels=label_encoder.inverse_transform(list(predictions))
print(classification_report(gold_labels,predicted_labels))


# Learning progress

* The history object we get lets us inspect the accuracy during training
* Remarks:
  - Accuracy on training data keeps going up
  - Accuracy on validation (test) data flattens out after a but over 10 epochs, we are learning very little past that point
  - What we see is the network keeps overfitting on the training data to the end

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.ylim(0.85,1.0)
plt.plot(hist.history["val_accuracy"],label="Validation set accuracy")
plt.plot(hist.history["accuracy"],label="Training set accuracy")
plt.legend()
plt.show()

# Summary

* We put together a program to train a neural network classifier for sentiment detector
* We learned the necessary code/techniques to save models, and feed the training with data in just the right format
* We observed the training across epochs
* We saw how the classifier can be applied to various text classification problems
* The IMDB sentiment classifier ended up at nearly 90% accuracy, the state of the art is about 95%, we got surprisingly far in few lines of code
