<a href="https://colab.research.google.com/github/HannaKi/priva_DL_HLT/blob/master/Text_Classification_task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text classification

This project is about text classification. You will develop a text classification system that identifies different kinds of online texts, such as news, blogs and opinionated texts. We will refer to these text categories as registers. If you want to learn more about online registers and their automatic identification, you can read, e.g., our paper [Toward Multilingual Identification of Online Registers] (https://www.aclweb.org/anthology/W19-6130/).

# Data and register labels
The data for this project consist of ~7500 documents with manual annotations on their register. You can download it from http://dl.turkunlp.org/TKO_8965-projects/classification/ . The documents are based on a (almost) random sample of the Finnish Internet. The registers are identified using a relatively detailed, hierarchical taxonomy. The taxonomy consists of 8 main categories that are divided into a large number of subregisters. The taxonomy is described at the end of this page. The table includes also the abbreviations that are used in the data.

The challenge with online documents is that it is not always easy to identify the specific registers categories of the documents. Furthermore, another issue is that a document may display characteristics of several registers. For instance, a blog post may simultaneously seem like a product review. To deal with these challenges, we have followed the following guidelines:
* For each document, the annotators have aimed at marking the specific subregister category. When this is possible, the document has two register labels: the subregister label and the main register label to which the subregister belongs. For instance, a document annotated as a news article would have the label NE for News and the corresponding higher level register label NA for Narrative. 
* In some cases, the document does not seem to fit any of the subregisters. In this case, the document can be given only one label: the main register label, such as NA for Narrative. 
* Some documents may display characteristics of several register categories. In this case, the annotator can mark several register labels for one single document. Consequently, the document may have up to four labels. This would be the case case if a document is annotated both as a Personal blog (subregister label PB + corresponding higher level register label NA) and Review (subregister label RV + corresponding higher level register label OP).


# Register classes and abbreviations

NA Narrative

* NE NA    New reports / news blogs
* SR NA    Sports reports
* PB NA    Personal blog
* HA NA    Historical article
* FC NA    Fiction
* TB NA    Travel blog
* CB NA    Community blogs
* OA NA    Online article

OP  Opinion
* OB OP  Personal opinion blogs
* RV OP  Reviews
* RS OP  Religious blogs/sermons
* AV OP  Advice

IN Informational description
* JD IN  Job description
* FA IN  FAQs
* DT IN  Description of a thing
* IB IN  Information blogs
* DP IN  Description of a person
* RA IN  Research articles
* LT IN  Legal terms / conditions
* CM IN  Course materials
* EN IN  Encyclopedia articles
* RP IN  Report

ID Interactive discussion
* DF ID  Discussion forums
* QA ID  Question-answer forums

HI  How-to/instructions
* RE HI  Recipes

IP IG  Informational persuasion
* DS IG  Description with intent to sell
* EB IG  News-opinion blogs / editorials

Lyrical LY
* PO LY  Poems
* SL LY  Songs

Spoken SP
* IT SP Interviews
* FS SP Formal speeches

Others OS
* MT OS Machine-translated / generated texts


# Preparations

Download and open data, explore it.

In [1]:
# Get rid of old tf at some point!

%tensorflow_version 1.x
# to run with old tf with which the code was made
# The default version of TensorFlow in Colab will switch to TensorFlow 2.x on the 27th of March, 2020.
# https://colab.research.google.com/notebooks/tensorflow_version.ipynb

import tensorflow
print(tensorflow.__version__)

TensorFlow 1.x selected.
1.15.2


In [2]:
# Download development data
!wget http://dl.turkunlp.org/TKO_8965-projects/classification/fincore-dev.tsv
# Download test data
!wget http://dl.turkunlp.org/TKO_8965-projects/classification/fincore-test.tsv
# Download train data
!wget http://dl.turkunlp.org/TKO_8965-projects/classification/fincore-train.tsv

--2020-04-20 15:56:02--  http://dl.turkunlp.org/TKO_8965-projects/classification/fincore-dev.tsv
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4035578 (3.8M) [application/octet-stream]
Saving to: ‘fincore-dev.tsv’


2020-04-20 15:56:03 (4.29 MB/s) - ‘fincore-dev.tsv’ saved [4035578/4035578]

--2020-04-20 15:56:05--  http://dl.turkunlp.org/TKO_8965-projects/classification/fincore-test.tsv
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8512687 (8.1M) [application/octet-stream]
Saving to: ‘fincore-test.tsv’


2020-04-20 15:56:07 (6.40 MB/s) - ‘fincore-test.tsv’ saved [8512687/8512687]

--2020-04-20 15:56:10--  http://dl.turkunlp.org/TKO_8965-projects/classification/fincore-train.tsv
Resolving dl

Data split
  - Train data - all training based on it (this includes the vectorizer!)
  - Development data - set the parameters (a.k.a validation data set)
  - Test data - used for nothing during training, produce final results



In [3]:
import pandas as pd

train = pd.read_csv('fincore-train.tsv', sep='\t', header=None)

train = train.sample(frac=1, random_state = 4) # suffle the data
train.columns = ['label','text']
print(train.head())
print(train.shape)

       label                                               text
3982  DS IG    Logistiikka Jenni Lindholm Laskutus Ritva Lie...
2640  RS OP    Tässä [ [ Ortodoksinen seminaari ( Joensuu ) ...
119   NE NA    Koulutuspaikka jokaiselle peruskoulun päättän...
4916  SR NA    1 Cardiff C–Everton Tasainen kohde . Cardiff ...
775   MT OS    Northrop Grumman Q4 2009 tulokset Northrop Gr...
(5295, 2)


In [4]:
dev = pd.read_csv('fincore-dev.tsv', sep='\t', header=None)
dev.columns = ['label','text']
print(dev.head())
print(dev.shape)

    label                                               text
0  OA NA    Luonnonhoito Maaperän siemenpankkia avattiin ...
1  DS IG    • Jokainen ripsi on erittäin kevyt ja muodolt...
2  DS IG    Mukavuudet Hotel Dila Vain muutaman metrin pä...
3  DF ID    Vastaa viestiin Otsikko Viesti ensin omaishoi...
4  OA NA    Dinosaur Jr 30.5.2010 Tavastia , Helsinki 198...
(756, 2)


In [5]:
test = pd.read_csv('fincore-test.tsv', sep='\t', header=None)
test.columns = ['label','text']
print(test.head())
print(test.shape)

test_text = test['text']
test_labels = test['label']

    label                                               text
0    HI     Tehkää nollaleimaus . Jos rekisteröinti onnis...
1    NA     1 kommenttia : Syyslomallelähtijät kirjoitti ...
2  DT IN    Ammattikoulutuksen perustana on ajatus siitä ...
3  DP IN    Ulkonäkö : Silveriä voisi kuvata tietyllä tap...
4  DT IN    Laulupelimannien puheenjohtajina ovat toimine...
(1513, 2)


In [87]:
# Prepare stratified data sets for training, development and testing:
# Stratification aims to ensure that all the data sets (train, development and test) have the same distribution of labels. 
# This minimizes chances that a model has to try to predict labes it has not seen during training.

# Error: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
# Best solution: More data
# Second best solution: If you cannot have another dataset, you will have to play with what you have. I would suggest you remove the sample that has the lonely target. 
# So you will have a model which does not cover that target. If that does not fit you requirements, you need a new dataset.

from sklearn.model_selection import train_test_split 
from pprint import pprint

# Join all the data and re-divide it with stratification

frames = [train, dev, test]
data = pd.concat(frames)

# Separating out the target
y = data['label'] # pd df
# # Separating out the features
X = data['text'] # pd df

unique, counts = np.unique(y, return_counts=True)
nums = dict(zip(unique, counts))

pprint(sorted(nums.items(), key = lambda kv:(kv[1], kv[0])))

[('AV OP DS IG ', 1),
 ('CB NA EB IG ', 1),
 ('CB NA HI  ', 1),
 ('CB NA IP IG ', 1),
 ('CB NA RV OP ', 1),
 ('DF ID NE NA ', 1),
 ('DF ID PB NA ', 1),
 ('DF ID RE HI ', 1),
 ('DP IN CB NA ', 1),
 ('DP IN EN IN ', 1),
 ('DP IN IP IG ', 1),
 ('DP IN NA  ', 1),
 ('DS IG AV OP ', 1),
 ('DS IG IN  ', 1),
 ('DS IG MT OS ', 1),
 ('DS IG OB OP ', 1),
 ('DT IN AV OP ', 1),
 ('DT IN FC NA ', 1),
 ('DT IN NE NA ', 1),
 ('DT IN RV OP ', 1),
 ('EN IN IP IG ', 1),
 ('FC NA DP IN ', 1),
 ('FC NA DT IN ', 1),
 ('FC NA ID  ', 1),
 ('HA NA DP IN ', 1),
 ('HA NA PB NA ', 1),
 ('HI  EN IN ', 1),
 ('HI  IP IG ', 1),
 ('HI  LT IN ', 1),
 ('HI  NE NA ', 1),
 ('IB IN CB NA ', 1),
 ('IB IN IN  ', 1),
 ('ID  PB NA ', 1),
 ('IN  HI  ', 1),
 ('IP IG NE NA ', 1),
 ('IT SP DT IN ', 1),
 ('MT OS EN IN ', 1),
 ('NE NA DT IN ', 1),
 ('OA NA DP IN ', 1),
 ('OA NA FC NA ', 1),
 ('OB OP DT IN ', 1),
 ('OB OP RE HI ', 1),
 ('OP  DP IN ', 1),
 ('PB NA DS IG ', 1),
 ('PB NA IB IN ', 1),
 ('RE HI PB NA ', 1),
 ('RV OP DS IG

In [91]:
# Separate and handle labels that occur in data less than three times

one_label = []
for key, value in nums.items():
    if value == 1:
      one_label.append(key)

two_labels = []
for key, value in nums.items():
    if value == 2:
      two_labels.append(key)

sufficient = []
for key, value in nums.items():
    if value >= 3:
      sufficient.append(key)

one_label_ = data[data['label'].isin(one_label)]
print(one_label_.shape[0]) # number of labels that occur ONLY ONCE in the data. Super bad!

two_labels_ = data[data['label'].isin(two_labels)]
print(two_labels_['label'].nunique()) # number of labels that occur ONLY TWICE in the data. Bad!

enough_labels = data[data['label'].isin(sufficient)]
print(enough_labels.shape[0]) # amount of data ready for test-dev-train split as it is
print(type(enough_labels))

ones = pd.concat([one_label_, one_label_, one_label_])
print(ones.shape)

twos = pd.concat([two_labels_, two_labels_.drop_duplicates(subset=['label'])])
print(twos.shape)

data_ = pd.concat([enough_labels, ones, twos])
data_.shape

y = data_['label'] # pd df
# # Separating out the features
X = data_['text'] # pd df

unique, counts = np.unique(y, return_counts=True)
nums = dict(zip(unique, counts))

pprint(sorted(nums.items(), key = lambda kv:(kv[1], kv[0])))

51
10
7493
<class 'pandas.core.frame.DataFrame'>
(153, 2)
(30, 2)
[('AV OP DS IG ', 3),
 ('CB NA EB IG ', 3),
 ('CB NA HI  ', 3),
 ('CB NA IP IG ', 3),
 ('CB NA RV OP ', 3),
 ('CB NA SR NA ', 3),
 ('DF ID HI  ', 3),
 ('DF ID NE NA ', 3),
 ('DF ID PB NA ', 3),
 ('DF ID RE HI ', 3),
 ('DP IN CB NA ', 3),
 ('DP IN DS IG ', 3),
 ('DP IN EN IN ', 3),
 ('DP IN IP IG ', 3),
 ('DP IN NA  ', 3),
 ('DS IG AV OP ', 3),
 ('DS IG IN  ', 3),
 ('DS IG MT OS ', 3),
 ('DS IG OB OP ', 3),
 ('DS IG RV OP ', 3),
 ('DT IN AV OP ', 3),
 ('DT IN FC NA ', 3),
 ('DT IN NE NA ', 3),
 ('DT IN RV OP ', 3),
 ('EN IN HI  ', 3),
 ('EN IN IP IG ', 3),
 ('FC NA DP IN ', 3),
 ('FC NA DT IN ', 3),
 ('FC NA ID  ', 3),
 ('HA NA DP IN ', 3),
 ('HA NA DS IG ', 3),
 ('HA NA PB NA ', 3),
 ('HI  DS IG ', 3),
 ('HI  DT IN ', 3),
 ('HI  EN IN ', 3),
 ('HI  IP IG ', 3),
 ('HI  LT IN ', 3),
 ('HI  NE NA ', 3),
 ('IB IN AV OP ', 3),
 ('IB IN CB NA ', 3),
 ('IB IN IN  ', 3),
 ('ID  PB NA ', 3),
 ('IN  HI  ', 3),
 ('IP IG NE NA ', 3)

In [92]:
y = data_['label']
X = data_['text']

# Split train and development data

dev_size = dev.shape[0]/data.shape[0]
train_text, dev_text, train_labels, dev_labels = train_test_split(X, y, stratify = y, test_size=dev_size, random_state=1)

# Split train and test data

test_size = test.shape[0] / len(train_text)
train_text, test_text, train_labels, test_labels = train_test_split(train_text, train_labels, stratify = train_labels, test_size=test_size, random_state=1)

print(train_text.shape, dev_text.shape, test_text.shape)
# print(train_text.head())
# print(train_labels.head())

labels = [train_labels, dev_labels, test_labels]
all_labels = pd.concat(labels)
class_count = len(all_labels.unique())
print("Number of unique labels in data: ", class_count)

(5395,) (768,) (1513,)
Number of unique labels in data:  119


# Baseline

Since we now know that 'MT OS ' is the most common label in the data with 957 occurences, we can set a naive baseline prediction: We will predict that an unlabeled new text belongs to this biggest class. In this case our prediction accuracy is the pure share of the biggest class in the data:

In [137]:
print("Classification baseline: ", round((957/data.shape[0])*100,1), "percent") # here the original data without modifications for training of classificators 

Classification baseline:  12.7 percent


In [0]:
# import seaborn as sns

# plt.subplots_adjust(left=None, bottom=None, right=None, top=1, wspace=0.6, hspace=0.6)

# plt.subplot(2, 3, 1)
# sns.countplot(x='label', data=train, )
# #ax = sns.lineplot(x="t_Length", y="Gross_tonnage", hue="Ship_type", data=df).set_title('Gross tonnage as a function of ship length to power of two')

# plt.subplot(2, 3, 2)
# sns.countplot(x='label', data=dev, )

In [0]:
# # Gather features and labels of the data

# # Separate text and the associated label
# train_text = train['text']
# train_labels = train['label']

# print(train_text.head())
# print(train_labels.head())
# print()

# dev_text = dev['text']
# dev_labels = dev['label']

# test_text = test['text']
# test_labels = test['label']

# labels = [train_labels, dev_labels, test_labels]
# all_labels = pd.concat(labels)

# print(all_labels.head(10))
# print()
# class_count = len(all_labels.unique())
# print("Number of unique labels in data: ", class_count)

# Milestone 1.1: Bag-of-words classifier (multi-class)

Train a bag-of-words classifier to predict the register categories. In this milestone, the setting is multi-class, so the register label combinations form the classes, e.g. NA_NE and NA_NE_OP_OB. 

- Evaluate your model and report your results with different hyperparameters
- Ideas to try:
  - Different activation functions
  - Altering the learning rate
  - Use different optimizers
  - Adjusting the vocabulary size of the embeddings

- Activation functions and optimizers supported by Keras can be found here: https://keras.io/


Bow classifier is only interested in the multiplicity or appearance of words (or to be precise n-garms). Hence we loose the textual context and order of the words (n-grams). This inevitably leads to some information loss.

We will use CountVectorizer from sklearn package to transform out text data to numerical format with which our classifier is able to deal with. CountVectorizer converts the collection of text documents (our training data) to a matrix of token counts. Since we are only interested whether a particular word ot the vocabulary is in a single document or not, our vectorizer is set on "binary". 

In [93]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 85000, binary = True, ngram_range = (1,1))
# form feature matrix
train_feature_matrix = vectorizer.fit_transform(train_text)
dev_feature_matrix = vectorizer.fit_transform(dev_text)

print("shape of the training data: ", train_feature_matrix.shape)
print("shape of the development data: ", dev_feature_matrix.shape)

shape of the training data:  (5395, 85000)
shape of the development data:  (768, 85000)


The shape of the feature matrix tells us that we have 5295 items (documents) in our training data. The number of unique n-grams exceeds 97 000 but we are including only the first 97 000 most common of them. Since our CountVectorizer has parameter setting "ngram_range = 1, 1" this means we are forming the vector with unigrams, separate words or charachters.

## Label encoding

Next we will encode the labels. This means transforming the textula labels no numeric values, which our model is able to deal with. This step is made with LabelEncoder class.

In [95]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder() # Create the instance of LabelEncoder we use to turn class labels into integers

label_encoder.fit(all_labels) # encode labels to integers

train_numbers = label_encoder.transform(train_labels) 
dev_numbers = label_encoder.transform(dev_labels) 
test_numbers = label_encoder.transform(test_labels) 

print("Inverse transform gives unique labels in each data set: ", label_encoder.inverse_transform(train_numbers))
print("Sanity checks, do we have as many labels and texts in our data sets")
print(len(train_numbers), len(train_text))
print(len(dev_numbers), len(dev_text))
print(len(test_numbers), len(test_text))

Inverse transform gives unique labels in each data set:  ['NE NA ' 'CB NA ' 'IN  ' ... 'RS OP ' 'PB NA ' 'DF ID ']
Sanity checks, do we have as many labels and texts in our data sets
5395 5395
768 768
1513 1513


In [0]:
# LabelEncoderin testailua ÄLÄ HÄVITÄ!!!!!!

# le = LabelEncoder()
# le.fit(["paris", "paris", "tokyo", "amsterdam"]) # encode tee koodaus
# print("Koodaukset: ", list(le.classes_))
# print()
# le.transform(["tokyo", "tokyo", "paris"]) # käytä koodausta arvojen transformointiin --> numeeriset arvot
# num = le.transform(["tokyo", "tokyo", "paris"])
# print("Sovitetulla encoderilla tuotetut numeeriset arvot muuttujalistasta: ", num)
# print()
# print("Inverse transform: ", list(le.inverse_transform([2, 2, 1])))
# print()
# test = le.fit_transform(["paris", "paris", "tokyo", "amsterdam"]) # encode and transform
# print("Koodit kolmella", test)

# le.fit(["paris", "paris", "tokyo", "amsterdam", "helsinki"])
# #print(list(le.classes_))

# test = le.fit_transform(["paris", "helsinki", "tokyo", "amsterdam"]) # encode and transform
# print("Koodit neljällä", test)

# print(list(le.inverse_transform([2, 2, 1])))


Now the data is prepared and we move on to building the classifier itself.

In [96]:
import keras
from keras.models import Model
from keras.layers import Input, Dense

# from tensorflow.python.keras.models import Model
# from tensorflow.python.keras.layers import Input, Dense

example_count, feature_count = train_feature_matrix.shape

inp = Input(shape = (feature_count, ))                  # Tuple. The size of the inputlayer is the number of the vectors
hidden = Dense(200, activation="tanh")(inp)             # Non-linear activation function. tanh or relu? 
outp = Dense(class_count, activation="softmax")(hidden) # As many output possibilities as we have input classes. ALL THE POSSIBLE CLASSES!!?!??! 
                                                        # Softmax: produces probability distribution of the classes
bow_model = Model(inputs=[inp], outputs=[outp])

bow_model.summary()

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 85000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 200)               17000200  
_________________________________________________________________
dense_2 (Dense)              (None, 119)               23919     
Total params: 17,024,119
Trainable params: 17,024,119
Non-trainable params: 0
_________________________________________________________________


Using TensorFlow backend.


In [0]:
bow_model.compile(optimizer="adam", loss = "sparse_categorical_crossentropy", metrics = ['accuracy'])

Now we will fit the data. Here we will also need the validation data.

batch_size kuinka monta inputtia kerralla sisaan. jokaisen batchin jalkeen paivitetaan painokertoimet gradientien keskiarvolla
epochs kuinka monta kertaa mennaan lapi koko data
validation_split: kuinka paljon dataa kaytetaan accuracyn laskemiseen

- jos näyttää sille, että mallin oppiminen paranisi vaikka malli on jo treenattu (val_acc kehittyy paremmaksi), lisaa epocheja. Käytä early stoppingia estään ylisovittaminen

## Fitting BOW-classifier

Let's try with different optimizers available in Keras.

In [98]:
from keras.callbacks import ModelCheckpoint, EarlyStopping

stop_cb = EarlyStopping(monitor = 'val_accuracy', patience=3, verbose=1, mode='auto', baseline=None, restore_best_weights=True)

# https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Ftrl
# Lisää tähän loopiin eri learningratet?
ops = ('Adadelta', 'Adagrad', 'Adam', 'SGD', 'RMSProp', 'Adamax', 'Nadam')

# tee uusi muuttuja, jossa LR: liitetty loopin avulla optimizeriin!

for op in ops:
  print(op)
  print()
  bow_model.compile(optimizer = op, loss="sparse_categorical_crossentropy", metrics=['accuracy'])
  bow_history = bow_model.fit(train_feature_matrix, train_numbers, batch_size=100, 
                 verbose=1, epochs=25, validation_data=(dev_feature_matrix, dev_numbers), callbacks=[stop_cb])
  print()

# https://www.javacodemonk.com/difference-between-loss-accuracy-validation-loss-validation-accuracy-in-keras-ff358faa  

Adadelta


Train on 5395 samples, validate on 768 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Restoring model weights from the end of the best epoch
Epoch 00004: early stopping

Adagrad

Train on 5395 samples, validate on 768 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Restoring model weights from the end of the best epoch
Epoch 00004: early stopping

Adam

Train on 5395 samples, validate on 768 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Restoring model weights from the end of the best epoch
Epoch 00005: early stopping

SGD

Train on 5395 samples, validate on 768 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Restoring model weights from the end of the best epoch
Epoch 00004: early stopping

RMSProp

Train on 5395 samples, validate on 768 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Restoring model weights from the end of the best epoch
Epoch 00004: early stopping

Adamax

Train on 5395 samples, validate on 768 samples
Epoch 1/25
Epoch 2/

In [99]:
# Choose best OP and LR and fit the model
bow_model.compile(optimizer = 'Adadelta', loss="sparse_categorical_crossentropy", metrics=['accuracy'])

bow_history = bow_model.fit(train_feature_matrix, train_numbers, batch_size=100, 
                 verbose=1, epochs=25, validation_data=(dev_feature_matrix, dev_numbers), callbacks=[stop_cb])

Train on 5395 samples, validate on 768 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Restoring model weights from the end of the best epoch
Epoch 00004: early stopping


In [0]:
# import numpy as np

# predictions = model.predict(dev_feature_matrix)
# pred_classes = np.argmax(predictions,axis=-1)
# for pred, correct, txt_line in zip(pred_classes, dev_labels, dev_text):
#     pred_label=label_encoder.classes_[pred]
#     if pred_label!=correct:
#         print("Prediction:",pred_label,"Correct:",correct,"Text:",txt_line)

In [100]:
# form feature matrix for test data set
test_feature_matrix = vectorizer.fit_transform(test_text)

print(test_feature_matrix.shape)

(1513, 85000)


In [0]:
# # Code for plot from:
# # http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py

# import matplotlib.pyplot as plt
# import itertools
# %matplotlib inline 

# def plot_confusion_matrix(cm, classes,
#                           normalize = False,
#                           title = 'Confusion matrix',
#                           cmap = plt.cm.Blues):
    
    
#     """
#     This function prints and plots the confusion matrix.
#     Normalization can be applied by setting `normalize=True`.
#     """
#     if normalize:
#         cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
#         print("Normalized confusion matrix")
#     else:
#         print('Confusion matrix, without normalization')

#     print(cm)

#     plt.imshow(cm, interpolation='nearest', cmap = cmap)
#     plt.title(title)
#     plt.colorbar()
#     tick_marks = np.arange(len(classes))
#     plt.xticks(tick_marks, classes, rotation=45)
#     plt.yticks(tick_marks, classes)

#     fmt = '.2f' if normalize else 'd'
#     thresh = cm.max() / 2.
#     for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
#         plt.text(j, i, format(cm[i, j], fmt),
#                  horizontalalignment = "center",
#                  color="white" if cm[i, j] > thresh else "black")

#     plt.ylabel('True label')
#     plt.xlabel('Predicted label')
#     plt.tight_layout() 

In [101]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import numpy as np

#print("Network output=",model.predict(test_feature_matrix))
predictions = np.argmax(bow_model.predict(test_feature_matrix), axis=1) # np.argmax gives the index which has the highest value e.g. class
print("true labels: \n", test_labels)
# target_labels = label_encoder.inverse_transform(list(target))
predicted_labels = label_encoder.inverse_transform(list(predictions))
print("predicted labels: \n", predicted_labels)
print()
print("Classification accuracy: ", round(accuracy_score(test_labels, predicted_labels)*100,1), "percent")

# print(classification_report(test_labels, predicted_labels))

true labels: 
 4123          RV OP 
986     NE NA DS IG 
1283          IP IG 
337             HI  
969           DT IN 
            ...     
1167          NE NA 
1372          EN IN 
1184          RV OP 
294           NE NA 
870           MT OS 
Name: label, Length: 1513, dtype: object
predicted labels: 
 ['NE NA ' 'NE NA ' 'PB NA ' ... 'NE NA ' 'PB NA ' 'DT IN ']

Classification accuracy:  11.1 percent


Why does the model perform so badly? 119 classes. 

In [102]:

print("Number of unique labels in")
print("-train data: ", len(np.unique(train_numbers)))
print("-development data: ", len(np.unique(dev_numbers)))
print("-test data: ", len(np.unique(test_numbers)))
print()
inter = np.intersect1d(train_numbers, dev_numbers)
inter2 = np.intersect1d(train_numbers, test_numbers)
print("Number of shared labels in")
print("-train and development data: ", len(inter))
print("-train and test data: ", len(inter2))

Number of unique labels in
-train data:  119
-development data:  56
-test data:  101

Number of shared labels in
-train and development data:  56
-train and test data:  101


In [0]:
# cnf_matrix = confusion_matrix(test_labels, predicted_labels)

# # Confusion matrix has the true labels on rows, and predicted labels on columns in sorted order
# print(cnf_matrix)

In [0]:
# # Plot confusion matrix
# plt.figure()
# plot_confusion_matrix(cnf_matrix, classes = all_labels, normalize = False)

# plt.show()

In [0]:
# # np-argmaxin testailua ÄLÄ HÄVITÄ!!!!!!
# print(predictions[0])
# print(model.predict(test_feature_matrix)[0][25])
# print(model.predict(test_feature_matrix)[0])
# print(sum(model.predict(test_feature_matrix)[0]))

# Milestone 1.2: Recurrent Neural Network Classifier (multi-class)

Modify your codes from milestone 1.1 to use recurrent neural networks (e.g. LSTM or biLSTM) in the classifier. Evaluate your model and report your results with different hyperparameters.

For RNN-calssifier we use Tokenizer which turns tokens, in our case the words of training data to integers. 

## Tokenizing

In [0]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(
    num_words=97000, # max num of most common words
)

tokenizer.fit_on_texts(train_text)

In [104]:
from pprint import pprint    # pretty-printer

def truncate_dict(d, count=10):
    # Returns at most count items from the given dictionary.  
    return dict(i for i, _ in zip(d.items(), range(count)))

# Check if 0 is in the index, and print examples of the mapping
# 0 is reserved for padding!
print(tokenizer.word_index.get(0))
pprint(truncate_dict(tokenizer.word_index))

None
{'ei': 3,
 'että': 4,
 'ja': 1,
 'kun': 9,
 'mutta': 7,
 'oli': 8,
 'on': 2,
 'ovat': 10,
 'se': 5,
 'tai': 6}


In [105]:
train_sequences = tokenizer.texts_to_sequences(train_text)

print(len(train_sequences)) 

# Print an example text, its corresponding sequence, and the tokens it represents
print('Text:', train_text.head(1)[0:200]) # first item of the suffled data (index not 0!)
print('Sequence:', train_sequences[0][:10])
print('Mapped back:', [tokenizer.index_word[i] for i in train_sequences[0][:10]])

5395
Text: 104     MAAPALLOUUTISET WWF etsii Suomenlahden suojel...
Name: text, dtype: object
Sequence: [20072, 3350, 21213, 61223, 20072, 114, 15206, 23896, 5404, 6082]
Mapped back: ['wwf', 'etsii', 'suomenlahden', 'suojelijoita', 'wwf', 'suomen', 'vuotuinen', 'panda', 'palkinto', 'myönnetään']


In [106]:
lengths = [len(s) for s in train_sequences]
print('Lengths:', lengths[:10], 'min:', min(lengths), 'max:', max(lengths), 'mean:', np.mean(lengths))

Lengths: [203, 197, 60, 101, 62, 85, 1182, 592, 90, 39] min: 0 max: 85774 mean: 567.8354031510657


## Padding

Since Keras demands for all of the input items (separate documents of our training data) to have the same length, we need to "pad" all but the longest document by filling in the "missing" number of words with zeros.


In [115]:
from keras.preprocessing.sequence import pad_sequences

sequence_length = np.floor(np.mean(lengths)).astype(int) # based on mean value of input length: we will cut sequences longer than this and pad with zeros sequeces shorter than this

type(sequence_length)

padded_X = pad_sequences(
    train_sequences,
    maxlen = sequence_length, 
    value=0
)

print(padded_X.shape)

(5395, 567)


In [116]:
# Prepare model development data

dev_sequences = tokenizer.texts_to_sequences(dev_text)
padded_dev = pad_sequences(
    dev_sequences,
    maxlen = sequence_length, 
    value=0
)

print(padded_dev.shape)

(768, 567)


# Build LSTM RNN

# HUOMAA!!!
Tämä teksti alkuperäisestä RNN-classification notebookista!

We define a basic RNN model that takes the RNN cell class (RNN_class) as an argument:

- input: sequence of sequence_length integers corresponding to words
- embedding: randomly initialized mapping from integers to embedding_dim-dimensional vectors
- rnn: recurrent neural network with rnn_units-dimensional state
- output: num_classes-dimensional fully connected layer with softmax activation

# KATSO NÄITÄ!
We're intentionally leaving out a few fairly obvious things that would be expected to help here, including

- Any form of regularization, e.g. dropout
- Initializing the embeddings with pre-trained word vectors (ks. fasttext)
- Masking to ignore padding (see Masking and padding with Keras)


In [0]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense

# We'll use these model parameters for all of our examples here.
embedding_dim = 50 # input vector
rnn_units = 100

def build_rnn_model(RNN_class, sequence_length, vocab_size, num_classes):
    input_ = Input(shape=(sequence_length,))
    embedding = Embedding(vocab_size, embedding_dim)(input_) # randomly initialized. Layer turns positive integers (indexes) into dense vectors of fixed size
    rnn = RNN_class(rnn_units)(embedding) # can support different RNNs
    output = Dense(num_classes, activation='softmax')(rnn)
    return Model(inputs=[input_], outputs=[output])

sequence_length = padded_X.shape[1]
vocab_size = tokenizer.num_words
num_classes = len(label_encoder.classes_)

In [118]:
len(label_encoder.classes_)

119

In [119]:
lstm_model = build_rnn_model(LSTM, sequence_length, vocab_size, num_classes)

lstm_model.summary()

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 567)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 567, 50)           4850000   
_________________________________________________________________
lstm (LSTM)                  (None, 100)               60400     
_________________________________________________________________
dense (Dense)                (None, 119)               12019     
Total params: 4,922,419
Trainable params: 4,922,419
Non-trainable params: 0
_________________________________________________________________


In [0]:
lstm_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

epochs = 25
batch_size = 100
stop_cb = EarlyStopping(monitor = 'val_acc', patience=3, verbose=1, mode='auto', baseline=None, restore_best_weights=True)

In [127]:
lstm_history = lstm_model.fit(padded_X, train_numbers, epochs = epochs, batch_size = batch_size, validation_data=(padded_dev, dev_numbers), callbacks=[stop_cb])
# , callbacks=[stop_cb]

Train on 5395 samples, validate on 768 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 00007: early stopping


In [128]:
# Prepare model test data

test_sequences = tokenizer.texts_to_sequences(test_text)
padded_test = pad_sequences(
    test_sequences,
    maxlen = sequence_length, 
    value=0
)

print(padded_test.shape)

(1513, 567)


Predict with LSTM


In [129]:
predictions = np.argmax(lstm_model.predict(padded_test), axis=1) # np.argmax gives the index which has the highest value e.g. class
predicted_labels = label_encoder.inverse_transform(list(predictions))
print("Classification accuracy: ", round(accuracy_score(test_labels, predicted_labels)*100,1), "percent")

Classification accuracy:  29.7 percent


# Bidirectional LSTM RNN

In [0]:
# Tämä ei toimi! Virhe nimeämisestä, selvittämättä!
# from keras.layers import Bidirectional as Bi

# def build_bi_rnn_model(RNN_class, sequence_length, vocab_size, num_classes):
#     input_ = Input(shape=(sequence_length,))
#     embedding = Embedding(vocab_size, embedding_dim)(input_) # randomly initialized
#     rnn = Bi(RNN_class(rnn_units))(embedding) # can support different RNNs
#     output = Dense(num_classes, activation='softmax')(rnn)
#     return Model(inputs=[input_], outputs=[output])


In [0]:
# lstm_bi_model = build_bi_rnn_model(LSTM, sequence_length, vocab_size, num_classes)

# lstm_bi_model.summary()

In [0]:
# lstm_bi_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [0]:
# from keras.models import Sequential
# from keras.layers import Dense, Embedding, Activation, Bidirectional, LSTM

# # model = Sequential()
# # model.add(Bidirectional(LSTM(10, return_sequences=True),
# #                         input_shape=(5, 10)))
# # model.add(Bidirectional(LSTM(10)))
# # model.add(Dense(5))
# # model.add(Activation('softmax'))
# # model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# model = Sequential()
# model.add(Dense(90, input_dim=sequence_length))
# model.add(Embedding(vocab_size, embedding_dim))
# model.add(Bidirectional(LSTM(rnn_units)))
# model.add(Dense(num_classes))
# model.add(Activation('softmax'))

# model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# model.summary()

In [0]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, Activation, Bidirectional, LSTM

# def build_bi_rnn_model(RNN_class, sequence_length, vocab_size, num_classes):
#     input_ = Input(shape=(sequence_length,))
#     embedding = Embedding(vocab_size, embedding_dim)(input_) # randomly initialized
#     rnn = Bi(RNN_class(rnn_units))(embedding) # can support different RNNs
#     output = Dense(num_classes, activation='softmax')(rnn)
#     return Model(inputs=[input_], outputs=[output])

model = Sequential()
# model.add(Input(shape=(sequence_length,)
model.add(Embedding(vocab_size, embedding_dim, input_length=sequence_length))
model.add(Bidirectional(LSTM(rnn_units))) # bidirectional: num of neurons gets doubled
model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

In [130]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, Activation, Bidirectional, LSTM

# def build_bi_rnn_model(RNN_class, sequence_length, vocab_size, num_classes):
#     input_ = Input(shape=(sequence_length,))
#     embedding = Embedding(vocab_size, embedding_dim)(input_) # randomly initialized
#     rnn = Bi(RNN_class(rnn_units))(embedding) # can support different RNNs
#     output = Dense(num_classes, activation='softmax')(rnn)
#     return Model(inputs=[input_], outputs=[output])

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length=sequence_length))
model.add(Bidirectional(LSTM(rnn_units))) # bidirectional: num of neurons gets doubled
model.add(Dense(num_classes, activation = 'softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 567, 50)           4850000   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               120800    
_________________________________________________________________
dense_3 (Dense)              (None, 119)               23919     
Total params: 4,994,719
Trainable params: 4,994,719
Non-trainable params: 0
_________________________________________________________________


In [133]:
stop_cb = EarlyStopping(monitor = 'val_accuracy', patience=3, verbose=1, mode='auto', baseline=None, restore_best_weights=True)
hist_bi_lstm = model.fit(padded_X, train_numbers, epochs = epochs, batch_size = batch_size, validation_data=(padded_dev, dev_numbers), callbacks=[stop_cb])

Train on 5395 samples, validate on 768 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Restoring model weights from the end of the best epoch
Epoch 00006: early stopping


Predict with bi-directional LSTM

In [134]:
predictions = np.argmax(model.predict(padded_test), axis=1) # np.argmax gives the index which has the highest value e.g. class
predicted_labels = label_encoder.inverse_transform(list(predictions))
print("Classification accuracy: ", round(accuracy_score(test_labels, predicted_labels)*100,1), "percent")

Classification accuracy:  32.8 percent


# Milestone 2.1: Deep contextual representations with Bert (multi-class)
Train a Bert classifier to predict the register categories. Similar to Milestone 1, the setting is multi-class, and the evaluations should include results with different hyperparameters.

# Milestone 2.2: Error analysis
Compare the errors made by the classifiers you have trained from milestones 1 and 2.1. Are there any patterns? How do the errors one model makes differ from those made by another.

# Milestone 3.1: Bert (multi-LABEL)
Train two multi-label classifiers, one using non-deep contextual representations, the other using Bert. In this setting, each label is assigned independently. Do hyperparameter optimization on these classifiers.

# Milestone 3.2: Model comparison
Compare the results of these two classifiers. Do the two models predict in the same way? Analyze the predictions in terms of label-specific differences.

# Register classes and abbreviations
