<a href="https://colab.research.google.com/github/HannaKi/Deep_Learning_in_LangTech_course/blob/master/language_classifier_NN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Compare SVM and NN with language classification 

Course data folder language_identification contains data for 5 languages. Based on this data
* Train an SVM classifier for language recognition between these 5 languages. (Previously done!)
* Implement this same classifier using a simple NN
* Compare the results you get with NN and SVM Focus on experimenting with the various NN parameters of learning (learning rate, optimizer, etc)

In [0]:
# Reading the data in makes sense to structure a little bit
# ratkaise, miten kansio tuodaan omasta GitHubista!
# toimii myös luomalla Colabiin kansion (tässä nimeltä "texts"), 
# jonne tiedostot raahaa (kansio katoaa, kun ajo päättyy)

import random

def read_data_one_lang(lang,part):
    """Reads one file for one language. Returns data in the form of pairs of (lang,line)"""
    filename="texts/{}_{}.txt".format(lang,part)
    result=[] #this will be the list of pairs (lang,line)
    with open(filename) as f:
        for line in f:
            line=line.strip()
            result.append((lang,line)) 
    return result


def read_data_all_langs(part):
    """Reads train, test or dev data for all languages. part can be train, test, or devel"""
    data=[]
    for lang in ("en","es","et","fi","pt"):
        pairs=read_data_one_lang(lang,part)
        data.extend(pairs) #just add these lines to the end
    #...done
    #but now they come in the order of languages
    #we really must scramble these!
    random.shuffle(data)
    
    #let's yet separate the labels and lines, we will need that anyway
    labels=[label for label,line in data]
    lines=[line for label,line in data]
    return labels,lines

labels_train, lines_train=read_data_all_langs("train") # train and test data splitting already done!
labels_dev,lines_dev=read_data_all_langs("devel")


In [2]:
for label,line in zip(labels_train[:5],lines_train[:5]):
    print(label,"   ",line[:30],"...")

print(labels_train[0], lines_train[0])

en     2. In Section 3.1, in the last ...
et     " KUMMELI TEE". ...
pt     « Psicologia, poesia» ...
es     La temporada siguiente, sin Ma ...
pt     Estavam repletas de lixo, copo ...
en 2. In Section 3.1, in the last sentence after the proviso insert"a" before the word"change" and after the word"in".


# Reminder

Feature matrix has row for each document

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
import sklearn.svm

vectorizer = CountVectorizer(max_features=100000, binary=True, ngram_range=(1,1))

feature_matrix_train = vectorizer.fit_transform(lines_train) 
# .fit_transform: Learn the vocabulary dictionary and return term-document matrix.
feature_matrix_dev = vectorizer.transform(lines_dev)
# .transform: Transform documents to document-term matrix.

# print(vectorizer.get_feature_names()) # Words (or ngrams) in learned vocabulary, a HUGE list!
print("Number of rows (documens) and unique ngrams in feature matrix")
print(feature_matrix_train.shape) 
print()
print("Since most of the texts only use a limited amonut of words (ngrams) in the vocabulary, feature matrix is sparse!")
print(feature_matrix_train.toarray())

Number of rows (documens) and unique ngrams in feature matrix
(5000, 28620)

Since most of the texts only use a limited amonut of words (ngrams) in the vocabulary, feature matrix is sparse!
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


# Support Vector Machine

In [4]:
for C in (0.001,0.01,0.1,1,10,100):
    classifier =  sklearn.svm.LinearSVC(C=C)
    classifier.fit(feature_matrix_train, labels_train)
    print("C=",C,"     ",classifier.score(feature_matrix_dev, labels_dev))

C= 0.001       0.8758
C= 0.01       0.9144
C= 0.1       0.933
C= 1       0.9302
C= 10       0.9102
C= 100       0.8728




* 93% accuracy!

# NN

For NN we will combine the data and separate a proportion of it for validation when fitting the model.

In [0]:
lines =  lines_train + lines_dev
labels = labels_train + labels_dev

For NN we need to encode each class to numeric value.
 * Remember[ difference between encoding vs. one hot encoding](https://https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f)

In [15]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder() #Turns class labels into integers
class_numbers = label_encoder.fit_transform(labels)

print(class_numbers)
print("class_numbers shape=",class_numbers.shape)
print("class labels",label_encoder.classes_) #this will let us translate back from indices to labels

[0 2 4 ... 2 3 1]
class_numbers shape= (10000,)
class labels ['en' 'es' 'et' 'fi' 'pt']


In [16]:
vectorizer = CountVectorizer(max_features=100000, binary=True, ngram_range=(1,1))

feature_matrix = vectorizer.fit_transform(lines) 
# .fit_transform: Learn the vocabulary dictionary and return term-document matrix.

# print(vectorizer.get_feature_names()) # Words (or ngrams) in learned vocabulary, a HUGE list!
print("Number of rows (documens) and unique ngrams in feature matrix")
print(feature_matrix.shape) 
print()
print("Since most of the texts only use a limited amonut of words (ngrams) in the vocabulary, feature matrix is sparse!")
print(feature_matrix.toarray())

Number of rows (documens) and unique ngrams in feature matrix
(10000, 46875)

Since most of the texts only use a limited amonut of words (ngrams) in the vocabulary, feature matrix is sparse!
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [17]:
import keras
from keras.models import Model
from keras.layers import Input, Dense

example_count, feature_count = feature_matrix.shape
class_count = len(label_encoder.classes_)

inp = Input(shape=(feature_count, )) # tuple
hidden = Dense(200, activation="tanh")(inp) # taalla kaytetty tanh. Relu suositumpi? 
# Jos mitaan aktivointifunktiota ei anneta, tulee syotteen ja kertoimien lineaarinen matriisitulo 
outp = Dense(class_count, activation="softmax")(hidden) # softmax: tuottaa luokkien jakauman
model = Model(inputs=[inp], outputs=[outp])

model

<keras.engine.training.Model at 0x7f5e39acce80>

Once the model is constructed it needs to be compiled, for that we need to know:
* which optimizer we want to use (sgd is fine to begin with)
* what is the loss (categorial_crossentropy for multiclass of the kind we have is the right choice)
* which metrics to measure, accuracy is an okay choice

In [0]:
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=['accuracy'])

In [24]:
from keras.callbacks import ModelCheckpoint, EarlyStopping

# batch_size kuinka monta inputtia kerralla sisaan. jokaisen batchin jalkeen paivitetaan painokertoimet gradientien keskiarvolla
# epochs kuinka monta kertaa mennaan lapi koko data
# validation_split: kuinka paljon dataa kaytetaan accuracyn laskemiseen

# Callback to stop training when no improvement
stop_cb=EarlyStopping(monitor='val_acc', patience=2, verbose=1, mode='auto', baseline=None, 
                      restore_best_weights=True)

hist=model.fit(feature_matrix, class_numbers, batch_size=100, verbose=1, epochs=100,
               validation_split=0.1, callbacks=[stop_cb])

Train on 9000 samples, validate on 1000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Restoring model weights from the end of the best epoch
Epoch 00005: early stopping


Accuracy 96.7 %

Experiment with the various parameters of learning (learning rate, optimizer, etc). See [Keras API](https://keras.io/optimizers/).

In [0]:
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd, loss="sparse_categorical_crossentropy", metrics=['accuracy'])

hist = model.fit(feature_matrix, class_numbers, batch_size=100, verbose=1, epochs=100,
               validation_split=0.1, callbacks=[stop_cb])