<a href="https://colab.research.google.com/github/HannaKi/Deep_Learning_in_LangTech_course/blob/master/language_classifier_NN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In the data package , the directory language_identification contains data for 5 languages. Based on this data
* Train an SVM classifier for language recognition between these 5 languages.
  * Kun regressioalgoritmi tekee luokittelun, jokaista luokkaa kohden tehdään oma luokittelija ("Onko englantia? Kyllä/ei") --> viisi decision boundaryä
* Implement this same classifier using a simple NN
* Compare the results you get with NN and SVM? Focus on experimenting with the various parameters of learning (learning rate, optimizer, etc)

In [0]:
# Reading the data in makes sense to structure a little bit
# ratkaise, miten kansio tuodaan omasta GitHubista!
# toimii myös luomalla Colabiin kansion (tässä nimeltä "texts"), 
# jonne tiedostot raahaa (kansio katoaa, kun ajo päättyy)

import random

def read_data_one_lang(lang,part):
    """Reads one file for one language. Returns data in the form of pairs of (lang,line)"""
    filename="texts/{}_{}.txt".format(lang,part)
    result=[] #this will be the list of pairs (lang,line)
    with open(filename) as f:
        for line in f:
            line=line.strip()
            result.append((lang,line)) 
    return result


def read_data_all_langs(part):
    """Reads train, test or dev data for all languages. part can be train, test, or devel"""
    data=[]
    for lang in ("en","es","et","fi","pt"):
        pairs=read_data_one_lang(lang,part)
        data.extend(pairs) #just add these lines to the end
    #...done
    #but now they come in the order of languages
    #we really must scramble these!
    random.shuffle(data)
    
    #let's yet separate the labels and lines, we will need that anyway
    labels=[label for label,line in data]
    lines=[line for label,line in data]
    return labels,lines

labels_train, lines_train=read_data_all_langs("train") # train and test data splitting already done!
labels_dev,lines_dev=read_data_all_langs("devel")


In [37]:
for label,line in zip(labels_train[:5],lines_train[:5]):
    print(label,"   ",line[:30],"...")

print(labels_train[0], lines_train[0])

es     La acumulación de dicha sustan ...
pt     Se outros já por lá andavam, f ...
en     <<Alberta Transmission Access  ...
es     Esta se inició con un antinatu ...
fi     Sumuisena syysaamuna äiti polt ...
es La acumulación de dicha sustancia en los tejidos del de el cuerpo puede causar un daño severo al a el sistema nervioso central de los niños pequeños.


# Reminder

Feature matrix has row for each document

In [38]:
from sklearn.feature_extraction.text import CountVectorizer
import sklearn.svm

vectorizer = CountVectorizer(max_features=100000, binary=True, ngram_range=(1,1))

feature_matrix_train = vectorizer.fit_transform(lines_train) 
# .fit_transform: Learn the vocabulary dictionary and return term-document matrix.
feature_matrix_dev = vectorizer.transform(lines_dev)
# .transform: Transform documents to document-term matrix.

# print(vectorizer.get_feature_names()) # Words (or ngrams) in learned vocabulary, a HUGE list!
print("Number of rows (documens) and unique ngrams in feature matrix")
print(feature_matrix_train.shape) 
print()
print("Since most of the texts only use a limited amonut of words (ngrams) in the vocabulary, feature matrix is sparse!")
print(feature_matrix_train.toarray())

Number of rows (documens) and unique ngrams in feature matrix
(5000, 28620)

Since most of the texts only use a limited amonut of words (ngrams) in the vocabulary, feature matrix is sparse!
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


# Support Vector Machine

In [39]:
for C in (0.001,0.01,0.1,1,10,100):
    classifier =  sklearn.svm.LinearSVC(C=C)
    classifier.fit(feature_matrix_train, labels_train)
    print("C=",C,"     ",classifier.score(feature_matrix_dev, labels_dev))

C= 0.001       0.8758
C= 0.01       0.9144
C= 0.1       0.933
C= 1       0.9302
C= 10       0.9102
C= 100       0.8728




* 93% accuracy!

# NN

For NN we need to encode each class to numeric value.
 * Remember[ difference between encoding vs. one hot encoding](https://https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f)

In [44]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder() #Turns class labels into integers

class_numbers = label_encoder.fit_transform(labels_train)

print(class_numbers)

print("class_numbers shape=",class_numbers.shape)
print("class labels",label_encoder.classes_) #this will let us translate back from indices to labels

[1 4 0 ... 0 3 4]
class_numbers shape= (5000,)
class labels ['en' 'es' 'et' 'fi' 'pt']


* Are words actually a good source of features?
* Let us try with character n-grams instead of words

In [29]:
vectorizer = CountVectorizer(max_features=100000, binary=True,
                           ngram_range=(1,3), analyzer="char_wb")
feature_matrix_train=vectorizer.fit_transform(lines_train)
feature_matrix_dev=vectorizer.transform(lines_dev)

# print(vectorizer.get_feature_names()) # Words (or ngrams) in learned vocabulary, a HUGE list!
print("Number of rows (documens) and unique ngrams in feature matrix")
print(feature_matrix_train.shape) 
print()
print("Since most of the texts only use a limited amonut of words (ngrams) in the vocabulary, feature matrix is sparse!")
print(feature_matrix_train.toarray())

Number of rows (documens) and unique ngrams in feature matrix
(5000, 17671)

Since most of the texts only use a limited amonut of words (ngrams) in the vocabulary, feature matrix is sparse!
[[1 1 1 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 ...
 [1 0 0 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 0 0 ... 0 0 0]]


# SVM

In [28]:
for C in (0.001,0.01,0.1,1,10,100):
    classifier=sklearn.svm.LinearSVC(C=C)
    classifier.fit(feature_matrix_train, labels_train)
    print("C=",C,"     ",classifier.score(feature_matrix_dev, labels_dev))


C= 0.001       0.9762
C= 0.01       0.9778
C= 0.1       0.9732
C= 1       0.9726




C= 10       0.9726
C= 100       0.9724


Now, that's quite a bit better!