# Introduction
The classifier was developed with the nltk package, using its own classifier nltk.NaiveBayesClassifier() with a simple pipeline, that resembles word2vec:
-  Acquisition of data
-  Cleaning and pre-processing:
    -  Removal of non-alphanumeric characters and words 
    -  Removal of stop-words of all the languages
-  Tokenization 
-  Creation of the Bag of Words 
-  Splitting training and testing sets 
-  Training the model 
-  Testing and Querying the model

In [356]:
import re
import pandas as pd
import numpy as np
import math
import random

from collections import Counter
from nltk.tokenize import WhitespaceTokenizer
from nltk import NaiveBayesClassifier
from nltk.metrics import ConfusionMatrix
from nltk.classify import accuracy
from nltk.corpus import stopwords
from nltk.corpus import genesis
from nltk.corpus import udhr
from nltk.corpus import gutenberg

# 1. Acquisition - Creating corpus

The corpus was made mixing 9 pre-existing nltk corpus, from the Genesis corpus and the Universal declaration of human rights corpus:
-  3 of the corpus are in english (two from the genesis and one from the Udhr)
-  The other languages used are: Finnish, French (2 corpus), Portuguese, German and Spanish 

In [357]:
d = { 'corpus':
        [genesis.words("english-kjv.txt"),genesis.words("english-web.txt"),
            genesis.words("finnish.txt"),genesis.words("french.txt"),genesis.words("portuguese.txt"),
            gutenberg.words("austen-emma.txt"),gutenberg.words("shakespeare-macbeth.txt"),
            udhr.words("English-Latin1"),udhr.words("German_Deutsch-Latin1"),
            udhr.words("French_Francais-Latin1"),udhr.words("Spanish-Latin1")
        ],
        'language':
        [1,1,0,0,0,1,1,1,0,0,0]
}
df = pd.DataFrame(data=d)

### Size of the corpus

In [358]:
def lexical_diversity(text):
    return len(set(text)) / len(text)


print('\nInformation about the corpus.\n(Where 1 is english and 0 is non-english)\n')

for index,corpus in df.iterrows():
    print(f"Information about corpus number {index}\n Length: {len(corpus['corpus'])}, Lexical diversity: {lexical_diversity(corpus['corpus'])}, Language: {corpus['language']}")


Information about the corpus.
(Where 1 is english and 0 is non-english)

Information about corpus number 0
 Length: 44764, Lexical diversity: 0.06230453042623537, Language: 1
Information about corpus number 1
 Length: 44054, Lexical diversity: 0.06033504335588142, Language: 1
Information about corpus number 2
 Length: 32520, Lexical diversity: 0.2088560885608856, Language: 0
Information about corpus number 3
 Length: 46116, Lexical diversity: 0.0803842484170353, Language: 0
Information about corpus number 4
 Length: 45094, Lexical diversity: 0.08457887967357076, Language: 0
Information about corpus number 5
 Length: 192427, Lexical diversity: 0.04059201671283136, Language: 1
Information about corpus number 6
 Length: 23140, Lexical diversity: 0.17359550561797754, Language: 1
Information about corpus number 7
 Length: 1781, Lexical diversity: 0.29927007299270075, Language: 1
Information about corpus number 8
 Length: 1521, Lexical diversity: 0.3806706114398422, Language: 0
Information 

# 2. Cleaning - Pre-processing

In [359]:
def remove_nonalpha(string):
    return [word for word in string if re.match(r'[a-zA-Z]+',word)]

In [360]:
df['corpus']=df['corpus'].apply(lambda cw : remove_nonalpha(cw))

### 3. Remove stopwords

In [361]:
def remove_stopwords(string):
    stop_words = set(stopwords.words('english') + stopwords.words('french') + 
        stopwords.words('finnish') + stopwords.words('portuguese') + 
        stopwords.words('german')+ stopwords.words('spanish'))
    return ' '.join(word.lower() for word in string if word not in stop_words)

In [362]:
df['corpus']=df['corpus'].apply(lambda cw : remove_stopwords(cw))

# 4/5. Tokenization and creation of the BoW

In [363]:
def bagofwords(corpus):
    w_tokenizer = WhitespaceTokenizer()
    return dict(Counter(token for token in w_tokenizer.tokenize(corpus)))

In [364]:
df['corpus'] = df['corpus'].apply(lambda cw : bagofwords(cw))

# 6. Creating labeled corpus in the correct format

In [365]:
labeled_corpusno = [
    ({word:freq}, 'non-english') 
    for corp in df[df['language'] == 0]['corpus']
    for word,freq in corp.items()
]
labeled_corpuseng = [
    ({word:freq}, 'english') 
    for corp in df[df['language'] == 1]['corpus']
    for word,freq in corp.items()
    ]
total_set = labeled_corpuseng + labeled_corpusno
random.shuffle(total_set)

## Creating training and testing set
Common split percentages include:
-  Train: 80%, Test: 20%
-  Train: 67%, Test: 33%
-  Train: 50%, Test: 50%
Given the size of the corpus and some experiments during the developing of the code, it's been chosen to use the 67% and 33% split of the dataset.\
In the end, after the preprocessing and cleaning of the data, the size of the train set is 19574 words and the size of the test set is 10514 words.

    

In [366]:
train = total_set[math.ceil(2*(len(labeled_corpuseng)/3)):] 
test = total_set[:math.ceil(2*(len(labeled_corpuseng)/3))]

In [367]:
print(f'Size of train set: {len(train)}\nSize of test set: {len(test)}')

Size of train set: 19574
Size of test set: 10514


# 7. Training the model

In [368]:
classifier = NaiveBayesClassifier.train(train)

In [369]:
classifier.show_most_informative_features(15)

Most Informative Features
                    baal = 2              non-en : englis =      2.6 : 1.0
                   hadar = 1              non-en : englis =      2.6 : 1.0
                   jubal = 1              non-en : englis =      2.6 : 1.0
                 magdiel = 1              non-en : englis =      2.6 : 1.0
                  merari = 1              non-en : englis =      2.6 : 1.0
                   abida = 1              non-en : englis =      1.8 : 1.0
                   arodi = 1              non-en : englis =      1.8 : 1.0
                   hamor = 11             non-en : englis =      1.8 : 1.0
                   jabal = 1              non-en : englis =      1.8 : 1.0
                  jemuel = 1              non-en : englis =      1.8 : 1.0
                   ludim = 1              non-en : englis =      1.8 : 1.0
                   magog = 1              non-en : englis =      1.8 : 1.0
                    moab = 2              non-en : englis =      1.8 : 1.0

# 8. Testing and Querying the model

## Performance indicators
The performance of the classifier on the constructed test set is analyzed with confusion matrix by measuring the standard
metrics that are commonly used for measuring the classification performance of other classification models.\
The experiments are evaluated using the standard metrics of accuracy, precision, recall and F-measure for classification.
These were calculated using the predictive classification table, known as Confusion Matrix, where:\
- TN (True Negative) : Number of correct predictions that an instance is irrelevant
- FP (False Positive) : Number of incorrect predictions that an instance is relevant
- FN (False Negative) : Number of incorrect predictions that an instance is irrelevant
- TP (True Positive) : Number of correct predictions that an instance is relevant
- Accuracy(ACC) - The proportion of the total number of predictions that were correct:
    - Accuracy (\%) = (TN + TP)/(TN+FN+FP+TP)
- Precision(PREC) - The proportion of the predicted relevant materials data sets that were correct:
    - Precision (\%) = TP / (FP + TP)
- Recall(REC) - The proportion of the relevant materials data sets that were correctly identified
    - Recall (\%) = TP / (FN + TP)
- F-Measure(FM) - Derives from precision and recall values:
    - F-Measure (\%) = (2 x REC x PREC)/(REC + PREC) 

In [370]:
print(accuracy(classifier, test))

0.5246338215712384


In [371]:
test_wolabes = [item[0] for item in test]

In [372]:
test_classified = classifier.classify_many(test_wolabes)

In [373]:
reference = [elem[1]
    for elem in test 
]

In [374]:
cm = ConfusionMatrix(reference,test_classified)

In [375]:
print(cm)

            |         n |
            |         o |
            |         n |
            |         - |
            |    e    e |
            |    n    n |
            |    g    g |
            |    l    l |
            |    i    i |
            |    s    s |
            |    h    h |
------------+-----------+
    english |<5455>  65 |
non-english | 4933  <61>|
------------+-----------+
(row = reference; col = test)



In [376]:
print(cm.evaluate())

        Tag | Prec.  | Recall | F-measure
------------+--------+--------+-----------
    english | 0.5251 | 0.9882 | 0.6858
non-english | 0.4841 | 0.0122 | 0.0238



# Employability
Each row as to be intended as the one in which the label is the True statement (so we'll look at the first one). Given the F-measure though I don't think it is actually employable as classifier.
