# How To Detect Latin_based languages

As a means of optimizing a NLP preprocessing pipeline for easier presentation to users with results in their preferred language, we deemed it necessary and useful to automatically determine the language of some sample text.

In this project, we shall be presenting ways to do it in python, using the Letter Frequency Analysis report. We will also analyze and evaluate each solution based on three dimensions: Accuracy of detection, time executed and ease of use

## Experimental setup

We will use the genesis corpus fron nltk, which is easily available.  

In [1]:
import nltk
nltk.download('genesis')

[nltk_data] Downloading package genesis to /home/sdg/nltk_data...
[nltk_data]   Package genesis is already up-to-date!


True

Genesis corpus contains the text from these 6 languages: Finnish, French, German, Portuguese, Swedish, and three different English versions. Unfortunately, no Polish was found here!

Although the writing style used here isn't representative of the typical context in which language detection could be used (very formal and rather outdated), but the good thing, it has already been labeled.

While we use the genesis corpus solely for testing purposes, we will train our classifier for custom solutions, we will equally use other data sources.

Accuracy will be computed when predicting each sentence of the corpus, and the execution time for predicting the complete dataset. 

External depencies in addition to NLTK are NUMPY and PANDAS.

### Dataset creation

We begin by reating a Pandas dataframe containing all sentences with their respective labels. 

In [2]:
import pandas as pd
import numpy as np
from nltk.corpus import genesis as dataset

dfs  = []
for ids in dataset.fileids():
    df = pd.DataFrame(data=np.array(dataset.sents(ids)), columns=['sentences'])
    df['label'] = ids.strip('.txt') if ids not in {'english-kjv.txt', 'english-web.txt', 'lolcat.txt'} else 'english'
    dfs.append(df)
sentences = pd.concat(dfs)

## Experimental Approah "Naive solution (baseline)"

Our approach here is first present a naive solution relying on stop words (most common words in a language). The nltk corpus stopwords will be exploited.

1. Creaing of a dictionary stop words for each langaage. 
2. Note that this dictionnary includes languages which are not present in the genesis corpus, such as Norwegian or Danish.
3. We did that to ensures a fair comparison between custom solutions and external libraries (which have no restriction on 
    which languages might be present).

In [3]:
from nltk.corpus import stopwords
from collections import defaultdict

languages = stopwords.fileids()
stopwords_dict = defaultdict(list)
for l in languages:
    for sw in stopwords.words(l):
        stopwords_dict[sw].append(l)

1. For each sentence -  represented as a list of tokens, we compute the number of stop words of each language present in 
    the sentence, using a dictionary to aggregate the counts. 
2. We then predict the sentence to be of the language with the highest count (if the dictionary is not empty; else we
    predict 'unknown').

3. To determine equality, we toss a coin and choose at random.

In [4]:
from collections import defaultdict, Counter
import random

def predict_language_naive(sentence):
    random.seed(0)
    cnt = Counter()
    cnt.update(language
              for word in sentence
              for language in stopwords_dict.get(word, ()))
    if not cnt:
        return 'unknown'
        
    m = max(cnt.values())
    return random.choice([k for k, v in cnt.items() if v == m])

Accuracy is computed as follows : 

In [5]:
def compute_accuracy(predictor):
    return (sentences['sentences'].apply(predictor) == sentences['label']).sum() / len(sentences)

In [6]:
compute_accuracy(predict_language_naive)

0.92565982404692082

*Important to note, accuracy might necesssarily not be the ideal metrics here, since we have a slightly unbalanced class distribution, with English being 3 times as frequent as any other language.*

Execution time is computed using the timeit parameter.

In [7]:
%timeit compute_accuracy(predict_language_naive)

299 ms ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


As you can see, the solution is quite fast, but not very accurate. This is obviously so because it does not use any external library which does offer some benefits in some context

## External libraries

We now have a baseline, so we can benchmark a few external libraries to see how well they perform. We reasoned that though they might actually be more accurate, but at what cost in terms of the execution time?

So we tested two libraries: langdetect and pycld2.

### langdetect

The official documentation can be found [here](https://pypi.python.org/pypi/langdetect?). It's a port of a Google library in Python. Unfortunately, the code is not very python friendly...Howver, it's easily installable with pip

In [8]:
from langdetect import detect, lang_detect_exception

# Important
1. The langdetect API takes whole sentences (not tokenised) as input, so we first concatenate tokenised sentences.

2. The detect function may raise an exception when it is unsure about the language, in which case we want to have an 
    unknown label. Our wrapper should catch the exception.

3. We also want to consider the fact that the output is the ISO 639-1 code for the language, which is not very user 
    friendly. In this case we will use a mapping dictionary to convert the output.

In [9]:
iso_to_human = {'aa': 'afar', 'ab': 'abkhazian', 'af': 'afrikaans', 'ak': 'akan', 'am': 'amharic', 'an': 'aragonese', 'ar': 'arabic', 'as': 'assamese', 'av': 'avar', 'ay': 'aymara', 'az': 'azerbaijani', 'ba': 'bashkir', 'be': 'belarusian', 'bg': 'bulgarian', 'bh': 'bihari', 'bi': 'bislama', 'bm': 'bambara', 'bn': 'bengali', 'bo': 'tibetan', 'br': 'breton', 'bs': 'bosnian', 'ca': 'catalan', 'ce': 'chechen', 'ch': 'chamorro', 'co': 'corsican', 'cr': 'cree', 'cs': 'czech', 'cu': 'old bulgarian', 'cv': 'chuvash', 'cy': 'welsh', 'da': 'danish', 'de': 'german', 'dv': 'divehi', 'dz': 'dzongkha', 'ee': 'ewe', 'el': 'greek', 'en': 'english', 'eo': 'esperanto', 'es': 'spanish', 'et': 'estonian', 'eu': 'basque', 'fa': 'persian', 'ff': 'peul', 'fi': 'finnish', 'fj': 'fijian', 'fo': 'faroese', 'fr': 'french', 'fy': 'west frisian', 'ga': 'irish', 'gd': 'scottish gaelic', 'gl': 'galician', 'gn': 'guarani', 'gu': 'gujarati', 'gv': 'manx', 'ha': 'hausa', 'he': 'hebrew', 'hi': 'hindi', 'ho': 'hiri motu', 'hr': 'croatian', 'ht': 'haitian', 'hu': 'hungarian', 'hy': 'armenian', 'hz': 'herero', 'ia': 'interlingua', 'id': 'indonesian', 'ie': 'interlingue', 'ig': 'igbo', 'ii': 'sichuan yi', 'ik': 'inupiak', 'io': 'ido', 'is': 'icelandic', 'it': 'italian', 'iu': 'inuktitut', 'ja': 'japanese', 'jv': 'javanese', 'kg': 'kongo', 'ki': 'kikuyu', 'kj': 'kuanyama', 'kk': 'kazakh', 'kl': 'greenlandic', 'km': 'cambodian', 'kn': 'kannada', 'ko': 'korean', 'kr': 'kanuri', 'ks': 'kashmiri', 'ku': 'kurdish', 'kv': 'komi', 'kw': 'cornish', 'ky': 'kirghiz', 'la': 'latin', 'lb': 'luxembourgish', 'lg': 'ganda', 'li': 'limburgian', 'ln': 'lingala', 'lo': 'laotian', 'lt': 'lithuanian', 'lv': 'latvian', 'mg': 'malagasy', 'mh': 'marshallese', 'mi': 'maori', 'mk': 'macedonian', 'ml': 'malayalam', 'mn': 'mongolian', 'mo': 'moldovan', 'mr': 'marathi', 'ms': 'malay', 'mt': 'maltese', 'my': 'burmese', 'na': 'nauruan', 'nd': 'north ndebele', 'ne': 'nepali', 'ng': 'ndonga', 'nl': 'dutch', 'nn': 'norwegian nynorsk', 'no': 'norwegian', 'nr': 'south ndebele', 'nv': 'navajo', 'ny': 'chichewa', 'oc': 'occitan', 'oj': 'ojibwa', 'om': 'oromo', 'or': 'oriya', 'os': 'ossetian', 'pa': 'punjabi', 'pi': 'pali', 'pl': 'polish', 'ps': 'pashto', 'pt': 'portuguese', 'qu': 'quechua', 'rm': 'raeto romance', 'rn': 'kirundi', 'ro': 'romanian', 'ru': 'russian', 'rw': 'rwandi', 'sa': 'sanskrit', 'sc': 'sardinian', 'sd': 'sindhi', 'sg': 'sango', 'sh': 'serbo-croatian', 'si': 'sinhalese', 'sk': 'slovak', 'sl': 'slovenian', 'sm': 'samoan', 'sn': 'shona', 'so': 'somalia', 'sq': 'albanian', 'sr': 'serbian', 'ss': 'swati', 'st': 'southern sotho', 'su': 'sundanese', 'sv': 'swedish', 'sw': 'swahili', 'ta': 'tamil', 'te': 'telugu', 'tg': 'tajik', 'th': 'thai', 'ti': 'tigrinya', 'tk': 'turkmen', 'tl': 'tagalog', 'tn': 'tswana', 'to': 'tonga', 'tr': 'turkish', 'ts': 'tsonga', 'tt': 'tatar', 'tw': 'twi', 'ty': 'tahitian', 'ug': 'uyghur', 'ur': 'urdu', 've': 'venda', 'vi': 'vietnamese', 'vo': 'volapük', 'wa': 'walloon', 'wo': 'wolof', 'xh': 'xhosa', 'yi': 'yiddish', 'yo': 'yoruba', 'za': 'zhuang', 'zh': 'chinese', 'zu': 'zulu'}


def detect_without_exception(s):
    try:
        return iso_to_human[detect(' '.join(s))]
    except lang_detect_exception.LangDetectException:
        return 'unknown'

Here we go for the prediction accuracy, and the execution time.

In [10]:
compute_accuracy(detect_without_exception)

0.96539589442815255

In [11]:
%timeit compute_accuracy(detect_without_exception)

51.3 s ± 966 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


An improved classification accuracy, but at the expense of being more than 150 times slower. unacceptable in most use cases.

### pycld2

[pycld2](https://pypi.python.org/pypi/pycld2/) provides Python bindings around Google compact language detection library (CLD2). 

The API exposes more details than langdetect, providing a confidence percentage for each language detected, and since it's a wrapper on a C++ compiled binary, we can hope that it'll be faster. Easily installed with pip.

It is the underlying library used by [Polyglot](https://pypi.python.org/pypi/polyglot), a NLP library offering a wide variety of tools for handling multilingual usages.

As langdetect, pycld2 takes whole sentences as input, so we will reuse our previously defined `sentences_agg`.

In [12]:
import pycld2 as cld2

compute_accuracy(lambda s: cld2.detect(' '.join(s), bestEffort=True)[2][0][0].lower())

0.97375366568914956

In [13]:
%timeit compute_accuracy(lambda s: cld2.detect(' '.join(s), bestEffort=True)[2][0][0].lower())

134 ms ± 776 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


The accuracy is actually sligtly better than what we have with langdetect, and it's even faster than our naive solution. 

The only downside is that the GitHub repository has not been updated since 2015, and the documentation seems out of sync. Furthermore, the computation is not made in Python, which makes it harder to alter the code to suit custom needs.

One last thing we can do is to try is to biais the algorithm towards choosing English more often, given that it is the more frequent language.

In [14]:
compute_accuracy(lambda s: cld2.detect(' '.join(s), bestEffort=True, hintLanguage='ENGLISH')[2][0][0].lower())

0.96796187683284463

The result does not improve accuracy, presumably because we have such short pieces of text to label, but it might be of use in other contexts.

## Improvements on the naive solution

Can we improved on the 97% accuracy of a off-the-shelf solution? Let's see if we can beat our naive solution.

### Training dataset

In order to improve our naive solution, we will need another source of multilingual text - using the genesis corpus would be cheating since it's our test set. 

We use the [European Parliament Proceedings Parallel Corpus](http://www.statmt.org/europarl/) which we can download with nltk.

In [15]:
from nltk.corpus import europarl_raw

We can obtain the list of words for each language as follow : 

In [16]:
europarl_raw.english.words()

['Resumption', 'of', 'the', 'session', 'I', 'declare', ...]

Definition of the list of languages from our dataset: 

In [17]:
languages = ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'italian', 'portuguese', 'spanish', 'swedish']

Definition of some small function for cleaning our lists of tokens

In [18]:
def clean_tokens(tokens):
    return [token.lower() for token in tokens if token.isalpha()]

### Weight stop words

Observe that some stop words are present in more than one language. We can consider that these words are less discriminant with respect to the languages they belong to, so we want to assign them a weight proportionnal to how frequent a stop word is in the set of all languages.

In [19]:
weighted_stopwords_dict = defaultdict(dict)
for sword, langs in stopwords_dict.items():
    coeff = 1/ len(langs)
    for lang in langs:
        weighted_stopwords_dict[sword][lang] = coeff

In [20]:
def predict_language_weighted_stopwords(sentence):
    random.seed(0)
    cnt = Counter()
    for word in sentence:
        if word in weighted_stopwords_dict:
            cnt.update(weighted_stopwords_dict[word])

    if not cnt:
        return 'unknown'
    m = max(cnt.values())
    return random.choice([k for k, v in cnt.items() if v == m])

In [21]:
compute_accuracy(predict_language_weighted_stopwords)

0.92184750733137832

In [22]:
%timeit compute_accuracy(predict_language_weighted_stopwords)

413 ms ± 47.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Well! This weighting scheme has not improve our naive solution.

### Use diacritics

Diacritics as defined by Wikipedia, are glyphs added to a letter. They can be quite distinctive of a given language (if present), and so we want to use them in addition to stopwords to improve our classification accuracy for western languages. 

First, we must determine a list of diacritics used per language and so we settled for the European Parliament Proceedings to handle that.

In the first line of the function, we get a list of all characters presents in the proceedings for a given language, after cleaning the tokens (we keep only alphabetic words and we cast everything to lower case). 

Next, we count the number of occurences for each character. then we remove characters occuring less than 500 times, since they can come from foreign words such as "Surnames or Location names", and we only want to keep typical diacritics for a language. 

Lastly, we remove non-accentuated characters (= ascii characters) from the set.

In [23]:
import string

def get_diacritics(language):
    char_list = list(''.join(clean_tokens(europarl_raw.__getattribute__(language).words())))
    cnt = Counter(char_list)
    frequent_chars = {k for k, v in cnt.items() if v > 500}
    return frequent_chars - set(string.ascii_lowercase)

Print the list of diacritics per language.

In [24]:
diacritics = {language: list(get_diacritics(language)) for language in languages}
diacritics

{'danish': ['æ', 'å', 'ø', 'é'],
 'dutch': ['ë', 'é'],
 'english': [],
 'finnish': ['ö', 'ä'],
 'french': ['à', 'û', 'ô', 'ê', 'è', 'ç', 'é', 'î'],
 'german': ['ö', 'ü', 'ä', 'ß'],
 'italian': ['à', 'ò', 'ù', 'è', 'ì', 'é'],
 'portuguese': ['à', 'ú', 'ê', 'ã', 'ç', 'á', 'é', 'í', 'ó', 'õ', 'â'],
 'spanish': ['ú', 'ñ', 'á', 'é', 'í', 'ó'],
 'swedish': ['ö', 'å', 'ä']}

The lists appears well (at least for the languages I know), and it's running reasonably fast for a naive solution.

Now what we have a list of diacritics, we can use the same method as we used for stop words to detect language. 

First, let's try only the diacritics.

In [25]:
diacritics_transposed = defaultdict(list)
for language, chars in diacritics.items():
    for char in chars:
        diacritics_transposed[char].append(language)

        
def predict_language_diacritics(sentence):
    cnt = Counter()
    cnt.update(language
             for ch in ''.join(sentence).lower()
             for language in diacritics_transposed[ch]
             if ch not in string.ascii_lowercase)
    if not cnt:
        return 'english'
    m = max(cnt.values())
    return random.choice([k for k, v in cnt.items() if v == m])

In [26]:
compute_accuracy(predict_language_diacritics)

0.65058651026392966

In [27]:
%timeit compute_accuracy(predict_language_diacritics)

169 ms ± 5.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Considering such small chunks of text, we are far from guaranteed to have diacritics, which could explain the low accuracy. 

Let's check if the use of confusion matrix will proof our hypothesis right.

Let's use Pandas-ml library, which combines the power of scikit-learn with the readability of pandas.

In [28]:
from pandas_ml import ConfusionMatrix
ConfusionMatrix(sentences['label'], sentences['sentences'].apply(predict_language_diacritics))

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  df = df.loc[idx, idx.copy()].fillna(0)  # if some columns or rows are missing


Predicted   danish  dutch  english  finnish  french  german  italian  \
Actual                                                                 
danish           0      0        0        0       0       0        0   
dutch            0      0        0        0       0       0        0   
english          0      0     4521        0       0       0        0   
finnish          0      0      227      648       0     671        0   
french          66    115      295        0     646       0      462   
german           0     10      876      152       0     687        0   
italian          0      0        0        0       0       0        0   
portuguese      12     10      198        0      77       1       18   
spanish          0      0        0        0       0       0        0   
swedish         43      1       35       95       1      89        1   
__all__        121    136     6152      895     724    1448      481   

Predicted   portuguese  spanish  swedish  __all__  
Actual     

# Important
Confusion matrix gives us two very interesting pieces of information. 

First, as you can see fro the this report, a lot of sentences are predicted as English; actuallly, any sentence with no diacritics will be predicted as English, as there are no diacritics in the English language. On short sentences, it is possible that whatever the language, there are no diacritics.

Secundly, it is observable that a large number of Swedish sentences are predicted as Finnish. Which implies that two out of three Swedish diacritics are also Finnish ones, and the fact that our naive implementation returns a language at random amongst the most probable in case of equality. 

Ok! We can now try to use the diacritics in addition to the stop words.

In [29]:
def predict_language_stopwords_diacritics(sentence):
    random.seed(0)
    cnt = Counter()
    cnt.update(language
              for word in sentence
              for language in stopwords_dict.get(word, ()))
    cnt.update(language
               for ch in ''.join(sentence).lower()
               for language in diacritics_transposed[ch]
               if ch not in string.ascii_lowercase)
    if not cnt:
        return 'unknown'
        
    m = max(cnt.values())
    return random.choice([k for k, v in cnt.items() if v == m])

In [30]:
compute_accuracy(predict_language_stopwords_diacritics)

0.93995601173020527

In [31]:
%timeit compute_accuracy(predict_language_stopwords_diacritics)

463 ms ± 63.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


So w obtain a gain in accuracy, at the expence of a slightly increased running time.

# Sophisticated Approach

### Learn a classifier based on N-grams Embeddings

# Using Facebook's library FastText for text classification
In order to do this, we are going to need a dataset to train on our classifier, we are going to use the European Parlement Proceedings corpus. 

More information about fastText can be found in the [documentation](https://fasttext.cc/).

In [32]:
from pyfasttext import FastText
from sklearn.model_selection import train_test_split
from nltk import ngrams

The fastText library is trained on n-grams (tuples of n words), so by using a linear classifier on top of a hidden word embedding. Let's being by creating a set of trigrams to learn on.

In [33]:
doc_set = [(language, clean_tokens(europarl_raw.__getattribute__(language).words())) for language in languages]

trigrams_set = [(language, ' '.join(trigram)) for (language, words) in doc_set
                                    for trigram in ngrams(words, 3)]

In [34]:
train_set, test_set = train_test_split(trigrams_set, test_size = 0.30, random_state=0)

pyfasttext is a wrapper around command line tool, so we will need to dump the sets to a file before training the classifier.

In [35]:
with open('train_data_europarl.txt', 'w') as f:
    for label, words in train_set:
        f.write('__label__{} {}\n'.format(label, words))

In [36]:
model = FastText()
model.supervised(input='train_data_europarl.txt', output='model_europarl', epoch=10, lr=0.7, wordNgrams=3)

We can then evaluate how good is the training error and the test error.

In [37]:
# train accuracy
labels, samples = np.split(np.array(train_set), 2, axis=1)
(np.array(model.predict(samples.T[0])) == labels).sum() / len(train_set)

0.99680029382291524

In [38]:
# test accuracy
labels, samples = np.split(np.array(test_set), 2, axis=1)
(np.array(model.predict(samples.T[0])) == labels).sum() / len(test_set)

0.98648199595051833

Now, let's apply this model to our initial dataset.

In [39]:
(model.predict(sentences['sentences'].str.join(' ') + '\n') == sentences['label'][:, None]).sum()/len(sentences)

0.97514662756598236

In [40]:
%timeit model.predict(sentences['sentences'].str.join(' ') + '\n')

204 ms ± 22.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Summary

We sum up our findings in the following table.

| Algorithm              | Accuracy | Execution time | Comments                                                    |
|------------------------|----------|----------------|-------------------------------------------------------------|
| Stopwords based        | 92.5%    | 299 ms         | Baseline                                                    |
| Weighted stopwords     | 92.2%    | 413 ms         |                                                             |
| Diacritics             | 65.0%    | 169 ms         |                                                             |
| Diacritics + stopwords | 94.0%    | 463 ms         |                                                             |
| langdetect             | 96.5%    | 51 300 ms      | Too slow to be of any use                                   |
| pycld2                 | 97.3%    | 134 ms         | External library; handles a large number of languages       |
| fastText               | 97.5%    | 204 ms         | Needs a training corpus; can be trained on specialized data |

To analyze the above result, it is pretty obvious that are two relevant options which are either pycld2, which is capable of  of handling over 165 languages and does not need any labeled data, and fastText, which in our opininion might be a worthy alternative if one has specialized data on which to train it.

Without a doubt, external libraries can equally handle non-european languages, which use non-latin scripts and in which the notion of "words" may be ill-defined. Our custom solution does not have the same ambition, and in addition requires a labeled corpus to be trained on.

Quite frankly, accuracy does not tell the whole story here, so using a confusion matrix to see what kind of mistakes the classifier makes is paramount. Confusion matrices have not been included here only for future study.