<p> <i>Martin K. Sova</i></p>
<h1><b>Introduction</b></h1>

The solution to the Language Detection challenge is split into three parts: <u>Preprocessing training data</u>, <u>Building the models and predicting on test data</u>, and <u>Analysis of the models and conclusion</u>. My submission is split into the three sections in the order that they are listed in order to provide a clear structure of my approach and to accomplish an overall organized solution to the problem.

<i>Note: I am aware that I did not use doc strings (although still provided inline-comments where it deemed appropriate). I made the decision to rather utilize markdown cells; since they are provided by jupyter notebook and are more pleasant to read, I detailed my reason for implementing certain modules and the explanations for my decision to apply given functions (including parameters and return values) in a markdown cell prior to their use.</i>

<h1><b>Part 1</b></h1>
<h2><u>Preprocessing training data</u></h2>

<h3><b>Step 1.1: Imports</b></h3>

The <code>re</code> module is imported to make use of regular expression matching.

In [1]:
import re

The <code>csv</code> module is utilized for the import format.

In [2]:
import csv

The <code>pandas</code> module provides an array of data analysis functions.

In [3]:
import pandas

The <code>os</code> module allows for simple functions such opening files; 'a portable way of using operating system dependent functionality' (https://docs.python.org/).

In [4]:
import os

I make use of the <code>glob module</code>, as seen in the <code>loadData</code> function for instance. The advantage of <code>glob</code> is that I can find all the pathnames that match a given pattern (absolute or relative). We observe the implementation of <code>glob.glob(path)</code> to find a - potentially empty - list of path names that match the parameter <code>path_name</code>; for example, <code>glob.glob('thesis/*.doc')</code> will return a list of all pathnames for .doc files found in the folder <i>thesis</i>.

In [5]:
import glob

The <code>random</code> module is imported to provide a pseudo-random number generator (used in Section 1.2).

In [6]:
import random

<h3><b> Step 1.2: Reading data from files </b></h3>

The <code>loadData(path_name)</code> function provides a method for extracting relevant text from the file corresponding to the pathname given by the parameter <code>path_name</code>.

The parameter <code>path_name</code> determines the path to files from which data should be extracted.
    
The function returns a list of extracted words from files that have a matching path to the parameter.

In [7]:
def loadData(path_name):

    # Create a list for all words found in the specified text file.
    text_return = []
    
    # All path names that match the parameter.
    all_files = glob.glob(path_name)
    # The use of a smaller sample is justified in the conclusion.
    num_delete = int(len(all_files) / 1.25)
    num_keep = len(all_files) - num_delete
    files = set(random.sample(all_files, num_keep))
    files = [i for i in all_files if i in files]
    
    # Iterate over all found files.
    for f in files:
        input_data = open(f, encoding='utf8')
        input_data = input_data.read()
        # Obtain all words from the text file for pathname 'f'.
        extracted_text = input_data.split(' ')
        for word in extracted_text:
            # Remove any unwanted words before returning the final extracted text.
            if word !='ID' and word !='SPEAKER' and word !='LANGUAGE' and word !='NAME' and word !='CHAPTER' and  word !='<P>':
                # If word satisfies the conditions append to the list of extracted words to be returned.
                text_return.append(word)
    # Return the final list of extracted words.
    return(text_return)

<h3><b> Step 1.3: Denoising the training (and testing) data</b></h3>

The <code>denoise(input_data, data_type)</code> function provides a procedure for removing any unwanted noise from data.

The parameter <code>input_data</code> is the source data to be denoised, and the parameter <code>data_type</code> determines whether the source data is the training or testing data (which, as observed, determines the variable <code>i</code>).

The function returns the text data after removing all noise.

<p>
<i>The <code>pandas.DataFrame.drop_duplicates()</code> function is utilized to remove any duplicate rows from the DataFrame (the use of DataFrame is defined in Step 1.6).</i>
</p>

<p>
<i>The <code>pandas.DataFrame.dropna()</code> is utilized to drop NA entries from the DataFrame; in other words, enables to remove any missing values.</i>
</p>

In [8]:
def denoise(input_data, data_type):
    if data_type == 'train':
        i = 0
    else:
        i = 1
    input_data[i] = input_data[i].str.replace('[=<>":-;.,\(\)]', ' ')
    input_data[i] = input_data[i].str.replace('[0-9]', ' ')
    input_data[i] = input_data[i].str.replace('\s', ' ')
    input_data[i] = input_data[i].str.replace('<P>', ' ')
    input_data[i] = input_data[i].str.replace('/', ' ')
    input_data[i] = input_data[i].str.replace('ID', ' ')
    input_data[i] = input_data[i].str.replace('NAME', ' ')
    input_data[i] = input_data[i].str.replace('SPEAKER', ' ')
    input_data[i] = input_data[i].str.replace('CHAPTER', ' ')
    input_data[i] = input_data[i].str.replace('LANGUAGE', ' ')
    input_data[i] = input_data[i].str.strip()
    input_data = input_data[input_data[i] != '']
    input_data = input_data.drop_duplicates()
    input_data =input_data.dropna()
    return(input_data)
    

Denoising the extracted text encompasses the removal of characters such as digits and punctuations from the training data. We emphasize that it is important to perform the equivalent denoising procedure for both the training and test data; that is, to also get rid of digits and punctuations, which also add noise to the test data.

<h3><b>Step 1.4: Define the dataset labels</b></h3>

Defining a list of labels as appear in the European Parliament Proceedings Parallel Corpus text dataset (e.g. folder names) enables for more intuitive implementation; listing the labels to be classified proves to be hugely advantageous, such as allowing iteration as observed in Steps 1.6 (also utilized in Step 2.2).

In [9]:
# These are labels to be classified.
all_labels = ['bg','cs','da','de','el','en','es','et','fi','fr','hu','it','lt','lv','nl','pl','pt','ro','sk','sl','sv']

In total, there are 21 languages for which detection must be supported, so it is imperative that the classifier accurately recognizes all 21 categorical variables.

List of all languages (in order as they appear in the list variable <code>all_labels</code>): <i> Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish. </bi>

<h3><b> Step 1.5: Function to return a string of all words of a language</b></h3>

The <code>convert(test_series)</code> function provides a method to convert all words of a language in a Series to a sequence, and return the word concatenation of the strings in the sequence joined by a given <code>str</code> seperator.

The parameter <code>text</code> is the Series containing all words of a language.

The return value <code>to_string</code> of the function is the string containing all words joined by the <code>str</code> seperator ' '.

In [10]:
def convert(text): 
    text_list = text[0].tolist()
    to_string = ' '.join(text_list)
    return(to_string)

The function <code>pandas.Series.tolist()</code> is used to perform a series to list conversion.

Here the <code>join()</code> function is utilized to return a string by joining the string elements of a sequence with a <code>str</code> separator; in this case, a space ' '.

<h3><b>Step 1.6: <i>Realizing Steps 1.2 and 1.3</i></b></h3>

Utilizing functions <code>loadData(path_name)</code> and <code>denoise(input_data, data_type)</code> to get training data.

The <code>DataFrame()</code> function of the <code>pandas</code> module serves to enable representation of the extracted data as a size-mutable tabular data structure.

We have placed all data labels in a list, which means that we can call the <code>loadData</code> and <code>denoise</code> functions from within a for loop to extract and denoise all data and store it in a list with just a few lines of code. 

In [104]:
data_results = {}
for label in all_labels:
    path_name = 'txt/' + label + '/*.txt'
    data = loadData(path_name)
    data_df = pandas.DataFrame(data)
    data_denoised = denoise(data_df, 'train')
    data_results[label] = data_denoised

<h3><b>Step 1.7:<i> Realizing Step 1.5</i></b></h3>

In [111]:
results = []
for d in data_results:
    results.append(convert(data_results[d]))

<h3><b>Step 1.8: A single dataframe</b></h3>

Next, we will convert the list containing strings of all languages defined in Step 1.7 into one data frame.

In [112]:
training_data = pandas.DataFrame(results)

In [113]:
training_data

Unnamed: 0,0
0,Одобряване на протокола от предишното заседани...
1,Schválení zápisu z předchozího zasedání viz zá...
2,NL Formanden Jeg giver ordet til fru Maes s...
3,Altfahrzeuge Der Präsident Nach der Tagesord...
4,΄Εvαρξη της ετήσιας συvόδoυ Πρόεδρος Κηρύσσω τ...
5,Resumption of the session President I declar...
6,Aprobación del Acta de la sesión anterior El P...
7,Uurimiskomisjoni ja ajutise komisjoni moodusta...
8,Romuajoneuvot Puhemies Esityslistalla on seu...
9,Adoption du procès-verbal de la séance précéde...


<h1><b>Part 2</b></h1>
<h2><u>Building the models and predicting on the test data</u></h2>

<h3><b>Step 2.1: Imports</b></h3> 

The <code>sklearn.pipeline</code> module is imported to utilize the <code>Pipeline</code> of transforms with a final estimator.

The reason why I make use of the pipeline is to execute steps that can be cross-validated together while passing varying parameters; it allows to set parameters of the different steps using names and parameter names.

scikit-learn.org states "Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods." We observe that in the final estimator we only need to implement <code>.fit()</code>. 

As observed in Step 2.2, we set a list parameter of (name, transform) tuples (implementing fit/transform) chained in a specific order, where the last object is the estimator.

The <code>memory</code> parameter of the <code>sklearn.pipeline.Pipeline(steps, memory=None)</code> function enables the transformers in the pipeline to be cached using the memory argument; however, this is not utilized in my implementation of the <code>pipeline</code> class.

In [114]:
from sklearn.pipeline import Pipeline

The <code>CountVectorizer</code> function of the <code>sklearn.feature_extraction.text</code> class is used for the conversion of "a collection of text documents to a matrix of token counts" (http://scikit-learn.org/), which performs well on nested objects or simple estimators (such as pipelines).

In [115]:
from sklearn.feature_extraction.text import CountVectorizer

The <code>TfidfVectorizer</code> function of the <code>sklearn.feature_extraction.text</code> class is used for the conversion of "a collection of raw documents to a matrix of TF-IDF features" (http://scikit-learn.org/), which also performs well on nested objects or simple estimators (such as pipelines).

In [116]:
from sklearn.feature_extraction.text import TfidfVectorizer

The <code>TfidfTransformer</code> function of the <code>sklearn.feature_extraction.text</code> class is used for the conversion of "a count matrix to a normalized tf or tf-idf representation" (http://scikit-learn.org/), which also performs well on nested objects or simple estimators (such as pipelines).

In [117]:
from sklearn.feature_extraction.text import TfidfTransformer

The <code>LogisticRegression</code> classifier of the <code>sklearn.linear_model</code> class is imported because the analysis for the frequency of words and characters in a given language will be achieved with a logistic regressing model.

The n-gram models tested for, as observed in Step 2.2, include the analysis of: <i>1-gram word frequency, 2-gram word frequency, 1-gram character frequency, 2-gram character frequency, and 4-gram character frequency</i>.

In [118]:
from sklearn.linear_model import LogisticRegression

<h3><b> Step 2.2: Training the data</b></h3>

Now, we build the model and train it with the extracted training data.

The pipeline is used to implement the 5 models detailed in Step 2.1.

In order to achieve generalization to previously unseen data, I have set the inverse of regularization strength to 1.0 for each model. 

Further, <b>L2 regularization</b> is utilized to avoid overfitting of the models to training data.


In [119]:
models = {}
info_pipeline = [('model_1_char', 'char', (1, 1)), ('model_1_word', 'word', (1, 1)),
                 ('model_2_char', 'char', (1, 2)), ('model_4_char', 'char', (1, 4)), ('model_2_word', 'word', (1, 2))]
for values in info_pipeline: 
    k = Pipeline([('vect', CountVectorizer(ngram_range = (values[2][0], values[2][1]), analyzer = values[1])),
                  ('tfidf', TfidfTransformer(use_idf = False)), ('lrg', LogisticRegression(n_jobs = -1))])
    models[values[0]] = k.fit(training_data[0], all_labels)

  " = {}.".format(self.n_jobs))
  " = {}.".format(self.n_jobs))
  " = {}.".format(self.n_jobs))
  " = {}.".format(self.n_jobs))
  " = {}.".format(self.n_jobs))


As seen above, models are built and stored in a dictionary from within a for loop, as steps taken for each model are similar with just few varying parameters, which are retrieved from the <code>info_pipeline</code> list.

<h3><b>Step 2.3: Preparing test data</b></h3>

Utilizing function <code>denoise(input_date, data_type)</code> defined in Step 1.3, we prepare the test data.

In [123]:
test_data = pandas.read_csv('europarl.test', sep='\t',header=None)

Above the function <code>read_csv</code> is used to read the test file into DataFrame.

In [124]:
test_data[1]= test_data[1].str.replace('\(.*?\)','')
test_data= denoise(test_data, "test")

<h3><b> Step 2.4: Predictions on test data</b></h3>

We are able to intuitively perform predictions on the test data from within a for loop as the trained models are stored in a dictionary; the 'model' substring of the key is replaced with 'prediction' using the <code>replace</code> function and the predictions are therafter stored in a new dictionary for each model.

We utilize the <code>predict</code> function to predict the target values for individual models.

In [125]:
predictions = {}
for model in models:
    label = model.replace('model','prediction')
    predictions[label] = models[model].predict(test_data[1])

<h1><b>Part 3</b></h1>
<h2><u>Analysis of the models and conclusion</u></h2>

<h3><b> Step 3.1: Imports</b></h3>

The <code>accuracy_score</code> function of the <code>sklearn.metrics</code> class is imported to compute accuracy classification scores.

In [126]:
from sklearn.metrics import accuracy_score 

The <code>confusion_matrix</code> function of the <code>sklearn.metrics</code> class is imported to compute the confusion matrix to evaluate the accuracy of a classification.

In [127]:
from sklearn.metrics import confusion_matrix

The <code>classification_report</code> function of the <code>sklearn.metrics</code> class allows for a text report that demonstrates the fundamental classification metrics.

In [128]:
from sklearn.metrics import classification_report

The <code>precision_recall_fscore_support</code> function of the <code>sklearn.metrics</code> class is imported to compute the recall, precision, support and F-measure for each class.

In [129]:
from sklearn.metrics import precision_recall_fscore_support

The <code>f1_score</code> function of the <code>sklearn.metrics</code> class is imported to compute the F1 score (the balanced F-measure).

In [130]:
from sklearn.metrics import f1_score

<h3><b> Step 3.2: Analysis</b></h3>

<u>Classification report and accuracy score for a 4-gram character model</u>

In [131]:
print (classification_report(test_data[0], predictions['prediction_4_char']))
print (accuracy_score(test_data[0], predictions['prediction_4_char']))

             precision    recall  f1-score   support

         bg       1.00      1.00      1.00       997
         cs       0.80      0.95      0.87       993
         da       0.88      0.95      0.91       994
         de       0.69      0.99      0.81       993
         el       1.00      1.00      1.00       988
         en       1.00      0.70      0.82       998
         es       0.99      0.76      0.86       996
         et       0.87      0.89      0.88       993
         fi       0.84      1.00      0.91       995
         fr       0.99      0.88      0.93       999
         hu       0.90      0.99      0.94       998
         it       0.98      0.81      0.88       996
         lt       0.93      0.97      0.95       995
         lv       1.00      0.97      0.98       978
         nl       0.88      0.89      0.89       999
         pl       0.97      0.99      0.98       997
         pt       0.86      0.95      0.90       996
         ro       0.85      0.97      0.91   

<font color = 'red'><u>Results:</u></font>

The calculated prediction accuracy accomplished on the test data with a model trained with the 4-gram character logistic regression model is: <b> 91.033%.</b>

<u>Classification report and accuracy score for a 2-gram character model</u>

In [132]:
print (classification_report(test_data[0], predictions['prediction_2_char']))
print (accuracy_score(test_data[0], predictions['prediction_2_char']))

             precision    recall  f1-score   support

         bg       1.00      1.00      1.00       997
         cs       0.74      0.94      0.83       993
         da       0.86      0.93      0.89       994
         de       0.66      0.98      0.79       993
         el       1.00      1.00      1.00       988
         en       0.99      0.65      0.78       998
         es       0.97      0.67      0.80       996
         et       0.86      0.85      0.86       993
         fi       0.81      0.99      0.89       995
         fr       0.98      0.83      0.90       999
         hu       0.90      0.99      0.94       998
         it       0.94      0.75      0.83       996
         lt       0.91      0.95      0.93       995
         lv       0.99      0.97      0.98       978
         nl       0.79      0.86      0.83       999
         pl       0.97      0.99      0.98       997
         pt       0.80      0.93      0.86       996
         ro       0.82      0.95      0.88   

<font color = 'red'><u>Results:</u></font>

The calculated prediction accuracy accomplished on the test data with a model trained with the 2-gram character logistic regression model is: <b> 88.340%.</b>

<u>Classification report and accuracy score for a 1-gram character model</u>

In [133]:
print (classification_report(test_data[0], predictions['prediction_1_char']))
print (accuracy_score(test_data[0], predictions['prediction_1_char']))

             precision    recall  f1-score   support

         bg       1.00      1.00      1.00       997
         cs       0.61      0.93      0.74       993
         da       0.82      0.83      0.82       994
         de       0.55      0.95      0.70       993
         el       1.00      1.00      1.00       988
         en       0.94      0.47      0.63       998
         es       0.79      0.48      0.60       996
         et       0.80      0.75      0.77       993
         fi       0.75      0.98      0.85       995
         fr       0.92      0.55      0.69       999
         hu       0.87      0.98      0.92       998
         it       0.84      0.52      0.64       996
         lt       0.78      0.93      0.85       995
         lv       0.98      0.90      0.94       978
         nl       0.58      0.87      0.69       999
         pl       0.96      0.97      0.96       997
         pt       0.66      0.85      0.74       996
         ro       0.74      0.83      0.78   

<font color = 'red'><u>Results:</u></font>

The calculated prediction accuracy accomplished on the test data with a model trained with the 1-gram character logistic regression model is: <b> 79.263%.</b>

<u>Classification report and accuracy score for a 2-gram word model</u>

In [134]:
print (classification_report(test_data[0], predictions['prediction_2_word']))
print (accuracy_score(test_data[0], predictions['prediction_2_word']))

             precision    recall  f1-score   support

         bg       1.00      1.00      1.00       997
         cs       0.84      0.50      0.63       993
         da       0.97      0.89      0.93       994
         de       0.93      0.97      0.95       993
         el       0.63      1.00      0.78       988
         en       0.87      0.97      0.92       998
         es       0.85      0.74      0.79       996
         et       0.95      0.87      0.91       993
         fi       0.90      0.83      0.86       995
         fr       0.73      0.85      0.78       999
         hu       0.99      0.96      0.97       998
         it       0.89      0.94      0.92       996
         lt       0.98      0.84      0.90       995
         lv       0.80      0.87      0.83       978
         nl       0.56      0.96      0.71       999
         pl       0.95      0.80      0.87       997
         pt       0.54      0.69      0.60       996
         ro       1.00      0.81      0.89   

<font color = 'red'><u>Results:</u></font>

The calculated prediction accuracy accomplished on the test data with a model trained with the 2-gram word logistic regression model is: <b> 83.416%.</b>

<u>Classification report and accuracy score for a 1-gram word model</u>

In [135]:
print (classification_report(test_data[0], predictions['prediction_1_word']))
print (accuracy_score(test_data[0], predictions['prediction_1_word']))

             precision    recall  f1-score   support

         bg       1.00      1.00      1.00       997
         cs       0.84      0.51      0.63       993
         da       0.97      0.89      0.93       994
         de       0.93      0.97      0.95       993
         el       0.64      1.00      0.78       988
         en       0.88      0.97      0.92       998
         es       0.85      0.74      0.79       996
         et       0.95      0.87      0.91       993
         fi       0.90      0.83      0.86       995
         fr       0.73      0.85      0.79       999
         hu       0.99      0.96      0.97       998
         it       0.90      0.94      0.92       996
         lt       0.97      0.84      0.90       995
         lv       0.80      0.87      0.83       978
         nl       0.56      0.95      0.71       999
         pl       0.95      0.80      0.87       997
         pt       0.54      0.69      0.61       996
         ro       1.00      0.81      0.90   

<font color = 'red'><u>Results:</u></font>

The calculated prediction accuracy accomplished on the test data with a model trained with the 1-gram word logistic regression model is: <b> 83.599%.</b>

<h3> Step 3.3: Further analysis with a crosstab

We further analyse the results using a crosstab, which gives an idea about the correlation between the languages (false negatives and false positives are displayed).  

We consider the model trained with the 4-gram word logistic regression model, which returned the highest prediction accuracy of 90.870%

In [136]:
crosstab = pandas.crosstab(test_data[0], predictions['prediction_4_char'], rownames=["V"], colnames=["PV"], margins=True)
crosstab = pandas.DataFrame(crosstab)

The results show that, for example, in Italian 20 words were missclassified as Portuguese.

Most strings were missclassified in Slovakian as Czech: 212 in total. I grew up in Prague, Czech Republic, for 19 years and can vouch for the similarity between the two languages, being able to communicate with my Slovak friends almost seamlessly while speaking our respective languages. This is due to the shared history of the two countries.

The trends between other languages can be observed below. 'V' signifies the actual values and 'PV' signifies the predicted values.

In [137]:
crosstab

PV,bg,cs,da,de,el,en,es,et,fi,fr,...,lt,lv,nl,pl,pt,ro,sk,sl,sv,All
V,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
bg,997,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,997
cs,0,943,0,2,0,0,0,4,6,0,...,5,0,0,4,1,1,3,0,0,993
da,0,0,940,17,1,0,0,4,2,0,...,0,1,9,0,1,0,0,0,9,994
de,0,0,4,979,0,0,0,4,3,0,...,0,0,1,0,0,0,0,0,0,993
el,0,0,0,0,988,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,988
en,0,7,16,156,0,696,2,23,11,4,...,3,1,35,5,11,9,0,0,5,998
es,0,2,3,43,0,0,758,14,5,1,...,3,0,22,0,117,18,0,0,0,996
et,0,0,2,11,0,0,0,884,75,0,...,9,0,7,0,0,0,1,3,1,993
fi,0,0,0,2,0,0,0,2,991,0,...,0,0,0,0,0,0,0,0,0,995
fr,0,2,12,53,0,1,2,8,8,880,...,3,0,10,1,0,14,1,0,0,999


<h3> Step 3.4: Conclusion </h3>

A fifth of text files were randomly chosen for each language for the analysis section due to the size of the European Parliment Corpus (5 GB), which proved to be less time expensive when performing tests (in cases other than for the purpose of this challenge, this could be easily adjusted to account for the entire dataset in the <code>loadData</code> function); the use of the entire dataset suggests the necessity of Big Data analysis. Despite using a subset of the data we are still able to generalize to unseen data and achieve a prediction accuracy of 91.033%.