# European Language Detection: 

# PROBLEM:

There are 21 European languages in the dataset, http://www.statmt.org/europarl/. The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.
The goal of the problem is to detect among 21 languages in the European Union and its a classic Machine learning multi label classification. The dataset is the 5GBcorpus where each text file has a Chapter ID and Speaker ID.

# Pre Processing:

The dataset on extracting has 21 seperate folders each belonging to one of the languages mentioned above. Each language folder has 1000's of text files, each a conversation in the parliament. Now, we need to have a way to combine all the text files into one and then together input it to the model. 

The load_data_files.py shows how to combine all of the files for the model. On the other hand running it on a local computer is taxing and hence the Language_European_Small-final notebook helps to make a smaller dataset out of it. This smaller dataset is achieved by randomly shuffling all the text files and selecting an approximate of 1000 to 3000 files per language based on the size.

In [1]:
import codecs
import glob as g
import pandas as pd
import os
pd_dict={}
#os.path.basename(path)
#path="C:/Users/Sathvik/Desktop/DS/NLP_TEXT/"
get_dir1=g.glob('C:/Users/Sathvik/Desktop/DS/NLP_TEXT_SMALL/*')
a=len(get_dir1)
for k in range(0,a):
    with codecs.open(get_dir1[k],encoding="utf-8") as f:
            pd_dict[os.path.basename(get_dir1[k])]=f.read()
        

Once we have the dataset up and running, Pandas is one of the easier ways to mungle and manage the dataset and hence read the files and put it to a dataframe and the file names serve as the label names.


In [2]:
 df=pd.DataFrame.from_dict(pd_dict,orient='index')

In [3]:
with codecs.open("C:/Users/Sathvik/Desktop/DS/DUMMY_NLP_SMALL/pl.txt",encoding=" iso-8859-1") as f1:
    df.loc["pl.txt"]=f1.read()

In [4]:
df.head()

Unnamed: 0,0
bg.txt,"<CHAPTER ID=""003"">\nСъстав на Парламента: вж. ..."
cs.txt,"<CHAPTER ID=""003"">\nSchválení zápisu z předcho..."
da.txt,<CHAPTER ID=1>\nGenoptagelse af sessionen\n<SP...
de.txt,<CHAPTER ID=1>\nWiederaufnahme der Sitzungsper...
el.txt,<CHAPTER ID=1>\nΕπαvάληψη της συvσδoυ\n<SPEAKE...


In [5]:
df.columns=["Data"]

In [6]:
df["Label"]=df.index

In [7]:
df["Label"]=df["Label"].apply(lambda x: x.replace('.txt',''))

In [8]:
df.replace(to_replace='\<.*?\>',value="",regex=True,inplace=True)

In [9]:
df.head()

Unnamed: 0,Data,Label
bg.txt,"\nСъстав на Парламента: вж. протоколи\n,\nОдоб...",bg
cs.txt,\nSchválení zápisu z předchozího zasedání: viz...,cs
da.txt,\nGenoptagelse af sessionen\n\nJeg erklærer Eu...,da
de.txt,\nWiederaufnahme der Sitzungsperiode\n\nIch er...,de
el.txt,\nΕπαvάληψη της συvσδoυ\n\nΚηρύσσω την επανάλη...,el


In [10]:
import re
def remove_strings(s):
    s = s.replace("\n","")
    s=re.sub('\d','',s)
    s = s.translate ({ord(c): "" for c in "!@#$%^&*()[]{};:,./<>?\|`~-=_+"})
    return s

In [11]:
df["Data"]=df["Data"].apply(remove_strings)
#df_test["Data"]=df_test["Data"].str.strip()

As explained earlier each document has a CHAPTER ID and SPEAKER ID within the 'HTML' kind of tags and they pretty much don't add anything to the model and its important to remove them. 
There are many punctuation marks, symbols and numbers that are totally not necessary for the ML model to know, So let's remove them. 


In [40]:
df["Data"]

bg.txt    Състав на Парламента вж протоколиОдобряване на...
cs.txt    Schválení zápisu z předchozího zasedání viz zá...
da.txt    Genoptagelse af sessionenJeg erklærer EuropaPa...
de.txt    Wiederaufnahme der SitzungsperiodeIch erkläre ...
el.txt    Επαvάληψη της συvσδoυΚηρύσσω την επανάληψη της...
en.txt    Resumption of the sessionI declare resumed the...
es.txt    Reanudación del período de sesionesDeclaro rea...
et.txt    Eelmise istungi protokolli kinnitamine vaata p...
fi.txt    Istuntokauden uudelleenavaaminen Julistan perj...
fr.txt    Reprise de la sessionJe déclare reprise la ses...
hu.txt    Az előző ülés jegyzőkönyvének elfogadása lásd ...
it.txt    Ripresa della sessioneDichiaro ripresa la sess...
lt.txt    Ankstesnio posėdžio protokolų tvirtinimas žr p...
lv.txt    Iepriekšējās sēdes protokola apstiprināšana sk...
nl.txt    Hervatting van de zittingIk verklaar de zittin...
pt.txt    Reinício da sessãoDeclaro reaberta a sessão do...
ro.txt    Componenţa Parlamentului a se 

In [41]:
#with open('C:/Users/Sathvik/Desktop/DS/europarl-test/europarl_test.txt',encoding=" utf-8") as f2:
#    lines = f2.readlines()

In [42]:
df_test=pd.read_csv('C:/Users/Sathvik/Desktop/DS/europarl-test/europarl_test.txt',encoding=" utf-8",sep='\t',header=None) 

In [43]:
df_test.columns=["Label_test","Data_test"]

In [44]:
df_test["Data_test"]=df_test["Data_test"].apply(remove_strings)

In [45]:
df_test

Unnamed: 0,Label_test,Data_test
0,bg,Европа не трябва да стартира нов конкурентен ...
1,bg,CS Найголямата несправедливост на сегашната об...
2,bg,DE Гжо председател гн член на Комисията по при...
3,bg,DE Гн председател бих искал да започна с комен...
4,bg,DE Гн председател въпросът за правата на човек...
5,bg,DE Гн председател гласувах в подкрепа на Комис...
6,bg,DE Гн председател госпожи и господа в каква по...
7,bg,DE Гн председател госпожи и господа неотдавна ...
8,bg,DE Гн председател след повече от години колон...
9,bg,EN Благодаря Ви Сара за сериозното съдействие


In [20]:
import matplotlib.pyplot as plt
import numpy as np
import scipy
import seaborn as sns

from sklearn import ensemble
from sklearn import feature_extraction
from sklearn import linear_model
from sklearn import pipeline
from sklearn import cross_validation
from sklearn import metrics



# Model building and what worked and what could have worked!

The most interesting part! I first took  linear logistic regression as my model for classification. Scikit learn has a beautiful way of wrapping everything in  a pipeline and building the model. 
 


When it comes to dealing with text, you need to convert to a group of vectors and then input to your model. Here I have chosen the tfidf vectorizer. More about tfidf vectorizer? http://www.markhneedham.com/blog/2015/02/15/pythonscikit-learn-calculating-tfidf-on-how-i-met-your-mother-transcripts/, Nice way to learn about it. 
I have used my anlayser to be character based as opposed to word based, the reason being some of these languages are very close to each other and it's best to use character based when using a language detection. 


In [21]:
vectorizer = feature_extraction.text.TfidfVectorizer(ngram_range=(1, 6),
                             analyzer='char',)


pipe = pipeline.Pipeline([
    ('vectorizer', vectorizer),
    ('clf', linear_model.LogisticRegression())
])

In [23]:
pipe.fit(df["Data"], df["Label"])

Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer='char', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 6), norm='l2', preprocessor=None, smooth_idf=Tr...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [24]:
y_predicted = pipe.predict(df_test["Data_test"])


In [25]:
cm = metrics.confusion_matrix(df_test["Label_test"], y_predicted)

In [46]:
a=(y_predicted==df_test["Label_test"])

In [47]:
accuracy=sum(a)/len(a)

Well, honeslty the model worked very well, a 97% accuracy, not bad at all. I have the classification report as shown below. 


In [48]:
print(metrics.classification_report(df_test["Label_test"], y_predicted,
                                    target_names=df["Label"]))

             precision    recall  f1-score   support

         bg       0.71      1.00      0.83      1000
         cs       0.98      0.93      0.95      1000
         da       0.98      0.97      0.98      1000
         de       0.96      0.98      0.97      1000
         el       0.96      1.00      0.98       992
         en       1.00      0.95      0.98      1000
         es       1.00      0.92      0.96      1000
         et       0.99      0.92      0.96      1000
         fi       0.93      0.98      0.96      1000
         fr       1.00      0.96      0.98      1000
         hu       0.99      0.96      0.98      1000
         it       1.00      0.97      0.98      1000
         lt       0.97      0.98      0.97      1000
         lv       1.00      0.99      1.00       979
         nl       0.97      0.98      0.97      1000
         pt       1.00      0.97      0.98      1000
         ro       0.94      0.99      0.97      1000
         sk       0.99      0.99      0.99   

# RESULTS AND OBSERVATIONS:


The next model trained was using Random forests. It is a very robust and versatile model but it did not peform well obviously for a high sparse matrix. 


Off lately data science nerds are into Xgboost and I wanted to try my hands on it for text classification. It is a great algorithm,(adaboost algorithm- train weak classifiers inorder to obtain one strong classifier- ensemble methods)
However, I tried running it a couple of times but was thrown "Memory error". 

Logistic regression did perform better. But as we can see here, the model was trained on a comparitively small set and there may be chances of overfitting the model which leads to this high accuracy.


It is important to run this model on the Entire dataset and then test using the test set provided. Also I have handled this task with a character encoding as opposed to word encoding and would be thrilled to work 
on the word encoding had Memory error not been an issue.

 Overall, the task was to detect the language and our model performs better for the short data set handled. As we know "All models wrong and some models are useful", we can expand this on various levels.

Key take aways to note on NLP: Try eliminating low information features, Work with Gridsearch to tune your hyperparamters, Add customized stop word list, Always try to remove punctuation marks and symbols, Try different corpus for the text data say extracting data from wikipedia.


# Future Work:


Work with Facebook's new Fasttext algorithm inorder to achieve better results but this works on shorter texts well. 
We can use CNN to perform classification task on NLP which is the new for NLP since CNN was majorly and primarly designed for images. 
This one would be definetely my future work. Refer: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

In [65]:
#df_test_grouped=df_test.groupby('Label_test')['Data_test'].apply(' '.join).reset_index()

In [51]:
#y_predicted_grouped=pipe.predict(df_test_grouped["Data_test"])

# Uncomment the following to execute Xgboost Classifier 

In [61]:
#import xgboost as xgb

In [62]:
#from xgboost.sklearn import XGBClassifier

In [63]:
#vectorizer = feature_extraction.text.TfidfVectorizer(ngram_range=(1, 6),analyzer='char',)

#pipe_xgb = pipeline.Pipeline([('vectorizer', vectorizer),('clf',  XGBClassifier())])

In [60]:
#pipe_xgb.fit(df["Data"], df["Label"])