## Text Language Classification using Machine learning 

We all have used machine learning for data mining task and number crunching but it will be interesting to use machine learning in classifying the given text or sentence into the languages like German,Spanish,French,English.

So to build language classification we will first need to train the model with the text data so that it is able to predict the type of language for given text, I think most of the browsers like **Google chrome,Mozilla or Internet explorer** would be using trained model under the hood whenever a user is landing on a page which is different than english.

Below code will help in getting the text data from wikipedia page and will store it into text file locally

In [None]:
# simple python script to collect text paragraphs from various languages on the
# same topic namely the Wikipedia encyclopedia itself

import os
try:
    # Python 2 compat
    from urllib2 import Request, build_opener
except ImportError:
    # Python 3
    from urllib.request import Request, build_opener

import lxml.html
from lxml.etree import ElementTree
import numpy as np

#urls for all the differnt wikipedia page
pages = {
    u'ar': u'http://ar.wikipedia.org/wiki/%D9%88%D9%8A%D9%83%D9%8A%D8%A8%D9%8A%D8%AF%D9%8A%D8%A7',
    u'de': u'http://de.wikipedia.org/wiki/Wikipedia',
    u'en': u'http://en.wikipedia.org/wiki/Wikipedia',
    u'es': u'http://es.wikipedia.org/wiki/Wikipedia',
    u'fr': u'http://fr.wikipedia.org/wiki/Wikip%C3%A9dia',
    u'it': u'http://it.wikipedia.org/wiki/Wikipedia',
    u'ja': u'http://ja.wikipedia.org/wiki/Wikipedia',
    u'nl': u'http://nl.wikipedia.org/wiki/Wikipedia',
    u'pl': u'http://pl.wikipedia.org/wiki/Wikipedia',
    u'pt': u'http://pt.wikipedia.org/wiki/Wikip%C3%A9dia',
    u'ru': u'http://ru.wikipedia.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%8F',
#    u'zh': u'http://zh.wikipedia.org/wiki/Wikipedia',
}

html_folder = u'html'
text_folder = u'paragraphs'
short_text_folder = u'short_paragraphs'
n_words_per_short_text = 5

#if Html folder does not exist create one 
if not os.path.exists(html_folder):
    os.makedirs(html_folder)

for lang, page in pages.items():

    text_lang_folder = os.path.join(text_folder, lang)
    if not os.path.exists(text_lang_folder): #Create paragragh folder
        os.makedirs(text_lang_folder)

    short_text_lang_folder = os.path.join(short_text_folder, lang)
    if not os.path.exists(short_text_lang_folder): #Create Short paragragh folder
        os.makedirs(short_text_lang_folder)

    opener = build_opener()
    html_filename = os.path.join(html_folder, lang + '.html')
    if not os.path.exists(html_filename):
        print("Downloading %s" % page)
        request = Request(page)
        # change the User Agent to avoid being blocked by Wikipedia
        # downloading a couple of articles ones should not be abusive
        request.add_header('User-Agent', 'OpenAnything/1.0')
        html_content = opener.open(request).read()
        open(html_filename, 'wb').write(html_content)

    # decode the payload explicitly as UTF-8 since lxml is confused for some
    # reason
    html_content = open(html_filename).read()
    if hasattr(html_content, 'decode'):
        html_content = html_content.decode('utf-8')
    tree = ElementTree(lxml.html.document_fromstring(html_content))
    i = 0
    j = 0
    for p in tree.findall('//p'):
        content = p.text_content()
        if len(content) < 100:
            # skip paragraphs that are too short - probably too noisy and not
            # representative of the actual language
            continue

        text_filename = os.path.join(text_lang_folder,
                                     '%s_%04d.txt' % (lang, i))
        print("Writing %s" % text_filename)
        open(text_filename, 'wb').write(content.encode('utf-8', 'ignore'))
        i += 1

        # split the paragraph into fake smaller paragraphs to make the
        # problem harder e.g. more similar to tweets
        if lang in ('zh', 'ja'):
        # FIXME: whitespace tokenizing does not work on chinese and japanese
            continue
        words = content.split()
        n_groups = len(words) / n_words_per_short_text
        if n_groups < 1:
            continue
        groups = np.array_split(words, n_groups)

        for group in groups:
            small_content = u" ".join(group)

            short_text_filename = os.path.join(short_text_lang_folder,
                                               '%s_%04d.txt' % (lang, j))
            print("Writing %s" % short_text_filename)
            open(short_text_filename, 'wb').write(
                small_content.encode('utf-8', 'ignore'))
            j += 1
            if j >= 1000:
                break

Data will be stored into local folders of html,paragraghs and short_paragraphs.Now we need to load the text data into the memory for classification task.

We will be using the load_files function of sklearn.datasets which can read the **2-level hierarchy structure like "/Content/Category/file1.txt....file50.txt".** 

The function takes the path of the folder as an input to read the files and returns a dataset with data and target attributes.

Note: If you are curious to find underline structure of the load_files() function then you can use this following code to extract its detail.

**import inspect**

**print inspect.getsource(load_files)**


In [2]:
from sklearn.datasets import load_files

In [38]:
#folder path containing our text files is passed as an argument.

data_folder=os.getcwd()+"\\paragraphs"
dataset = load_files(data_folder)

Let's import required library functions for the experiment.

In [6]:
import sys
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Perceptron
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn import metrics

Since the data is already loaded in the dataset variable we will split it into train and test dataframe

In [7]:
# Split the dataset in training and test set:
docs_train, docs_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.5)

Now let's build the feature out of text data. We will be using two well known technique 
+ n-gram 
+ tf-idf (term frequency-inverse document frequency)

## Uderstanding n-gram with simple example
n-gram model are used in text mining and natural language processing task.

They can be used for creating uni-gram,bi-gram,tri-gram model which in-turn can be used as an input for supervised learning task.

Let see the example if we have sentence as **"I love python language"** and if N=2 then ngrams would be:
+ I love
+ love python
+ python language

Above was the example of bigram model where you typically move one word forward.

Now lets try again with N=3 so our output will be

+ I love python
+ love python language

This was a trigram model with N=3.

So next question in mind can be how many n-gram can be created for a given sentence k 

Let's look for answer

N-gram = X-(N-1) where X="Number of words in the sentence" 

so for our sentence X = 4(**"I love python language"**)

N-gram = 4-(2-1) = 3  (if N=2)

N-gram = 4-(3-1) = 2  (if N=3)

## Understanding the tf-idf with simple example

Tf-idf stands for _term frequency-inverse document frequency_, and the tf-idf weight is a weight often used in information retrieval and text mining.

It is measure used to evaluate how important a word is to a document in a collection or corpus.The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

TF: TermFrequency,which measures how frequently a term occurs in a document.

**TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).**

IDF: Inverse-DocumentFrequncy,which measures how important a term is. 

**IDF(t) = log_e(Total number of documents / Number of documents with term t in it).**

Example: 

Consider a document containing 100 words wherein cat appears 3 times. so the term-frequency would be
 **tf:(3/100) -> 0.03**

Now assume we have 10 million documents and cat appears 1000 times of these document then our Inverse-document frequency will be
 **idf:log_e(10,000,000/1000) = 4**

So tf-idf weight will be product of these two quantities = 0.03 * 4 = 0.12


In [8]:
#Let's perform ngram and tf-idf in one line and store it into variable vectorizer

vectorizer=TfidfVectorizer(ngram_range=(1,3),analyzer='char',use_idf=False)

In the above code ngram_range=(1,3) tells function to create trigram feature model and analyzer='char' tells it to perform trigram for characters instead of words.

_Note: We are not performing stop word removal,stemming in this task_

Now since the feature building is done we will be running a Nueral network model called **perceptron** on top of it to perform the clasification task for different language types without any hyper-parameter tuning for model

In [18]:
clf=Pipeline([('vec',vectorizer),('clf',Perceptron())])

**Pipeline** function takes transformation and model function and returns the final model to fit

In [20]:
clf.fit(docs_train,y_train) #fit the model to the training data

Pipeline(steps=[('vec', TfidfVectorizer(analyzer='char', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), norm=u..._iter=5, n_jobs=1, penalty=None, random_state=0, shuffle=False,
      verbose=0, warm_start=False))])

In [21]:
y_predicted=clf.predict(docs_test)

In [22]:
# Print the classification report
print(metrics.classification_report(y_test, y_predicted,
                                    target_names=dataset.target_names))

             precision    recall  f1-score   support

         ar       1.00      1.00      1.00        13
         de       0.99      1.00      0.99        79
         en       0.99      1.00      0.99        78
         es       0.98      1.00      0.99        51
         fr       1.00      1.00      1.00        64
         it       1.00      0.98      0.99        41
         ja       1.00      1.00      1.00        32
         nl       1.00      1.00      1.00        21
         pl       1.00      1.00      1.00        19
         pt       1.00      0.98      0.99        43
         ru       1.00      0.97      0.98        31

avg / total       0.99      0.99      0.99       472



As we can see our model is accurately able to predict the language types based on average precision and recall having 0.99 % accuracy for precision and recall.  

Lets predict the news sentences using our language prediction model.

Following are languages which we will be predicting 
+ en-English
+ fr-French
+ de-German
+ ru-Russian
+ it-Itatlian

In [28]:
# Predict the result on some short new sentences:
sentences = [
    u'This is a language detection test.',
    u'Ceci est un test de d\xe9tection de la langue.',
    u'Dies ist ein Test, um die Sprache zu erkennen.',
    u'[ˌwɪkiˈpiːdiə]) — свободная[3] общедоступная мультиязычная',
    u'Wikipedia – wielojęzyczna encyklopedia internetowa'
]

predicted = clf.predict(sentences)

for s, p in zip(sentences, predicted):
    print(u'The language of "%s" is "%s"' % (s, dataset.target_names[p]))

The language of "This is a language detection test." is "en"
The language of "Ceci est un test de détection de la langue." is "fr"
The language of "Dies ist ein Test, um die Sprache zu erkennen." is "de"
The language of "[ˌwɪkiˈpiːdiə]) — свободная[3] общедоступная мультиязычная" is "ru"
The language of "Wikipedia – wielojęzyczna encyklopedia internetowa" is "it"
