# Language Identification Exercise

I will be constructing a classifier capable of identifying what language a document was written in. The dataset I will be working with is a selection of paragraphs from wikipedia.

In [1]:
# Import the needed libraries.

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

In [2]:
# Load the dataset, shuffle it randomly, 
# and split it into training and testing sets.

dataset = datasets.load_files('paragraphs/')
X = dataset.data
y = dataset.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [3]:
# Set up a pipeline for analyzing the data:
# 1. Run it through a count vectorizer, making
#    a "bag of words".
# 2. Run the vectors through a tf-idf transform
#    to get word frequencies.
# 3. Apply a multinomial naive Bayes classifier.

text_clf = Pipeline([('vect', CountVectorizer()), 
                     ('tfidf', TfidfTransformer()), 
                     ('clf', MultinomialNB())
                    ])

In [4]:
# Fit the model to the training data, predict 
# the tets data label, and compare with the 
# true labels to find the prediction accuracy.

text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print(text_clf.score(X_test, y_test))

0.886850152905


In [5]:
# Create and print a classification report, 
# showing the performance of the model.

print(metrics.classification_report(y_test, y_pred, target_names = dataset.target_names))

             precision    recall  f1-score   support

         ar       1.00      0.71      0.83         7
         de       0.71      1.00      0.83        52
         en       0.79      1.00      0.88        45
         es       0.98      0.98      0.98        43
         fr       0.96      1.00      0.98        47
         it       1.00      0.97      0.98        32
         ja       1.00      0.10      0.18        20
         nl       1.00      0.83      0.91        12
         pl       1.00      0.47      0.64        15
         pt       0.97      0.93      0.95        30
         ru       1.00      0.88      0.93        24

avg / total       0.91      0.89      0.87       327



While the model has excellent precision, we see that recall is quite low in a few cases. Japanese and Polish documents are not well identified. Let's check the confusion matrix to see what happened.

In [6]:
# Build and print the confusion matrix.

print(metrics.confusion_matrix(y_test, y_pred))

[[ 5  2  0  0  0  0  0  0  0  0  0]
 [ 0 52  0  0  0  0  0  0  0  0  0]
 [ 0  0 45  0  0  0  0  0  0  0  0]
 [ 0  0  1 42  0  0  0  0  0  0  0]
 [ 0  0  0  0 47  0  0  0  0  0  0]
 [ 0  0  1  0  0 31  0  0  0  0  0]
 [ 0 14  3  1  0  0  2  0  0  0  0]
 [ 0  1  0  0  1  0  0 10  0  0  0]
 [ 0  3  4  0  0  0  0  0  7  1  0]
 [ 0  0  2  0  0  0  0  0  0 28  0]
 [ 0  1  1  0  1  0  0  0  0  0 21]]


Interestingly, the Japanese documents are being misclassified as German at a high rate, and the Polish documents as German and English. I have no idea why, and this would be where I consult a linguist or more carefully explore the data.

Lets try to fix this by simply using a better algorithm: the Support Vector Machine, trained with stochastic gradient descent.

In [7]:
# Build the new text classifier pipeline, replacing
# the classifier bit.

text_clf = Pipeline([('vect', CountVectorizer()), 
                     ('tfidf', TfidfTransformer()), 
                     ('clf', SGDClassifier(loss = 'hinge', penalty = 'l2', 
                                           alpha = 1e-3, random_state = 42))
                    ])

In [8]:
# Fit the model to the training data, predict 
# the tets data label, and compare with the 
# true labels to find the prediction accuracy.

text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print(text_clf.score(X_test, y_test))

0.987767584098


In [9]:
# Create and print a classification report, 
# showing the performance of the model.

print(metrics.classification_report(y_test, y_pred, target_names = dataset.target_names))

             precision    recall  f1-score   support

         ar       1.00      1.00      1.00         7
         de       1.00      1.00      1.00        52
         en       0.98      1.00      0.99        45
         es       1.00      0.98      0.99        43
         fr       1.00      1.00      1.00        47
         it       1.00      1.00      1.00        32
         ja       0.86      0.95      0.90        20
         nl       1.00      1.00      1.00        12
         pl       1.00      1.00      1.00        15
         pt       1.00      0.97      0.98        30
         ru       1.00      0.96      0.98        24

avg / total       0.99      0.99      0.99       327



In [10]:
# Build and print the confusion matrix.

print(metrics.confusion_matrix(y_test, y_pred))

[[ 7  0  0  0  0  0  0  0  0  0  0]
 [ 0 52  0  0  0  0  0  0  0  0  0]
 [ 0  0 45  0  0  0  0  0  0  0  0]
 [ 0  0  0 42  0  0  1  0  0  0  0]
 [ 0  0  0  0 47  0  0  0  0  0  0]
 [ 0  0  0  0  0 32  0  0  0  0  0]
 [ 0  0  1  0  0  0 19  0  0  0  0]
 [ 0  0  0  0  0  0  0 12  0  0  0]
 [ 0  0  0  0  0  0  0  0 15  0  0]
 [ 0  0  0  0  0  0  1  0  0 29  0]
 [ 0  0  0  0  0  0  1  0  0  0 23]]


Our misclassification errors are almost totally gone! That, I would say, is a success.