# Language Identification

In order to extract any kind of information from text, the first thing we have to know is what language the text is in. In this assignment you are going to use **character N-gram grammars** to solve the problem of *language identification*.

Given a document, your goal is to say what language it is written in. We will give you a set of training documents (one in each of 6 languages) and a set of development test documents. You will be graded on an unseen set of 6 test documents. To make the problem tractable, we guarantee that the test documents will come from one of the 6 languages you have seen in the training set.

The data you will use is 6 translations of part of the [Universal Declaration of Human Rights](http://www.un.org/en/universal-declaration-human-rights/index.html) (which has been translated into many languages although the data for the 6 languages is in the Language Identification folder in the Week 04 folder.)

The algorithm you will use requires that you build 6 separate character bigram grammars, one for each language, on the training data. Mostly in lecture we talked about **word bigrams**. A character bigram is computed on characters instead of words. You should use the simple Bayesian Unigram Prior smoothing method.

For each test document in the dev subfolder, for each of your 6 bigram grammars, you compute the log-likelihood of the test document given the bigram grammar (use the log-likelihood instead of the likelihood since it's less likely to underflow). Then you choose as your answer for that document the language that gave the highest log-likelihood.

Here's the formal description of the equations you should be computing. First, you want to pick the language, out of the 6 languages, which assigns the highest log probability to the document:

$$\hat{L} = \underset{L \in \mathcal{L}}{\mathrm{argmax}} ( log P_L(Document) )$$

To compute the log probability for each language, you make the Markov (N-gram) assumption, and use a bigram grammar that has been trained on that language: 

$$log P_L(Document) = log P_{smooth}(char^n_1) \approx \prod_{i=1}^n log P_{smooth}(char_i | char_{i-1})$$

(That was the equation in log-space; in non-log space it would be:)

$$P_L(Document) = P_{smooth}(char^n_1) \approx \prod_{i=1}^n P_{smooth}(char_i | char_{i-1})$$

Don't forget to add some sort of special START and END characters at the beginning and end of the file.

To train your bigram grammars, use Bayesian Unigram Prior smoothing:

$$P_{smooth}(char_i | char_{i-1}) = \frac{C(char_{i-1},char_i) + P(char_i)}{C(char_{i-1}) + 1}$$

Please develop your solution in an iPython notebook using the text in the train subfolder. Then test your models on data in the dev subfolder.

The data is in UTF-8 format.

## Train

In [1]:
from __future__ import print_function
from collections import Counter

import codecs
import numpy as np
import os

TRAIN_FOLDER = "train"  # folder with training files
START_CHAR = "@"  # unique start character - for bigrams
END_CHAR = "#"  # unique end character - for bigrams

print("Training files: ", os.listdir(TRAIN_FOLDER))

Training files:  ['esper.txt', 'eng.txt', 'dut.txt', 'frn.txt', 'spn.txt', 'ger.txt']


In [2]:
"""Read in the text for every training file. Each training file corresponds to a language."""
train_corpora = {}

# read in the training files
for train_file in os.listdir(TRAIN_FOLDER):
    train_path = os.path.join(TRAIN_FOLDER, train_file)  # full path to training file
    language, ext = os.path.splitext(train_file)  # remove extension from file name
    try:
        corpus_text = codecs.open(train_path, encoding='utf-8').read()  # read unicode
        corpus_text = START_CHAR + corpus_text + END_CHAR  # add start and end characters
        train_corpora[language] = dict(text=corpus_text)  # store text for this language
    except:
        print("ERROR: Could not read text file \"%s\"" % train_file)
        continue

In [3]:
"""count unigram and bigram frequencies for each language"""
for language in train_corpora.keys():
    corpus = train_corpora[language]['text']
    
    unigram_counts = Counter(corpus)
    bigram_counts = Counter(zip(corpus[:-1], corpus[1:]))
    
    N = sum(unigram_counts.values())  # Number of characters in corpus
    V = len(unigram_counts.keys())  # Vocabulary size, number of unique letters in corpus
    
    unigram_probs = {unigram: count / float(N) \
                     for unigram,count in unigram_counts.iteritems()}  # turn unigram counts into probabilities
    
    # Store these properties in the dictionary for this language
    train_corpora[language]['N'] = N
    train_corpora[language]['V'] = V
    
    train_corpora[language]['unigram_counts'] = unigram_counts
    train_corpora[language]['unigram_probs'] = unigram_probs
    
    train_corpora[language]['bigram_counts'] = bigram_counts

In [4]:
def get_p_smooth(a, b, L):
    """
    Returns the probability of seeing character b after character a,
    conditioned on language model L.
    Uses Bayesian Unigram Prior smoothing for the bigram probability
    """
    global train_corpora
    
    if L not in train_corpora.keys():
        raise ValueError("No language model exists for language %s" % str(L))
    
    corpus = train_corpora[L]
    
    N = corpus['N']
    unigram_probs = corpus['unigram_probs']
    unigram_counts = corpus['unigram_counts']
    bigram_counts = corpus['bigram_counts']

    p_b = unigram_probs.get(b, 1.0/N)  # unigram probability of b
    c_a_b = bigram_counts.get((a,b), 1)  # frequency of bigram (a,b) in this language's training corpus
    c_a = unigram_counts.get(a, 1)  # frequency of character a in this language's training corpus
    
    p_smooth = (c_a_b + p_b) / float(c_a + 1)
    
    return p_smooth

## Test

In [5]:
TEST_FOLDER = "dev"

os.listdir(TEST_FOLDER)

['esper.txt', 'eng.txt', 'dut.txt', 'frn.txt', 'spn.txt', 'ger.txt']

In [6]:
def identify_language(document):
    """Takes in a document string, and classifies the document into one of the languages trained on."""
    
    document = START_CHAR + document + END_CHAR  # add start/end tags
    
    # Calculate log probabilities for every language L
    log_probs = {}
    for L in train_corpora.keys():
        log_p_L = 0  # log probability of the document belonging to this language
        for bigram in zip(test_txt[:-1], test_txt[1:]):
            log_p_L += np.log(get_p_smooth(bigram[0], bigram[1], L))  # sum log probs of all bigrams
        log_probs[L] = log_p_L
    
    print(log_probs)
    
    # get argmax
    L_hat = max(log_probs, key=log_probs.get)
    
    return L_hat

In [7]:
for test_file in os.listdir(TEST_FOLDER):
    test_path = os.path.join(TEST_FOLDER, test_file)
    
    with open(test_path) as f:
        test_txt = f.read()
    
    estimated_language = identify_language(test_txt)
    
    print("Test file \"%s\" is most likely written in: %s" % (test_file, estimated_language))
    print("")

{'ger': -5156.617835253816, 'eng': -4830.199488649011, 'esper': -4022.7257458963654, 'frn': -4745.852657185119, 'spn': -4422.47247751263, 'dut': -5401.615711982561}
Test file "esper.txt" is most likely written in: esper

{'ger': -5233.459268912499, 'eng': -4294.395526807493, 'esper': -5001.168904834336, 'frn': -5059.756437228359, 'spn': -5102.926309932263, 'dut': -5020.633862095518}
Test file "eng.txt" is most likely written in: eng

{'ger': -5482.514000130293, 'eng': -5571.293869702195, 'esper': -6126.451056949624, 'frn': -6086.481864218604, 'spn': -6033.183813051615, 'dut': -4899.99404438128}
Test file "dut.txt" is most likely written in: dut

{'ger': -5934.146308425525, 'eng': -5607.85379611641, 'esper': -5623.5734370082055, 'frn': -4932.910949926944, 'spn': -5614.713920954162, 'dut': -5942.935660522131}
Test file "frn.txt" is most likely written in: frn

{'ger': -6008.125468104433, 'eng': -5641.382559918243, 'esper': -5189.878065908862, 'frn': -5553.055824450768, 'spn': -4885.09306

