# Assignment: Language identification with trigrams

Languages differ in their syllable structure, spelling conventions and common vocabulary. 
As a result, letter sequences from each language have distinct, characteristic frequencies 
that allow us to identify the language of a text on the basis of its ngram profile.

In this assignment we build a trigram language recognizer. The recognizer works by comparing the ngram profile of a test document against a collection of ngram profiles for known languages, computed and saved in advance.

## Contents


**[1. Overview](#1.-Overview)**  

**[2. General requirements](#2.-General-requirements)**  

**[3. Preparation: The data files](#3.-Preparation:-The-data-files)**  

**[4. The `langdetect` module](#4.-The-langdetect-module)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [4.1 `prepare()`](#4.1-prepare%28%29)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [4.2 `trigrams()`](#4.2-trigrams%28%29)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [4.3 `trigram_table()`](#4.3-trigram_table%28%29)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [4.4 `read_trigrams()` and `write_trigrams()`](#4.4-read_trigrams%28%29-and-write_trigrams%28%29)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [4.5 `cosine_similarity()`](#4.5-cosine_similarity%28%29)  

**[5. Script 1: Create profiles](#5.-Script-1:-Create-profiles)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [5.1 The data](#5.1-The-data)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [5.2 The function `make_profiles()`](#5.2-The-function-make_profiles%28%29)  

**[6. Script 2: The language recognizer](#6.-Script-2:-The-language-recognizer)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [6.1 The `LangMatcher` class](#6.1-The-LangMatcher-class)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [6.2 The recognizer main script](#6.2-The-recognizer-main-script)  

**[7. Script 3: Evaluation](#7.-Script-3:-Evaluation)**  

**[8. Submission](#8.-Submission)**  


## 1. Overview

Your trigram recognizer will consist of four files. Step by step specifications are given below. **Study the specifications carefully and make sure your work meets all the requirements.**

This overview summarizes the sections that follow. If there seems to be disagreement between this summary and the detailed description, follow the detailed description.


### Core functionality: The module `langdetect.py`

This module will define some important core functions, and any other helper functions you find useful. They include:

- `trigram_table()`: Calculate and _return_ a trigram frequency table from a string.
- `read_table()` and `write_table()`: Functions that read, resp. write, a list of trigrams and their frequencies.
- `cosine_similarity()`: Return the cosine similarity score between two frequency tables.


### Three scripts: Create profiles, match languages

All three scripts should import necessary functions from the module `langdetect`. Do not copy-paste code or reimplement any operations that `langdetect.py` can already handle.

1. `write_profiles.py`: reads a directory of multilingual files and saves trigram frequency tables for later use. 

2. `matchlang.py`: The language recognizer. This is a general-purpose program with command-line arguments and options. It can also be imported as a module, allowing the recognizer to be used by another program.

3. `evaluate.py`: This script will apply the recognizer to two collections of test files, and measure and report the success of the recognition process.

## 2. General requirements

Read the instructions carefully. Many of the necessary concepts and 
skills are covered in the forthcoming reading and practica (up to the 
deadline).

**Credit.** Projects will be graded on function (correct results, design according
to the specification) as well as form (organization and appropriateness 
of the code, documentation). **Test** the parts of your program to
ensure that they work as intended. 

**Specification.** Ensure that your modules and functions **exactly** meet the specifications given here, including naming and arguments. (Extra _optional_ arguments are fine.) Adherence to specifications is essential for the components of a complex system to work together. You are free (and encouraged) to structure your code by defining additional functions, optional arguments, etc. If you wish to go beyond the requirements of the assignment, you are welcome to. Just ensure that your additions do not conflict with the specifications. 

**Module structure.** When imported, your modules should only provide definitions; they must not read or write data files, generate output, or carry out any lengthy calculations. Modules intended to be used as stand-alone scripts must have a conditional section that executes the code. (For modules that don't have such use, you are free to add a conditional section to help with development and debugging.) 

**Paths.** Do not use absolute paths for files. Use arguments rather than hard-coded names wheverer possible. Use forward slashes ("/") to separate components. Unlike backslashes ("\"), they do not need escaping and they work on both Windows and Unix (OS X, Linux) systems.

**Form.** Follow our recommended best practices on variable naming, line length, and "docstrings" for all functions and modules. Functions should avoid relying on global variables; use function arguments to pass information. Avoid code repetition; if you find it useful to copy and paste your code to more than one place, you should factor it out into a function. Finally, if you are aware that some part of your code has a problem but are unable to fix it, document the problem in a comment.

**Function.** In addition to producing correct results, beware of excessively long running times. They indicate a mistake in your algorithm, or an unsuitable data structure somewhere.


Finally, be aware that **your code will be evaluated with a different data set.** Do not hard-code paths and filenames in your code, except as directed. Ensure that your code pays attention to function arguments and command-line arguments. Make sure that the code runs correctly when a non-default value for any argument is used.

## 3. Preparation: The data files

Download the file `datafiles.zip`, unzip it and move the subdirectories `training` and `test-clean` to the folder that will contain your scripts. 

### Automated tests

This notebook contains snippets of test code. If you also place it in the same folder as your code and data folders, you can check your work by running the test cells. Obviously, **you must write the code before you can run the tests.** 

**Important:** Python will only import your modules once. If you modify your files, you must restart the notebook kernel for the changes to take effect. You can use the keyboard shortcut `<Escape> 0 0` to restart the kernel (possibly followed by `Control-Return` to rerun the current cell), or copy the tests into their own script that you can run separately.

The tests are only partial: **Passing all the tests doesn't guarantee that your code is correct in all respects.** Also, the tests do not check things like documentation and code style. 

## 4. The `langdetect` module

Create a Python module `langdetect.py` that defines the following functions:

### 4.1 `prepare()`

The function `prepare(text)` takes a string, replaces the characters `!?",.()<>` with spaces, and then splits the new string on whitespace and returns the resulting list of words.
Run the following code to check if your function seems to work correctly.

In [1]:
# Demo:
from langdetect import prepare
tokens = prepare('This is <cough,cough> "HAL-9000".  Don\'t touch!')
print(tokens)

# Test:
if tokens == ['This', 'is', 'cough', 'cough', 'HAL-9000', "Don't", 'touch']:
    print("ok")
else:
    import sys
    print("Incorrect result", file=sys.stderr)

['This', 'is', 'cough', 'cough', 'HAL-9000', "Don't", 'touch']
ok


### 4.2 `trigrams()`

The function `trigrams(seq)` takes any sequence (e.g., a string) and returns a list of its trigrams. It must not apply any padding, splitting or other modifications to its argument. Example and test:

In [2]:
from langdetect import trigrams
tr = trigrams("R2.D2")
print(tr)
        
# Test:
if tr == ['R2.', '2.D', '.D2']:
    print("ok")
else:
    import sys
    print("Incorrect result", file=sys.stderr)

['R2.', '2.D', '.D2']
ok


### 4.3 `trigram_table()`

We will represent a table of trigram frequencies as a dictionary whose keys are ngrams and the values are frequencies (ngram counts). The function `trigram_table(text, limit)` takes two arguments: A string to process and the _optional_ argument `limit`, giving the number of frequent ngrams to return. The default value must be 0, which means all ngrams should be returned. This function should:
        
1. Use `prepare()` to clean and tokenize the text.
- Surround each tokenized word with the characters `<` and `>` (angle brackets). E.g., `"it's"` becomes `"<it's>"`.
    
- Extract the trigrams of each word, and use a dictionary (either an ordinary `dict` or `collections.Counter`) to count how often each trigram occurs.

- If `limit` is 0, return the entire dictionary.

- Otherwise `limit` must be an integer that tells us how many ngrams to return (starting from the most frequent). Create and return a new dictionary containing this amount of keys and values. 

#### **Notes**

* If the cutoff point falls among several ngrams with the same frequency, the selection is arbitrary (and might vary between program runs).

2. One way to select a subset of the entries in a dictionary is to convert it to a list of ordered pairs, sorted by frequency, and create a new dictionary from a slice of the values. You are free to use any approach.

**Example and test:**

In [1]:
from langdetect import trigram_table
from langdetect import write_trigrams
import os.path

top = trigram_table("hiep, hiep, hoera! Sep is een mooie jongen ah ah ah ah.")
print(top)

if top == {'<hi': 2, 'hie': 2, 'ep>': 2, 'iep': 2}:
    print("ok")
else:
    import sys
    print("Incorrect result", file=sys.stderr)
    
write_trigrams(top, "tekst.txt")

{'<ah': 4, 'ah>': 4, 'ep>': 3, '<hi': 2, 'hie': 2, 'iep': 2, 'en>': 2, '<ho': 1, 'hoe': 1, 'oer': 1, 'era': 1, 'ra>': 1, '<Se': 1, 'Sep': 1, '<is': 1, 'is>': 1, '<ee': 1, 'een': 1, '<mo': 1, 'moo': 1, 'ooi': 1, 'oie': 1, 'ie>': 1, '<jo': 1, 'jon': 1, 'ong': 1, 'nge': 1, 'gen': 1}


Incorrect result


### 4.4 `read_trigrams()` and `write_trigrams()`

The function `write_trigrams(table, filename)` takes an ngram table as generated by `trigram_table()` and a filename, and writes the ngrams to that file, in this format:

    2 iep 
    2 ep>
    1 <Hi
    1 hoe
    1 <ho
    ...
    
The output must use the `utf8` encoding, and must be in descending order of frequency.

The function `read_trigrams(filename)` must reverse the process: It opens the file for reading with `utf8` encoding, it reads the file's contents and converts them to dictionary whose values are **integers**, not strings.  

Check that you can "round-trip" your data with these functions, i.e. read back exactly what you wrote out:

In [3]:
from langdetect import trigram_table, read_trigrams, write_trigrams

text = "Het valt voor, dat bij één roveroverval, één rover voorover over één roverval valt."
table = trigram_table(text, 10)
print(table)
write_trigrams(table, "rover.10.TEMP")
reread_table = read_trigrams("rover.10.TEMP")

if table == reread_table:
    print("ok")
else:
    import sys
    print("The round trip fails", file=sys.stderr)

{'ove': 6, 'ver': 6, 'rov': 5, 'val': 4, '<éé': 3, 'één': 3, 'én>': 3, '<ro': 3, 'er>': 3, '<va': 2}


AttributeError: 'str' object has no attribute 'write'

### 4.5 `cosine_similarity()`

The function `cosine_similarity(known, unknown)` takes two arguments, which must both be dictionaries of ngram frequencies: the ngrams for the unknown text we want to identify, and one of the reference profiles of a known language. It must compute and return the "cosine similarity" metric between the two tables.

The formula of cosine similarity between *a* and *b* is as follows:

$\cos(a,b)=\frac{a\cdot b}{magnitude(a)*magnitude(b)}$

where $a\cdot b$ is the _dot product_ of $a$ and $b$ which equals $\Sigma_i a_i*b_i$ and $magnitude(x)=\sqrt{\Sigma _i x_i^2}$.

(See the [Wikipedia definition][1] for the formula, or [this page][2] for a detailed geometric explanation).

The cosine similarity metric ranges between `-1.0` and `1.0`. Larger is better (smaller angle, i.e. more similar vectors). An ngram table compared to itself should return cosine `1.0`. Unrelated ngram tables should return cosine of around `0`.

If an ngram does not appear in an ngram table, it has frequency 0 and contributes nothing to the numerator in the formula nor to the table's vector magnitude. So when comparing two tables,  trigrams that don't occur in either do not affect cosine measure because they do not contribute either to the dot product or the tables' vector magnitudes. This means that our calculations need only consider trigrams that occur in one of the tables: The dot product (numerator) is the sum over trigrams that are present on both sides (since other products are zero), and the magnitude of each vector (denominator) is calculated from its own trigrams.

For example, the dot product of the following vectors is $2 * 2$, and each vector has magnitude $\sqrt {2*2+1}$. The cosine score works out to $2*2/(\sqrt {2*2+1}*\sqrt {2*2+1})=4/5$.

[1]: https://en.wikipedia.org/wiki/Cosine_similarity#Definition
[2]: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/

In [0]:
from langdetect import cosine_similarity

table1 = { "<he": 2, "het": 1 }
table2 = { "<he": 2, "hem": 1 }
score = cosine_similarity(table1, table2)

# The score *should be* 0.8, but we get floating point error
print(score)

if abs(score - 4/5) < 0.000001:
    print("ok")
else:
    import sys
    print("Incorrect result", file=sys.stderr)

## 5. Script 1: Create profiles

We're now ready to deal with collecting our trigram statistics. The script `write_profiles.py`  must define a function `make_profiles()` that accepts three arguments: The path to a folder containing the multilingual reference corpus, the path to an (empty) folder in which to write the trigram tables, and the (maximum) size of the trigram tables.

When imported as a module, `write_profiles.py` should not read or write any files. When executed as a stand-alone script, it should run a single command

    make_profiles("./training", "./trigram-models", 200)

### 5.1 The data

Create the output folder `trigram-models` by hand, or make your program create it if it does not exist.<p/>

The folder `training` contains a number of files with names like these:

        Afrikaans-Latin1
        Aguaruna-Latin1
        Albanian-UTF8
        (etc.)
        
The part before the hyphen is the language name. The part after the hyphen is the encoding. 


### 5.2 The function `make_profiles()`

The function `make_profiles(datafolder, profilefolder, size)` must 
    
  1. Loop over a list of the files in `datafolder`; for each file:
  - Split each filename into language name and encoding.
  - Read the file, using the appropriate encoding. 
  - Generate a table of the `size` (e.g.: 200) most frequent ngrams.
  - Write out the table in a file in the `profilefolder`, named like this: `Afrikaans.200`.
  
### Notes

* Use the functions of the `langdetect` module; do not reinvent any wheels!

* All profiles must be saved in the `utf-8` encoding, regardless of the input's encoding.
    
* Check that your function works correctly if called with different directory names for input and output.

In [4]:
 %run write_profiles.py

ERROR:root:File `'write_profiles.py'` not found.


## 6. Script 2: The language recognizer

To use our collection of trigram profiles for language identification, we will first create a dedicated class to manage them easily. Then we'll specify a command line interface for our recognizer. 

As usual, importing the script `matchlang.py` as a module should do no input, output, or long computations. When run as the main program, it must look for command line arguments and try to identify the language of the specified file.


### 6.1 The `LangMatcher` class

The `LangMatcher` class will allow a text be compared against multiple language profiles, in order to find the best match. It should have the following behavior:

1. The initializer must have one required argument: The path to a directory with the saved trigram profiles. The initializer should define and store a dictionary whose keys are language names, and each value is itself a dictionary whose keys are ngrams and whose values are frequencies. These are read from the profile file with `langdetect.read_trigrams()`. (Use just the language name as the key, not the complete file name).

3. `score(text, n=1, ngrams=200)`: Here, `text` is a string that has not yet been cleaned up or tokenized. Compute its ngram table, limited to `ngrams`  most frequent ngrams (default 200), and compare it against each of the languages in the model dictionary. Compile a list of the language names and cosine similarity scores, and return the names and similarity scores for the `n` best matches.

5. `recognize(filename, encoding="utf-8", ngrams=200)`: A convenience function: It opens the specified file, calls `score()` on its contents, and returns the top result (i.e., the name of the highest matching language and the similarity score).

### 6.2 The recognizer main script
 
When run as a main program, `matchlang.py` should do the following things:

1. Accept one **or more** command line arguments; each is the path of a file whose language we want to identify. If there are **no** command line arguments, the program can do whatever you want. (It is a good idea to specify some default filenames to help with development and testing. Alternately you can do nothing, or print an error message.)

2. If the option `-e` is given, its argument specifies the encoding to use for all test files. (E.g., `-e latin1`).  The default encoding is `utf-8`.

**Program execution:**

1. Initialize a `LangMatcher` object, filling it with all the profiles from the directory `trigram-models`.

2. For each file specified on the command line, print out a line containing the file name, the most similar language, and its similarity score. 

The following code cell uses notebook "magic" commands (i.e., not Python) to run `matchlang.py` as a command line script in the directory that contains this notebook.

In [6]:
%run matchlang.py "europarl-90/ep-00-02-03.nl"

%run matchlang.py europarl-30/*.da



europarl-90/ep-00-02-03.nl	Dutch	0.7273736039378357
europarl-30/ep-00-01-17.da	Danish	0.3726819571803258
europarl-30/ep-00-01-18.da	Danish	0.46092073189990596
europarl-30/ep-00-01-21.da	Danish	0.4373769273603376


## 7. Script 3: Evaluation

The script `evaluate.py` assesses the performance of the language recognition.
Since the tests are fixed, the directory names can be hard-coded in this file. 

We'll test our system on fragments from translations of European Parliament proceedings (`europarl` corpus). The algorithm works so well on clean, monolingual text, that mistakes are only likely with very short text fragments. Here we evaluate collections of randomly selected fragments that are 90, 30, and 10 words long. They are in the folders `europarl-90`, `europarl-30`, and `europarl-10`, respectively. All files are in `utf-8` format. The language of the fragment is encoded in the file names, but not in the same way as the training data: Each filename has a suffix indicating the ISO language code, e.g. `ep-00-02-02.de` (suffix `de`, German). The following codes are used:

        da Danish
        de German
        el Greek
        en English
        es Spanish
        fi Finnish
        fr French
        it Italian
        nl Dutch
        pt Portuguese
        sv Swedish

Convert the above table into a dictionary, and use it to automatically check if the language returned by the recognizer was correct.

Write a function `eval(path)` that is given the pathname of a collection and performs language identification on each file of the collection. Print the name of each file and the recognizer's guess, and add the word "ERROR" and the correct language after incorrect guesses. Keep a tally of the number of correct and incorrect results, and report the totals at the end of each collection (separately for each collection).

In the  main script, use your function `eval()` to evaluate the collections `europarl-90`, `europarl-30`, and `europarl-10`. 

In [11]:
%run evaluate.py

FileNotFoundError: [Errno 2] No such file or directory: 'test-clean/ep-00-01-17.da'

## 8. Submission

When you are done, upload your four Python files. If you wish to add any comments about problems or extensions to the tasks, add a file `README.txt` with the additional information.

### Submission Checklist
Before submission, please make sure you did not forget to include important parts of your code:
- `langdetect.py` : core functions, and any other helper functions you find useful, including:
    - `prepare()`
    - `trigrams()`
    - `trigram_table()`
    - `read_table()` 
    - `write_table()`
    - `cosine_similarity()`
    
- `write_profiles.py`: including function
    -  `make_profiles()`
    
- `matchlang.py`: The language recognizer. This is a general-purpose program with command-line arguments and options.
    - `MatchLang` class with `score()` and `recognize()` functions
    - main script

- `evaluate.py` evaluates your script on europarl-90, europarl-30, and europarl-10.
    - `eval()`
    -  main script