# Assignment: Language identification with ngrams

Languages differ in their syllable structure, spelling conventions and common vocabulary.  As a result, letter sequences from each language have distinct, characteristic frequencies that allow us to identify the language of a text on the basis of its ngram profile.

In this assignment we build an ngram language recognizer. The recognizer works by comparing the ngram profile of a test document against a collection of ngram profiles for known languages, computed and saved in advance.

## Contents


**[1. Overview](#1.-Overview)**  

**[2. General requirements](#2.-General-requirements)**  

**[3. Preparation: The data files](#3.-Preparation:-The-data-files)**  

**[4. The `langdetect` module](#4.-The-langdetect-module)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [4.1 `prepare` function](#4.1-prepare%28%29)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [4.2 `ngrams` function](#4.2-trigrams%28%29)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [4.3 `ngram_table` function](#4.3-trigram_table%28%29)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [4.4 `read_ngrams` and `write_ngrams` functions](#4.4-read_trigrams%28%29-and-write_trigrams%28%29)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [4.5 `cosine_similarity`](#4.5-cosine_similarity%28%29)  

**[5. Script 1: Create profiles](#5.-Script-1:-Create-profiles)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [5.1 The data](#5.1-The-data)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [5.2 The function `make_profiles`](#5.2-The-function-make_profiles%28%29)  

**[6. Script 2: The language recognizer](#6.-Script-2:-The-language-recognizer)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [6.1 The `LangMatcher` class](#6.1-The-LangMatcher-class)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [6.2 The recognizer main script](#6.2-The-recognizer-main-script)  

**[7. Script 3: Evaluation](#7.-Script-3:-Evaluation)**  

**[8. Report](#8.-Report)**  

**[9. Submission](#9.-Submission)**  


## 1. Specification Overview

Your ngram recognizer will consist of four files. Step by step specifications are given below. **Study the specifications carefully and make sure your work meets all the requirements.**

This overview summarizes the sections that follow. If there seems to be disagreement between this summary and the detailed description, follow the detailed description.


### Core functionality: The module `langdetect.py`

This module will define some important core functions, and any other helper functions you find useful. They include:

- `ngram_table`: Calculate and _return_ an ngram frequency table from a string.
- `read_ngrams` and `write_ngrams`: Functions that read, resp. write, a list of ngrams and their frequencies.
- `cosine_similarity`: Return the cosine similarity score between two frequency tables.


### Three scripts: Create profiles, match languages

All three scripts should import necessary functions from the module `langdetect`. Do not copy-paste code or reimplement any operations that `langdetect.py` can already handle.

1. `write_profiles.py`: reads a directory of multilingual files and saves ngram frequency tables for later use. 

2. `match_language.py`: The language recognizer. This is a general-purpose program with command-line arguments and options. It can also be imported as a module, allowing the recognizer to be used by another program.

3. `evaluate.py`: This script will apply the recognizer to collections of test files, and measure and report the success of the recognition process.

## 2. General requirements

Read the instructions carefully. Many of the necessary concepts and 
skills are covered in the forthcoming reading and practica (up to the 
deadline).

**Credit:** Projects will be graded on function (correct results, design according
to the specification) as well as form (organization and appropriateness 
of the code, documentation). **Test** the parts of your program to
ensure that they work as intended. However, don't include any tests in the final version, beyond what is specified in the assignment.

**Specification:** Ensure that your modules and functions **exactly** meet the specifications given here, including naming and arguments. (Extra _optional_ arguments are fine.) Adherence to specifications is essential for the components of a complex system to work together. You are free (and encouraged) to structure your code by defining additional functions, optional arguments, etc. If you wish to go beyond the requirements of the assignment, you are welcome to. Just ensure that your additions do not conflict with the specifications. 

**Module structure:** When imported, your modules should only provide definitions; they must not read or write data files, generate output, or carry out any lengthy calculations. Modules intended to be used as stand-alone scripts must have a conditional section that executes the code. (For modules that don't have such use, you are free to add a conditional section to help with development and debugging.) This means that the only places you'll _call_ a function (as opposed to defining it) is either inside another function's definition, or inside an `if __name__ == "__main__"` statement.

**Paths:** Do not use absolute paths for files. Use arguments rather than hard-coded names wheverer possible. Use forward slashes ("/") to separate components. Unlike backslashes ("\\"), they do not need escaping and they work on both Windows and Unix (OS X, Linux) systems. (Your code should work on our computers; if you hardcode in "/Users/janedoe/Documents/CL/", we can't use your code.)

**Form:** 

* Follow our recommended best practices on variable/function/class naming and comments. 
* Try to keep your lines max 120 characters long (technically the standard is 80, but this is becoming less common, and 120 is more readable. It's the default in Pycharm, for instance.)
* Include docstrings for **all** functions, classes, and modules. 
* Functions should avoid relying on global variables; use function arguments to pass information. 
* Avoid code repetition; if you find it useful to copy and paste your code to more than one place, you should factor it out into a function. 
* If you use "static" functions in your classes (these are functions that exceptionally do NOT take `self` as their first argument and can be called from outside an instance of the class) please _decorate_ them with `@staticmethod`. (We haven't used these in class, and are not needed in the assignment, but they are permitted.)
* If you are aware that some part of your code has a problem but are unable to fix it, document the problem in a comment and in the report. Try to set up your code so that if the modules is imported or the script is run, it doesn't throw an error (even if it doesn't behave correctly.) This might mean commenting out some malfunctioning code. Don't delete it, but leave it commented out and direct us to it in a comment and in the report.

**Function.** In addition to producing correct results, beware of excessively long running times. They indicate a mistake in your algorithm, or an unsuitable data structure somewhere. Specifically, unless your computer is very slow, these scripts should fun in a few seconds at most.

For each python file, please include the names of the authors in the docstring.

Finally, be aware that **your code will be evaluated with a different data set.** Do not hard-code paths and filenames in your code, except as directed. Ensure that your code pays attention to function arguments and command-line arguments. Make sure that the code runs correctly when a non-default value for any argument is used.

## 3. Preparation: The data files

Download the file `datafiles.zip` and unzip it. These instructions assume you have a directory in the same directory as your python files called `datafiles`, containing `training` and `test`.

### Automated tests

This notebook contains snippets of test code. If you also place it in the same folder as your code and data folders, you can check your work by running the test cells. Obviously, **you must write the code before you can run the tests.** 

**Important:** Python will only import your modules once. If you modify your files, you must restart the notebook kernel for the changes to take effect. You can use the keyboard shortcut `<Escape> 0 0` to restart the kernel (possibly followed by `Control-Return` to rerun the current cell), or copy the tests into their own script that you can run separately.

The tests are only partial: **Passing all the tests doesn't guarantee that your code is correct in all respects.** Also, the tests do not check things like documentation and code style. 

## 4. The `langdetect` module

Create a Python module `langdetect.py` that defines the following functions:

### 4.1 `prepare` function

The function `prepare(text)` takes a string, replaces the characters `!?",.()<>` with spaces, and then splits the new string on whitespace and returns the resulting list of words.
Run the following code to check if your function seems to work correctly.

In [1]:
# Demo:
from langdetect import prepare
tokens = prepare('This is <cough,cough> "HAL-9000".  Don\'t touch!')
print(tokens)

# Test:
if tokens == ['This', 'is', 'cough', 'cough', 'HAL-9000', "Don't", 'touch']:
    print("ok")
else:
    import sys
    print("Incorrect result", file=sys.stderr)

['This', 'is', 'coughcough', 'HAL-9000', "Don't", 'touch']


Incorrect result


### 4.2 `ngrams` function

The function `ngrams(seq, n=3)` takes any sequence (e.g., a string) and an _optional_ integer argument `n` (default 3) and returns a list of its ngrams. It must not apply any padding, splitting or other modifications to its argument. By default, `ngrams` returns trigrams.

Because Python treats strings as lists of characters, this means if you apply `ngrams` to a string, it should return a list of strings. (See the example and test below.)

Example and test:

In [2]:
from langdetect import ngrams
trigrams = ngrams("R2.D2")
print(trigrams)
        
# Test:
if trigrams == ['R2.', '2.D', '.D2']:
    print("ok")
else:
    import sys
    print("Incorrect result", file=sys.stderr)
    
bigrams = ngrams("R2.D2", 2)
print(bigrams)

# Test:
if bigrams == ['R2', '2.', '.D', 'D2']:
    print("ok")
else:
    import sys
    print("Incorrect result", file=sys.stderr)
    

['R2.', '2.D', '.D2']
ok
['R2', '2.', '.D', 'D2']
ok


### 4.3 `ngram_table` function

We will represent a table of ngram frequencies as a dictionary whose keys are ngrams and the values are frequencies (ngram counts). The function `ngram_table(text, n, limit)` takes three arguments: A string to process and the _optional_ arguments `n` and `limit`. `n` is the size of ngrams, default value 3. `limit` gives the number of frequent ngrams to return. The default value must be 0, which will be interpreted to mean _all_ ngrams should be returned. This function should:
        
Use `prepare` to clean and tokenize the text:

   * Surround each tokenized word with the characters `<` and `>` (angle brackets). E.g., `"it's"` becomes `"<it's>"`.
    
   * Extract the ngrams (for n = `n`) of each word, and use a dictionary (either an ordinary `dict` or `collections.Counter`) to count how often each ngram occurs.

   * If `limit` is 0, return the entire dictionary.

   * Otherwise `limit` must be an integer that tells us how many ngrams to return (starting from the most frequent). Create and return a new dictionary containing this amount of keys and values.

#### **Notes**

* If the cutoff point falls among several ngrams with the same frequency, the selection is arbitrary (and might vary between program runs). This is fine.

* One way to select a subset of the entries in a dictionary is to convert it to a list of ordered pairs, sorted by frequency, and create a new dictionary from a slice of the values. You are free to use any approach.

**Example and test:**

In [3]:
from langdetect import ngram_table

top = ngram_table("hiep, hiep, hoera!", n=3, limit=4)
print("top:", top)

if top == {'<hi': 2, 'hie': 2, 'ep>': 2, 'iep': 2}:
    print("ok")
else:
    import sys
    print("Incorrect result", file=sys.stderr)

top: {'<hi': 2, 'hie': 2, 'iep': 2, 'ep>': 2}
ok


### 4.4 `read_ngrams` and `write_ngrams` functions

The function `write_ngrams(table, filename)` takes an ngram table as generated by `ngram_table` and a filename, and writes the ngrams to that file. Each line contains the count and the ngram, separated by a space. For example: 

    2 iep 
    2 ep>
    1 <Hi
    1 hoe
    1 <ho
    ...
    
The output must use the `utf8` encoding, and must be in descending order of frequency.

The function `read_ngrams(filename)` must reverse the process: It opens the file for reading with `utf8` encoding, it reads the file's contents and converts them to dictionary whose values are **integers**, not strings.  

Check that you can "round-trip" your data with these functions, i.e. read back exactly what you wrote out:

In [4]:
import os

from langdetect import ngram_table, read_ngrams, write_ngrams

text = "Het valt voor, dat bij één roveroverval, één rover voorover over één roverval valt."
table = ngram_table(text, 3, limit=10)
print(table)
write_ngrams(table, "rover.10.TEMP")
reread_table = read_ngrams("rover.10.TEMP")
print(reread_table)

if table == reread_table:
    print("ok")
else:
    import sys
    print("The round trip fails", file=sys.stderr)
    
# remove temporary file
os.remove("rover.10.TEMP")

{'al>': 2, '<éé': 3, 'één': 3, 'én>': 3, '<ro': 3, 'er>': 3, 'val': 4, 'rov': 5, 'ove': 6, 'ver': 6}
{'al>': 2, '<éé': 3, 'één': 3, 'én>': 3, '<ro': 3, 'er>': 3, 'val': 4, 'rov': 5, 'ove': 6, 'ver': 6}
ok


### 4.5 `cosine_similarity` function

The function `cosine_similarity(known, unknown)` takes two arguments, which must both be dictionaries of ngram frequencies: the ngrams for the unknown text we want to identify, and one of the reference profiles of a known language. It must compute and return the "cosine similarity" metric between the two tables.

The formula of cosine similarity between *a* and *b* is as follows:

$\cos(a,b)=\frac{a\cdot b}{\text{magnitude}(a)*\text{magnitude}(b)}$

where $a\cdot b$ is the _dot product_ of $a$ and $b$ which equals $\Sigma_i a_i*b_i$ and $\text{magnitude}(x)=\sqrt{\Sigma _i x_i^2}$.

(See the [Wikipedia definition][1] for the formula, or [this page][2] for a detailed geometric explanation).

The cosine similarity metric ranges between `-1.0` and `1.0`. Larger is better (smaller angle, i.e. more similar vectors). An ngram table compared to itself should return cosine `1.0`. Unrelated ngram tables should return cosine of around `0`.

If an ngram does not appear in an ngram table, it has frequency 0 and contributes nothing to the numerator in the formula nor to the table's vector magnitude. So when comparing two tables,  trigrams that don't occur in either do not affect cosine measure because they do not contribute either to the dot product or the tables' vector magnitudes. This means that our calculations need only consider trigrams that occur in one of the tables: The dot product (numerator) is the sum over trigrams that are present on both sides (since other products are zero), and the magnitude of each vector (denominator) is calculated from its own trigrams.

For example, the dot product of the following vectors is $2 * 2$, and each vector has magnitude $\sqrt {2*2+1}$. The cosine score works out to $2*2/(\sqrt {2*2+1}*\sqrt {2*2+1})=4/5$.

[1]: https://en.wikipedia.org/wiki/Cosine_similarity#Definition
[2]: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/

In [5]:
from langdetect import cosine_similarity

table1 = { "<he": 2, "het": 1 }
table2 = { "<he": 2, "hem": 1 }
score = cosine_similarity(table1, table2)

# The score *should be* 0.8, but we get floating point error
print(score)

if abs(score - 4/5) < 0.000001:
    print("ok")
else:
    import sys
    print("Incorrect result", file=sys.stderr)


0.7999999999999998
ok


## 5. Script 1: Create profiles

We're now ready to deal with collecting our trigram statistics. The script `write_profiles.py`  must define a function `make_profiles` that accepts three arguments: The path to a folder containing the multilingual reference corpus, the `n` of the ngrams, and the (maximum) size of the ngram tables.

When imported as a module, `write_profiles.py` should not read or write any files. When executed as a stand-alone script, it should run two commands:

    make_profiles("./datafiles/training", 3, 200)
    make_profiles("./datafiles/training", 2, 200)

These commands write files of language profiles in terms of trigrams and bigrams, keeping only the 200 most frequent of each. Remember to enclose the script part in a conditional. These lines should not execute when the module is imported. They will write to directories `./models/3-200` and `./models/2-200`.

`make_profiles` should create the full directory path `./models/<n>-<limit>/` if it doesn't exist. For instance, `make_profiles("./datafiles/training", 5, 30)` should create `./models/` and a subdirectory `5-30` if they don't exist.

### 5.1 The data

The folder `training` contains a number of files with names like these:

        Afrikaans-Latin1
        Aguaruna-Latin1
        Albanian-UTF8
        (etc.)
        
The part before the hyphen is the language name. The part after the hyphen is the encoding. 


### 5.2 The function `make_profiles`

The function `make_profiles(datafolder, n, limit)` must 
    
  1. Loop over a list of the files in `datafolder`; for each file:
  - Split each filename into language name and encoding.
  - Read the file, **using the appropriate encoding**. 
  - Generate a table of the `limit` (e.g.: 200) most frequent ngrams, for n = `n`.
  - Write out the table in a file in a directory with path `./models/<n>-<limit>`, for example `./models/3-200` for trigrams with limit 200. Use `utf-8` encoding.
  
### Notes

* Use the functions of the `langdetect` module; do not reinvent any wheels!

* All profiles must be saved in the `utf-8` encoding, regardless of the input's encoding.
    
* Check that your function works correctly if called with different directory names

In [6]:
 %run write_profiles.py

## 6. Script 2: The language recognizer

To use our collection of trigram profiles for language identification, we will first create a dedicated class to manage them easily. Then we'll specify a command line interface for our recognizer. 

As usual, importing the script `match_language.py` as a module should do no input, output, or long computations. When run as the main program, it must look for command line arguments and try to identify the language of the specified file.


### 6.1 The `LangMatcher` class

The `LangMatcher` class will allow a text be compared against multiple language profiles, in order to find the best match. It should have the following behavior:

1. The initializer must have one required argument: The path to a directory with the saved ngram profiles. The initializer should define and store a dictionary whose keys are language names, and each value is itself a dictionary whose keys are ngrams and whose values are frequencies. These are read from the profile file with `langdetect.read_ngrams`. (Use just the language name as the key, not the complete file name).

3. `score(text, k_best=1)`: Here, `text` is a string that has not yet been cleaned up or tokenized. Compute its ngram table, limited to the same number of the most frequent ngrams as the models were, and compare it against each of the languages in the model dictionary. Compile a list of the language names and cosine similarity scores, and return the names and similarity scores for the `k_best` best matches. Make the default value of `k_best` 1, so by default we get the single best match.

5. `recognize(filename, encoding="utf-8")`: A convenience function: It opens the specified file, calls `score` on its contents, and returns the top result (i.e., the name of the highest matching language and the similarity score).

### 6.2 The recognizer main script
 
When run as a main program, `match_language.py` should do the following:

1. Accept two **or more** command line arguments. The first is the path to the ngram model directory.  Each of the rest is the path of a file whose language we want to identify. If there are too few command line arguments (fewer than 2), the program can do whatever you want. (It is a good idea to specify some default filenames to help with development and testing. Alternately you can do nothing, or print an error message.)


**Program execution:**

1. Initialize a `LangMatcher` object, filling it with all the profiles from the ngram model directory, for example  `models/3-200`.

2. For each file specified on the command line, print out a line containing the file name, the most similar language, and its similarity score. 

The following code cell uses notebook "magic" commands (i.e., not Python) to run `match_language.py` as a command line script in the directory that contains this notebook.

In [7]:
print("Dutch")

%run match_language.py "models/3-200" "datafiles/test/europarl-90/ep-00-02-03.nl"
%run match_language.py "models/2-200" "datafiles/test/europarl-90/ep-00-02-03.nl"

print("Portuguese, Swedish")

%run match_language.py "models/2-200" "datafiles/test/europarl-90/ep-00-02-03.pt" "datafiles/test/europarl-90/ep-00-01-21.sv"

print("Danish")

# note: the * doesn't work on all systems.
%run match_language.py "models/3-200" "datafiles/test/europarl-30/*.da"

Dutch
ep-00-02-03.nl Dutch 0.7296681511302934
ep-00-02-03.nl Dutch 0.9147864640014649
Portuguese, Swedish
ep-00-02-03.pt Portuguese 0.8468795676904076
ep-00-01-21.sv Swedish 0.7963984242255718
Danish
ep-00-01-17.da Danish 0.37575562280655533
ep-00-01-18.da Danish 0.46320817970089806
ep-00-01-21.da Danish 0.4386465265689307


## 7. Script 3: Evaluation

The script `evaluate.py` assesses the performance of the language recognition.
Since the tests are fixed, the directory names can be hard-coded in this file. 

We'll test our system on fragments from translations of European Parliament proceedings (`europarl` corpus). The algorithm works so well on clean, monolingual text, that mistakes are only likely with very short text fragments. Here we evaluate collections of randomly selected fragments that are 90, 30, and 10 words long. They are in the folders `europarl-90`, `europarl-30`, and `europarl-10`, respectively. All files are in `utf-8` format. The language of the fragment is encoded in the file names, but not in the same way as the training data: Each filename has a suffix indicating the ISO language code, e.g. `ep-00-02-02.de` (suffix `de`, German). The following codes are used:

        da Danish
        de German
        el Greek
        en English
        es Spanish
        fi Finnish
        fr French
        it Italian
        nl Dutch
        pt Portuguese
        sv Swedish

Convert the above table into a dictionary, and use it to automatically check if the language returned by the recognizer was correct.

Write a function `eval(model_path, test_path)` that is given the pathname of an ngram model directory and a test corpus collection and performs language identification on each file of the collection using the ngram models specified. For each corpus file, **print** the name of the corpus file and the recognizer's guess, and add the word "ERROR" and the correct language after incorrect guesses. For instance, if file for `ep-00-02-15.es` your models guesses Portuguese, print something like:

`ep-00-02-15.es Portuguese ERROR Spanish`


Keep a tally of the number of correct and incorrect results, and report the totals at the end of each collection (separately for each collection). Include in the report the size of the ngrams (e.g. bigrams or trigrams.)

For example, if you got 17/30 correct on europarl-10 using your bigram models, you could print something like:

```
Bigram models for 10-word sentences: 17 correct, 13 incorrect
```

Include a main script, in which you use your function `eval` to evaluate the collections `europarl-90`, `europarl-30`, and `europarl-10` using both the bigrams and the trigrams. Hard-code them in specifically, as opposed to looping over all corpora in the test directory. There are more corpora that we'll add to the directory when we grade your assignment, but they shouldn't be evaluated by your main script.

In [8]:
%run evaluate.py

ep-00-01-17.da Norwegian ERROR Danish
ep-00-01-17.de German
ep-00-01-17.sv Swedish
ep-00-01-18.da Danish
ep-00-01-18.es Spanish
ep-00-01-18.it Italian
ep-00-01-18.sv Frisian ERROR Swedish
ep-00-01-19.it Spanish ERROR Italian
ep-00-01-20.de German
ep-00-01-20.es Spanish
ep-00-01-21.da Danish
ep-00-01-21.es Galician ERROR Spanish
ep-00-01-21.it Italian
ep-00-01-21.sv Norwegian ERROR Swedish
ep-00-02-02.de German
ep-00-02-02.es Ido ERROR Spanish
ep-00-02-02.fi Finnish
ep-00-02-02.sv Swedish
ep-00-02-03.en Sharanahua ERROR English
ep-00-02-03.es Spanish
ep-00-02-03.fr Catalan ERROR French
ep-00-02-03.nl Frisian ERROR Dutch
ep-00-02-03.pt Portuguese
ep-00-02-14.de German
ep-00-02-15.el Greek
ep-00-02-15.es Portuguese ERROR Spanish
ep-00-02-15.it Ido ERROR Italian
ep-00-02-16.fr French
ep-00-02-16.pt French ERROR Portuguese
ep-00-02-17.en English
Bigram models for 10-word sentences: 18 correct, 12 incorrect
ep-00-01-17.da Norwegian ERROR Danish
ep-00-01-17.de German
ep-00-01-17.sv Swedish
ep

## 8. Report

Write a short report (txt or pdf) in which you compare your bigram and trigram models. Include discussion on the following:

1. How many languages did each get right, and in which corpora? Did the bigrams ever beat the trigrams? Don't just list the results, but try to make generalisations.

2. There are two parameters you manipulated: the size of the ngrams and the limit on the number of ngrams you looked at. However, in the evaluation, you used the default value of `limit=200`. Can you manipulate the limits in the bigram and trigram models to make the bigrams beat the trigrams? (Don't forget to also change the limit in the test corpus ngrams.) If so, can you tell approximately where they cross? If not, do you have an idea why not?

3. What if you decrease `n` to 1, or increase it further? Is there a trend you can describe?

4. Finally, please include some **reflections on how the project went**. Was it hard to work as a group? Did everyone contribute fairly? What obstacles did you encounter, and were you able to overcome them?

One way to clearly present data of the kind in 2 and 3 is with a graph. This is not required, but you might want to try it.

Write a brief report (about 1-3 pages) on these questions and save it as a PDF called report.pdf or a plaintext document called report.txt. (No Word documents please!)

## 9. Submission

When you are done, archive your four Python files into a zip file. Upload the zip file. If you wish to add any comments about problems or extensions to the tasks, add a file `README.txt` with the additional information.

### Submission Checklist
Before submission, please make sure you did not forget to include important parts of your code:
- `langdetect.py` : core functions, and any other helper functions you find useful, including:
    - `prepare`
    - `ngrams`
    - `ngram_table`
    - `read_ngrams` 
    - `write_ngrams`
    - `cosine_similarity`
    
- `write_profiles.py`: including function
    -  `make_profiles`
    
- `match_language.py`: The language recognizer. This is a general-purpose program with command-line arguments and options.
    - `MatchLang` class with `score` and `recognize` functions
    - main script

- `evaluate.py` evaluates your script on europarl-90, europarl-30, and europarl-10, with both bigrams and trigrams.
    - `eval`
    -  main script

- `report.pdf` or `report.txt`, a short report comparing your bigram and trigram models.