# Assignment #3: A simple language classifier with scikit-learn and PyTorch

Author: Pierre Nugues

## Objectives

In this assignment, you will implement a language detector inspired and simplified from Google's _Compact language detector_, version 3 (CLD3): https://github.com/google/cld3. CLD3 is written in C++ and its code is available from GitHub. The objectives of the assignment are to:
* Write a program to classify languages
* Use neural networks with sklearn and PyTorch
* Know what a classifier is
* Write a short report of 1 to 2 pages to describe your program. You will notably comment the performance you obtained and how you could improve it.

## Description

### System Overview

Read the GitHub description of CLD3, https://github.com/google/cld3, (_Model_ section). In your individual report you will:
1. Summarize the system in two or three sentences;
2. Outline the CLD3 overall architecture in a figure. Use building blocks only and do not specify the parameters.

## Imports

In [1]:
import bz2
import json
import os
import numpy as np
import requests
import sys
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score, classification_report
from sklearn.metrics import confusion_matrix
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import hashlib
from tqdm import tqdm

In [2]:
random.seed(1234)
np.random.seed(1234)
torch.manual_seed(1234)

<torch._C.Generator at 0x1155cf830>

## Dataset

As dataset, we will use Tatoeba, https://tatoeba.org/eng/downloads. It consists of more than 8 million short texts in 347 languages and it is available in one file called `sentences.csv`.

The dataset is structured this way: There is one text per line, where each line consists of the three following fields separated by tabulations and ended by a carriage return:
```
sentence id [tab] language code [tab] text [cr]
```
Each text (sentence) has a unique id and has a language code that follows the ISO 639-3 standard (see below). 

### Scope of the lab

In this lab, you will consider six languages only: French (fra), Japanese (jpn), Chinese (cmn), English (eng), Swedish (swe), and Danish (dan). Below is an excerpt of the Tatoeba dataset limited to three languages: 

```
1276    eng     Let's try something.
1277    eng     I have to go to sleep.
1280    eng     Today is June 18th and it is Muiriel's birthday!
...
1115    fra     Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent.
1279    fra     Je ne supporte pas ce type.
1441    fra     Pour une fois dans ma vie je fais un bon geste... Et ça ne sert à rien.
...
337413  swe     Vi trodde att det var ett flygande tefat.
341910  swe     Detta är huset jag bodde i när jag var barn.
341938  swe     Vi hade roligt på stranden igår.
...
```
Tatoeba is updated continuously. The examples from this dataset come from a corpus your instructor downloaded on September 23, 2021.

### Understanding the ${X}$ matrix (feature matrix)

You will now investigate the CLD3 features:
 *  What are the features CLD3 extracts from each text?
 * Create manually a simplified ${X}$ matrix where you will represent the 9 texts with CLD3 features. You will use a restricted set of features: You will only consider the letters _a_, _b_, and _n_ and the bigrams _an_, _ba_, and _na_. You will ignore the the rest of letters and bigrams as well as the trigrams. Your matrix will have 9 rows and 6 columns, each column will contain these counts: `[#a, #b, #n, #an, #ba, #na]`.

The CLD3's original description uses relative frequencies (counts of a letter divided by the total counts of letters in the text). Here, you will use the raw counts. To help you, your instructor filled the fourth row of the matrix corresponding to the first text in French. Fill in the rest. You will __include this matrix in your report__. 

$\mathbf{X} =
\begin{bmatrix}
-& -& -& -&-& -\\
-& -& -& -&-& -\\
-& -& -& -&-& -\\
8& 0& 8& 1&0&0\\
-& -& -& -&-& -\\
-& -& -& -&-& -\\
-& -& -& -&-& -\\
-& -& -& -&-& -\\
-& -& -& -&-& -\\
\end{bmatrix}$
; $\mathbf{y} =
\begin{bmatrix}
     \text{eng} \\
     \text{eng}\\
     \text{eng}\\
    \text{fra}\\
   \text{fra}  \\
     \text{fra}\\
    \text{swe}\\
 \text{swe}   \\
 \text{swe}   
\end{bmatrix}$

To help you check your counts, you can use the `str.count()` function

In [100]:
example_ngrams = ['a', 'b', 'n', 'an', 'ba', 'na']

In [101]:
"""Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent.""".count('a')

8

In [102]:
"""Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent.""".count('an')

1

In [103]:
my_string = """Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent."""
row = []
for ngram in example_ngrams:
    row += [my_string.count(ngram)]
row

[8, 0, 8, 1, 0, 0]

## Getting the Dataset

Before you start programming, download the Tatoeba dataset. You can use the instructions:

In [3]:
!wget https://downloads.tatoeba.org/exports/sentences.tar.bz2

--2023-09-19 20:22:31--  https://downloads.tatoeba.org/exports/sentences.tar.bz2
Résolution de downloads.tatoeba.org (downloads.tatoeba.org)… 94.130.77.194
Connexion à downloads.tatoeba.org (downloads.tatoeba.org)|94.130.77.194|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 183274899 (175M) [application/octet-stream]
Sauvegarde en : « sentences.tar.bz2 »


2023-09-19 20:23:35 (2,76 MB/s) — « sentences.tar.bz2 » sauvegardé [183274899/183274899]



In [4]:
!tar -xvjf sentences.tar.bz2

x sentences.csv


### Loading the Dataset

Run the code to read the dataset and split it into lines. You may have to change the path

In [5]:
dataset_large = open('sentences.csv', encoding='utf8').read().strip()
dataset_large = dataset_large.split('\n')
dataset_large[:10]

['1\tcmn\t我們試試看！',
 '2\tcmn\t我该去睡觉了。',
 '3\tcmn\t你在干什麼啊？',
 '4\tcmn\t這是什麼啊？',
 '5\tcmn\t今天是６月１８号，也是Muiriel的生日！',
 '6\tcmn\t生日快乐，Muiriel！',
 '7\tcmn\tMuiriel现在20岁了。',
 '8\tcmn\t密码是"Muiriel"。',
 '9\tcmn\t我很快就會回來。',
 '10\tcmn\t我不知道。']

The size may vary as new documents are added every day to _Tatoeba_

In [6]:
len(dataset_large)

11611245

Run the code to split the fields and remove possible whitespaces

In [7]:
dataset_large = list(map(lambda x: tuple(x.split('\t')), dataset_large))
dataset_large = list(map(lambda x: tuple(map(str.strip, x)), dataset_large))
dataset_large[:3]

[('1', 'cmn', '我們試試看！'), ('2', 'cmn', '我该去睡觉了。'), ('3', 'cmn', '你在干什麼啊？')]

In [8]:
from collections import Counter
counter = Counter(map(lambda x: x[1], dataset_large))

Again the figures may vary

In [9]:
counter.most_common(30)

[('eng', 1829233),
 ('rus', 1013772),
 ('ita', 863056),
 ('tur', 728016),
 ('epo', 727449),
 ('kab', 686241),
 ('ber', 649317),
 ('deu', 626486),
 ('fra', 568043),
 ('por', 422018),
 ('spa', 402305),
 ('hun', 394468),
 ('jpn', 238488),
 ('heb', 200838),
 ('ukr', 184377),
 ('nld', 176953),
 ('fin', 146641),
 ('pol', 122378),
 ('lit', 93912),
 ('mkd', 78095),
 ('tgl', 74892),
 ('cmn', 73632),
 ('ces', 73545),
 ('mar', 73277),
 ('ara', 62802),
 ('dan', 58848),
 ('tok', 53870),
 ('swe', 52960),
 ('lat', 49153),
 ('srp', 47878)]

## Restricting the Dataset to a few Languages 

The Tatoeba dataset is very large. You will first extract a subset of it

Write the code to extract texts in the languages below. For each language, you limit the number of documents to 50,000 or less if the language has less documents.
You will call the resulting dataset: `dataset`

The languages

In [10]:
langs = ['fra', 'cmn', 'jpn', 'eng', 'swe', 'dan']

The maximal number of documents per language

In [11]:
MAX_DOCS = 50000

Write a loop that:
1. Extracts a list of all the documents in a certain language from the dataset
2. Shuffles this list with `random.shuffle()`
3. Adds `MAX_DOCS` to `dataset`. You just need to use a slice

In [12]:
# Write your code here
dataset = []
for lang in langs:
    ...

In [13]:
random.shuffle(dataset)

In [14]:
len(dataset)

300000

In [15]:
dataset[:5]

[('8262671', 'eng', "Where are Algeria's embezzled funds?"),
 ('9211583', 'eng', 'Beware of the dog.'),
 ('366878', 'cmn', '你有必要去看一下医生。'),
 ('7071789', 'fra', 'Va balayer ma chambre.'),
 ('11801603', 'eng', 'Dmitri had an emergency at his work.')]

## Utilities

Before you can use the dataset to train a model, you need to convert it into numbers. You will carry this with out the following steps and you will write a corresponding function.
1. You will extract the $n$-grams up to trigrams (`all_ngrams()`);
2. Trigrams can create many symbols that most student's machines cannot process. You will reduce their numbers using hash codes (`hash_ngrams()`);
3. You will compute the relative frequencies of the $n$-grams, replaced here by the hash codes (`calc_ref_freq()`).
4. The results will be stored in three dictionaries, for characters, bigrams, and trigrams. You will merge these dictionaries into one (`shift_keys()`).

You will then apply the functions to vectorize the dataset.

### Extracting $n$-grams
The goal of this section is that you extract the $n$-grams from a text. By default, you will lowercase the text. The result will have the form: `[chars, bigrams, trigrams]`

Write a function to extract the $n$-grams of a sentence: `ngrams(sentence, n=1, lc=True)`, `n` is a parameters. You can use list slices for this.

In [16]:
# Write your code here
def ngrams(sentence, n=1, lc=True):
    ngram_l = []
    ...
    return ngram_l

In [17]:
ngrams('try something.')

['t', 'r', 'y', ' ', 's', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', '.']

In [18]:
ngrams('try something.', n=2)

['tr', 'ry', 'y ', ' s', 'so', 'om', 'me', 'et', 'th', 'hi', 'in', 'ng', 'g.']

We now use this function to extract all the $n$-grams

In [19]:
def all_ngrams(sentence, max_ngram=3, lc=True):
    all_ngram_list = []
    for i in range(1, max_ngram + 1):
        all_ngram_list += [ngrams(sentence, n=i, lc=lc)]
    return all_ngram_list

In [20]:
all_ngrams('try something.')

[['t', 'r', 'y', ' ', 's', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', '.'],
 ['tr',
  'ry',
  'y ',
  ' s',
  'so',
  'om',
  'me',
  'et',
  'th',
  'hi',
  'in',
  'ng',
  'g.'],
 ['try',
  'ry ',
  'y s',
  ' so',
  'som',
  'ome',
  'met',
  'eth',
  'thi',
  'hin',
  'ing',
  'ng.']]

### Hashing

We consider languages with many characters that will make the number of bigrams and trigrams impossible to process. We will use the _hashing trick_ to reduce them, where we will gather $n$-grams into subsets using hash codes.

Each item will have this format:
`[char_hcodes, bigram_hcodes, trigram_hcodes]`.

#### Description

Python has a built-in hashing function that returns a unique numerical signature for a given string

In [21]:
hash('a'), hash('ab'), hash('abc')

(6639504683097779921, 4838953610571693510, 453918279105448612)

If we take the remainder (modulo) of a division by 5, we reduce the possible codes to: 0, 1, 2, 3, or 4

In [22]:
list(map(lambda x: x % 5, (hash('a'), hash('ab'), hash('abc'))))

[1, 0, 2]

#### Implementation

We set maximal numbers for our $n$-grams using these divisors

In [23]:
MAX_CHARS = 521
MAX_BIGRAMS = 1031
MAX_TRIGRAMS = 1031

Here strings have integer codes within the range [0, `MAX_CHARS`[

In [24]:
list(map(lambda x: x % MAX_CHARS, (hash('a'), hash('ab'), hash('abc'))))

[10, 266, 17]

Hash codes may vary across machines and Marcus Klang wrote this function to have reproducible codes

In [25]:
def reproducible_hash(string):
    """
    reproducible hash on any string
    
    Arguments:
       string: python string object
    
    Returns:
       signed int64
    """
    
    # We are using MD5 for speed not security.
    h = hashlib.md5(string.encode("utf-8"), usedforsecurity=False)
    return int.from_bytes(h.digest()[0:8], 'big', signed=True)

In [26]:
reproducible_hash('a')

919145239626757800

In [27]:
reproducible_hash('a') % MAX_CHARS

234

### Converting $n$-grams to hash codes
You will now convert the $n$-grams to hash codes


In [28]:
MAXES = [MAX_CHARS, MAX_BIGRAMS, MAX_TRIGRAMS]

Create a `hash_ngrams` function that creates a list of hash codes from a list of $n$-grams. As arguments, you will have the list of $n$-grams `[chars, bigrams, trigrams]` as well as the list of dividers (`MAXES`).

The output format will be a list of three lists:

`[char_hcodes, bigram_hcodes, trigram_hcodes]`.

In [29]:
# Write your code
def hash_ngrams(ngrams, modulos):
    hash_codes = []
    ...
    return hash_codes

In [30]:
all_ngrams('try something.')

[['t', 'r', 'y', ' ', 's', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', '.'],
 ['tr',
  'ry',
  'y ',
  ' s',
  'so',
  'om',
  'me',
  'et',
  'th',
  'hi',
  'in',
  'ng',
  'g.'],
 ['try',
  'ry ',
  'y s',
  ' so',
  'som',
  'ome',
  'met',
  'eth',
  'thi',
  'hin',
  'ing',
  'ng.']]

In [31]:
hash_ngrams(all_ngrams('try something.'), MAXES)

[[432, 437, 309, 86, 331, 97, 100, 32, 432, 332, 233, 310, 31, 442],
 [6, 765, 224, 203, 557, 176, 590, 711, 527, 757, 919, 57, 685],
 [848, 617, 468, 456, 873, 996, 287, 10, 817, 674, 960, 399]]

### Functions to Count Hash Codes

Write a function `calc_rel_freq(codes)` to count the codes. As in CLD3, you will return the relative frequencies.

This is just an application of `Counter` to a list of codes and then a division by the length.

The input is a list of codes and the output is a `Counter` object of relative frequencies.

In [32]:
# Write your code
def calc_rel_freq(codes):
    ...
    return cnt

In [33]:
hash_ngrams(all_ngrams('try something.'), MAXES)

[[432, 437, 309, 86, 331, 97, 100, 32, 432, 332, 233, 310, 31, 442],
 [6, 765, 224, 203, 557, 176, 590, 711, 527, 757, 919, 57, 685],
 [848, 617, 468, 456, 873, 996, 287, 10, 817, 674, 960, 399]]

In [34]:
list(map(calc_rel_freq, hash_ngrams(all_ngrams('try something.'), MAXES)))

[Counter({432: 0.14285714285714285,
          437: 0.07142857142857142,
          309: 0.07142857142857142,
          86: 0.07142857142857142,
          331: 0.07142857142857142,
          97: 0.07142857142857142,
          100: 0.07142857142857142,
          32: 0.07142857142857142,
          332: 0.07142857142857142,
          233: 0.07142857142857142,
          310: 0.07142857142857142,
          31: 0.07142857142857142,
          442: 0.07142857142857142}),
 Counter({6: 0.07692307692307693,
          765: 0.07692307692307693,
          224: 0.07692307692307693,
          203: 0.07692307692307693,
          557: 0.07692307692307693,
          176: 0.07692307692307693,
          590: 0.07692307692307693,
          711: 0.07692307692307693,
          527: 0.07692307692307693,
          757: 0.07692307692307693,
          919: 0.07692307692307693,
          57: 0.07692307692307693,
          685: 0.07692307692307693}),
 Counter({848: 0.08333333333333333,
          617: 0.08333333333333

### Merge the Dictionaries

In the results above, we have three counter objects with numerical keys (the hash codes). You will build one dictionary of them.

There is a key overlap and we must take care that a same hash code for the unigrams is not the same as in the bigrams. We will then shift the keys.

The keys range from:
1. Unigrams from 0 to 521, [0, MAX_CHARS[
2. Bigrams from 0 to 1031, [0, MAX_BIGRAMS[
3. Trigrams from 1 to 1031, [0, MAX_TRIGRAMS[

You will leave the unigrams keys as they are. You will shift the bigram keys by MAX_CHARS, and the trigram keys by MAX_CHARS + MAX_BIGRAMS. You can reuse the code below

In [35]:
MAX_SHIFT = []
for i in range(len(MAXES)):
    MAX_SHIFT += [sum(MAXES[:i])]

In [36]:
MAX_SHIFT

[0, 521, 1552]

Write a `shift_keys(dicts, MAX_SHIFT)` function that takes a list of dictionaries as input and the list of shifts and that a new unique dictionary, where the numerical keys have been shifted by the numbers in `MAX_SHIFT`

In [37]:
# Write your code here
def shift_keys(dicts, MAX_SHIFT):
    new_dict = {}
    ...
    return new_dict

In [38]:
list(map(calc_rel_freq, hash_ngrams(all_ngrams('try something.'), MAXES)))

[Counter({432: 0.14285714285714285,
          437: 0.07142857142857142,
          309: 0.07142857142857142,
          86: 0.07142857142857142,
          331: 0.07142857142857142,
          97: 0.07142857142857142,
          100: 0.07142857142857142,
          32: 0.07142857142857142,
          332: 0.07142857142857142,
          233: 0.07142857142857142,
          310: 0.07142857142857142,
          31: 0.07142857142857142,
          442: 0.07142857142857142}),
 Counter({6: 0.07692307692307693,
          765: 0.07692307692307693,
          224: 0.07692307692307693,
          203: 0.07692307692307693,
          557: 0.07692307692307693,
          176: 0.07692307692307693,
          590: 0.07692307692307693,
          711: 0.07692307692307693,
          527: 0.07692307692307693,
          757: 0.07692307692307693,
          919: 0.07692307692307693,
          57: 0.07692307692307693,
          685: 0.07692307692307693}),
 Counter({848: 0.08333333333333333,
          617: 0.08333333333333

In [39]:
shift_keys(map(calc_rel_freq, hash_ngrams(all_ngrams('try something.'), MAXES)), MAX_SHIFT)

{432: 0.14285714285714285,
 437: 0.07142857142857142,
 309: 0.07142857142857142,
 86: 0.07142857142857142,
 331: 0.07142857142857142,
 97: 0.07142857142857142,
 100: 0.07142857142857142,
 32: 0.07142857142857142,
 332: 0.07142857142857142,
 233: 0.07142857142857142,
 310: 0.07142857142857142,
 31: 0.07142857142857142,
 442: 0.07142857142857142,
 527: 0.07692307692307693,
 1286: 0.07692307692307693,
 745: 0.07692307692307693,
 724: 0.07692307692307693,
 1078: 0.07692307692307693,
 697: 0.07692307692307693,
 1111: 0.07692307692307693,
 1232: 0.07692307692307693,
 1048: 0.07692307692307693,
 1278: 0.07692307692307693,
 1440: 0.07692307692307693,
 578: 0.07692307692307693,
 1206: 0.07692307692307693,
 2400: 0.08333333333333333,
 2169: 0.08333333333333333,
 2020: 0.08333333333333333,
 2008: 0.08333333333333333,
 2425: 0.08333333333333333,
 2548: 0.08333333333333333,
 1839: 0.08333333333333333,
 1562: 0.08333333333333333,
 2369: 0.08333333333333333,
 2226: 0.08333333333333333,
 2512: 0.08333

Finally, we assemble all these utilities in a function

In [40]:
def build_freq_dict(sentence, MAXES=MAXES, MAX_SHIFT=MAX_SHIFT):
    hngrams = hash_ngrams(all_ngrams(sentence), MAXES)
    fhcodes = map(calc_rel_freq, hngrams)
    return shift_keys(fhcodes, MAX_SHIFT)

In [41]:
build_freq_dict('try something.')

{432: 0.14285714285714285,
 437: 0.07142857142857142,
 309: 0.07142857142857142,
 86: 0.07142857142857142,
 331: 0.07142857142857142,
 97: 0.07142857142857142,
 100: 0.07142857142857142,
 32: 0.07142857142857142,
 332: 0.07142857142857142,
 233: 0.07142857142857142,
 310: 0.07142857142857142,
 31: 0.07142857142857142,
 442: 0.07142857142857142,
 527: 0.07692307692307693,
 1286: 0.07692307692307693,
 745: 0.07692307692307693,
 724: 0.07692307692307693,
 1078: 0.07692307692307693,
 697: 0.07692307692307693,
 1111: 0.07692307692307693,
 1232: 0.07692307692307693,
 1048: 0.07692307692307693,
 1278: 0.07692307692307693,
 1440: 0.07692307692307693,
 578: 0.07692307692307693,
 1206: 0.07692307692307693,
 2400: 0.08333333333333333,
 2169: 0.08333333333333333,
 2020: 0.08333333333333333,
 2008: 0.08333333333333333,
 2425: 0.08333333333333333,
 2548: 0.08333333333333333,
 1839: 0.08333333333333333,
 1562: 0.08333333333333333,
 2369: 0.08333333333333333,
 2226: 0.08333333333333333,
 2512: 0.08333

## Converting the Dataset
We can now enrich the dataset with a numerical representation of the sentence. We use the utility functions and we call this new version: `dataset_num`

In [42]:
dataset[:2]

[('8262671', 'eng', "Where are Algeria's embezzled funds?"),
 ('9211583', 'eng', 'Beware of the dog.')]

In [43]:
dataset_num = []
for datapoint in tqdm(dataset):
    dataset_num += [list(datapoint) + [build_freq_dict(datapoint[2])]]

100%|██████████| 300000/300000 [00:56<00:00, 5305.71it/s]


In [44]:
dataset_num[:2]

[['8262671',
  'eng',
  "Where are Algeria's embezzled funds?",
  {301: 0.027777777777777776,
   332: 0.027777777777777776,
   32: 0.19444444444444445,
   437: 0.08333333333333333,
   86: 0.1111111111111111,
   234: 0.08333333333333333,
   15: 0.05555555555555555,
   31: 0.027777777777777776,
   233: 0.027777777777777776,
   323: 0.027777777777777776,
   331: 0.05555555555555555,
   100: 0.027777777777777776,
   25: 0.027777777777777776,
   14: 0.05555555555555555,
   371: 0.05555555555555555,
   327: 0.027777777777777776,
   69: 0.027777777777777776,
   310: 0.027777777777777776,
   368: 0.027777777777777776,
   1337: 0.02857142857142857,
   806: 0.02857142857142857,
   1066: 0.05714285714285714,
   1165: 0.05714285714285714,
   995: 0.05714285714285714,
   614: 0.05714285714285714,
   557: 0.02857142857142857,
   640: 0.02857142857142857,
   565: 0.02857142857142857,
   994: 0.02857142857142857,
   1468: 0.02857142857142857,
   949: 0.02857142857142857,
   1368: 0.02857142857142857,


## Programming: Building ${X}$

You will now build the ${X}$ matrix.

### Vectorizing the features

The CLD3 architecture uses embeddings. In this lab, we will simplify it and we will use a feature vector instead consisting of the character frequencies. For example, you will represent the text:

`"Let's try something."`

with:

`{'l': 0.05, 'e': 0.1, 't': 0.15, "'": 0.05, 's': 0.1, ' ': 0.1, 
 'r': 0.05, 'y': 0.05, 'o': 0.05, 'm': 0.05, 'h': 0.05, 'i': 0.05, 
 'n': 0.05, 'g': 0.05, '.': 0.05}`

Note that we used characters and not codes to make it more legible.

To create the ${X}$ matrix, we need to transform the dictionaries of `dataset_num` into numerical vectors. The `DictVectorizer` class from the scikit-learn library, see here [https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html], has two methods, `fit()` and `transform()`, and a combination of both `fit_transform()` to convert dictionaries into such vectors.

You will now write the code to:

1. Extract the hash code frequency dictionaries from `dataset_num` corresponding to its 3rd index;
2. Convert the list of dictionaries into an ${X}$ matrix using `DictVectorizer`.

#### Extracting the character frequencies

Produce a new list of datapoints with the $n$-grams. Each item in this list will be a dictionary. You will call it `X_cat`

In [45]:
# Write your code here
X_cat = ...

In [46]:
X_cat[0]

{301: 0.027777777777777776,
 332: 0.027777777777777776,
 32: 0.19444444444444445,
 437: 0.08333333333333333,
 86: 0.1111111111111111,
 234: 0.08333333333333333,
 15: 0.05555555555555555,
 31: 0.027777777777777776,
 233: 0.027777777777777776,
 323: 0.027777777777777776,
 331: 0.05555555555555555,
 100: 0.027777777777777776,
 25: 0.027777777777777776,
 14: 0.05555555555555555,
 371: 0.05555555555555555,
 327: 0.027777777777777776,
 69: 0.027777777777777776,
 310: 0.027777777777777776,
 368: 0.027777777777777776,
 1337: 0.02857142857142857,
 806: 0.02857142857142857,
 1066: 0.05714285714285714,
 1165: 0.05714285714285714,
 995: 0.05714285714285714,
 614: 0.05714285714285714,
 557: 0.02857142857142857,
 640: 0.02857142857142857,
 565: 0.02857142857142857,
 994: 0.02857142857142857,
 1468: 0.02857142857142857,
 949: 0.02857142857142857,
 1368: 0.02857142857142857,
 858: 0.02857142857142857,
 1238: 0.02857142857142857,
 912: 0.02857142857142857,
 654: 0.02857142857142857,
 553: 0.02857142857

#### Vectorize `X_cat`

Convert you `X_cat` matrix into a numerical representation using `DictVectorizer`: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html. You will set the `sparse` argument to False. Call the result `X`.

In [47]:
# Write your code here
...

In [48]:
X.shape

(300000, 2583)

In [49]:
X[:5]

array([[0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    ],
       [0.    , 0.    , 0.    , ..., 0.0625, 0.    , 0.    ],
       [0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    ],
       [0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    ],
       [0.    , 0.    , 0.    , ..., 0.    , 0.    , 0.    ]])

## Programming: Building $\mathbf{y}$

You will now convert the list of language symbols into a $\mathbf{y}$ vector

Extract the language symbols from `dataset_small_feat` and call the resulting list `y_cat`

In [50]:
# Write your code here
y_cat = ...

In [51]:
y_cat[:5]

['eng', 'eng', 'cmn', 'fra', 'eng']

Extract the set of language symbols and name it `y_symbols`. Then build two indices mapping the symbols to integers and the integers to symbols. Both indices will be dictionaries that you will call: `lang2idx`and `idx2lang`. Such a conversion is not necessary with sklearn. We do it because many other many machine-learning toolkits (keras or pytorch) require a numerical $\mathbf{y}$ vector and to learn how to carry out this conversion.

In [52]:
# Write your code here
y_symbols = set(y_cat)
...

In [53]:
idx2lang

{0: 'dan', 1: 'swe', 2: 'eng', 3: 'jpn', 4: 'cmn', 5: 'fra'}

In [54]:
lang2idx

{'dan': 0, 'swe': 1, 'eng': 2, 'jpn': 3, 'cmn': 4, 'fra': 5}

Convert your `y_cat` vector into a numerical vector. Call this vector `y`.

In [55]:
# Write your code here
y = ...

In [56]:
y[:5]

[2, 2, 4, 5, 2]

## Programming: Building the Model

Create a neural network using sklearn with a hidden layer of 50 nodes and a relu activation layer: https://scikit-learn.org/stable/modules/neural_networks_supervised.html. Set the maximal number of iterations to 5, in the beginning, and verbose to True. Use the default values for the rest. You will call your classifier `clf`

In [57]:
# Write your code here
clf = ...

In [58]:
clf

### Training and Validation Sets

You will now split the dataset into a training and validation sets

#### We split the dataset
We use a training set of 80% and a validation set of 20%

In [59]:
training_examples = int(X.shape[0] * 0.8)

X_train = X[:training_examples, :]
y_train = y[:training_examples]

X_val = X[training_examples:, :]
y_val = y[training_examples:]

### Fitting the model

Fit the model on the training set

In [60]:
# Write your code here
model = ...

Iteration 1, loss = 0.29736265
Iteration 2, loss = 0.04149845
Iteration 3, loss = 0.02957625
Iteration 4, loss = 0.02458274
Iteration 5, loss = 0.02169300




## Predicting

Predict the `X_val` languages. You will call the result `y_val_pred`

In [61]:
# Write your code here
y_val_pred = ...

In [62]:
y_val_pred[:20]

array([1, 4, 3, 4, 5, 1, 2, 0, 2, 3, 4, 1, 0, 5, 0, 0, 5, 5, 0, 0])

In [63]:
y_val[:20]

[1, 4, 3, 4, 5, 1, 2, 0, 2, 3, 4, 1, 0, 5, 0, 0, 5, 5, 0, 0]

#### Evaluating

Use the `accuracy_score()` function to evaluate your model on the validation set

In [64]:
# evaluate the model
accuracy_score(y_val, y_val_pred)

0.99175

In [65]:
print(classification_report(y_val, y_val_pred, target_names=y_symbols))
print('Micro F1:', f1_score(y_val, y_val_pred, average='micro'))
print('Macro F1', f1_score(y_val, y_val_pred, average='macro'))

              precision    recall  f1-score   support

         dan       0.98      0.98      0.98     10081
         swe       0.98      0.98      0.98     10001
         eng       0.99      1.00      1.00      9989
         jpn       1.00      1.00      1.00      9868
         cmn       1.00      1.00      1.00     10027
         fra       1.00      1.00      1.00     10034

    accuracy                           0.99     60000
   macro avg       0.99      0.99      0.99     60000
weighted avg       0.99      0.99      0.99     60000

Micro F1: 0.99175
Macro F1 0.9917723012971812


### Confusion Matrix

In [66]:
confusion_matrix(y_val, y_val_pred)

array([[ 9910,   143,    23,     0,     0,     5],
       [  190,  9786,    18,     0,     0,     7],
       [   13,     8,  9957,     0,     0,    11],
       [    0,     0,     0,  9844,    24,     0],
       [    5,     4,     2,    14, 10000,     2],
       [    5,     5,    15,     0,     1, 10008]])

You may try to increase the number of iterations to improve the score. You may also try change the parameters of the multilayer percetron.

## Predict the language of a text

Now you will predict the languages of the strings below.

In [67]:
docs = ["Salut les gars !", "Hejsan grabbar!", "Hello guys!", "Hejsan tjejer!"]

In [68]:
build_freq_dict('Salut les gars !')

{331: 0.1875,
 234: 0.125,
 15: 0.125,
 69: 0.0625,
 432: 0.0625,
 86: 0.1875,
 32: 0.0625,
 31: 0.0625,
 437: 0.0625,
 333: 0.0625,
 1078: 0.06666666666666667,
 640: 0.06666666666666667,
 582: 0.06666666666666667,
 1542: 0.06666666666666667,
 1492: 0.06666666666666667,
 900: 0.06666666666666667,
 739: 0.06666666666666667,
 1319: 0.06666666666666667,
 1238: 0.13333333333333333,
 982: 0.06666666666666667,
 1415: 0.06666666666666667,
 557: 0.06666666666666667,
 1161: 0.06666666666666667,
 1020: 0.06666666666666667,
 1803: 0.07142857142857142,
 1608: 0.07142857142857142,
 2199: 0.07142857142857142,
 2349: 0.07142857142857142,
 2284: 0.07142857142857142,
 1958: 0.07142857142857142,
 1720: 0.07142857142857142,
 1925: 0.07142857142857142,
 2370: 0.07142857142857142,
 2546: 0.07142857142857142,
 1752: 0.07142857142857142,
 1805: 0.07142857142857142,
 1670: 0.07142857142857142,
 2269: 0.07142857142857142}

Create features vectors from this list. Call this matrix `X_test`

In [69]:
# Write your code here
X_test = ...

In [70]:
X_test

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

And run the prediction that you will store in a variable called `pred_languages`

In [71]:
# Write your code here
pred_languages = ...

In [72]:
pred_languages

['fra', 'swe', 'eng', 'dan']

## Building the Model with PyTorch
You will now recreate a PyTorch model with the same architecture as in sklearn.

### The Model
Create a model identical to the one you created with sklearn. Use the same activation function for the hidden layer and no activation in the last layer.

In [73]:
len(langs)

6

In [74]:
# Write your code here
class Model(nn.Module):
    def __init__(self, input_dim):
        ...

In [75]:
input_dim = X.shape[1]
model = Model(input_dim)
model

Model(
  (fc1): Linear(in_features=2583, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=6, bias=True)
)

Write the loss `loss_fn` and optimizer `optimizer`. As optimizer, use the same as in sklearn. See here: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [77]:
# Write your code here. (The solution is given)
loss_fn = nn.CrossEntropyLoss()    # cross entropy loss
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

### The data loader

We convert the data to tensors

In [78]:
X_train = torch.Tensor(X_train)
y_train = torch.LongTensor(y_train)

X_val = torch.Tensor(X_val)
y_val = torch.LongTensor(y_val)

X_test = torch.Tensor(X_test)

In [79]:
from torch.utils.data import TensorDataset, DataLoader

dataset = TensorDataset(X_train, y_train)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

In [80]:
model.train()

Model(
  (fc1): Linear(in_features=2583, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=6, bias=True)
)

Fit your network on your training set. Write a code similar to that seen during the lecture and use five epochs to start with.

In [81]:
# Write your code here
for epoch in range(5):
    loss_train = 0
    for X_batch, y_batch in dataloader:
        ...
    print(loss_train)

765.704064067424
174.51052177065867
140.88319964771836
122.14930617929895
108.1038764459961


Predict the validation set `X_val` in the form of logits. Call the result: `Y_val_pred_logits`

In [82]:
model.eval()

Model(
  (fc1): Linear(in_features=2583, out_features=50, bias=True)
  (fc2): Linear(in_features=50, out_features=6, bias=True)
)

In [83]:
# Write your code here
Y_val_pred_logits = ...

In [84]:
Y_val_pred_logits[:5]

tensor([[ -5.9929,   6.8918,  -3.0228,  -9.0067,  -4.4374,  -9.8311],
        [ -5.7582, -10.4759, -12.6710,  -3.3666,  18.9000, -12.0623],
        [ -6.9647,  -9.3725, -22.0371,  22.4224,  -8.4151, -14.8630],
        [ -8.7401, -11.8870,  -8.4524,  -2.2943,  16.8711,  -7.5894],
        [ -3.7173,  -5.8539,  -5.3409,  -9.7889,  -6.0550,  11.3400]],
       grad_fn=<SliceBackward0>)

Predict the validation set `X_val` in the form of probabilities. Use `torch.softmax()` for that and call the result: `Y_val_pred_proba`

In [85]:
# Write your code here
Y_val_pred_proba = ...

In [86]:
Y_val_pred_proba[:4]

tensor([[2.5364e-06, 9.9994e-01, 4.9445e-05, 1.2455e-07, 1.2017e-05, 5.4616e-08],
        [1.9547e-11, 1.7467e-13, 1.9448e-14, 2.1365e-10, 1.0000e+00, 3.5748e-14],
        [1.7271e-13, 1.5547e-14, 4.9143e-20, 1.0000e+00, 4.0500e-14, 6.4144e-17],
        [7.5369e-12, 3.2398e-13, 1.0050e-11, 4.7487e-09, 1.0000e+00, 2.3820e-11]],
       grad_fn=<SliceBackward0>)

Extract the categories from the probabilities in `Y_val_pred_proba`. Use the `torch.argmax()` function. Call the result `y_val_pred`. Check that the prediction corresponds to the real values.

In [87]:
# Write your code here
y_val_pred = ...

In [88]:
y_val_pred[:20]

tensor([1, 4, 3, 4, 5, 1, 2, 0, 2, 3, 4, 1, 0, 5, 0, 0, 5, 5, 0, 0])

In [89]:
y_val[:20]

tensor([1, 4, 3, 4, 5, 1, 2, 0, 2, 3, 4, 1, 0, 5, 0, 0, 5, 5, 0, 0])

Print the evaluation

In [90]:
print(classification_report(y_val, y_val_pred, target_names=y_symbols))
print('Micro F1:', f1_score(y_val, y_val_pred, average='micro'))
print('Macro F1', f1_score(y_val, y_val_pred, average='macro'))

              precision    recall  f1-score   support

         dan       0.98      0.98      0.98     10081
         swe       0.98      0.98      0.98     10001
         eng       0.99      1.00      1.00      9989
         jpn       1.00      1.00      1.00      9868
         cmn       1.00      1.00      1.00     10027
         fra       1.00      1.00      1.00     10034

    accuracy                           0.99     60000
   macro avg       0.99      0.99      0.99     60000
weighted avg       0.99      0.99      0.99     60000

Micro F1: 0.9922333333333333
Macro F1 0.9922547719497999


Print the confusion matrix

In [91]:
confusion_matrix(y_val, y_val_pred)

array([[ 9914,   146,    20,     0,     0,     1],
       [  167,  9815,    15,     0,     1,     3],
       [   15,     6,  9962,     0,     0,     6],
       [    0,     0,     0,  9839,    29,     0],
       [    3,     2,     1,    14, 10005,     2],
       [    6,     5,    24,     0,     0,  9999]])

Predict your languages with PyTorch. Reuse `X_test` and call the result `Y_test_pred_proba`.

In [92]:
# Write your code here
Y_test_pred_proba = ...

In [93]:
Y_test_pred_proba

tensor([[1.8276e-07, 6.6974e-07, 1.4123e-07, 1.6559e-09, 2.8414e-08, 1.0000e+00],
        [4.1081e-04, 9.9952e-01, 2.4307e-05, 9.6759e-07, 4.6090e-05, 1.8566e-06],
        [3.9380e-02, 4.1912e-05, 9.4139e-01, 7.8594e-06, 8.5014e-04, 1.8331e-02],
        [9.9818e-01, 1.8032e-03, 7.3723e-08, 9.3494e-07, 1.2838e-05, 9.8474e-07]],
       grad_fn=<SoftmaxBackward0>)

From the probabilities, extract the predicted languages and map them to strings. Call the results `pred_languages_pytorch`.

In [94]:
# Write your code here
pred_languages_pytorch = ...

In [95]:
pred_languages_pytorch

['fra', 'swe', 'eng', 'dan']

## Turning in your assignment

Now your are done with the program. To complete this assignment, you will:
1. Write a short individual report on your program. Do not forget to:
   * Summarize CLD3 and outline its architecture
   * Identify the features used by CLD3
   * Describe your architecture and tell how it is different from CLD3
   * Include the feature matrix you computed manually
   * Outline the differences between sklearn and PyTorch

Submit your report as well as your notebook (for archiving purposes) to Canvas: https://canvas.education.lu.se/. To write your report, you can either
1. Write directly your text in Canvas, or
2. Use Latex and Overleaf (www.overleaf.com). This will probably help you structure your text. You will then upload a PDF file in Canvas.

The submission deadline is October 6, 2023.

## Postscript from Pierre Nugues

I created this assignment from an examination I wrote in 2019 for the course on applied machine learning. I simplified it from the `README.md` on GitHub, https://github.com/google/cld3. I found the C++ code difficult to understand and I reimplemented a Keras/Tensorflow version of it from this `README`. Should you be interested, you can find it here: https://github.com/pnugues/language-detector.