# Assignment #3: A simple language classifier with scikit-learn and Keras

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Author: Pierre Nugues

## Objectives

In this assignment, you will implement a language detector inspired from Google's _Compact language detector_, version 3 (CLD3): https://github.com/google/cld3. CLD3 is written in C++ and its code is available from GitHub. The objectives of the assignment are to:
* Write a program to classify languages
* Use neural networks with sklearn and either Keras or PyTorch
* Know what a classifier is
* Write a short report of 1 to 2 pages to describe your program. You will notably comment the performance you obtained and how you could improve it.

## Description

### System Overview

Read the GitHub description of CLD3, https://github.com/google/cld3, (_Model_ section). In your individual report you will:
1. Summarize the system in two or three sentences;
2. Outline the CLD3 overall architecture in a figure. Use building blocks only and do not specify the parameters.

## Imports

In [None]:
import bz2
import json
import os
import numpy as np
import requests
import sys
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score, classification_report
from sklearn.metrics import confusion_matrix

### Dataset

As dataset, we will use Tatoeba, https://tatoeba.org/eng/downloads. It consists of more than 8 million short texts in 347 languages and it is available in one file called `sentences.csv`.

The dataset is structured this way: There is one text per line, where each line consists of the three following fields separated by tabulations and ended by a carriage return:
```
sentence id [tab] language code [tab] text [cr]
```
Each text (sentence) has a unique id and has a language code that follows the ISO 639-3 standard (see below). 

### Scope of the lab

In this lab, you will consider three languages only: French (fra), English (eng), and Swedish (swe). Below is an excerpt of the Tatoeba dataset limited to these three languages: 

```
1276    eng     Let's try something.
1277    eng     I have to go to sleep.
1280    eng     Today is June 18th and it is Muiriel's birthday!
...
1115    fra     Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent.
1279    fra     Je ne supporte pas ce type.
1441    fra     Pour une fois dans ma vie je fais un bon geste... Et ça ne sert à rien.
...
337413  swe     Vi trodde att det var ett flygande tefat.
341910  swe     Detta är huset jag bodde i när jag var barn.
341938  swe     Vi hade roligt på stranden igår.
...
```
Tatoeba is updated continuously. The examples from this dataset come from a corpus your instructor downloaded on September 23, 2021.

### Understanding the $\mathbf{X}$ matrix (feature matrix)

You will now investigate the CLD3 features:
 *  What are the features CLD3 extracts from each text?
 * Create manually a simplified $\mathbf{X}$ matrix where you will represent the 9 texts with CLD3 features. You will use a restricted set of features: You will only consider the letters _a_, _b_, and _n_ and the bigrams _an_, _ba_, and _na_. You will ignore the the rest of letters and bigrams as well as the trigrams. Your matrix will have 9 rows and 6 columns, each column will contain these counts: `[#a, #b, #n, #an, #ba, #na]`.

The CLD3's original description uses relative frequencies (counts of a letter divided by the total counts of letters in the text). Here, you will use the raw counts. To help you, your instructor filled the fourth row of the matrix corresponding to the first text in French. Fill in the rest. You will include this matrix in your report. 

$\mathbf{X} =
\begin{bmatrix}
0& 0& 1& 0&0& 0\\
0& 0& 0& 0&0& 0\\
3& 1& 2& 1&0& 0\\
8& 0& 8& 1&0&0\\
1& 0& 1& 0&0& 0\\
3& 1& 5& 1&0& 0\\
3& 0& 1& 0&0& 0\\
4& 2& 2& 0&0& 0\\
2& 0& 1& 1&0& 0\\
\end{bmatrix}$
; $\mathbf{y} =
\begin{bmatrix}
     \text{eng} \\
     \text{eng}\\
     \text{eng}\\
    \text{fra}\\
   \text{fra}  \\
     \text{fra}\\
    \text{swe}\\
 \text{swe}   \\
 \text{swe}   
\end{bmatrix}$

## Programming: Extracting the features

Before you start programming, download the Tatoeba dataset. You can use the intructions:

In [None]:
!wget https://downloads.tatoeba.org/exports/sentences.tar.bz2

In [None]:
!tar -xvjf sentences.tar.bz2

### Loading and filtering the dataset

Run the code to read the dataset and split it into lines. You may have to change the path

In [None]:
dataset = open('../../corpus/sentences.csv', encoding='utf8').read().strip()
dataset = dataset.split('\n')
dataset[:10]

['1\tcmn\t我們試試看！',
 '2\tcmn\t我该去睡觉了。',
 '3\tcmn\t你在干什麼啊？',
 '4\tcmn\t這是什麼啊？',
 '5\tcmn\t今天是６月１８号，也是Muiriel的生日！',
 '6\tcmn\t生日快乐，Muiriel！',
 '7\tcmn\tMuiriel现在20岁了。',
 '8\tcmn\t密码是"Muiriel"。',
 '9\tcmn\t我很快就會回來。',
 '10\tcmn\t我不知道。']

Run the code to split the fields and remove possible whitespaces

In [None]:
dataset = list(map(lambda x: tuple(x.split('\t')), dataset))
dataset = list(map(lambda x: tuple(map(str.strip, x)), dataset))
dataset[:3]

[('1', 'cmn', '我們試試看！'), ('2', 'cmn', '我该去睡觉了。'), ('3', 'cmn', '你在干什麼啊？')]

Write the code to extract the French, English, and Swedish texts. You will call the resulting dataset: `dataset_small`

In [None]:
# Write your code here
path = '/content/drive/MyDrive/Colab Notebooks/corpus/mini.tsv'
import csv
with open(path, encoding = 'UTF-8') as f:
  dataset_small = f.read().strip().split('\n')
  dataset_small = list(map(lambda x: tuple(x.split('\t')), dataset_small))
  dataset_small = list(map(lambda x: tuple(map(str.strip, x)), dataset_small))
    

In [None]:
dataset_small[:5]

[('1482', 'eng', "I don't speak Japanese."),
 ('1598', 'eng', "I don't want to spend the rest of my life regretting it."),
 ('1632', 'eng', 'The cost of life increased drastically.'),
 ('1864', 'eng', "She's really smart, isn't she?"),
 ('2019',
  'eng',
  'The news article painted the defendant as a guilty man, even though he had been proven innocent.')]

### Functions to Count Characters Ngrams

Write a function `count_chars(string, lc=True)` to count characters (unigrams) of a string. You will set the text in lowercase if `lc` is set to `True`. As in CLD3, you will return the relative frequencies of the unigrams.

In [None]:
# Write your code here
import regex as re
from collections import Counter
def count_chars(text, lc = True):
  if lc:
    text = text.lower() 
  return {k : v/len(text) for k,v in Counter(text).items()}


In [None]:
counth_chars("hejhej jag heter Nils")

{'h': 0.16666666666666666,
 'e': 0.2222222222222222,
 'j': 0.16666666666666666,
 'a': 0.05555555555555555,
 'g': 0.05555555555555555,
 't': 0.05555555555555555,
 'r': 0.05555555555555555,
 'n': 0.05555555555555555,
 'i': 0.05555555555555555,
 'l': 0.05555555555555555,
 's': 0.05555555555555555}

Write a function `count_bigrams(string, lc=True)` to count the characters bigrams of a string. You will set the text in lowercase if `lc` is set to `True`. As in CLD3, you will return the relative frequencies of the bigrams.

In [None]:
# Write your code here
def count_bigrams(text, lc = True):
    if lc:
      text = text.lower()
    #text = re.sub('\p{z}', '', text)
    words = text
    bigrams = []
    for i in range(len(words) - 1):
        bigrams.append(words[i] + words[i + 1])
    frequency_bigrams = {}
    for i in range(len(words) - 1):
        if bigrams[i] in frequency_bigrams:
            frequency_bigrams[bigrams[i]] += 1
        else:
            frequency_bigrams[bigrams[i]] = 1
    return {k : v / sum(frequency_bigrams.values()) for k,v in frequency_bigrams.items()}


In [None]:
count_bigrams("hejhej jag heter Nils")

{'he': 0.15,
 'ej': 0.1,
 'jh': 0.05,
 'j ': 0.05,
 ' j': 0.05,
 'ja': 0.05,
 'ag': 0.05,
 'g ': 0.05,
 ' h': 0.05,
 'et': 0.05,
 'te': 0.05,
 'er': 0.05,
 'r ': 0.05,
 ' n': 0.05,
 'ni': 0.05,
 'il': 0.05,
 'ls': 0.05}

Write a function `count_trigrams(string, lc=True)` to count the characters trigrams of a string. You will set the text in lowercase if `lc` is set to `True`. As in CLD3, you will return the relative frequencies of the trigrams.

In [None]:
# Write your code here
def count_trigrams(text, lc = True):
  if lc:
    text = text.lower()
  words = text
  trigrams = [words[idx:idx + 3] for idx in range(len(words) - 2)]
  frequencies = {}
  for trigram in trigrams:
      if trigram in frequencies:
          frequencies[trigram] += 1
      else:
          frequencies[trigram] = 1
  return {k : v / sum(frequencies.values()) for k,v in frequencies.items()}


In [None]:
count_trigrams("hejhej jag heter Nils")

{'hej': 0.10526315789473684,
 'ejh': 0.05263157894736842,
 'jhe': 0.05263157894736842,
 'ej ': 0.05263157894736842,
 'j j': 0.05263157894736842,
 ' ja': 0.05263157894736842,
 'jag': 0.05263157894736842,
 'ag ': 0.05263157894736842,
 'g h': 0.05263157894736842,
 ' he': 0.05263157894736842,
 'het': 0.05263157894736842,
 'ete': 0.05263157894736842,
 'ter': 0.05263157894736842,
 'er ': 0.05263157894736842,
 'r n': 0.05263157894736842,
 ' ni': 0.05263157894736842,
 'nil': 0.05263157894736842,
 'ils': 0.05263157894736842}

In [None]:
count_chars("Let's try something.")

{'l': 0.05,
 'e': 0.1,
 't': 0.15,
 "'": 0.05,
 's': 0.1,
 ' ': 0.1,
 'r': 0.05,
 'y': 0.05,
 'o': 0.05,
 'm': 0.05,
 'h': 0.05,
 'i': 0.05,
 'n': 0.05,
 'g': 0.05,
 '.': 0.05}

In [None]:
count_bigrams("Let's try something.")

{'le': 0.05263157894736842,
 'et': 0.10526315789473684,
 "t'": 0.05263157894736842,
 "'s": 0.05263157894736842,
 's ': 0.05263157894736842,
 ' t': 0.05263157894736842,
 'tr': 0.05263157894736842,
 'ry': 0.05263157894736842,
 'y ': 0.05263157894736842,
 ' s': 0.05263157894736842,
 'so': 0.05263157894736842,
 'om': 0.05263157894736842,
 'me': 0.05263157894736842,
 'th': 0.05263157894736842,
 'hi': 0.05263157894736842,
 'in': 0.05263157894736842,
 'ng': 0.05263157894736842,
 'g.': 0.05263157894736842}

In [None]:
count_trigrams("Let's try something.")

{'let': 0.05555555555555555,
 "et'": 0.05555555555555555,
 "t's": 0.05555555555555555,
 "'s ": 0.05555555555555555,
 's t': 0.05555555555555555,
 ' tr': 0.05555555555555555,
 'try': 0.05555555555555555,
 'ry ': 0.05555555555555555,
 'y s': 0.05555555555555555,
 ' so': 0.05555555555555555,
 'som': 0.05555555555555555,
 'ome': 0.05555555555555555,
 'met': 0.05555555555555555,
 'eth': 0.05555555555555555,
 'thi': 0.05555555555555555,
 'hin': 0.05555555555555555,
 'ing': 0.05555555555555555,
 'ng.': 0.05555555555555555}

### Counting the ngrams in the dataset

You will now extract the features from each text. For this, add the character, bigram, and trigram relative frequencies to the texts using this format:
`(text_id, language_id, text, char_cnt, bigram_cnt, trigram_cnt)`.

From the datapoint:
`('1276', 'eng', "Let's try something.")`,
you must return:

`('1276', 'eng', "Let's try something.", 
  {'l': 0.05, 'e': 0.1, 't': 0.15, "'": 0.05, 's': 0.1, ' ': 0.1, 'r': 0.05, 'y': 0.05, 'o': 0.05, 'm': 0.05, 'h': 0.05, 'i': 0.05, 'n': 0.05, 'g': 0.05, '.': 0.05},
  {'le': 0.05263157894736842, 'et': 0.10526315789473684, "t'": 0.05263157894736842, "'s": 0.05263157894736842, 's ': 0.05263157894736842, ' t': 0.05263157894736842, 'tr': 0.05263157894736842, 'ry': 0.05263157894736842, 'y ': 0.05263157894736842, ' s': 0.05263157894736842, 'so': 0.05263157894736842, 'om': 0.05263157894736842, 'me': 0.05263157894736842, 'th': 0.05263157894736842, 'hi': 0.05263157894736842, 'in': 0.05263157894736842, 'ng': 0.05263157894736842, 'g.': 0.05263157894736842},
  {'let': 0.05555555555555555, "et'": 0.05555555555555555, "t's": 0.05555555555555555, "'s ": 0.05555555555555555, 's t': 0.05555555555555555, ' tr': 0.05555555555555555, 'try': 0.05555555555555555, 'ry ': 0.05555555555555555, 'y s': 0.05555555555555555, ' so': 0.05555555555555555, 'som': 0.05555555555555555, 'ome': 0.05555555555555555, 'met': 0.05555555555555555, 'eth': 0.05555555555555555, 'thi': 0.05555555555555555, 'hin': 0.05555555555555555, 'ing': 0.05555555555555555, 'ng.': 0.05555555555555555})`

You will store the extracted features in a list that you will call `dataset_small_feat`

In [None]:
# Write your code here
dataset_small_feat = list(map(lambda x: x + (count_chars(x[-1]), count_bigrams(x[-1]), count_trigrams(x[-1])), 
                              dataset_small))

We only compute the unigrams and bigrams as most students have slow machines

In [None]:
# Write your code here


In [None]:
dataset_small_feat[:2]

[('1482',
  'eng',
  "I don't speak Japanese.",
  {'i': 0.043478260869565216,
   ' ': 0.13043478260869565,
   'd': 0.043478260869565216,
   'o': 0.043478260869565216,
   'n': 0.08695652173913043,
   "'": 0.043478260869565216,
   't': 0.043478260869565216,
   's': 0.08695652173913043,
   'p': 0.08695652173913043,
   'e': 0.13043478260869565,
   'a': 0.13043478260869565,
   'k': 0.043478260869565216,
   'j': 0.043478260869565216,
   '.': 0.043478260869565216},
  {'i ': 0.045454545454545456,
   ' d': 0.045454545454545456,
   'do': 0.045454545454545456,
   'on': 0.045454545454545456,
   "n'": 0.045454545454545456,
   "'t": 0.045454545454545456,
   't ': 0.045454545454545456,
   ' s': 0.045454545454545456,
   'sp': 0.045454545454545456,
   'pe': 0.045454545454545456,
   'ea': 0.045454545454545456,
   'ak': 0.045454545454545456,
   'k ': 0.045454545454545456,
   ' j': 0.045454545454545456,
   'ja': 0.045454545454545456,
   'ap': 0.045454545454545456,
   'pa': 0.045454545454545456,
   'an': 0

The unigram frequencies

In [None]:
dataset_small_feat[0][3].items()

dict_items([('i', 0.043478260869565216), (' ', 0.13043478260869565), ('d', 0.043478260869565216), ('o', 0.043478260869565216), ('n', 0.08695652173913043), ("'", 0.043478260869565216), ('t', 0.043478260869565216), ('s', 0.08695652173913043), ('p', 0.08695652173913043), ('e', 0.13043478260869565), ('a', 0.13043478260869565), ('k', 0.043478260869565216), ('j', 0.043478260869565216), ('.', 0.043478260869565216)])

The bigram frequencies

In [None]:
dataset_small_feat[0][4].items()

dict_items([('i ', 0.045454545454545456), (' d', 0.045454545454545456), ('do', 0.045454545454545456), ('on', 0.045454545454545456), ("n'", 0.045454545454545456), ("'t", 0.045454545454545456), ('t ', 0.045454545454545456), (' s', 0.045454545454545456), ('sp', 0.045454545454545456), ('pe', 0.045454545454545456), ('ea', 0.045454545454545456), ('ak', 0.045454545454545456), ('k ', 0.045454545454545456), (' j', 0.045454545454545456), ('ja', 0.045454545454545456), ('ap', 0.045454545454545456), ('pa', 0.045454545454545456), ('an', 0.045454545454545456), ('ne', 0.045454545454545456), ('es', 0.045454545454545456), ('se', 0.045454545454545456), ('e.', 0.045454545454545456)])

## Programming: Building $\mathbf{X}$

You will now build the $\mathbf{X}$ matrix. In this assignment, you will only consider unigrams to speed up the training step. This means that you will set aside the character bigrams and trigrams.

When you are done with the lab requirements, feel free to improve the program and include bigrams and trigrams. To add bigrams, a possible method is to add the bigram dictionary to the unigram one using update and then to extract the resulting dictionary. You can easily extend this to trigrams. Feel free to use another method if you want.

In [None]:
INCLUDE_BIGRAMS = False
if INCLUDE_BIGRAMS:
    for i in range(len(dataset_small_feat)):
        dataset_small_feat[i][3].update(dataset_small_feat[i][4])

### Vectorizing the features

The CLD3 architecture uses embeddings. In this lab, we will simplify it and we will use a feature vector instead consisting of the character frequencies. For example, you will represent the text:

`"Let's try something."`

with:

`{'l': 0.05, 'e': 0.1, 't': 0.15, "'": 0.05, 's': 0.1, ' ': 0.1, 
 'r': 0.05, 'y': 0.05, 'o': 0.05, 'm': 0.05, 'h': 0.05, 'i': 0.05, 
 'n': 0.05, 'g': 0.05, '.': 0.05}`

To create the $\mathbf{X}$ matrix, we need to transform the dictionaries of `dataset_small` into numerical vectors. The `DictVectorizer` class from the scikit-learn library, see here [https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html], has two methods, `fit()` and `transform()`, and a combination of both `fit_transform()` to convert dictionaries into such vectors.

You will now write the code to:

1. Extract the character frequency dictionaries from `dataset_small` corresponding to its 3rd index and set them in a list;
2. Convert the list of dictionaries into an $\mathbf{X}$ matrix using `DictVectorizer`.

#### Extracting the character frequencies

Produce a new list of datapoints with the unigrams only. Each item in this list will be a dictionary. You will call it `X_cat`

In [None]:
# Write your code here
X_cat = [row[3] for row in dataset_small_feat] 


In [None]:
X_cat[:2]

[{'i': 0.043478260869565216,
  ' ': 0.13043478260869565,
  'd': 0.043478260869565216,
  'o': 0.043478260869565216,
  'n': 0.08695652173913043,
  "'": 0.043478260869565216,
  't': 0.043478260869565216,
  's': 0.08695652173913043,
  'p': 0.08695652173913043,
  'e': 0.13043478260869565,
  'a': 0.13043478260869565,
  'k': 0.043478260869565216,
  'j': 0.043478260869565216,
  '.': 0.043478260869565216},
 {'i': 0.07142857142857142,
  ' ': 0.19642857142857142,
  'd': 0.03571428571428571,
  'o': 0.05357142857142857,
  'n': 0.07142857142857142,
  "'": 0.017857142857142856,
  't': 0.14285714285714285,
  'w': 0.017857142857142856,
  'a': 0.017857142857142856,
  's': 0.03571428571428571,
  'p': 0.017857142857142856,
  'e': 0.10714285714285714,
  'h': 0.017857142857142856,
  'r': 0.05357142857142857,
  'f': 0.03571428571428571,
  'm': 0.017857142857142856,
  'y': 0.017857142857142856,
  'l': 0.017857142857142856,
  'g': 0.03571428571428571,
  '.': 0.017857142857142856}]

#### Vectorize `X_cat`

Convert you `X_cat` matrix into a numerical representation using `DictVectorizer`: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html. Call the result `X`.

In [None]:
# Write your code here
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
X = v.fit_transform(X_cat)

In [None]:
X[:5]

array([[0.13043478, 0.        , 0.        , 0.        , 0.        ,
        0.04347826, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.04347826, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.13043478, 0.        , 0.        ,
        0.04347826, 0.13043478, 0.        , 0.        , 0.        ,
        0.04347826, 0.04347826, 0.04347826, 0.        , 0.        ,
        0.08695652, 0.04347826, 0.08695652, 0.        , 0.        ,
        0.08695652, 0.04347826, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  

## Programming: Building $\mathbf{y}$

You will now convert the list of language symbols into a $\mathbf{y}$ vector

Extract the language symbols from `dataset_small_feat` and call the resulting list `y_cat`

In [None]:
# Write your code here
y_cat = [row[1] for row in dataset_small_feat]

In [None]:
y_cat[:5]

['eng', 'eng', 'eng', 'eng', 'eng']

Extract the set of language symbols and name it `y_symbols`. Then build two indices mapping the symbols to integers and the integers to symbols. Both indices will be dictionaries that you will call: `lang2idx`and `idx2lang`. Such a conversion is not necessary with sklearn. We do it because many other many machine-learning toolkits (keras or pytorch) require a numerical $\mathbf{y}$ vector and to learn how to carry out this conversion.

In [None]:
# Write your code here
lang2idx = {lang : i for i, lang in enumerate(set(y_cat))}
idx2lang = {v:k for k,v in lang2idx.items()}

In [None]:
idx2lang

{0: 'swe', 1: 'fra', 2: 'eng'}

In [None]:
lang2idx

{'swe': 0, 'fra': 1, 'eng': 2}

Convert your `y_cat` vector into a numerical vector. Call this vector `y`.

In [None]:
# Write your code here
y = [lang2idx[lang] for lang in y_cat]

In [None]:
y[:5]

[2, 2, 2, 2, 2]

## Programming: Building the Model

Create a neural network using sklearn with a hidden layer of 50 nodes and a relu activation layer: https://scikit-learn.org/stable/modules/neural_networks_supervised.html. Set the maximal number of iterations to 5, in the beginning, and verbose to True. Use the default values for the rest. You will call your classifier `clf`

In [None]:
# Write your code here
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(50,), max_iter=100, verbose=True)

In [None]:
clf

MLPClassifier(hidden_layer_sizes=(50,), max_iter=100, verbose=True)

### Training and Validation Sets

You will now split the dataset into a training and validation sets

#### We shuffle the indices

In [None]:
indices = list(range(X.shape[0]))
np.random.shuffle(indices)
print(indices[:10])
X = X[indices, :]
y = np.array(y)[indices]

[2156, 13157, 31555, 13501, 3653, 27203, 31426, 13153, 10444, 26982]


#### We split the dataset
We use a training set of 80% and a validation set of 20%

In [None]:
training_examples = int(X.shape[0] * 0.8)

X_train = X[:training_examples, :]
y_train = y[:training_examples]

X_val = X[training_examples:, :]
y_val = y[training_examples:]

### Fitting the model

Fit the model on the training set

In [None]:
# Write your code here
clf.fit(X_train, y_train)


Iteration 1, loss = 1.00398268
Iteration 2, loss = 0.76552054
Iteration 3, loss = 0.51252072
Iteration 4, loss = 0.36670322
Iteration 5, loss = 0.28836159
Iteration 6, loss = 0.24176861
Iteration 7, loss = 0.21130848
Iteration 8, loss = 0.18975510
Iteration 9, loss = 0.17397720
Iteration 10, loss = 0.16212691
Iteration 11, loss = 0.15293940
Iteration 12, loss = 0.14608317
Iteration 13, loss = 0.14065484
Iteration 14, loss = 0.13591432
Iteration 15, loss = 0.13221521
Iteration 16, loss = 0.12903083
Iteration 17, loss = 0.12644320
Iteration 18, loss = 0.12415150
Iteration 19, loss = 0.12232722
Iteration 20, loss = 0.12062701
Iteration 21, loss = 0.11911164
Iteration 22, loss = 0.11790251
Iteration 23, loss = 0.11651380
Iteration 24, loss = 0.11527138
Iteration 25, loss = 0.11448057
Iteration 26, loss = 0.11375886
Iteration 27, loss = 0.11286386
Iteration 28, loss = 0.11227028
Iteration 29, loss = 0.11149498
Iteration 30, loss = 0.11113302
Iteration 31, loss = 0.11046939
Iteration 32, los



MLPClassifier(hidden_layer_sizes=(50,), max_iter=100, verbose=True)

## Predicting

Predict the `X_val` languages. You will call the result `y_val_pred`

In [None]:
# Write your code here
y_val_pred = clf.predict(X_val)


In [None]:
y_val_pred[:20]

array([1, 1, 0, 0, 2, 0, 2, 0, 2, 2, 2, 1, 2, 2, 1, 2, 0, 2, 0, 1])

In [None]:
y_val[:20]

array([1, 1, 0, 0, 2, 0, 2, 0, 2, 1, 2, 1, 2, 2, 1, 2, 0, 2, 0, 1])

#### Evaluating

Use the `accuracy_score()` function to evaluate your model on the validation set

In [None]:
# evaluate the model
accuracy_score(y_val, y_val_pred)

0.9610621904237755

In [None]:
print(classification_report(y_val, y_val_pred, target_names=y_symbols))
print('Micro F1:', f1_score(y_val, y_val_pred, average='micro'))
print('Macro F1', f1_score(y_val, y_val_pred, average='macro'))

              precision    recall  f1-score   support

         swe       0.97      0.96      0.96      1974
         fra       0.95      0.96      0.96      2014
         eng       0.96      0.97      0.96      3280

    accuracy                           0.96      7268
   macro avg       0.96      0.96      0.96      7268
weighted avg       0.96      0.96      0.96      7268

Micro F1: 0.9610621904237755
Macro F1 0.9606558779048515


### Confusion Matrix

In [None]:
confusion_matrix(y_val, y_val_pred)

array([[1890,   17,   67],
       [  23, 1924,   67],
       [  35,   74, 3171]])

You may try to increase the number of iterations to improve the score. You may also try change the parameters of the multilayer percetron.

## Predict the language of a text

Now you will predict the languages of the strings below.

In [None]:
docs = ["Salut les gars !", "Hejsan grabbar!", "Hello guys!", "Hejsan tjejer!"]


Create features vectors from this list. Call this matrix `X_test`

In [None]:
# Write your code here
def featurize(string_list, dictvectorizer):
  return list(map(lambda x : x[0], map(dictvectorizer.transform ,map(count_chars, string_list))))

X_test = featurize(docs, v)

In [None]:
X_test

[array([0.1875, 0.0625, 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ,
        0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ,
        0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ,
        0.    , 0.    , 0.    , 0.125 , 0.    , 0.    , 0.    , 0.0625,
        0.    , 0.0625, 0.    , 0.    , 0.    , 0.    , 0.125 , 0.    ,
        0.    , 0.    , 0.    , 0.    , 0.0625, 0.1875, 0.0625, 0.0625,
        0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ,
        0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ,
        0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ,
        0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ,
        0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ,
        0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    , 0.    ]),
 array([0.06666667, 0.06666667, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        

And run the prediction that you will store in a variable called `pred_languages`

In [None]:
# Write your code here
pred_languages = [idx2lang[pred] for pred in clf.predict(X_test)]

In [None]:
pred_languages

['fra', 'swe', 'eng', 'swe']

## Building the Model with Keras
You will now recreate a Keras model with the same architecture as in sklearn.

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

(29068, 96)

### The Model
Create a model identical to the one you created with sklearn. Use the same activation function for the hidden layer and softmax in the last layer.

In [None]:
# Write your code here
model = keras.Sequential([keras.layers.Dense(50, input_shape = (X_train.shape[1], ), activation = 'relu'),
                          keras.layers.Dense(3, activation = 'softmax')
                          ]
                         )

In [None]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_5 (Dense)             (None, 50)                4850      
                                                                 
 dense_6 (Dense)             (None, 3)                 153       
                                                                 
Total params: 5,003
Trainable params: 5,003
Non-trainable params: 0
_________________________________________________________________


Compile your network with the loss, optimizer, and metrics arguments. As optimizer, use the same as in sklearn. See here: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [None]:
# Write your code here
optimizer = keras.optimizers.Adam(learning_rate = 1e-3)
loss = keras.losses.CategoricalCrossentropy()
model.compile(optimizer = optimizer,
              loss = loss,
              metrics=['accuracy'])

Convert the output categories to one-hot vectors. Call the result: `Y_train_cat`

In [None]:
from tensorflow.keras.utils import to_categorical

In [None]:
# Write your code here
Y_train_cat = to_categorical(
    y_train, num_classes=3, dtype='float32'
)
Y_val_cat = to_categorical(
    y_val, num_classes=3, dtype='float32'
)


In [None]:
Y_train_cat[:5]

array([[0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)

Fit your network on your training set. Use two epochs to start with.

In [None]:
# Write your code here
model.fit(X_train, Y_train_cat, epochs = 15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f3d2fd962d0>

Evaluate your network on `X_val` and `y_val` with the `evaluate()` method

In [None]:
# Write your code here
model.evaluate(X_val, Y_val_cat)



[0.10313791036605835, 0.9620253443717957]

Predict the validation set: `X_val`. Call the result: `Y_val_pred_proba`

In [None]:
# Write your code here
Y_val_pred_proba = model.predict(X_val)

In [None]:
Y_val_pred_proba[:5]

array([[7.0445362e-06, 9.9999058e-01, 2.3804862e-06],
       [2.2156986e-05, 9.9962902e-01, 3.4874643e-04],
       [1.0000000e+00, 5.4049274e-19, 4.5157749e-21],
       [1.0000000e+00, 4.3803317e-09, 1.5293482e-10],
       [4.5808003e-04, 2.6849745e-02, 9.7269213e-01]], dtype=float32)

Extract the categories from the probabilities in `Y_val_pred_proba`. Use the `np.argmax()` function. Call the result `y_val_pred`. Check that the prediction corresponds to the real values.

In [None]:
# Write your code here
y_val_pred = [np.argmax(row) for row in Y_val_pred_proba]

In [None]:
y_val_pred[:20]

[1, 1, 0, 0, 2, 0, 2, 0, 2, 2, 2, 1, 2, 2, 1, 2, 0, 2, 0, 1]

In [None]:
y_val[:20]

array([1, 1, 0, 0, 2, 0, 2, 0, 2, 1, 2, 1, 2, 2, 1, 2, 0, 2, 0, 1])

Print the evaluation

In [None]:
print(classification_report(y_val, y_val_pred, target_names=y_symbols))
print('Micro F1:', f1_score(y_val, y_val_pred, average='micro'))
print('Macro F1', f1_score(y_val, y_val_pred, average='macro'))

              precision    recall  f1-score   support

         swe       0.97      0.96      0.96      1974
         fra       0.96      0.95      0.96      2014
         eng       0.96      0.97      0.96      3280

    accuracy                           0.96      7268
   macro avg       0.96      0.96      0.96      7268
weighted avg       0.96      0.96      0.96      7268

Micro F1: 0.9620253164556962
Macro F1 0.9616142202227483


Print the confusion matrix

In [None]:
confusion_matrix(y_val, y_val_pred)

array([[1898,   16,   60],
       [  23, 1919,   72],
       [  41,   64, 3175]])

In [None]:
X_test[0].shape

(96,)

Predict your languages with Keras. Reuse `X_test` and call the result `Y_test_pred_proba`.

In [None]:
# Write your code here
Y_test_pred_proba = model.predict(np.array(X_test))


In [None]:
Y_test_pred_proba

array([[4.4291654e-01, 5.3150606e-01, 2.5577381e-02],
       [9.9838591e-01, 1.4349347e-04, 1.4706150e-03],
       [2.0313782e-03, 2.8379817e-04, 9.9768484e-01],
       [9.8025954e-01, 1.9502448e-02, 2.3803693e-04]], dtype=float32)

From the probabilities, extract the predicted languages and map them to strings. Call the results `pred_languages_keras`.

In [None]:
# Write your code here
y_test_pred = [np.argmax(row) for row in Y_test_pred_proba]
pred_languages_keras = [idx2lang[pred] for pred in y_test_pred]

In [None]:
pred_languages_keras

['fra', 'swe', 'eng', 'swe']

## Submission

When you have written all the code and run all the cells, fill in your ID and as well as the name of the notebook.

In [None]:
STIL_ID = ["ni5324ro-s", "si7660da-s"] # Write your stil ids as a list
CURRENT_NOTEBOOK_PATH = os.path.join(os.getcwd(), 
                                     "/content/drive/MyDrive/Colab Notebooks/3-language_detector_keras.ipynb") # Write the name of your notebook

In [None]:
os.getcwd()

'/content'

The submission code will send your answer. It consists of the predicted languages.

In [None]:
ANSWER = json.dumps({'pred_langs': pred_languages})
ANSWER

'{"pred_langs": ["fra", "swe", "eng", "swe"]}'

Now the moment of truth:
1. Save your notebook and
2. Run the cells below

In [None]:
SUBMISSION_NOTEBOOK_PATH = CURRENT_NOTEBOOK_PATH + ".submission.bz2"

In [None]:
ASSIGNMENT = 3
API_KEY = "f581ba347babfea0b8f2c74a3a6776a7"

# Copy and compress current notebook
with bz2.open(SUBMISSION_NOTEBOOK_PATH, mode="wb") as fout:
    with open(CURRENT_NOTEBOOK_PATH, "rb") as fin:
        fout.write(fin.read())

In [None]:
res = requests.post("https://vilde.cs.lth.se/edan20checker/submit", 
                    files={"notebook_file": open(SUBMISSION_NOTEBOOK_PATH, "rb")}, 
                    data={
                        "stil_id": STIL_ID,
                        "assignment": ASSIGNMENT,
                        "answer": ANSWER,
                        "api_key": API_KEY,
                    },
               verify=True)

# from IPython.display import display, JSON
res.json()

{'msg': None,
 'status': 'correct',
 'signature': '2ab7637dc4637b04eb5ab563daf2bbf0c6079687f19e3b939fde18651dc89e1b5c877c57517ff42343c615604c5d8fefcb34bc6dd2e53433678c09fc4b998b3b',
 'submission_id': 'c0236019-ff34-4b45-9514-8bdd49705012'}

## Turning in your assignment

Now your are done with the program. To complete this assignment, you will:
1. Write a short individual report on your program. Do not forget to:
   * Summarize CLD3 and outline its architecture
   * Identify the features used by CLD3
   * Describe your architecture and tell how it is different from CLD3
   * Include the feature matrix you computed manually
   * Outline the differences between sklearn and the neural network API you choose

Submit your report as well as your notebook (for archiving purposes) to Canvas: https://canvas.education.lu.se/. To write your report, you can either
1. Write directly your text in Canvas, or
2. Use Latex and Overleaf (www.overleaf.com). This will probably help you structure your text. You will then upload a PDF file in Canvas.

The submission deadline is September 30, 2022.

## Postscript from Pierre Nugues

I created this assignment from an examination I wrote in 2019 for the course on applied machine learning. I simplified it from the `README.md` on GitHub, https://github.com/google/cld3. I found the C++ code difficult to understand and I reimplemented a Keras/Tensorflow version of it from this `README`. Should you be interested, you can find it here: https://github.com/pnugues/language-detector.