## Демо по мотивам статьи 
## Unsupervised Machine Translation Using Monolingual Corpora Only

Основные отличия от алгоритма, описанного в статье:

1. Небольшие различия в архитектуре (GRU вместо LSTM), другой оптимизатор, более простая модель дискриминатора
2. Отстутсвие Attention в Decoder
3. Не добавлял шум в автокодировщик

Скачиваем параллельные предложения из multi30K сразу в предобработанном виде.

Основная предобработка: слова и знаки препинания разделены пробелом, и некоторая работа со спецсимволами (типа '&').

In [1]:
#!wget https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/tok/train.lc.norm.tok.fr

In [2]:
#!wget https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/tok/train.lc.norm.tok.en

In [3]:
#!head train.lc.norm.tok.fr

Скачиваем MUSE-вектора, т.е. вектора для слов на разных языках, выровненные таким образом, чтобы косинусное расстояние между схожими словами на разных языках было невелико. 

In [4]:
#!wget https://dl.fbaipublicfiles.com/arrival/vectors/wiki.multi.en.vec#https://s3.amazonaws.com/arrival/embeddings/wiki.multi.en.vec

In [5]:
#!wget https://dl.fbaipublicfiles.com/arrival/vectors/wiki.multi.fr.vec#https://s3.amazonaws.com/arrival/embeddings/wiki.multi.fr.vec

Импорт основных библиотек

In [6]:
import io
import numpy as np

import unicodedata
import string
import re
import random
import codecs
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print (device)

cpu


Из обучалки torch Seq2Seq: удаляем всё, кроме латинских букв и знаков препинания. 
    
Диакритические знаки (черточки над буквами) также удаляем.

In [7]:
# Turn a Unicode string to plain ASCII, thanks to
# http://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters


def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s.strip()

In [8]:
normalizeString(u"bonjour  je suis élève de l'institut'")

'bonjour je suis eleve de l institut'

Функция загрузки векторов MUSE из файла.
Словари векторов содержат много мусорных слов (типа хэштегов), поэтому используем следующую логику:
    
1. Предобрабатываем слова, удаляем нелатинские символы
2. Если после предобработки в словаре встретился дубликат:
        a. Если текущее слово не изменилось после предобработки (т.е. оно скорее всего "хорошее"), то заменяем для данного слова в словаре вектор на текущий
        b. Иначе пропускаем и идем дальше

In [9]:
def load_vec(emb_path):
    vectors = []
    word2id = {}
    with io.open(emb_path, 'r', encoding='utf-8', newline='\n', errors='ignore') as f:
        next(f)
        for i, line in enumerate(f):
            orig_word, vect = line.rstrip().split(' ', 1)
            
            word = normalizeString(orig_word)
            vect = np.fromstring(vect, sep=' ')
            if word in word2id:
                print (u'word found twice: {0} ({1})'.format(word, orig_word))
                if orig_word==word:
                    id = word2id[word]
                    vectors[id] = vect
                    print ('rewriting')
                    continue
                else:
                    continue
            vectors.append(vect)
            word2id[word] = len(word2id)
            
    id2word = {v: k for k, v in word2id.items()}
    embeddings = np.vstack(vectors)
    return embeddings, id2word, word2id

In [10]:
en_embedding_tuple = load_vec('./wiki.multi.en.vec')

word found twice:  (-)
word found twice:  (')
word found twice:  ())
word found twice:  (()
word found twice: s (s)
rewriting
word found twice:  (–)
word found twice:  (#)
word found twice:  (%)
word found twice:  (/)
word found twice:  (")
word found twice:  (—)
word found twice:  (}})
word found twice:  (})
word found twice:  ($)
word found twice:  (>)
word found twice:  (+)
word found twice:  (&)
word found twice: www (//www)
word found twice:  (•)
word found twice: p (>p)
word found twice:  (·)
word found twice: en (en)
rewriting
word found twice:  (→)
word found twice:  (%/)
word found twice: com (com/)
word found twice: u (>u)
word found twice:  (£)
word found twice:  (×)
word found twice:  (}}}})
word found twice: d (#d)
word found twice: present (–present)
word found twice:  (}}})
word found twice: or (or%)
word found twice:  (//)
word found twice:  (~)
word found twice:  (°)
word found twice:  (\)
word found twice:  (_)
word found twice:  (⚡)
word found twice: title (_title)
w

word found twice:  (/–)
word found twice: andres (andres)
rewriting
word found twice: debut (début)
word found twice:  (`)
word found twice: link (_link)
word found twice: sao (sao)
rewriting
word found twice: fcc (#fcc)
word found twice:  (►)
word found twice:  (///////)
word found twice:  (¢)
word found twice:  (θ)
word found twice: this (#this)
word found twice: tome (tome)
rewriting
word found twice: etienne (etienne)
rewriting
word found twice: th (th/)
word found twice: b (b+)
word found twice: t (—❤t☮☺☯)
word found twice: b (\b)
word found twice: zu (/zu/)
word found twice: a (/a)
word found twice: data (//data)
word found twice: gfdl (gfdl}})
word found twice: pi (\pi)
word found twice: gamma (\gamma)
word found twice: a (a$)
word found twice:  (τ)
word found twice: th (th–)
word found twice:  (♫)
word found twice: free (_free)
word found twice: film (_film)
word found twice: secure (//secure)
word found twice: college (collège)
word found twice: pyrenees (pyrenees)
rewriting
w

word found twice: blogspot (blogspot\)
word found twice: dona (dona)
rewriting
word found twice: esp (esp}})
word found twice: g (g_)
word found twice: hk (hk$)
word found twice: h (h_)
word found twice: talk (talk%)
word found twice: a (‘a)
word found twice: parent (parent}})
word found twice: german (germán)
word found twice: cite (cité)
word found twice: pa (pā)
word found twice:  (↔)
word found twice:  (ø)
word found twice: marin (marín)
word found twice: cordoba (cordoba)
rewriting
word found twice: south (south%)
word found twice: country (_country)
word found twice: chisinau (chişinău)
word found twice:  (⚔)
word found twice:  (雲)
word found twice: n (n_)
word found twice: tn (/tn/)
word found twice: wikipedia (}}wikipedia)
word found twice: cote (cote)
rewriting
word found twice: zero (_zero)
word found twice: c (c+)
word found twice: gunter (gunter)
rewriting
word found twice: and (—and)
word found twice: city (city_)
word found twice: in (#in)
word found twice: turk (türk)
wo

word found twice: in (īn)
word found twice: list (#list)
word found twice: vladimir (vladimír)
word found twice: and (,and)
word found twice: guarani (guaraní)
word found twice:  (л)
word found twice: if (}#if)
word found twice: uc (đức)
word found twice: ar (är)
word found twice: schaffer (schaffer)
rewriting
word found twice: miro (miro)
rewriting
word found twice: k (k–)
word found twice: article (@article)
word found twice: c (c^)
word found twice: th (th%)
word found twice: summer (summer%)
word found twice: william (#william)
word found twice: women (women%)
word found twice:  (‧)
word found twice: section (&section)
word found twice:  (ο)
word found twice:  (}}}}}}}}}})
word found twice: table (table}})
word found twice:  (русский)
word found twice: bahia (bahía)
word found twice: lim (\lim_)
word found twice: road (road%)
word found twice:  (✈)
word found twice: ibanez (ibanez)
rewriting
word found twice: epsilon (\epsilon)
word found twice:  (%–)
word found twice: aaaaaa (#aaa

word found twice: proposed (/proposed)
word found twice: yucatan (yucatan)
rewriting
word found twice: marques (marqués)
word found twice:  (\})
word found twice:  (српски)
word found twice: saito (saitō)
word found twice: arg (arg}})
word found twice:  (е)
word found twice: mi (/mi²)
word found twice:  (ξ)
word found twice: pre (pré)
word found twice: cedric (cédric)
word found twice: m (m+)
word found twice:  (н)
word found twice: valerie (valérie)
word found twice: michael (michaël)
word found twice: rincon (rincon)
rewriting
word found twice: romania (românia)
word found twice: z (z^)
word found twice:  (%}})
word found twice: xuan (xuân)
word found twice: chah (chāh)
word found twice: talk (tälk)
word found twice: wedge (\wedge)
word found twice: mex (mex}})
word found twice:  (ζ)
word found twice:  (за)
word found twice: daimyo (daimyō)
word found twice: brasov (braşov)
word found twice: ned (ned}})
word found twice: szabo (szabo)
rewriting
word found twice: que (qué)
word found 

word found twice: sinead (sinead)
rewriting
word found twice: fein (fein)
rewriting
word found twice: size (_size)
word found twice: league (league%)
word found twice: bel (bel}})
word found twice: q (q^)
word found twice: tres (très)
word found twice: emilie (émilie)
word found twice: ir (ir/)
word found twice: new (#new)
word found twice: was (—was)
word found twice: sk (šk)
word found twice:  (ˉˉ╦╩)
word found twice: fabian (fabián)
word found twice:  (✓)
word found twice: valentin (valentín)
word found twice: ftp (//ftp)
word found twice: ngati (ngati)
rewriting
word found twice: dauphine (dauphine)
rewriting
word found twice: mediawiki (//mediawiki)
word found twice: cz (cz/)
word found twice: said (saïd)
word found twice: tms (//tms)
word found twice: s (#s)
word found twice: i (i+)
word found twice:  (❝)
word found twice: chi (chí)
word found twice:  (}}}}}}}}})
word found twice: decor (décor)
word found twice: ain (aïn)
word found twice: joseph (#joseph)
word found twice: j (j_

word found twice: valles (vallès)
word found twice: party (_party)
word found twice: lang (&lang)
word found twice: u (u/)
word found twice: pec (peć)
word found twice: bien (biên)
word found twice: potosi (potosi)
rewriting
word found twice: c (\c)
word found twice: source (&source)
word found twice: forster (förster)
word found twice:  (γλŀќ)
word found twice: operator (_operator)
word found twice:  (心)
word found twice: rafa (rafał)
word found twice: universitat (universitat)
rewriting
word found twice:  (+%)
word found twice: b (b–)
word found twice: ffa (#ffa)
word found twice: fjournal (,fjournal)
word found twice: ta (tá)
word found twice: duong (duong)
rewriting
word found twice:  (£}})
word found twice: ake (ake)
rewriting
word found twice: angelica (angélica)
word found twice: top (top}})
word found twice: leader (_leader)
word found twice: chi (\chi)
word found twice:  (каталог)
word found twice:  (龸)
word found twice: january february (january/february)
word found twice: fr

word found twice:  (せ/食)
word found twice: interchange (interchange}})
word found twice: neamt (neamţ)
word found twice: line (line}})
word found twice: nor (nor}})
word found twice: cm (cm−)
word found twice: e (e/)
word found twice: bermudez (bermudez)
rewriting
word found twice: koichi (kōichi)
word found twice: halle (hallé)
word found twice: a (_a)
word found twice: cd (×cd)
word found twice: fpo (fpö)
word found twice: replaced (→replaced)
word found twice: andalucia (andalucia)
rewriting
word found twice: nd (nd/)
word found twice: x (\x)
word found twice: m (m,)
word found twice: michael (michael%)
word found twice: routes (_routes)
word found twice: black (#black)
word found twice: loc (lộc)
word found twice: shiro (shirō)
word found twice: because (because…)
word found twice:  (└)
word found twice: plon (plon)
rewriting
word found twice: california (california}})
word found twice: tan (\tan)
word found twice: n (n/)
word found twice: a b (a+b)
word found twice:  (┐)
word foun

word found twice: ps (ps/)
word found twice:  (произведений)
word found twice: valerian (valérian)
word found twice: portals (portals}})
word found twice: taiyo (taiyo)
rewriting
word found twice: bartok (bartok)
rewriting
word found twice:  (ɣ)
word found twice: review (review/)
word found twice: henry (henry%)
word found twice: galan (galan)
rewriting
word found twice: c d (c%d)
word found twice: nuno (nuño)
word found twice: hohe (höhe)
word found twice: galvez (galvez)
rewriting
word found twice: b w (b&w)
word found twice: noi (nội)
word found twice: lander (länder)
word found twice:  (من)
word found twice:  (∅)
word found twice:  (россии)
word found twice: e (_e)
word found twice:  (район)
word found twice: espanola (espanola)
rewriting
word found twice:  (–/–)
word found twice: tres (três)
word found twice:  (ст)
word found twice: team (team%)
word found twice:  (と/戸)
word found twice: dx ($dx,)
word found twice:  (µ)
word found twice: creation (création)
word found twice: micha

word found twice:  (الله)
word found twice: reaction (reaction•)
word found twice: alt (⣀alt)
word found twice: canadian (canadian%)
word found twice: with (—with)
word found twice: first (#first)
word found twice: bagh (bāgh)
word found twice: rd (rd/)
word found twice: autodromo (autodromo)
rewriting
word found twice: jurgens (jürgens)
word found twice:  (や/疒)
word found twice: mathbf (\mathbf})
word found twice: winter (winter%)
word found twice: kola (kolā)
word found twice: alcantara (alcântara)
word found twice: alt (⢥alt)
word found twice: ngai (ngāi)
word found twice: mil (mil/)
word found twice: relative (\relative)
word found twice: august (–august)
word found twice: ice (ice%)
word found twice: home (//home)
word found twice: endo (endō)
word found twice: espana (espana)
rewriting
word found twice:  (ツ)
word found twice: n (/n/)
word found twice: mary (#mary)
word found twice: la (lá)
word found twice: anibal (anibal)
rewriting
word found twice: games (games%)
word found twi

word found twice: espiritu (espíritu)
word found twice: tau (\tau_)
word found twice: venus (vénus)
word found twice: contact (contact/)
word found twice:  (τα)
word found twice: ampere (ampère)
word found twice: i (i,)
word found twice: basic (bašić)
word found twice: andrew (andrew%)
word found twice: comics (comics%)
word found twice: ukr (ukr}})
word found twice: public (public%)
word found twice: tau (tàu)
word found twice: vac (vác)
word found twice:  (энциклопедия)
word found twice: inch (inch/)
word found twice: mesic (mesić)
word found twice: fc (#fc)
word found twice:  (た/⽥)
word found twice: daniele (danièle)
word found twice: left ($left)
word found twice: x y (x+y)
word found twice: pagina (pagina)
rewriting
word found twice: numero (numéro)
word found twice: talk (↑talk↓)
word found twice: a e (a%e)
word found twice:  (~$)
word found twice: w (w^)
word found twice: morne (morné)
word found twice: alt (_alt)
word found twice:  (______)
word found twice: dat (&dat)
word fou

word found twice: imperiale (impériale)
word found twice: bruckner (brückner)
word found twice:  (‖)
word found twice: none (none}})
word found twice:  (#####)
word found twice: esme (esmé)
word found twice: cliched (cliched)
rewriting
word found twice:  (す/発)
word found twice: start (start+)
word found twice: min (_min)
word found twice: water (water%)
word found twice: paul (paul%)
word found twice: qat (qat}})
word found twice: ii (ii%)
word found twice: boll (böll)
word found twice: video (video%)
word found twice: summer (summer}})
word found twice:  (さ/阝)
word found twice: que (¿qué)
word found twice: saldana (saldaña)
word found twice: hockey (hockey%)
word found twice: via (vía)
word found twice: belgium (belgium}})
word found twice: manner (männer)
word found twice: a (~a)
word found twice: abraham (#abraham)
word found twice:  (も/門)
word found twice: trong (trọng)
word found twice: kandi (kandī)
word found twice: marian (marián)
word found twice: if (}}}#if)
word found twice:

word found twice: clemence (clémence)
word found twice: pieta (pieta)
rewriting
word found twice: veronique (veronique)
rewriting
word found twice: bras (brás)
word found twice: aeronautica (aeronáutica)
word found twice: intro (/intro)
word found twice: ragam (ragam)
rewriting
word found twice: archive (_archive)
word found twice: heros (héros)
word found twice: thank (}}thank)
word found twice: paco (paço)
word found twice:  (εν)
word found twice: review (/review/)
word found twice: s (\s)
word found twice: loi (lợi)
word found twice:  (ゆ/彳)
word found twice: salo (salò)
word found twice: am (am–)
word found twice: gaudi (gaudi)
rewriting
word found twice: what (‘what)
word found twice: vasselin (vasselin}})
word found twice: alla (allá)
word found twice: little (#little)
word found twice: chenier (chenier)
rewriting
word found twice: al (al%)
word found twice: b (+b)
word found twice:  (⊂)
word found twice: economica (economica)
rewriting
word found twice: see (#see)
word found twic

word found twice: ye (yé)
word found twice: cerny (cerny)
rewriting
word found twice: frederick (#frederick)
word found twice: we (‘we)
word found twice: si (sì)
word found twice: blog (/blog)
word found twice: m s (m/s²)
word found twice: o (/o)
word found twice: idee (idee)
rewriting
word found twice: zarate (zarate)
rewriting
word found twice:  ( )
word found twice: u (/u/)
word found twice: engstrom (engström)
word found twice: television (television%)
word found twice: p (&p)
word found twice: railroad (railroad,)
word found twice: alfred (alfréd)
word found twice: shojo (shojo)
rewriting
word found twice: asis (asís)
word found twice: florida (florida%)
word found twice: attr (þáttr)
word found twice: thanh (thánh)
word found twice: timesofindia (//timesofindia)
word found twice: pires (pirès)
word found twice: the (thế)
word found twice:  (>})
word found twice: virginia (virginia%)
word found twice: time (time,)
word found twice: octagon (octagón)
word found twice: z (/z)
word f

word found twice: greek (greek%)
word found twice: i (+i)
word found twice: ps (/ps)
word found twice: ltd (,ltd)
word found twice: other (#other)
word found twice: sanchez (sánchez}})
word found twice: televisions (télévisions)
word found twice: crni (crni)
rewriting
word found twice: tri (trí)
word found twice: about (/about)
word found twice: bayamon (bayamon)
rewriting
word found twice: phi (\phi^)
word found twice: en (_en)
word found twice:  (βǃʘʘɱ)
word found twice: eta (età)
word found twice:  (█)
word found twice:  (ほ/方)
word found twice: sorry (#sorry)
word found twice: soviet (soviet%)
word found twice: blue (#blue)
word found twice: av (£€åv€)
word found twice: gene (gené)
word found twice: adria (adrià)
word found twice: africa (africa}})
word found twice:  (استان‌های)
word found twice: strom (ström)
word found twice: white (white}})
word found twice: giovanni (#giovanni)
word found twice: jack (#jack)
word found twice: arcangel (arcangel)
rewriting
word found twice: luisa

word found twice: ai (đài)
word found twice: rubi (rubí)
word found twice: are (#are)
word found twice: line (line%)
word found twice: africa (africa%)
word found twice: jorn (jörn)
word found twice: jian (jiān)
word found twice: ecija (écija)
word found twice:  (_^)
word found twice: kleber (kleber)
rewriting
word found twice: isaias (isaías)
word found twice: okami (okami)
rewriting
word found twice: frac (\frac+)
word found twice: backman (bäckman)
word found twice: dorothee (dorothee)
rewriting
word found twice: he (—he)
word found twice: ferre (ferre)
rewriting
word found twice: civ (civ}})
word found twice: naga (nāga)
word found twice: ministry (ministry%)
word found twice: abd (`abd)
word found twice: soderberg (soderberg)
rewriting
word found twice: garces (garces)
rewriting
word found twice: kristjan (kristján)
word found twice: elections (élections)
word found twice: sourceforge (//sourceforge)
word found twice: related (related%)
word found twice: mathcal (\mathcal^)
word f

word found twice: chi (\chi_)
word found twice: hellstrom (hellström)
word found twice: heritage (héritage)
word found twice:  (ˀ)
word found twice:  (اطلس)
word found twice: archives (//archives)
word found twice: llyn (llŷn)
word found twice: phabricator (phabricator)
rewriting
word found twice: unita (unità)
word found twice: bc (bc%)
word found twice: preludes (préludes)
word found twice:  (～)
word found twice: to (to/)
word found twice: thuan (thuan)
rewriting
word found twice: berenger (bérenger)
word found twice: female (female%)
word found twice: bui (bùi)
word found twice: idn (idn}})
word found twice: review (/review)
word found twice:  (⠾)
word found twice: dst (dst+)
word found twice: g (_g)
word found twice: va (và)
word found twice: com photo (com/photo/)
word found twice:  (ʏɑɴ)
word found twice: hdl (//hdl)
word found twice: reel (réel)
word found twice: assur (aššur)
word found twice: arm (ärm)
word found twice: njg (/njg)
word found twice: ruaidhri (ruaidhri)
rewritin

word found twice: megane (megane)
rewriting
word found twice: c (c−)
word found twice: ksi (ksí)
word found twice: masse (massé)
word found twice: vision (visión)
word found twice: goncalves (goncalves)
rewriting
word found twice: kan (kạn)
word found twice: martiniere (martinière)
word found twice: thuy (thúy)
word found twice: mg (mg/)
word found twice: two (#two)
word found twice: unassigned (>unassigned)
word found twice: rugby (rugby%)
word found twice: n (—n)
word found twice: i (i}})
word found twice: creche (creche)
rewriting
word found twice: singer (singer%)
word found twice: inh (đính)
word found twice: berard (berard)
rewriting
word found twice: politecnico (politécnico)
word found twice: l (&l)
word found twice: kevin (kévin)
word found twice: phu (phủ)
word found twice: northwest (/northwest)
word found twice: mirza (mírzá)
word found twice: ang (ang}})
word found twice: chile (chile}})
word found twice: bid (bīd)
word found twice: cl (cl/)
word found twice: wikipedia (wi

word found twice: reviews (reviews/)
word found twice: stubs (stubs,)
word found twice: v (v%)
word found twice: players (_players)
word found twice: tecnologia (tecnología)
word found twice:  (⢞)
word found twice: muzeum (múzeum)
word found twice: health (health%)
word found twice: chris (#chris)
word found twice: templates (templates‎)
word found twice: opinion (opinión)
word found twice: galaxy (/galaxy_)
word found twice: pelaez (pelaez)
rewriting
word found twice: bon (bön)
word found twice: nb (nb}})
word found twice: wayback (//wayback)
word found twice: legere (legere)
rewriting
word found twice:  (российской)
word found twice:  (московского)
word found twice: gbif (gbif\)
word found twice: musa (mūsā)
word found twice: grabow (grabów)
word found twice: kohei (kōhei)
word found twice: akerman (åkerman)
word found twice: temeraire (téméraire)
word found twice: the („the)
word found twice: feher (feher)
rewriting
word found twice: compere (compère)
word found twice: theme (thème)

word found twice: cot (\cot)
word found twice: pattee (pattée)
word found twice: sarat (sărat)
word found twice: fuso (fusō)
word found twice: unio (unió)
word found twice: cassio (cássio)
word found twice:  (ш)
word found twice: bolero (boléro)
word found twice: sqrt (\sqrt}\)
word found twice: maturin (maturín)
word found twice: implies (\implies)
word found twice:  (гг)
word found twice: lowenthal (löwenthal)
word found twice: blucher (blucher)
rewriting
word found twice: dawn (dawn%)
word found twice: hungary (hungary}})
word found twice: n (n>)
word found twice: e (ë)
word found twice: dantes (dantès)
word found twice:  (：)
word found twice: tien (tiền)
word found twice: sar (sarı)
word found twice: kalakaua (kalakaua)
rewriting
word found twice: real (real%)
word found twice: kampfgeschwader (/kampfgeschwader)
word found twice: my (‘my)
word found twice: uruguay (uruguay}})
word found twice: cfc (#cfc}})
word found twice: buchi (buchi)
rewriting
word found twice: eg (ég)
word fou

word found twice: societa (societa)
rewriting
word found twice:  (は/辶)
word found twice: canada (canada}})
word found twice: as (,as)
word found twice: pr (/pr)
word found twice: race (race%)
word found twice: trapeang (trâpeang)
word found twice: michel (micheľ)
word found twice: bibliotheque (bibliotheque)
rewriting
word found twice: peron (péron)
word found twice: en (#en)
word found twice: petersburg (petersburg}})
word found twice: harry (#harry)
word found twice: jim (jim%)
word found twice: t (t++)
word found twice: show (show%)
word found twice: oishi (ōishi)
word found twice: balan (bălan)
word found twice: bfb (bfb)
rewriting
word found twice: century (century–)
word found twice: wurzburg (wurzburg)
rewriting
word found twice: ba (bá)
word found twice: goal (#goal)
word found twice:  (από)
word found twice: alt (⣓alt)
word found twice: everyone (everyone,)
word found twice: sang (sång)
word found twice: acre (/acre)
word found twice: jam (jam}})
word found twice:  (при)
word 

word found twice: cai (cái)
word found twice: prince (prince%)
word found twice: best (best,)
word found twice: basak (basak)
rewriting
word found twice: planche (planché)
word found twice: elizabeth (elizabeth%)
word found twice: energia (energía)
word found twice: vah (váh)
word found twice: li (#li)
word found twice: chugoku (chugoku)
rewriting
word found twice:  (τῆς)
word found twice: cecilia (cecília)
word found twice: days (days}})
word found twice:  (｜)
word found twice: yugoslavia (yugoslavia}})
word found twice: cid (&cid)
word found twice:  (кирилла)
word found twice: wikipedia (/wikipedia)
word found twice: music (/music)
word found twice: kobe (kōbe)
word found twice:  (α,β)
word found twice: navarro (navarro}})
word found twice: mai (//mai)
word found twice: interview (interview/)
word found twice: life (#life)
word found twice: sara (sarà)
word found twice: shozo (shōzō)
word found twice: beja (béja)
word found twice: astor (ástor)
word found twice: oan (oan)
rewriting
w

word found twice: team (team,)
word found twice: fraser (//fraser)
word found twice: sion (siôn)
word found twice: nar (når)
word found twice: yahoo (@yahoo)
word found twice: alt (⣧alt)
word found twice: victoria (victoria}})
word found twice: resita (reşiţa)
word found twice: ban (bản)
word found twice: encarta (//encarta)
word found twice: green (#green)
word found twice: the (thé)
word found twice: eden (edén)
word found twice: gedeon (gédéon)
word found twice: regards (regards—)
word found twice: al (/al)
word found twice: camera (caméra)
word found twice: pages (pagès)
word found twice: thome (thomé)
word found twice: three (#three)
word found twice: martha (märtha)
word found twice: federacion (federacion)
rewriting
word found twice: napoleon (napoleón)
word found twice: hwe (hwe)
rewriting
word found twice:  (без)
word found twice: milan (milán)
word found twice: hofer (höfer)
word found twice: viata (viaţa)
word found twice: both (—both)
word found twice: air (aïr)
word found 

word found twice: chateauguay (chateauguay)
rewriting
word found twice: cintron (cintron)
rewriting
word found twice: dc (#dc)
word found twice: lutfi (lütfi)
word found twice:  (ישראל)
word found twice: q (q\)
word found twice: merode (mérode)
word found twice: cespedes (cespedes)
rewriting
word found twice: bayern (_bayern)
word found twice: disney (disney%)
word found twice: noemi (noemí)
word found twice: bol (bol}})
word found twice: titan (titán)
word found twice: tro (tro%)
word found twice: kentaro (kentarō)
word found twice: merite (merite)
rewriting
word found twice: wikiproject (#wikiproject)
word found twice: number (,number)
word found twice: lopez (lópez}})
word found twice: noc (noć)
word found twice: ii (ii}})
word found twice: niceville (niceville)
rewriting
word found twice: alt (⣇alt)
word found twice: khe (khê)
word found twice: cha (chá)
word found twice: lord (#lord)
word found twice: jau (jaú)
word found twice: aa (#aa)
word found twice: kovacevic (kovacevic)
rew

word found twice: blase (blase)
rewriting
word found twice: yoichi (yōichi)
word found twice: barron (barrón)
word found twice: garzon (garzon)
rewriting
word found twice: la (lā)
word found twice: k (k−)
word found twice:  (河東)
word found twice: wisniewski (wiśniewski)
word found twice: souffle (soufflé)
word found twice: proces (proces)
rewriting
word found twice: bunkyo (bunkyō)
word found twice:  (/})
word found twice: dede (dedé)
word found twice: silvio (sílvio)
word found twice: building (building%)
word found twice: ca (cá)
word found twice: yang (yáng)
word found twice:  (⣪)
word found twice: with (,with)
word found twice: r (r\)
word found twice: atitlan (atitlán)
word found twice: maki (mäki)
word found twice: image (image_)
word found twice: hoss (höss)
word found twice: sirene (sirène)
word found twice: fernan (fernan)
rewriting
word found twice: gang (gång)
word found twice: frontiere (frontière)
word found twice: map (map+)
word found twice: kyosuke (kyōsuke)
word found 

word found twice: territories (territories%)
word found twice: szymanski (szymański)
word found twice: yi (yì)
word found twice: huang (huáng)
word found twice: conferences (conférences)
word found twice: lwn (//lwn)
word found twice: sf (_sf)
word found twice: avc (avcı)
word found twice: tone (toné)
word found twice: in (_in)
word found twice: sum n (\sum_n)
word found twice:  (╫)
word found twice: age (age%)
word found twice: foreningen (foreningen)
rewriting
word found twice: jimmy (jimmy%)
word found twice: alan (alan%)
word found twice: munter (münter)
word found twice: white (white,)
word found twice: edit (#edit)
word found twice: quintin (quintín)
word found twice: creek (creek%)
word found twice: org about (org/about/)
word found twice: seances (seances)
rewriting
word found twice: palmares (palmarès)
word found twice: hang (hàng)
word found twice: r (r%)
word found twice: ace (#ace)
word found twice:  (войны)
word found twice: aetius (aëtius)
word found twice: kastel (kaštel

word found twice: dos (dos/)
word found twice: asatru (asatru)
rewriting
word found twice:  ()
word found twice: meroe (meroë)
word found twice: fric (fric)
rewriting
word found twice: baena (baena}})
word found twice: nimes (nimes)
rewriting
word found twice: cuauhtemoc (cuauhtemoc)
rewriting
word found twice: zp (žp)
word found twice: estee (estee)
rewriting
word found twice: name (name%)
word found twice: democratico (democratico)
rewriting
word found twice: nabla (nabla)
rewriting
word found twice: esteve (estève)
word found twice: assembly (assembly%)
word found twice:  (♯)
word found twice: la (‘la)
word found twice: vukovic (vukovic)
rewriting
word found twice: v (və)
word found twice: grand (#grand)
word found twice: makhachkala (makhachkala}})
word found twice: m m (m/m)
word found twice: theater (&theater)
word found twice: abi (abi%)
word found twice: cacic (čačić)
word found twice: h (δh)
word found twice:  (остров)
word found twice: tunisia (tunisia}})
word found twice:  

word found twice: park (park/)
word found twice: sea (sea%)
word found twice: seppala (seppälä)
word found twice: pomte (pomte)
rewriting
word found twice: queensland (queensland%)
word found twice: pera (pêra)
word found twice: lake (lake}})
word found twice: kien (kiến)
word found twice:  (меня)
word found twice: companies (companies%)
word found twice: com world (com/world/)
word found twice: mari (marí)
word found twice: cerebral (cerebral%)
word found twice: leonce (leonce)
rewriting
word found twice: z (+z)
word found twice: shir (shīr)
word found twice: dubs (dübs)
word found twice: rosler (rosler)
rewriting
word found twice:  (­)
word found twice: r (r_^)
word found twice: four (four%)
word found twice: beta (\beta^)
word found twice:  (/}}})
word found twice: puerto (puerto%)
word found twice: openlibrary (openlibrary)
rewriting
word found twice: arabia (arabia}})
word found twice: sanger (sånger)
word found twice: f (–f)
word found twice:  (☮ღ☺)
word found twice: papers (/pap

word found twice: suma (šuma)
word found twice: hms (#hms)
word found twice: ibrahim (ibrāhīm)
word found twice: sr ($sr)
word found twice: maya (māyā)
word found twice: cf (cf}})
word found twice: mahmudabad (maḩmūdābād)
word found twice:  (علی)
word found twice: charter (charter%)
word found twice: base (base%)
word found twice: alt (⢀alt)
word found twice: mole (molé)
word found twice: yuta (yūta)
word found twice: detail (detail/)
word found twice: jauregui (jauregui)
rewriting
word found twice: israelite (israélite)
word found twice: paramo (paramo)
rewriting
word found twice: ukic (ukić)
word found twice: has (—has)
word found twice: hopital (hopital)
rewriting
word found twice: huskies (huskies%)
word found twice: kosei (kōsei)
word found twice: be (\be)
word found twice:  (⢌)
word found twice: i (•i)
word found twice: miles (miles/)
word found twice:  (ל)
word found twice:  (связей)
word found twice:  (\^)
word found twice: reves (reves)
rewriting
word found twice:  (__________

word found twice: go (go%)
word found twice: cyclone (#cyclone)
word found twice: mw (~mw)
word found twice:  (⨹)
word found twice: judetul (judeţul)
word found twice: despotovic (despotovic)
rewriting
word found twice: japan (japan%)
word found twice: against (against%)
word found twice: one (one,)
word found twice: friedrich (#friedrich)
word found twice: gorecki (gorecki)
rewriting
word found twice: eleonore (éléonore)
word found twice:  (}},}})
word found twice: alt (⣱alt)
word found twice: panic (panić)
word found twice: softball (softball%)
word found twice:  (ما)
word found twice: asp (asp/)
word found twice: proteges (proteges)
rewriting
word found twice: april june (april→june)
word found twice: long (lóng)
word found twice: ve (ﬁve)
word found twice: yuto (yūto)
word found twice: lak (lắk)
word found twice: eta (\eta^)
word found twice: katarina (katarína)
word found twice: artistico (artístico)
word found twice:  (╱)
word found twice: etc (/etc)
word found twice: i (ì)
word 

In [11]:
fr_embedding_tuple = load_vec('./wiki.multi.fr.vec')

word found twice:  (')
word found twice:  (-)
word found twice:  ())
word found twice:  (()
word found twice: a (a)
rewriting
word found twice: s (s)
rewriting
word found twice:  (»)
word found twice:  (}})
word found twice:  (#)
word found twice:  (/)
word found twice: ou (où)
word found twice:  (%)
word found twice:  (–)
word found twice: ne (né)
word found twice:  (│)
word found twice:  (+)
word found twice:  (•)
word found twice:  (—)
word found twice: des (dès)
word found twice:  (})
word found twice:  (·)
word found twice:  (&)
word found twice: la (là)
word found twice: annee (&annee)
word found twice:  (°)
word found twice: comte (comte)
rewriting
word found twice:  (†)
word found twice: donne (donné)
word found twice: cote (côte)
word found twice: n (n°)
word found twice:  ( )
word found twice:  (>)
word found twice: demande (demandé)
word found twice: passe (passé)
word found twice: wikipedia (wikipedia)
rewriting
word found twice: situe (situe)
rewriting
word found twice: ut

word found twice: capture (capturé)
word found twice:  (⇒)
word found twice: marches (marches)
rewriting
word found twice: right (\right)
word found twice: prete (prêté)
word found twice: prefere (préféré)
word found twice: fur (fur)
rewriting
word found twice: chaine (chaine)
rewriting
word found twice: reference (reference)
rewriting
word found twice: parait (parait)
rewriting
word found twice: associe (associe)
rewriting
word found twice: separe (séparé)
word found twice: passes (passés)
word found twice: ile (ile)
rewriting
word found twice: hotel (hotel)
rewriting
word found twice: entraine (entraîné)
word found twice: edit (édit)
word found twice: television (television)
rewriting
word found twice: formes (formés)
word found twice: apporte (apporté)
word found twice: provoque (provoqué)
word found twice: regiment (regiment)
rewriting
word found twice: the (thé)
word found twice: confie (confié)
word found twice: accuse (accuse)
rewriting
word found twice: felix (felix)
rewriting


word found twice: region (region)
rewriting
word found twice: limites (limités)
word found twice: rentre (rentré)
word found twice: affecte (affecte)
rewriting
word found twice: atteste (atteste)
rewriting
word found twice: creuse (creusé)
word found twice: maitre (maitre)
rewriting
word found twice: forge (forgé)
word found twice: x (x^)
word found twice: adapte (adapte)
rewriting
word found twice: fixes (fixés)
word found twice: ain (aïn)
word found twice: m (m²)
word found twice: refugie (réfugié)
word found twice: justifie (justifié)
word found twice: ramene (ramené)
word found twice: eut (eût)
word found twice: illustres (illustrés)
word found twice: sao (sao)
rewriting
word found twice: mat (mât)
word found twice: borde (borde)
rewriting
word found twice: plante (planté)
word found twice: sanchez (sanchez)
rewriting
word found twice: jerome (jerome)
rewriting
word found twice: barthelemy (barthélémy)
word found twice: opere (opéré)
word found twice: different (diffèrent)
word fou

word found twice: rassemble (rassemblé)
word found twice: attaques (attaqués)
word found twice: colle (collé)
word found twice: frederic (frederic)
rewriting
word found twice: q (q}})
word found twice: figure (figuré)
word found twice: ecarte (écarte)
word found twice:  (и)
word found twice: menaces (menacés)
word found twice: oscar (óscar)
word found twice: boheme (bohème)
word found twice: reproche (reproché)
word found twice: regional (regional)
rewriting
word found twice: teste (teste)
rewriting
word found twice: leningrad (leningrad)
rewriting
word found twice: mele (mêlé)
word found twice: rose (rosé)
word found twice: lambda (\lambda)
word found twice: vicomte (vicomté)
word found twice: consulte (consulte)
rewriting
word found twice: decline (décliné)
word found twice: quebec (quebec)
rewriting
word found twice: equipe (equipe)
rewriting
word found twice: experience (experience)
rewriting
word found twice: tome (tomé)
word found twice: taiwan (taiwan)
rewriting
word found twice

word found twice: avila (avila)
rewriting
word found twice: accelere (accéléré)
word found twice: regroupe (regroupé)
word found twice: d (%d)
word found twice: sigma (\sigma)
word found twice: perce (perce)
rewriting
word found twice: jozef (józef)
word found twice: approche (approché)
word found twice: hat (\hat)
word found twice: piege (piégé)
word found twice: incite (incité)
word found twice: julian (julián)
word found twice: fouille (fouillé)
word found twice: negocie (négocié)
word found twice: emilie (emilie)
rewriting
word found twice: evite (évité)
word found twice: epargne (épargné)
word found twice: internes (internés)
word found twice: generalise (généralise)
word found twice: napoleon (napoleon)
rewriting
word found twice: m (m_)
word found twice: angelique (angelique)
rewriting
word found twice: vertebres (vertèbres)
word found twice: energie (energie)
rewriting
word found twice: henin (henin)
rewriting
word found twice: durer (dürer)
word found twice: explore (exploré)


word found twice: chute (chuté)
word found twice: preoccupe (préoccupe)
word found twice:  (+})
word found twice: mega (méga)
word found twice: peuples (peuplés)
word found twice: chauffe (chauffé)
word found twice: merite (mérité)
word found twice: chatel (chatel)
rewriting
word found twice: use (usé)
word found twice: flute (flute)
rewriting
word found twice: cendre (cendré)
word found twice: demeure (demeuré)
word found twice: armee (armee)
rewriting
word found twice: exporte (exporte)
rewriting
word found twice: helene (helene)
rewriting
word found twice: captures (captures)
rewriting
word found twice: ismail (ismaïl)
word found twice: evelyne (evelyne)
rewriting
word found twice:  (話)
word found twice: sa (sá)
word found twice: ete (ete)
rewriting
word found twice: aragon (aragón)
word found twice: le (le )
word found twice: andres (andres)
rewriting
word found twice:  (с)
word found twice: melanges (mélangés)
word found twice: wc (wc}})
word found twice: nino (niño)
word found tw

word found twice: vladimir (vladimír)
word found twice: specifie (spécifie)
word found twice: tau (\tau)
word found twice: ville (villé)
word found twice: penalty (pénalty)
word found twice: analyses (analysés)
word found twice: maries (maries)
rewriting
word found twice: r (r^)
word found twice: pe (pé)
word found twice: economie (economie)
rewriting
word found twice: pa (på)
word found twice: massacre (massacré)
word found twice: faite (faîte)
word found twice: penche (penché)
word found twice: artemis (artemis)
rewriting
word found twice: tentes (tentés)
word found twice: alexei (alexei)
rewriting
word found twice: stereo (stereo)
rewriting
word found twice: treve (trève)
word found twice: leila (leïla)
word found twice: liberia (libéria)
word found twice: gorges (görges)
word found twice: prie (prié)
word found twice: legales (legales)
rewriting
word found twice: elisa (élisa)
word found twice: decret (decret)
rewriting
word found twice: panama (panamá)
word found twice: ecclesia (

word found twice: doyenne (doyenne)
rewriting
word found twice: evian (evian)
rewriting
word found twice: bolchevique (bolchevique)
rewriting
word found twice: incendies (incendiés)
word found twice: bucher (bucher)
rewriting
word found twice: caches (caches)
rewriting
word found twice: brule (brulé)
word found twice: stephane (stephane)
rewriting
word found twice: elimination (elimination)
rewriting
word found twice:  (}},})
word found twice: roman (román)
word found twice: hue (hué)
word found twice: rhin (rhin_)
word found twice: pal (pál)
word found twice: uber (uber)
rewriting
word found twice: amalgame (amalgamé)
word found twice: ter (tér)
word found twice: shanghai (shanghaï)
word found twice: peron (péron)
word found twice: prieure (prieure)
rewriting
word found twice: aggrave (aggravé)
word found twice: structures (structurés)
word found twice: calais (calais_)
word found twice: attaquant (attaquant}})
word found twice:  (№)
word found twice:  (✎)
word found twice: suppressio

word found twice: nikos (níkos)
word found twice:  (стеф)
word found twice: planifie (planifie)
rewriting
word found twice: x (###x##)
word found twice: pardonne (pardonné)
word found twice: enveloppe (enveloppé)
word found twice: l (l_)
word found twice: katerina (kateřina)
word found twice: ai (đại)
word found twice: enterre (enterre)
rewriting
word found twice: commemore (commémoré)
word found twice: elias (elías)
word found twice: naufrages (naufrages)
rewriting
word found twice: kano (kanō)
word found twice: sur ( sur)
word found twice:  (\,\)
word found twice: oracle (oracle,)
word found twice: suspectes (suspectés)
word found twice: ideal (ideal)
rewriting
word found twice: medias (medias)
rewriting
word found twice: boitier (boitier)
rewriting
word found twice: levesque (levesque)
rewriting
word found twice: pecheurs (pécheurs)
word found twice: parachute (parachuté)
word found twice: gabor (gabor)
rewriting
word found twice:  (ℚ)
word found twice: vasquez (vasquez)
rewriting
w

word found twice: borghese (borghèse)
word found twice: abd (`abd)
word found twice: invalide (invalidé)
word found twice: muscles (musclés)
word found twice: entraineurs (entraineurs)
rewriting
word found twice: http (/http)
word found twice:  (ℹ)
word found twice: affilie (affilie)
rewriting
word found twice: auge (augé)
word found twice: enterine (entériné)
word found twice: yu (yū)
word found twice: ferraille (ferraillé)
word found twice: enflamme (enflammé)
word found twice: elu (#élu)
word found twice: devore (dévore)
word found twice:  (∅)
word found twice: strategies (strategies)
rewriting
word found twice: tape (tapé)
word found twice: boise (boise)
rewriting
word found twice: oceanic (océanic)
word found twice: tout ( tout)
word found twice: veterans (veterans)
rewriting
word found twice: denombre (dénombré)
word found twice: mon (môn)
word found twice: x (x+)
word found twice: veto (véto)
word found twice: enfonce (enfoncé)
word found twice: hyperion (hypérion)
word found tw

word found twice: reginald (réginald)
word found twice: facon (facon)
rewriting
word found twice: prefecture (prefecture)
rewriting
word found twice: les (/les)
word found twice: jager (jager)
rewriting
word found twice: apres (aprés)
word found twice:  (µ)
word found twice: pecheur (pécheur)
word found twice: homologues (homologués)
word found twice: shu (shū)
word found twice: administrateurs (administrateurs/)
word found twice: fascine (fascine)
rewriting
word found twice: deco (deco)
rewriting
word found twice:  ( )
word found twice: creature (creature)
rewriting
word found twice: neutralise (neutralise)
rewriting
word found twice: traine (traîné)
word found twice: principe (príncipe)
word found twice: present (–présent)
word found twice: infanterie (infanterie}})
word found twice: be (be/)
word found twice: englobe (englobé)
word found twice: grade (gradé)
word found twice: pese (pesé)
word found twice: a (à )
word found twice: klara (klara)
rewriting
word found twice: depeche (dé

word found twice: yucatan (yucatan)
rewriting
word found twice: presente (presente)
rewriting
word found twice: avale (avalé)
word found twice: equipe (équipe_)
word found twice:  (}}}}}}}})
word found twice: maia (maïa)
word found twice: enroule (enroule)
rewriting
word found twice: barthelemy (barthelemy)
rewriting
word found twice: germany (germany}})
word found twice: actionne (actionne)
rewriting
word found twice: gu (gū)
word found twice: cameo (cameo)
rewriting
word found twice: sy (şÿℵדαχ₮ɘɼɾ๏ʁ)
word found twice: prejudice (prejudice)
rewriting
word found twice: epoque (epoque)
rewriting
word found twice: big (\big)
word found twice: e (ê)
word found twice: reactive (réactivé)
word found twice: aimee (aimee)
rewriting
word found twice: abondance (_abondance)
word found twice: nepal (nepal)
rewriting
word found twice: interet (intéret)
word found twice:  (`)
word found twice: prospere (prospéré)
word found twice: janis (jānis)
word found twice: irrite (irrite)
rewriting
word fou

word found twice:  (#}}})
word found twice: residence (residence)
rewriting
word found twice: sin (sîn)
word found twice: georgios (geórgios)
word found twice: dut (dût)
word found twice: raye (raye)
rewriting
word found twice: allege (allège)
word found twice: plon (plön)
word found twice: pieve (piève)
word found twice: mull (mull)
rewriting
word found twice: riviere (riviere)
rewriting
word found twice: cooperative (cooperative)
rewriting
word found twice: vandalise (vandalise)
rewriting
word found twice: cuisines (cuisinés)
word found twice: formules (formulés)
word found twice: pates (pâtés)
word found twice: mathbf (\mathbf_)
word found twice: depouilles (dépouillés)
word found twice: illumine (illumine)
rewriting
word found twice: alcazar (alcázar)
word found twice: cense (cense)
rewriting
word found twice: certifie (certifie)
rewriting
word found twice: entrave (entravé)
word found twice: republica (republica)
rewriting
word found twice: ota (ōta)
word found twice: abroge (abro

word found twice: domestique (domestiqué)
word found twice: sum (\sum)
word found twice: indef (indéf)
word found twice: orban (orbán)
word found twice: gai (gaï)
word found twice: solde (soldé)
word found twice: broye (broyé)
word found twice: sequence (séquencé)
word found twice: new (new}})
word found twice: consignes (consignés)
word found twice: discuter (►discuter)
word found twice: langle (langle)
rewriting
word found twice: sai (saï)
word found twice: fourragere (fourragere)
rewriting
word found twice: depenses (dépensés)
word found twice: questionne (questionné)
word found twice: repondre (repondre)
rewriting
word found twice: dimitrios (dimítrios)
word found twice: affut (affut)
rewriting
word found twice: circ (\circ)
word found twice: janos (janos)
rewriting
word found twice: mer (mer »)
word found twice: ecran (ecran)
rewriting
word found twice: sg (/sg)
word found twice:  (у)
word found twice: allocine (allocine)
rewriting
word found twice: egyptiens (egyptiens)
rewriting

word found twice: pete (pète)
word found twice: e (eﬆ)
word found twice: novy (nový)
word found twice: rateau (râteau)
word found twice: resserre (resserré)
word found twice: page (pagé)
word found twice: ll (ll}})
word found twice: marino (mariño)
word found twice: odon (ödön)
word found twice: outre (outré)
word found twice: reflexion (reflexion)
rewriting
word found twice: une (#une)
word found twice: allah (allâh)
word found twice: decalage (decalage)
rewriting
word found twice: etire (étiré)
word found twice: ouedraogo (ouédraogo)
word found twice: catalogue (catalogué)
word found twice: de (//de)
word found twice: place ( place)
word found twice: estimes (estimes)
rewriting
word found twice: liga (līga)
word found twice: tarde (tardé)
word found twice: indignes (indignés)
word found twice: eole (eole)
rewriting
word found twice: pas (pas…)
word found twice: genes (genes)
rewriting
word found twice: resilie (résilié)
word found twice:  (ß)
word found twice: egare (égare)
word foun

word found twice: selassie (selassie)
rewriting
word found twice:  (л)
word found twice: j (#j)
word found twice: divorces (divorces)
rewriting
word found twice: decroit (décroit)
word found twice: elu (elu)
rewriting
word found twice: reitere (réitéré)
word found twice: contamines (contamines)
rewriting
word found twice: ecolo (écolo)
word found twice: wikipedia (#wikipedia)
word found twice: loic (loic)
rewriting
word found twice: ryo (ryō)
word found twice: ref (_ref)
word found twice:  (>>>)
word found twice: equations (equations)
rewriting
word found twice: vous ( vous)
word found twice: fatigues (fatigues)
rewriting
word found twice: eu (eu/)
word found twice: labbe (labbe)
rewriting
word found twice: interviewe (interviewe)
rewriting
word found twice: molnar (molnar)
rewriting
word found twice: module (modulé)
word found twice: github (github)
rewriting
word found twice: illegal (illegal)
rewriting
word found twice: ecuyer (ecuyer)
rewriting
word found twice:  (   )
word found t

word found twice: eveques (eveques)
rewriting
word found twice: desole (désole)
word found twice: metamorphose (métamorphosé)
word found twice: d (d+)
word found twice: peril (peril)
rewriting
word found twice: koichi (kōichi)
word found twice: goemon (goemon)
rewriting
word found twice: ff (ff}})
word found twice: fevrier (fevrier)
rewriting
word found twice: produit (_produit)
word found twice: ia (ía)
word found twice: katarina (katarína)
word found twice: terre ( terre)
word found twice: chaim (chaïm)
word found twice: eventuellement (eventuellement)
rewriting
word found twice: vilaine (vilaine_)
word found twice: ejecte (éjecte)
word found twice: choregraphie (chorégraphié)
word found twice: roue (roué)
word found twice: hoceima (hoceïma)
word found twice: experiences (experiences)
rewriting
word found twice: eleni (eleni)
rewriting
word found twice: rape (râpé)
word found twice: engrange (engrangé)
word found twice: clarifie (clarifie)
rewriting
word found twice:  (°/)
word found

word found twice: mediawiki (médiawiki)
word found twice: aigue (aigue)
rewriting
word found twice: galeria (galería)
word found twice: coche (coché)
word found twice: cher (cher_)
word found twice: elam (elam)
rewriting
word found twice: mache (mâché)
word found twice: serres (serrès)
word found twice: valencia (valència)
word found twice: theologie (theologie)
rewriting
word found twice: sergei (sergeï)
word found twice: theravada (theravada)
rewriting
word found twice: emeric (émeric)
word found twice:  (¾)
word found twice: sene (sène)
word found twice: erick (érick)
word found twice: ros (rós)
word found twice: teletoon (teletoon)
rewriting
word found twice: brode (brode)
rewriting
word found twice: pelissier (pelissier)
rewriting
word found twice: wurm (wurm)
rewriting
word found twice: hardwicke (/hardwicke)
word found twice: bechet (béchet)
word found twice: forets (forets)
rewriting
word found twice: yusuf (yûsuf)
word found twice: desarme (désarme)
word found twice: are (åre)

word found twice: ecroule (écroulé)
word found twice: kwai (kwai)
rewriting
word found twice: enchainent (enchainent)
rewriting
word found twice: professionnelle (professionnelle}})
word found twice: t (^t)
word found twice: leonore (leonore)
rewriting
word found twice: emirats (emirats)
rewriting
word found twice: moron (moron)
rewriting
word found twice: angela (ángela)
word found twice: voutes (voutes)
rewriting
word found twice: nominee (nominee)
rewriting
word found twice: louis ( louis)
word found twice: broyes (broyes)
rewriting
word found twice: inquietes (inquiètes)
word found twice: immediate (immediate)
rewriting
word found twice: papa (pápa)
word found twice: balint (bálint)
word found twice: cocaine (cocaine)
rewriting
word found twice: bourges (bourgès)
word found twice: nationale (nationale »)
word found twice: emery (émery)
word found twice: koweit (koweit)
rewriting
word found twice: degeneres (dégénérés)
word found twice: sacrement (sacrément)
word found twice: terror

word found twice: allemand (allemand,)
word found twice: delibere (délibère)
word found twice: tenu (ténu)
word found twice: cine (ciné+)
word found twice:  (ι)
word found twice: titre (titre_)
word found twice: empeche (empeche)
rewriting
word found twice: fundacion (fundacion)
rewriting
word found twice:  (∧)
word found twice: deon (déon)
word found twice: designation (designation)
rewriting
word found twice: sourcage (sourcage)
rewriting
word found twice: regles (regles)
rewriting
word found twice: beit (beït)
word found twice: impulse (impulsé)
word found twice: distille (distille)
rewriting
word found twice: lluis (lluis)
rewriting
word found twice: a (_a)
word found twice: evince (évince)
word found twice: cong (công)
word found twice: gamma (\gamma^)
word found twice: elaboration (elaboration)
rewriting
word found twice: editeurs (editeurs)
rewriting
word found twice: rance (rancé)
word found twice: quoi (¿quoi)
word found twice: fete (fete)
rewriting
word found twice: l (,l)
wo

word found twice: enclaves (enclavés)
word found twice: setubal (setubal)
rewriting
word found twice: v (v/)
word found twice: a ($a)
word found twice: eloge (eloge)
rewriting
word found twice: maze (mazé)
word found twice: wikipedia (#wikipédia)
word found twice: kitab (kitāb)
word found twice: congres (congres)
rewriting
word found twice: fibres (fibrés)
word found twice: quiche (quiche)
rewriting
word found twice: menendez (menendez)
rewriting
word found twice: lucia (lúcia)
word found twice: immortalise (immortalise)
rewriting
word found twice: epilogue (epilogue)
rewriting
word found twice: desavoue (désavoue)
word found twice: patronne (patronné)
word found twice: defiant (defiant)
rewriting
word found twice: decoupes (découpes)
word found twice: cc h (cc@h)
word found twice: coopere (coopéré)
word found twice: mikhailov (mikhaïlov)
word found twice: degats (dégats)
word found twice: electoral (electoral)
rewriting
word found twice: memes (memes)
rewriting
word found twice: gable

word found twice: refoule (refoule)
rewriting
word found twice: remilly (rémilly)
word found twice: craque (craqué)
word found twice: omer (ömer)
word found twice: regne (regne)
rewriting
word found twice: perec (pérec)
word found twice: esp (esp}})
word found twice: casares (casarès)
word found twice: que (que )
word found twice: diem (diệm)
word found twice: melina (mélina)
word found twice: desapprouve (désapprouvé)
word found twice: go (gô)
word found twice:  (#}})
word found twice: i (#i)
word found twice: presentations (presentations)
rewriting
word found twice: athle (athlé)
word found twice: lindstrom (lindstrom)
rewriting
word found twice: thailande (thailande)
rewriting
word found twice: xviii (xviii°)
word found twice:  (,,)
word found twice: thea (théa)
word found twice: titanic ( titanic »)
word found twice: feconde (fécondé)
word found twice: mosse (mossé)
word found twice: shangai (shangai)
rewriting
word found twice: in ( in)
word found twice: g (/g)
word found twice: g

word found twice: geza (geza)
rewriting
word found twice: omi (ōmi)
word found twice: representatives (representatives)
rewriting
word found twice:  (,…)
word found twice: saad (saâd)
word found twice: marques (marquès)
word found twice: hygiene (hygiene)
rewriting
word found twice: s n (s/n)
word found twice: pire (piré)
word found twice: pige (pigé)
word found twice: telerama (telerama)
rewriting
word found twice: fdc (fdc)
rewriting
word found twice: innes (innés)
word found twice: conseil ( conseil)
word found twice: possederait (posséderait)
word found twice:  (🔴)
word found twice: peine (peiné)
word found twice: backes (backès)
word found twice: deroute (dérouté)
word found twice: redouble (redoublé)
word found twice: portugal (portugal}})
word found twice: coexiste (coexisté)
word found twice: en (,en)
word found twice: vinh (vĩnh)
word found twice: vienne (vienne_)
word found twice: je (,je)
word found twice:  (無)
word found twice: rimes (rimés)
word found twice: evin (evin)
re

word found twice: senecal (sénécal)
word found twice: kongo (kongō)
word found twice: tg (/tg)
word found twice: cc (#cc)
word found twice: hebron (hebron)
rewriting
word found twice: gaucho (gaúcho)
word found twice: a (,à)
word found twice: janvier (janvier )
word found twice: nomme (nõmme)
word found twice: etaient (etaient)
rewriting
word found twice: souffles (soufflés)
word found twice: angouleme (angoulème)
word found twice: valdes (valdès)
word found twice: jordan (jordán)
word found twice: t (t/)
word found twice: vireo (viréo)
word found twice: soren (sören)
word found twice: puigcerda (puigcerda)
rewriting
word found twice: presidentielle (presidentielle)
rewriting
word found twice: eruption (eruption)
rewriting
word found twice: dun (dún)
word found twice: gites (gites)
rewriting
word found twice: eleazar (eléazar)
word found twice: temperance (temperance)
rewriting
word found twice: o (ò)
word found twice: demange (démange)
word found twice: reserva (reserva)
rewriting
wor

word found twice: ans (ans…)
word found twice: metaxas (metaxás)
word found twice: economica (económica)
word found twice: sevi (sevi)
rewriting
word found twice: dechiffre (déchiffre)
word found twice: thetis (thetis)
rewriting
word found twice: par (par )
word found twice: economise (économisé)
word found twice: cegep (cegep)
rewriting
word found twice: classifie (classifie)
rewriting
word found twice: saccages (saccages)
rewriting
word found twice:  (·✉·✍·)
word found twice: kobe (kobé)
word found twice: dans (dañs)
word found twice: reprouve (réprouvé)
word found twice:  (⁄)
word found twice: annee (année_)
word found twice: super (süper)
word found twice: disqualifie (disqualifie)
rewriting
word found twice: discrete (discrete)
rewriting
word found twice: revisions (revisions)
rewriting
word found twice: village ( village)
word found twice: ses ( ses)
word found twice: succederont (succèderont)
word found twice: belon (bélon)
word found twice: france (france…)
word found twice: to

word found twice: reglements (réglements)
word found twice: heribert (héribert)
word found twice: precisement (précisement)
word found twice: michaela (michaëla)
word found twice: attriste (attriste)
rewriting
word found twice: goncalves (goncalves)
rewriting
word found twice: lexikon (@lexikon)
word found twice: legislation (legislation)
rewriting
word found twice: vec (\vec}_)
word found twice: longitude (±longitude)
word found twice: yuji (yūji)
word found twice: reamenage (réaménage)
word found twice: ouen (ouën)
word found twice: prostitue (prostitue)
rewriting
word found twice: d (_d)
word found twice: faches (faches)
rewriting
word found twice: club (club}})
word found twice: mondovi (mondovì)
word found twice: nl (nl/)
word found twice: masai (masaï)
word found twice: yang (yáng)
word found twice:  (иван)
word found twice: sympathise (sympathisé)
word found twice: tour (tour}})
word found twice: sekou (sekou)
rewriting
word found twice: general (général}})
word found twice: moi

word found twice: synonyme (synonyme )
word found twice: mila (milà)
word found twice: mun (mûn)
word found twice: talonne (talonne)
rewriting
word found twice: al (ħal)
word found twice: rape (râpe)
word found twice: godel (godel)
rewriting
word found twice: w (/w/)
word found twice: domenech (domènech)
word found twice: lambda (\lambda^)
word found twice: cite ( cité)
word found twice: pele (pele)
rewriting
word found twice: corse (corsé)
word found twice: z (ž)
word found twice: chatelard (chatelard)
rewriting
word found twice: caserne (caserné)
word found twice: maritimes (maritimes_)
word found twice: hotel ( hôtel)
word found twice: africa (áfrica)
word found twice: lindelof (lindelof)
rewriting
word found twice: theatre ( théâtre)
word found twice: mu i (\mu_i^)
word found twice:  (·˙·)
word found twice: denier (dénier)
word found twice:  (🙋)
word found twice: religieux (religieux »)
word found twice: eglise (église »)
word found twice: merce (mercè)
word found twice: enrage (en

word found twice: etre (être »)
word found twice: preludes (preludes)
rewriting
word found twice: droite (droite »)
word found twice: bonne ( bonne)
word found twice: ferus (ferus)
rewriting
word found twice: wang (wáng)
word found twice: schroder (schroder)
rewriting
word found twice: mars (mars )
word found twice: pupille (pupillé)
word found twice:  (от)
word found twice: menagerie (menagerie)
rewriting
word found twice: kitsune (kitsuné)
word found twice: mathbf (\mathbf^)
word found twice: mekong (mekong)
rewriting
word found twice: mosaique (mosaique)
rewriting
word found twice: gregor (grégor)
word found twice: considerant (considerant)
rewriting
word found twice: ree (rée)
word found twice: belgium (_belgium)
word found twice:  (москва)
word found twice: receptionne (réceptionne)
word found twice: aleria (aleria)
rewriting
word found twice: neg (nèg)
word found twice: frege (frégé)
word found twice: petits ( petits)
word found twice: christina (christína)
word found twice: tele

word found twice: cambriole (cambriolé)
word found twice: premio (prémio)
word found twice: kopavogur (kópavogur)
word found twice: personnes ( personnes)
word found twice: decora (decora)
rewriting
word found twice: devient (dévient)
word found twice: espece (espece)
rewriting
word found twice: pete (pété)
word found twice: abusefilter (abusefilter/)
word found twice: dogan (doğan)
word found twice: magnusson (magnússon)
word found twice: refuses (refuses)
rewriting
word found twice: zinaida (zinaida)
rewriting
word found twice: solferino (solferino)
rewriting
word found twice: wanderers (wanderers}})
word found twice: saadi (saâdi)
word found twice: aveugles (aveuglés)
word found twice: it (//it)
word found twice: thien (thiên)
word found twice: p (+p)
word found twice: tst (tst)
rewriting
word found twice: relaye (relaye)
rewriting
word found twice: s ( s )
word found twice: oui (oüi)
word found twice: incise (incisé)
word found twice: canada (canada,)
word found twice: x (œx)
word 

word found twice: modules (modulés)
word found twice: a (a »)
word found twice: l (/l/)
word found twice: metronome (metronome)
rewriting
word found twice: carnes (carnés)
word found twice: aide (aide,)
word found twice: faux ( faux)
word found twice: hereros (hereros)
rewriting
word found twice: geronimo (gerónimo)
word found twice: pagine (paginé)
word found twice: qumran (qumran)
rewriting
word found twice: vj (ˈvjɛɕ)
word found twice: bechir (bechir)
rewriting
word found twice: gros ( gros)
word found twice: ryoma (ryōma)
word found twice: emule (emule)
rewriting
word found twice: conjure (conjure)
rewriting
word found twice: ruhl (rühl)
word found twice: japon (/japon)
word found twice: reveleront (révèleront)
word found twice: aspasia (aspasía)
word found twice: isbn (,isbn)
word found twice: palais ( palais)
word found twice: l (‘l)
word found twice: jaen (jaen)
rewriting
word found twice: marines (marinés)
word found twice: eyadema (eyadema)
rewriting
word found twice: boletin 

word found twice: ong (đông)
word found twice: laodice (laodice)
rewriting
word found twice: juin (_juin_)
word found twice: meme ( même)
word found twice: jb (jb✉)
word found twice: peut ( peut)
word found twice: ecritures (ecritures)
rewriting
word found twice: abimer (abimer)
rewriting
word found twice: numero (número)
word found twice: larve (larvé)
word found twice: drazen (drazen)
rewriting
word found twice: rai (raí)
word found twice:  (й)
word found twice: memorise (mémorise)
word found twice:  (его)
word found twice: peninsula (península)
word found twice: meeus (meeùs)
word found twice: rive (rivé)
word found twice: ouvres (ouvrés)
word found twice: titre (_titre)
word found twice: neve (nève)
word found twice: veria (véria)
word found twice: dx (//dx)
word found twice:  (～)
word found twice: fustige (fustigé)
word found twice: https ( https)
word found twice: janvier ( janvier )
word found twice: unis (unis}})
word found twice: satine (satine)
rewriting
word found twice: gha

word found twice: evenement (évenement)
word found twice: borde (börde)
word found twice: sera (será)
word found twice: berenguer (bérenguer)
word found twice: desavantages (désavantagés)
word found twice: koji (kôji)
word found twice: christ (christ »)
word found twice: yuichi (yūichi)
word found twice: vedanta (vedanta)
rewriting
word found twice: viscosite (viscosite)
rewriting
word found twice: sulfures (sulfurés)
word found twice: beyonce (beyonce)
rewriting
word found twice: trainieres (traînières)
word found twice: ussr (ussr}})
word found twice:  (号)
word found twice: nuit (nuit »)
word found twice: fo (fô)
word found twice: alene (alene)
rewriting
word found twice: plate (platé)
word found twice: departements (departements)
rewriting
word found twice: n (/n/)
word found twice: zeljko (zeljko)
rewriting
word found twice: terre (terre,)
word found twice: acceleration (acceleration)
rewriting
word found twice: profere (profère)
word found twice: regler (règler)
word found twice: 

word found twice: alumine (aluminé)
word found twice: terre (terré)
word found twice: etiquette (etiquette)
rewriting
word found twice:  (▷)
word found twice: boulle (boullé)
word found twice: grey (grey}})
word found twice: article (article}})
word found twice: fafnir (fáfnir)
word found twice: senia (senia)
rewriting
word found twice: prophetise (prophétisé)
word found twice: deshabille (déshabillé)
word found twice: barde (bardé)
word found twice: bien (bien…)
word found twice: oui ( oui)
word found twice: drosera (droséra)
word found twice: hoss (höss)
word found twice: scanne (scanne)
rewriting
word found twice: abjure (abjuré)
word found twice: menages (ménagés)
word found twice: mer (mer,)
word found twice: dilate (dilaté)
word found twice: hellstrom (hellstrom)
rewriting
word found twice: noce (nocé)
word found twice: btv (/btv)
word found twice:  (музей)
word found twice: katy (kąty)
word found twice: yagyu (yagyu)
rewriting
word found twice:  (институт)
word found twice: aout

word found twice: equilibre (equilibre)
rewriting
word found twice: boxe (boxé)
word found twice: r n (r^n)
word found twice: champion (champion}})
word found twice: epargnes (épargnes)
word found twice: jouet (jouët)
word found twice: labadie (labadié)
word found twice: oses (osés)
word found twice: terre (terre…)
word found twice:  (ŋ)
word found twice: evident (evident)
rewriting
word found twice:  (¿⸮)
word found twice: m m (m&m)
word found twice: la (‘la)
word found twice: plebiscites (plébiscités)
word found twice: joe (joé)
word found twice: sevin (sévin)
word found twice: reoccupe (réoccupe)
word found twice: requete (requète)
word found twice: ceremonies (ceremonies)
rewriting
word found twice: laureate (laureate)
rewriting
word found twice: kandahar (kandahâr)
word found twice: round (}}round}}}}})
word found twice: taddei (taddeï)
word found twice: paiva (païva)
word found twice: exemple (exemple,)
word found twice: tracte (tracte)
rewriting
word found twice: accuses (accuse

word found twice: angelina (angélina)
word found twice: le (lè)
word found twice: concurrences (concurrences)
rewriting
word found twice: inventorie (inventorie)
rewriting
word found twice: e (e+)
word found twice: labienus (labiénus)
word found twice: oye (oyé)
word found twice: electeur (electeur)
rewriting
word found twice: suivante (suivante )
word found twice: decimal (decimal)
rewriting
word found twice: creator (creator%)
word found twice: mediterranee (mediterranee)
rewriting
word found twice: epitre (epître)
word found twice: almodovar (almodovar)
rewriting
word found twice: mlada (mlada)
rewriting
word found twice: tochter (töchter)
word found twice:  (ł)
word found twice: redox (rédox)
word found twice: imperio (império)
word found twice: carmine (carminé)
word found twice: prefere (préfére)
word found twice: un (un »)
word found twice: economique (économique »)
word found twice: dine (dîné)
word found twice: polarise (polarise)
rewriting
word found twice: shigatse (shigatse

word found twice: concocte (concocte)
rewriting
word found twice: millesime (millésimé)
word found twice: bedia (bedia)
rewriting
word found twice: tout (tout »)
word found twice: rafle (raflé)
word found twice: feret (feret)
rewriting
word found twice: antenor (antenor)
rewriting
word found twice: lies (liés}})
word found twice: sane (sâne)
word found twice: magique (magique »)
word found twice: eparges (eparges)
rewriting
word found twice: azrael (azrael)
rewriting
word found twice: no (no )
word found twice: natalia (natália)
word found twice: lohr (löhr)
word found twice: triceratops (tricératops)
word found twice: scenarios (scenarios)
rewriting
word found twice: calligraphies (calligraphiés)
word found twice: goni (goñi)
word found twice: rives (rivés)
word found twice: equestre (equestre)
rewriting
word found twice: porter (±‹porter)
word found twice: ria (ría)
word found twice: exhume (exhume)
rewriting
word found twice: erro (erró)
word found twice: loria (lòria)
word found tw

word found twice: payes (payes)
rewriting
word found twice: thera (thera)
rewriting
word found twice: skg (/skg)
word found twice: liberati (liberati)
rewriting
word found twice: frange (frangé)
word found twice: ministre ( ministre)
word found twice: boreal (boreal)
rewriting
word found twice: eleuthere (eleuthère)
word found twice: patrimoine ( patrimoine)
word found twice: film ( film)
word found twice: sare (saré)
word found twice: sur (#sur)
word found twice: pape (papé)
word found twice: sweden (sweden}})
word found twice: chan (chán)
word found twice: caricatures (caricaturés)
word found twice: defigure (défigure)
word found twice: constanta (constanta)
rewriting
word found twice: original (original )
word found twice: th (th/)
word found twice: b (b,)
word found twice: diable (diable »)
word found twice:  (⠇⠑⠁⠛)
word found twice: indispose (indisposé)
word found twice: gueguen (gueguen)
rewriting
word found twice: barberis (barbéris)
word found twice: tremolo (tremolo)
rewritin

word found twice: ecole (école »)
word found twice: paraiso (paraiso)
rewriting
word found twice: idees (idees)
rewriting
word found twice: leonide (leonide)
rewriting
word found twice:  (%%)
word found twice: democratico (democratico)
rewriting
word found twice: cagoules (cagoules)
rewriting
word found twice: sarge (sargé)
word found twice: fontes (fontès)
word found twice: ffa (#ffa)
word found twice: fils (fils »)
word found twice: herpes (herpes)
rewriting
word found twice: the (‘the)
word found twice: monch (monch)
rewriting
word found twice:  (     )
word found twice: o ( o)
word found twice: elio (elío)
word found twice: lorien (lorien)
rewriting
word found twice: naude (naude)
rewriting
word found twice: linoleum (linoléum)
word found twice: esau (esau)
rewriting
word found twice: racontes (racontes)
rewriting
word found twice: aliena (aliena)
rewriting
word found twice: benaim (benaïm)
word found twice: dessalement (déssalement)
word found twice: ecorche (écorche)
word found t

word found twice: et (#et)
word found twice: pourquoi (#pourquoi)
word found twice: regionales (regionales)
rewriting
word found twice: mede (mede)
rewriting
word found twice: vinci (vinci}})
word found twice: membre ( membre)
word found twice: prefixe (préfixé)
word found twice: gouvernement ( gouvernement)
word found twice: delie (delie)
rewriting
word found twice: porte (porte »)
word found twice: levitan (lévitan)
word found twice:  (○)
word found twice: amd (/amd)
word found twice: crepin (crepin)
rewriting
word found twice: reynes (reynes)
rewriting
word found twice: toreador (toreador)
rewriting
word found twice: condottiere (condottière)
word found twice: decadence (decadence)
rewriting
word found twice: decroitre (décroitre)
word found twice: esope (esope)
rewriting
word found twice: teo (téo)
word found twice: abimee (abimée)
word found twice: dag (dağ)
word found twice: maharaja (mahârâja)
word found twice: gregorio (gregório)
word found twice: addai (addai)
rewriting
word f

word found twice:  (✎✎)
word found twice: teletoon (télétoon+)
word found twice: sieger (sieger)
rewriting
word found twice: teresa (térésa)
word found twice: stabile (stábile)
word found twice: doring (doring)
rewriting
word found twice: sauze (sauzé)
word found twice: nejib (nejib)
rewriting
word found twice: dabrowski (dąbrowski)
word found twice: prendre ( prendre)
word found twice: alarmes (alarmés)
word found twice: yuka (yūka)
word found twice: perigord (perigord)
rewriting
word found twice: ostra (ostra)
rewriting
word found twice: odon (odón)
word found twice: trilobe (trilobe)
rewriting
word found twice: samsara (saṃsāra)
word found twice: rauber (räuber)
word found twice: ou (/ou)
word found twice: hongo (hongō)
word found twice: kentaro (kentarō)
word found twice: macule (macule)
rewriting
word found twice: blinde (blinde)
rewriting
word found twice: janvier (/janvier)
word found twice: volume (volume%)
word found twice: gabin (gąbin)
word found twice: etaples (etaples)
rew

word found twice: emilienne (emilienne)
rewriting
word found twice: france (france/)
word found twice:  (の)
word found twice:  (⎕)
word found twice: gabriele (gabrièle)
word found twice: and (and%)
word found twice: honorio (honorio)
rewriting
word found twice: solea (soléa)
word found twice: mc (mc^)
word found twice: rocio (rocio)
rewriting
word found twice: systemes (systemes)
rewriting
word found twice: lh (lh/)
word found twice: azema (azema)
rewriting
word found twice: so (so_)
word found twice: predecesseur (prédecesseur)
word found twice: caitlin (caitlín)
word found twice: etudes ( études)
word found twice: babord (babord)
rewriting
word found twice: emboite (emboîté)
word found twice: red (_red)
word found twice: critique (critique »)
word found twice: goju (gōjū)
word found twice:  (искусство)
word found twice: reactualise (réactualise)
word found twice:  (⊆)
word found twice: bottes (bottés)
word found twice: reglement (reglement)
rewriting
word found twice: modifier ( modi

word found twice: u (ŭ)
word found twice: cage (cagé)
word found twice: annote (annote)
rewriting
word found twice: beau ( beau)
word found twice: melin (mélin)
word found twice: bandai (bandaï)
word found twice: petrov (pétrov)
word found twice: pole ( pôle)
word found twice: c (ć)
word found twice: chronique ( chronique)
word found twice: beche (bèche)
word found twice: leur ( leur)
word found twice: vivre ( vivre)
word found twice: interprete (interprête)
word found twice: commune (commune}})
word found twice: politiques (politiques »)
word found twice: jeunes (jeunes »)
word found twice: biogeographie (biogeographie)
rewriting
word found twice: espagne (espagne »)
word found twice: pospisil (pospíšil)
word found twice: nystrom (nystrom)
rewriting
word found twice: surtaxe (surtaxé)
word found twice: amin (amîn)
word found twice: menotte (menotte)
rewriting
word found twice: georgien (georgien)
rewriting
word found twice: remora (rémora)
word found twice: hegesippe (hégesippe)
word 

word found twice:  (兔)
word found twice: theodor (théodor)
word found twice: plutot ( plutôt)
word found twice: luge (lüge)
word found twice: defriche (défriche)
word found twice: abl (%abl)
word found twice: moreri (moreri)
rewriting
word found twice: volte (volté)
word found twice: e f (e,f)
word found twice: eons (eons)
rewriting
word found twice: juillet ( juillet)
word found twice: phenyl (phenyl)
rewriting
word found twice: jehan (jéhan)
word found twice: izia (izïa)
word found twice: terres ( terres)
word found twice: ngu (ngữ)
word found twice: derive (derive)
rewriting
word found twice: ici (içi)
word found twice:  ( }})
word found twice: selecao (seleçao)
word found twice: body (_body)
word found twice: allaite (allaité)
word found twice: o o (ʻōʻō)
word found twice: ade (adé)
word found twice: la (la♭)
word found twice: croix (croix »)
word found twice: balalaika (balalaika)
rewriting
word found twice: non (non…)
word found twice: codes (/codes)
word found twice: walder (wäl

word found twice: score (score_)
word found twice: publics (publics,)
word found twice: remond (remond)
rewriting
word found twice: tiberius (tibérius)
word found twice: borbon (borbon)
rewriting
word found twice: aerienne (aerienne)
rewriting
word found twice: vaisse (vaisse)
rewriting
word found twice: heaulme (heaulmé)
word found twice: championship (championship}})
word found twice: kato (katô)
word found twice: dax (_dax)
word found twice: reville (reville)
rewriting
word found twice: attente (attenté)
word found twice:  (алексей)
word found twice: catche (catché)
word found twice: mela (méla)
word found twice: h (_h)
word found twice: futs (futs)
rewriting
word found twice: replicant (replicant)
rewriting
word found twice: nuit (nuit…)
word found twice: bedouin (bedouin)
rewriting
word found twice: cemea (cemea)
rewriting
word found twice: valle (vallé)
word found twice: annee (année )
word found twice: navratil (navrátil)
word found twice: n (+n)
word found twice: nevroses (névr

word found twice: dechaine (déchaine)
word found twice: junjo (junjō)
word found twice: universita (universita)
rewriting
word found twice: quebec (québec}})
word found twice: g (ĝ)
word found twice: an (ân)
word found twice: championnat (championnat}})
word found twice: ragout (ragout)
rewriting
word found twice: peaufine (peaufiné)
word found twice: guei (gueï)
word found twice: exiguite (exigüité)
word found twice: burgi (bürgi)
word found twice: p i (p_i^)
word found twice: borras (borrás)
word found twice: appel ( appel)
word found twice: ecume (écumé)
word found twice: qu (#qu)
word found twice: civica (cívica)
word found twice: mysterio (mystério)
word found twice: menaiel (ménaïel)
word found twice: moe (moé)
word found twice: reve (rève)
word found twice: kobe (kôbe)
word found twice: makeieff (makeieff)
rewriting
word found twice: rebuffat (rebuffat)
rewriting
word found twice: kiraly (kiraly)
rewriting
word found twice: lowy (lowy)
rewriting
word found twice: samtskhe (samts

Функция из MUSE: находим ближайшие вектора из другого языка. 

Посмотрим, насколько хорошо работает выравнивание.

In [12]:
def get_nn(word, src_emb, src_id2word, tgt_emb, tgt_id2word, K=5):
    print("Nearest neighbors of \"%s\":" % word)
    word2id = {v: k for k, v in src_id2word.items()}
    word_emb = src_emb[word2id[word]]
    scores = (tgt_emb / np.linalg.norm(tgt_emb, 2, 1)[:, None]).dot(word_emb / np.linalg.norm(word_emb))
    k_best = scores.argsort()[-K:][::-1]
    for i, idx in enumerate(k_best):
        print('%.4f - %s' % (scores[idx], tgt_id2word[idx]))

In [13]:
for word in ['cat','dog','human','student','computer']:
    get_nn(word, en_embedding_tuple[0], en_embedding_tuple[1], fr_embedding_tuple[0], fr_embedding_tuple[1], K=3)

Nearest neighbors of "cat":
0.6131 - chat
0.5779 - cat
0.5504 - chien
Nearest neighbors of "dog":
0.7998 - chien
0.7052 - chiens
0.6259 - chienne
Nearest neighbors of "human":
0.7290 - humain
0.7280 - humains
0.6391 - humaine
Nearest neighbors of "student":
0.7106 - etudiante
0.6421 - etudiantes
0.5901 - professeurs
Nearest neighbors of "computer":
0.7823 - informatique
0.7696 - ordinateur
0.7398 - ordinateurs


Класс Lang отвечает за обработку языка.

SOS_token --- идентификатор начала предложения.

EOS_token --- идентификатор конца предложения.

In [14]:
SOS_token = 0
EOS_token = 1


class Lang:
    def __init__(self, name, embedding_tuple):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = [ "SOS", "EOS"]
        self.embedding_tuple = embedding_tuple    
        self.n_words = 2  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:   
            if word in self.embedding_tuple[2]:
                self.word2index[word] = self.n_words
                self.word2count[word] = 1
                self.index2word.append(word)
                self.n_words += 1
        else:
            self.word2count[word] += 1
            
    def get_matrix(self):
        """
        Получаем матрицу слово -> вектор для всех слов, которые встретились в тексте.
        Вектор для начала предложения заменяем нулевым, 
        для конца предложения --- единичным (можно заменить на случайный вектор).
        """
        dim = self.embedding_tuple[0].shape[1]
        matrix = np.zeros((self.n_words, dim))        
        matrix[0] = np.zeros(dim)
        matrix[1] = np.ones(dim)
        for id, word in enumerate(self.index2word[2:]):
            id = id+2
            word_id = self.embedding_tuple[2][word]
            vector = self.embedding_tuple[0][word_id]
            matrix[id] = vector
        return matrix

In [15]:
# Предложение представляется как набор идентификаторов слов. 
# Чтобы поместить несколько предложений в одну матрицу, нужно дополнить каждое предложение токенами конца предложения.
def pad_seq(seq, length):
    
    seq += [EOS_token for i in range(length - len(seq))]
    return seq

Функции-утилиты для считывания текста.

В дальнейшем ограничимся только предложениями длины <= 10 для экономии памяти.

In [16]:
def readLangs(lang1, lang2, emb1, emb2,  prefix):
    print("Reading lines...")

    # Read the file and split into lines
    lines1 = codecs.open(prefix+lang1, encoding='utf-8').\
        read().strip().split('\n')
    lines2 = codecs.open(prefix+lang2, encoding='utf-8').\
        read().strip().split('\n')
    # Split every line into pairs and normalize
    lines1 = [normalizeString(s) for s in lines1]
    lines2 = [normalizeString(s) for s in lines2]
   
    input_lang = Lang(lang1, emb1)
    output_lang = Lang(lang2, emb2)

    return input_lang, output_lang, lines1, lines2

In [17]:
MAX_LENGTH = 10


def filter_line(line):
    return len(line.split(' ')) < MAX_LENGTH
    
def filter_lines(lines):
    return [line for line in lines if filter_line(line)]

In [18]:
def prepareData(lang1, lang2, emb1, emb2,  prefix):
    """
    Возвращает два объекта-языка, предложения для обучения
    и пары параллельных предложений для промежуточной валидации результата
    """
    input_lang, output_lang, lines1, lines2 = readLangs(lang1, lang2,  emb1, emb2, prefix)
    print("Read %s sentence pairs" % len(lines1))
    pairs = [(l1, l2) for l1, l2 in zip(lines1, lines2) if filter_line(l1) and filter_line(l2)]
    
    # по условиями эксперимента у нас нет параллельных предложений. Для чистоты эксперимента  перемешаем их.
    np.random.shuffle(lines1)
    np.random.shuffle(lines2)

    lines1 = filter_lines(lines1)
    lines2 = filter_lines(lines2)
    
    min_lines = min(len(lines1), len(lines2))
    lines1, lines2 = lines1[:min_lines], lines2[:min_lines]
    
    print("Trimmed to %s sentence pairs" % min_lines)
    print("Counting words...")
    for l1, l2 in zip(lines1, lines2):
        input_lang.addSentence(l1)
        output_lang.addSentence(l2)
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, lines1, lines2,  pairs


input_lang, output_lang, lines1, lines2, pairs = prepareData('fr', 'en', fr_embedding_tuple, en_embedding_tuple, 
                                             'train.lc.norm.tok.')
# в качестве щашумленного перевода предложений пока просто возьмем непараллельные пары
tr_lines1,tr_lines2 = lines2[:], lines1[:]

Reading lines...
Read 29000 sentence pairs
Trimmed to 3510 sentence pairs
Counting words...
Counted words:
fr 2794
en 2826


Функции кодирования предложений в последовательность идентификаторов слов

In [19]:
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ') if word in lang.word2index]


def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)


def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)


# функция возвращает матрицы случайных предложений и их зашумленных версий в pytorch-формате.
# batch_size --- количество предложений.
def random_batch(batch_size):
    input_seqs = []
    target_seqs = []
    tr_input_seqs = []
    tr_target_seqs = []
    
    # Choose random pairs
    for i in range(batch_size):
        id1 = random.choice(range(len(lines1)))
        id2 = random.choice(range(len(lines2)))
        line1 = lines1[id1]
        line2 = lines2[id2]
        tr_line1 = tr_lines1[id1]
        tr_line2 = tr_lines2[id2]
        
        
        input_seqs.append(indexesFromSentence(input_lang, line1))
        target_seqs.append(indexesFromSentence(output_lang, line2))
        
        tr_input_seqs.append(indexesFromSentence(output_lang, tr_line1))
        tr_target_seqs.append(indexesFromSentence(input_lang, tr_line2))
        
        
    input_length = max([len(s) for s in input_seqs])
    target_length = max([len(s) for s in target_seqs])
    tr_input_length = max([len(s) for s in tr_input_seqs])
    tr_target_length = max([len(s) for s in tr_target_seqs])
    
    
    # For input and target sequences, get array of lengths and pad with 0s to max length    
    input_padded = [pad_seq(s, input_length) for s in input_seqs]    
    target_padded = [pad_seq(s, target_length) for s in target_seqs]
    
    tr_input_padded = [pad_seq(s, tr_input_length) for s in tr_input_seqs]    
    tr_target_padded = [pad_seq(s, tr_target_length) for s in tr_target_seqs]

    
    # Turn padded arrays into (batch_size x max_len) tensors, transpose into (max_len x batch_size)
    input_var = torch.tensor(input_padded, dtype=torch.long,device=device).transpose(0, 1)
    target_var = torch.tensor(target_padded, dtype=torch.long,device=device).transpose(0, 1)
    tr_input_var = torch.tensor(tr_input_padded, dtype=torch.long,device=device).transpose(0, 1)
    tr_target_var = torch.tensor(tr_target_padded, dtype=torch.long,device=device).transpose(0, 1)
    
    
    return input_var, target_var, tr_input_var, tr_target_var

In [20]:
batch = random_batch(30)

Посмотрим матрицы для слов из каждого языка для использования в Seq2Seq

In [21]:
fr_matrix = torch.FloatTensor(input_lang.get_matrix())


In [22]:
en_matrix = torch.FloatTensor(output_lang.get_matrix())


Простой класс однослойной нейросети с 1000 нейронов на скрытом слое, будет использоваться как дискриминатор.

Входная размерность --- 300, совпадает с размерностью слов и размерностью скрытого пространства Seq2Seq,

In [23]:
class Net1(nn.Module):
    def __init__(self):
        super(Net1, self).__init__()
        self.fc1 = nn.Linear(300,1000)
        self.fc2 = nn.Linear(1000, 1)
        
    def forward(self, x):
        x = torch.nn.functional.relu(self.fc1(x))
        y = torch.nn.functional.sigmoid(self.fc2(x))
        
        return y

класс Encoder.

In [24]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, matrix, gru = None):
        """
        gru -- если не None, берет готовую модель gru и использует для своего языка.
        Параметр требуется, чтобы использовать одну и ту же GRU-модель для двух энкодеров с разных языков.
        """
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        
        
        self.embedding = nn.Embedding.from_pretrained(matrix)
        self.embedding.requires_grad = False
        
        if not gru:
            self.gru = nn.GRU(hidden_size, hidden_size)
        else:
            self.gru = gru
            

    def forward(self, input, batch_size, hidden):
        embedded = self.embedding(input).view(1, batch_size, self.hidden_size)
        output = embedded
        
        output, hidden = self.gru(output, hidden)
        return output, hidden
    

    def initHidden(self, batch_size):
        return torch.zeros(1, batch_size, self.hidden_size, device=device)

Decoder. 

Важно: в отличие от статьи, я не использовал Attention. 

Attention можно взять из официальной обучалки pytorch seq2seq, но он там работает с одним предложением за одну операцию.
Лучше погуглить "seq2seq pytorch batch"

In [25]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, matrix, gru = None):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding.from_pretrained(matrix)
        self.embedding.requires_grad = False
        if not gru:
            self.gru = nn.GRU(hidden_size, hidden_size)
        else:
            self.gru = gru
            
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden, batch_size):
        output = self.embedding(input).view(1, batch_size, self.hidden_size)
        
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden
    
    def initHidden(self, batch_size):
        return torch.zeros(1, batch_size, self.hidden_size, device=device)

Функции для проведения одной итерации оптимизации.

teacher_forcing_ratio отвечает вероятность, что в качестве слов при декоде мы будем получать подсказку, а не будем использовать те слова, что были раскодированы декодером самостоятельно на предыдущих шагах. 

In [26]:
teacher_forcing_ratio = 0.5

def encode_decode(input_tensor,  encoder, decoder, target_tensor=None):
    # раскодирует и декодирует.
    # input_tensor, target_tensor --- матрицы размером
    #    <Количество предложений в батче> * <Максимальная длина предложения в батче>
    # возвращает последовательность идентификаторов слов от декодера и скрытый вектор от энкодера
    batch_size = input_tensor.size(1)
    encoder_hidden = encoder.initHidden(batch_size)
    
    input_length = input_tensor.size(0)
    if target_tensor is None:
        target_tensor = input_tensor
    
    target_length = target_tensor.size(0)

    encoder_outputs = torch.zeros(input_length, batch_size, encoder.hidden_size, device=device)

    for ei in range(input_length):
    
        encoder_output, encoder_hidden = encoder(input_tensor[ei], batch_size, encoder_hidden)
        encoder_outputs[ei] = encoder_output[0]
    
    
    
    decoder_input = torch.tensor([[SOS_token]*batch_size], device=device)
    decoder_hidden = encoder_hidden
    
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    outputs = []
    for di in range(target_length):
        decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden, batch_size)        
        outputs.append(decoder_output)        
        
        if use_teacher_forcing:
            decoder_input = target_tensor[di]  # Teacher forcing            
        else:
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()  # detach from history as input
            #if decoder_input == EOS_token:
            #    break
    return outputs,  encoder_hidden


def train(source_tensor, target_tensor, translated_source_tensor, translated_target_tensor, 
          encoder_source, encoder_target,
          decoder_source, decoder_target, discriminator,  
          optimizer,   discriminator_optimizer, 
          criterion, cross_entropy, ae_coef = 1.0, translate_coef = 0.0, disc_coef = 1.0,   max_length=MAX_LENGTH):
    """
    Одна итерация оптимизации
    source_tensor --- матрица предложений первого языка
    target_tensor --- матрица предложений второго языка
    translated_source_tensor --- зашумленный перевод первого языка
    translated_target_tensor --- зашумленный перевод второго языка
    encoder_target, decoder_source, encoder_target, decoder_target --- seq2seq кодировщики
    discriminator --- дискриминаторная сеть
    optimizer, discriminator_optimizer --- оптимизаторы Seq2Seq и дискриминатора
    criterion --- функция ошибки для Seq2Seq. Принимает на вход предложения и матрицу ответа от декодера
    cross_entropy --- функция ошибки для дискриминатора
    """
    
    batch_size = source_tensor.size(1)
    
    # автокодировщики
    source_to_source, source_hidden = encode_decode(source_tensor, encoder_source, decoder_source )
    target_to_target, target_hidden = encode_decode(target_tensor, encoder_target, decoder_target)
    
    # зануляем градиент от ошибки Seq2Seq
    optimizer.zero_grad()
    
    
    loss = 0
    
    # считаем ошибку от первого автокодировщика
    for di in range(len(source_to_source)):
        decoder_output = source_to_source[di]
        loss += criterion(decoder_output, source_tensor[di])*ae_coef
    
    
    #считаем ошибку от второго автокодировщика
    for di in range(len(target_to_target)):
        decoder_output = target_to_target[di]
        loss += criterion(decoder_output, target_tensor[di])*ae_coef
    
    # будем считать, что объекты из ПЕРВОГО языка принадлежат  классу "1", 
    # объекты ВТОРОГО языка --- классу "0"
    classes = torch.tensor(np.array([1.0]*batch_size + [0.0]*batch_size), dtype=torch.float, device=device)
    
    # Смотрим, что предсказал дискриминатор
    source_hidden_predict = discriminator(torch.stack([target_hidden, source_hidden]))
    # На данном этапе мы хотим обмануть дискриминатор, поэтому будем минимизировать долю правильных ответов,
    # т.е. минимизировать ошибку между ответами дискриминатора и НЕПРАВИЛЬНЫМИ классами.
    loss += cross_entropy(source_hidden_predict, classes)*disc_coef
    
    # аналогично автокодирощикам, считаем ошибку на зашумленном переводе
    if translate_coef > 0.0:
        translated_source_to_source, _ = encode_decode(translated_source_tensor, encoder_target, decoder_source, target_tensor=source_tensor)
        translated_target_to_target, _ = encode_decode(translated_target_tensor, encoder_source, decoder_target, target_tensor=target_tensor)
        
        for di in range(min(len(translated_source_to_source), len(source_tensor))):
            decoder_output = translated_source_to_source[di]
            loss += criterion(decoder_output, source_tensor[di])*ae_coef

        for di in range(min(len(translated_target_to_target), len(target_tensor))):
            decoder_output = translated_target_to_target[di]
            loss += criterion(decoder_output, target_tensor[di])*ae_coef



    # подсчет градиента от ошибки
    loss.backward()

    # запуск оптимизации в сторону антиградиента
    optimizer.step()
    
    # информация для отладки
    avg_len = (len(target_tensor)+len(source_tensor))/2
    
    # обнуляем градиент и ошибку для дискриминатора
    d_loss = 0     
    discriminator_optimizer.zero_grad()
    
    _, source_hidden = encode_decode(source_tensor, encoder_source, decoder_source)
    _, target_hidden = encode_decode(target_tensor, encoder_target, decoder_target)
    
    
    # теперь минимизируем ошибку между предсказанным и правильным классами
    source_hidden_predict = discriminator(torch.stack([source_hidden, target_hidden]))    
    d_loss += cross_entropy(source_hidden_predict, classes)

    d_loss.backward()

    discriminator_optimizer.step()
    
              
    
    return loss.item() / avg_len, d_loss.item()

In [27]:
import time

def trainIters(encoder_source,encoder_target,  decoder_source,  decoder_target,discriminator, 
               n_iters, print_every=1000,  learning_rate=0.001):
    # глобальная процедура оптимизации 
    
    
    global tr_lines1, tr_lines2
    
    optimizer = optim.Adam(list(encoder_source.parameters()) +  list(decoder_source.parameters())+\
                            list(encoder_target.parameters())+  list(decoder_target.parameters()),
                           lr=learning_rate)
    
    optimizer2 = optim.Adam(discriminator.parameters(),
                           lr=learning_rate)
    
    
    criterion = nn.NLLLoss()
    criterion2 = nn.BCELoss()
    print_loss_total = [0, 0]

    for iter in range(1, n_iters + 1):    
        batch = random_batch(25)
        
        # уровень доверия к зашумленному переводу будет увеличиваться в процессе оптимизации. 
        # поскольку в первое время мы не обладаем никаким переводом, то изначально коэффициент будет нулевым.
        if iter < print_every:
            tr_coef = 0.0
        else:
            tr_coef = 1.0/n_iters
        loss = train(batch[0], batch[1], batch[2], batch[3],  encoder_source,encoder_target,  decoder_source,  decoder_target,
                     discriminator, optimizer, optimizer2, criterion, criterion2, translate_coef = tr_coef)
        print_loss_total[0] += loss[0]
        print_loss_total[1] += loss[1]
        

        if iter % print_every == 0:
            # выводи примеры перевода, среднюю ошибку и делаем новый шумный перевод, 
            # каждую 'print_every' итерацию.
            print_loss_avg0 = print_loss_total[0] / print_every
            print_loss_avg1 = print_loss_total[1] / print_every
            print_loss_total = [0,0]
            print (iter, print_loss_avg0, print_loss_avg1)
            print ('_'*10)
            evaluateRandomly(encoder_source, decoder_target,n=3)
            print ('_'*10)
            evaluateRandomly(encoder_source, decoder_source,langid=0, n=3)
            print ('_'*10)
            evaluateRandomly(encoder_target, decoder_target, langid=1, n=3)
            
            tr_lines1, tr_lines2 = make_translation()

In [28]:
def make_translation():
    """
    Построение зашуменного перевода. 
    Можно существенно ускорить, если переводить батчем.
    """
    max_length = MAX_LENGTH
    tr_lines1, tr_lines2 = [],[]
    id = 0
    for line in lines1:
        id+=1
        if id%1000 == 0:
            print ('translating source', id)
        input_tensor = tensorFromSentence(input_lang, line)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder_source.initHidden(1)

        encoder_outputs = torch.zeros(max_length, encoder_source.hidden_size, device=device)

        
        for ei in range(min(MAX_LENGTH, input_length)):
            encoder_output, encoder_hidden = encoder_source(input_tensor[ei],1,
                                                     encoder_hidden)
            
            encoder_outputs[ei] += encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS

        decoder_hidden = encoder_hidden

        decoded_words = []
        
        for di in range(max_length):
            decoder_output, decoder_hidden = decoder_target(
                decoder_input, decoder_hidden, 1)
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:                
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])

            decoder_input = topi.squeeze().detach()
        tr_lines1.append(u' '.join(decoded_words))
    
    # можно устранить дублирование кода, не дошли руки, прим. Олег
    id  = 0 
    for line in lines2:
        id+=1
        if id%1000 == 0:
            print ('translation target', id)
        input_tensor = tensorFromSentence(output_lang, line)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder_target.initHidden(1)

        encoder_outputs = torch.zeros(max_length, encoder_target.hidden_size, device=device)

        
        for ei in range(min(MAX_LENGTH, input_length)):
            encoder_output, encoder_hidden = encoder_target(input_tensor[ei],1,
                                                     encoder_hidden)
            
            encoder_outputs[ei] += encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS

        decoder_hidden = encoder_hidden

        decoded_words = []
        
        for di in range(max_length):
            decoder_output, decoder_hidden = decoder_source(
                decoder_input, decoder_hidden, 1)
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                
                break
            else:
                decoded_words.append(input_lang.index2word[topi.item()])

            decoder_input = topi.squeeze().detach()
        tr_lines2.append(u' '.join(decoded_words))
        
    
    return tr_lines1, tr_lines2
#tr_lines1, tr_lines2 = make_translation()

In [29]:
def evaluate(encoder, decoder, sentence, encoded_lang,  decoder_lang, max_length=MAX_LENGTH):
    """
    Процедура промежуточной валидации перевода
    """
    with torch.no_grad():
        input_tensor = tensorFromSentence(encoded_lang, sentence)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden(1)

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(min(MAX_LENGTH, input_length)):
            encoder_output, encoder_hidden = encoder(input_tensor[ei],1,
                                                     encoder_hidden)
            encoder_outputs[ei] += encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS

        decoder_hidden = encoder_hidden

        decoded_words = []
        
        for di in range(max_length):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, 1)
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_words.append('<EOS>')
                break
            else:
                decoded_words.append(decoder_lang.index2word[topi.item()])

            decoder_input = topi.squeeze().detach()

        return decoded_words
    
def evaluateRandomly(encoder, decoder, langid = -1,  n=10):
    """
    Берем n предложений и смотрим качество на них.
    Если langid == -1 --- смотрим качество перевода.
    Если langid == 0 или == 1, смотирм качество восстановления автокодировщиком.
    """
    for i in range(n):
        if langid == -1:
            id0 = 0
            id1 = 1
            enc_lang= input_lang
            dec_lang = output_lang
        else:            
            id0 = langid
            if langid == 0:
                enc_lang = input_lang
                dec_lang = input_lang
            else:
                enc_lang = output_lang
                dec_lang = output_lang
        if langid==-1:
                
            pair = random.choice(pairs)
        else:
            pair = [random.choice(lines1), random.choice(lines2)]
        print   ('>', pair[id0])
        if langid==-1:
            print ('=', pair[id1])
        
            
        output_words = evaluate(encoder, decoder, pair[id0], enc_lang, dec_lang)
        output_sentence = ' '.join(output_words)
        print ('<', output_sentence, '\n')
        
    

Создание сетей и запуск обучения.

Для информации ниже: пример вывода валидации на последних 10000 итераций.

Видно, что перевод работает, хотя и не всегда корректно.

Зато почти идеально выполняется восстановление предложений.

Добавление шума в автокодировщик (как в статье) исправит положение.

In [30]:

hidden_size = 300

encoder_source = EncoderRNN(input_lang.n_words, hidden_size, fr_matrix).to(device)
encoder_target = EncoderRNN(output_lang.n_words, hidden_size, en_matrix, gru = encoder_source.gru).to(device)

decoder_source = DecoderRNN(hidden_size, input_lang.n_words, fr_matrix).to(device)
decoder_target = DecoderRNN(hidden_size, output_lang.n_words, en_matrix, gru = decoder_source.gru).to(device)

#disc = Net1()
#disc.cuda()


In [32]:
disc = Net1()
disc.cuda()
trainIters(encoder_source,encoder_target,  decoder_source,  decoder_target, disc, 100000, print_every=5000)

AssertionError: Torch not compiled with CUDA enabled

Сохраняем модели

In [None]:
torch.save(encoder_source.gru.state_dict(), 'monolingual_seq2seq_fr_en_enc')
torch.save(decoder_source.gru.state_dict(), 'monolingual_seq2seq_fr_en_dec')