# Bilingual Book Generator

### **Goal:** generate bilingual ebooks, where a few sentences of language A is followed by the corresponding sentences in language B and so on. The program does not do translation, it relies on text input provided in both language A and B.

*Issues faced:* in high literature the sentence structure often gets adjusted during translation. Sentences are divided into several sentences. Potentially the order of sentences changed, paragraphs are divided into several paragraphs etc. Therefore the bilingual book generator should recognize which sentences match together.

*Approach:* I will try to use NLP techniques to match together sentences. First I shall see if a simple encoder based sentence distance measurement works. If not, I shall look into neural network based techniques.

For the research phase I shall generate a German-English bilingual book from Herman Hesse's Siddharta. The books are available for free at gutenberg.org

https://www.gutenberg.org/ebooks/2500

https://www.gutenberg.org/ebooks/2499

In [None]:
# Downloading the plain text versions of the books
!wget https://www.gutenberg.org/cache/epub/2500/pg2500.txt
!wget https://www.gutenberg.org/cache/epub/2499/pg2499.txt

In [2]:
# Reading the texts into python variables
textGerman = textEnglish = None

with open("pg2499.txt", "r") as fileGerman:
  textGerman = fileGerman.read()

with open("pg2500.txt", "r") as textEnglish:
  textEnglish = textEnglish.read()

len(textGerman), len(textEnglish)

(232866, 235447)

Finding pretrained embeddings for German and English - socalled cross-lingual word embedding models, or Multilingual Universal Sentence Encoder. Read more at https://www.tensorflow.org/hub/tutorials/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder

I shall use the universal-sentence-encoder-multilingual embedding. It works with 16 languages: Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian.

https://tfhub.dev/google/universal-sentence-encoder-multilingual/3

In [None]:
!pip install tensorflow_text=="2.7.0"

In [8]:
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer

module_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual/3'
embedding_model = hub.load(module_url)

def embedSentences(input):
  return embedding_model(input)

In [9]:
sampleSentencesGerman = ["Im Schatten des Hauses, in der Sonne des Flußufers bei den Booten, im Schatten des Salwaldes, im Schatten des Feigenbaumes wuchs Siddhartha auf, der schöne Sohn des Brahmanen, der junge Falke, zusammen mit Govinda, seinem Freunde, dem Brahmanensohn.", "Liebe rührte sich in den Herzen der jungen Brahmanentöchter, wenn Siddhartha durch die Gassen der Stadt ging, mit der leuchtenden Stirn, mit dem Königsauge, mit den schmalen Hüften."]
sampleEmbeddingsGerman = embedSentences(sampleSentencesGerman)

sampleSentencesEnglish = ["In the shade of the house, in the sunshine of the riverbank near the boats, in the shade of the Sal-wood forest, in the shade of the fig tree is where Siddhartha grew up, the handsome son of the Brahman, the young falcon, together with his friend Govinda, son of a Brahman.", "Love touched the hearts of the Brahmans’ young daughters when Siddhartha walked through the lanes of the town with the luminous forehead, with the eye of a king, with his slim hips."]
sampleEmbeddingsEnglish = embedSentences(sampleSentencesEnglish)

sampleEmbeddingsGerman.shape

TensorShape([2, 512])

In [10]:
# Measure euclidean distances between embeddings
def embedDistance(embedA, embedB):
  distance = tf.sqrt(tf.reduce_sum(tf.square(embedA - embedB)))
  return distance.numpy()

In [11]:
[embedDistance(sampleEmbeddingsGerman[0], sampleEmbeddingsEnglish[0]), 
embedDistance(sampleEmbeddingsGerman[1], sampleEmbeddingsEnglish[1]),
embedDistance(sampleEmbeddingsGerman[0], sampleEmbeddingsEnglish[1]),
embedDistance(sampleEmbeddingsGerman[1], sampleEmbeddingsEnglish[0])]

[0.5691226, 0.5534523, 0.8985048, 0.9359911]

It is visible that the matching sentences are much closer to each other than non-matching ones. But is it good enough?

#Text Processing

Starting the text processing. We shall see how well the sentence pairing works when we look near the end of chapters. At that point errors appearing throughout the chapter could have accumulated.

During the research phase, we will only work with the first part of the text. From now on instead of referring to specific languages in code, German will be referred to as A, English as B. This will help generalize the code later.

In [12]:
textRatio = 1 # 0.05

textA = textGerman[:int(len(textGerman) * textRatio)]
textB = textEnglish[:int(len(textEnglish) * textRatio)]

In [13]:
import re

def removeFormatting(text):
  cleanedText = text.replace("\n", " ")
  # remove non utf-8 characters
  # cleanedText = cleanedText.encode('ascii','ignore').decode("ascii")
  cleanedText = cleanedText.strip()
  return cleanedText

# Embed text
def embedText(text):
  # Separate text into sentences
  sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)',text)
  # filter out non-sentences
  filtered = filter(lambda text: text not in ["", " ", "\n"], sentences)
  # Remove linebreaks from text
  filtered = list(map(removeFormatting, filtered))
  # Embed sentences
  return embedSentences(filtered), filtered

For identifying matching sequences of text, I shall try out methods used in genetics for matching gene sequences. The family of algorithms are called Sequence alignment algorithms.

In [14]:
embeddedA, filteredSentencesA = embedText(textA)
embeddedB, filteredSentencesB = embedText(textB)

In [18]:
print("\n".join(filteredSentencesB[50:60]))

It had to be found, the pristine source in one’s own self, it had to be possessed! Everything else was searching, was a detour, was getting lost.
Thus were Siddhartha’s thoughts, this was his thirst, this was his suffering.
Often he spoke to himself from a Chandogya-Upanishad the words: “Truly, the name of the Brahman is satyam—verily, he who knows such a thing, will enter the heavenly world every day.” Often, it seemed near, the heavenly world, but never he had reached it completely, never he had quenched the ultimate thirst.
And among all the wise and wisest men, he knew and whose instructions he had received, among all of them there was no one, who had reached it completely, the heavenly world, who had quenched it completely, the eternal thirst.
“Govinda,” Siddhartha spoke to his friend, “Govinda, my dear, come with me under the Banyan tree, let’s practise meditation.”  They went to the Banyan tree, they sat down, Siddhartha right here, Govinda twenty paces away.
While putting himse

Trying the naive method: step through array A. Try to find a nearby sentence in array B that is within threshold distance from the sentence in Array A

In [45]:
import numpy as np

LOOK_AHEAD = 10
MATCH_THRESHOLD = 0.7
pairingAtoB = np.full((len(embeddedA)), -1, dtype=int)
lastMatchedIndexB = -1
indexA = -1

for embeddingA in embeddedA:
  indexA = indexA + 1

  for i in range(LOOK_AHEAD):
    indexB = min(lastMatchedIndexB + i, len(embeddedB) - 1)
    distance = embedDistance(embeddingA, embeddedB[indexB])
    
    if(distance < MATCH_THRESHOLD):
      pairingAtoB[indexA] = indexB
      lastMatchedIndexB = indexB
      break


In [46]:
# Count ratio of paired sentences
np.count_nonzero(pairingAtoB != -1) / len(pairingAtoB)

0.581056466302368

In [48]:
# Print sentence pairings
for i in range(100, 110): # len(pairingAtoB)
  if(pairingAtoB[i] != -1):
    print(filteredSentencesA[i])
    print(filteredSentencesB[pairingAtoB[i]])
    print()

Nun gehe und küsse deine Mutter, sage ihr, wohin du gehst.
Go now and kiss your mother, tell her where you are going to.

Siddhartha schwankte zur Seite, als er zu gehen versuchte.
Siddhartha wavered to the side, as he tried to walk.

Als er im ersten Tageslicht langsam auf erstarrten Beinen die noch stille Stadt verließ, erhob sich bei der letzten Hütte ein Schatten, der dort gekauert war, und schloß sich an den Pilgernden an--Govinda.
As he slowly left on stiff legs in the first light of day the still quiet town, a shadow rose near the last hut, who had crouched there, and joined the pilgrim—Govinda.

"Du bist gekommen", sagte Siddhartha und lächelte.
“You have come,” said Siddhartha and smiled.

"Ich bin gekommen," sagte Govinda.
“I have come,” said Govinda.

BEI DEN SAMANAS  Am Abend dieses Tages holten sie die Asketen ein, die dürren Samanas, und boten ihnen Begleitschaft und Gehorsam an.
WITH THE SAMANAS   In the evening of this day they caught up with the ascetics, the skinny Sa

# Convert text to PDF (Failed attempt)

### FPDF library

In [49]:
%%capture
!pip install fpdf

In [50]:
from fpdf import FPDF

SENTENCES_PER_CELL = 3

pdf = FPDF()
pdf.set_font("Times", size = 15)

# generate text pairings
textPairings = []

cellSentencesCounter = 0
lastUsedIndexA = -1
lastUsedIndexB = -1

for indexA in range(len(filteredSentencesA)):
  if (indexA - lastUsedIndexA) > SENTENCES_PER_CELL and pairingAtoB[indexA] != -1:

    aSentencesInCell = filteredSentencesA[lastUsedIndexA + 1 : indexA + 1]
    bSentencesInCell = filteredSentencesB[lastUsedIndexB + 1 : pairingAtoB[indexA] + 1]
    textPairings = textPairings + [(aSentencesInCell, bSentencesInCell)]
    
    lastUsedIndexA = indexA
    lastUsedIndexB = pairingAtoB[indexA]


In [52]:
pdf.add_page()

cellCounter = 0

for pairing in textPairings:
  try:
    text = " ".join(pairing[0]) + "\n" + " ".join(pairing[1]) + "\n\n"
    pdf.cell(200, 10, txt = text, ln = 1, align = 'L')
    cellCounter = cellCounter + 1
    if(cellCounter > 5):
      pdf.add_page()

  except Exception as e:
    print(pairing)
    print(e)

# pdf.output("BilingualBook.pdf")

*Note: the resulting pdf is unsatisfactory, I will try to generate a pdf using LaTex. This will allow better control of the layout*

# Translate from other Languages

Texts written in languages not supported by the encoder should be translated to English, then aligned. I will use google translate API for this purpose.

The text chosen is Amok by Stefan Zweig. The original German version will be paired with the Hungarian version. Both versions are available for free at the following links:

https://www.gutenberg.org/ebooks/57850

https://mek.oszk.hu/00500/00537/

The Hungarian version is not available as plain text, the RTF version will be used.

In [None]:
!wget https://www.gutenberg.org/ebooks/57850.txt.utf-8
!wget https://mek.oszk.hu/00500/00537/00537.rtf

I had failed attempts for converting Hungarian RTF to plain text using pypandoc, docx2txt, textract libraries and to convert from HTML using BeautifulSoup

In [None]:
!pip install pypandoc
!pip install docx2txt
# !pip install textract

In [60]:
# Try different library
import pypandoc

print(pypandoc.get_pandoc_formats())
# pypandoc.convert_file("00537.htm", "plain", format="html")

(['commonmark', 'docbook', 'docx', 'epub', 'haddock', 'html', 'json', 'latex', 'markdown', 'markdown_github', 'markdown_mmd', 'markdown_phpextra', 'markdown_strict', 'mediawiki', 'native', 'odt', 'opml', 'org', 'rst', 't2t', 'textile', 'twiki'], ['asciidoc', 'beamer', 'commonmark', 'context', 'docbook', 'docbook5', 'docx', 'dokuwiki', 'dzslides', 'epub', 'epub3', 'fb2', 'haddock', 'html', 'html5', 'icml', 'json', 'latex', 'man', 'markdown', 'markdown_github', 'markdown_mmd', 'markdown_phpextra', 'markdown_strict', 'mediawiki', 'native', 'odt', 'opendocument', 'opml', 'org', 'plain', 'revealjs', 'rst', 'rtf', 's5', 'slideous', 'slidy', 'tei', 'texinfo', 'textile', 'zimwiki'])


Functioning RTF stripping function

In [63]:
"""
Extract text in RTF Files. Refactored to use with Python 3.x
Source:
    http://stackoverflow.com/a/188877
Code created by Markus Jarderot: http://mizardx.blogspot.com
"""

import re


def striprtf(text):
   pattern = re.compile(r"\\([a-z]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^a-z])|([{}])|[\r\n]+|(.)", re.I)
   # control words which specify a "destionation".
   destinations = frozenset((
      'aftncn','aftnsep','aftnsepc','annotation','atnauthor','atndate','atnicn','atnid',
      'atnparent','atnref','atntime','atrfend','atrfstart','author','background',
      'bkmkend','bkmkstart','blipuid','buptim','category','colorschememapping',
      'colortbl','comment','company','creatim','datafield','datastore','defchp','defpap',
      'do','doccomm','docvar','dptxbxtext','ebcend','ebcstart','factoidname','falt',
      'fchars','ffdeftext','ffentrymcr','ffexitmcr','ffformat','ffhelptext','ffl',
      'ffname','ffstattext','field','file','filetbl','fldinst','fldrslt','fldtype',
      'fname','fontemb','fontfile','fonttbl','footer','footerf','footerl','footerr',
      'footnote','formfield','ftncn','ftnsep','ftnsepc','g','generator','gridtbl',
      'header','headerf','headerl','headerr','hl','hlfr','hlinkbase','hlloc','hlsrc',
      'hsv','htmltag','info','keycode','keywords','latentstyles','lchars','levelnumbers',
      'leveltext','lfolevel','linkval','list','listlevel','listname','listoverride',
      'listoverridetable','listpicture','liststylename','listtable','listtext',
      'lsdlockedexcept','macc','maccPr','mailmerge','maln','malnScr','manager','margPr',
      'mbar','mbarPr','mbaseJc','mbegChr','mborderBox','mborderBoxPr','mbox','mboxPr',
      'mchr','mcount','mctrlPr','md','mdeg','mdegHide','mden','mdiff','mdPr','me',
      'mendChr','meqArr','meqArrPr','mf','mfName','mfPr','mfunc','mfuncPr','mgroupChr',
      'mgroupChrPr','mgrow','mhideBot','mhideLeft','mhideRight','mhideTop','mhtmltag',
      'mlim','mlimloc','mlimlow','mlimlowPr','mlimupp','mlimuppPr','mm','mmaddfieldname',
      'mmath','mmathPict','mmathPr','mmaxdist','mmc','mmcJc','mmconnectstr',
      'mmconnectstrdata','mmcPr','mmcs','mmdatasource','mmheadersource','mmmailsubject',
      'mmodso','mmodsofilter','mmodsofldmpdata','mmodsomappedname','mmodsoname',
      'mmodsorecipdata','mmodsosort','mmodsosrc','mmodsotable','mmodsoudl',
      'mmodsoudldata','mmodsouniquetag','mmPr','mmquery','mmr','mnary','mnaryPr',
      'mnoBreak','mnum','mobjDist','moMath','moMathPara','moMathParaPr','mopEmu',
      'mphant','mphantPr','mplcHide','mpos','mr','mrad','mradPr','mrPr','msepChr',
      'mshow','mshp','msPre','msPrePr','msSub','msSubPr','msSubSup','msSubSupPr','msSup',
      'msSupPr','mstrikeBLTR','mstrikeH','mstrikeTLBR','mstrikeV','msub','msubHide',
      'msup','msupHide','mtransp','mtype','mvertJc','mvfmf','mvfml','mvtof','mvtol',
      'mzeroAsc','mzeroDesc','mzeroWid','nesttableprops','nextfile','nonesttables',
      'objalias','objclass','objdata','object','objname','objsect','objtime','oldcprops',
      'oldpprops','oldsprops','oldtprops','oleclsid','operator','panose','password',
      'passwordhash','pgp','pgptbl','picprop','pict','pn','pnseclvl','pntext','pntxta',
      'pntxtb','printim','private','propname','protend','protstart','protusertbl','pxe',
      'result','revtbl','revtim','rsidtbl','rxe','shp','shpgrp','shpinst',
      'shppict','shprslt','shptxt','sn','sp','staticval','stylesheet','subject','sv',
      'svb','tc','template','themedata','title','txe','ud','upr','userprops',
      'wgrffmtfilter','windowcaption','writereservation','writereservhash','xe','xform',
      'xmlattrname','xmlattrvalue','xmlclose','xmlname','xmlnstbl',
      'xmlopen',
   ))
   # Translation of some special characters.
   specialchars = {
      'par': '\n',
      'sect': '\n\n',
      'page': '\n\n',
      'line': '\n',
      'tab': '\t',
      'emdash': '\u2014',
      'endash': '\u2013',
      'emspace': '\u2003',
      'enspace': '\u2002',
      'qmspace': '\u2005',
      'bullet': '\u2022',
      'lquote': '\u2018',
      'rquote': '\u2019',
      'ldblquote': '\201C',
      'rdblquote': '\u201D',
   }
   stack = []
   ignorable = False       # Whether this group (and all inside it) are "ignorable".
   ucskip = 1              # Number of ASCII characters to skip after a unicode character.
   curskip = 0             # Number of ASCII characters left to skip
   out = []                # Output buffer.
   for match in pattern.finditer(text):
      word,arg,hex,char,brace,tchar = match.groups()
      if brace:
         curskip = 0
         if brace == '{':
            # Push state
            stack.append((ucskip,ignorable))
         elif brace == '}':
            # Pop state
            ucskip,ignorable = stack.pop()
      elif char: # \x (not a letter)
         curskip = 0
         if char == '~':
            if not ignorable:
                out.append('\xA0')
         elif char in '{}\\':
            if not ignorable:
               out.append(char)
         elif char == '*':
            ignorable = True
      elif word: # \foo
         curskip = 0
         if word in destinations:
            ignorable = True
         elif ignorable:
            pass
         elif word in specialchars:
            out.append(specialchars[word])
         elif word == 'uc':
            ucskip = int(arg)
         elif word == 'u':
            c = int(arg)
            if c < 0: c += 0x10000
            if c > 127: out.append(chr(c)) #NOQA
            else: out.append(chr(c))
            curskip = ucskip
      elif hex: # \'xx
         if curskip > 0:
            curskip -= 1
         elif not ignorable:
            c = int(hex,16)
            if c > 127: out.append(chr(c)) #NOQA
            else: out.append(chr(c))
      elif tchar:
         if curskip > 0:
            curskip -= 1
         elif not ignorable:
            out.append(tchar)
   return ''.join(out)

In [64]:
plainText = None

def splitSentences(text):
  split = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)', text)
  
  split = filter(lambda text: text not in ["", " ", "\n"], split)
  # Remove linebreaks from text
  split = list(map(removeFormatting, split))
  return split

with open("00537.rtf") as file:
  text = file.read()
  plainText = striprtf(text)

plainSentences = splitSentences(plainText)
print(plainSentences)

['STEFAN ZWEIG  ÁMOK  FORDÍTOTTA: GÖRGEY GÁBOR     Ezerkilencszáztizenkettõ márciusában egy nápolyi kikötõben horgonyzó óceánjárón a kirakodásnál súlyos szerencsétlenség történt.', 'Az újságok terjedelmes, de igen valószínûtlenül kiszínezett tudósításokban számoltak be az esetrõl.', 'Bár magam is az Óceánián utaztam, sem én, sem útitársaim nem lehettünk tanúi a különös balesetnek, minthogy éjjel, szén- és árurakodás közben történt, s a lárma elõl valamennyien színházba vagy kávéházba menekültünk.', 'Annak idején nem beszéltem errõl, de azt hiszem, jól gyanítottam a megrázó esemény igazi okát, és most, esztendõk múltán, már feltárhatom azt a bizalmas beszélgetést, mely közvetlenül megelõzte a történteket.', 'Mikor visszatérõben Európába, a hajóstársaság calcuttai ügynökségén helyet akartam biztosítani az Óceánián, a hivatalnok sajnálkozva vállat vont.', 'Nem tudja még, kaphatok-e kabint, mert ilyenkor, az esõs évszak beállta elõtt, rendszerint már Ausztráliából lefoglalnak minden helyet

# Translate Text in Unsupported language to English

In [67]:
!pip install googletrans



In [None]:
from googletrans import Translator
translator = Translator()
translator.translate('Was ist los?').text


# Generating the PDF with LaTex

LaTex file generation experiments were done outside the notebook.