This notebook contains code for getting the data (i.e translations) for this analysis. If you want to work with different translations of the DDJ check out [Terebes Online](https://terebess.hu/english/tao/_index.html), which you can scrape similar to what I did here. 

To get 170 translations of the first chapter of the DDJ save [this page](http://www.bopsecrets.org/gateway/passages/tao-te-ching.htm) as html in the `webpages/` directory. You can get the DC Lau translation the same way from [this page](https://terebess.hu/english/tao/lau.html). You can get 10 additional complete translations from [this page](https://ttc.tasuki.org/display:Code:gff,sm,vhm,sasl,dl,jhmd,jc,dh,rh,as). For this page you can inspect source and manually get html code. Saving directly as html does not work. 

In [42]:
import os
import pickle
from pathlib import Path

import urllib
from bs4 import BeautifulSoup

webpage_dir = Path('webpages/')

Let's start with the 170 translations of chapter 1

In [43]:
html = open(webpage_dir.joinpath('intros.html'))
soup = BeautifulSoup(html, 'lxml')
chapters = []
translators = []
for block in soup.find_all('blockquote')[2:-4]:
    lines = list(block.stripped_strings)
    lines = [l.replace('\n', ' ') for l in lines]
    
    if len(lines) > 1:
        t = lines[-1]
        # Found translator line
        if t.count('by') >= 1:
            translators.append(t)
            chapters.append(" ".join(lines[:-1]))

# Do some cleaning
authors = list(map(lambda t: t.split('by')[1].split('(')[0].strip(), translators))

# Sanity Check
assert(len(chapters) == len(authors))

In [44]:
chapters[:3]

['The tau (reason) which can be tau -ed (reasoned) is not the Eternal Tau (Reason).  The name which can be named is not the Eternal Name. Non-existence is named the Antecedent of heaven and earth; and Existence is  named the Mother of all things. In eternal non-existence, therefore, man seeks  to pierce the primordial mystery; and, in eternal existence, to behold the  issues of the Universe. But these two are one and the same, and differ only in  name. This sameness (or existence and non-existence) I call the abyss \x97 the abyss of  abysses \x97 the gate of all mystery.',
 'The TAO, or Principle of Nature, may be discussed [by all]; it is  not the popular or common Tao. Its Name may be named [ i.e. , the TAO may receive a  designation, though of itself it has none]; but it is not an ordinary name, [or  name in the usual sense of the word, for it is a presentment or ειδωλον of the  Infinite]. Its nameless period was that which preceded the birth of the Universe, In being spoken of by n

In [45]:
authors[:3]

['John Chalmers', 'Frederick Henry Balfour', 'James Legge']

In [46]:
translations = {author:chapter for author, chapter in zip(authors, chapters)}
with open('data/chapter1_translations', 'wb') as f:
    pickle.dump(translations, f)

Looks good! Now let's get the DC Lau translation

In [47]:
html = open(webpage_dir.joinpath('DCLau_1963.html'))
soup = BeautifulSoup(html, 'lxml')

chapters = []
for p in soup.find_all('p')[87:]:
    lines = list(p.stripped_strings)
    lines = [l.replace('\n', "") for l in lines]
    
    if p.find('a') is not None:
        # found start of new chapter
        lines = [l for l in lines if not l.isdigit()]
        chapters.append(lines)
    else:
        chapters[-1].extend(lines)

In [48]:
chapters[:3]

[['The way that can be spoken of',
  'Is not the constant way;',
  'The name that can be named',
  'Is not the constant name.',
  'The  nameless was the beginning of heaven and earth;',
  'The named was the mother of the myriad creatures.',
  'Hence  always rid yourself of desires in order to observe its secrets;',
  'But always allow yourself to have desires in order to observe its manifestations.',
  'These  two are the same',
  'But diverge in name as they issue forth.',
  'Being the same they are called mysteries,',
  'Mystery upon mystery -',
  'The gateway of the manifold secrets.'],
 ['The whole world recognizes the beautiful as the beautiful, yet this is only  the ugly;',
  'the whole world recognizes the good as the good, yet this is only the bad.',
  'Thus  Something and Nothing produce each other;',
  'The difficult and the easy complement each other;',
  'The long and the short off-set each other;',
  'The high and the low incline towards each other;',
  'Note and sound har

Awesome! now let's get 10 more complete translations

In [49]:
html = open(webpage_dir.joinpath('comparisons.html'), encoding='utf8')
soup = BeautifulSoup(html, 'lxml')

authors = ['Feng_1972',
           'Mitchel_1988',
           'Hair_1990',
           'Addis&Lombardo_1993',
           'Lin_1994',
           'McDonald_1996',
           'Clatfelter_2000',
           'Hinton_2002',
           'Hogan_2004',
           'Solska_2005']

translations = {a:[] for a in authors}    

In [50]:
num_authors = len(authors)
author_ind = 0
for chapter in soup.find_all('div', {"class": "text ng-binding"})[10:]:
    lines = []
    for c in list(chapter.children):
        try:
            lines.extend(list(c.stripped_strings))
        except AttributeError:
            continue
        
    translations[authors[author_ind % num_authors]].append(lines)
    author_ind += 1

In [51]:
# Add DC Lau translation
translations['DCLau_1963'] = chapters

In [52]:
# Check number of chapters
[len(translation) for translation in translations.values()]

[81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81]

In [53]:
translations['Mitchel_1988'][:5]

[['The tao that can be told',
  'is not the eternal Tao',
  'The name that can be named',
  'is not the eternal Name.',
  'The unnamable is the eternally real.',
  'Naming is the origin',
  'of all particular things.',
  'Free from desire, you realize the mystery.',
  'Caught in desire, you see only the manifestations.',
  'Yet mystery and manifestations',
  'arise from the same source.',
  'This source is called darkness.',
  'Darkness within darkness.',
  'The gateway to all understanding.'],
 ['When people see some things as beautiful,',
  'other things become ugly.',
  'When people see some things as good,',
  'other things become bad.',
  'Being and non-being create each other.',
  'Difficult and easy support each other.',
  'Long and short define each other.',
  'High and low depend on each other.',
  'Before and after follow each other.',
  'Therefore the Master',
  'acts without doing anything',
  'and teaches without saying anything.',
  'Things arise and she lets them come;',

All looks good. Now let's save this data so we can use it later

In [54]:
with open('data/complete_translations.pkl', 'wb') as f:
    pickle.dump(translations, f)