Converts xml file containing Crosby and Shaeffer vocabulary to a csv file containing only the vocabulary lists

In [37]:
import re
import wget
import pandas as pd


### Get the xml file

In [38]:
!pip install wget
# !wget crosbyshaeffer2.0.xml https://github.com/gregorycrane/CrosbyShaeffer2.0/blob/51a1e5d7663f6944816d11853f1aa21004838323/crosbyshaeffer2.0.xml



In [39]:
file_path = './lib/crosbyshaeffer_vocabulary_unprocessed.xml'
url = 'https://github.com/gregorycrane/CrosbyShaeffer2.0/blob/51a1e5d7663f6944816d11853f1aa21004838323/crosbyshaeffer2.0.xml'
wget.download(url, file_path)

'./lib/crosbyshaeffer2.0 (1).xml'

Use regex to capture the info between `<p>VOCABULARY` and `</p>`

In [40]:
capture = False
vocab = []

# get rid of irrelevant lines
with open(file_path) as f:
    for line in f:
        if re.search(r"VOCABULARY", line):
            capture = True
        if re.search(r"\</p>", line):
            capture = False
        # capture everything within the <p></p>
        if capture:
            # otherwise, grab only the relevant info from within the tags
            match = re.search(r">((\w)+.*)<\/", line)
            if match is not None:
                cleaned_match = re.sub(r"<.*>", '', match.group(1))
                cleaned_match = re.sub(r"[&/;]", '', cleaned_match)
                vocab.append(cleaned_match)
            

In [41]:
# continue to clean the file

# remove any lines starting with numbers
# remove any lines not starting with greek letters
# remove any lines without a colon
remove = []
for line in vocab:
    if re.search(r"^[0-9]", line) or (not re.search('^[\u0370-\u03FF]', line)) or (not re.search(':', line)):
        remove.append(line)
        
for i in remove:
    vocab.remove(i)

for v in vocab:
    print(v)

παύουσι: they stop.
πέμπει: he, she, or it sends (transitive).
πέμπουσι: they send.
ποταμός, -οὔ, ὃ: river, HIPPOPOTAMUS.
στρατηγός, -οὔ, ὃ: general. STRATEGY.
δίκαιος : just. 
μακρός: large. MACRON. MACROECONOMICS.
μῑκρός: small. MICROSCOPIC. MICROECONOMICS
πόλεμος: war.
πολέμιος : hostile.
οἱ πολέμιοι: the enemy. POLEMIC.
φίλος: friend. PHILANTHROPIST.
γράφω: write. TELEGRAPH. LITHOGRAPH.
καί: conj.: and, also, even.
καλός: beautiful, honorable, fine.
κίνδυνος: danger.
λίθος: stone.
λύω: loose, break, destroy.
παύω: stop (trans,). PAUSE.ltgt
δέ (δ᾽ before vowels), postpos.2 conj: but, and.
δένδρον: tree. RHODODENDRON. 
παρά, prep.: with G., from the side of with D., by the side of with A., to the side of, to, alongside. PARALLEL.
δῆλος: plain, evident. 
δῶρον: gift. THEODORE, 
εἰς, proclit. prep. with A.: into  (Lat. in).
πεδίον: plain.
στάδιον: stadium (race course) stade (600 ft.).3 ltgt
γάρ, postpos. conj. : for. 
σπονδή: libation pl, treaty, truce. SPONDEE.
κώμη: village. 
μάχῃ: 

Put everything into a dictionary

In [42]:
table = {"word": [], "translation":[], "cognate":[]}

for v in vocab:  
    
    # get the greek word
    word_match = re.search(r"(.+):", v)
    if word_match is not None:
        table["word"].append(word_match.group(0).strip(" :"))
    else:
        table["word"].append("N/A")
        
        
    # get the english translation
    transl_match = re.search(r":(.+)", v)
    if transl_match is not None:
        eng_trans_list = []
        # get the entire string
        w = transl_match.group(0).strip(" :")
        print(w)
        # get each of the words (separated by periods)
        for i in re.split("[.,]", w):
            i = i.strip(" ")
            # if the word is all uppercase, ignore it
            if not i.isupper():
                eng_trans_list.append(i)
        # recombine all the words we want, separated by periods
        table["translation"].append(".".join(eng_trans_list))
        
    else:
        table["translation"].append("N/A")
        
        
    # get the cognate
    cognate_match = re.search(r"[A-Z][A-Z]+", v) 
    if cognate_match is not None:
        table["cognate"].append(cognate_match.group(0))
    else:
        table["cognate"].append("N/A")
    

they stop.
he, she, or it sends (transitive).
they send.
river, HIPPOPOTAMUS.
general. STRATEGY.
just.
large. MACRON. MACROECONOMICS.
small. MICROSCOPIC. MICROECONOMICS
war.
hostile.
the enemy. POLEMIC.
friend. PHILANTHROPIST.
write. TELEGRAPH. LITHOGRAPH.
conj.: and, also, even.
beautiful, honorable, fine.
danger.
stone.
loose, break, destroy.
stop (trans,). PAUSE.ltgt
but, and.
tree. RHODODENDRON.
with G., from the side of with D., by the side of with A., to the side of, to, alongside. PARALLEL.
plain, evident.
gift. THEODORE,
into  (Lat. in).
plain.
stadium (race course) stade (600 ft.).3 ltgt
for.
libation pl, treaty, truce. SPONDEE.
village.
battle. LOGOMACHY.
flee. Lat. fugio. FUGITIVE,
not, UTOPIA. PROPHYLACTIC.
flight, exile. Lat fuga.
guard, garrison.
tent. SCENE.
guard (verb).ltgt
ten. DECALOGUE.
with G., through  with A., on account of. DIAMETER.
friendship. Cf. φίλος.
friendly.
foreigner, barbarian.
outcry, shout.
word, speech.
coord. conj. (§ 45).
silence.ltgt
intend, dela

Convert the dictionary to a dataframe

In [43]:
df = pd.DataFrame(data=table)
print(df)

                                                word  \
0                                            παύουσι   
1                                             πέμπει   
2                                           πέμπουσι   
3                                    ποταμός, -οὔ, ὃ   
4                                  στρατηγός, -οὔ, ὃ   
..                                               ...   
396  σημαίνω, σημανῶ, ἐσήμηνα, σεσήμασμαι, ἐσημάνθην   
397                                         βασιλεία   
398                                           κοινός   
399                                  νόμος, νόμου, ὁ   
400                          οἵομαι, οἴησομαι, ᾠήθην   

                          translation       cognate  
0                          they stop.           N/A  
1    he.she.or it sends (transitive).           N/A  
2                          they send.           N/A  
3                              river.  HIPPOPOTAMUS  
4                            general.      STRATEGY  
.. 

Convert the dataframe to a csv

In [44]:
df.to_csv('./lib/crosbyshaeffer_vocabulary.csv', sep='\t', index=False)