# Further Applications of Regular Expressions

On Thursday, we went over an introdution to strings and the use of regular expressions. Today, we will be focusing on more applications of those tools and hoping to explore the potential of tools that we have covered thus far.

### Learning Goals:


### Lesson Outline:
- Q&A regarding Thursday's content
- Examples (with Abigail's data)
- Practice!

In [1]:
# importing different packages
import re
from datascience import *

## Importing our text

In [20]:
with open('linsenn_to_page_69.txt', 'r') as f:
    read_text = f.read()
print(read_text[89:400])

CUNEIFORM MONOGRAPHS 25

Managing Editor Geerd Haayer

Edited by

T. Abusch, MJ. Geller, M.P. Maidman

S.M. Maul and F.A.M. Wiggermann

BRILL • SIYX

LEIDEN • BOSTON

2004

CUNEIFORM MONOGRAPHS 25

THE CULTS OF URUK AND

BABYLON

The Temple Ritual Texts as Evidence for

Hellenistic Cult Practises

Marc J.H. Li


In [28]:
opn = [line for line in read_text[89:].split('\n') if len(line) > 0]
cod_opn = [line for line in pls.split('\n') if len(line) > 0]

In [40]:
for i in range((len(cod_opn)-1)//6):
    sbs(opn[i], cod_opn[i])

In [19]:
import codecs
with codecs.open('LinssenCM25 adobe txt.txt', "r",encoding='utf-8', errors='ignore') as fdata:
    pls = fdata.read()
print(pls[:300])

CUNEIFORM MONOGRAPHS 25 
r 
Edited by T. Abusch, M.J. Geller, M.P. Maidman S.M. Maul and F.A.M. Wiggermann BRILL SIYX LEIDEN  BOSTON 2004 CUNEIFORM MONOGRAPHS 25 
THE CULTS OF URUK AND 
BABYLON 
The Temple Ritual Texts as Evidence for 
Hellenistic Cult Practises 
Marc J.H. Linssen BRILL


## Removing all the transliterations and translations

So that we can run text analysis on the English portion of the work

In [30]:
#!pip install polyglot
#!pip install pycld2
#!brew install intltool icu4c gettext
#!brew link icu4c gettext --force
#!CFLAGS=-I/usr/local/opt/icu4c/include LDFLAGS=-L/usr/local/opt/icu4c/lib pip3 install pyicu

from polyglot.detect import Detector

In [31]:
import numpy as np

no_linebreaks = [line for line in read_text.split('\n') if len(line) > 0]
lang = [Detector(line,quiet=True).language.name for line in no_linebreaks]
conf = [Detector(line,quiet=True).language.confidence for line in no_linebreaks]
bool_obv = [bool(re.match('[O,o,0][b,B][v,V]', line)) for line in no_linebreaks]
bool_num_start = [bool(re.search('^\d+\s+.+',line)) for line in no_linebreaks]

look_for_nonenglish = Table().with_columns([
        'text', no_linebreaks,
        'lang', lang,
        'conf', conf,
        'num_start', bool_num_start,
        'obv', bool_obv,
        'ind', np.arange(len(bool_obv))
    ])

low_conf = look_for_nonenglish['conf'] < 80
high_conf = look_for_nonenglish['conf'] > 90
english = look_for_nonenglish['lang'] == 'English'
not_english = look_for_nonenglish['lang'] != 'English'
language_uncertain = np.logical_or(low_conf, not_english)
certainly_english = np.logical_and(english, high_conf)
plsHelp = np.logical_and(language_uncertain, bool_num_start)

we_want = look_for_nonenglish.where(certainly_english)
we_want

text,lang,conf,num_start,obv,ind
CUNEIFORM MONOGRAPHS 25,English,95,False,False,27
Managing Editor Geerd Haayer,English,96,False,False,28
"T. Abusch, MJ. Geller, M.P. Maidman",English,96,False,False,30
S.M. Maul and F.A.M. Wiggermann,English,96,False,False,31
LEIDEN • BOSTON,English,93,False,False,33
CUNEIFORM MONOGRAPHS 25,English,95,False,False,35
THE CULTS OF URUK AND,English,95,False,False,36
The Temple Ritual Texts as Evidence for,English,97,False,False,38
Hellenistic Cult Practises,English,96,False,False,39
Marc J.H. Linssen,English,94,False,False,40


In [32]:
english_corpus = '\n'.join(we_want['text'])
print(english_corpus)

CUNEIFORM MONOGRAPHS 25
Managing Editor Geerd Haayer
T. Abusch, MJ. Geller, M.P. Maidman
S.M. Maul and F.A.M. Wiggermann
LEIDEN • BOSTON
CUNEIFORM MONOGRAPHS 25
THE CULTS OF URUK AND
The Temple Ritual Texts as Evidence for
Hellenistic Cult Practises
Marc J.H. Linssen
LEIDEN • BOSTON
This book is printed on acid-free paper.
Library of Congress Cataloging-in-Publication Data
Linssen, Marc J.H.
The cults of Uruk and Babylon : the temple ritual as evidence for Hellenistic cult
practices / Marc J.H. Linssen
p. cm. — (Cuneiform Monographs; ISSN 0929-0052 ; 25)
Originally presented as the author’s thesis (doctoral-Vrije Universiteit Amsterdam), 2002.
Includes bibliographical references (p.) and index.
1. Cults-Iraq-Babylon (Extinct city) 2. Cults-Iraq-Erech (Extinct city) 3. Babylon
(Extinct city)-Religion. 4. Erech (Extinct city)-Religion. 5. Akkadian language-Texts. 6.
Assyro-Babylonian religion. I. Title. II. Series.
© Copyright 2004 by Styx/Koninklijke Brill NV, Leiden, The Netherlands
Al

In [33]:
from collections import Counter
import pprint
pp = pprint.PrettyPrinter()

# capital 'W' mean NOT word characters (not letters or numbers)
pp.pprint(Counter(re.split('\W+',english_corpus.lower())))

Counter({'the': 2504,
         'and': 878,
         'in': 818,
         'a': 628,
         'of': 529,
         'to': 494,
         'is': 441,
         'for': 393,
         'ofthe': 340,
         'on': 334,
         'day': 291,
         'from': 263,
         'are': 252,
         'temple': 234,
         'ceremonies': 220,
         'also': 214,
         'texts': 208,
         'which': 205,
         'ritual': 202,
         'ceremony': 196,
         'see': 191,
         'with': 189,
         'uruk': 182,
         'not': 179,
         'text': 176,
         'month': 173,
         '2': 172,
         '1': 166,
         'tu': 166,
         'i': 164,
         '4': 152,
         'e': 151,
         'that': 145,
         'we': 145,
         'ii': 144,
         'as': 138,
         'was': 135,
         'this': 131,
         '3': 131,
         'hellenistic': 130,
         'days': 129,
         'obv': 126,
         'no': 124,
         '6': 124,
         'but': 119,
         'by': 116,
         'an': 111

In [34]:
# need more text analysis ideas

## Putting translations/transliterations into a searchable format

In [9]:
# i'd like to put them side by side, so comparison would be easy
# probably need to make a function for that

# gonna use some html output to show them side-by-side
# function which takes in what to display
# displays inline-block divs for each item in list
# then plots, also returns the actual list

# gonna have to use this stuff :/
from IPython.core.display import display, HTML

CSS = """
#wrapper {
  width:100%;
  clear:both;
  display: flex;
}
#left1 {
  background-color: #ccabb5;
  width:33%;
  float:left;
  padding: .5vw;
  border-right: solid black 1.5px;
}
#middle1 {
  background-color: lightgray;
  width:33%;
  float:left;
  padding: .5vw;
}
#right1 {
  background-color: #cfc5c9;
  width:33%;
  float:left;
  padding: .5vw;
  border-left: solid black 1.5px;
}
"""
txt1 = 'the dog went to the store, I want to figure out whats gonna happend when i reach the bottom of this,the dog went to the store, I want to figure out whats gonna happend when i reach the bottom of this'
txt2 = 'me, i just followed him to the end of the street before i got tired and decided that it would be best if i went back home. I trust his decision making skills'

body ='<div id="wrapper"><div id="left1">Akkadian<br>{}</div><div id="middle1">Translation<br></div><div id="right1">Morph Analyzer<br>{}</div></div>'.format(txt1,txt2)
HTML('<style>{}</style> <body>{}</body>'.format(CSS, body))

In [39]:
for line in [line for line in no_linebreaks[:1500] if (len(line) > 0) and re.search('^\d+\s+.+',line)]:
    det = Detector(line,quiet=True)
    lft = str(det.language.name)+ ' ' +str(det.language.read_bytes) + ' ' + str(det.language.confidence)
    rght = line
    sbs(lft, rght)

In [40]:
detector = Detector(read_text[50000:51000])
txt1 = '(' + str(detector.language.name)+ ') (' +str(detector.language.read_bytes) + ') (' + str(detector.language.confidence) + ')'
txt2 = read_text[50000:51000]
body ='<div id="wrapper"><div id="left">(lang) (bytes) (confidence)<br>{}</div><div id="right">Text<br>{}</div></div>'.format(txt1,txt2)
display(HTML('<style>{}</style> <body>{}</body>'.format(CSS, body)))

## Table of Contents

In [None]:
read_text

In [None]:
read_text[4195:]

In [None]:
re.findall('[1,I,V]{1,3}\.\d?\.?\d?.+\n\n',read_text[4195:])

In [None]:
# remember that if it seems like your answer
# is getting overly complicated, there's probably a shorter
# and more elegant answer

toc = re.findall('.*\.\..*',read_text[4195:13122])
toc

In [None]:
Table(['A', 'B']).with_rows(rows).show(75)

In [None]:
re.findall('111.6', read_text)

In [None]:
rows = [re.split('\s*\.{2,}\s*\D*', line) for line in toc]
rows

In [None]:
re.search('343', read_text)

In [None]:
re.search('I. Introduction', read_text)

In [None]:
# select out the toc from the beginning of the work

## Search by Table of Contents

In [None]:
# use the above table of contents to search for pages that fit within a certain section of the TOC
# maybe get OOP and make it a tool for them

## Seachable Pages

In [None]:
# average characters per page, so within that range?
# once you've found one, split there, then move onto the next
# if not within that range plus or minus some error amount, 
# include that text, but move one more down in the range
len(read_text) / 69

In [None]:
# this needs a lot of work


#change so that it only searches within a range of pages, 
#then determines which character would be the end of the page

pg_breakdown = []
temp = read_text
for i in range(20):
    print(re.search('.{,20}'+str(i)+'.{,20}', temp))
    split = re.split('('+str(i)+')', temp, maxsplit=1)
    temp = split[2]
    pg_breakdown.append(split[0])
pg_breakdown

## Pull out all of the tables

In [None]:
# see page 56 for an example of a table

In [None]:
re.findall('.{,20}Table I.{,20}', read_text)

In [None]:
re.search('Table I', read_text)

In [None]:
print(read_text[140460:140460])

In [None]:
read_text[140460:].split('\n')