# Setting up Tsakorpus on the Corpus Server
*Agustin Lorenzo, agustin.lorenzo@uga.edu*

These are all the steps taken to upload annotated ELIC files to a corpus hosted on the UGA corpus server. The corpus is created using the open-source [Tsakorpus framework](https://github.com/timarkh/tsakorpus). All steps follow what is described in the [docs](https://tsakorpus.readthedocs.io/en/latest/), along with specific steps taken specifically for setup on the corpus server (i.e. changing necesssary code, downloading appropriate software versions, etc). Everything is done through terminal commands, given that the server is most likely being accessed through SSH.

If there are any questions or issues with setup, feel free to contact me for help at the email above.

### Format
All `Shell Script` code blocks are commands run in the terminal (or expected outputs from the terminal after running a given command). All `Python` or `raw` code blocks are what should be put in the implied file it follows in the previous shell script command. i.e. ...

In [None]:
emacs file.txt # open a file with a text editor

## 1. Setup

Create a `venv` environment for Python packages

In [None]:
python -m venv korpusenv
source korpusenv/activate

Download the Tsakorpus repository

In [None]:
git clone https://github.com/timarkh/tsakorpus.git
cd tsakorpus/

Change `corpus.json` settings in the `/conf/` folder to match these settings

In [None]:
cd conf
rm corpus.json
emacs corpus.json

### 2. Converting `.eaf` files to `.json`

Setup conversion settings

In [None]:
cd src_convertors/
mkdir corpus
mv conf_conversion corpus/conf_conversion
cd corpus/conf_conversion
emacs conversion_settings.json 

Replace the `conversion_settings.json` file contents with these settings:

In [None]:
emacs conf_conversion/categories.json # create categories file

The `categories.json` can just match the `categories.json` file in the `tsakorpus/conf` directory (for now)

In [None]:
cd .. # go back up to `corpus` directory
mkdir eaf

Add all .eaf files and their corresponding audio files to the `src_convertors/corpus/eaf` directory

In [None]:
scp /path/on/local/to/eaf/file.eaf username@corpus.uga.edu:tsakorpus/src_convertors/corpus/eaf/
scp /path/on/local/to/eaf/file.wav username@corpus.uga.edu:tsakorpus/src_convertors/corpus/eaf/
# repeat for all .eaf files

Ensure that `ffprobe` is installed (for cutting media into clips)

In [None]:
pip install ffprobe
mkdir -p $HOME/bin
cd $HOME/bin
wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar -xf ffmpeg-release-amd64-static.tar.xz
cd ffmpeg-*-static
export PATH=$PWD:$PATH

Navigate to `src_convertors` and convert eaf files to json with `eaf2json.py`

In [None]:
cd ~/tsakorpus/src_convertors/
python eaf2json.py

Confirm that the conversion worked by checking in `corpus/json`

In [None]:
ls corpus/json
ls corpus/media

Move converted files to correct final folders (to be found by indexator later)

In [None]:
cd ~/tsakorpus
mkdir corpus
cd corpus
mkdir elic
cd ..
mv -v src_convertors/corpus/json/* corpus/elic/

cd search
mkdir media
cd media
mkdir elic
cd ../..
mv -v src_convertors/corpus/media/* search/media/elic/

### 3. Running the indexator

Download Elasticsearch

In [None]:
pip install elasticsearch
cd ..
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-9.0.1-linux-x86_64.tar.gz
tar -xvzf elasticsearch-9.0.0-linux-x86_64.tar.gz
rm elasticsearch-9.0.0-linux-x86_64.tar.gz

Change Elasticsearch configuration (in `elasticsearch-9.0.0/config/`)

In [None]:
cd elasticsearch-9.0.0/config
emacs elasticsearch.yml

In [None]:
# ...
# change settings at bottom of .yml file

# Enable security features 
xpack.security.enabled: false # CHANGED TO FALSE (for now!)

xpack.security.enrollment.enabled: false # CHANGED TO FALSE (for now!)

# Enable encryption for HTTP API client connections, such as Kibana, Logstash, and Agents
xpack.security.http.ssl:
  enabled: true
  keystore.path: certs/http.p12

# Enable encryption and mutual authentication between cluster nodes
xpack.security.transport.ssl:
  enabled: true
  verification_mode: certificate
  keystore.path: certs/transport.p12
  truststore.path: certs/transport.p12
# Create a new cluster with the current node only
# Additional nodes can still join the cluster later
cluster.initial_master_nodes: ["kucera.ling.uga.edu"]

# Allow HTTP API connections from anywhere
# Connections are encrypted and require user authentication
http.host: 0.0.0.0

# Allow other nodes to join the cluster from anywhere
# Connections are encrypted and mutually authenticated
#transport.host: 0.0.0.0

#----------------------- END SECURITY AUTO CONFIGURATION -------------------------

xpack.ml.enabled: false # DISABLED ML FUNCTIONALITY 

Open a new terminal window and run Elasticsearch, keep it running for the rest of the duration

In [None]:
cd elasticsearch-9.0.0
./bin/elasticsearch

Make changes to `indexator.py`, give script direct location of Elasticsearch server

In [None]:
# ...
# find indexator class and change line at bottom
class Indexator:
    """
    Contains methods for loading the JSON documents in the corpus
    database.
    """
    SETTINGS_DIR = '../conf'
    MAX_MEM_DICT_SIZE = 100000
    rxBadFileName = re.compile('[^\\w_.-]*', flags=re.DOTALL)

    def __init__(self, overwrite=False):
        random.seed(datetime.now().timestamp())
        self.overwrite = overwrite  # whether to overwrite an existing index without asking
        with open(os.path.join(self.SETTINGS_DIR, 'corpus.json'),
                  'r', encoding='utf-8') as fSettings:
            self.settings = json.load(fSettings)
        self.j2h = JSON2HTML(settings=self.settings)
        self.name = self.settings['corpus_name']
        self.languages = self.settings['languages']
        if len(self.languages) <= 0:
            self.languages = [self.name]
        self.input_format = self.settings['input_format']
        self.corpus_dir = os.path.join('../corpus', self.name)
        self.lowerWf = False
        if 'wf_lowercase' not in self.settings or self.settings['wf_lowercase']:
            self.lowerWf = True
        self.iterSent = None
        if self.input_format in ['json', 'json-gzip']:
            self.iterSent = JSONDocReader(format=self.input_format,
                                          settings=self.settings)

        # Make sure only commonly used word fields and those listed
        # in corpus.json get into the words index.
        self.goodWordFields = [
            'lex',          # lemma
            'wf',           # word form (for search)
            'wf_display',   # word form (for display; optional)
            'parts',        # morpheme breaks in the word form
            'gloss',        # glosses (for display)
            'gloss_index',  # glosses (for search)
            'n_ana'         # number of analyses
        ]
        self.additionalWordFields = set()
        self.additionalLemmaFields = set()
        self.excludeFromDict = {}
        if 'word_fields' in self.settings:
            self.additionalWordFields |= set(self.settings['word_fields'])
        if 'word_table_fields' in self.settings:
            self.additionalWordFields |= set(self.settings['word_table_fields'])
        if 'lemma_table_fields' in self.settings:
            self.additionalLemmaFields = set(self.settings['lemma_table_fields'])
        if 'accidental_word_fields' in self.settings:
            self.additionalWordFields -= set(self.settings['accidental_word_fields'])
            self.additionalLemmaFields -= set(self.settings['accidental_word_fields'])
        if 'exclude_from_dict' in self.settings:
            self.excludeFromDict = {k: re.compile(v) for k, v in self.settings['exclude_from_dict'].items()}
        f = open(os.path.join(self.SETTINGS_DIR, 'categories.json'),
                 'r', encoding='utf-8')
        categories = json.loads(f.read())
        f.close()
        self.goodWordFields += ['gr.' + v for lang in categories
                                for v in categories[lang].values()]
        self.goodWordFields = set(self.goodWordFields)
        self.characterRegexes = {}

        self.pd = PrepareData()

        # Initialize Elasticsearch connection
        self.es = None
        if 'elastic_url' in self.settings and len(self.settings['elastic_url']) > 0:
            # Connect to a non-default URL or supply username and password
            self.es = Elasticsearch([self.settings['elastic_url']], timeout=60)
        else:
            # self.es = Elasticsearch(timeout=60) # ORIGINAL LINE
            self.es = Elasticsearch(hosts=["http://localhost:9200"], timeout=60) # UPDATED LINE (most updated version of library needs exact location of server)
                                                                                 # default is 9200
        self.es_ic = IndicesClient(self.es)

Go back to the first terminal window and run the indexator

In [None]:
# install any python packages that might still be needed after installing requirements.txt
pip install ijson
pip install werkzeug
pip install jinja2
pip install flask
pip install sqlitedict

In [None]:
python indexator.py

### 4. Open corpus locally

Repeat previous code change for `tsakorpus.wsgi`

In [None]:
# make 
cd ~/tsakorpus
cd search
emacs search_engine/client.py

In [None]:
# ...
# find search client and change code at bottom
class SearchClient:
    """
    Contains methods for querying the corpus database.
    """

    def __init__(self, settings_dir, settings):
        self.settings = settings
        self.name = self.settings.corpus_name
        esTimeout = max(20, self.settings.query_timeout)
        self.es = None
        if self.settings.elastic_url is not None and len(self.settings.elastic_url) > 0:
            # Connect to a non-default URL or supply username and password
            self.es = Elasticsearch([self.settings.elastic_url], timeout=esTimeout)
        else:
            self.es = Elasticsearch(hosts=["http://localhost:9200"], timeout=esTimeout) # same change as indexator.py, add `hosts` parameter

Make changes in `search/web_app/translations` directory to fix naming on final webpage

In [None]:
cd ~/tsakorpus/search/web_app/translations/en
emacs corpus-specific.txt

In [None]:
emacs languages.txt

Run `tsakorpus.wsgi`

In [None]:
# install any needed packages
pip install xlsxwriter==3.0.9
pip install flask_babel

In [None]:
python tsakorpus.wsgi

Get output and port number

In [None]:
compiling catalog translations_pybabel/en/LC_MESSAGES/messages.po to translations_pybabel/en/LC_MESSAGES/messages.mo
compiling catalog translations_pybabel/ru/LC_MESSAGES/messages.po to translations_pybabel/ru/LC_MESSAGES/messages.mo
compiling catalog translations_pybabel/fr/LC_MESSAGES/messages.po to translations_pybabel/fr/LC_MESSAGES/messages.mo
Interface translations compiled.
 * Serving Flask app 'web_app' (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: on
 * Running on all addresses.
   WARNING: This is a development server. Do not use it in a production deployment.
 * Running on http://172.22.162.16:7342/ (Press CTRL+C to quit) # port is 7342

Open another terminal window and connect using ssh and the port number

In [None]:
ssh -L 7342:localhost:7342 username@corpus.uga.edu

Go to the following URL in your web browser - if everything was done correctly, you should see the corpus webpage


The Tsakorpus creator also allowed for a different way to configure the corpus with the `corpus.json` file with a GUI accessible through 

If you set the config settings this way, be sure to move the new `.json` file to the correct location (`~/tsakorpus/conf`) as described in the [docs](https://tsakorpus.readthedocs.io/en/latest/configuration.html)

## Additional Changes
Here are extra changes that were made after the initial setup for bug fixes, changing configs, etc.

### Adding speaker to metadata menu

In [None]:
cd ~
cd tsakorpus/search/web_app
emacs response_processors

In [None]:
# find this function:
def process_sentence_header(self, sentSource, format='html', curLocale=''):
    """
    Build the little Metadata pop-up for each sentence in search results.
    We fetch the document-level metadata, overlay the sentence-level
    metadata from sentSource, then render the modal table.
    """
    # and replace its content with this below:
    docID = sentSource['doc_id']
    raw = self.sc.get_doc_by_id(docID)
    if (raw is None
        or 'hits' not in raw
        or 'hits' not in raw['hits']
        or len(raw['hits']['hits']) == 0
        or '_source' not in raw['hits']['hits'][0]):
        # nothing to show
        if format == 'csv':
            return ['']
        else:
            return render_template(
                'search_results/sentence_header.html',
                fulltext_view_enabled=False
            )

    source = raw['hits']['hits'][0]['_source']
    doc_meta = source.get('meta', {}).copy()
    sent_meta = sentSource.get('meta', {})
    if isinstance(sent_meta, dict):
        doc_meta.update(sent_meta)

    for k,v in list(doc_meta.items()):
        if isinstance(v, list):
            doc_meta[k] = '; '.join(v)
        elif isinstance(v, int):
            doc_meta[k] = str(v)

    dateDisplay = ''
    if 'year' in doc_meta:
        dateDisplay = str(doc_meta['year'])
    elif 'year_from' in doc_meta and 'year_to' in doc_meta:
        dateDisplay = str(doc_meta['year_from'])
        if doc_meta['year_to'] != doc_meta['year_from']:
            dateDisplay += '–' + str(doc_meta['year_to'])

    curLocale = '_' + curLocale
    if (self.settings.localized_meta_values
        and len(curLocale) > 1):
        for key in list(doc_meta.keys()):
            if key.endswith(curLocale):
                base = key[:-len(curLocale)]
                if base in self.settings.localized_meta_values:
                    doc_meta[base] = doc_meta[key]
                    del doc_meta[key]

    metaHtml = render_template(
        'modals/metadata_table.html',
        data={'meta': doc_meta},
        viewable_meta=self.settings.viewable_meta,
        sentence_meta=self.settings.sentence_meta
    )
    metaHtml = html.escape(metaHtml)

    if format == 'csv':
        result = ['']
        if 'title' in doc_meta:
            result[0] += '"' + doc_meta['title'] + '" '
        else:
            result[0] += '"???" '
        if (self.authorMeta in doc_meta
            and doc_meta[self.authorMeta]):
            result[0] += '(' + doc_meta[self.authorMeta] + ') '
        if 'issue' in doc_meta and doc_meta['issue']:
            result[0] += doc_meta['issue'] + ' '
        if dateDisplay:
            result[0] += '[' + dateDisplay + ']'
        meta = {
            self.rxKW.sub('', k): v
            for k,v in doc_meta.items()
            if (self.rxKW.sub('', k) in self.settings.viewable_meta
                and k not in ['filename','filename_kw'])
        }
        sortedFields = [
            f for f in self.settings.viewable_meta
            if f not in ['filename','filename_kw']
                and f not in self.settings.sentence_meta
        ]
        for f in sortedFields:
            val = meta.get(f,'').replace('\t',' ')
            new = f'[{f}: {val}]'
            if new not in result:
                result.append(new)
        return result

    return render_template(
        'search_results/sentence_header.html',
        fulltext_view_enabled=self.settings.fulltext_view_enabled,
        author_meta=self.authorMeta,
        date_display=dateDisplay,
        metaHtml=metaHtml,
        meta=doc_meta
    )


### Bug Fixes

Navigate to `~/tsakorpus/search/search_engine/query_parsers.py` and find the `make_simple_term_query()` function, change the return statement

In [None]:
def make_simple_term_query(self, text, field, lang, keyword_query=False, rewrite=True):
    """
    Make a term query that will become one of the inner parts
    of the compound bool query. Recognize simple wildcards and regexps.
    If the field is "ana.gr", find categories for every gramtag. If no
    category is available for some tag, return empty query.
    """
    # . . .
    return {'match': {field: text}} # ADDED CHANGE: changed from {} to {field: text}

Locate the `html2es()` function in the same file and make the following changes

In [None]:
def html2es(self, htmlQuery, page=1, query_size=10, sortOrder='random',
                randomSeed=None, searchOutput='sentences', groupBy='word',
                distances=None, includeNextWordField=False,
                after_key=None, highlight=True):
        """
        Make and return a ES query out of the HTML form data.
        """
        # --- ADDED BLOCK: Convert numeric lang1 to language name ---
        if 'lang1' in htmlQuery and htmlQuery['lang1'].isdigit():
            idx = int(htmlQuery['lang1'])
            if idx < len(self.settings.languages):
                htmlQuery['lang1'] = self.settings.languages[idx]
        # -----------------------------------------------------------        
        # --- ADDED BLOCK: Handle generic query parameter "q" ---
        if 'q' in htmlQuery and 'n_words' not in htmlQuery:
            htmlQuery['n_words'] = "1"
            htmlQuery['wf1'] = htmlQuery['q']
        # ----------------------------------------------------------
        query_from, langID, lang, searchIndex =\
            self.check_html_parameters(htmlQuery, page, query_size, searchOutput)
        if query_from is None:
            return {'query': {'match_none': {}}} # ADDED CHANGE: changed from {'match_none': ''} to {'match_none': {}}
        # . . . 
        # . . . navigate to bottom of function
        # . . . 
        else:
            queryDict = {'query': {'match_none': {}}} # ADDED CHANGE: changed from {'match_none': ''} to {'match_none': {}}
        return queryDict

Navigate to `~/tsakorpus/search/web_app/response_processors.py` and make the following changes to `transliterate_baseline()`

In [None]:
def transliterate_baseline(self, text, lang, translit=None):
    if translit is None or lang not in self.settings.languages:
        return text
    # ADDED CHANGE: fixing error by making sure 'text' is actually a string
    if not text:
        text = ''
    elif not isinstance(text, (str, bytes)):
        text = str(text)
    # end of added change
    # . . . 