### General setup
1. We discuss in what collation is
1. We discuss what CollateX is.
1. We discuss what Alignment is.
1. How to install CollateX
1. Exercise here?


## What is collation ?

Collation at its most basic level means the comparison of two or more texts (literally: “placing/laying side by side”). Over the centuries, textual scholars have collated texts with different goals in mind. As a consequence, they have not always had a similar understanding of “collation”. In the following paragraphs we will go over the various objectives of collation and see how these were affected by the editor’s orientation.

In general, texts are collated for three reasons: 
1. to track the transmission of a text;
2. to get as close as possible to the original text; 
3. to establish a critical or final text. This could also mean the generation of a list of variants and/or a critical apparatus.

### What is CollateX ?

### CollateX
<img src="images/collatex_screenshot.png" width="70%"/>
CollateX is available in Java (http://collatex.net) and Python (https://pypi.python.org/pypi/collatex) versions. The Python version, which is the focus of this workshop, provides hooks for user modification of the Tokenization and Normalization stages of the Gothenburg collation model, and supports output as a plain text table, HTML, SVG variant graph, GraphML, generic XML, and TEI parallel segmentation XML. The materials for this workshop provide tutorial information on all aspects of using the Python version of CollateX.


### Alignment

As part of the collation of textual variants, alignment is the process of determining which tokens in one witness should be regarded as parallel to which tokens in another. Alignment thus presupposes [tokenization](week_2_day_1_tokenization.md). Furthermore, texts may be [normalized](week2_day_1_normalization.md) before alignment as a way of treating as equivalent readings that are not string-equal. Normalization may be implemented in the text itself, completely leveling differences that may have been present originally, or it may be performed on shadow copies of the tokens, which lets the alignment process treat different readings as equivalent without irretrievably erasing evidence of the differences.

<img src="../../images/Collation_Aligner.png" align="right"/>The image to the right and the accompanying description of it are taken from <https://wiki.tei-c.org/index.php/Textual_Variance>: 

> Looking at an example, assume that we have three witnesses: the first is comprised of the token sequence (a, b, c, d), the second reads (a, c, d, b) and the third (b, c, d). A collator might align these three witnesses as depicted in a tabular fashion on the right. Each witness occupies a column, matching tokens are aligned in a row, necessary gap tokens as inserted during the alignment process are denoted via a hyphen. Depending from which perspective one interprets this alignment table, one can say for example that the (b) in the second row was ommitted in the second witness or it has been added in the first and the third. A similar statement can be made about (b) in the last row by just inverting the relationship of being added/omitted.

Some alignment decisions cannot be resolved unambiguously even by a human. Given the variants “It’s a big problem” and “It’s big, big problem”, there is no principled way to determine which of the “big” tokens in the second string should be regarded as corresponding to the single “big” token in the first string, and which should be regarded as having no counterpart.


#### Python 3 and CollateX installation instructions

## 1. Overview

This tutorial explains how to install Python, CollateX, and Jupyter notebook for use in the DiXiT Workshop
[“Code and collation: training textual scholars”](https://sites.google.com/site/dixitcodingcollation/), Amsterdam, 2–4 November 2016. To avoid delaying the start of the workshop, please install the software in advance (if you get stuck, do as much as you can and we’ll help you finish the process when you arrive).

## 2. Quick start

If you have already installed CollateX, make sure that you have the most recent version by running:

    pip install --upgrade collatex

If not, here are the installation instructions in a nutshell:

1. Ensure Python 3, preferably the Anaconda distribution
1. `pip install collatex`
1. `pip install python-levenshtein` (but see the note below for Windows)
1. Install Graphviz, either through a package manager such as apt-get or MacPorts, or go to http://www.graphviz.org/Download.php and accept the license
1. `pip install graphviz`

If you are not sure what all that means, read on!

## 3. Installation

To run CollateX, you need first to install Python 3 and then the CollateX module, along with some other programs, packages, and modules upon which CollateX depends. Here’s how to do that in Mac OS X, Ubuntu Linux, and Windows. The process described below will probably take between thirty minutes and an hour, depending on how familiar you are with installing programs on your system. The good news is that you only have to do the installation once, and launching CollateX after that will take almost no time. This tutorial assumes that you are running Mac OS X 10.11 or later, Windows 7, 8, or 10, or Ubuntu Linux 14.04 LTS or later. In all of the steps below, if you are prompted to enter your password, you should do so.


### 3. 2. Installing CollateX

#### 3. 2. 1. Using the command line

Once you have installed Python, as described above, you need to install CollateX, along with a few supporting files (libraries). To do this, you will need to work with a command line window. Each operating system makes a terminal available by default, without requiring special installation:

* For Mac OSX: the Terminal.app that you will find in the Applications → Utilities folder.
* For Windows: Windows Powershell, which you can find from the search box. Windows 10 users who have installed the new Windows bash shell may use that instead.
* For Ubuntu Desktop (Unity): you can type Ctrl-Alt-T or you can type “Terminal” (without the quotation marks) into the Search box in the Dash.

A window will open that displays a command line, a place where you can type instructions to be executed on the computer, with a prompt that might look something like this on a Mac OS Terminal:

    Taras-Mac:~ tara$

or this in the Windows Powershell:

    PS C:\Users\Tara L Andrews>

or this in a Linux terminal:

    tla@ubuntu:~$

Now you are ready to type the commands that come next.

Windows users: Some of you may have used **cmd.exe** in the past to work at the command line. We recommend Powershell (or, for Windows 10 users, bash) because it uses many of the same commands that have always been in use on Unix-like systems, and so makes it easier for you to follow generic command-line instructions such as those we will be giving in the workshop. If you stick to **cmd.exe** you do so at your own risk, and the commands described below may not all be available.

#### 3.2.2. The CollateX installation

The easiest way to install CollateX from the command line is with **pip**, a Python package manager. **pip** comes bundled with Anaconda, so you don’t have to install it separately, and you can install CollateX and the most of the libraries on which it depends by typing:

    pip install collatex

### 3.3. Installing the Python Levenshtein library

CollateX relies on this library to do near (inexact) matching of words.

#### 3.3.1. For Mac OS X and Linux

Type the following at the command line:

    pip install python-levenshtein

Mac OS users: You may get a popup window telling you that you require the command-line developer tools. If you get this window, choose Install. When the installation is finished, run the command again.

Once this is done, you can check that everything worked by opening a terminal, typing the following command, and hitting the Enter key:

    python -c "import Levenshtein; print('This works.')"

#### 3.3.2. For Windows
Windows users can try either of these precompiled packages depending on their Windows being 32 bit or 64 bit:
* `pip install http://collatex.obdurodon.org/python_Levenshtein-0.12.0-cp35-none-win32.whl` (if your system is a 32-bit one)
* `pip install http://collatex.obdurodon.org/python_Levenshtein-0.12.0-cp35-none-win_amd64.whl` (if your system is 64-bit)

These files are mirrored from http://www.lfd.uci.edu/~gohlke/pythonlibs/#python-levenshtein. At the time we are writing this tutorial, we’re linking to the Levenshtein files for Python 3.5 (that’s what the “cp35” means in the filenames), which is the current Anaconda version. 

Windows users with an installed and configured C++ compiler can try:

    pip install python-levenshtein

As noted this will succeed only if you have a C++ compiler configured (most Windows users do not).

Once installed the package you can check that everything worked with the following command:

    python -c "import Levenshtein; print('This works.')"

### 3.4. Installing Graphviz

Graphviz is a program for creating graphic representations, including the variant graphs sometimes used in CollateX (see the examples at http://stemmaweb.net/stemmaweb/relation/help/Latin). Graphviz is required by CollateX only for viewing variant graphs. We recommend installing it for the workshop, but you can perform collations without it. Note that in addition to installing Graphviz, all users need to install Python bindings for Graphviz, which is a separate step, described in Section 3.5, below.

#### 3.4.1. Installing Graphviz on Mac OS X

The easiest way to install Graphviz is to download the appropriate installer from the [Graphviz download page](http://www.graphviz.org/Download.php) (you will need to accept the license.) On Mac, this will be the mountainlion current stable release. The Graphiz page is often inaccessible; should this happen you can use the [Internet Archive Wayback Machine](https://web.archive.org/web/20160719021933/http://graphviz.org/).

If the installer refuses to run when you double-click it, then you can do the following:

* Navigate to the installer in your Downloads folder.
* Right-click (or ctrl-click) to bring up the context menu.
* Choose Open.
* When the warning dialog appears, choose Open again.

This is a useful trick to remember for installing any software that you know you want, but that your Mac doesn’t trust.

#### 3.4.2. Installing Graphviz on Ubuntu Linux

Graphviz can be installed from the Terminal on Ubuntu with the command:

    sudo apt-get install graphviz

#### 3.4.3. Installing Graphviz on Windows

The easiest way to install Graphviz on Windows is to download the appropriate installer from the [Graphviz download page](http://www.graphviz.org/Download.php) (you will need to accept the license.) The Graphiz page is often inaccessible; should this happen you can use the [Internet Archive Wayback Machine](https://web.archive.org/web/20160719021933/http://graphviz.org/). On Windows, use the **.msi** file if you can.

<img src="images/graphviz_installation.png" align="right" width="50%"/>When the installer shows the screen in the image on the right, copy the *full and exact folder name* down somewhere. When the installer is done, you will need to add this information to your execution path.

1. From the Control Panel, choose System and Security → System → Advanced settings, and then click the Environment variables button near the bottom of the window.
1. Select the entry in the list that says PATH and choose Edit.
1. Scroll all the way to the end of whatever is already there, and add a “;” character (without the quotes), then the exact folder name you copied, and then “\bin” (also without the quotes). In the example above, you would append “;C:\Program Files (x86)\Graphviz2.38\bin” (without the quotes, but with the leading semicolon) to the end of your original path, as in the image below.
<img src="images/edit-path-user.png" width="50%"/>
1. To confirm that the path has been set correctly, close any open Powershell or bash window you have, open a new one, and run the command `where.exe dot`. Do not leave off the “.exe”! The output should look something like:

        PS C:\Users\Tara L Andrews> where.exe dot
        C:\Program Files (x86)\Graphviz2.38\bin\dot.exe

### 3.5. Installing the Python Graphviz bindings

In addition to Graphviz itself, all users on all operating systems also need to install Python bindings (support) for Graphviz, which you can do at the command line by typing:

    pip install graphviz

Note that the preceding line does not install Graphviz; what it installs is just the Python bindings for Graphviz. You also need to install Graphviz itself, as described in Section 3.4 and its subsections, above.