# Turning text into a ConLL file

To calculate dependency distance, a document has to be split into sentences, tokenized, and parsed. This can be done using [Spacy](http://spacy.io) or [UDPipe](https://ufal.mff.cuni.cz/udpipe). Spacy's main interface is in Python, so we'll use that. UDPipe has an interface to R.

## Spacy and Python
First load Spacy and the English language model:

In [10]:
import spacy
nlp = spacy.load('en')

Next, load your document. This might come from a text file or database.

In [37]:
text = "I eat the pizza. The pizza, which I liked, was eaten by me."
doc = nlp(text)

for sent in doc.sents:
    print(sent)

I eat the pizza.
The pizza, which I liked, was eaten by me.


Spacy has successfully split the two sentences.

In the [ConLL format](http://universaldependencies.org/format.html), each line represents a token, and a new sentence is indicated by a newline. Each line contains 10 fields, not all of which we will fill with Spacy. Empty fields are marked with an underscore (`_`). Before each sentence, lines starting with `#` contain metadata belonging to that sentence. 

In [43]:
def to_conll(docs, docnames=None):
    """ docs is a list of document already parsed with spacy.
        docnames is optional, and contains the relevant document IDs."""
    # use numeric ids if not docnames are given
    if not docnames:
        docnames = list(range(1, len(docs)+1))
    assert(len(docs)==len(docnames))
    # this holds the conll output
    conll = []
    # iterate over documents
    for i_doc, doc in enumerate(docs):
        conll.append('# newdoc id = {}'.format(docnames[i_doc]))
        # iterate over sentences
        for i_sent, words in enumerate(doc.sents):
            conll.append("# sent_id = {}".format(i_sent+1))
            conll.append("# text = {}".format(str(words)))
            # iterate over words
            for i, word in enumerate(words):
                if word.head == word:
                    head = 0
                else:
                    # head should refer to within-document id
                    head = word.head.i - words[0].i + 1
                line = "{id}\t{form}\t{lemma}\t{upos}\t{xpos}\t_\t{head}\t{dep}\t_\t_".\
                    format(id=i+1, form=word, lemma=word.lemma_ or "_", 
                           upos=word.pos_ or "_", xpos=word.tag_ or "_",
                           head=head, dep=word.dep_.lower() or "_")
                conll.append(line)
            conll.append("") # sentency boundary
    return("\n".join(conll))

print(to_conll([doc]))

# newdoc id = 1
# sent_id = 1
# text = I eat the pizza.
1	I	-PRON-	PRON	PRP	_	2	nsubj	_	_
2	eat	eat	VERB	VBP	_	0	root	_	_
3	the	the	DET	DT	_	4	det	_	_
4	pizza	pizza	NOUN	NN	_	2	dobj	_	_
5	.	.	PUNCT	.	_	2	punct	_	_

# sent_id = 2
# text = The pizza, which I liked, was eaten by me.
1	The	the	DET	DT	_	2	det	_	_
2	pizza	pizza	NOUN	NN	_	9	nsubjpass	_	_
3	,	,	PUNCT	,	_	2	punct	_	_
4	which	which	ADJ	WDT	_	6	dobj	_	_
5	I	-PRON-	PRON	PRP	_	6	nsubj	_	_
6	liked	like	VERB	VBD	_	2	relcl	_	_
7	,	,	PUNCT	,	_	9	punct	_	_
8	was	be	VERB	VBD	_	9	auxpass	_	_
9	eaten	eat	VERB	VBN	_	0	root	_	_
10	by	by	ADP	IN	_	9	agent	_	_
11	me	-PRON-	PRON	PRP	_	10	pobj	_	_
12	.	.	PUNCT	.	_	9	punct	_	_



The output of this method can now be saved into a `.conll` file.

## UDPipe and R

The R interface to UDPipe makes it very easy to parse a list of documents and export it automatically to a ConLL file.

In [13]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [44]:
%%R -i text
library(udpipe)
# download the model using udpipe_download_model
udmodel_english <- udpipe_load_model(file = "~/english-ud-2.0-170801.udpipe")

# if you parse several texts at once, use doc_id to keep track
# of each document
doc <- udpipe_annotate(udmodel_english, x = text, doc_id = c("My text"))
print(as.data.frame(doc))

# to export, use:
cat(doc$conllu, file = "annotated.conll")

    doc_id paragraph_id sentence_id                                   sentence
1  My text            1           1                           I eat the pizza.
2  My text            1           1                           I eat the pizza.
3  My text            1           1                           I eat the pizza.
4  My text            1           1                           I eat the pizza.
5  My text            1           1                           I eat the pizza.
6  My text            1           2 The pizza, which I liked, was eaten by me.
7  My text            1           2 The pizza, which I liked, was eaten by me.
8  My text            1           2 The pizza, which I liked, was eaten by me.
9  My text            1           2 The pizza, which I liked, was eaten by me.
10 My text            1           2 The pizza, which I liked, was eaten by me.
11 My text            1           2 The pizza, which I liked, was eaten by me.
12 My text            1           2 The pizza, which