<center>    
    <h1 id='spacy-notebook-9' style='color:#7159c1; font-size:350%'>CoNLL-U Files</h1>
    <i style='font-size:125%'>Working with Annotated Datasets</i>
</center>

> **Topics**

```
- 🖊️ Introduction to CoNLL-U File
- 👀 Exploring CoNLL-U File
- ♻️ Converting CoNLL-U Metadata into Spacy Document
- ♻️ Converting CoNLL-U Annotations into Spacy Document
```

<h1 id='0-introduction-to-conll-u-file' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🖊️ | Introduction to CoNLL-U File</h1>

`CoNLL-U` is an annotation scema for describing linguistic features following `Universal Dependencies (UD)` Framework.

It contains three types of lines: `comments`, `word` and `blank` lines. The comments consist of information about the document, such as document id, sentence id, raw text and sentence type; whereas blank lines consist of literally blank lines working and break lines (\n). Word lines-wise, they contain the following information:

- **ID** - `Index of the word in sequence`;
- **FORM** - `The form of a word or punctuation symbol`;
- **LEMMA** - `Lemma or the base form of a word`;
- **UPOS** - `Universal part-of-speech tag`;
- **XPOS** - `Language-specific part-of-speech tag`;
- **FEATS** - `Morphological features`;
- **HEAD** - `Syntactic head of the current word`;
- **DEPREL** - `Universal dependency relation to the HEAD`;
- **DEPS** - `Enhanced dependency relations`;
- **MISC** - `Any additional annotations`.

<h1 id='1-exploring-conll-u-file' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>👀 | Exploring CoNLL-U File</h1>

Let's explore a simple CoNLL-U File.

In [1]:
# ---- Importations ----
import conllu # pip install conllu
from pathlib import Path

# ---- Constants ----
DATASETS_PATH = Path('datasets')

# ---- Reding File ----
annotations = None

with open(f'{DATASETS_PATH}/GUM_whow_parachute.conllu', mode='r', encoding='UTF-8') as file:
    annotations = file.read()

annotations[: 1_000]

'# newdoc id = GUM_whow_parachute\n# sent_id = GUM_whow_parachute-1\n# text = How to Cope With a Double Parachute Failure\n# s_type = inf\n1\tHow\thow\tSCONJ\tWRB\tPronType=Int\t3\tmark\t_\tDiscourse=preparation:1->11\n2\tto\tto\tPART\tTO\t_\t3\tmark\t_\t_\n3\tCope\tCope\tVERB\tVB\tVerbForm=Inf\t0\troot\t_\t_\n4\tWith\twith\tADP\tIN\t_\t8\tcase\t_\t_\n5\ta\ta\tDET\tDT\tDefinite=Ind|PronType=Art\t8\tdet\t_\tEntity=(event-1\n6\tDouble\tdouble\tADJ\tJJ\tDegree=Pos\t8\tamod\t_\t_\n7\tParachute\tparachute\tNOUN\tNN\tNumber=Sing\t8\tcompound\t_\tEntity=(object-2)\n8\tFailure\tfailure\tNOUN\tNN\tNumber=Sing\t3\tobl\t_\tEntity=event-1)\n\n# sent_id = GUM_whow_parachute-2\n# text = While skydiving, it is possible (yet extremely unlikely) that both your primary and reserve parachutes will malfunction, leaving you with no method of reducing your velocity.\n# s_type = decl\n1\tWhile\twhile\tSCONJ\tIN\t_\t2\tmark\t_\tDiscourse=circumstance:2->3\n2\tskydiving\tskydiving\tNOUN\tNN\tNumber=Sing\t6\tad

In [2]:
# Exploring CoNLL-U File
sentences = conllu.parse(annotations)
print(f'- First Sentence: {sentences[0]}')
print(f'- First Sentence Metadata (Comment Lines): {sentences[0].metadata}')
print(f'- First Sentence Type Metadata: {sentences[0].metadata["s_type"]}')
print(f'- First Sentence First Token: {sentences[0][0]}')
print(f'- First Sentence First Token Lemma: {sentences[0][0]["lemma"]}')

- First Sentence: TokenList<How, to, Cope, With, a, Double, Parachute, Failure, metadata={newdoc id: "GUM_whow_parachute", sent_id: "GUM_whow_parachute-1", text: "How to Cope With a Double Parachute Failure", s_type: "inf"}>
- First Sentence Metadata (Comment Lines): {'newdoc id': 'GUM_whow_parachute', 'sent_id': 'GUM_whow_parachute-1', 'text': 'How to Cope With a Double Parachute Failure', 's_type': 'inf'}
- First Sentence Type Metadata: inf
- First Sentence First Token: How
- First Sentence First Token Lemma: how


<h1 id='2-converting-conll-u-metadata-into-spacy-document' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>♻️ | Converting CoNLL-U Metadata into Spacy Document</h1>

In [4]:
import spacy
from spacy.tokens import Doc

nlp_en_large = spacy.load('en_core_web_lg')

first_sentence = sentences[0].metadata['text']
words = first_sentence.split(' ')
spaces = [True, True, True, True, True, True, True, False]
sentence_starts = [True, False, False, False, False, False, False, False]

document = Doc(
    vocab=nlp_en_large.vocab
    , words=words
    , spaces=spaces
    , sent_starts=sentence_starts
)

print(f'- Before Pipelines - {document[0]} POS: {document[0].pos_}')
print('---')

for name, component in nlp_en_large.pipeline:
    print(f'- Applying {name} Pipeline...')
    document = component(document)

print('---')
print(f'- After Pipelines - {document[0]} POS: {document[0].pos_}')

- Before Pipelines - How POS: 
---
- Applying tok2vec Pipeline...
- Applying tagger Pipeline...
- Applying parser Pipeline...
- Applying attribute_ruler Pipeline...
- Applying lemmatizer Pipeline...
- Applying ner Pipeline...
---
- After Pipelines - How POS: SCONJ


<h1 id='3-converting-conll-u-annotations-into-spacy-documents' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>♻️ | Converting CoNLL-U Annotations into Spacy Document</h1>

In [8]:
from spacy.training.converters import conllu_to_docs
from spacy.tokens import Doc

documents = list(conllu_to_docs(annotations, no_print=False))
document = Doc.from_docs(documents)

# Exploring the first 8 Tokens
for token_index in range(0, 8):
    token = document[token_index]
    print(f'- Text: {token.text}')
    print(f'- Coarse POS: {token.pos_}')
    print(f'- Fine-Grained POS: {token.tag_}')
    print(f'- Morpheme: {token.morph}')
    print(f'- Dependency Label: {token.dep_}')
    print(f'- Syntatic Head: {token.head}')
    print('---')

[38;5;4mℹ Grouping every 10 sentences into a document.[0m
- Text: How
- Coarse POS: SCONJ
- Fine-Grained POS: WRB
- Morpheme: PronType=Int
- Dependency Label: mark
- Syntatic Head: Cope
---
- Text: to
- Coarse POS: PART
- Fine-Grained POS: TO
- Morpheme: 
- Dependency Label: mark
- Syntatic Head: Cope
---
- Text: Cope
- Coarse POS: VERB
- Fine-Grained POS: VB
- Morpheme: VerbForm=Inf
- Dependency Label: ROOT
- Syntatic Head: Cope
---
- Text: With
- Coarse POS: ADP
- Fine-Grained POS: IN
- Morpheme: 
- Dependency Label: case
- Syntatic Head: Failure
---
- Text: a
- Coarse POS: DET
- Fine-Grained POS: DT
- Morpheme: Definite=Ind|PronType=Art
- Dependency Label: det
- Syntatic Head: Failure
---
- Text: Double
- Coarse POS: ADJ
- Fine-Grained POS: JJ
- Morpheme: Degree=Pos
- Dependency Label: amod
- Syntatic Head: Failure
---
- Text: Parachute
- Coarse POS: NOUN
- Fine-Grained POS: NN
- Morpheme: Number=Sing
- Dependency Label: compound
- Syntatic Head: Failure
---
- Text: Failure
- Coar

---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).