<img width=50% src="https://github.com/New-Languages-for-NLP/course-materials/raw/main/w2/using-inception-data/newnlp_notebook.png" />

For full documentation on this project, see [here](https://new-languages-for-nlp.github.io/course-materials/w2/using-inception-data/New%20Language%20Training.html)
 

# 1 Prepare the Notebook Environment

In [None]:
#@title The Colab runtime comes with spaCy v2 and needs to be upgraded to v3.
#@markdown This project uses the GPU by default, if you need to use just the CPU, just uncheck the box below.
GPU = True #@param {type:"boolean"}

# Install spaCy v3 and libraries for GPUs and transformers
!pip install spacy --upgrade --quiet
if GPU:
    !pip install 'spacy[transformers,cuda111]' --quiet
#!pip install wandb spacy-huggingface-hub --quiet

[K     |████████████████████████████████| 6.0 MB 19.2 MB/s 
[K     |████████████████████████████████| 42 kB 1.3 MB/s 
[K     |████████████████████████████████| 451 kB 40.3 MB/s 
[K     |████████████████████████████████| 628 kB 46.3 MB/s 
[K     |████████████████████████████████| 10.1 MB 51.9 MB/s 
[K     |████████████████████████████████| 181 kB 76.6 MB/s 
[K     |████████████████████████████████| 51 kB 170 kB/s 
[K     |████████████████████████████████| 3.4 MB 62.5 MB/s 
[K     |████████████████████████████████| 1.1 MB 48.9 MB/s 
[K     |████████████████████████████████| 895 kB 59.5 MB/s 
[K     |████████████████████████████████| 67 kB 6.1 MB/s 
[K     |████████████████████████████████| 3.3 MB 59.2 MB/s 
[K     |████████████████████████████████| 596 kB 70.2 MB/s 
[?25h

The notebook will pull project files from your GitHub repository.  

Note that you need to set the langugage (lang), treebank (same as the repo name), test_size and package name in the project.yml file in your repository.  

In [None]:
#@title Enter your language's repository name. 
#@markdown If the repo is private, please check the "private_repo" box and include an [access token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token).
private_repo = False #@param {type:"boolean"}
repo_name = "russian" #@param {type:"string"}
branch = "main"


!rm -rf /content/newlang_project
!rm -rf $repo_name
if private_repo:
    git_access_token = "" #@param {type:"string"}
    git_url = f"https://{git_access_token}@github.com/New-Languages-for-NLP/{repo_name}/"
    !git clone $git_url  -b $branch
    !cp -r ./$repo_name/newlang_project .  
    !mkdir newlang_project/assets/
    !mkdir newlang_project/configs/
    #!mkdir newlang_project/corpus/
    !mkdir newlang_project/metrics/
    !mkdir newlang_project/packages/
    !mkdir newlang_project/training/
    !mkdir newlang_project/assets/$repo_name
    !cp -r ./$repo_name/* newlang_project/assets/$repo_name/
    !rm -rf ./$repo_name
else:
    !python -m spacy project clone newlang_project --repo https://github.com/New-Languages-for-NLP/$repo_name --branch $branch
    !python -m spacy project assets /content/newlang_project

[38;5;2m✔ Cloned 'newlang_project' from New-Languages-for-NLP/russian[0m
/content/newlang_project
[38;5;2m✔ Your project is now ready![0m
To fetch the assets, run:
python -m spacy project assets /content/newlang_project
[38;5;4mℹ Fetching 1 asset(s)[0m
[38;5;2m✔ Downloaded asset /content/newlang_project/assets/russian[0m


In [None]:
# Install the custom language object from Cadet 
!python -m spacy project run install /content/newlang_project

[1m
Running command: rm -rf lang
Running command: mkdir lang
Running command: mkdir lang/rus
Running command: cp -r assets/russian/2_new_language_object/ lang/rus/rus
Running command: mv lang/rus/rus/setup.py lang/rus/
Running command: /usr/bin/python3 -m pip install -e lang/rus
Obtaining file:///content/newlang_project/lang/rus
Installing collected packages: rus
  Running setup.py develop for rus
Successfully installed rus-0.0.0


# 2 Prepare the Data for Training

In [None]:
#@title (optional) cell to correct a problem when your tokens have no pos value
%%writefile /usr/local/lib/python3.7/dist-packages/spacy/training/converters/conllu_to_docs.py
import re

from .conll_ner_to_docs import n_sents_info
from ...training import iob_to_biluo, biluo_tags_to_spans
from ...tokens import Doc, Token, Span
from ...vocab import Vocab
from wasabi import Printer


def conllu_to_docs(
    input_data,
    n_sents=10,
    append_morphology=False,
    ner_map=None,
    merge_subtokens=False,
    no_print=False,
    **_
):
    """
    Convert conllu files into JSON format for use with train cli.
    append_morphology parameter enables appending morphology to tags, which is
    useful for languages such as Spanish, where UD tags are not so rich.

    Extract NER tags if available and convert them so that they follow
    BILUO and the Wikipedia scheme
    """
    MISC_NER_PATTERN = "^((?:name|NE)=)?([BILU])-([A-Z_]+)|O$"
    msg = Printer(no_print=no_print)
    n_sents_info(msg, n_sents)
    sent_docs = read_conllx(
        input_data,
        append_morphology=append_morphology,
        ner_tag_pattern=MISC_NER_PATTERN,
        ner_map=ner_map,
        merge_subtokens=merge_subtokens,
    )
    sent_docs_to_merge = []
    for sent_doc in sent_docs:
        sent_docs_to_merge.append(sent_doc)
        if len(sent_docs_to_merge) % n_sents == 0:
            yield Doc.from_docs(sent_docs_to_merge)
            sent_docs_to_merge = []
    if sent_docs_to_merge:
        yield Doc.from_docs(sent_docs_to_merge)


def has_ner(input_data, ner_tag_pattern):
    """
    Check the MISC column for NER tags.
    """
    for sent in input_data.strip().split("\n\n"):
        lines = sent.strip().split("\n")
        if lines:
            while lines[0].startswith("#"):
                lines.pop(0)
            for line in lines:
                parts = line.split("\t")
                id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
                for misc_part in misc.split("|"):
                    if re.match(ner_tag_pattern, misc_part):
                        return True
    return False


def read_conllx(
    input_data,
    append_morphology=False,
    merge_subtokens=False,
    ner_tag_pattern="",
    ner_map=None,
):
    """Yield docs, one for each sentence"""
    vocab = Vocab()  # need vocab to make a minimal Doc
    for sent in input_data.strip().split("\n\n"):
        lines = sent.strip().split("\n")
        if lines:
            while lines[0].startswith("#"):
                lines.pop(0)
            doc = conllu_sentence_to_doc(
                vocab,
                lines,
                ner_tag_pattern,
                merge_subtokens=merge_subtokens,
                append_morphology=append_morphology,
                ner_map=ner_map,
            )
            yield doc


def get_entities(lines, tag_pattern, ner_map=None):
    """Find entities in the MISC column according to the pattern and map to
    final entity type with `ner_map` if mapping present. Entity tag is 'O' if
    the pattern is not matched.

    lines (str): CONLL-U lines for one sentences
    tag_pattern (str): Regex pattern for entity tag
    ner_map (dict): Map old NER tag names to new ones, '' maps to O.
    RETURNS (list): List of BILUO entity tags
    """
    miscs = []
    for line in lines:
        parts = line.split("\t")
        id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
        if "-" in id_ or "." in id_:
            continue
        miscs.append(misc)

    iob = []
    for misc in miscs:
        iob_tag = "O"
        for misc_part in misc.split("|"):
            tag_match = re.match(tag_pattern, misc_part)
            if tag_match:
                prefix = tag_match.group(2)
                suffix = tag_match.group(3)
                if prefix and suffix:
                    iob_tag = prefix + "-" + suffix
                    if ner_map:
                        suffix = ner_map.get(suffix, suffix)
                        if suffix == "":
                            iob_tag = "O"
                        else:
                            iob_tag = prefix + "-" + suffix
                break
        iob.append(iob_tag)
    return iob_to_biluo(iob)


def conllu_sentence_to_doc(
    vocab,
    lines,
    ner_tag_pattern,
    merge_subtokens=False,
    append_morphology=False,
    ner_map=None,
):
    """Create an Example from the lines for one CoNLL-U sentence, merging
    subtokens and appending morphology to tags if required.

    lines (str): The non-comment lines for a CoNLL-U sentence
    ner_tag_pattern (str): The regex pattern for matching NER in MISC col
    RETURNS (Example): An example containing the annotation
    """
    # create a Doc with each subtoken as its own token
    # if merging subtokens, each subtoken orth is the merged subtoken form
    if not Token.has_extension("merged_orth"):
        Token.set_extension("merged_orth", default="")
    if not Token.has_extension("merged_lemma"):
        Token.set_extension("merged_lemma", default="")
    if not Token.has_extension("merged_morph"):
        Token.set_extension("merged_morph", default="")
    if not Token.has_extension("merged_spaceafter"):
        Token.set_extension("merged_spaceafter", default="")
    words, spaces, tags, poses, morphs, lemmas = [], [], [], [], [], []
    heads, deps = [], []
    subtok_word = ""
    in_subtok = False
    for i in range(len(lines)):
        line = lines[i]
        parts = line.split("\t")
        id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
        if "." in id_:
            continue
        if "-" in id_:
            in_subtok = True
        if "-" in id_:
            in_subtok = True
            subtok_word = word
            subtok_start, subtok_end = id_.split("-")
            subtok_spaceafter = "SpaceAfter=No" not in misc
            continue
        if merge_subtokens and in_subtok:
            words.append(subtok_word)
        else:
            words.append(word)
        if in_subtok:
            if id_ == subtok_end:
                spaces.append(subtok_spaceafter)
            else:
                spaces.append(False)
        elif "SpaceAfter=No" in misc:
            spaces.append(False)
        else:
            spaces.append(True)
        if in_subtok and id_ == subtok_end:
            subtok_word = ""
            in_subtok = False
        id_ = int(id_) - 1
        head = (int(head) - 1) if head not in ("0", "_") else id_
        tag = pos if tag == "_" else tag
        morph = morph if morph != "_" else ""
        dep = "ROOT" if dep == "root" else dep
        lemmas.append(lemma)
        if pos == "_":
            pos = ""
        poses.append(pos)
        tags.append(tag)
        morphs.append(morph)
        heads.append(head)
        deps.append(dep)

    doc = Doc(
        vocab,
        words=words,
        spaces=spaces,
        tags=tags,
        pos=poses,
        deps=deps,
        lemmas=lemmas,
        morphs=morphs,
        heads=heads,
    )
    for i in range(len(doc)):
        doc[i]._.merged_orth = words[i]
        doc[i]._.merged_morph = morphs[i]
        doc[i]._.merged_lemma = lemmas[i]
        doc[i]._.merged_spaceafter = spaces[i]
    ents = get_entities(lines, ner_tag_pattern, ner_map)
    doc.ents = biluo_tags_to_spans(doc, ents)

    if merge_subtokens:
        doc = merge_conllu_subtokens(lines, doc)

    # create final Doc from custom Doc annotation
    words, spaces, tags, morphs, lemmas, poses = [], [], [], [], [], []
    heads, deps = [], []
    for i, t in enumerate(doc):
        words.append(t._.merged_orth)
        lemmas.append(t._.merged_lemma)
        spaces.append(t._.merged_spaceafter)
        morphs.append(t._.merged_morph)
        if append_morphology and t._.merged_morph:
            tags.append(t.tag_ + "__" + t._.merged_morph)
        else:
            tags.append(t.tag_)
        poses.append(t.pos_)
        heads.append(t.head.i)
        deps.append(t.dep_)

    doc_x = Doc(
        vocab,
        words=words,
        spaces=spaces,
        tags=tags,
        morphs=morphs,
        lemmas=lemmas,
        pos=poses,
        deps=deps,
        heads=heads,
    )
    doc_x.ents = [Span(doc_x, ent.start, ent.end, label=ent.label) for ent in doc.ents]

    return doc_x


def merge_conllu_subtokens(lines, doc):
    # identify and process all subtoken spans to prepare attrs for merging
    subtok_spans = []
    for line in lines:
        parts = line.split("\t")
        id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts
        if "-" in id_:
            subtok_start, subtok_end = id_.split("-")
            subtok_span = doc[int(subtok_start) - 1 : int(subtok_end)]
            subtok_spans.append(subtok_span)
            # create merged tag, morph, and lemma values
            tags = []
            morphs = {}
            lemmas = []
            for token in subtok_span:
                tags.append(token.tag_)
                lemmas.append(token.lemma_)
                if token._.merged_morph:
                    for feature in token._.merged_morph.split("|"):
                        field, values = feature.split("=", 1)
                        if field not in morphs:
                            morphs[field] = set()
                        for value in values.split(","):
                            morphs[field].add(value)
            # create merged features for each morph field
            for field, values in morphs.items():
                morphs[field] = field + "=" + ",".join(sorted(values))
            # set the same attrs on all subtok tokens so that whatever head the
            # retokenizer chooses, the final attrs are available on that token
            for token in subtok_span:
                token._.merged_orth = token.orth_
                token._.merged_lemma = " ".join(lemmas)
                token.tag_ = "_".join(tags)
                token._.merged_morph = "|".join(sorted(morphs.values()))
                token._.merged_spaceafter = (
                    True if subtok_span[-1].whitespace_ else False
                )

    with doc.retokenize() as retokenizer:
        for span in subtok_spans:
            retokenizer.merge(span)

    return doc

Overwriting /usr/local/lib/python3.7/dist-packages/spacy/training/converters/conllu_to_docs.py


In [None]:
# Convert the conllu files from inception to spaCy binary format
# Read the conll files with ner data and as ents to spaCy docs 
!python -m spacy project run convert /content/newlang_project

[1m
Running command: /usr/bin/python3 scripts/convert.py assets/russian/3_inception_export 10 rus
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1 documents): corpus/conllu/idiot-6.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1 documents):
corpus/conllu/crimepunishmentsample2.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1 documents): corpus/conllu/idiot-9.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1 documents):
corpus/conllu/crimepunishment-5.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (2 documents): corpus/conllu/demons-3.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1 documents):
corpus/conllu/crimepunishment-1.spacy[0m
[38;5;4mℹ Grouping every

In [None]:
# test/train split 
!python -m spacy project run split /content/newlang_project 

[1m
Running command: /usr/bin/python3 scripts/split.py 0.2 11 rus
🚂 Created 52 training docs
😊 Created 10 validation docs
🧪  Created 3 test docs


In [None]:
# Debug the data
!python -m spacy project run debug /content/newlang_project 

[1m
Running command: /usr/bin/python3 -m spacy debug data configs/config.cfg
[1m
[38;5;2m✔ Pipeline can be initialized with data[0m
[38;5;2m✔ Corpus is loadable[0m
[1m
Language: rus
Training pipeline: tok2vec, tagger, parser, ner
52 training docs
10 evaluation docs
[38;5;2m✔ No overlap between training and evaluation data[0m
[38;5;1m✘ Low number of examples to train a new pipeline (52)[0m
[1m
[38;5;4mℹ 17361 total word(s) in the data (4651 unique)[0m
[38;5;4mℹ No word vectors present in the package[0m
[1m
[38;5;4mℹ 3 label(s)[0m
0 missing value(s) (tokens with '-' label)
[38;5;3m⚠ Low number of examples for label 'null' (10)[0m
[2K[38;5;2m✔ Examples without occurrences available for all labels[0m
[38;5;2m✔ No entities consisting of or starting/ending with whitespace[0m
[1m
[38;5;4mℹ 17 label(s) in train data[0m
[1m
[38;5;4mℹ Found 17360 sentence(s) with an average length of 1.0 words.[0m
[38;5;4mℹ 1 label(s) in train data[0m
[38;5;4mℹ 1 label(s) in pr

# 3 Model Training 

If your project file uses Weights and Biases to monitor model training (`vars.wanb: true`), you'll need to create an account at [wandb.ai](https://wandb.ai/site) and get an API key.  

In [None]:
# train the model
!python -m spacy project run train /content/newlang_project 

[1m
Running command: /usr/bin/python3 -m spacy train configs/config.cfg --output training/russian --gpu-id 0 --nlp.lang=rus
[38;5;2m✔ Created output directory: training/russian[0m
[38;5;4mℹ Saving to output directory: training/russian[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-01-14 16:58:43,086] [INFO] Set up nlp object from config
[2022-01-14 16:58:43,097] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser', 'ner']
[2022-01-14 16:58:43,102] [INFO] Created vocabulary
[2022-01-14 16:58:43,103] [INFO] Finished initializing nlp object
[2022-01-14 16:58:57,812] [INFO] Initialized pipeline components: ['tok2vec', 'tagger', 'parser', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'tagger', 'parser', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS PARSER  LOSS NER  TAG_ACC  DEP_UAS  DEP_LAS  SENTS_F  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  -----------  -----------  --------  -------  ------- 

If you get `ValueError: Could not find gold transition - see logs above.`  
You may not have sufficent data to train on: https://github.com/explosion/spaCy/discussions/7282

In [None]:
# Evaluate the model using the test data
!python -m spacy project run evaluate /content/newlang_project 

[1m
Running command: /usr/bin/python3 -m spacy evaluate ./training/russian/model-best ./corpus/converted/test.spacy --output ./metrics/russian.json --gpu-id 0
[38;5;4mℹ Using GPU: 0[0m
[1m

TOK      100.00
TAG      33.26 
UAS      100.00
LAS      0.00  
NER P    6.80  
NER R    13.89 
NER F    9.13  
SENT P   100.00
SENT R   100.00
SENT F   100.00
SPEED    1762  

[1m

          P      R      F
_      0.00   0.00   0.00
root   0.00   0.00   0.00

[1m

         P       R      F
PER   6.80   15.87   9.52
FAC   0.00    0.00   0.00

[38;5;2m✔ Saved results to metrics/russian.json[0m


In [None]:
# Find the path for your meta.json file
# You'll need to add newlang_project/ +  the path from the training step just after "✔ Saved pipeline to output directory"
!ls newlang_project/training/russian/

model-best  model-last


In [None]:
#Update meta.json
import spacy 
import srsly 

# Change path to match that from the training cell where it says "✔ Saved pipeline to output directory"
meta_path = "newlang_project/training/russian/model-last/meta.json"

# Replace values below for your project
my_meta = { 
    "lang":"rus",
    "name":"Dostoevskys_Russian",
    "version":"0.0.1",
    "description":"Russian pipeline optimized for GPU. Components: tok2vec, tagger, parser, senter, lemmatizer.",
    "author":"New Languages for NLP",
    "email":"newnlp@princeton.edu",
    "url":"https://newnlp.princeton.edu",
    "license":"MIT", 
    }
meta = spacy.util.load_meta(meta_path)
meta.update(my_meta)
srsly.write_json(meta_path, meta)

### Download the trained model to your computer.


In [None]:
# Save the model to disk in a format that can be easily  downloaded and re-used.
!python -m spacy package newlang_project/training/russian/model-last newlang_project/export 

[38;5;4mℹ Building package artifacts: sdist[0m
[38;5;2m✔ Loaded meta.json from file[0m
newlang_project/training/russian/model-last/meta.json
[38;5;2m✔ Generated README.md from meta.json[0m
[38;5;2m✔ Successfully created package 'rus_Dostoevskys_Russian-0.0.1'[0m
newlang_project/export/rus_Dostoevskys_Russian-0.0.1
running sdist
running egg_info
creating rus_Dostoevskys_Russian.egg-info
writing rus_Dostoevskys_Russian.egg-info/PKG-INFO
writing dependency_links to rus_Dostoevskys_Russian.egg-info/dependency_links.txt
writing entry points to rus_Dostoevskys_Russian.egg-info/entry_points.txt
writing requirements to rus_Dostoevskys_Russian.egg-info/requires.txt
writing top-level names to rus_Dostoevskys_Russian.egg-info/top_level.txt
writing manifest file 'rus_Dostoevskys_Russian.egg-info/SOURCES.txt'
reading manifest file 'rus_Dostoevskys_Russian.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'rus_Dostoevskys_Russian.egg-info/SOURCES.txt'
runnin

In [None]:
from google.colab import files
# replace with the path in the previous cell under "✔ Successfully created zipped Python package"
files.download('newlang_project/export/rus_Dostoevskys_Russian-0.0.1/dist/rus_Dostoevskys_Russian-0.0.1.tar.gz')

# once on your computer, you can pip install yi_yiddish_sm-0.0.1.tar.gz
# Be sure to add the file to the 4_trained_models folder in GitHub

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>