To begin, let's import spaCy and the create_object script.  This includes as `create_object()` function that will generate a generic language object in the folder `new_lang/{language_name}`.  All of the object's files are contained there.

In [3]:
# Install needed util files if missing
if 'google.colab' in str(get_ipython()):
    !mkdir util
    !wget -O /content/util/corpus.py https://raw.githubusercontent.com/New-Languages-for-NLP/cadet-the-notebook/main/util/corpus.py
    !wget -O /content/util/create_object.py https://raw.githubusercontent.com/New-Languages-for-NLP/cadet-the-notebook/main/util/create_object.py
    !wget -O /content/util/export.py https://raw.githubusercontent.com/New-Languages-for-NLP/cadet-the-notebook/main/util/export.py
    !wget -O /content/util/tokenization.py https://raw.githubusercontent.com/New-Languages-for-NLP/cadet-the-notebook/main/util/tokenization.py
    #colab currently uses spacy 2.2.4, need 3
    if '3' not in spacy.__version__[:1]:
        !pip install spacy --upgrade

    import spacy
    from util.create_object import create_object
    spacy.__version__

else:
    import spacy
    from app.util.create_object import create_object
    spacy.__version__

ModuleNotFoundError: No module named 'spacy'

In [None]:
lang_name = 'Meow'
lang_code ='meow'
direction = 'ltr' #or 'rtl'
has_case = True
has_letters = True

create_object(lang_name, lang_code, direction, has_case, has_letters)

'🍈 created language object for meow'

In [None]:
!ls ./new_lang


base_config.cfg  lex_attrs.py	 __pycache__	      texts
corpus_json	 lookups	 setup.py	      tokenizer_exceptions.py
examples.py	 meow.egg-info	 stop_words.py
__init__.py	 project.yml	 syntax_iterators.py
lemmatizer.py	 punctuation.py  tag_map.py


To assess how the tokenizer defaults will work with your language, add example sentences to the [`examples.py`](./new_lang/examples.py) file.  

In [None]:
from IPython.core.display import HTML
from util.tokenization import tokenization
HTML(tokenization(lang_name))

To adjust the tokenizer you can add unique exceptions or regular exceptions to the [tokenizer_exceptions.py](./new_lang/tokenizer_exceptions.py) file

- To join two tokens, add an exception `{'BIG YIKES':[{ORTH: 'BIG YIKES'}]}`
- To split a token in two, `{'Kummerspeck':[{ORTH:"Kummer"},{ORTH:"speck"}]}`

Note in both cases that we add a dictionary. The key is the string to match on, with a list of tokens.  In the first case we had a single token where we would otherwise have two and vice versa. You can find more details in the spaCy documentation and [here](https://new-languages-for-nlp.github.io/course-materials/w1/tokenization.html).

## Lookups

The `create_object()` function creates a `new_lang/lookups` directory that contains three files.  These are simple json lookups for unambiguous pos, lemma and features. You can add your data to these files and automatically update token values.  Keep in mind that you'll need to find a balance between the convenience of automatically annotating tokens and the inconvenience of having to correct machine errors.  Once you're done updating the files with your existing linguistic data, proceed to the next step.

## Texts

For us to identify frequent tokens for automatic annotation, you'll need to provide texts.  Place your machine-readable utf-8 text files in the `new_lang/texts` folder.   

In [None]:
from util.corpus import make_corpus

make_corpus(lang_name)

{'texts': 1, 'tokens': 3912, 'unique_tokens': 761}


The output of make_corpus is a json file at [`new_lang/corpus_json/tokens.json`](./new_lang/corpus_json/tokens.json). For each token, you'll find a `text` key for the token's string as well as keys for pos_, lemma_ and ent_type_. Keep in mind that this system is not able to process ambiguous lookups.  Only enter data for tokens or spans with very little semantic variation.      

In [None]:
import srsly
from pathlib import Path

def get_percentages():
    thirds = []
    halfs = []
    two_thirds = []
    tokens = srsly.read_json(Path.cwd() / 'new_lang' / 'corpus_json' / 'tokens.json')
    tokens = srsly.json_loads(tokens)
    for token in tokens:
        if token['rank'] == 1:
            total_tokens = token['count'] + token['remain']

        percent_annotated = 1 - (token['remain'] / total_tokens)
        percent_annotated = int((percent_annotated * 100))
        if percent_annotated == 33:
            thirds.append(token)
        if percent_annotated == 50:
            halfs.append(token)
        if percent_annotated == 66:
            two_thirds.append(token)
    return thirds[0], halfs[0], two_thirds[0]

    #let percent_annotated = 1 - (token.remain / total_tokens);
#    let percent_annotated_str = (percent_annotated*100).toFixed(0);
third, half, two_thirds = get_percentages()
print(f"""
🍉 To bulk annotate 33% of the corpus, add data to first {third['rank']} tokens
🍅 To bulk annotate 50% of the corpus, add data to first {half['rank']} tokens
🍒 To bulk annotate 66% of the corpus, add data to first {two_thirds['rank']} tokens
""")


🍉 To bulk annotate 33% of the corpus, add data to first 14 tokens
🍅 To bulk annotate 50% of the corpus, add data to first 37 tokens
🍒 To bulk annotate 66% of the corpus, add data to first 100 tokens



Next we will export your texts and lookups in a TSV file in the CoreNLP format.  This data can then be loaded into INCEpTION for annotation work

In [None]:
from util.export import download

download(lang_name)


'saved data to file /tmp/conllu_export.zip'

When you have completed all annotation work in INCEpTION, you're ready to begin model training. This final step will export your spaCy language object. From there you can follow the spaCy documentation on model training!  

1. package the object into a usable folder, that can be moved, and initialized using projects
2. nlp.to_disk("/tmp/checkpoint")?


In [None]:
# Create a spaCy project file for your project.
from util.project import make_project


In [None]:
import shutil
from util.project import make_project

new_lang = Path.cwd() / "new_lang"
make_project(lang_name,lang_code)

#make export directory
export_path = Path.cwd() / lang_name


#shutil.make_archive("zipped_sample_directory", "zip", "sample_directory")
shutil.make_archive(str(export_path), 'zip', str(new_lang))
zip_file = Path.cwd() / (lang_name + '.zip')
print(f'created file {zip_file}')

created file /home/ajanco/projects/cadet-the-notebook/Meow.zip
