This notebook: 
- Loads project file from GitHub
- Loads assets from GitHub repo
- installs the custom language object 
- converts the training data to spaCy binary
- configure the project.yml file 
- train the model 
- assess performance 
- package the model (or push to huggingface) 


In [2]:
# temp to clear project folder
!rm -rf /srv/projects/course-materials/w2/using-inception-data/newlang_project


In [3]:
private_repo = True #@param {type:"boolean"}
repo_name = "old-chinese" #@param {type:"string"}

!rm -rf /content/newlang_project
!rm -rf $repo_name
if private_repo:
    git_access_token = "" #@param {type:"string"}
    git_url = f"https://{git_access_token}@github.com/New-Languages-for-NLP/{repo_name}/"
    !git clone $git_url  -b main
    !cp -r ./$repo_name/newlang_project .  
    !mkdir newlang_project/assets/
    !mkdir newlang_project/configs/
    !mkdir newlang_project/corpus/
    !mkdir newlang_project/metrics/
    !mkdir newlang_project/packages/
    !mkdir newlang_project/training/
    !mkdir newlang_project/assets/$repo_name
    !cp -r ./$repo_name/* newlang_project/assets/$repo_name/
    !rm -rf ./$repo_name
else:
    !python -m spacy project clone newlang_project --repo https://github.com/New-Languages-for-NLP/$repo_name --branch main
    !python -m spacy project assets /content/newlang_project

Cloning into 'old-chinese'...
remote: Enumerating objects: 36, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (27/27), done.[K
remote: Total 36 (delta 9), reused 24 (delta 4), pack-reused 0[K
Unpacking objects: 100% (36/36), 8.30 KiB | 99.00 KiB/s, done.


In [6]:
# Install the custom language object from Cadet 
!python -m spacy project run install /srv/projects/course-materials/w2/using-inception-data/newlang_project

/bin/bash: python: command not found


In [46]:
# Create training config
!python -m spacy project run config /srv/projects/course-materials/w2/using-inception-data/newlang_project

[1m
Running command: /srv/projects/course-materials/temp/venv/bin/python -m spacy init config config.cfg --lang clara -F
[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: clara
- Pipeline: tagger, parser, ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [8]:
# Convert the conllu files from inception to spaCy binary format
# Currently requires edit to spacy/training/converters/conllu_to_docs.py line 194 
# if pos == "_":                                                                                                                  
#     pos = ""

!python -m spacy project run convert /srv/projects/course-materials/w2/using-inception-data/newlang_project -F

[1m
Running command: /srv/projects/course-materials/temp/venv/bin/python scripts/convert.py assets/urban-giggle/3_inception_export
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (106 documents): corpus/YorText2.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (360 documents): corpus/YorText3.spacy[0m


In [9]:
# Read data files, convert to spaCy files
# test/train split 
!python -m spacy project run split /srv/projects/course-materials/w2/using-inception-data/newlang_project -F

[1m
Running command: /srv/projects/course-materials/temp/venv/bin/python scripts/split.py 0.4 11
😊 Created 279 training docs
😊 Created 187 validation docs


In [48]:
# Debug the data
!python -m spacy project run debug  /srv/projects/course-materials/w2/using-inception-data/newlang_project

[1m
Running command: /srv/projects/course-materials/temp/venv/bin/python scripts/update_config.py
Running command: /srv/projects/course-materials/temp/venv/bin/python -m spacy debug data ./config.cfg
[1m
[38;5;2m✔ Pipeline can be initialized with data[0m
[38;5;2m✔ Corpus is loadable[0m
[1m
Language: clara
Training pipeline: tok2vec, tagger, parser, ner
279 training docs
187 evaluation docs
[38;5;2m✔ No overlap between training and evaluation data[0m
[38;5;3m⚠ Low number of examples to train a new pipeline (279)[0m
[1m
[38;5;4mℹ 38701 total word(s) in the data (5648 unique)[0m
[38;5;4mℹ No word vectors present in the package[0m
[1m
[38;5;4mℹ 0 label(s)[0m
0 missing value(s) (tokens with '-' label)
[38;5;2m✔ Good amount of examples for all labels[0m
[38;5;2m✔ Examples without occurrences available for all labels[0m
[38;5;2m✔ No entities consisting of or starting/ending with whitespace[0m
[1m
[38;5;4mℹ 1 label(s) in train data[0m
[1m
[38;5;4mℹ Found 38691 sen

In [51]:
# Train the model 
!python -m spacy project run train /srv/projects/course-materials/w2/using-inception-data/newlang_project

[1m
Running command: /srv/projects/course-materials/temp/venv/bin/python -m spacy train config.cfg --output training/urban-giggle --gpu-id -1 --nlp.lang=clara
[38;5;2m✔ Created output directory: training/urban-giggle[0m
[38;5;4mℹ Saving to output directory: training/urban-giggle[0m
[38;5;4mℹ Using CPU[0m
[1m
[2021-12-19 20:17:02,411] [INFO] Set up nlp object from config
[2021-12-19 20:17:02,416] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser', 'ner']
[2021-12-19 20:17:02,419] [INFO] Created vocabulary
[2021-12-19 20:17:02,419] [INFO] Finished initializing nlp object
[2021-12-19 20:17:04,527] [INFO] Initialized pipeline components: ['tok2vec', 'tagger', 'parser', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'tagger', 'parser', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS PARSER  LOSS NER  TAG_ACC  DEP_UAS  DEP_LAS  SENTS_F  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  -----------

In [None]:
# Evaluate the model 
!python -m spacy project run evaluate /srv/projects/course-materials/w2/using-inception-data/newlang_project

In [1]:
# Package the model 
!mkdir ./export 
!python -m spacy package ./newlang_project/training/urban-giggle/model-last ./export 

/bin/bash: python: command not found
