### Headline Category Classifier Model Training

This notebook performs the model training for a text classifier using Spacy

__Model Type: DistilBERT__

This model was trained using a python v3.11.3 environment and requires:
- spacy
- spacy-transformers

Please consult the requirements.txt for more info.

#### 0. Check GPU Status

In [1]:
# check cuda and gpu status
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0


In [2]:
!nvidia-smi

Fri Jun 23 11:35:03 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.120      Driver Version: 529.01       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   35C    P8     1W / 114W |     10MiB /  8188MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#### 1. Generate config file and modify as necessary to use the correct model

In [3]:
# generate the generic the config file
!python -m spacy init config --pipeline textcat config_transformer.cfg --gpu

[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: textcat
- Optimize for: efficiency
- Hardware: GPU
- Transformer: roberta-base
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config_transformer.cfg
You can now add your data and train your pipeline:
python -m spacy train config_transformer.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


### Note: Edit the config file to use the correct transformer model


1. in the section

```
[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "distilbert-base-uncased"
mixed_precision = false
```

2. save the file



#### 2. Train the model and evaluate model performance

In [4]:
# train them model
!python -m spacy train config_transformer.cfg --paths.train ../data/train.spacy  --paths.dev ../data/dev.spacy --output textcat_model_transformer --gpu-id 0

[38;5;2m✔ Created output directory: textcat_model_transformer[0m
[38;5;4mℹ Saving to output directory: textcat_model_transformer[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2023-06-23 11:43:45,670] [INFO] Set up nlp object from config
[2023-06-23 11:43:45,676] [INFO] Pipeline: ['transformer', 'textcat']
[2023-06-23 11:43:45,678] [INFO] Created vocabulary
[2023-06-23 11:43:45,678] [INFO] Finished initializing nlp object
Downloading (…)okenizer_config.json: 100%|████| 28.0/28.0 [00:00<00:00, 140kB/s]
Downloading (…)lve/main/config.json: 100%|█████| 483/483 [00:00<00:00, 4.91MB/s]
Downloading (…)solve/main/vocab.txt: 232kB [00:00, 5.33MB/s]
Downloading (…)/main/tokenizer.json: 466kB [00:00, 8.19MB/s]
Downloading model.safetensors: 100%|█████████| 268M/268M [00:23<00:00, 11.5MB/s]
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight',

In [5]:
# evaluate the model
!python -m spacy evaluate ./textcat_model_transformer/model-best/ --output ./metrics_transformer.json ../data/test.spacy --gpu-id 0

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK                 100.00
TEXTCAT (macro F)   50.86 
SPEED               5247  

[1m

                     P       R       F
POLITICS         77.83   80.05   78.92
WELLNESS         58.81   70.48   64.12
ENTERTAINMENT    71.24   73.98   72.58
TRAVEL           78.89   79.05   78.97
HEALTHY LIVING   35.26   35.03   35.15
BUSINESS         52.45   47.93   50.09
WEIRD NEWS       40.36   47.08   43.46
SPORTS           68.01   74.75   71.22
PARENTING        51.73   67.42   58.54
STYLE & BEAUTY   83.05   78.03   80.46
GREEN            41.38   43.72   42.52
FOOD & DRINK     70.45   74.96   72.63
QUEER VOICES     71.81   68.08   69.90
THE WORLDPOST    45.33   48.63   46.92
HOME & LIVING    81.02   75.37   78.09
WEDDINGS         77.75   77.55   77.65
PARENTS          45.98   25.69   32.96
ARTS & CULTURE   50.00   30.22   37.67
CRIME            57.41   51.96   54.55
CULTURE & ARTS   68.89   36.05   47.33
ENVIRONMENT      56.25   20.30   29.83
COMEDY           47.

In [6]:
# check results
import spacy
nlp = spacy.load("textcat_model_transformer/model-best")
doc=nlp("History is made: 10 new UK attractions for day trips and short breaks")
print(doc.cats)

{'POLITICS': 0.00016067341493908316, 'WELLNESS': 0.0005015170318074524, 'ENTERTAINMENT': 9.730336751090363e-05, 'TRAVEL': 0.9956992864608765, 'HEALTHY LIVING': 0.0001028354890877381, 'BUSINESS': 0.0001245860185008496, 'WEIRD NEWS': 0.00013257964747026563, 'SPORTS': 8.15750245237723e-05, 'PARENTING': 0.00030307695851661265, 'STYLE & BEAUTY': 0.00045892540947534144, 'GREEN': 0.00010505902901059017, 'FOOD & DRINK': 0.00029052604804746807, 'QUEER VOICES': 4.714441456599161e-05, 'THE WORLDPOST': 4.2378153011668473e-05, 'HOME & LIVING': 0.00016575682093389332, 'WEDDINGS': 6.386010500136763e-05, 'PARENTS': 4.930210343445651e-05, 'ARTS & CULTURE': 1.0325259609089699e-05, 'CRIME': 3.628828562796116e-05, 'CULTURE & ARTS': 6.851810030639172e-05, 'ENVIRONMENT': 0.00017636224220041186, 'COMEDY': 5.965606396785006e-05, 'RELIGION': 1.7043454136000946e-05, 'MONEY': 7.017145253485069e-05, 'BLACK VOICES': 2.1419115000753663e-05, 'COLLEGE': 2.3839120331103913e-05, 'DIVORCE': 2.7171905458089896e-05, 'U.S.

In [7]:
max(doc.cats, key=doc.cats.get)

'TRAVEL'

In [8]:
doc.cats["TRAVEL"]

0.9956992864608765

#### 3. Package the model into a Zip file

In [None]:
# zip up the model-best

import shutil

model_best_path = "textcat_model_transformer/model-best"
zipfile_name = "textcat_model_transformer/model-best"

shutil.make_archive(zipfile_name, "zip", model_best_path)

__Note:__ To preserve models, please rename the folder. For example, "textcat_model_transformer" > "textcat_model_transformer_2023-07-17_12-24"