### Headline Category Classifier Model Training

This notebook performs the model training for a text classifier using Spacy

__Model Type: Spacy Bag of Words__

This model was trained using a python v3.11.3 environment and requires:
- spacy

Please consult the requirements.txt for more info.

#### 0. Check GPU Status

In [3]:
# check cuda version
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0


In [4]:
# check gpu status
!nvidia-smi

Sun Jun 25 01:31:33 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.120      Driver Version: 529.01       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   39C    P8     2W /  60W |      0MiB /  8188MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#### 1. Generate config file and modify as necessary to use the correct model

In [6]:
# setup the config file
!python -m spacy init config --pipeline textcat config.cfg --gpu

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: textcat
- Optimize for: efficiency
- Hardware: GPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


#### 2. Train the model and evaluate model performance

In [9]:
# train them model
!python -m spacy train config.cfg --paths.train ../data/train.spacy  --paths.dev ../data/dev.spacy --output textcat_model --gpu-id 0

[38;5;4mℹ Saving to output directory: textcat_model[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2023-06-25 01:34:36,478] [INFO] Set up nlp object from config
[2023-06-25 01:34:36,485] [INFO] Pipeline: ['textcat']
[2023-06-25 01:34:36,486] [INFO] Created vocabulary
[2023-06-25 01:34:36,486] [INFO] Finished initializing nlp object
[2023-06-25 01:34:51,542] [INFO] Initialized pipeline components: ['textcat']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['textcat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ----------  ------
  0       0          0.02        1.95    0.02
  0     200          4.57        3.32    0.03
  0     400          4.31        4.32    0.04
  0     600          4.12        5.24    0.05
  0     800          4.05        7.45    0.07
  0    1000          3.92       10.17    0.10
  0    1200          3.79       12.67    0.13
  0    1400          3.72       14.93    0.15
  0    1600      

In [10]:
# evaluate the model
!python -m spacy evaluate ./textcat_model/model-best/ --output ./metrics.json ../data/test.spacy --gpu-id 0

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK                 100.00
TEXTCAT (macro F)   34.56 
SPEED               6900  

[1m

                      P       R       F
POLITICS          64.86   81.76   72.34
WELLNESS          47.18   71.81   56.95
ENTERTAINMENT     56.85   69.29   62.46
TRAVEL            58.65   71.37   64.39
HEALTHY LIVING    25.77   22.04   23.76
BUSINESS          37.39   36.65   37.02
WEIRD NEWS        28.57   24.17   26.19
SPORTS            56.75   57.78   57.26
PARENTING         44.39   53.91   48.69
STYLE & BEAUTY    72.33   73.54   72.93
GREEN             35.46   20.24   25.77
FOOD & DRINK      57.30   64.12   60.52
QUEER VOICES      61.32   58.33   59.79
THE WORLDPOST     40.06   41.03   40.54
HOME & LIVING     67.48   61.94   64.59
WEDDINGS          74.50   67.89   71.04
PARENTS           30.34   26.93   28.53
ARTS & CULTURE     0.00    0.00    0.00
CRIME             48.49   44.97   46.67
CULTURE & ARTS    80.00    9.30   16.67
ENVIRONMENT       55.88   14.29   22.7

In [11]:
# check results
import spacy
nlp = spacy.load("textcat_model/model-best")
doc=nlp("History is made: 10 new UK attractions for day trips and short breaks")
print(doc.cats)

{'POLITICS': 0.008987436071038246, 'WELLNESS': 0.00651813019067049, 'ENTERTAINMENT': 0.018183095380663872, 'TRAVEL': 0.014709735289216042, 'HEALTHY LIVING': 0.000356932170689106, 'BUSINESS': 0.0005905973957851529, 'WEIRD NEWS': 0.0001808985834941268, 'SPORTS': 0.047901950776576996, 'PARENTING': 0.24622957408428192, 'STYLE & BEAUTY': 0.09931683540344238, 'GREEN': 0.015597987920045853, 'FOOD & DRINK': 0.003302691038697958, 'QUEER VOICES': 0.1706678569316864, 'THE WORLDPOST': 3.156016464345157e-05, 'HOME & LIVING': 0.016233058646321297, 'WEDDINGS': 0.006742260884493589, 'PARENTS': 4.187962622381747e-05, 'ARTS & CULTURE': 1.1152944807690801e-06, 'CRIME': 1.0797774848469999e-05, 'CULTURE & ARTS': 2.8245676730875857e-07, 'ENVIRONMENT': 9.458528074901551e-05, 'COMEDY': 0.00025647395523265004, 'RELIGION': 0.02188471332192421, 'MONEY': 0.0008230795501731336, 'BLACK VOICES': 0.00021754769841209054, 'COLLEGE': 0.0002326863177586347, 'DIVORCE': 0.0014017171924933791, 'U.S. NEWS': 9.816215606406331

In [12]:
max(doc.cats, key=doc.cats.get)

'PARENTING'

In [14]:
doc.cats["PARENTING"]

0.24622957408428192

#### 3. Package the model into a Zip file

In [None]:
# zip up the model-best

import shutil

model_best_path = "textcat_model/model-best"
zipfile_name = "textcat_model/model-best"

shutil.make_archive(zipfile_name, "zip", model_best_path)

__Note:__ To preserve models, please rename the folder. For example, "textcat_model" > "textcat_model_2023-07-17_12-24"