# Data Preparation
Here, we clean up our data, the AG News Dataset from Kaggle, for use in NER and Text Classification
To install AG Dataset and NER Annotations, run following lines in Terminal:
```
aws s3 cp s3://applied-nlp-book/data/ data --recursive --no-sign-request
aws s3 cp s3://applied-nlp-book/models/ag_dataset/ models/ag_dataset --recursive --no-sign-request
```

In [None]:
# Install aws cli and get the data for this course and this task overall
!pip install awscli
!aws s3 cp s3://applied-nlp-book/data/ data --recursive --no-sign-request
!aws s3 cp s3://applied-nlp-book/models/ag_dataset/ models/ag_dataset --recursive --no-sign-request

In [None]:
# Import libraries
import pandas as pd
import os

# Get CWD
cwd = os.getcwd()

# Read dataset, replace spaces with underscores, and create a new column in data which maps categories to their names
data = pd.read_csv(cwd + '/data/ag_dataset/train.csv')
data = pd.DataFrame(data=data)
data.columns = data.columns.str.replace(" ", "_")
data.columns = data.columns.str.lower()
data["class_name"] = data["class_index"].map({1:"world", 2:"Sports", 3:"Business", 4:"Sci_Tech"})

# Clean up data a little more
cols = ["title", "description"]
data[cols] = data[cols].applymap(lambda x: x.replace("\\"," "))
data[cols] = data[cols].applymap(lambda x: x.replace("#36","$"))
data[cols] = data[cols].applymap(lambda x: x.replace("  "," "))
data[cols] = data[cols].applymap(lambda x: x.strip())

# Grab data head
data.head()

Unnamed: 0,class_index,title,description,class_name
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",Business
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Business
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries ab...,Business
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export f...,Business
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...",Business


# Install spaCy

In [None]:
# NOTE: Here, we put cuda112 since we are using cuda 11.2 in Jupyter notebooks (as shown by a simple nvcc --version call)
!pip install -U spacy[cuda112,transformers,lookups]
!pip install -U spacy-lookups-data==1.0.0 
!pip install -U cupy-cuda112
!python -m spacy download en_core_web_lg

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy-transformers<1.2.0,>=1.1.2
  Downloading spacy_transformers-1.1.8-py2.py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 1.8 MB/s 
[?25hCollecting cupy-cuda112<12.0.0,>=5.0.0b4
  Downloading cupy_cuda112-10.6.0-cp38-cp38-manylinux1_x86_64.whl (80.8 MB)
[K     |████████████████████████████████| 80.8 MB 93 kB/s 
[?25hCollecting spacy-lookups-data<1.1.0,>=1.0.3
  Downloading spacy_lookups_data-1.0.3-py2.py3-none-any.whl (98.5 MB)
[K     |████████████████████████████████| 98.5 MB 1.1 MB/s 
Collecting transformers<4.22.0,>=3.4.0
  Downloading transformers-4.21.3-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 67.3 MB/s 
[?25hCollecting spacy-alignments<1.0.0,>=0.7.2
  Downloading spacy_alignments-0.8.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 60.0 

# Train NER model

To train our spaCy model, we have to start by generating a config file for our model based off the template we have and using spaCy's `init fill-config` command.

In [None]:
# Set cfg file parameters for file path of various things:
ner_path = "/data/ag_dataset/ner/"                                                  # Providing path to NER annotation directory
config_file_path_input = cwd + ner_path + "config_spacy_template_gpu_blank.cfg"     # Path to downloaded spaCy cfg template
config_file_path_output = cwd + ner_path + "config_final_gpu_blank.cfg"             # File path for filled config to be saved at

In [None]:
# Generate config file, taking in the template as input file path and giving a path for the config to be saved
!python -m spacy init fill-config "$config_file_path_input" "$config_file_path_output"

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/content/data/ag_dataset/ner/config_final_gpu_blank.cfg
You can now add your data and train your pipeline:
python -m spacy train config_final_gpu_blank.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


Now that we've generated a config file for the model to be trained off, we can start training:

In [None]:
# Set file parameters for training
output_path = cwd + "/models/ag_dataset/ner/ner-gpu-blank"                          # File path for model to be saved at
train_path = cwd + "/data/ag_dataset/ner/annotations/binary/train"                  # File path for training data annotations
dev_path = cwd + "/data/ag_dataset/ner/annotations/binary/eval"                     # File path for evaluation data annotations

In [None]:
# Do the training, passing in config file path, output path, training path, and eval path (called dev path here)
!python -m spacy train "$config_file_path_output" \
 --output "$output_path" --paths.train "$train_path" --paths.dev "$dev_path" \
  --training.max_epochs 30 --gpu-id 0 --verbose

# Output removed because it was too large, but it essentially saved our model at the output_path for
#   later use.

# Run NER Model:

In [None]:
import spacy
from spacy import displacy
import random

spacy.require_gpu()
custom_ner_model = spacy.load(cwd + \
    '/models/ag_dataset/ner/ner-gpu-blank/model-last')
options = {"ents": ["ORG","PERSON","GPE","TICKER"]}

for j in range(3):
    i = random.randint(0, len(data))
    print("Article",i)
    doc_custom = custom_ner_model(data.loc[i,"description"])
    print("Custom Model NER:")
    displacy.render(doc_custom, style="ent", options=options, jupyter=True)
    print("\n")



Article 104315
Custom Model NER:




Article 72745
Custom Model NER:




Article 116312
Custom Model NER:








In [None]:
prediction = custom_ner_model("Bolsonaro supporters try to storm police HQ in 'January 6-style' rampage")
print("Custom Model NER:")
displacy.render(prediction, style="ent", options=options, jupyter=True)

Custom Model NER:


# Train Text Classification Model
The process of training a text classificaiton model is similar to that of training an NER model with spaCy. We do this by generating a config file for training and then undergoing training using annotated data.

In [None]:
config_file_path_output = cwd + "/data/ag_dataset/textcat/config_final.cfg"     # Config file parameter

# Create config from scratch. This creation is a little more complicated, taking in language as a parameter as well as
#   GPU and pipeline. Textcat means text categoriztion, and multilabel means that a single training example
#   might have multiple labels due to conflicting annotations.
!python -m spacy init config "$config_file_path_output" --lang en \
--pipeline textcat_multilabel --optimize efficiency --gpu --force

[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: textcat_multilabel
- Optimize for: efficiency
- Hardware: GPU
- Transformer: roberta-base
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/content/data/ag_dataset/textcat/config_final.cfg
You can now add your data and train your pipeline:
python -m spacy train config_final.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


Now that we've created our config file from scratch, we can train our model:

In [None]:
import spacy
annots_path = "/data/ag_dataset/textcat/annotations/binary/"
output_path = cwd + "/models/ag_dataset/textcat/full_labels"
train_path = cwd + annots_path + "train_full_labels"
dev_path = cwd + annots_path + "eval"

In [None]:
!python -m spacy train "$config_file_path_output" \
--output "$output_path" --paths.train "$train_path" \
--paths.dev "$dev_path" --gpu-id 0 --training.max_epochs 1 --verbose

[2022-12-14 05:54:50,119] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev', 'training.max_epochs']
DEBUG:spacy:Config overrides from CLI: ['paths.train', 'paths.dev', 'training.max_epochs']
[38;5;4mℹ Saving to output directory:
/content/models/ag_dataset/textcat/full_labels[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-12-14 05:54:51,963] [INFO] Set up nlp object from config
INFO:spacy:Set up nlp object from config
[2022-12-14 05:54:51,972] [DEBUG] Loading corpus from path: /content/data/ag_dataset/textcat/annotations/binary/eval
DEBUG:spacy:Loading corpus from path: /content/data/ag_dataset/textcat/annotations/binary/eval
[2022-12-14 05:54:51,974] [DEBUG] Loading corpus from path: /content/data/ag_dataset/textcat/annotations/binary/train_full_labels
DEBUG:spacy:Loading corpus from path: /content/data/ag_dataset/textcat/annotations/binary/train_full_labels
[2022-12-14 05:54:51,974] [INFO] Pipeline: ['textcat_multilabel']
INFO:spacy:Pipeline: ['textcat_multilabel']
[2022-12-

# Run Text Classification Model

In [None]:
import spacy

spacy.require_gpu()
custom_text_class_model = spacy.load(cwd + \
    '/models/ag_dataset/textcat/full_labels/model-best')
options = {"ents": ["ORG","PERSON","GPE","TICKER"]}

# To do prediction, simply load the model and use it as a function, giving in the string. Access predictions
#   using .cats and print them out to find prediction values
prediction = custom_text_class_model("2023 NFL draft QB Hot Board: Ranking top 17 quarterbacks, risers")
print(f"Custom Model classifications:")
print(prediction.cats)

Custom Model classifications:
{'World': 0.394867479801178, 'Sci_Tech': 0.42078670859336853, 'Business': 0.44176068902015686, 'Sports': 0.518925130367279}


In [None]:
!zip -r "text-categorization-model.zip" "/content/models/ag_dataset/textcat/full_labels/model-best"

  adding: content/models/ag_dataset/textcat/full_labels/model-best/ (stored 0%)
  adding: content/models/ag_dataset/textcat/full_labels/model-best/tokenizer (deflated 81%)
  adding: content/models/ag_dataset/textcat/full_labels/model-best/meta.json (deflated 62%)
  adding: content/models/ag_dataset/textcat/full_labels/model-best/config.cfg (deflated 61%)
  adding: content/models/ag_dataset/textcat/full_labels/model-best/vocab/ (stored 0%)
  adding: content/models/ag_dataset/textcat/full_labels/model-best/vocab/vectors (deflated 45%)
  adding: content/models/ag_dataset/textcat/full_labels/model-best/vocab/lookups.bin (stored 0%)
  adding: content/models/ag_dataset/textcat/full_labels/model-best/vocab/key2row (stored 0%)
  adding: content/models/ag_dataset/textcat/full_labels/model-best/vocab/strings.json (deflated 75%)
  adding: content/models/ag_dataset/textcat/full_labels/model-best/vocab/vectors.cfg (stored 0%)
  adding: content/models/ag_dataset/textcat/full_labels/model-best/textca

# Running Saved Models


First, we reinstall spaCy and mount our google drive:

In [None]:
!pip install -U spacy[cuda112,transformers,lookups]
!pip install -U spacy-lookups-data==1.0.0 
!pip install -U cupy-cuda112
!python -m spacy download en_core_web_lg

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Now, we can load our spaCy models from the google drive:

# NER Demo

In [None]:
# Import necessary libraries
import spacy
from spacy import displacy
import random

In [None]:
# Load in our NER Model
spacy.require_gpu()
custom_ner_model = spacy.load('/content/drive/MyDrive/NLP Book Notes/ner-last-model/models/ag_dataset/ner/ner-gpu-blank/model-last')
options = {"ents": ["ORG","PERSON","GPE","TICKER"]}

# Generate Results
headline = input("Input a headline to predict on: \n\t")
print()
doc_custom = custom_ner_model(headline)
print("Custom Model NER Output:")
displacy.render(doc_custom, style="ent", options=options, jupyter=True)

Input a headline to predict on: 
	Ukraine's Zelenskiy opposes idea of Russian athletes at Olympics under neutral banner

Custom Model NER Output:


# Text Categorization Demo

In [None]:
import spacy

spacy.require_gpu()
custom_text_class_model = spacy.load('/content/drive/MyDrive/NLP Book Notes/text-categorization-model/content/models/ag_dataset/textcat/full_labels/model-best')
options = {"ents": ["ORG","PERSON","GPE","TICKER"]}

# To do prediction, simply load the model and use it as a function, giving in the string. Access predictions
#   using .cats and print them out to find prediction values
headline = input("Input a headline to predict on: \n\t")
print()
probabilities = custom_text_class_model(headline).cats
print(f"Model's probabilities of each category are:")
print(probabilities)