# **Build a binary text classification (Sentiment Analysis) using our custom TextCategorizer component.**

https://medium.com/@johnidouglasmarangon/building-a-text-classification-model-with-spacy-3-x-57e59fa50547

## Data Preparation

The dataset is labeled for a tweet sentiment analysis with two categories, positive and negative in Brazilian Portuguese.

In [1]:
import pandas as pd
df = pd.read_csv("https://gist.githubusercontent.com/johnidm/582cfeadd2bf418df4539c9422f824d2/raw/twitter-sentiment-pt-BR-md-2-l.csv")
df.head()
     

Unnamed: 0,tweet_text,sentiment
0,Mas tu não és feio :( @SavageFluxXx__,0
1,@SamaraPaivas Você que pensa :),1
2,te amo demais :( https://t.co/leUzS65WrG,0
3,@nicko_donis lindo! :),1
4,"@B_kirikihira Oi, tem sim! Visite nossos canai...",1


In [2]:
import string
import re
import spacy
from spacy.lang.pt.stop_words import STOP_WORDS

nlp = spacy.blank("pt")

REGX_USERNAME = r"@[A-Za-z0-9$-_@.&+]+"
REGX_URL = r"https?://[A-Za-z0-9./]+"


def preprocessing(text):
    text = text.lower()

    text = re.sub(REGX_USERNAME, " ", text)
    text = re.sub(REGX_URL, " ", text)

    emojis = {":)": "emocaopositiva", ":(": "emocaonegativa"}

    for e in emojis:
        text = text.replace(e, emojis[e])

    tokens = [token.text for token in nlp(text)]

    tokens = [
        t
        for t in tokens
        if t not in STOP_WORDS and t not in string.punctuation and len(t) > 3
    ]

    tokens = [t for t in tokens if not t.isdigit()]

    return " ".join(tokens)


df["tweet_text_clean"] = df["tweet_text"].apply(preprocessing)
df.head()


Unnamed: 0,tweet_text,sentiment,tweet_text_clean
0,Mas tu não és feio :( @SavageFluxXx__,0,feio emocaonegativa
1,@SamaraPaivas Você que pensa :),1,pensa emocaopositiva
2,te amo demais :( https://t.co/leUzS65WrG,0,emocaonegativa
3,@nicko_donis lindo! :),1,lindo emocaopositiva
4,"@B_kirikihira Oi, tem sim! Visite nossos canai...",1,visite canais saiba projeto incrivel emocaopos...


In [3]:
dataset = list(df[["tweet_text_clean", "sentiment"]].sample(frac=1).itertuples(index=False, name=None))
train_data = dataset[:15000]
dev_data = dataset[15000:18000]
test_data = dataset[18000:]

print(f"Total: {len(dataset)} - Train:  {len(train_data)} - Dev: {len(dev_data)} - Test: {len(test_data)}")

Total: 20000 - Train:  15000 - Dev: 3000 - Test: 2000


We are performing the data using the DocBin structure, which makes data manipulations in spaCy more efficient.

In [4]:
from spacy.tokens import DocBin

def convert(data, outfile):
    db = DocBin()
    docs = []
    for doc, label in nlp.pipe(data, as_tuples=True):
        doc.cats["POS"] = label == 1
        doc.cats["NEG"] = label == 0
        db.add(doc)
    
    db.to_disk(outfile)
convert(train_data, "resources/train.spacy")
convert(dev_data, "resources/dev.spacy")
convert(test_data, "resources/test.spacy")

## Training a Pipeline

Training config files include all settings and hyperparameters for training your pipeline instead of providing lots of arguments on the command line or in a source code.

In [5]:
!python -m spacy init config  --lang pt --pipeline textcat --optimize efficiency --force config.cfg

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: pt
- Pipeline: textcat
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


## Train Step

In [25]:
!python -m spacy train config.cfg --paths.train resources/train.spacy --paths.dev resources/dev.spacy --output resources/model --verbose


# train from code
# from spacy.cli.train import train

# train(
#     "./config.cfg",
#     overrides={
#         "paths.train": "resources/train.spacy",
#         "paths.dev": "resources/dev.spacy ",
#     },
#     output_path="resources/model",
# )

[38;5;4mℹ Saving to output directory: resources\model[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['textcat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ----------  ------
  0       0          0.25       38.38    0.38
  0     200          6.86       97.49    0.97
  0     400          2.23       98.93    0.99
  0     600          1.94       98.73    0.99
  1     800          1.52       98.36    0.98
  1    1000          1.16       98.93    0.99
  2    1200          0.68       98.79    0.99
  3    1400          0.32       98.79    0.99
  4    1600          0.22       98.89    0.99
  5    1800          0.09       98.89    0.99
  7    2000          0.02       98.83    0.99
  9    2200          0.06       98.83    0.99
 11    2400          0.04       98.79    0.99
 13    2600          0.03       98.73    0.99
[38;5;2m✔ Saved pipeline to output directory[0m
resour

[2024-03-08 18:38:50,651] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
[2024-03-08 18:38:50,830] [INFO] Set up nlp object from config
[2024-03-08 18:38:50,849] [DEBUG] Loading corpus from path: resources\dev.spacy
[2024-03-08 18:38:50,852] [DEBUG] Loading corpus from path: resources\train.spacy
[2024-03-08 18:38:50,852] [INFO] Pipeline: ['textcat']
[2024-03-08 18:38:50,856] [INFO] Created vocabulary
[2024-03-08 18:38:50,856] [INFO] Finished initializing nlp object
[2024-03-08 18:38:55,310] [INFO] Initialized pipeline components: ['textcat']
[2024-03-08 18:38:55,332] [DEBUG] Loading corpus from path: resources\dev.spacy
[2024-03-08 18:38:55,334] [DEBUG] Loading corpus from path: resources\train.spacy
[2024-03-08 18:38:55,340] [DEBUG] Removed existing output directory: resources\model\model-best
[2024-03-08 18:38:55,344] [DEBUG] Removed existing output directory: resources\model\model-last


## Pepeline Evaluation

In [12]:

!python -m spacy evaluate resources/model/model-best/ resources/test.spacy

[38;5;4mℹ Using CPU[0m
[1m

TOK                 100.00
TEXTCAT (macro F)   99.15 
SPEED               110252

[1m

          P       R       F
POS   98.59   99.70   99.14
NEG   99.70   98.61   99.15

[1m

      ROC AUC
POS      1.00
NEG      1.00



# Pepline Test

In [13]:
texts = [":)", "Estou muito triste hoje", "Estou muito feliz hoje"]

nlp = spacy.load("resources/model/model-best")

for text in texts:
    doc = nlp(preprocessing(text))
    print(doc.cats,  "-",  text)

{'POS': 0.9750515222549438, 'NEG': 0.024948548525571823} - :)
{'POS': 0.4973682463169098, 'NEG': 0.5026317238807678} - Estou muito triste hoje
{'POS': 0.8404939770698547, 'NEG': 0.15950603783130646} - Estou muito feliz hoje


## Get config from Languaje object

In [24]:
print(nlp.config.to_str())

[paths]
train = "resources/train.spacy"
dev = "resources/dev.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "pt"
pipeline = ["textcat"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.textcat]
factory = "textcat"
scorer = {"@scorers":"spacy.textcat_scorer.v2"}
threshold = 0.0

[components.textcat.model]
@architectures = "spacy.TextCatBOW.v3"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
length = 262144
nO = null

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
