**Spam Classification using Flair**

https://heartbeat.fritz.ai/using-transfer-learning-and-pre-trained-language-models-to-classify-spam-549fc0f56c20


In [None]:
pip install flair

In [3]:
# import libraries
import pandas as pd
from flair.datasets import ClassificationCorpus
from flair.data import Corpus
from flair.embeddings import WordEmbeddings, DocumentLSTMEmbeddings, FlairEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
from pathlib import Path

**Loading and Preprocessing the Data**

 Use the [SMS Spam Collection](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/), a public dataset of SMS labeled messages that have been collected for mobile phone spam research. The data is read using pandas and basic preprocessing is done—namely removing duplicates, ensuring the labels are prefixed with __label__, and splitting the dataset into train, dev and test sets using the 80/10/10 split.

Flair’s classification dataset needs to be formatted based on Facebook’s FastText format, which requires labels to be defined at the beginning of each line starting with the prefix __label__.

In [None]:
data = pd.read_csv('SMSSpamCollection.txt', delimiter='\t', header=None)
data = data.rename(columns={0:"label", 1:"text"}).drop_duplicates()
data['label']= '__label__' + data['label'].astype(str)

# divide the data 80% train, 10% dev, 10% test
data.iloc[0: int(len(data) *0.8)].to_csv('train.csv', sep='\t', index=False, header= False)                 # 0 : 80%   ==> Train data
data.iloc[int(len(data)*0.8) : int(len(data)*0.9)].to_csv('test.csv', sep='\t', index=False, header=False)  # 80% : 90% ==> Test data
data.iloc[int(len(data)*0.9) : ].to_csv('dev.csv', sep='\t', index = False, header = False)                 # 90% : 100% ==> dev deta


**train the model.**

In [None]:
corpus: Corpus = ClassificationCorpus(Path('./'), train_file='train.csv', dev_file='dev.csv', test_file='test.csv')

word_embeddings = [WordEmbeddings('glove'),
                   FlairEmbeddings('news-forward-fast'),
                   FlairEmbeddings('news-backward-fast')]




document_embeddings = DocumentLSTMEmbeddings(word_embeddings, 
                                             hidden_size=512,
                                             reproject_words= True,
                                             reproject_words_dimension=256)


classifier = TextClassifier(document_embeddings, 
                            label_dictionary=corpus.make_label_dictionary(), 
                            multi_label=False)

trainer =  ModelTrainer (classifier, corpus)
trainer.train('./', max_epochs=10)


2020-10-10 18:01:24,425 Reading data from .
2020-10-10 18:01:24,435 Train: train.csv
2020-10-10 18:01:24,438 Dev: dev.csv
2020-10-10 18:01:24,440 Test: test.csv


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


2020-10-10 18:01:26,096 Computing label dictionary. Progress:


  del sys.path[0]
100%|██████████| 4652/4652 [00:04<00:00, 1102.94it/s]

2020-10-10 18:01:30,543 [b'ham', b'spam']
2020-10-10 18:01:30,550 ----------------------------------------------------------------------------------------------------
2020-10-10 18:01:30,551 Model: "TextClassifier(
  (document_embeddings): DocumentLSTMEmbeddings(
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings('glove')
      (list_embedding_1): FlairEmbeddings(
        (lm): LanguageModel(
          (drop): Dropout(p=0.25, inplace=False)
          (encoder): Embedding(275, 100)
          (rnn): LSTM(100, 1024)
          (decoder): Linear(in_features=1024, out_features=275, bias=True)
        )
      )
      (list_embedding_2): FlairEmbeddings(
        (lm): LanguageModel(
          (drop): Dropout(p=0.25, inplace=False)
          (encoder): Embedding(275, 100)
          (rnn): LSTM(100, 1024)
          (decoder): Linear(in_features=1024, out_features=275, bias=True)
        )
      )
    )
    (word_reprojection_map): Linear(in_features=2148, out_features=




2020-10-10 18:03:02,623 epoch 1 - iter 13/130 - loss 0.29798273 - samples/sec: 4.54 - lr: 0.100000
2020-10-10 18:04:09,278 epoch 1 - iter 26/130 - loss 0.26329130 - samples/sec: 6.30 - lr: 0.100000
2020-10-10 18:05:21,888 epoch 1 - iter 39/130 - loss 0.24021390 - samples/sec: 5.73 - lr: 0.100000
2020-10-10 18:06:19,731 epoch 1 - iter 52/130 - loss 0.21684361 - samples/sec: 7.20 - lr: 0.100000
2020-10-10 18:07:34,652 epoch 1 - iter 65/130 - loss 0.19302364 - samples/sec: 5.56 - lr: 0.100000
2020-10-10 18:08:27,825 epoch 1 - iter 78/130 - loss 0.17996172 - samples/sec: 7.83 - lr: 0.100000
2020-10-10 18:09:16,376 epoch 1 - iter 91/130 - loss 0.17118647 - samples/sec: 8.68 - lr: 0.100000
2020-10-10 18:10:11,533 epoch 1 - iter 104/130 - loss 0.17459173 - samples/sec: 7.55 - lr: 0.100000
2020-10-10 18:11:19,247 epoch 1 - iter 117/130 - loss 0.16572302 - samples/sec: 6.15 - lr: 0.100000
2020-10-10 18:12:33,888 epoch 1 - iter 130/130 - loss 0.15746683 - samples/sec: 5.58 - lr: 0.100000
2020-10

{'dev_loss_history': [0.05795241892337799,
  0.0500209741294384,
  0.05635133385658264,
  0.04483029246330261,
  0.04626813158392906,
  0.04414363205432892,
  0.05554492771625519,
  0.05202345922589302,
  0.04412562772631645,
  0.04913755878806114],
 'dev_score_history': [0.9845,
  0.9884,
  0.9865,
  0.9923,
  0.9923,
  0.9884,
  0.9865,
  0.9865,
  0.9865,
  0.9845],
 'test_score': 0.9884,
 'train_loss_history': [0.15746683494832653,
  0.08176347882701801,
  0.0635400113918317,
  0.057816209018122976,
  0.060737541721811374,
  0.044873173844266256,
  0.045980322169802655,
  0.04650863060602345,
  0.030155148194171488,
  0.03120017525372812]}


After each epoch, will generate 'best-model.pt' file, which is the trained file
After final epoch, will generate 'final-model.pt' file which is the final trained model in the current working directory