# FLAIR
***STATE-OF-THE-ART NATURAL LANGUAGE PROCESSING***

### The Flair Text Classifier

We'll be training the Flair Text Classifier to predict the sentiment of tweets about global warming. Each tweet has 4 possible sentiments.

NOTE: This model takes 6 hours to train, the model file is above 600MB and the pip installs and embedding imports are large, I'd suggest you make use Google Collab when running this notebook to make use of their GPU and avoid running into errors on your local machine.

To make use of the pickled trained model:

 0. Open Notebook in Gooogle Collab
 1. Install the dependencies
 2. Import
 3. Skip over the training sections
 4. downlaod and load the model from google drive
 5. Make predictions.

#### Install

In [None]:
!pip install flair 
!pip install torch
!pip install allennlp==0.9.0

#### Import

In [None]:
import pandas as pd
import numpy as np
import flair
import torch
import allennlp
from tqdm import tqdm 

from flair.data import Corpus
from flair.datasets import ClassificationCorpus

from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings
from flair.embeddings import FlairEmbeddings, ELMoEmbeddings

from flair.trainers import ModelTrainer
from flair.models import TextClassifier
from flair.data import Sentence

from sklearn.metrics import f1_score
from sklearn import metrics
import time

import pickle
import gdown

### Training

#### The Data

In [13]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test_with_no_labels.csv")

In [14]:
pd.set_option('display.max_colwidth', None)
display(train.head())
display(test.head())

Unnamed: 0,sentiment,message,tweetid
0,1,"PolySciMajor EPA chief doesn't think carbon dioxide is main cause of global warming and.. wait, what!? https://t.co/yeLvcEFXkC via @mashable",625221
1,1,It's not like we lack evidence of anthropogenic global warming,126103
2,2,RT @RawStory: Researchers say we have three years to act on climate change before it’s too late https://t.co/WdT0KdUr2f https://t.co/Z0ANPT…,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year in the war on climate change https://t.co/44wOTxTLcD,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, sexist, climate change denying bigot is leading in the polls. #ElectionNight",466954


Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make sure that it is not alone in fighting climate change… https://t.co/O7T8rCgwDq,169760
1,Combine this with the polling of staffers re climate change and womens' rights and you have a fascist state. https://t.co/ifrm7eexpj,35326
2,"The scary, unimpeachable evidence that climate change is already here: https://t.co/yAedqcV9Ki #itstimetochange #climatechange @ZEROCO2_;..",224985
3,@Karoli @morgfair @OsborneInk @dailykos \nPutin got to you too Jill ! \nTrump doesn't believe in climate change at all \nThinks it's s hoax,476263
4,RT @FakeWillMoore: 'Female orgasms cause global warming!'\n-Sarcastic Republican,872928


In [15]:
print(train.shape)

(15819, 3)


### The Process

-- No preprocessing required 

Flair is a very powerful library, developed by Zalando Research. The collaborators on the project suggest not to do any preprocessing. All preprocessing is handled in the respective embeddings class.

Step 1: Prepare Dataset ( as either csv, or fastText format)

Step 2: Split the dataset into 3 (train,test,dev)

Step 3: Create Corpus and Label Dictionary

Step 4: Add Word Embeddings

Step 5: Instantiate Model and Train using the data

Step 6: Use Model to Make Prediction

#### Prep Dataset: fastText format

In [9]:
sentiment = train['sentiment']
word_sentiment = []

for i in sentiment :
    if i == 1 :
        word_sentiment.append('Pro')
    elif i == 0 :
        word_sentiment.append('Neutral')
    elif i == -1 :
        word_sentiment.append('Anti')
    else :
        word_sentiment.append('News')

train['sentiment'] = word_sentiment

In [10]:
train1 = train[['message', 'sentiment']]

In [11]:
df_fst = train1.copy()
display(df_fst.head(10))

Unnamed: 0,message,sentiment
0,PolySciMajor EPA chief doesn't think carbon di...,Pro
1,It's not like we lack evidence of anthropogeni...,Pro
2,RT @RawStory: Researchers say we have three ye...,News
3,#TodayinMaker# WIRED : 2016 was a pivotal year...,Pro
4,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",Pro
5,Worth a read whether you do or don't believe i...,Pro
6,RT @thenation: Mike Pence doesn’t believe in g...,Pro
7,RT @makeandmendlife: Six big things we can ALL...,Pro
8,@AceofSpadesHQ My 8yo nephew is inconsolable. ...,Pro
9,RT @paigetweedy: no offense… but like… how do ...,Pro


In [12]:
'__label__' + df_fst['sentiment'].astype(str)

0            __label__Pro
1            __label__Pro
2           __label__News
3            __label__Pro
4            __label__Pro
               ...       
15814        __label__Pro
15815       __label__News
15816    __label__Neutral
15817       __label__Anti
15818    __label__Neutral
Name: sentiment, Length: 15819, dtype: object

In [13]:
df_fst['labels'] = '__label__' + df_fst['sentiment'].astype(str)
df_fst = df_fst[['labels','message']]

The format needed...

In [14]:
df_fst.head(10)

Unnamed: 0,labels,message
0,__label__Pro,PolySciMajor EPA chief doesn't think carbon di...
1,__label__Pro,It's not like we lack evidence of anthropogeni...
2,__label__News,RT @RawStory: Researchers say we have three ye...
3,__label__Pro,#TodayinMaker# WIRED : 2016 was a pivotal year...
4,__label__Pro,"RT @SoyNovioDeTodas: It's 2016, and a racist, ..."
5,__label__Pro,Worth a read whether you do or don't believe i...
6,__label__Pro,RT @thenation: Mike Pence doesn’t believe in g...
7,__label__Pro,RT @makeandmendlife: Six big things we can ALL...
8,__label__Pro,@AceofSpadesHQ My 8yo nephew is inconsolable. ...
9,__label__Pro,RT @paigetweedy: no offense… but like… how do ...


#### Split the dataset into 3 (train,test,dev)

In [15]:
train_fst, test_fst, dev_fst = np.split(df_fst, [int(.9*len(df_fst)), int(.95*len(df_fst))])

In [16]:
print('Original: ', df_fst.shape)
print('Train: ', train_fst.shape)
print('Test: ', test_fst.shape)
print('Dev: ', dev_fst.shape)

Original:  (15819, 2)
Train:  (14237, 2)
Test:  (791, 2)
Dev:  (791, 2)


In [17]:
!mkdir -p data_faster  # Create a folder

In [18]:
train_fst.to_csv("/data_faster/train.csv",sep='\t',index=False,header=False)  # Save to the folder
test_fst.to_csv("/data_faster/test.csv",sep='\t',index=False,header=False)
dev_fst.to_csv("/data_faster/dev.csv",sep='\t',index=False,header=False)

In [19]:
!ls /data_faster   # Check folder contents

dev.csv  test.csv  train.csv


 #### Create Corpus and Label Dictionary

In [18]:
data_folder_fast = "/data_faster"   # Path to the data

In [21]:
corpus_fst: Corpus = ClassificationCorpus(data_folder_fast)

2021-06-12 18:53:08,721 Reading data from data_faster
2021-06-12 18:53:08,722 Train: data_faster/train.csv
2021-06-12 18:53:08,724 Dev: data_faster/dev.csv
2021-06-12 18:53:08,725 Test: data_faster/test.csv


In [22]:
label_dictionary=corpus_fst.make_label_dictionary()

2021-06-12 18:53:09,355 Computing label dictionary. Progress:


100%|██████████| 15028/15028 [00:13<00:00, 1127.40it/s]

2021-06-12 18:53:22,852 [b'Pro', b'News', b'Neutral', b'Anti']





#### Add Word Embeddings

In [None]:
# Initialising embeddings 
glove_embedding = WordEmbeddings('glove')
flair_forward  = FlairEmbeddings('news-forward')
elmo = ELMoEmbeddings() 

In [24]:
word_embeddings = [ 
                    WordEmbeddings('glove'),
                    FlairEmbeddings('news-forward'),
                    ELMoEmbeddings()    
                  ]

document_embeddings = DocumentRNNEmbeddings(word_embeddings, hidden_size=512, 
                                            reproject_words=True, reproject_words_dimension=256)

100%|██████████| 336/336 [00:00<00:00, 1232971.25B/s]
100%|██████████| 374434792/374434792 [00:14<00:00, 25600722.91B/s]


### Instantiate Model and Train using the data

In [25]:
classifier = TextClassifier(document_embeddings, label_dictionary=label_dictionary)

trainer = ModelTrainer(classifier, corpus_fst)

trainer.train('/data_faster',      # The path the model and training log will be saved in
              learning_rate=0.1,   # Parameters
              mini_batch_size=32,
              anneal_factor=0.5,
              patience=4,          
              train_with_dev=True, 
              max_epochs=150,      # The more the better, 150 here took 6hr33min to train
              embeddings_storage_mode='gpu')  

2021-06-12 18:54:23,210 ----------------------------------------------------------------------------------------------------
2021-06-12 18:54:23,212 Model: "TextClassifier(
  (document_embeddings): DocumentRNNEmbeddings(
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings('glove')
      (list_embedding_1): FlairEmbeddings(
        (lm): LanguageModel(
          (drop): Dropout(p=0.05, inplace=False)
          (encoder): Embedding(300, 100)
          (rnn): LSTM(100, 2048)
          (decoder): Linear(in_features=2048, out_features=300, bias=True)
        )
      )
      (list_embedding_2): ELMoEmbeddings(model=2-elmo-original-all)
    )
    (word_reprojection_map): Linear(in_features=5220, out_features=256, bias=True)
    (rnn): GRU(256, 512, batch_first=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Linear(in_features=512, out_features=4, bias=True)
  (loss_function): CrossEntropyLoss()
  (beta): 1.0
  (weights): None
  (weight_tensor) Non

AttributeError: 'dict' object has no attribute 'cuda'

##### Done Training!!

##### Results
Copied from the training log. Based on the training(train + dev) sample size of 15028  and test sample size of 791.

EPOCH 150 done: loss 0.0339 

Results:

F-score (micro):     0.7585

F-score (macro):     0.6704

Accuracy :           0.7585

By class:
                                 
              precision    recall  f1-score   support
         Pro     0.8104    0.8291    0.8196       433
        News     0.8112    0.8641    0.8368       184
     Neutral     0.4646    0.4694    0.4670        98
        Anti     0.6792    0.4737    0.5581        76
        
    micro avg     0.7585    0.7585    0.7585       791
    macro avg     0.6914    0.6591    0.6704       791
    weighted avg  0.7551    0.7585    0.7548       791
    samples avg   0.7585    0.7585    0.7585       791


### Load and Pickle The Model  After Training

In [None]:
flair_classifier = TextClassifier.load('final-model 3.pt')

In [None]:
model_save_path = "flair_classifier.pkl"
with open(model_save_path,'wb') as file:
    pickle.dump(flair_classifier,file)

I uploaded the pickled model to my google drive where I created a shared link to use later on.

### Download and Load The Pickled Model From Shared Google Drive Link

1. Follow this link: https://drive.google.com/file/d/1UMOBrqhEfrPNGZxS5nNybgwaCf0neVxT/view?usp=sharing
   And click download. You will be directed to the actual download page.
2. On the actual downlaod page, right click copy link adress.
3. Paste the link in the variable download_url below and continue.

In [None]:
# Downlaod the model
# Replace download_url with the link you copied
download_url = 'https://drive.google.com/u/0/uc?export=download&confirm=FDLh&id=1UMOBrqhEfrPNGZxS5nNybgwaCf0neVxT'

output = 'flair_classifier.pkl'
gdown.download(download_url, output, quiet=False)

In [None]:
# Unpickle

model_load_path = "flair_classifier.pkl"
with open(model_load_path,'rb') as file:
    flair_classifier = pickle.load(file)  # load the model
    

### Now Test the Model by Recreating the Data Split it was Trained on

First lets quickly test the model by making a prediction using the first message from our train dataframe. It's sentimet is 1.

In [None]:
message = "PolySciMajor EPA chief doesn't think carbon dioxide is main \
            cause of global warming and.. wait, what!? https://t.co/yeLvcEFXkC via @mashable"

sentence= Sentence(message)                   # Create Sentence Object
flair_classifier.predict(sentence)                # Predictions are tagged to the sentence object

In [None]:
pred.append(str(sentence.labels[0]).split(' ')[0])    # Extract prediction from the sentence

Let's continue.

In [None]:
# Recreating Split
train1 = train.copy()
training_data1, testing_data, training_data2 = np.split(train1, [int(.9*len(train1)), int(.95*len(train1))])
training_data = pd.concat([training_data1, training_data2], axis=0)
print(training_data.shape)
print(testing_data.shape)

In [None]:
start_time = time.time() 
pred = []

for message in tqdm(testing_data['message']):    # Flair TextClassifier Predicts on uncleaned message
    sentence= Sentence(message)                   # Create Sentence Object
    flair_classifier.predict(sentence)                # Predictions are tagged to the sentence object
    pred.append(str(sentence.labels[0]).split(' ')[0])    # Extract prediction from the sentence

k = 0
for i in tqdm(pred):        # Remapping predictions of TextClassifier to their original value
    if i == 'Pro' :
        pred[k] = 1
    elif i == 'Neutral' :
        pred[k] = 0
    elif i == 'Anti' :
        pred[k] = -1
    else :
        pred[k] = 2
    k = k + 1
    
run_time = time.time() - start_time

y_pred = pred
y_test = testing_data['sentiment'] 

model_summary = {} 
model_summary['Flair_TextClassifier'] = {
          'F1-Macro':metrics.f1_score(y_test, y_pred, average='macro'),
          'F1-Accuracy':metrics.f1_score(y_test, y_pred, average='micro'),
          'F1-Weighted':metrics.f1_score(y_test, y_pred, average='weighted'),
          'Execution Time': run_time }
          
flair_text_clf_performance = pd.DataFrame.from_dict(model_summary, orient='index')
flair_text_clf_performance.to_csv('flair_text_clf_performance.csv')

display(flair_text_clf_performance)

print('F1-Macro: ', metrics.f1_score(y_test, y_pred, average='macro'))
print('F1-Accuracy: ', metrics.f1_score(y_test, y_pred, average='micro'))
print('F1-Weighted: ', metrics.f1_score(y_test, y_pred, average='weighted'))

## Predict for Submission to Kaggle

In [28]:
# Predicting and Saving to list, takes about 15min to run

pred = []

for message in tqdm(test['message']):
    sentence= Sentence(message)                   # Create Sentence Object
    flair_classifier.predict(sentence)                # Predictions are tagged to the sentence object
    pred.append(str(sentence.labels[0]).split(' ')[0])    # Extract prediction from the sentence

100%|██████████| 10546/10546 [16:44<00:00, 10.50it/s]


In [29]:
# Remapping predictions to their original value

k = 0
for i in tqdm(pred):
    if i == 'Pro' :
        pred[k] = 1
    elif i == 'Neutral' :
        pred[k] = 0
    elif i == 'Anti' :
        pred[k] = -1
    else :
        pred[k] = 2
    k = k + 1

100%|██████████| 10546/10546 [00:00<00:00, 821215.49it/s]


In [30]:
# The Predictions
test_df = pd.DataFrame(zip(test['tweetid'],pred), columns= ['tweetid', 'sentiment'])
test_df

Unnamed: 0,tweetid,sentiment
0,169760,1
1,35326,1
2,224985,1
3,476263,1
4,872928,0
...,...,...
10541,895714,1
10542,875167,1
10543,78329,0
10544,867455,0


In [32]:
# Save to csv
test_df.to_csv('/data_faster/FlairNLP3.csv',index=False)

#### References
- https://blog.jcharistech.com/2020/10/04/text-classification-with-flair-pytorch-nlp-framework/
- https://www.geeksforgeeks.org/flair-a-framework-for-nlp/
- https://www.analyticsvidhya.com/blog/2019/02/flair-nlp-library-python/

##### The flair repo
- https://github.com/flairNLP/flair
