<a href="https://colab.research.google.com/github/ITU-Business-Analytics-Team/Business_Analytics_for_Professionals/blob/main/Part%20I%20%3A%20Methods%20%26%20Technologies%20for%20Business%20Analytics/Chapter%207%3A%20Text%20Analytics/7_6_3_Deep_Learning_Based_Sentiment_Analysis_Flair.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis (Opinion Mining)**
## Deep Learning Based Sentiment Analysis

### FLAIR

Flair is developed by Facebook and is a deep learning based text classifier as Bert and XLNet. First, we start with installation of necessary libraries to implement it.

In [None]:
!pip install flair
!pip install allennlp==0.9.0

Collecting conllu==1.3.1
  Using cached conllu-1.3.1-py2.py3-none-any.whl (9.3 kB)
Installing collected packages: conllu
  Attempting uninstall: conllu
    Found existing installation: conllu 4.4.1
    Uninstalling conllu-4.4.1:
      Successfully uninstalled conllu-4.4.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
flair 0.9 requires conllu>=4.0, but you have conllu 1.3.1 which is incompatible.[0m
Successfully installed conllu-1.3.1


In [None]:
import pandas as pd
import tqdm
import numpy as np

Flair comes with its own dataset format. In order to use it correctly, data should be labelled as __ label__x where x represents the class. Therefore, the train and test datasets are reformatted.

In [None]:
url=   'https://docs.google.com/spreadsheets/d/1XXyxrd7r0mx7kyLaYHDVwh6BFJzo8cPD/edit?usp=sharing&ouid=108589602591644119588&rtpof=true&sd=true'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]

df = pd.read_excel(path)
df['summary'] = df['summary'].map(lambda x: x.lstrip('News :'))
df['summary'] = df['summary'].map(lambda x: x.lstrip('UPDATE'))
df['summary'] = df['summary'].map(lambda x: x.lstrip('METALS-'))
df.rename(columns={'sentiment':'score', 'summary':'text'}, inplace = True)
# Optional lowercase for test data (if model was trained on lowercased text)
df['text'] = df['text'].str.lower()
df['label'] = '__label__' + df['score'].astype(str)
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
df = df.drop(columns='score')
df

Unnamed: 0,label,text
0,__label__1,nickel jumps on talks of indonesia export ban
1,__label__1,hanghai copper hits near 2-week high on trade ...
2,__label__0,copper at near 2-week highs on hopes china imp...
3,__label__1,"china's yunnan to help firms stockpile 110,000..."
4,__label__1,rpt-update 1-china turns net aluminium importe...
...,...,...
1115,__label__1,copper rebounds as u.s.-mexico deal calms nerves
1116,__label__1,china demand hopes help aluminium to hold near...
1117,__label__1,"rpt-column-new contracts, new platform as lme ..."
1118,__label__-1,rpt-column- china's aluminium import surge a s...


In [None]:
url=   'https://docs.google.com/spreadsheets/d/145tqf2J949KGCYnH-Nx3hiaHTogiZFn4/edit?usp=sharing&ouid=108589602591644119588&rtpof=true&sd=true'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
test_df = pd.read_excel(path)
test_df['summary'] = test_df['summary'].map(lambda x: x.lstrip('News :'))
test_df['summary'] = test_df['summary'].map(lambda x: x.lstrip('UPDATE'))
test_df['summary'] = test_df['summary'].map(lambda x: x.lstrip('METALS-'))
test_df.rename(columns={'sentiment':'score', 'summary':'text'}, inplace = True)
test_df['text'] = test_df['text'].str.lower()
test_df['label'] = '__label__' + test_df['score'].astype(str)
cols = test_df.columns.tolist()
cols = cols[-1:] + cols[:-1]
test_df = test_df[cols]
test_df = test_df.drop(columns='score')
test_df

Unnamed: 0,label,text
0,__label__0,copper at near 2-week highs on hopes china imp...
1,__label__1,"china's yunnan to help firms stockpile 110,000..."
2,__label__-1,column-politics trumps aluminium as u.s. reimp...
3,__label__-1,base metals decline on weak china demand outlook
4,__label__-1,"aluminium falls to $1,751.50/t, lowest since..."
...,...,...
163,__label__1,china names former chinalco exec as industry m...
164,__label__0,copper edges off two-year low as washington so...
165,__label__-1,"uncertainty on global growth, trade war weighs..."
166,__label__1,copper gains after fed chief rekindles rate cu...


In [None]:
df = df.drop_duplicates().merge(test_df.drop_duplicates(), on=test_df.columns.to_list(), 
                   how='left', indicator=True, right_index = False, left_index = False)
df = df.loc[df._merge=='left_only',df.columns!='_merge']
df = df.reset_index(drop = True, inplace= False)

In [None]:
df

Unnamed: 0,label,text
0,__label__1,nickel jumps on talks of indonesia export ban
1,__label__1,hanghai copper hits near 2-week high on trade ...
2,__label__1,rpt-update 1-china turns net aluminium importe...
3,__label__-1,1-china july aluminium output hits record ami...
4,__label__1,copper edges to 11-week high on china recovery
...,...,...
911,__label__1,rpt-column-pain for aluminium shorts as lme ge...
912,__label__1,copper rebounds as u.s.-mexico deal calms nerves
913,__label__1,china demand hopes help aluminium to hold near...
914,__label__1,"rpt-column-new contracts, new platform as lme ..."


After reformatting and merging, the train dataset is splitted as 90% of it becomes validation dataset to use understand the performance of the model.

In [None]:
df.iloc[0:int(len(df)*0.9)].to_csv('train.csv', sep='\t', index = False, header = False)
df.iloc[int(len(df)*0.9):].to_csv('dev.csv', sep='\t', index = False, header = False)
test_df.to_csv('test.csv', sep='\t', index = False, header = False)

Flair works with txt files, so we need to convert .csv files to .txt. 

In [None]:
import csv

def csv_to_txt(filename):
    csv_file = filename+'.csv'
    txt_file = filename+'.txt'
    with open(txt_file, "w") as my_output_file:
        with open(csv_file, "r") as my_input_file:
            [ my_output_file.write(" ".join(row)+'\n') for row in csv.reader(my_input_file)]
        my_output_file.close()

        
files = ['train', 'dev', 'test']
for file in files:
    csv_to_txt(file)


Flair will create a classification corpus with the files that are prepared in previous cells.

In [None]:
pip install flair

Collecting conllu>=4.0
  Using cached conllu-4.4.1-py2.py3-none-any.whl (15 kB)
Installing collected packages: conllu
  Attempting uninstall: conllu
    Found existing installation: conllu 1.3.1
    Uninstalling conllu-1.3.1:
      Successfully uninstalled conllu-1.3.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
allennlp 0.9.0 requires conllu==1.3.1, but you have conllu 4.4.1 which is incompatible.[0m
Successfully installed conllu-4.4.1


In [None]:
from flair.data_fetcher import NLPTaskDataFetcher
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentLSTMEmbeddings, DocumentRNNEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
from pathlib import Path
from flair.data import Corpus
from flair.datasets import ClassificationCorpus
from flair.embeddings import TransformerDocumentEmbeddings,TransformerWordEmbeddings
from flair.embeddings import BertEmbeddings, ELMoEmbeddings

# this is the folder in which train, test and dev files reside
data_folder = '/content/'

# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ClassificationCorpus(data_folder,
                              train_file='train.txt',
                              test_file='test.txt',
                              dev_file='dev.txt')
# print the number of Sentences in the train split
print(len(corpus.train))

# print the number of Sentences in the test split
print(len(corpus.test))

# print the number of Sentences in the dev split
print(len(corpus.dev))


2021-11-07 09:46:28,693 Reading data from /content
2021-11-07 09:46:28,695 Train: /content/train.txt
2021-11-07 09:46:28,698 Dev: /content/dev.txt
2021-11-07 09:46:28,701 Test: /content/test.txt
2021-11-07 09:46:28,724 Initialized corpus /content/ (label type name is 'class')
824
168
92


Several embeddings can be used in Flair, which is one of the most strong aspects of it. Since transformers architecture success in the problem is approved above, we will continue with that. Additional options for embedding are given in comment out. 

In [None]:
from flair.embeddings import StackedEmbeddings

#init BERT base (cases)
#optional_embedding = BertEmbeddings('bert-base-uncased')
# OR init ELMo (original)
#optional_embedding = ELMoEmbeddings('original')

#word_embeddings = [
 #   optional_embedding,
 #   FlairEmbeddings('news-forward'),
 #   FlairEmbeddings('news-backward')]


#word_embeddings = [WordEmbeddings('glove')]

#document_embeddings = DocumentRNNEmbeddings(
#        word_embeddings,
#        hidden_size=512,
#        reproject_words=True,
#        reproject_words_dimension=256
#    )

document_embeddings = TransformerDocumentEmbeddings('distilbert-base-uncased',fine_tune=True)



Now, Flair build the text classifier with chosen embeddings and created corpus. Since there are 3 classes, multi label is set true.

In [None]:
corpus: Corpus = ClassificationCorpus(data_folder,                                                                            
                                      label_type='topic',
                                      )

label_dict = corpus.make_label_dictionary(label_type='topic')

classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, label_type='topic', multi_label=True)

trainer = ModelTrainer(classifier, corpus)
trainer.train('./', max_epochs=10,mini_batch_size=32)

2021-11-07 09:24:53,770 Reading data from /content
2021-11-07 09:24:53,773 Train: /content/train.csv
2021-11-07 09:24:53,775 Dev: /content/dev.txt
2021-11-07 09:24:53,778 Test: /content/test.txt
2021-11-07 09:24:53,821 Initialized corpus /content/ (label type name is 'topic')
2021-11-07 09:24:53,824 Computing label dictionary. Progress:


  cpuset_checked))
100%|██████████| 824/824 [00:00<00:00, 995.03it/s] 

2021-11-07 09:24:54,895 Corpus contains the labels: topic (#824)
2021-11-07 09:24:54,899 Created (for label 'topic') Dictionary with 3 tags: 1, -1, 0
2021-11-07 09:24:54,905 ----------------------------------------------------------------------------------------------------
2021-11-07 09:24:54,909 Model: "TextClassifier(
  (loss_function): BCEWithLogitsLoss()
  (document_embeddings): TransformerDocumentEmbeddings(
    (model): DistilBertModel(
      (embeddings): Embeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (transformer): Transformer(
        (layer): ModuleList(
          (0): TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=768, out_features=768, bias=Tru


  cpuset_checked))


2021-11-07 09:25:03,061 epoch 1 - iter 2/26 - loss 0.02481312 - samples/sec: 8.13 - lr: 0.100000
2021-11-07 09:25:10,633 epoch 1 - iter 4/26 - loss 0.02223014 - samples/sec: 8.48 - lr: 0.100000
2021-11-07 09:25:21,659 epoch 1 - iter 6/26 - loss 0.02024161 - samples/sec: 5.81 - lr: 0.100000
2021-11-07 09:25:32,838 epoch 1 - iter 8/26 - loss 0.02060364 - samples/sec: 5.73 - lr: 0.100000
2021-11-07 09:25:42,312 epoch 1 - iter 10/26 - loss 0.01990936 - samples/sec: 6.76 - lr: 0.100000
2021-11-07 09:25:52,842 epoch 1 - iter 12/26 - loss 0.01920096 - samples/sec: 6.08 - lr: 0.100000
2021-11-07 09:26:01,079 epoch 1 - iter 14/26 - loss 0.01889305 - samples/sec: 7.77 - lr: 0.100000
2021-11-07 09:26:10,720 epoch 1 - iter 16/26 - loss 0.01869085 - samples/sec: 6.64 - lr: 0.100000
2021-11-07 09:26:18,001 epoch 1 - iter 18/26 - loss 0.01849717 - samples/sec: 8.81 - lr: 0.100000
2021-11-07 09:26:25,409 epoch 1 - iter 20/26 - loss 0.01843454 - samples/sec: 8.64 - lr: 0.100000
2021-11-07 09:26:33,232 

  cpuset_checked))


2021-11-07 09:46:13,805 0.7093	0.7262	0.7176	0.7024
2021-11-07 09:46:13,808 
Results:
- F-score (micro) 0.7176
- F-score (macro) 0.4947
- Accuracy 0.7024

By class:
              precision    recall  f1-score   support

           1     0.7156    0.8571    0.7800        91
          -1     0.7333    0.6769    0.7040        65
           0     0.0000    0.0000    0.0000        12

   micro avg     0.7093    0.7262    0.7176       168
   macro avg     0.4830    0.5114    0.4947       168
weighted avg     0.6713    0.7262    0.6949       168
 samples avg     0.7143    0.7262    0.7183       168

2021-11-07 09:46:13,810 ----------------------------------------------------------------------------------------------------


{'dev_loss_history': [tensor(0.0173),
  tensor(0.0170),
  tensor(0.0162),
  tensor(0.0135),
  tensor(0.0142),
  tensor(0.0140),
  tensor(0.0136),
  tensor(0.0172),
  tensor(0.0173),
  tensor(0.0213)],
 'dev_score_history': [0.574585635359116,
  0.5975609756097561,
  0.5241379310344828,
  0.7344632768361581,
  0.7073170731707318,
  0.7734806629834255,
  0.7403314917127071,
  0.7868852459016393,
  0.7262569832402235,
  0.7142857142857144],
 'test_score': 0.7176470588235294,
 'train_loss_history': [0.018629704683440402,
  0.01723632013913497,
  0.016668295361173962,
  0.01468841315617839,
  0.013153638003520596,
  0.012298279794529804,
  0.010871608226189336,
  0.009446992728750683,
  0.0075744906241453965,
  0.005192594941777801]}

As an output of model, Flair provides best model performance on the test set by following format at the end of output (see end of previous cell's output). 

Results:
- **F-score** (micro) 0.7045
- **F-score** (macro) 0.4809
- **Accuracy** 0.6845

By class:

                 precision    recall  f1-score   support
           1     0.7091    0.8571    0.7761        91
          -1     0.7273    0.6154    0.6667        65
           0     0.0000    0.0000    0.0000        12
   

As expected, Flair model has a good performance on positive news. Overall model performance is also better than statistical models. 