<a href="https://colab.research.google.com/github/Keenandrea/Flair/blob/master/Flair.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Flair: A Natural Language Processing Library

---

Flair is an NLP library whose framework builds on top of *PyTorch*. Several NLP tasks Flair can handle include *Name-Entity Recognition*, *Parts-of-Speech Tagging*, *Text Classification*, and *Custom Language Modeling*.

What makes Flair admirable is how it comprises itself from SOA word embeddings, allowing users to combine different embeddings to documents.

---

## contextual string embeddings for sequence labeling

---

Contextual String Embeddings leverage the internal states of a trained character language model to produce a novel type of word embedding. It uses certain internal principles of a trained character language model, such that words can have different meaning in different sentences.

The words are trained as characters in contenxtual string embeddings, and the embeddings are contextualized by their surrounding text. What this means is the same words can have different embeddings depending on the context.

Take, for instance, the word *key*. In some ways it is an object which unlocks, in others it is the fulcrum of rhetoric as in the *key* takeaway or the *key* point, and still, in others, it is the labeling of a value as in *key*-value pairs.

With contextual string embedding, each of these *keys* are given seperate context. Think of all the cases in which the same word in the English language is under different context domains and you'll see the boon of the tool.

---

## performing with Flair

---
We're going to exemplify the performance of Flair using the *Twitter Sentiment Analysis* dataset, downloaded from [Kaggle](https://www.kaggle.com/paoloripamonti/twitter-sentiment-analysis). 

Follow the link provided, download the *.csv*, upload the *.csv* into a colab notebook instance; set runtime to *Python 3* and *GPU*, and you're set to go.

---

## connecting to *Google Drive*, importing dataset

---




In [0]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials


# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Download a file based on its file ID.
# A file ID looks like: laggVyWshwcyP6kEI-y_W3P8D26sz
# file_id = '1fr4ff3mLKTY0WOvXI1x4Fj9Xu_hgxyQV' ### File ID ###
file_id = '1GhyH4k9C4uPRnMAMKhJYOqa-V9Tqt4q8' ### File ID ###
data = drive.CreateFile({'id': file_id})

[?25l[K     |▎                               | 10kB 24.9MB/s eta 0:00:01[K     |▋                               | 20kB 2.2MB/s eta 0:00:01[K     |█                               | 30kB 3.2MB/s eta 0:00:01[K     |█▎                              | 40kB 2.1MB/s eta 0:00:01[K     |█▋                              | 51kB 2.6MB/s eta 0:00:01[K     |██                              | 61kB 3.1MB/s eta 0:00:01[K     |██▎                             | 71kB 3.6MB/s eta 0:00:01[K     |██▋                             | 81kB 4.0MB/s eta 0:00:01[K     |███                             | 92kB 4.5MB/s eta 0:00:01[K     |███▎                            | 102kB 3.5MB/s eta 0:00:01[K     |███▋                            | 112kB 3.5MB/s eta 0:00:01[K     |████                            | 122kB 3.5MB/s eta 0:00:01[K     |████▎                           | 133kB 3.5MB/s eta 0:00:01[K     |████▋                           | 143kB 3.5MB/s eta 0:00:01[K     |█████                     

## transferring dataset into readable format

---



In [0]:
import io
import pandas as pd

data = pd.read_csv(io.StringIO(data.GetContentString())) 
data.head()

Unnamed: 0.1,Unnamed: 0,label,tweet
0,0,0.0,user when a father is dysfunctional and is s...
1,1,0.0,user user thanks for lyft credit i can t us...
2,2,0.0,bihday your majesty
3,3,0.0,model i love u take with u all the time in ...
4,4,0.0,factsguide society now motivation


In [0]:
# download Flair 
# on top PyTorch
import torch
!pip install flair
import flair

Collecting flair
[?25l  Downloading https://files.pythonhosted.org/packages/4e/3a/2e777f65a71c1eaa259df44c44e39d7071ba8c7780a1564316a38bf86449/flair-0.4.2-py3-none-any.whl (136kB)
[K     |████████████████████████████████| 143kB 3.4MB/s 
Collecting sqlitedict>=1.6.0 (from flair)
  Downloading https://files.pythonhosted.org/packages/0f/1c/c757b93147a219cf1e25cef7e1ad9b595b7f802159493c45ce116521caff/sqlitedict-1.6.0.tar.gz
Collecting deprecated>=1.2.4 (from flair)
  Downloading https://files.pythonhosted.org/packages/9f/7a/003fa432f1e45625626549726c2fbb7a29baa764e9d1fdb2323a5d779f8a/Deprecated-1.2.5-py2.py3-none-any.whl
Collecting segtok>=1.5.7 (from flair)
  Downloading https://files.pythonhosted.org/packages/1d/59/6ed78856ab99d2da04084b59e7da797972baa0efecb71546b16d48e49d9b/segtok-1.5.7.tar.gz
Collecting pytorch-pretrained-bert>=0.6.1 (from flair)
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e0/c08d5553b89973d9a240605b9c12404bcf8227590de62bae27acbcfe076b/pytorch_pret

Flair data types have two objects, namely *sentence* and *token* objects, which are cardinal appendeges of the library. Sentences are lists of tokens that hold textual sentences.

In [0]:
from flair.data import Sentence
s = Sentence('To be or not too.')
print(Sentence)

<class 'flair.data.Sentence'>


In [0]:
from flair.data import Sentence
s = Sentence('To be or not too.')
# see what's inside the sentence
print(s)

Sentence: "To be or not too." - 5 Tokens


In [0]:
#extracting the tweet part#
text = data['tweet'] 
 ## txt is a list of tweets ##
txt = text.tolist()
print(txt[:10])

['  user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction     run', ' user  user thanks for  lyft credit i can t use cause they don t offer wheelchair vans in pdx      disapointed  getthanked', '  bihday your majesty', ' model   i love u take with u all the time in ur                                      ', ' factsguide  society now     motivation', '      huge fan fare and big talking before they leave  chaos and pay disputes when they get there   allshowandnogo  ', '  user camping tomorrow  user  user  user  user  user  user  user danny   ', 'the next school year is the year for exams      can t think about that       school  exams    hate  imagine  actorslife  revolutionschool  girl', 'we won    love the land     allin  cavs  champions  cleveland  clevelandcavaliers      ', '  user  user welcome here    i m   it s so  gr    ']


## word embeddings with *Flair*

---

A few of the more popular word embeddings are written into the cell below. We will be using Stacked Embeddings to combine multiple embeddings to build a word representation model with great power and little complexity.

In [0]:
from flair.embeddings import WordEmbeddings
from flair.embeddings import CharacterEmbeddings
from flair.embeddings import StackedEmbeddings
from flair.embeddings import FlairEmbeddings
from flair.embeddings import BertEmbeddings
from flair.embeddings import ELMoEmbeddings
from flair.embeddings import FlairEmbeddings

### Initialising embeddings (un-comment to use others) ###
#glove_embedding = WordEmbeddings('glove')
#character_embeddings = CharacterEmbeddings()
flair_forward  = FlairEmbeddings('news-forward-fast')
flair_backward = FlairEmbeddings('news-backward-fast')
#bert_embedding = BertEmbedding()
#elmo_embedding = ElmoEmbedding()

stacked_embeddings = StackedEmbeddings(embeddings = [ 
                                                      flair_forward, 
                                                      flair_backward
                                                    ])

2019-06-13 15:52:03,285 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-news-english-forward-1024-v0.2rc.pt not found in cache, downloading to /tmp/tmphal31cc6


100%|██████████| 19689779/19689779 [00:01<00:00, 11157452.41B/s]

2019-06-13 15:52:05,602 copying /tmp/tmphal31cc6 to cache at /root/.flair/embeddings/lm-news-english-forward-1024-v0.2rc.pt
2019-06-13 15:52:05,623 removing temp file /tmp/tmphal31cc6





2019-06-13 15:52:13,012 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-news-english-backward-1024-v0.2rc.pt not found in cache, downloading to /tmp/tmpkq2ebney


100%|██████████| 19689779/19689779 [00:01<00:00, 11485513.42B/s]

2019-06-13 15:52:15,263 copying /tmp/tmpkq2ebney to cache at /root/.flair/embeddings/lm-news-english-backward-1024-v0.2rc.pt
2019-06-13 15:52:15,286 removing temp file /tmp/tmpkq2ebney





Here we can mix and match flair.embedding library imports and test the make and model of our stacked embeddings.

In [0]:
# create a sentence
s = Sentence('To be or not too.')
# embed words in sentence
stacked_embeddings.embed(s)
for token in s:
  print(token.embedding)
# data type and size of embedding
print(type(token.embedding))
# storing size (length)
z = token.embedding.size()[0]

tensor([-2.5548e-03, -1.5109e-06,  2.8850e-07,  ..., -2.4275e-08,
         5.6365e-06,  1.2835e-02])
tensor([ 1.5566e-03,  3.3451e-05,  1.0460e-06,  ..., -5.7873e-08,
         2.3331e-03,  1.8155e-02])
tensor([ 5.9643e-04, -1.0582e-06,  9.8358e-07,  ..., -4.2780e-08,
        -2.3259e-03,  1.5929e-02])
tensor([-1.0171e-02, -2.1711e-05,  6.7540e-06,  ..., -1.3955e-07,
        -4.0077e-05,  1.1862e-03])
tensor([-2.3813e-03, -3.6439e-06,  5.0253e-09,  ..., -1.7233e-09,
        -3.2845e-04,  1.7303e-03])
<class 'torch.Tensor'>


## vectorizing the text

---

Here we can choose one of two approaches:

    1) we calculate the mean of word embeddings
    2) we vectorize the entire tweet
Both of these approaches are listed below, respectively.

---



## calculating the mean of word embeddings

---

Our calculation approach for embeddings within a Tweet will take the following steps: 

    1) we generate a word embedding for each word 
    2) we calculate the mean of these embeddings to obtain the embedding of the sentence

The running of this next cell takes some long time to complete. Fair warning.

In [0]:
# import to track pro
# gress of our loops
from tqdm import tqdm

# creating a tensor 
# to store sentence 
# embeddings 
s = torch.zeros(0,z)

# iterating Sentence
for tweet in tqdm(txt):   
  # empty tensor for words 
  w = torch.zeros(0,z)   
  sentence = Sentence(tweet)
  stacked_embeddings.embed(sentence)
  # loop for every word
  for token in sentence:
    # storing Embeddings of each word in a sentence
    w = torch.cat((w,token.embedding.view(-1,z)),0)
  # storing sentence Embeddings (obtains mean of all words)
  s = torch.cat((s, w.mean(dim = 0).view(-1, z)),0)

100%|██████████| 49159/49159 [1:05:36<00:00,  6.72it/s]


## document embedding

---


So you've chosen to perform some document embedding. In other words, we're going to vectorize the entire Tweet. This, too, is going to take some time. Grab your reading material.

In [0]:
from flair.embeddings import DocumentPoolEmbeddings

### initialize the document embeddings, mode = mean ###
document_embeddings = DocumentPoolEmbeddings([
                                              flair_backward,
                                              flair_forward
                                             ])
# # Storing Size of embedding
z = sentence.embedding.size()[0]

### Vectorising text ###
# creating a tensor for storing sentence embeddings
s = torch.zeros(0,z)
# iterating Sentences #
for tweet in tqdm(txt):   
  sentence = Sentence(tweet)
  document_embeddings.embed(sentence)
  # Adding Document embeddings to list #
  s = torch.cat((s, sentence.embedding.view(-1,z)),0)

100%|██████████| 49159/49159 [1:32:11<00:00,  5.35it/s]


## partitioning the data between train and test sets

---



In [0]:
## tensor to numpy array ##
X = s.detach().numpy()   

## Test set ##
test = X[31962:,:]
train = X[:31962,:]

# extracting labels of the training set #
target = data['label'][data['label'].isnull()==False].values

In [0]:
import numpy as np

Up next, we're going to define a custom *F1 evaluator* for our upcoming *XGBoost* model.

---

## what the heek is an *f1 score*?

---
According to *Wikipedia*: in statistical analysis of binary classification, the f1 score is a measure of a test's accuracy. What it does, it considers both the *precision* (p) and the *recall* (r) of the test to compute the score. 

*p* is the number of correct positive results divided by the number of all positive results returned by the classifier.

*r* is the number of correct positive results divided by the number of all relevent samples, or, all samples that should have been identified as positive.

The f1 score is the harmonic average of the precision and recall. It's best values are those near 1, and its worst values are those near 0.


In [0]:
def custom_eval(preds, dtrain):
    labels = dtrain.get_label().astype(np.int)
    preds = (preds >= 0.3).astype(np.int)
    return [('f1_score', f1_score(labels, preds))]

## building the model using *XGBoost*

---



In [0]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

### Splitting training set ###
x_train, x_valid, y_train, y_valid = train_test_split(train, target,  
                                                      random_state=42, 
                                                          test_size=0.3)

### XGBoost compatible data ###
dtrain = xgb.DMatrix(x_train,y_train)         
dvalid = xgb.DMatrix(x_valid, label = y_valid)

### defining parameters ###
params = {
          'colsample': 0.9,
          'colsample_bytree': 0.5,
          'eta': 0.1,
          'max_depth': 8,
          'min_child_weight': 6,
          'objective': 'binary:logistic',
          'subsample': 0.9
          }

### Training the model ###
xgb_model = xgb.train(
                      params,
                      dtrain,
                      feval= custom_eval,
                      num_boost_round= 1000,
                      maximize=True,
                      evals=[(dvalid, "Validation")],
                      early_stopping_rounds=30
                      )

[0]	Validation-error:0.07373	Validation-f1_score:0.133165
Multiple eval metrics have been passed: 'Validation-f1_score' will be used for early stopping.

Will train until Validation-f1_score hasn't improved in 30 rounds.
[1]	Validation-error:0.065075	Validation-f1_score:0.133165
[2]	Validation-error:0.063927	Validation-f1_score:0.133165
[3]	Validation-error:0.062363	Validation-f1_score:0.133165
[4]	Validation-error:0.063719	Validation-f1_score:0.133165
[5]	Validation-error:0.06278	Validation-f1_score:0.297885
[6]	Validation-error:0.063197	Validation-f1_score:0.376812
[7]	Validation-error:0.062467	Validation-f1_score:0.41914
[8]	Validation-error:0.062676	Validation-f1_score:0.42386
[9]	Validation-error:0.062885	Validation-f1_score:0.436113
[10]	Validation-error:0.062572	Validation-f1_score:0.444444
[11]	Validation-error:0.062572	Validation-f1_score:0.427835
[12]	Validation-error:0.06278	Validation-f1_score:0.43455
[13]	Validation-error:0.06278	Validation-f1_score:0.438532
[14]	Validatio

In [0]:
### Reformatting test set for XGB ###
dtest = xgb.DMatrix(test)

### Predicting ###
predict = xgb_model.predict(dtest) # predicting

# Generate Language With Flair Transfer Learning

---



In [0]:
import torch
from flair.models import LanguageModel

dataset = xgb_model

# load the language model
model = LanguageModel.load_language_model(dataset)

idx2item = model.dictionary.idx2item

# initial hidden state
hidden = model.init_hidden(1)
input = torch.rand(1, 1).mul(len(idx2item)).long()

# generate text character by character
characters = []
number_of_characters_to_generate = 2000
for i in range(number_of_characters_to_generate):
    prediction, rnn_output, hidden = model.forward(input, hidden)
    word_weights = prediction.squeeze().data.div(1.0).exp().cpu()
    word_idx = torch.multinomial(word_weights, 1)[0]
    input.data.fill_(word_idx)
    word = idx2item[word_idx].decode('UTF-8')
    characters.append(word)

    if i % 100 == 0:
        print('| Generated {}/{} chars'.format(i, number_of_characters_to_generate))

# print generated text
print(''.join(characters))

FileNotFoundError: ignored