#### CLASSIFICATION CON BERT E FASTTEXT

Lo scopo di questo jupyter è quello di indagare le tecniche di classificazione attraverso delle reti neurali: bert e fasttext. In particolare, si vuole capire come queste strutture funzionano ed il livello di accuratezza raggiunti.

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('To enable a high-RAM runtime, select the Runtime > "Change runtime type"')
  print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
  print('re-execute this cell.')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 38.0 gigabytes of available RAM

You are using a high-RAM runtime!


In [3]:
import pandas as pd
import numpy as np
import random

import string
import os.path
import re

In [4]:
# Preprocessing data for BERT algorithm.
class preprocessing:
  # Initialize.
  def __init__(self, path, name_df):
    self.path = path
    self.name_df = name_df

  # Load data.
  def loader(self):
    print('IMPORT DATASET ' + self.name_df)
    if os.path.isfile(self.path) == False: 
      print('Set not exists.')
      raise SystemExit("Stop right there!")
    else:
      with open(self.path, encoding="utf8") as file:
       self.df = file.readlines()
    print(f"Size: {len(self.df)}")

 # Sampling.
  def sampling(self, size):
    print('SAMPLING ' + self.name_df)
    random.seed = 20201230
    self.df = random.sample(self.df, size)  
    print(f"Size: {len(self.df)}")

  # From list to data frame
  def data_frame(self):
    print('CREATE DATASET: REVIEWS - LABELS ' + self.name_df)
    X = []
    labels = []

    for rev in self.df:
      _, label, sent = re.split(r'__label__(\d)', rev)
      label = int(label[0]) -1
      labels.append(label)
      X.append(sent)

    self.df = pd.DataFrame(list(zip(X, labels)), columns=['Review', 'Labels']) 


In [5]:
%%time

        #-- IMPORT SET.

train_path = 'drive/MyDrive/Text Mining/train.ft.txt'

train = preprocessing(train_path, 'TRAIN')

train.loader()


      #-- SAMPLING.

train_sample_size = 250000
train.sampling(train_sample_size)

      #-- CREATE A DATASET WITH REVIEW - LABEL.

train.data_frame()

IMPORT DATASET TRAIN
Size: 3600000
SAMPLING TRAIN
Size: 250000
CREATE DATASET: REVIEWS - LABELS TRAIN
CPU times: user 2.6 s, sys: 2.23 s, total: 4.84 s
Wall time: 28.4 s


### **BERT**

In [6]:
!pip install ktrain

Collecting ktrain
[?25l  Downloading https://files.pythonhosted.org/packages/41/23/6f5addc2ade7c6240e2c9169bd7a9506cea17b35c9f322104a60dd4ba7fd/ktrain-0.25.2.tar.gz (25.3MB)
[K     |████████████████████████████████| 25.3MB 90.5MB/s 
Collecting langdetect
[?25l  Downloading https://files.pythonhosted.org/packages/56/a3/8407c1e62d5980188b4acc45ef3d94b933d14a2ebc9ef3505f22cf772570/langdetect-1.0.8.tar.gz (981kB)
[K     |████████████████████████████████| 983kB 59.8MB/s 
Collecting cchardet
[?25l  Downloading https://files.pythonhosted.org/packages/a0/e5/a0b9edd8664ea3b0d3270c451ebbf86655ed9fc4c3e4c45b9afae9c2e382/cchardet-2.1.7-cp36-cp36m-manylinux2010_x86_64.whl (263kB)
[K     |████████████████████████████████| 266kB 69.1MB/s 
[?25hCollecting syntok
  Downloading https://files.pythonhosted.org/packages/8c/76/a49e73a04b3e3a14ce232e8e28a1587f8108baa665644fe8c40e307e792e/syntok-1.3.1.tar.gz
Collecting seqeval==0.0.19
  Downloading https://files.pythonhosted.org/packages/93/e5/b7705156

In [7]:
import ktrain
from ktrain import text

In [8]:
# Preprocess data.

bert_df = train.df[['Review', 'Labels']].rename(columns = {'Labels': 'pos'})
swap = {0:1, 1:0}
bert_df['neg'] = bert_df['pos'].apply(lambda x: swap[x])

In [9]:
%%time
# The function returns the train, test set and the preprocessing method of the text.
(x_train, y_train), (x_val, y_val), preproc = text.texts_from_df(bert_df, 
                                                                   'Review', # name of column containing review text
                                                                   label_columns=['neg', 'pos'],
                                                                   maxlen=100, 
                                                                   max_features=100000,
                                                                   preprocess_mode='bert')

downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


CPU times: user 3min 20s, sys: 1.94 s, total: 3min 22s
Wall time: 3min 28s


In [10]:
%%time
# Build the BERT (Classifiation) model.
model = text.text_classifier(name = 'bert',
                            train_data = (x_train, y_train),
                            preproc = preproc)

Is Multi-Label? False
maxlen is 100
done.
CPU times: user 8.89 s, sys: 1.83 s, total: 10.7 s
Wall time: 5.29 s


In [11]:
%%time
# Training the BERT model.
learner = ktrain.get_learner(model = model,
                             train_data = (x_train, y_train),
                             val_data = (x_val, y_val),
                             batch_size = 12
                            )

CPU times: user 233 ms, sys: 427 ms, total: 660 ms
Wall time: 694 ms


In [12]:
%%time
learner.fit_onecycle(lr = 2e-5, # learning rate.
                     epochs = 1 # it's very slow to perform.
                    )

# Ci ha messo 8 ore per finire.
# Accuratezza di validazione del 95%!



begin training using onecycle policy with max lr of 2e-05...
CPU times: user 7d 15h 10min 23s, sys: 5h 58min 19s, total: 7d 21h 8min 42s
Wall time: 8h 13min 44s


<tensorflow.python.keras.callbacks.History at 0x7faaed869ef0>

### ***FASTTEXT***

In [13]:
# DAL TERMINALE DI UBUNTU


# Ho creato un file con 500000 di train e 55555 di test.

# Per l'addestramento del modell ho cambiato un po' di parametri,
# i migliori però si sono rivelati: 
# - learning rate: 0.1
# - epoch 5
# - wordNgrams: 1

# ./fasttext supervised -input train_sample -output model -label __label__

# Per le predizioni:
# l'ultima parte serve per salvare le lebals predette.
# ./fasttext predict model.bin test_sample > predicted_labels

# Con:
# ./fasttext test model.bin test_sample
# ottengo solo la recall e la precision.
# Import quindi il dataset per calcolarne l'accuratezza. 

In [14]:
from sklearn.metrics import accuracy_score
import decimal

# Importo i valori predetti e forniti per calcolare l'accuratezza
# dato che l'algoritmo di base fornisce solamente la recall e precision.
with open('drive/MyDrive/Text Mining/predicted_labels', 'r') as file:
 y_pred = file.readlines()

with open('drive/MyDrive/Text Mining/test_sample.txt', 'r') as file:
 y_true = file.readlines()

y_pred = [int(re.split(r'__label__(\d)', label)[1])-1 for label in y_pred]
y_true = [int(re.split(r'__label__(\d)', label)[1])-1 for label in y_true]

# print(f"Accuracy of FastText: {decimal.Decimal(accuracy_score(y_pred, y_true))}%")

print("Accuracy FastText: {0:.3f}%".format(decimal.Decimal(accuracy_score(y_pred, y_true))))

Accuracy FastText: 0.905%
