### Task 3 Summary

In the notebook I used 3 different models: fasttext as a soft baseline, XGBoost as a hard baseline and CNN. 

As a target metric to compare this models I decided to use ROC-AUC score, since this metric is spesifically relevant for classification tasks where none of the classes has higher priority. 



In [0]:
pip install fasttext
import fasttext

Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/f8/85/e2b368ab6d3528827b147fdb814f8189acc981a4bc2f99ab894650e05c40/fasttext-0.9.2.tar.gz (68kB)
[K     |████▊                           | 10kB 27.6MB/s eta 0:00:01[K     |█████████▌                      | 20kB 6.2MB/s eta 0:00:01[K     |██████████████▎                 | 30kB 8.7MB/s eta 0:00:01[K     |███████████████████             | 40kB 11.0MB/s eta 0:00:01[K     |███████████████████████▉        | 51kB 7.1MB/s eta 0:00:01[K     |████████████████████████████▋   | 61kB 8.3MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 5.6MB/s 
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp36-cp36m-linux_x86_64.whl size=3021212 sha256=90ea191cd9c326f3a4c6829b4ae0f96cd61e6bb02bcd22c7987edb9e08956352
  Stored in directory: /root/.cache/pip/wheels/98/ba/7f/b154944a1cf5a8cee91c15

In [0]:
#some imports
import numpy as np
import pandas as pd 
import bz2
from sklearn.metrics import roc_auc_score
import os
import re
import csv
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from xgboost import XGBClassifier
import nltk
from keras.layers import *
from keras.models import Model
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

In [2]:
pip install keras-self-attention

Collecting keras-self-attention
  Downloading https://files.pythonhosted.org/packages/44/3e/eb1a7c7545eede073ceda2f5d78442b6cad33b5b750d7f0742866907c34b/keras-self-attention-0.42.0.tar.gz
Building wheels for collected packages: keras-self-attention
  Building wheel for keras-self-attention (setup.py) ... [?25l[?25hdone
  Created wheel for keras-self-attention: filename=keras_self_attention-0.42.0-cp36-none-any.whl size=17296 sha256=e538f3e1a7b631d5795d87ef456480c13be95446b29d481660533f991a538a4a
  Stored in directory: /root/.cache/pip/wheels/7b/05/a0/99c0cf60d383f0494e10eca2b238ea98faca9a1fe03cac2894
Successfully built keras-self-attention
Installing collected packages: keras-self-attention
Successfully installed keras-self-attention-0.42.0


In [0]:
from keras.models import Sequential
from keras_self_attention import SeqSelfAttention

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
#get the data and decode it
data = bz2.BZ2File("../content/drive/My Drive/amazonreviews/train.ft.txt.bz2")
data = data.readlines()
data = [x.decode('utf-8') for x in data]

### Part 1: Soft baseline, fasttext




Soft baseline fasttext classification is heavily based on [this](https://www.kaggle.com/ejlok1/fasttext-model-91-7) kernel example with consultation to [official PyPI fasttext documentation](https://pypi.org/project/fasttext/#train_supervised-parameters).

In [0]:
help(fasttext.train_supervised)

Help on function train_supervised in module fasttext.FastText:

train_supervised(*kargs, **kwargs)
    Train a supervised model and return a model object.
    
    input must be a filepath. The input text does not need to be tokenized
    as per the tokenize function, but it must be preprocessed and encoded
    as UTF-8. You might want to consult standard preprocessing scripts such
    as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html
    
    The input file must must contain at least one label per line. For an
    example consult the example datasets which are part of the fastText
    repository such as the dataset pulled by classification-example.sh.



In [0]:

#Building a model 
model = fasttext.train_supervised('train.txt',label_prefix='__label__', epoch = 10)
print(model.labels)

['__label__1', '__label__2']


In [0]:
#Test data
test = bz2.BZ2File("../content/drive/My Drive/amazonreviews/test.ft.txt.bz2")
test = test.readlines()
test = [x.decode('utf-8') for x in test]

In [0]:
#Removing labels from test data
test_clear = [i.replace('__label__2 ', '') for i in test]
test_clear = [i.replace('__label__1 ', '') for i in test_clear]
test_clear = [i.replace('\n', '') for i in test_clear]

In [0]:
#Predicting the labels of the test set
pred = model.predict(test_clear)

In [0]:
labels = [0 if x.split(' ')[0] == '__label__1' else 1 for x in test]
pred_labels = [0 if x == ['__label__1'] else 1 for x in pred[0]]

In [0]:
roc_auc_FT = roc_auc_score(labels, pred_labels)
print("ROC-AUC for FastText is {}".format(round(roc_auc_FT,3)))

ROC AUC for FastText is 0.917


### Part 2: Hard baseline: TFIDF + XGBoost

Sources of this part of the notebook: 

1.   tricks for data preparation [from here](https://www.kaggle.com/kevinautin/fully-convolutional-accuracy-94-4-15-min)
2.   tricks for tokenization [from here](https://medium.com/@chrisfotache/text-classification-in-python-pipelines-nlp-nltk-tf-idf-xgboost-and-more-b83451a327e0)





In [0]:
#import a smart progress meter
from tqdm import tqdm

In [0]:
# import train data
train = bz2.BZ2File("../content/drive/My Drive/amazonreviews/train.ft.txt.bz2")
train = train.readlines()
train = [x.decode('utf-8') for x in train]

In [0]:
# import test data
test = bz2.BZ2File("../content/drive/My Drive/amazonreviews/test.ft.txt.bz2")
test = test.readlines()
test = [x.decode('utf-8') for x in test]

In [0]:
# function for data preproseccing (taken from source 1)
def reviewText(review):
    review = review.split(' ', 1)[1][:-1].lower()
    review = re.sub('\d','0',review)
    if 'www.' in review or 'http:' in review or 'https:' in review or '.com' in review:
        review = re.sub(r"([^ ]+(?<=\.[a-z]{3}))", "<url>", review)
    return review

In [0]:
# second function for processing (inspiration: source 1)
def splitReviewsLabels(lines,list = False, review_length = '' ):
    '''parameter:
    list - desired label output format: 
      if list = False, output - integer 0 or 1
      if list = True, output - list [0,1] or [1,0]
    review_length - num of characters in review, by default all characters
    '''
    reviews = []
    labels = []
    for review in tqdm(lines):
        rev = reviewText(review)
        if list == True:
          label = [1,0] if review.split(' ')[0] == '__label__1' else [0,1]
        else:
          label = 0 if review.split(' ')[0] == '__label__1' else 1
        reviews.append(rev[:review_length])
        labels.append(label)
    return reviews, labels

In [12]:
# get the data for XGBoost model
reviews_train_XGB, y_train_XGB = splitReviewsLabels(train, review_length = 512)
reviews_test_XGB, y_test_XGB = splitReviewsLabels(test, review_length = 512)

100%|██████████| 3600000/3600000 [00:41<00:00, 86110.48it/s]
100%|██████████| 400000/400000 [00:04<00:00, 85699.05it/s]


In [0]:
# I decided to decrease the size of training sample, since I have limited computing power
# to do that I made up this stupid function
# I used train_test_split to maintain balance 
def decreaseTrain(X, y, target_size = 1000000):
  '''
  Input: 
  X - training data, list
  y - labels, list
  X and y must be the same size
  Parameter:
  target_size - target training sample size, (0;len(y))
  '''
  _, X1, _, Y1 = train_test_split(X, y, test_size=target_size/len(y))
  return X1, Y1

In [0]:
X_train_XGB, Y_train_XGB = decreaseTrain(reviews_train_XGB, y_train_XGB, target_size = 600000)

In [0]:
del train, test, reviews_train_XGB, y_train_XGB, reviews_test_XGB, y_test_XGB

In [0]:
# tokenizer taken from source 2
def Tokenizer_XGB(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    porter_stemmer=nltk.PorterStemmer()
    words = [porter_stemmer.stem(word) for word in words]
    return words

In [0]:
vectorizer = TfidfVectorizer(tokenizer=Tokenizer_XGB, stop_words='english')

In [18]:
train_XGB = vectorizer.fit_transform(X_train_XGB)
test_XGB = vectorizer.transform(X_test_XGB)

  'stop_words.' % sorted(inconsistent))


In [0]:
# tuning hyperparameters, max tree depth and number of trees
model_XGB = XGBClassifier(max_depth=10, n_estimators = 50)

In [22]:
model_XGB.fit(train_XGB, Y_train_XGB)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=10,
              min_child_weight=1, missing=None, n_estimators=50, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [23]:
roc_auc_XGBoost=(roc_auc_score(Y_test_XGB, model_XGB.predict(test_XGB)))
print("ROC-AUC for XGBoost is {}".format(round(roc_auc_XGBoost,3)))

ROC-AUC for XGBoost is 0.807


### Part 3: NN based models

NN based model sources:


1.   [Source number 1](https://www.kaggle.com/kevinautin/fully-convolutional-accuracy-94-4-15-min)
2.   [Source number 2](https://medium.com/analytics-vidhya/https-medium-com-understanding-attention-mechanism-natural-language-processing-9744ab6aed6a)



In [0]:
# import train data
train = bz2.BZ2File("../content/drive/My Drive/amazonreviews/train.ft.txt.bz2")
train = train.readlines()
train = [x.decode('utf-8') for x in train]

In [0]:
# import test data
test = bz2.BZ2File("../content/drive/My Drive/amazonreviews/test.ft.txt.bz2")
test = test.readlines()
test = [x.decode('utf-8') for x in test]

In [26]:
# get the data with labels in another format 
reviews_train_NN, y_train_NN = splitReviewsLabels(train, list=True, review_length = 512)
reviews_test_NN, y_test_NN = splitReviewsLabels(test, list=True, review_length = 512)

100%|██████████| 3600000/3600000 [00:46<00:00, 78256.42it/s]
100%|██████████| 400000/400000 [00:05<00:00, 73311.92it/s]


In [0]:
y_train_NN = np.array(y_train_NN)
y_test_NN = np.array(y_test_NN)

In [0]:
X_train_NN, Y_train_NN= decreaseTrain(reviews_train_NN, y_train_NN, target_size = 600000)

In [0]:
del train, test, reviews_train_NN, y_train_NN

In [0]:
max_features = 10000 #length of vocab
maxlen = 128 #max number of words in a review
embed_size = 64 

In [0]:

from keras.preprocessing.text import Tokenizer as Tokenizer1

In [0]:
tokenizer = Tokenizer1(num_words=max_features)
tokenizer.fit_on_texts(X_train_NN)
token_train = tokenizer.texts_to_sequences(X_train_NN)
token_test = tokenizer.texts_to_sequences(reviews_test_NN)

In [0]:
x_train = pad_sequences(token_train, maxlen=maxlen, padding='post')
x_test = pad_sequences(token_test, maxlen=maxlen, padding='post')

Convolutional NN

In [43]:
# constructing a model, convocutional model with batch normalization and dropouts
input = Input(shape=(maxlen,))
net = Embedding(max_features, embed_size)(input)
net = Dropout(0.2)(net)
net = BatchNormalization()(net)

net = Conv1D(128, 7, padding='same', activation='relu')(net)
net = BatchNormalization()(net)
net = Conv1D(64, 3, padding='same', activation='relu')(net)
net = BatchNormalization()(net)
net = Conv1D(64, 3, padding='same', activation='relu')(net)
net = BatchNormalization()(net)
net = Conv1D(32, 3, padding='same', activation='relu')(net)
net1 = BatchNormalization()(net)

net = Conv1D(2, 1)(net)
net = GlobalAveragePooling1D()(net)
output = Activation('softmax')(net)
model_conv = Model(inputs = input, outputs = output)
model_conv.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model_conv.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 128)               0         
_________________________________________________________________
embedding_4 (Embedding)      (None, 128, 64)           640000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 128, 64)           0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 128, 64)           256       
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 128, 128)          57472     
_________________________________________________________________
batch_normalization_2 (Batch (None, 128, 128)          512       
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 128, 64)           2464

In [44]:
model_conv.fit(x_train, Y_train_NN, batch_size=1024, epochs=10, validation_split=0.1)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 540000 samples, validate on 60000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7fbfab596780>

In [45]:
roc_auc_conv=(roc_auc_score(y_test_NN, model_conv.predict(x_test)))
print("ROC-AUC for conv_NN is {}".format(round(roc_auc_conv,3)))

ROC-AUC for conv_NN is 0.977


Self attention model

In [39]:
# model structure inspired by source 2
# the model has RNN and self attention layers
model_att=Sequential()
model_att.add(Embedding(max_features, embed_size, input_length=maxlen))
model_att.add(Bidirectional(LSTM(units = 16, return_sequences = True, dropout = 0.5, recurrent_dropout = 0.7)))
model_att.add(SeqSelfAttention(attention_activation = 'sigmoid'))
model_att.add(Flatten())
model_att.add(Dense(2, activation = 'softmax'))
model_att.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model_att.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 128, 64)           640000    
_________________________________________________________________
bidirectional_3 (Bidirection (None, 128, 32)           10368     
_________________________________________________________________
seq_self_attention_3 (SeqSel (None, 128, 32)           2113      
_________________________________________________________________
flatten_2 (Flatten)          (None, 4096)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 8194      
Total params: 660,675
Trainable params: 660,675
Non-trainable params: 0
_________________________________________________________________


In [40]:
model_att.fit(x_train, Y_train_NN, batch_size=1024, epochs=10, validation_split=0.1)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 540000 samples, validate on 60000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7fbfaf129a20>

In [42]:
roc_auc_att=(roc_auc_score(y_test_NN, model_att.predict(x_test)))
print("ROC-AUC for attention_NN is {}".format(round(roc_auc_att,3)))

ROC-AUC for attention_NN is 0.979


Results of the experiment is following: 


*   ROC-AUC for FastText is 0.917
*   ROC-AUC for XGBoost is 0.807
*   ROC-AUC for conv_NN is 0.977
*   ROC-AUC for attention_NN is 0.979

The best ROC-AUC score is given by convolutional NN and attention RNN. Both NNs look quite promising and probably they would give better results if we let it learn all the evailable information.