### Task 3 Summary

In the notebook I used 3 different models: fasttext as a soft baseline, XGBoost as a hard baseline and CNN. 

As a target metric to compare this models I decided to use ROC-AUC score, since this metric is spesifically relevant for classification tasks where none of the classes has higher priority. 



In [0]:
pip install fasttext
import fasttext

Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/f8/85/e2b368ab6d3528827b147fdb814f8189acc981a4bc2f99ab894650e05c40/fasttext-0.9.2.tar.gz (68kB)
[K     |████▊                           | 10kB 27.6MB/s eta 0:00:01[K     |█████████▌                      | 20kB 6.2MB/s eta 0:00:01[K     |██████████████▎                 | 30kB 8.7MB/s eta 0:00:01[K     |███████████████████             | 40kB 11.0MB/s eta 0:00:01[K     |███████████████████████▉        | 51kB 7.1MB/s eta 0:00:01[K     |████████████████████████████▋   | 61kB 8.3MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 5.6MB/s 
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp36-cp36m-linux_x86_64.whl size=3021212 sha256=90ea191cd9c326f3a4c6829b4ae0f96cd61e6bb02bcd22c7987edb9e08956352
  Stored in directory: /root/.cache/pip/wheels/98/ba/7f/b154944a1cf5a8cee91c15

In [0]:
#some imports
import numpy as np
import pandas as pd 
import bz2
from sklearn.metrics import roc_auc_score
import os
import re
import csv
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from xgboost import XGBClassifier
import nltk
from keras.layers import *
from keras.models import Model
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
#get the data and decode it
data = bz2.BZ2File("../content/drive/My Drive/amazonreviews/train.ft.txt.bz2")
data = data.readlines()
data = [x.decode('utf-8') for x in data]

### Part 1: Soft baseline, fasttext




Soft baseline fasttext classification is heavily based on [this](https://www.kaggle.com/ejlok1/fasttext-model-91-7) kernel example with consultation to [official PyPI fasttext documentation](https://pypi.org/project/fasttext/#train_supervised-parameters).

In [0]:
help(fasttext.train_supervised)

Help on function train_supervised in module fasttext.FastText:

train_supervised(*kargs, **kwargs)
    Train a supervised model and return a model object.
    
    input must be a filepath. The input text does not need to be tokenized
    as per the tokenize function, but it must be preprocessed and encoded
    as UTF-8. You might want to consult standard preprocessing scripts such
    as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html
    
    The input file must must contain at least one label per line. For an
    example consult the example datasets which are part of the fastText
    repository such as the dataset pulled by classification-example.sh.



In [0]:

#Building a model 
model = fasttext.train_supervised('train.txt',label_prefix='__label__', epoch = 10)
print(model.labels)

['__label__1', '__label__2']


In [0]:
#Test data
test = bz2.BZ2File("../content/drive/My Drive/amazonreviews/test.ft.txt.bz2")
test = test.readlines()
test = [x.decode('utf-8') for x in test]

In [0]:
#Removing labels from test data
test_clear = [i.replace('__label__2 ', '') for i in test]
test_clear = [i.replace('__label__1 ', '') for i in test_clear]
test_clear = [i.replace('\n', '') for i in test_clear]

In [0]:
#Predicting the labels of the test set
pred = model.predict(test_clear)

In [0]:
#Changing '__label__1' to class 0 and '__label__2' to class 1 and predicting the labels
labels = [0 if x.split(' ')[0] == '__label__1' else 1 for x in test]
pred_labels = [0 if x == ['__label__1'] else 1 for x in pred[0]]

In [0]:
#Estimating the target quality metric - ROC AUC
roc_auc_FT = roc_auc_score(labels, pred_labels)
print("ROC-AUC for FastText is {}".format(round(roc_auc_FT,3)))

ROC AUC for FastText is 0.917


### Part 2: Hard baseline: TFIDF + XGBoost

Sources of this part of the notebook: 

1.   tricks for data preparation [from here](https://www.kaggle.com/kevinautin/fully-convolutional-accuracy-94-4-15-min)
2.   tricks for tokenization [from here](https://medium.com/@chrisfotache/text-classification-in-python-pipelines-nlp-nltk-tf-idf-xgboost-and-more-b83451a327e0)





In [0]:
#import a smart progress meter
from tqdm import tqdm

In [0]:
# import train data
train = bz2.BZ2File("../content/drive/My Drive/amazonreviews/train.ft.txt.bz2")
train = train.readlines()
train = [x.decode('utf-8') for x in train]

In [0]:
# import test data
test = bz2.BZ2File("../content/drive/My Drive/amazonreviews/test.ft.txt.bz2")
test = test.readlines()
test = [x.decode('utf-8') for x in test]

In [0]:
# function for data preproseccing (taken from source 1)
def reviewText(review):
    review = review.split(' ', 1)[1][:-1].lower()
    review = re.sub('\d','0',review)
    if 'www.' in review or 'http:' in review or 'https:' in review or '.com' in review:
        review = re.sub(r"([^ ]+(?<=\.[a-z]{3}))", "<url>", review)
    return review

In [0]:
# second function for processing (inspiration: source 1)
def splitReviewsLabels(lines,list = False, review_length = '' ):
    '''parameter:
    list - desired label output format: 
      if list = False, output - integer 0 or 1
      if list = True, output - list [0,1] or [1,0]
    review_length - num of characters in review, by default all characters
    '''
    reviews = []
    labels = []
    for review in tqdm(lines):
        rev = reviewText(review)
        if list == True:
          label = [1,0] if review.split(' ')[0] == '__label__1' else [0,1]
        else:
          label = 0 if review.split(' ')[0] == '__label__1' else 1
        reviews.append(rev[:review_length])
        labels.append(label)
    return reviews, labels

In [0]:
# get the data for XGBoost model
reviews_train_XGB, y_train_XGB = splitReviewsLabels(train, review_length = 512)
reviews_test_XGB, y_test_XGB = splitReviewsLabels(test, review_length = 512)

100%|██████████| 3600000/3600000 [00:45<00:00, 78884.28it/s]
100%|██████████| 400000/400000 [00:05<00:00, 78599.04it/s]


In [0]:
# I decided to decrease the size of training sample, since I have limited computing power
# to do that I made up this stupid function
# I used train_test_split to maintain balance 
def decreaseTrain(X, y, target_size = 1000000):
  '''
  Input: 
  X - training data, list
  y - labels, list
  X and y must be the same size
  Parameter:
  target_size - target training sample size, (0;len(y))
  '''
  _, X1, _, Y1 = train_test_split(X, y, test_size=target_size/len(y))
  return X1, Y1

In [0]:
X_train_XGB, Y_train_XGB = decreaseTrain(reviews_train_XGB, y_train_XGB, target_size = 600000)

In [0]:
# tokenizer taken from source 2
def Tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    porter_stemmer=nltk.PorterStemmer()
    words = [porter_stemmer.stem(word) for word in words]
    return words

In [0]:
train_XGB = TfidfVectorizer(tokenizer=Tokenizer, stop_words='english').fit_transform(X_train_XGB)
test_XGB = TfidfVectorizer(tokenizer=Tokenizer, stop_words='english').transform(reviews_test_XGB)

In [0]:
# tuning hyperparameters, max tree depth and number of trees
model_XGB = XGBClassifier(max_depth=10, n_estimators = 200)

In [0]:
model_XGB.fit(train_XGB, Y_train_XGB)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [0]:
roc_auc_XGBoost=(roc_auc_score(y_test_XGB, model_XGB.predict(test_XGB)))
print("ROC-AUC for XGBoost is {}".format(round(roc_auc_XGBoost,3)))

ROC-AUC for XGBoost is 0.788


### Part 3: NN based models

NN based model sources:


1.   [Source number 1](https://www.kaggle.com/kevinautin/fully-convolutional-accuracy-94-4-15-min)
2.   List item



In [0]:
# get the data with labels in another format 
reviews_train_NN, y_train_NN = splitReviewsLabels(train, list=True, review_length = 512)
reviews_test_NN, y_test_NN = splitReviewsLabels(test, list=True, review_length = 512)

100%|██████████| 3600000/3600000 [00:54<00:00, 65518.48it/s]
100%|██████████| 400000/400000 [00:05<00:00, 78249.41it/s]


In [0]:
y_train_NN = np.array(y_train_NN)
y_test_NN = np.array(y_test_NN)

In [0]:
X_train_NN, Y_train_NN= decreaseTrain(reviews_train_NN, y_train_NN, target_size = 600000)

In [0]:
del train, test

In [0]:
max_features = 10000 #length of vocab
maxlen = 128 #max number of words in a review
embed_size = 64 

In [0]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(X_train_NN)
token_train = tokenizer.texts_to_sequences(X_train_NN)
token_test = tokenizer.texts_to_sequences(reviews_test_NN)

In [0]:
x_train = pad_sequences(token_train, maxlen=maxlen, padding='post')
x_test = pad_sequences(token_test, maxlen=maxlen, padding='post')

Convolutional NN

In [0]:
# constructing a model, convocutional model with batch normalization and dropouts
input = Input(shape=(maxlen,))
net = Embedding(max_features, embed_size)(input)
net = Dropout(0.2)(net)
net = BatchNormalization()(net)

net = Conv1D(128, 7, padding='same', activation='relu')(net)
net = BatchNormalization()(net)
net = Conv1D(64, 3, padding='same', activation='relu')(net)
net = BatchNormalization()(net)
net = Conv1D(64, 3, padding='same', activation='relu')(net)
net = BatchNormalization()(net)
net = Conv1D(32, 3, padding='same', activation='relu')(net)
net1 = BatchNormalization()(net)

net = Conv1D(2, 1)(net)
net = GlobalAveragePooling1D()(net)
output = Activation('softmax')(net)
model_conv = Model(inputs = input, outputs = output)
model_conv.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model_conv.summary()

Model: "model_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         (None, 128)               0         
_________________________________________________________________
embedding_4 (Embedding)      (None, 128, 64)           640000    
_________________________________________________________________
dropout_4 (Dropout)          (None, 128, 64)           0         
_________________________________________________________________
batch_normalization_16 (Batc (None, 128, 64)           256       
_________________________________________________________________
conv1d_16 (Conv1D)           (None, 128, 32)           14368     
_________________________________________________________________
batch_normalization_17 (Batc (None, 128, 32)           128       
_________________________________________________________________
conv1d_17 (Conv1D)           (None, 128, 32)           3104

In [0]:
model_conv.fit(x_train, Y_train_NN, batch_size=1024, epochs=10, validation_split=0.1)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 648000 samples, validate on 72000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7f1316e4f240>

In [0]:
roc_auc_conv=(roc_auc_score(y_test_NN, model_conv.predict(x_test)))


In [0]:
print("ROC-AUC for conv_NN is {}".format(round(roc_auc_conv,3)))

ROC-AUC for RNN is 0.977


In [0]:
model=Sequential()
model.add(Embedding(1000000,100))
model.add(LSTM(256,return_sequences=True))
model.add(LSTM(512))
model.add(Dense(500,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(100,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(2,activation='sigmoid'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model_conv.summary()

In [0]:
def build_model():
    sequences = layers.Input(shape=(maximum_length,))
    embedded = layers.Embedding(20000, 64)(sequences)
    x = layers.Conv1D(64, 3, activation='relu')(embedded)
    x = layers.BatchNormalization()(x)
    x = layers.MaxPool1D(3)(x)
    x = layers.Conv1D(64, 5, activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.MaxPool1D(5)(x)
    x = layers.Conv1D(64, 5, activation='relu')(x)
    x = layers.GlobalMaxPool1D()(x)
    x = layers.Flatten()(x)
    x = layers.Dense(100, activation='relu')(x)
    predictions = layers.Dense(1, activation='sigmoid')(x)
    model = models.Model(inputs=sequences, outputs=predictions)
    model.compile(
        optimizer='rmsprop',
        loss='binary_crossentropy',
        metrics=['binary_accuracy']
    )
    return model
    
CNN = build_model()

In [0]:
CNN.fit(
    train_texts, 
    train_labels, 
    batch_size=128,
    epochs=2,
    validation_data=(val_texts, val_labels), )

In [0]:
preds = CNN.predict(test_texts)
roc_auc_CNN = roc_auc_score(test_labels, preds)
print('ROC AUC for CNN is', roc_auc_CNN)

In [0]:
def build_rnn_model():
    sequences = layers.Input(shape=(maximum_length,))
    embedded = layers.Embedding(20000, 64)(sequences)
    x = layers.CuDNNGRU(128, return_sequences=True)(embedded)
    x = layers.CuDNNGRU(128)(x)
    x = layers.Dense(32, activation='relu')(x)
    x = layers.Dense(100, activation='relu')(x)
    predictions = layers.Dense(1, activation='sigmoid')(x)
    model = models.Model(inputs=sequences, outputs=predictions)
    model.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['binary_accuracy']
    )
    return model

rnn_model = build_rnn_model()

In [0]:
rnn_modelt.fit(
    train_texts, 
    train_labels, 
    batch_size=128,
    epochs=1,
    validation_data=(val_texts, val_labels), )

In [0]:
RNN_preds = rnn_model.predict(test_texts)
roc_auc_RNN = roc_auc_score(test_labels, RNN_preds)
print('ROC AUC for RNN', roc_auc_RNN)

In [0]:
def build_LSTM_model():
    model=Sequential()
    model.add(Embedding(1000000,100))
    model.add(LSTM(256,return_sequences=True))
    model.add(LSTM(512))
    model.add(Dense(500,activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(100,activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(2,activation='sigmoid'))
    model.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['roc_auc_score']
    
    return model

LSTM_model = build_LSTM_model()

In [0]:
LSTM_modelt.fit(
    train_texts, 
    train_labels, 
    batch_size=128,
    epochs=1,
    validation_data=(val_texts, val_labels), )

In [0]:
LSTM_preds = LSTM_model.predict(test_texts)
print('ROC AUC for LSTM', (roc_auc_score(test_labels, LSTM_preds)))

Heavily based on: https://www.kaggle.com/ejlok1/fasttext-model-91-7, https://www.kaggle.com/saishan/sentiment-analysis-logregre-vs-cudnnlstm