From here you are on your own! You have cleaned (or not) your data and you have your training data intact. Your labels can be gotten easily by the following line.
From here on:

I. Provide code to build your models for approach 1, 2 and 3. Reminder:

- Approach 1: Simple BOW representation + a classifier
- Approach 2: Train a word2vec model representation (no pre-trained embeddings yet) + a classifier
- Approach 3: Show us what you can do! Custom-made approach

II. Make sure you evaluate properly (reporting precision, recall, accuracy, F-measure) your model (you might need to split the training set to train/validation sets).

III. Make sure you evaluate properly (reporting precision, recall, accuracy, F-measure) your model on the testset.
**Pay attention that the same pre-processing you applied on the training set, should also be applied on the testset (see below)**. Other than that your model should never see the testset.

IV. Provide enough commentary on the decisions you made (pre-processing and then the specific decisions in approaches 1, 2 and 3)

V. Provide conclusions on how a successful solution can be built. Summarize the challenges you faced and how you can further expand your work

# **DISCLAIMER**

Cleaning: I played around with multiple stemmers/cleaning mechanisms and settled for the one with the highest performance; a porterstemmer that also removes punctuation and stopwords. This quite radical simplification was made since I noticed that I have a hardware issue; I just can't train the networks in enough detail for it to be useful to have that much detail in the language.

Approach 1, BOW: There was not much freedom for choice here, I simply converted all words into one_hot encodings; added them up for each question, concatenated the two and fed them into the network. I don't know what decision process I can comment on here since everything was predetermined. 
Regarding the training and architecture of the network, I simply tested multiple until I achieved the highest performance. Within the network I made sure (via tensorboard, which I did not include here since it does not only need python to run) that no overtraining happened. 

Approach 2, word2vec: Once again, I did not feel like I had any freedom of choice, I simply trained a gensim model using the training dataset and then fed those embeddings to the neural network (getting the average for each question first of course)
For the network, I did the exact same thing as for BOW, I only changed the hyperparametres slightly.

Approach 3, custom:
Fuzzyfeatures: Fuzzywuzzy is a library that is made to check how similar/related strings are. Therefore, I decided to apply it to the questions hoping that it would add useful information. Some of the questions, that are nearly the same except for some stopwords will get an extremely high Qratio for example.

Vector Features: Here, some of the features of the word vectors are calculated manually to make that job easier. If the cosine distance between vectors is fed for example, it'll be much easier to train on vector similarity since it reduces the complexity a lot.

XGboost: XGboost is a tree boosting library which usually outperforms libraries such as Tensorflow or SKlearn on problems where it is applicable. Since it is here, I decided to give it a go and try to find a decent tune. Here, I did not take any additional measures again overfitting since XGboost has a built-in function for that which I relied on.


Conlusion:

If I were free to build my own solution using any measures needed (provided I had as much time as I wanted), I would probably use precomputed embeddings (such as the ones offered by google), then add the vector features (based on those encodings) as well as the fuzzy features.
Then, I would use an LSTM to compare the sentences word by word and feed all those values into a final classifier network.
This would exceed the resources and time I have though (plus no pre-trained embeddings), which is why I settled for a simples solution.

# **GENERAL (must be run for multiple/all models)**

Importing files (I used google collab, otherwise this wouldn't be necessary)

In [1]:
!pip install fuzzywuzzy
!pip install python-Levenshtein
# from google.colab import files
# uploaded = files.upload()



Saving test.csv to test (1).csv
Saving train.csv to train (1).csv


Downloading all required libraries and importing the .csv files and making them into panda dataframes

In [2]:
import pandas as pd
import nltk
import tensorflow as tf
import keras
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize
import matplotlib.pyplot as plt
%matplotlib inline
from fuzzywuzzy import fuzz
import warnings
from nltk import word_tokenize,sent_tokenize 
from gensim.models import Word2Vec  
from nltk.corpus import gutenberg
from nltk.corpus import stopwords
import matplotlib
%matplotlib notebook
import gensim
import string
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
nltk.download('stopwords')
stop_words = stopwords.words('english')
nltk.download('punkt')
stop_words=set(nltk.corpus.stopwords.words('english'))


df_train = pd.read_csv('train.csv')
df_train.dropna(axis=0, inplace=True)
#df_train.groupby("is_duplicate")['id'].count().plot.bar()
df_test = pd.read_csv('test.csv')
df_test.drop(['id'], axis=1, inplace=True)

Using TensorFlow backend.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


counting the imbalance of dublicate vs. non-dublicate questions in the dataset so that weights can be set in order to avoid the effects of an unbalanced dataset. Then the 'id' column containing the labels will be dropped.

In [0]:
dublicates = 1
non_dublicates = 1
df_train.drop(['id'], axis=1, inplace=True)

This function simply stems the dataset using a porter stemmer and removes punctuation and stopwords; the only pre-processing I'm using. This is mostly done in order to reduce the complexity of the prediction problem to speed up the training of the networks.

In [0]:
from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
porter=PorterStemmer()
def stem_sentence(sentence):
    token = tokenizer.tokenize(sentence)
    token = [word for word in token if word not in stop_words]
    stemmed = []
    for word in token:
        stemmed.append(porter.stem(word))
        stemmed.append(" ")
    return "".join(stemmed)

Here, the labels are extracted from a given dataframe.

In [0]:
#This should give you the labels (y) for the training set
def create_y(dataframe):
  labels = dataframe['is_duplicate'].values
  y = labels
  return y

This builds the dictionary (not a function because this has to be done only once but for any network, hence I just left it as script) that is used to one_hot encoding and word2vec.

In [6]:
dict = {}
k = 0
for index, row in df_train.iterrows():
    sentence = stem_sentence(row["question1"] + " " + row["question2"]).split(" ") #+ row["question2"]).split(" ")
    for j in sentence:
        if j not in dict:
            k+=1
            dict[j] = k
print("amount of unique stemmed words:", len(dict))

amount of unique stemmed words: 11826


Here, a word2vec representation for all words left in the df_train dataset after cleaning is built. This is then used by the word2vec and the custom model to create their input arrays. Testing showed that 5 dimensional arrays word best (with a resonable training time), even though higher-dimensional array would probably yield better results if hardware wasn't limited (google usually uses 500-dimensional encoding).

In [0]:
#word2vec
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

def create_embeddings(dataframe):
  token_sent=[]
  for index, sent in dataframe.iterrows():
      sent = sent["question1"] + sent["question2"]
      sent = stem_sentence(sent)
      sent.translate(string.punctuation)
      words=nltk.word_tokenize(sent)
      words=[w for w in words if w not in stop_words]
      token_sent.append(words)
  print(len(token_sent))
  print(token_sent)
  w2v_model=Word2Vec(token_sent,size=dimensions ,min_count=1,window=3,sg=1,hs=0,seed=42,workers=4)
  w2v_model.train(token_sent,total_examples=len(token_sent),epochs=10)

  vocab=list(w2v_model.wv.vocab)
  print(len(vocab))
  print(vocab)
  return w2v_model

evaluate() evaluates the model by letting it make predictions for the test dataset (that it has never seen before) and then creating a confusion matrix and returning the percentage of correct predictions.

In [0]:
# EVALUATION OF THE TESTSET
def evaluate(model, X, y):
  predictions = model.predict(X)
  k= 0
  confusion_matrix = []
  for i in range(2):
      confusion_matrix.append([])
      for j in range(2):
          confusion_matrix[i].append(0)

  # creating a confusion matrix
  print(predictions)
  print(y)
  roc_predictions = predictions.T[1]
  for i in predictions:
     prediction = np.argmax(predictions[k])
     real_result = y[k]
     confusion_matrix[prediction][real_result]= confusion_matrix[prediction][real_result]+1
     k = k + 1
  confusion_matrix = np.array(confusion_matrix)
  print(confusion_matrix)
  print(real_result)
  percent =  (confusion_matrix[0][0]+confusion_matrix[1][1])/len(X)*100
  return percent

# **Approach 1: BOW_model**


This function creates X (the array with inputs for the neural network) by giving each word a one hot encoding and summing all one-hot encodings together for each question. eg:
"This is is an example" --> This = 1 0 0 0 is = 0 1 0 0 an = 0 0 1 0 example = 0 0 0 1 
So the final array here would be: 1 2 1 1 

This is done for both questions; then the arrays are concatenated. This is then fed to the network where it has to learn to 'compare' the first half of the array (corresponding to the first question) to the second half (corresponding to the second question)

In [0]:
# PRE-PROCESSING for BOW
def create_X_bow(dataframe):
  X = []
  for index, row in dataframe.iterrows():
      Q1 = stem_sentence(row["question1"]).split(" ")
      Q2 = stem_sentence(row["question2"]).split(" ")
      Q1_bow = np.zeros(len(dict)+1)
      Q2_bow = np.zeros(len(dict)+1)
      for j in Q1: # conversion to one-hot encoding
        if j in dict:
          Q1_bow += keras.utils.to_categorical(dict[j], len(dict)+1)  
      for j in Q2:
        if j in dict:
          Q2_bow += keras.utils.to_categorical(dict[j], len(dict)+1)
      bow_representation = Q1_bow + Q2_bow
      X.append(bow_representation)
  X = np.array(X, ndmin=2)
  return X

Now, the classifier will be built. I am using a tensorflow network of connected dense layers. First, weights are set in order to avoid that the data imbalance affects the training. Then, the data is normalized.
Finally, the model is compiled and trained using the X and y that were computed beforehand.

In [0]:
# BOW CLASSIFIER
def train_bow(X, y):
  
  X = np.array(X, ndmin=2) # not enitrely sure why this conversion is needed again, but otherwise it throws an error
  print(X.shape)

  # class weight not used in the end since it leads to performance decrease as long as train and test dataset
  # have an equal imbalance
  class_weight = {0:dublicates,
                  1:non_dublicates
                  }
  # normalising the input vectors
  X = tf.keras.utils.normalize(
          X,
          axis=-1,
          order=2
          )
  
  # creating the keras model (amount of neurons empirically tested to be useful)
  model = keras.Sequential([
      keras.layers.Dense(input_shape=(len(X[0]),), units=2),
      keras.layers.Dense(int(len(X[0])/4), activation=tf.nn.relu),
      keras.layers.Dense(int(len(X[0])/8), activation=tf.nn.relu),
      keras.layers.Dense(2, activation=tf.nn.softmax)
  ])

  # sets optimization function, learning rate and loss function
  model.compile(optimizer= tf.train.AdamOptimizer(),
                loss = 'sparse_categorical_crossentropy',
                metrics = ['accuracy'])

  # trains the network for 5 epochs with a batch size of 1000. The validation split shows whether overtraining is happening.
  # once the validation accuracy/loss goes up, the network should be stopped (this is just done through manual testing here)
  model.fit(X,
            y,
            epochs = 5,
            batch_size=1000,
            shuffle=True,
            validation_split=0.05,
           # class_weight = class_weight
            )
  return model

Here, all the functions necessary to train and test the BOW model are called

In [11]:
########### MAIN BOW ############ Performance: https://pastebin.com/6ZGVZQdZ

y = create_y(df_train)
print("y created")
X = create_X_bow(df_train)
print("X created")
model = train_bow(X, y)
print("model trained")
y = create_y(df_test)
print("y test created")
X = create_X_bow(df_test)
print("X test created")

print("evaluation finished, accuracy:", evaluate(model, X, y))

y created
X created
(11000, 11827)
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Train on 10450 samples, validate on 550 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
model trained
y test created
X test created
[[1.0000000e+00 3.7275239e-08]
 [1.0000000e+00 1.7966046e-13]
 [9.4802618e-01 5.1973812e-02]
 ...
 [1.0000000e+00 1.5906089e-08]
 [9.9999976e-01 2.2531052e-07]
 [9.9998605e-01 1.3910221e-05]]
[0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 1 0 0 0 1 1 1
 1 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0
 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0
 0 0 0 0 1 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 0 0 0 1 0 1 1 0 1 1 1 0 1 0 0
 1 0 0 1 1 1 1 0 1 1 0 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 1
 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1 0 1 0 0 1 1 1 

**Performance**

The performance in terms of accuracy is decent; around 70%. It takes a long time to train though, since making the encodings more sparse does not work (the accuracy plummets).

https://pastebin.com/6ZGVZQdZ

# **Approach 2: VEC2WORD MODEL**

Takes the average of all word vectors in a question and concatenates the two questions. The resulting vector is then fed into the neural network. Since vectors of similar word look alike, the 'overall vector' for a question should represent the content of the question.

E.g. if two people are asking about countries and their capital city, it would be expected that the resulting vector of both questions is very similar. Then, the network can from there on determine whether the questions are duplicate.

In [0]:
def create_X_w2v(dataframe, embeddings):
  X = []
  for index, row in dataframe.iterrows():
      Q1 = stem_sentence(row["question1"]).split(" ")
      Q2 = stem_sentence(row["question2"]).split(" ")
      Q1_w2v = np.zeros(dimensions)
      Q2_w2v = np.zeros(dimensions)
      for j in Q1:
        if j in vocab:         
          Q1_w2v += np.array(embeddings.wv.get_vector(j))
      for j in Q2:
        if j in vocab:
          Q2_w2v += np.array(embeddings.wv.get_vector(j))
      Q1_w2v = np.divide(Q1_w2v, dimensions)
      Q2_w2v = np.divide(Q2_w2v, dimensions)
      w2v_representation = Q1_w2v.tolist() + Q2_w2v.tolist()
      X.append(w2v_representation)
  X = np.array(X)
  return X


As for the BOW, this is a simply tensorflow network that takes the word2vec representation of both sentences as in input and then classifies it into two output nodes.

In [0]:
# w2v CLASSIFIER
def train_w2v(X, y):
  
  X = np.array(X, ndmin=2) # once again, not sure why this is necessary
  
  # class weights still included even though not used in the end
  class_weight = {0:dublicates,
                  1:non_dublicates
                  }
  
  # normalisation of input
  X = tf.keras.utils.normalize(
          X,
          axis=-1,
          order=2
          )
  model = keras.Sequential([
      keras.layers.Dense(input_shape=(dimensions*2,), units=2),
      keras.layers.Dense(100, activation=tf.nn.relu),
      keras.layers.Dense(100, activation=tf.nn.relu),
      keras.layers.Dense(2, activation=tf.nn.softmax)
  ])
  # sets optimization function, learning rate and loss function for network
  model.compile(optimizer= tf.train.AdamOptimizer(),
                loss = 'sparse_categorical_crossentropy',
                metrics = ['accuracy'])

  # trains the network for 1000 epochs with 500 iterations per epoch
  model.fit(X,
            y,
            epochs = 100,
            batch_size=100,
            shuffle=True,
            validation_split=0.05,
            #class_weight = class_weight
            )
  return model


Here, all the functions to create a word2vec model are called. In the end the evaluation will be done using the test dataset

In [14]:
dimensions = 5

y = create_y(df_train)
print("y created")
embeddings = create_embeddings(df_train)
vocab=list(embeddings.wv.vocab)
X = create_X_w2v(df_train, embeddings)
print("X created")
model = train_w2v(X, y)
print("model trained")
y = create_y(df_test)
print("y test created")
#embeddings = create_embeddings(df_test)
#vocab=list(embeddings.wv.vocab)
X = create_X_w2v(df_test, embeddings)
print("X test created")

print("evaluation finished, accuracy:", evaluate(model, X, y))

y created
11000
[['I', 'see', 'comment', 'youtub', 'video', 'I', 'see', 'tool', 'allow', 'see', 'watch', 'video', 'youtub', 'playlist', 'I', 'creat'], ['lose', 'weight', 'I', 'reduc', 'weight'], ['block', 'someon', 'quora', 'I', 'block', 'someon', 'follow', 'quora'], ['I', 'improv', 'spoken', 'english', 'way', 'improv', 'english'], ['best', 'fruit', 'eat', 'weight', 'loss', 'fruit', 'best', 'weight', 'loss'], ['whi', 'gay', 'peopl', 'love', 'anal', 'whi', 'gay', 'peopl', 'love', 'flaunt', 'sexual'], ['peopl', 'biggest', 'frustrat', 'age', 'peopl', 'biggest', 'frustrat', 'linux'], ['mean', 'secur', 'mean', 'secur'], ['basic', 'I', 'agricultur', 'graduat', 'pursu', 'master', 'agricultur', 'biotechnolog', 'I', 'interest', 'basic', 'research', 'I', 'want', 'pursu', 'Ph', 'D', 'basic', 'plant', 'biolog', 'agricultur', 'background', 'affect', 'research', 'career', 'I', 'want', 'pursu', 'career', 'research', 'biolog', 'option'], ['I', 'get', 'back', 'delet', 'messag', 'facebook', 'I', 'get', 

Performance: For unknown reasons, this model does much worse than the BOW and performs at only 62% accuracy: https://pastebin.com/Brxg797Q

This is most likely due to bad embeddings, but I am not certain.

# **Approach 3: CUSTOM MODEL**

Creating X for the custom network.

This is a combination of:
- word2vec
- fuzzyfeatures 
- vector features

all will be fed into XGboost in order to achieve maximal performance.

The idea is that the adidtional information provided through fuzzyfeatures and vector features in combination with XGboost (which has a slightly superior performance compared to tensorflow) will boost the network further.

The BOW representation is not fed in order to avoid performance issues. It would probably help boost the performance even further, but already without the prediction accuracy easily gets to around 73% accuracy. Therefore, I decided to keep it computationally more efficient by not taking the BOW into account. 

In [0]:
from scipy.spatial.distance import cosine, cityblock, jaccard, canberra, euclidean, minkowski, braycurtis
def create_X_custom(dataframe, embeddings):
  X = []  
  for index, row in dataframe.iterrows():
      
      
      # stemming the questions
      Q1 = stem_sentence(row["question1"]).split(" ")
      Q2 = stem_sentence(row["question2"]).split(" ")
      Q1_w2v = np.zeros(dimensions)
      Q2_w2v = np.zeros(dimensions)
      
      # appending all relevant fuzzy features
      fuzz_data = []
      fuzz_data.append(fuzz.token_sort_ratio(Q1,Q2))
      fuzz_data.append(fuzz.QRatio(Q1,Q2))
      fuzz_data.append(fuzz.WRatio(Q1,Q2))
      fuzz_data.append(fuzz.partial_ratio(Q1,Q2))
      fuzz_data.append(fuzz.partial_token_sort_ratio(Q1,Q2))
      fuzz_data.append(fuzz.partial_token_set_ratio(Q1,Q2))
      
      # appending the word2vec embeddings of the sentences
      for j in Q1:
        if j in vocab:
          Q1_w2v += np.array(embeddings.wv.get_vector(j))
      for j in Q2:
        if j in vocab:
          Q2_w2v += np.array(embeddings.wv.get_vector(j))
      Q1_w2v = np.divide(Q1_w2v, dimensions)
      Q2_w2v = np.divide(Q2_w2v, dimensions)
      w2v_representation = Q1_w2v.tolist() + Q2_w2v.tolist()
      
      # concatenating the word2vec and fuzzy features with vector features (usually I would to a 
      # linebreak, but google colab does not allow this for some reason, excuse the ugly style)  
      X.append(w2v_representation + fuzz_data + [cosine(x,y) for (x,y) in zip(Q1_w2v, Q2_w2v)] + [cityblock(x,y) for (x,y) in zip(Q1_w2v, Q2_w2v)] + [jaccard(x,y) for (x,y) in zip(Q1_w2v, Q2_w2v)] + [canberra(x,y) for (x,y) in zip(Q1_w2v, Q2_w2v)] + [euclidean(x,y) for (x,y) in zip(Q1_w2v, Q2_w2v)] + [minkowski(x,y) for (x,y) in zip(Q1_w2v, Q2_w2v)])
  X = np.array(X)
  return X

Creating a simply xgboost with a tree search depth of 7. After 200 rounds without performance increase, the training will be stopped. To avoid overfitting, a constant comparison to the validation data is made.

In [0]:
import xgboost as xgb

def train_xgb(x_train, y_train, x_valid, y_valid):
  params = {}
  params['objective'] = 'binary:logistic'
  params['eval_metric'] = ['logloss', 'error']
  params['eta'] = 0.01
  params['max_depth'] = 7
  d_train = xgb.DMatrix(x_train, label=y_train)
  d_valid = xgb.DMatrix(x_valid, label=y_valid)
  watchlist = [(d_train, 'train'), (d_valid, 'valid')]
  bst = xgb.train(params, d_train, 1000, watchlist, 
                  early_stopping_rounds=200, verbose_eval=100)
  xgb_preds = (bst.predict(d_valid) >= 0.5).astype(int)
  xgb_accuracy = np.sum(xgb_preds == y_valid) / len(y_valid)
  print("Xgb accuracy: %0.3f" % xgb_accuracy)


This just runs the custom model based on XGboost and FuzzyFeatures.

In [17]:
dimensions = 5
y_train = create_y(df_train)
print("y created")
embeddings = create_embeddings(df_train)
vocab=list(embeddings.wv.vocab)
X_train = create_X_custom(df_train, embeddings)
print("X created")
y_test = create_y(df_test)
print("y test created")
X_test = create_X_custom(df_test, embeddings) # already includes the creation of new embeddings
print("X test created")
model = train_xgb(X_train, y_train, X_test, y_test)
print("model trained")

#print("evaluation finished, accuracy:", evaluate(model, df_test, X, y))

y created
11000
[['I', 'see', 'comment', 'youtub', 'video', 'I', 'see', 'tool', 'allow', 'see', 'watch', 'video', 'youtub', 'playlist', 'I', 'creat'], ['lose', 'weight', 'I', 'reduc', 'weight'], ['block', 'someon', 'quora', 'I', 'block', 'someon', 'follow', 'quora'], ['I', 'improv', 'spoken', 'english', 'way', 'improv', 'english'], ['best', 'fruit', 'eat', 'weight', 'loss', 'fruit', 'best', 'weight', 'loss'], ['whi', 'gay', 'peopl', 'love', 'anal', 'whi', 'gay', 'peopl', 'love', 'flaunt', 'sexual'], ['peopl', 'biggest', 'frustrat', 'age', 'peopl', 'biggest', 'frustrat', 'linux'], ['mean', 'secur', 'mean', 'secur'], ['basic', 'I', 'agricultur', 'graduat', 'pursu', 'master', 'agricultur', 'biotechnolog', 'I', 'interest', 'basic', 'research', 'I', 'want', 'pursu', 'Ph', 'D', 'basic', 'plant', 'biolog', 'agricultur', 'background', 'affect', 'research', 'career', 'I', 'want', 'pursu', 'career', 'research', 'biolog', 'option'], ['I', 'get', 'back', 'delet', 'messag', 'facebook', 'I', 'get', 

  dist = 1.0 - uv / np.sqrt(uu * vv)


X created
y test created
X test created
[0]	train-logloss:0.689716	train-error:0.257364	valid-logloss:0.690303	valid-error:0.301
Multiple eval metrics have been passed: 'valid-error' will be used for early stopping.

Will train until valid-error hasn't improved in 200 rounds.
[100]	train-logloss:0.512836	train-error:0.221364	valid-logloss:0.557087	valid-error:0.28
[200]	train-logloss:0.441663	train-error:0.191455	valid-logloss:0.520749	valid-error:0.277
[300]	train-logloss:0.409322	train-error:0.178455	valid-logloss:0.511706	valid-error:0.277
[400]	train-logloss:0.39178	train-error:0.170727	valid-logloss:0.508726	valid-error:0.278
Stopping. Best iteration:
[258]	train-logloss:0.422235	train-error:0.186	valid-logloss:0.514226	valid-error:0.27

Xgb accuracy: 0.724
model trained


**Performance**

The performance of this network constantly exceeds the expected 70% as seen here: https://pastebin.com/Pz1sf1T7

This is quite impressive considering that it uses word2vec (which only gets to 60% on it's own) as it's base. Therefore, it can be improved to about 80% by also feeding the BOW representation, but then the training takes extremely long. 