<h3>Loading the data</h3>

In [3]:
import pandas as pd

In [4]:
data_train = pd.read_csv('V1.4_Training.csv')
data_test = pd.read_csv('SubtaskA_Trial_Test.csv')

Taking a look at the dataset

In [6]:
print(data_train.head())
print(data_test.head())

      ID                                            COMMENT  LABEL
0  663_3  "Please enable removing language code from the...      1
1  663_4  "Note: in your .csproj file, there is a Suppor...      0
2  664_1  "Wich means the new version not fully replaced...      0
3  664_2  "Some of my users will still receive the old x...      0
4  664_3  "The store randomly gives the old xap or the n...      0
      id                                            comment label
0  13101  "I'm not asking Microsoft to Gives permission ...     X
1  13121            "somewhere between Android and iPhone."     X
2  13131  "And in the Windows Store you can flag the App...     X
3  13132  "Many thanks Sameh Hi, As we know, there is a ...     X
4  13133  "The idea is that we can develop a regular app...     X


<h3>Train data</h3>

In [8]:
train_dataframe = data_train.iloc[:, 1]

train_x = []
for line in train_dataframe:
    train_x.append(line)


train_label_dataframe = data_train.iloc[:, 2]
train_y = []
for line in train_label_dataframe:
    train_y.append(line)

<h3>Test data</h3>

In [9]:
test_x = []
test_dataframe = data_test.iloc[:, 1]
for line in test_dataframe:
    test_x.append(line)

with open('labels.txt', 'r') as f:
    test_y_file = f.readlines()

test_y = []
for val in test_y_file:
    test_y.append(val[0])
for index, val in enumerate(test_y):
    val = int(val)
    test_y[index] = val
    

In [12]:
print("Size of train data: ", len(train_x))
print("Size of test data:  ", len(test_x))

Size of train data:  8500
Size of test data:   592


<h3>Preprocessing the text</h3>

In [14]:
import re
import string

In [16]:
def preprocess(data):
    for index ,line in enumerate(data):
        line = line.lower()
        line = re.sub(r'\d+', '', line)
        translation = str.maketrans(" "," ", string.punctuation);
        line = line.translate(string.punctuation)
        line = line.translate(translation);
        data[index] = line
        
preprocess(train_x)
preprocess(test_x)

In [17]:
print(train_x[100])
print(test_x[100])

the same happened with facebook integration
these descriptions also appear differently depending on where they are being viewed web zune on pc or device


<h3>Creating the nlp model</h3>

In [18]:
import keras
import tensorflow as tf
import numpy as np

Using TensorFlow backend.


In [20]:
# We need to import several things from Keras.
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding
from keras.optimizers import Adam
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

<b>Neural Network cannot work directly on text-strings dataset so there is a step called tokenizer which converts words to integer and is done on the dataset before it is given as input to the Neural Network</b>

In [22]:
vocabulary = 5000
tokenizer = Tokenizer(num_words = vocabulary)

<h3>Fitting the tokenizer</h3>

In [27]:
text_data = train_x + test_x
tokenizer.fit_on_texts(text_data)
train_x_tokens = tokenizer.texts_to_sequences(train_x)

In [29]:
train_x[100]

'the same happened with facebook integration'

In [28]:
np.array(train_x_tokens[100])

array([   1,   91, 1338,   16,  361,  447])

<h3>Need to convert the texts in the test-set to tokens

In [32]:
test_x_tokens = tokenizer.texts_to_sequences(test_x)

In [33]:
test_x[1]

'somewhere between android and iphone'

In [34]:
np.array(test_x_tokens[1])

array([1224,  226,  147,    4,  640])

<p>The Recurrent Neural Network can take sequences of arbitrary length as input, but in order to use a whole batch of data,it need to have the same length so either ensure that all in the entire data-set have the same length, or write a custom data-generator that ensures that it has the same length within each batch.<br>
First is simpler but if the length of the longest sequence in the data-set is used, then a lot of memory is wated which is a problem in large dataset.<br>So sequence-length is used that covers most sequences in the data-set, and then truncate longer sequences and pad shorter sequences.</p>


In [35]:
num_tokens = [len(tokens) for tokens in train_x_tokens + test_x_tokens]
num_tokens = np.array(num_tokens)

In [36]:
np.mean(num_tokens)

16.828090629124507

In [37]:
np.max(num_tokens)

193

<h3>The max number of token set to the average plus 2.5 times standard deviations

In [39]:
max_tokens = np.mean(num_tokens)+ 2.5 * np.std(num_tokens)
max_tokens = int(max_tokens)
print(max_tokens)

45


Now its imp to decide whether to do padding or truncating pre or post. Truncation means part of the sequence thrown away and padding means adding zeros at the front or at the end here pre is used bcoz it is set that model will know the text is starting and if post is done then there is a chance of forgetting as so many zeros will come.<br>But when truncating  is used it  may loose some important information or features then we have to make compromise


In [122]:
pad = 'pre'
train_x_pad = pad_sequences(train_x_tokens, maxlen = max_tokens, padding= pad, truncating = pad)
train_x_pad

array([[   0,    0,   46, ...,   43,    6,  472],
       [   0,    0,    0, ..., 1534,    4, 2733],
       [   0,    0,    0, ...,   97, 1028,  557],
       ...,
       [   0,    0,    0, ..., 1757,    2,   39],
       [   0,    0,    0, ...,    3,  169,  633],
       [   0,    0,    0, ...,    8,  207,  148]])

In [59]:
test_x_pad = pad_sequences(test_x_tokens, maxlen=max_tokens , padding = pad, truncating = pad)

<h3>Tokenizer inverse map</h3>

In [60]:
idx = tokenizer.word_index
inverse_map = dict(zip(idx.values(), idx.keys()))

<h3>Defining a function for converting a list of tokens back to a string of words</h3>

In [61]:
def tokens_to_string(tokens):
    words = [inverse_map[token] for token in tokens if token != 0]
    
    #Concatenate all the words
    text = " ".join(words)
    return text

<h3>Matching with the training data</h3>

In [62]:
train_x[1]

'note in your csproj file there is a supportedcultures entry like this supportedculturesdederururu supportedcultures when i removed the ru language code and published my new xap version the old xap version still remains in the store with replaced and unpublished'

In [63]:
tokens_to_string(train_x_tokens[1])

'note in your csproj file there is a supportedcultures entry like this supportedcultures when i removed the ru language code and published my new xap version the old xap version still remains in the store with replaced and unpublished'

<h3>Creation of LSTM model</h3>

In [93]:
model = Sequential()

The first layer is the embedding layer which converts each integer-token into a vector of values<br><br>Each integer token will be converted to a vector of length (5)

In [94]:
embedding_size = 5

Embedding layer also need the number of words in the vocabulary and the length of the padded token sequence

In [95]:
model.add(Embedding(input_dim = num_words, output_dim = embedding_size
                    , input_length = max_tokens, name = 'layer_embedding'))

Here an output of dimensionality 16 is produced

In [96]:
model.add(LSTM(16, return_sequences = True))

This adds the second LSTM with 8 output units. This will be followed by another LSTM so it must also return sequences.

In [97]:
model.add(LSTM(8, return_sequences = True))

This adds the third and final LSTM with 4 output units. This will be followed by a dense-layer, so it should only give the final output of the LSTM and not a whole sequence of outputs.

In [98]:
model.add(LSTM(4))

Add a fully-connected dense layer which computes a value between 0.0 and 1.0 that will be used as the classification output.

In [99]:
model.add(Dense(1,activation='sigmoid'))

<h2>Compilation part</h2>

<h3>By using Adam optimizer and specifying the learning rate compiling the model. As it is a classification problem so using a cross entropy loss function 

In [100]:
optimizer = Adam(lr=0.001)
model.compile(loss ='binary_crossentropy', optimizer =  optimizer, metrics = ['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embedding (Embedding)  (None, 45, 5)             25000     
_________________________________________________________________
lstm_8 (LSTM)                (None, 45, 16)            1408      
_________________________________________________________________
lstm_9 (LSTM)                (None, 45, 8)             800       
_________________________________________________________________
lstm_10 (LSTM)               (None, 4)                 208       
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 5         
Total params: 27,421
Trainable params: 27,421
Non-trainable params: 0
_________________________________________________________________


Fitting the data to model

In [101]:
model.fit(train_x_pad, train_y, validation_split = 0.05, epochs =3 , batch_size = 64)

Train on 8075 samples, validate on 425 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1bf760bb1d0>

<h3>Calculating its classification accuracy on the test-set</h3>


In [102]:
result = model.evaluate(test_x_pad, test_y)



In [103]:
print("Accuracy : {0:.2%}".format(result[1]))

Accuracy : 78.38%


<h3>Evaluating data</h3>

In [116]:
data = pd.read_csv('SubtaskA_EvaluationData.csv')

In [117]:
data_text = data.iloc[:,1]
text = []
for t in data_text:
    text.append(t)

In [112]:
preprocess(text)

In [118]:
tokenizer.fit_on_texts(text)

In [119]:
tokens = tokenizer.texts_to_sequences(text) 

In [123]:
pad = pad_sequences(tokens, maxlen = max_tokens, padding= pad, truncating = pad)

In [124]:
final = model.predict(pad)

In [125]:
for index, val in enumerate(final):
    if val>0.5:
        final[index] = 1
    else:
        final[index] = 0

submission = data.iloc[:, [0,1]]
output = pd.DataFrame(final)
result = pd.concat([submission,output], axis=1, sort=False)

In [126]:
result.to_csv(r'suman_goel.csv')