[View in Colaboratory](https://colab.research.google.com/github/AllarVi/nlp/blob/master/independent_project.ipynb)

## Upload dataset

In [1]:
! rm -f que-tag.txt
! wget https://raw.githubusercontent.com/AllarVi/nlp/master/tag-que.txt


--2018-05-16 08:55:20--  https://raw.githubusercontent.com/AllarVi/nlp/master/tag-que.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 99233319 (95M) [text/plain]
Saving to: ‘tag-que.txt’


2018-05-16 08:55:22 (135 MB/s) - ‘tag-que.txt’ saved [99233319/99233319]



In [2]:
for l in open('tag-que.txt'):
  print(l)
  break

flex actionscript-3 air	SQLStatement.execute() - multiple queries in one statement



## Dependencies

In [3]:
! pip install torch

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/69/43/380514bd9663f1bf708abeb359b8b48d3fabb1c8e95bb3427a980a064c57/torch-0.4.0-cp36-cp36m-manylinux1_x86_64.whl (484.0MB)
[K    100% |████████████████████████████████| 484.0MB 26kB/s 
tcmalloc: large alloc 1073750016 bytes == 0x5b2c8000 @  0x7ff555a651c4 0x46d6a4 0x5fcbcc 0x4c494d 0x54f3c4 0x553aaf 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54e4c8
[?25hInstalling collected packages: torch
Successfully installed torch-0.4.0


In [4]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(device)

cuda


## Data preparation

In [0]:
SOS_token = 0
EOS_token = 1


class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

In [0]:
# Turn a Unicode string to plain ASCII, thanks to
# http://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

In [0]:
def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")

    # Read the file and split into lines
    lines = open('%s-%s.txt' % (lang1, lang2), encoding='utf-8').\
        read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs

In [0]:
#@title Default title text
MAX_LENGTH = 30


def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH


def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]


In [10]:
def prepareData(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs


input_lang, output_lang, pairs = prepareData('tag', 'que', True)

Reading lines...
Read 1264216 sentence pairs
Trimmed to 1263976 sentence pairs
Counting words...
Counted words:
que 137206
tag 23777


In [12]:
print(random.choice(pairs))

['salt command takes about seconds to finish', 'salt stack']


## Seq2Seq Model

### The Encoder

The encoder of a seq2seq network is a RNN that outputs some value for every word from the input sentence. For every input word the encoder outputs a vector and a hidden state, and uses the hidden state for the next input word.

In [0]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

### The Decoder

In [0]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        output = self.embedding(input).view(1, 1, -1)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

### Attention Decoder

In [0]:
class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        attn_weights = F.softmax(
            self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
        attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                                 encoder_outputs.unsqueeze(0))

        output = torch.cat((embedded[0], attn_applied[0]), 1)
        output = self.attn_combine(output).unsqueeze(0)

        output = F.relu(output)
        output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, attn_weights

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

## Training

### Preparing Training Data

In [0]:
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]


def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)


def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

### Training the Model

In [0]:
teacher_forcing_ratio = 0.5


def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    encoder_hidden = encoder.initHidden()

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(
            input_tensor[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0, 0]

    decoder_input = torch.tensor([[SOS_token]], device=device)

    decoder_hidden = encoder_hidden

    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]  # Teacher forcing

    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()  # detach from history as input

            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == EOS_token:
                break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length

### Helpers

In [0]:
import time
import math


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))


In [0]:
def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [tensorsFromPair(random.choice(pairs))
                      for i in range(n_iters)]
    criterion = nn.NLLLoss()

    for iter in range(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]

        loss = train(input_tensor, target_tensor, encoder,
                     decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('since: %s (iter: %d complete: %d%%) loss_avg: %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))

        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    showPlot(plot_losses)

### Plotting Results

In [0]:
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib.ticker as ticker
import numpy as np


def showPlot(points):
    plt.figure()
    fig, ax = plt.subplots()
    # this locator puts ticks at regular intervals
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

## Evaluation

In [0]:
def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
    with torch.no_grad():
        input_tensor = tensorFromSentence(input_lang, sentence)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, encoder_hidden = encoder(input_tensor[ei],
                                                     encoder_hidden)
            encoder_outputs[ei] += encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS

        decoder_hidden = encoder_hidden

        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)

        for di in range(max_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            decoder_attentions[di] = decoder_attention.data
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_words.append('<EOS>')
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])

            decoder_input = topi.squeeze().detach()

        return decoded_words, decoder_attentions[:di + 1]

In [0]:
def evaluateRandomly(encoder, decoder, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('Question - ', pair[0])
        print('Actual Tags - ', pair[1])
        output_words, attentions = evaluate(encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        print('My Tags - ', output_sentence)
        print('')

## Training and Evalutation

The lower the **loss**, the better a model (unless the model has over-fitted to the training data). The loss is calculated on **training** and **validation** and its interperation is how well the model is doing for these two sets. Unlike accuracy, loss is not a percentage. It is a summation of the errors made for each example in training or validation sets.

In [23]:
hidden_size = 256
encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)
attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(device)

trainIters(encoder1, attn_decoder1, 75000, print_every=1000)

since: 1m 25s (- 105m 9s) (iter: 1000 complete: 1%) loss_avg: 5.1082
since: 2m 39s (- 97m 9s) (iter: 2000 complete: 2%) loss_avg: 4.6264
since: 3m 55s (- 94m 8s) (iter: 3000 complete: 4%) loss_avg: 4.5478
since: 5m 11s (- 92m 9s) (iter: 4000 complete: 5%) loss_avg: 4.5468
since: 6m 26s (- 90m 17s) (iter: 5000 complete: 6%) loss_avg: 4.3658
since: 7m 41s (- 88m 23s) (iter: 6000 complete: 8%) loss_avg: 4.2476
since: 8m 57s (- 86m 57s) (iter: 7000 complete: 9%) loss_avg: 4.3120
since: 10m 13s (- 85m 37s) (iter: 8000 complete: 10%) loss_avg: 4.1509
since: 11m 30s (- 84m 20s) (iter: 9000 complete: 12%) loss_avg: 4.2189
since: 12m 45s (- 82m 58s) (iter: 10000 complete: 13%) loss_avg: 4.2408
since: 14m 1s (- 81m 38s) (iter: 11000 complete: 14%) loss_avg: 4.1484
since: 15m 17s (- 80m 15s) (iter: 12000 complete: 16%) loss_avg: 4.0990
since: 16m 32s (- 78m 54s) (iter: 13000 complete: 17%) loss_avg: 4.0996
since: 17m 47s (- 77m 31s) (iter: 14000 complete: 18%) loss_avg: 4.0173
since: 19m 3s (- 76

since: 67m 21s (- 27m 57s) (iter: 53000 complete: 70%) loss_avg: 3.4593
since: 68m 37s (- 26m 41s) (iter: 54000 complete: 72%) loss_avg: 3.3911
since: 69m 55s (- 25m 25s) (iter: 55000 complete: 73%) loss_avg: 3.5353
since: 71m 11s (- 24m 9s) (iter: 56000 complete: 74%) loss_avg: 3.4958
since: 72m 27s (- 22m 52s) (iter: 57000 complete: 76%) loss_avg: 3.4179
since: 73m 44s (- 21m 36s) (iter: 58000 complete: 77%) loss_avg: 3.3622
since: 75m 0s (- 20m 20s) (iter: 59000 complete: 78%) loss_avg: 3.4983
since: 76m 16s (- 19m 4s) (iter: 60000 complete: 80%) loss_avg: 3.4534
since: 77m 32s (- 17m 47s) (iter: 61000 complete: 81%) loss_avg: 3.3534
since: 78m 48s (- 16m 31s) (iter: 62000 complete: 82%) loss_avg: 3.3892
since: 80m 5s (- 15m 15s) (iter: 63000 complete: 84%) loss_avg: 3.3586
since: 81m 22s (- 13m 59s) (iter: 64000 complete: 85%) loss_avg: 3.2600
since: 82m 39s (- 12m 43s) (iter: 65000 complete: 86%) loss_avg: 3.4365
since: 83m 56s (- 11m 26s) (iter: 66000 complete: 88%) loss_avg: 3.3

In [64]:
evaluateRandomly(encoder1, attn_decoder1)

Question -  efficient database structure for deep tree data
Actual Tags -  sql database design tree hashmap bloom filter
My Tags -  database database database <EOS>

Question -  postgresql . permission to deny functions body
Actual Tags -  sql postgresql
My Tags -  postgresql <EOS>

Question -  array and x matrices ascending sorting count repeated numbers
Actual Tags -  visual c 
My Tags -  javascript arrays sorting sorting <EOS>

Question -  java regular expression characters and matches the pattern p alpha . 
Actual Tags -  java regex pattern matching match
My Tags -  java regex <EOS>

Question -  retrying httpclient unsuccessful requests
Actual Tags -  c dotnet httpclient httpcontent
My Tags -  java <EOS>

Question -  how to run appium script in multiple android device emulators ?
Actual Tags -  android python python appium
My Tags -  android <EOS>

Question -  why does tomcat redirect away from angularjs petclinic ?
Actual Tags -  java angularjs spring spring mvc tomcat
My Tags -  

In [0]:
torch.save(encoder1.state_dict(), 'encoder.pt')

In [0]:
from google.colab import files


In [47]:
! ls -l

total 562936
-rw-r--r-- 1 root root  50958029 May 16 11:04 attn_decoder1.pt
drwxr-xr-x 1 root root      4096 Apr 30 16:29 datalab
-rw-r--r-- 1 root root 142078866 May 16 10:53 encoder1
-rw-r--r-- 1 root root 142078866 May 16 10:58 encoder1.pt
-rw-r--r-- 1 root root 142078866 May 16 10:58 encoder.pt
-rw-r--r-- 1 root root  99233319 May 16 08:55 tag-que.txt


In [0]:
files.download('attn_decoder1.pt')

In [0]:
torch.save(attn_decoder1.state_dict(), 'attn_decoder1.pt')