Based on https://github.com/hamelsmu/Seq2Seq_Tutorial/blob/master/notebooks/Tutorial.ipynb

In [1]:
import pandas as pd
import logging
import glob
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 500)
logger = logging.getLogger()
logger.setLevel(logging.WARNING)
from ktext.preprocess import processor
import dill as dpickle
import numpy as np

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
PATH='data/github-issues/'
MODEL_PATH=f'{PATH}model/'
GITHUB_ISSUES = f'{PATH}github_issues.csv'

In [3]:
!ls -lah {PATH} | grep github_issues.csv

-rw-rw-r--  1 kuptservol kuptservol 2.7G Sep 19 14:21 github_issues.csv


In [4]:
#read in data sample 2M rows (for speed of tutorial)
traindf, testdf = train_test_split(pd.read_csv(GITHUB_ISSUES)
#                                    .sample(n=2000000)
                                   .sample(n=1000000)
                                   , test_size=.10)


#print out stats about shape of data
print(f'Train: {traindf.shape[0]:,} rows {traindf.shape[1]:,} columns')
print(f'Test: {testdf.shape[0]:,} rows {testdf.shape[1]:,} columns')

# preview data
traindf.head(3)

Train: 900,000 rows 3 columns
Test: 100,000 rows 3 columns


Unnamed: 0,issue_url,issue_title,body
3548140,"""https://github.com/BBasile/Coedit/issues/185""",diff dialog shown twice on external modification,the diff dialog causes a loss of focus which leads to a double check.
2577963,"""https://github.com/samsung-cnct/k2/issues/326""",fix 'clean up releases' to avoid needing to ignore the failure,"if we can ignore this failure, we should be able to test and see that the work doesn't actually need to be done and just not execute this. seeing failures that are ignored irritates me. task roles/kraken.services : clean up releases failed: localhost item={u'name': u'kubedns', u'namespace': u'kube-system', u'chart': u'kubedns', u'repo': u'atlas', u'version': u'0.1.0', u'values': {u'cluster_ip': u'10.32.0.2', u'dns_domain': u'cluster.local'}} => { changed : true, cmd : helm , delete , --purge..."
19327,"""https://github.com/oyyd/cheerio-without-node-native/issues/5""",a very nice module,thany you! i am looking for a module which will be used in react-native it suits me fine


In [5]:
train_body_raw = traindf.body.tolist()
train_title_raw = traindf.issue_title.tolist()
#preview output of first element
train_body_raw[0]

'the diff dialog causes a loss of focus which leads to a double check.'

In [6]:
%reload_ext autoreload
%autoreload 2
from ktext.preprocess import processor

In [7]:
%%time
# Clean, tokenize, and apply padding / truncating such that each document length = 70
#  also, retain only the top 8,000 words in the vocabulary and set the remaining words
#  to 1 which will become common index for rare words 
body_pp = processor(keep_n=8000, padding_maxlen=70)
train_body_vecs = body_pp.fit_transform(train_body_raw)



CPU times: user 1min 48s, sys: 6.13 s, total: 1min 54s
Wall time: 4min 19s


In [9]:
print('\noriginal string:\n', train_body_raw[0], '\n')
print('after pre-processing:\n', train_body_vecs[0], '\n')


original string:
 the diff dialog causes a loss of focus which leads to a double check. 

after pre-processing:
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    3 1406 1152  965    5 2041   11 1267   63 2074    4    5  727  150] 



In [10]:
title_pp = processor(append_indicators=True, keep_n=4500, 
                     padding_maxlen=12, padding ='post')

# process the title data
train_title_vecs = title_pp.fit_transform(train_title_raw)



In [11]:
print('\noriginal string:\n', train_title_raw[0])
print('after pre-processing:\n', train_title_vecs[0])


original string:
 diff dialog shown twice on external modification
after pre-processing:
 [   2 1608  594  704  956   10  576 2603    3    0    0    0]


In [12]:
import dill as dpickle
import numpy as np

# Save the preprocessor
with open(f'{MODEL_PATH}body_pp.dpkl', 'wb') as f:
    dpickle.dump(body_pp, f)

with open(f'{MODEL_PATH}title_pp.dpkl', 'wb') as f:
    dpickle.dump(title_pp, f)

# Save the processed data
np.save(f'{MODEL_PATH}train_title_vecs.npy', train_title_vecs)
np.save(f'{MODEL_PATH}train_body_vecs.npy', train_body_vecs)

In [16]:
from seq2seq_utils import load_decoder_inputs, load_encoder_inputs, load_text_processor

In [17]:
encoder_input_data, doc_length = load_encoder_inputs(f'{MODEL_PATH}train_body_vecs.npy')
decoder_input_data, decoder_target_data = load_decoder_inputs(f'{MODEL_PATH}train_title_vecs.npy')

Shape of encoder input: (900000, 70)
Shape of decoder input: (900000, 11)
Shape of decoder target: (900000, 11)


In [18]:
num_encoder_tokens, body_pp = load_text_processor(f'{MODEL_PATH}body_pp.dpkl')
num_decoder_tokens, title_pp = load_text_processor(f'{MODEL_PATH}title_pp.dpkl')

Size of vocabulary for data/github-issues/model/body_pp.dpkl: 8,002
Size of vocabulary for data/github-issues/model/title_pp.dpkl: 4,502


<H2> Model

In [19]:
%matplotlib inline
from keras.models import Model
from keras.layers import Input, LSTM, GRU, Dense, Embedding, Bidirectional, BatchNormalization
from keras import optimizers

In [20]:
#arbitrarly set latent dimension for embedding and hidden units
latent_dim = 300

##### Define Model Architecture ######

########################
#### Encoder Model ####
encoder_inputs = Input(shape=(doc_length,), name='Encoder-Input')

# Word embeding for encoder (ex: Issue Body)
x = Embedding(num_encoder_tokens, latent_dim, name='Body-Word-Embedding', mask_zero=False)(encoder_inputs)
x = BatchNormalization(name='Encoder-Batchnorm-1')(x)

# Intermediate GRU layer (optional)
#x = GRU(latent_dim, name='Encoder-Intermediate-GRU', return_sequences=True)(x)
#x = BatchNormalization(name='Encoder-Batchnorm-2')(x)

# We do not need the `encoder_output` just the hidden state.
_, state_h = GRU(latent_dim, return_state=True, name='Encoder-Last-GRU')(x)

# Encapsulate the encoder as a separate entity so we can just 
#  encode without decoding if we want to.
encoder_model = Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model')

seq2seq_encoder_out = encoder_model(encoder_inputs)

########################
#### Decoder Model ####
decoder_inputs = Input(shape=(None,), name='Decoder-Input')  # for teacher forcing

# Word Embedding For Decoder (ex: Issue Titles)
dec_emb = Embedding(num_decoder_tokens, latent_dim, name='Decoder-Word-Embedding', mask_zero=False)(decoder_inputs)
dec_bn = BatchNormalization(name='Decoder-Batchnorm-1')(dec_emb)

# Set up the decoder, using `decoder_state_input` as initial state.
decoder_gru = GRU(latent_dim, return_state=True, return_sequences=True, name='Decoder-GRU')
decoder_gru_output, _ = decoder_gru(dec_bn, initial_state=seq2seq_encoder_out)
x = BatchNormalization(name='Decoder-Batchnorm-2')(decoder_gru_output)

# Dense layer for prediction
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='Final-Output-Dense')
decoder_outputs = decoder_dense(x)

########################
#### Seq2Seq Model ####

#seq2seq_decoder_out = decoder_model([decoder_inputs, seq2seq_encoder_out])
seq2seq_Model = Model([encoder_inputs, decoder_inputs], decoder_outputs)


seq2seq_Model.compile(optimizer=optimizers.Nadam(lr=0.001), loss='sparse_categorical_crossentropy')

In [23]:
from seq2seq_utils import viz_model_architecture
seq2seq_Model.summary()
viz_model_architecture(seq2seq_Model)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Decoder-Input (InputLayer)      (None, None)         0                                            
__________________________________________________________________________________________________
Decoder-Word-Embedding (Embeddi (None, None, 300)    1350600     Decoder-Input[0][0]              
__________________________________________________________________________________________________
Encoder-Input (InputLayer)      (None, 70)           0                                            
__________________________________________________________________________________________________
Decoder-Batchnorm-1 (BatchNorma (None, None, 300)    1200        Decoder-Word-Embedding[0][0]     
__________________________________________________________________________________________________
Encoder-Mo

ImportError: Failed to import `pydot`. Please install `pydot`. For example with `pip install pydot`.

In [24]:
from keras.callbacks import CSVLogger, ModelCheckpoint

script_name_base = 'tutorial_seq2seq'
csv_logger = CSVLogger('{:}.log'.format(script_name_base))
model_checkpoint = ModelCheckpoint('{:}.epoch{{epoch:02d}}-val{{val_loss:.5f}}.hdf5'.format(script_name_base),
                                   save_best_only=True)

batch_size = 1200
epochs = 7
history = seq2seq_Model.fit([encoder_input_data, decoder_input_data], np.expand_dims(decoder_target_data, -1),
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.12, callbacks=[csv_logger, model_checkpoint])

Train on 792000 samples, validate on 108000 samples
Epoch 1/7


  str(node.arguments) + '. They will not be included '


Epoch 2/7
Epoch 3/7
Epoch 5/7
Epoch 6/7
Epoch 7/7


In [25]:
#save model
seq2seq_Model.save(f'{MODEL_PATH}seq2seq_model_tutorial.h5')

  str(node.arguments) + '. They will not be included '


Results:

In [26]:
from seq2seq_utils import Seq2Seq_Inference
seq2seq_inf = Seq2Seq_Inference(encoder_preprocessor=body_pp,
                                 decoder_preprocessor=title_pp,
                                 seq2seq_model=seq2seq_Model)

In [27]:
# this method displays the predictions on random rows of the holdout set
seq2seq_inf.demo_model_predictions(n=50, issue_df=testdf)




"https://github.com/francineloza/HIM_Operations/issues/58"
Issue Body:
 hi francine, i pushed a login to the server anm: angrejoaagrajo5253, password: 9456 , but got a call from our field manager that there was a mistake in the form. we needed to change the login to angrejo5253 password: 9456 , so i pushed more data to the server with that. basically, we have two anms in our system now: anrejoaagrajo5253 and anrejo5253. i'm not sure what we do about the first anm but just wanted to flag this so we don't end up sending a report to the govt saying this anm isn't using her tablet. thanks, and i'm really sorry for the inconvenience : 

Original Title:
 delete anm angrejoaagrajo5253

****** Machine Generated Title (Prediction) ******:
 login form is not working



"https://github.com/PatchworkBoy/homebridge-edomoticz/issues/100"
Issue Body:
 hi, i had to change my z-wave stick and thus, exclude then include all my devices. my fibaro wallplugs fgwpe used to appear in homekit before my res


****** Machine Generated Title (Prediction) ******:
 automatic synchronization of number



"https://github.com/bloomberg/bqplot/issues/498"
Issue Body:
 just a question: i'm using the new graph mark. python tt = tooltip fields=node_attrs_list graph = graph node_data=node_data, link_data=link_data, link_type='line', colors=color_array, tooltip=tt, directed=false instead of adding the same tooltip to all nodes in the graph, i would like to display different tooltip for different nodes ie, some nodes might have a 'size' attribute, others might not. is there any way to do this? or if not, is there a way to get the tooltip to not display empty attributes? 

Original Title:
 customize tooltip for different elements of a mark?

****** Machine Generated Title (Prediction) ******:
 different tooltip for nodes



"https://github.com/libcg/bfp/issues/4"
Issue Body:
 once we get a base implementation we should be able to wire muparser to bfp to get an interactive shell. 

Original Title:
 add mu


****** Machine Generated Title (Prediction) ******:
 check command silently ignores empty files



"https://github.com/electron/electron.atom.io/issues/663"
Issue Body:

Original Title:
 how to know which apps are recently released

****** Machine Generated Title (Prediction) ******:
 new address for the new business page



"https://github.com/jeremyruppel/walrus/issues/30"
Issue Body:
 when i was trying to compile the template without the friends array because data is dynamic : html <h1>{{name.first}} {{name.last}}</h1> <ul> {{:each @friends do}} <li>{{name}}</li> {{end}} </ul> i get this message: typeerror: cannot set property '$index' of undefined at walrus.js:1125:24 there is any way to catch the errors to send over the rest server to show an error into the frontend? i'm working over nodejs 7.x. btw the other stuff works really great! 

Original Title:
 catch errors event

****** Machine Generated Title (Prediction) ******:
 cannot use without a non existent template



"https://

In [None]:
%reload_ext autoreload
%autoreload 2
all_data_df = pd.read_csv(GITHUB_ISSUES)
from seq2seq_utils import Seq2Seq_Inference
seq2seq_inf_rec = Seq2Seq_Inference(encoder_preprocessor=body_pp,
                                    decoder_preprocessor=title_pp,
                                    seq2seq_model=seq2seq_Model)
recsys_annoyobj = seq2seq_inf_rec.prepare_recommender(train_body_vecs, all_data_df)

100%|██████████| 900000/900000 [00:27<00:00, 32786.69it/s]


In [None]:
seq2seq_inf_rec.demo_model_predictions(n=1, issue_df=testdf, threshold=1)