<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Process-Data" data-toc-modified-id="Process-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Process Data</a></span></li><li><span><a href="#Pre-Process-Data-For-Deep-Learning" data-toc-modified-id="Pre-Process-Data-For-Deep-Learning-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Pre-Process Data For Deep Learning</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Look-at-one-example-of-processed-issue-bodies" data-toc-modified-id="Look-at-one-example-of-processed-issue-bodies-2.0.0.1"><span class="toc-item-num">2.0.0.1&nbsp;&nbsp;</span>Look at one example of processed issue bodies</a></span></li><li><span><a href="#Look-at-one-example-of-processed-issue-titles" data-toc-modified-id="Look-at-one-example-of-processed-issue-titles-2.0.0.2"><span class="toc-item-num">2.0.0.2&nbsp;&nbsp;</span>Look at one example of processed issue titles</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Define-Model-Architecture" data-toc-modified-id="Define-Model-Architecture-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Define Model Architecture</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Load-the-data-from-disk-into-variables" data-toc-modified-id="Load-the-data-from-disk-into-variables-3.0.1"><span class="toc-item-num">3.0.1&nbsp;&nbsp;</span>Load the data from disk into variables</a></span></li><li><span><a href="#Define-Model-Architecture" data-toc-modified-id="Define-Model-Architecture-3.0.2"><span class="toc-item-num">3.0.2&nbsp;&nbsp;</span>Define Model Architecture</a></span></li></ul></li></ul></li><li><span><a href="#Train-Model" data-toc-modified-id="Train-Model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Train Model</a></span></li><li><span><a href="#See-Results-On-Holdout-Set" data-toc-modified-id="See-Results-On-Holdout-Set-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>See Results On Holdout Set</a></span></li><li><span><a href="#Feature-Extraction-Demo" data-toc-modified-id="Feature-Extraction-Demo-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Feature Extraction Demo</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Example-1:-Issues-Installing-Python-Packages" data-toc-modified-id="Example-1:-Issues-Installing-Python-Packages-6.0.1"><span class="toc-item-num">6.0.1&nbsp;&nbsp;</span>Example 1: Issues Installing Python Packages</a></span></li><li><span><a href="#Example-2:--Issues-asking-for-feature-improvements" data-toc-modified-id="Example-2:--Issues-asking-for-feature-improvements-6.0.2"><span class="toc-item-num">6.0.2&nbsp;&nbsp;</span>Example 2:  Issues asking for feature improvements</a></span></li></ul></li></ul></li></ul></div>

In [1]:
import pandas as pd
import logging
import glob
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 500)
logger = logging.getLogger()
logger.setLevel(logging.WARNING)

# Process Data

Look at filesystem to see files extracted from BigQuery (or Kaggle: https://www.kaggle.com/davidshinn/github-issues/)

In [4]:
!ls -lah | grep github_issues.csv

Split data into train and test set and preview data

In [7]:
#read in data sample 2M rows (for speed of tutorial)
traindf, testdf = train_test_split(pd.read_csv('github_issues.csv').sample(n=2000000), 
                                   test_size=.10)


#print out stats about shape of data
print(f'Train: {traindf.shape[0]:,} rows {traindf.shape[1]:,} columns')
print(f'Test: {testdf.shape[0]:,} rows {testdf.shape[1]:,} columns')

# preview data
traindf.head(3)

Train: 1,800,000 rows 3 columns
Test: 200,000 rows 3 columns


Unnamed: 0,issue_url,issue_title,body
271751,"""https://github.com/luciddreamz/laravel-ex/issues/4""",how to have local git instead of github?,how do i update my project from my local pc instead of github? i were able to do that on openshift v2 but i can't understand how the v3 image work. can you provide a simple guide on this?
5108112,"""https://github.com/robert-ciobotaru/Proiect_IP/issues/4""",un protocol de comunicare cu middle-end-ul.,avem nevoie de un protocol detaliat pentru comunicarea cu middle-end-ul.
3476565,"""https://github.com/PhenX/php-font-lib/issues/58""",what are the compatible font extensions?,"i need to validate a form for font submission, what kind of files should i accept?"


**Convert to lists in preparation for modeling**

In [8]:
train_body_raw = traindf.body.tolist()
train_title_raw = traindf.issue_title.tolist()
#preview output of first element
train_body_raw[0]

"how do i update my project from my local pc instead of github? i were able to do that on openshift v2 but i can't understand how the v3 image work. can you provide a simple guide on this?"

Collecting spacy==2.0.15
[?25l  Downloading https://files.pythonhosted.org/packages/d5/f6/4a61c2707f8006131abc4d4d3428b8a34e2e540d1ab115a2997ae2475d6d/spacy-2.0.15-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (25.6MB)
[K    100% |████████████████████████████████| 25.6MB 2.3MB/s eta 0:00:01
Collecting regex==2018.01.10 (from spacy==2.0.15)
[?25l  Downloading https://files.pythonhosted.org/packages/76/f4/7146c3812f96fcaaf2d06ff6862582302626a59011ccb6f2833bb38d80f7/regex-2018.01.10.tar.gz (612kB)
[K    100% |████████████████████████████████| 614kB 9.0MB/s eta 0:00:01
[?25hCollecting thinc<6.13.0,>=6.12.0 (from spacy==2.0.15)
[?25l  Downloading https://files.pythonhosted.org/packages/80/84/a4d8e8b66729ec0f1bc676ed6614333fedefb5ac49235d065067192715e5/thinc-6.12.1-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (2.6MB)
[K    100% |███████████████████████████████

Collecting ktext
  Using cached https://files.pythonhosted.org/packages/da/18/577262a30a7cf39a0a5b11815d9aed8792afb4db8e1ac63e5d727da90f8f/ktext-0.34-py3-none-any.whl
Collecting textacy<=0.6.2 (from ktext)
  Using cached https://files.pythonhosted.org/packages/f7/13/77612f4393d9c8a55e53924f13b2cf8b835cbf4a5e69e288613ed2de9eca/textacy-0.6.2-py2.py3-none-any.whl
Collecting pyphen>=0.9.4 (from textacy<=0.6.2->ktext)
  Using cached https://files.pythonhosted.org/packages/15/82/08a3629dce8d1f3d91db843bb36d4d7db6b6269d5067259613a0d5c8a9db/Pyphen-0.9.5-py2.py3-none-any.whl
Collecting pyemd>=0.3.0 (from textacy<=0.6.2->ktext)
  Using cached https://files.pythonhosted.org/packages/c0/c5/7fea8e7a71cd026b30ed3c40e4c5ea13a173e28f8855da17e25271e8f545/pyemd-0.5.1.tar.gz
Building wheels for collected packages: pyemd


  Running setup.py bdist_wheel for pyemd ... [?25lerror
  Complete output from command /anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/67/tx0qhqqj59b04kxb_6kkzjt00000gn/T/pip-install-4oaa1sn0/pyemd/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /private/var/folders/67/tx0qhqqj59b04kxb_6kkzjt00000gn/T/pip-wheel-q48561iq --python-tag cp37:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.macosx-10.7-x86_64-3.7
  creating build/lib.macosx-10.7-x86_64-3.7/pyemd
  copying pyemd/__init__.py -> build/lib.macosx-10.7-x86_64-3.7/pyemd
  copying pyemd/__about__.py -> build/lib.macosx-10.7-x86_64-3.7/pyemd
  running build_ext
  building 'pyemd.emd' extension
  creating build/temp.macosx-10.7-x86_64-3.7
  creating build/temp.macosx-10.7-x86_64-3.7/pyemd
  gcc -Wno-unused-result -Wsign-compare -Wunreachable-

# Pre-Process Data For Deep Learning

See [this repo](https://github.com/hamelsmu/ktext) for documentation on the ktext package

In [9]:
%reload_ext autoreload
%autoreload 2
from ktext.preprocess import processor

In [10]:
%%time
# Clean, tokenize, and apply padding / truncating such that each document length = 70
#  also, retain only the top 8,000 words in the vocabulary and set the remaining words
#  to 1 which will become common index for rare words 
body_pp = processor(keep_n=8000, padding_maxlen=70)
train_body_vecs = body_pp.fit_transform(train_body_raw)



CPU times: user 3min 13s, sys: 38.5 s, total: 3min 52s
Wall time: 10min 12s


#### Look at one example of processed issue bodies

In [11]:
print('\noriginal string:\n', train_body_raw[0], '\n')
print('after pre-processing:\n', train_body_vecs[0], '\n')


original string:
 how do i update my project from my local pc instead of github? i were able to do that on openshift v2 but i can't understand how the v3 image work. can you provide a simple guide on this? 

after pre-processing:
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0   72   52    6  176   41  116   23   41  205 1538  168   11
  212    6  421  207    4   52   17   16 2855 1030   24    6   27   30
  791   72    3 1518  100   81   27   20  324    5  503 1067   16   13] 



In [14]:
# Instantiate a text processor for the titles, with some different parameters
#  append_indicators = True appends the tokens '_start_' and '_end_' to each
#                      document
#  padding = 'post' means that zero padding is appended to the end of the 
#             of the document (as opposed to the default which is 'pre')
title_pp = processor(append_indicators=True, keep_n=4500, 
                     padding_maxlen=12, padding ='post')

# process the title data
train_title_vecs = title_pp.fit_transform(train_title_raw)



#### Look at one example of processed issue titles

In [15]:
print('\noriginal string:\n', train_title_raw[0])
print('after pre-processing:\n', train_title_vecs[0])


original string:
 how to have local git instead of github?
after pre-processing:
 [  2  26   4  97 184 382 142  12 166   3   0   0]


Serialize all of this to disk for later use

In [16]:
import dill as dpickle
import numpy as np

# Save the preprocessor
with open('body_pp.dpkl', 'wb') as f:
    dpickle.dump(body_pp, f)

with open('title_pp.dpkl', 'wb') as f:
    dpickle.dump(title_pp, f)

# Save the processed data
np.save('train_title_vecs.npy', train_title_vecs)
np.save('train_body_vecs.npy', train_body_vecs)

# Define Model Architecture

### Load the data from disk into variables

In [20]:
from seq2seq_utils import load_decoder_inputs, load_encoder_inputs, load_text_processor

In [19]:
# !pip install annoy

Collecting annoy
Installing collected packages: annoy
Successfully installed annoy-1.15.1


In [24]:
encoder_input_data, doc_length = load_encoder_inputs('train_body_vecs.npy')
decoder_input_data, decoder_target_data = load_decoder_inputs('train_title_vecs.npy')

Shape of encoder input: (1800000, 70)
Shape of decoder input: (1800000, 11)
Shape of decoder target: (1800000, 11)


In [25]:
num_encoder_tokens, body_pp = load_text_processor('body_pp.dpkl')
num_decoder_tokens, title_pp = load_text_processor('title_pp.dpkl')

Size of vocabulary for body_pp.dpkl: 8,002
Size of vocabulary for title_pp.dpkl: 4,502


### Define Model Architecture

In [26]:
%matplotlib inline
from keras.models import Model
from keras.layers import Input, LSTM, GRU, Dense, Embedding, Bidirectional, BatchNormalization
from keras import optimizers

In [27]:
#arbitrarly set latent dimension for embedding and hidden units
latent_dim = 300

##### Define Model Architecture ######

########################
#### Encoder Model ####
encoder_inputs = Input(shape=(doc_length,), name='Encoder-Input')

# Word embeding for encoder (ex: Issue Body)
x = Embedding(num_encoder_tokens, latent_dim, name='Body-Word-Embedding', mask_zero=False)(encoder_inputs)
x = BatchNormalization(name='Encoder-Batchnorm-1')(x)

# Intermediate GRU layer (optional)
#x = GRU(latent_dim, name='Encoder-Intermediate-GRU', return_sequences=True)(x)
#x = BatchNormalization(name='Encoder-Batchnorm-2')(x)

# We do not need the `encoder_output` just the hidden state.
_, state_h = GRU(latent_dim, return_state=True, name='Encoder-Last-GRU')(x)

# Encapsulate the encoder as a separate entity so we can just 
#  encode without decoding if we want to.
encoder_model = Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model')

seq2seq_encoder_out = encoder_model(encoder_inputs)

########################
#### Decoder Model ####
decoder_inputs = Input(shape=(None,), name='Decoder-Input')  # for teacher forcing

# Word Embedding For Decoder (ex: Issue Titles)
dec_emb = Embedding(num_decoder_tokens, latent_dim, name='Decoder-Word-Embedding', mask_zero=False)(decoder_inputs)
dec_bn = BatchNormalization(name='Decoder-Batchnorm-1')(dec_emb)

# Set up the decoder, using `decoder_state_input` as initial state.
decoder_gru = GRU(latent_dim, return_state=True, return_sequences=True, name='Decoder-GRU')
decoder_gru_output, _ = decoder_gru(dec_bn, initial_state=seq2seq_encoder_out)
x = BatchNormalization(name='Decoder-Batchnorm-2')(decoder_gru_output)

# Dense layer for prediction
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='Final-Output-Dense')
decoder_outputs = decoder_dense(x)

########################
#### Seq2Seq Model ####

#seq2seq_decoder_out = decoder_model([decoder_inputs, seq2seq_encoder_out])
seq2seq_Model = Model([encoder_inputs, decoder_inputs], decoder_outputs)


seq2seq_Model.compile(optimizer=optimizers.Nadam(lr=0.001), loss='categorical_crossentropy')

Instructions for updating:
Colocations handled automatically by placer.


Instructions for updating:
Colocations handled automatically by placer.


In [None]:
# !pip uninstall pydot
# !pip install pydot
# !pip install pydotplus
# !pip install graphviz

Uninstalling pydot-1.4.1:
  Would remove:
    /anaconda3/envs/test_env/lib/python3.6/site-packages/dot_parser.py
    /anaconda3/envs/test_env/lib/python3.6/site-packages/pydot-1.4.1.dist-info/*
    /anaconda3/envs/test_env/lib/python3.6/site-packages/pydot.py
Proceed (y/n)? 

** Examine Model Architecture Summary **

In [39]:
from seq2seq_utils import viz_model_architecture
seq2seq_Model.summary()
# viz_model_architecture(seq2seq_Model)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Decoder-Input (InputLayer)      (None, None)         0                                            
__________________________________________________________________________________________________
Decoder-Word-Embedding (Embeddi (None, None, 300)    1350600     Decoder-Input[0][0]              
__________________________________________________________________________________________________
Encoder-Input (InputLayer)      (None, 70)           0                                            
__________________________________________________________________________________________________
Decoder-Batchnorm-1 (BatchNorma (None, None, 300)    1200        Decoder-Word-Embedding[0][0]     
__________________________________________________________________________________________________
Encoder-Mo

# Train Model

In [41]:
from keras.callbacks import CSVLogger, ModelCheckpoint

script_name_base = 'tutorial_seq2seq'
csv_logger = CSVLogger('{:}.log'.format(script_name_base))
model_checkpoint = ModelCheckpoint('{:}.epoch{{epoch:02d}}-val{{val_loss:.5f}}.hdf5'.format(script_name_base),
                                   save_best_only=True)

batch_size = 1200
epochs = 1
history = seq2seq_Model.fit([encoder_input_data, decoder_input_data], np.expand_dims(decoder_target_data, -1),
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.12, callbacks=[csv_logger, model_checkpoint])

Train on 1584000 samples, validate on 216000 samples
Epoch 1/1


  '. They will not be included '


In [23]:
#save model
seq2seq_Model.save('seq2seq_model_tutorial.h5')

  str(node.arguments) + '. They will not be included '


# See Results On Holdout Set

In [1]:

from seq2seq_utils import Seq2Seq_Inference
seq2seq_inf = Seq2Seq_Inference(encoder_preprocessor=body_pp,
                                 decoder_preprocessor=title_pp,
                                 seq2seq_model=seq2seq_Model)

Using TensorFlow backend.


ModuleNotFoundError: No module named 'annoy'

In [None]:
# this method displays the predictions on random rows of the holdout set
ref_title, gen_title = seq2seq_inf.demo_model_predictions(n=50, issue_df=testdf)

for i in range(20):
    print("ref_title[i])

In [2]:
!pip install rouge



# Feature Extraction Demo

In [68]:
# Read All 5M data points
all_data_df = pd.read_csv('github_issues.csv')
# Extract the bodies from this dataframe
all_data_bodies = all_data_df['body'].tolist()

In [70]:
# transform all of the data using the ktext processor
all_data_vectorized = body_pp.transform_parallel(all_data_bodies)

In [71]:
# save transformed data
with open('all_data_vectorized.dpkl', 'wb') as f:
    dpickle.dump(all_data_vectorized, f)

In [262]:
%reload_ext autoreload
%autoreload 2
from seq2seq_utils import Seq2Seq_Inference
seq2seq_inf_rec = Seq2Seq_Inference(encoder_preprocessor=body_pp,
                                    decoder_preprocessor=title_pp,
                                    seq2seq_model=seq2seq_Model)
recsys_annoyobj = seq2seq_inf_rec.prepare_recommender(all_data_vectorized, all_data_df)

### Example 1: Issues Installing Python Packages

In [223]:
seq2seq_inf_rec.demo_model_predictions(n=1, issue_df=testdf, threshold=1)




"https://github.com/bnosac/pattern.nlp/issues/5"
Issue Body:
 thanks for your package, i can't wait to use it. unfortunately i have issues with the installation. prerequisite is 'first install python version 2.5+ not version 3 '. so this package cant be used with version 3.6 64bit that i have installed? i nevertheless tried to install it using pip, conda is not supported? but got an error: 'syntaxerror: missing parentheses in call to 'print''. besides when i try to run the library in r version 3.3.3. 64 bit i got errors with can_find_python_cmd required_modules = pattern.db : 'error in find_python_cmd......' pattern seems to be written in python but must be used in r, why cant it be used in python? i found another python pattern application that apparently does the same in python: https://pypi.python.org/pypi/pattern how is this related? 

Original Title:
 error installation python

****** Machine Generated Title (Prediction) ******:
 install with python * number *

**** Similar Iss

Unnamed: 0,issue_url,issue_title,body,dist
286906,"""https://github.com/scikit-hep/root_numpy/issues/337""",root 6.10/02 and root_numpy compatibility,i am trying to pip install root_pandas and one of the dependency is root_numpy however some weird reasons i am unable to install it even though i can import root in python. i am working on python3.6 as i am more comfortable with it. is root_numpy is not yet compatible with the latest root?,0.694671
314005,"""https://github.com/andim/noisyopt/issues/4""",joss review: installing dependencies via pip,"hi, i'm trying to install noisyopt in a clean conda environment running python 3.5. running pip install noisyopt does not install the dependencies numpy, scipy . i see that you do include a requires keyword argument in your setup.py file, does this need to be install_requires ? as in https://packaging.python.org/requirements/ . also, not necessary if you don't want to, but i think it would be good to include a list of dependences somewhere in the readme.",0.698265
48120,"""https://github.com/turi-code/SFrame/issues/389""",python 3.6 compatible,"hi: i tried to install sframe using pip and conda but i can not find anything that will work with python 3.6? has sframe been updated to work with python 3.6 yet? thanks, drew",0.718715


### Example 2:  Issues asking for feature improvements

In [226]:
seq2seq_inf_rec.demo_model_predictions(n=1, issue_df=testdf, threshold=1)




"https://github.com/Chingu-cohorts/devgaido/issues/89"
Issue Body:
 right now, your profile link is https://devgaido.com/profile. this is fine, but it would be really cool if there was a way to share your profile with other people. on my portfolio, i have social media buttons to freecodecamp, github, ect. without a custom link, i cannot show-off what i have done on devgaido to future employers. 

Original Title:
 feature request: sharable profile.

****** Machine Generated Title (Prediction) ******:
 add a link to your profile

**** Similar Issues (using encoder embedding) ****:



Unnamed: 0,issue_url,issue_title,body,dist
250423,"""https://github.com/ParabolInc/action/issues/1379""",integrations list view discoverability,"issue - enhancement i was initially confused by the link to my account copy; seeing github in the integrations list made me think it had already been set up . i realize now that i had to allow parabol to post as me. i think that link to my account could use a tooltip explaining what link means, and why you'd want to do so. <img width= 728 alt= screen shot 2017-09-29 at 10 52 05 am src= https://user-images.githubusercontent.com/2146312/31024786-2fd39c46-a50e-11e7-9f2a-6d4a5ed2baeb.png >",0.748828
222304,"""https://github.com/viosey/hexo-theme-material/issues/166""",allow us to use sns-share for github,"i'd love to be able to add a link at the bottom of the page for my github account. however, the sns-share option doesn't currently seem to be able to do this.",0.774398
153327,"""https://github.com/tobykurien/GoogleApps/issues/31""",drive provide download ability,sometimes people share files via g drive. provided a link this app can show some info about the files but doesn't show the download button. i hope that it can be fixed and users would be able to download files with this app.,0.778953


In [78]:
# incase you need to reset the rec system
# seq2seq_inf_rec.set_recsys_annoyobj(recsys_annoyobj)
# seq2seq_inf_rec.set_recsys_data(all_data_df)

# save object
recsys_annoyobj.save('recsys_annoyobj.pkl')

True