# Résumé

L'objectif de ce notebook est de faire du "Document Classification", c'est une sous partie du NLP. Pour cela, nous prenons les données qui sont [ici](https://api.github.com/repos/Microsoft/vscode/issues). Les données peuvent être récupérées soit via l'API [PyGithub](https://github.com/PyGithub/PyGithub), soit directement avec la commande curl. Nous devrons ensuite classer les différentes "issues" sous les différents labels (bug, feature-request et other). On rajoute la classe "other" afin d'être sûr que les deux autres concepts sont correctements appris. Pour finir, nous fournirons une méthode qui prendra en paramètre, un titre et un corp de texte et qui labellisera cette nouvelles entrée.

Plan
========
1. Construction du dataset
2. Séparation des datasets
3. Feature-Extraction
4. Entraînement
5. Validation
6. Utilisation du classifier

In [1]:
#Diférents imports utilisés par la suite
import pandas as pd #Gestion des dataframes
import nltk # Traitement du langage naturel
import numpy as np
#Apprentissage
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.utils import shuffle
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from pprint import pprint
from time import time
import logging

In [2]:
#Constants utilisées par la suite
LABEL_FQ = 'feature-request'
LABEL_BUG = 'bug'
LABEL_OTHER = 'other'
LABELS = [LABEL_BUG, LABEL_FQ, LABEL_OTHER]

# Construction du dataset

Pour cet exercice, on ne prendra que le titre, le corp et l'ID de l'issue. On va faire 3 classes différentes, 'bug', 'feature-request', 'other'. On considère que chaque input n'a qu'un seul label.

In [3]:
#Fonction qui permet de redéfinir les autres labels que bug et feature-request à other
def filter_label(labels):
    
    if LABEL_FQ  in labels:
        return LABEL_FQ
    elif LABEL_BUG in labels:
        return LABEL_BUG
    
    return LABEL_OTHER

In [4]:
issues = pd.read_csv('./issues.csv') #importation des données téléchargées au préalable
issues = issues.loc[:,['title','body','labels']] #On conserve seulement titre, body et labels
issues.head()

Unnamed: 0,title,body,labels
0,Panel badge is an odd shape when a single digit,Need to update the css so that this badge beco...,[]
1,custom titlebar : fullscreen very top dragging...,- VSCode Version: Insiders 1.26\r\n- OS Versio...,[]
2,Localized descriptions for built-in extensions...,Fixes #54111,[]
3,editor automatically removing characters from ...,Issue Type: <b>Bug</b>\r\n\r\nthe editor is re...,[]
4,[js] Add auto completion for computed property...,Currently intellisense doesn't work for comput...,"[{'id': 291124272, 'node_id': 'MDU6TGFiZWwyOTE..."


In [5]:
#Transformation des labels. On ne garde qu'un seul label (bug, feature-request et other)
for ind in issues.index:
    row = issues.loc[ind]
    labels = eval(row['labels'])
    tmp = []

    if len(labels) > 0: # S'il y a au moins un label, 3 possibilitées d'affectation
        for l in labels:
            tmp.append(l['name'])

        new_label = filter_label(tmp)
        issues.loc[ind, 'labels'] = new_label
    else:
        issues.loc[ind, 'labels'] = LABEL_OTHER #Sinon c'est other

In [6]:
issues.head()

Unnamed: 0,title,body,labels
0,Panel badge is an odd shape when a single digit,Need to update the css so that this badge beco...,other
1,custom titlebar : fullscreen very top dragging...,- VSCode Version: Insiders 1.26\r\n- OS Versio...,other
2,Localized descriptions for built-in extensions...,Fixes #54111,other
3,editor automatically removing characters from ...,Issue Type: <b>Bug</b>\r\n\r\nthe editor is re...,other
4,[js] Add auto completion for computed property...,Currently intellisense doesn't work for comput...,other


In [7]:
print(issues.labels.value_counts())
print('Totale : {}'.format(issues.shape[0]))

feature-request    2833
other              1644
bug                 904
Name: labels, dtype: int64
Totale : 5381


# Séparation des datasets

Dans cette section, nous séparons les données en 2 datasets. Ceci afin de valider le classifieur. Nous allons avoir un dataset pour l'entrainement et un pour le test. On garde 70% de chaque classe pour l'entraînement, et le reste de 30%.

In [8]:
dfTrain = {}
dfTest = {}
for l in LABELS:
    dfTrain[l] = issues[issues.labels == l].sample(frac=0.7)
    dfTest[l] = issues[issues.labels == l].drop(dfTrain[l].index)

dfTrain = pd.concat([dfTrain[l] for l in LABELS ], axis=0)
dfTest = pd.concat([dfTest[l] for l in LABELS ], axis=0)

In [9]:
dfTrain.head()

Unnamed: 0,title,body,labels
2903,Menus: restore icon has different color compar...,Refs: https://github.com/Microsoft/vscode/issu...,bug
1664,node protocol probing confuses electron,- Use this program test.js:\r\n ```js\r\n le...,bug
4766,The location of Source Control Viewlet's Conte...,### Issue Type\r\nBug\r\n\r\n### Description\r...,bug
3967,"""terminal.selectionBackground"" is above foregr...",Issue Type: <b>Bug</b>\r\n\r\nWhen selecting t...,bug
4483,Debuger error: It loses the `this` context whe...,"Hi,\r\n\r\nI found a strange behaviour when I ...",bug


In [10]:
dfTrain.labels.value_counts()

feature-request    1983
other              1151
bug                 633
Name: labels, dtype: int64

In [11]:
dfTest.head()

Unnamed: 0,title,body,labels
81,Toggle Word Wrap doesn't work with the custom ...,Issue Type: <b>Bug</b>\r\n\r\n(Using Windows 1...,bug
117,"OSX: ""Save as"" with ""Hide extension"" checked c...",- VSCode Version: Code 1.17.2 (b813d1298030801...,bug
120,Support git repositories inside git repositories,- VSCode Version: Code 1.18.0\r\n- OS Version:...,bug
131,Wrong git decorations in file explorer,<!-- Do you have a question? Please ask it on ...,bug
148,TypeScript throws error and stops watch build,Just had `npm run watch` exit because of an un...,bug


In [12]:
dfTest.labels.value_counts()

feature-request    850
other              493
bug                271
Name: labels, dtype: int64

In [13]:
#On mélange les données
dfTrain = shuffle(dfTrain)
dfTest = shuffle(dfTest)

On constate un déséquilibre au niveau  du nombre d'éléments par classe. Cela pourra poser des difficultés pour l'apprentissage.

# Feature Extraction

Maintenant que nous avons nos datasets, nous allons traiter nos données, afin d'aider notre classifier à trouver du sens. Pour ce faire, nous allons "stemmatiser" les différents textes, "tokanizer" pour récupérer les différents termes utilisés. Pour finir, notre dataset ressemblera à un Bag of Words (BoW). Concrètement nous aurons une matrice (n exemples x m mots). Les m mots sont tous les mots rencontrés dans le dataset d'entraînement. Les différentes valeurs correspondront au nombre de fois que le mot est utilisé par un exemple.

In [None]:
#Transforme une chaîne de caractère en un liste de tokens.
#On supprime les stop words (at, to ...), les ponctuations et les urls
def string_to_tokens(mystring):
    tokens = nltk.tokenize.TweetTokenizer().tokenize(mystring)
    stopwords = nltk.corpus.stopwords.words('english')
    stemmer = nltk.stem.PorterStemmer()
    
    for i in range(len(tokens))[::-1]:
        if tokens[i] in stopwords:
            tokens.remove(tokens[i]) #On retire les stopwords
        elif len(tokens[i]) <= 1:
            tokens.remove(tokens[i])
        elif 'https://' in tokens[i]: #suppression des url
            tokens.remove(tokens[i])
        else :    
            tokens[i] = stemmer.stem(tokens[i])
            tokens[i] = tokens[i].lower()
        
    return tokens

In [None]:
#Retourne à partir d'une liste de tokens un dictionnaire des tokens et leur nombre d'apparition
def token_frequency(tokens):
    frequencies = {t : 0 for t in tokens}
    
    for t in tokens:
        frequencies[t] += 1
    return frequencies

In [14]:
underSampleTrain = pd.concat([dfTrain[dfTrain.labels == l].sample(n=633) for l in LABELS],axis=0)

In [15]:
underSampleTrain.labels.value_counts()

feature-request    633
other              633
bug                633
Name: labels, dtype: int64

In [16]:
corpus = [str(row['title']) + ' ' + str(row['body']) for ind, row in underSampleTrain.iterrows()]
corpusTest = [str(row['title']) + ' ' + str(row['body']) for ind, row in dfTest.iterrows()]

In [17]:
tokenizer = nltk.tokenize.TweetTokenizer()
pipeline =Pipeline([
    ('vect', CountVectorizer(stop_words='english', tokenizer=tokenizer.tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

In [18]:
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    #'clf__n_iter': (10, 50, 80),
}

In [19]:
grid_searchSGD = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
pprint(parameters)
t0 = time()
grid_searchSGD.fit(corpus, underSampleTrain.labels.values)
print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_searchSGD.best_score_)
print("Best parameters set:")
best_parameters = grid_searchSGD.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__alpha': (1e-05, 1e-06),
 'clf__penalty': ('l2', 'elasticnet'),
 'tfidf__norm': ('l1', 'l2'),
 'tfidf__use_idf': (True, False),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__max_features': (None, 5000, 10000, 50000),
 'vect__ngram_range': ((1, 1), (1, 2))}
Fitting 3 folds for each of 384 candidates, totalling 1152 fits






[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   22.5s


















[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.6min






























[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  3.7min








































[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  6.6min










































[Parallel(n_jobs=-1)]: Done 1152 out of 1152 | elapsed:  9.6min finished


done in 579.582s

Best score: 0.588
Best parameters set:
	clf__alpha: 1e-05
	clf__penalty: 'l2'
	tfidf__norm: 'l1'
	tfidf__use_idf: True
	vect__max_df: 0.5
	vect__max_features: 10000
	vect__ngram_range: (1, 2)




In [20]:
clf = grid_searchSGD.best_estimator_

In [21]:
pred = clf.predict(corpusTest)
print(accuracy_score(y_pred=pred, y_true=dfTest.labels.values))
print(confusion_matrix(y_pred=pred, y_true=dfTest.labels.values))

0.5892193308550185
[[177  38  56]
 [217 548  85]
 [149 118 226]]


In [25]:
clf.predict(['Found a bug, please correct it quicly ! I can\'t work now'])[0]

'bug'

In [23]:
clf.predict(['Need a new feature to improve the software'])[0]

'feature-request'

In [24]:
clf.predict(['Brian is in the kitchen'])[0]

'other'

In [27]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')

pipeline =Pipeline([
    ('vect', stemmed_count_vect),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

grid_searchSGDStem = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
grid_searchSGDStem.fit(corpus, underSampleTrain.labels.values)

Fitting 3 folds for each of 384 candidates, totalling 1152 fits






[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.1min


















[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  5.1min






























[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 11.7min








































[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 20.8min










































[Parallel(n_jobs=-1)]: Done 1152 out of 1152 | elapsed: 30.3min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', StemmedCountVectorizer(analyzer='word', binary=False, decode_error='strict',
            dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
            lowercase=True, max_df=1.0, max_features=None, min_df=1,
            ngram_range=(1, 1), preprocessor=None, stop_words=...='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'vect__max_df': (0.5, 0.75, 1.0), 'vect__max_features': (None, 5000, 10000, 50000), 'vect__ngram_range': ((1, 1), (1, 2)), 'tfidf__use_idf': (True, False), 'tfidf__norm': ('l1', 'l2'), 'clf__alpha': (1e-05, 1e-06), 'clf__penalty': ('l2', 'elasticnet')},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [28]:
clf = grid_searchSGDStem.best_estimator_
pred = clf.predict(corpusTest)
print(accuracy_score(y_pred=pred, y_true=dfTest.labels.values))
print(confusion_matrix(y_pred=pred, y_true=dfTest.labels.values))
print(clf.predict(['Found a bug, please correct it quicly ! It\'s a bug'])[0])
print(clf.predict(['Need a new feature to improve the software'])[0])
print(clf.predict(['Brian is in the kitchen'])[0])

0.6146220570012392
[[157  50  64]
 [171 595  84]
 [117 136 240]]
bug
feature-request
other


In [43]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')

pipeline =Pipeline([
    ('vect', stemmed_count_vect),
    ('clf', SGDClassifier()),
])

parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    #'clf__n_iter': (10, 50, 80),
}

grid_searchSGDStem = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
grid_searchSGDStem.fit(corpus, underSampleTrain.labels.values)

Fitting 3 folds for each of 96 candidates, totalling 288 fits






[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.2min


















[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  5.3min










[Parallel(n_jobs=-1)]: Done 288 out of 288 | elapsed:  8.1min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', StemmedCountVectorizer(analyzer='word', binary=False, decode_error='strict',
            dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
            lowercase=True, max_df=1.0, max_features=None, min_df=1,
            ngram_range=(1, 1), preprocessor=None, stop_words=...='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'vect__max_df': (0.5, 0.75, 1.0), 'vect__max_features': (None, 5000, 10000, 50000), 'vect__ngram_range': ((1, 1), (1, 2)), 'clf__alpha': (1e-05, 1e-06), 'clf__penalty': ('l2', 'elasticnet')},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [44]:
clf = grid_searchSGDStem.best_estimator_
pred = clf.predict(corpusTest)
print(accuracy_score(y_pred=pred, y_true=dfTest.labels.values))
print(confusion_matrix(y_pred=pred, y_true=dfTest.labels.values))
print(clf.predict(['Found a bug, please correct it quicly ! It\'s a bug'])[0])
print(clf.predict(['Need a new feature to improve the software'])[0])
print(clf.predict(['Brian is in the kitchen'])[0])

0.5396530359355638
[[161  37  73]
 [211 482 157]
 [158 107 228]]
bug
feature-request
feature-request


# Utilisation du classifier

In [45]:
class IssueClassifier:
    def __init__(self,clf):
        self.clf = clf
        
    def predict_issue(self, title, body):
        return self.clf.predict([title + ' ' + body])[0]

In [46]:
issueClf = IssueClassifier(clf)

In [47]:
issueClf.predict_issue(title = "Bug", body = "Hi, I found a bug ! Could you please correct it quickly ? It's a bug, bug, bug, bug")

'bug'

In [49]:
issueClf.predict_issue(title = "A Bug in the matrix", body = "Hi, I found a bug ! Steps to reproduce 1 step1 . Could you please correct it quickly ? Regards")

'bug'

In [50]:
issueClf.predict_issue(title = "Feature", body = "Hi, I need a new feature concerning the autocompletion")

'feature-request'

In [51]:
issueClf.predict_issue(title="Hell World !", body="Brian is in the kitchen")

'bug'

In [None]:
clf = grid_searchSVC.best_estimator_
pred = clf.predict(corpusTest)
print(accuracy_score(y_pred=pred, y_true=dfTest.labels.values))
print(confusion_matrix(y_pred=pred, y_true=dfTest.labels.values))
print(clf.predict(['Found a bug, please correct it quicly ! It\'s a bug'])[0])
print(clf.predict(['Need a new feature to improve the software'])[0])
print(clf.predict(['Brian is in the kitchen'])[0])

In [None]:
tokenizer = nltk.tokenize.TweetTokenizer()
pipeline =Pipeline([
    ('vect', CountVectorizer(stop_words='english', tokenizer=tokenizer.tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier()),
])

parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
}
grid_searchRF = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)


grid_searchRF.fit(corpus, underSampleTrain.labels.values)

In [None]:
clf = grid_searchRF.best_estimator_
pred = clf.predict(corpusTest)
print(accuracy_score(y_pred=pred, y_true=dfTest.labels.values))
print(confusion_matrix(y_pred=pred, y_true=dfTest.labels.values))
print(clf.predict(['Found a bug, please correct it quicly ! It\'s a bug'])[0])
print(clf.predict(['Need a new feature to improve the software'])[0])
print(clf.predict(['Brian is in the kitchen'])[0])

In [None]:
tokenizer = nltk.tokenize.TweetTokenizer()
pipeline =Pipeline([
    ('vect', CountVectorizer(stop_words='english', tokenizer=tokenizer.tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MLPClassifier()),
])

parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    #'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    #'clf__alpha': (0.00001, 0.000001),
    #'clf__penalty': ('l2', 'elasticnet'),
    #'clf__n_iter': (10, 50, 80),
    'clf__hidden_layer_sizes' : ((500,100), (1000,500))
}

grid_search.fit(corpus, dfTrain.labels.values)
clf = grid_search.best_estimator_
pred = clf.predict(corpusTest)
print(accuracy_score(y_pred=pred, y_true=dfTest.labels.values))
print(confusion_matrix(y_pred=pred, y_true=dfTest.labels.values))

In [None]:
tokenizer = nltk.tokenize.TweetTokenizer()
pipeline =Pipeline([
    ('vect', CountVectorizer(stop_words='english', tokenizer=tokenizer.tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MLPClassifier()),
])
pipeline.fit(corpus, dfTrain.labels.values)
pred = pipeline.predict(corpusTest)
print(accuracy_score(y_pred=pred, y_true=dfTest.labels.values))
print(confusion_matrix(y_pred=pred, y_true=dfTest.labels.values))

In [None]:
print(accuracy_score(y_pred=pred, y_true=dfValidation.labels.values))
print(confusion_matrix(y_pred=pred, y_true=dfValidation.labels.values))

In [None]:
dfTest.labels.values

# Entraînement

In [None]:
def transform_df_like_BoW(words, df):
    newDf = pd.DataFrame(columns=words)
    for ind, row in df.iterrows():
        string = str(row['title']) + ' ' + str(row['body'])
        tokens = string_to_tokens(string)
        newDf.loc[ind] = [tokens.count(word) for word in words]
        newDf.loc[ind,'label'] = row['labels']
    return newDf

In [None]:
test = transform_df_like_BoW(X.columns.tolist(), dfTest)

In [None]:
test.head()

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

In [None]:
clf = gnb.fit(X=X.drop('label', axis=1), y=X.label)

In [None]:
pred = clf.predict(X=test.drop('label',axis=1))

In [None]:
from sklearn.metrics import accuracy_score

print(accuracy_score(test.label, pred))

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

In [None]:
clf = clf.fit(X=X.drop('label', axis=1), y=X.label)

In [None]:
pred = clf.predict(X=test.drop('label',axis=1))

In [None]:
print(accuracy_score(test.label, pred))

In [None]:
from sklearn import svm
clf = svm.SVC()
clf = clf.fit(X=X.drop('label', axis=1), y=X.label)
pred = clf.predict(X=test.drop('label',axis=1))
print(accuracy_score(test.label, pred))

In [None]:
def labels_to_binary_vector(labels):
    vector = []
    for l in labels:
        n = LABELS.index(l)
        vector.append(np.array([0] * n + [1] + [0] * (3-n-1)))
    return vector

In [None]:
labels_to_binary_vector(LABELS * 2)[0].shape

In [None]:
np.array(labels_to_binary_vector(X.label)).shape

In [None]:
import keras

In [None]:
model = keras.models.Sequential()
model.add(keras.layers.Dense(units=1000, activation='relu', input_dim=X.shape[1] - 1))
model.add(keras.layers.Dense(units=500, activation='relu'))
model.add(keras.layers.Dense(units=3, activation='relu'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True))

In [None]:
model.fit(X.drop('label', axis=1).values, np.array(labels_to_binary_vector(X.label.tolist())), epochs=100, batch_size=32, shuffle=True)

In [None]:
pred = model.predict(test.drop('label', axis=1))

In [None]:
pred.shape

In [None]:
files = os.listdir('./issues/')

In [None]:
files.sort()

In [None]:
dfs = []
for f in files:
    dfs.append(pd.read_json('./issues/' + f))

In [None]:
issues = pd.concat(dfs)

In [None]:
issues.to_csv('issues.csv')

In [None]:
issues.loc[:,['title','body','labels']]

In [None]:
issues.iloc[4,:].labels

In [None]:
json.loads(issues.iloc[4,:].labels)

In [None]:
type(eval(issues.iloc[0,:].labels))

In [None]:
for ind, row in issues.iterrows():
    labels = row['labels']
    tmp = []
    print(labels)
    for l in labels:
        tmp.append(l['name'])
        
    new_label = filter_label(tmp)
    issues.loc[ind, 'labels'] = new_label

In [None]:
nltk.word_tokenize(issues.loc[0].body)

In [None]:
string_to_tokens(issues.loc[0].body)

In [None]:
len(')')

In [None]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10)

In [None]:
X_train