# Résumé

L'objectif de ce notebook est de faire du "Document Classification", c'est une sous partie du NLP. Pour cela, nous prenons les données qui sont [ici](https://api.github.com/repos/Microsoft/vscode/issues). Les données peuvent être récupérées soit via l'API [PyGithub](https://github.com/PyGithub/PyGithub), soit directement avec la commande curl. Nous devrons ensuite classer les différentes "issues" sous les différents labels (bug, feature-request et other). On rajoute la classe "other" afin d'être sûr que les deux autres concepts sont correctements appris. Pour finir, nous fournirons une méthode qui prendra en paramètre, un titre et un corp de texte et qui labellisera cette nouvelles entrée.

Plan
========
1. Construction du dataset
2. Séparation des datasets
3. Feature-Extraction
4. Entraînement
5. Validation
6. Utilisation du classifier

In [59]:
import pandas as pd #Gestion des dataframes
import nltk # Traitement du langage naturel
import numpy as np

In [2]:
#Constants utilisées par la suite
LABEL_FQ = 'feature-request'
LABEL_BUG = 'bug'
LABEL_OTHER = 'other'
LABELS = [LABEL_BUG, LABEL_FQ, LABEL_OTHER]

# Construction du dataset

Pour cet exercice, on ne prendra que le titre, le corp et l'ID de l'issue. On va faire 3 classes différentes, 'bug', 'feature-request', 'other'. On considère que chaque input n'a qu'un seul label.

In [3]:
#Fonction qui permet de redéfinir les autres labels que bug et feature-request à other
def filter_label(labels):
    
    if LABEL_FQ  in labels:
        return LABEL_FQ
    elif LABEL_BUG in labels:
        return LABEL_BUG
    
    return LABEL_OTHER

In [4]:
issues = pd.read_csv('./issues.csv') #importation des données téléchargées au préalable
issues = issues.loc[:,['title','body','labels']] #On conserve seulement titre, body et labels
issues.head()

Unnamed: 0,title,body,labels
0,Panel badge is an odd shape when a single digit,Need to update the css so that this badge beco...,[]
1,custom titlebar : fullscreen very top dragging...,- VSCode Version: Insiders 1.26\r\n- OS Versio...,[]
2,Localized descriptions for built-in extensions...,Fixes #54111,[]
3,editor automatically removing characters from ...,Issue Type: <b>Bug</b>\r\n\r\nthe editor is re...,[]
4,[js] Add auto completion for computed property...,Currently intellisense doesn't work for comput...,"[{'id': 291124272, 'node_id': 'MDU6TGFiZWwyOTE..."


In [5]:
#Transformation des labels. On ne garde qu'un seul label (bug, feature-request et other)
for ind in issues.index:
    row = issues.loc[ind]
    labels = eval(row['labels'])
    tmp = []

    if len(labels) > 0: # S'il y a au moins un label, 3 possibilitées d'affectation
        for l in labels:
            tmp.append(l['name'])

        new_label = filter_label(tmp)
        issues.loc[ind, 'labels'] = new_label
    else:
        issues.loc[ind, 'labels'] = LABEL_OTHER #Sinon c'est other

In [6]:
issues

Unnamed: 0,title,body,labels
0,Panel badge is an odd shape when a single digit,Need to update the css so that this badge beco...,other
1,custom titlebar : fullscreen very top dragging...,- VSCode Version: Insiders 1.26\r\n- OS Versio...,other
2,Localized descriptions for built-in extensions...,Fixes #54111,other
3,editor automatically removing characters from ...,Issue Type: <b>Bug</b>\r\n\r\nthe editor is re...,other
4,[js] Add auto completion for computed property...,Currently intellisense doesn't work for comput...,other
5,Electron 2.0.5,This reverts https://github.com/Microsoft/vsco...,other
6,"Mac OS X ""Invalid key shortcut terminal"" on ⌃`",I want to open a terminal tab using the shortc...,other
7,ES & TS autoimport features enhance,No useful ways to import es & ts modules witho...,other
8,Folder with Chinese path cannot import into wo...,Issue Type: <b>Bug</b>\r\n\r\nPath include Chi...,other
9,Cannot uninstall VS Code on Windows Server 201...,Interesting use case here while using Visual S...,other


In [7]:
print(issues.labels.value_counts())
print('Totale : {}'.format(issues.shape[0]))

feature-request    2833
other              1644
bug                 904
Name: labels, dtype: int64
Totale : 5381


# Séparation des datasets

Dans cette section, nous séparons les données en 3 datasets. Ceci afin de valider le classifieur. Nous allons avoir un dataset pour l'entrainement, un pour le test et un dernier pour la validation. On garde 70% de chaque classe pour l'entraînement, et le reste sera divisé en 2 pour les autres datasets.

In [8]:
dfTrain = {}
dfTest = {}
dfValidation = {}
for l in LABELS:
    dfTrain[l] = issues[issues.labels == l].sample(frac=0.7)
    reste = issues[issues.labels == l].drop(dfTrain[l].index)
    dfTest[l] = reste.sample(frac=0.5)
    dfValidation[l] = reste.drop(dfTest[l].index)
    
dfTrain = pd.concat([dfTrain[l] for l in LABELS ], axis=0)
dfTest = pd.concat([dfTest[l] for l in LABELS ], axis=0)
dfValidation = pd.concat([dfValidation[l] for l in LABELS ], axis=0)

In [9]:
dfTrain.head()

Unnamed: 0,title,body,labels
1278,Can't drag a file over a webview in another ed...,- VSCode Version: 1.12 Insiders\r\n- OS Versio...,bug
3116,Panel titles are read twice,**Environment Details:** \r\nVSCode Version : ...,bug
5174,"VoiceOver reports search results as an ""empty ...",Turn VoiceOver on and then focus the search vi...,bug
784,Window title only repainting when window is re...,- VSCode Version: Code 1.14.2 (cb82febafda0c8c...,bug
4682,node debug adapter must support file URLs,\r\n- I have a javascript program which execut...,bug


In [10]:
dfTrain.labels.value_counts()

feature-request    1983
other              1151
bug                 633
Name: labels, dtype: int64

In [11]:
dfTest.head()

Unnamed: 0,title,body,labels
4496,Newly created folder is not revealed and selected,Issue Type: <b>Bug</b>\r\n\r\nTesting #43968.\...,bug
1059,Terminal.onDidWriteData sends duplicate data,Version: 1.25.0-insider\r\nCommit: fb0b8f12036...,bug
2090,Cannot debug mocha/es6 tests,- VSCode Version: 1.6.1\n- OS Version: Windows...,bug
2113,Project Search/Replace with RegExp with \n onl...,- VSCode Version: 1.5 and 1.6.0-insiders\n\nSt...,bug
2910,Extension settings don't show up when settings...,- Open settings editor (new or old)\r\n- Close...,bug


In [12]:
dfTest.labels.value_counts()

feature-request    425
other              246
bug                136
Name: labels, dtype: int64

In [13]:
dfValidation.head()

Unnamed: 0,title,body,labels
85,Toggling sidebar switches away from settings e...,Issue Type: <b>Bug</b>\r\n\r\n**Repo**\r\n1. O...,bug
168,Deleting .code-workspace contents can break ex...,Testing #35871\r\n\r\n1. Create a simple works...,bug
189,Number badge not nicely alligned in problems view,"\r\n<img width=""189"" alt=""capture"" src=""https:...",bug
293,Editor indentation incorrect around tabbed fun...,Indentation does not seem to work correctly fo...,bug
335,WorkspaceConfiguration.get doesn't respect def...,Steps to Reproduce:\r\n\r\n1. Create a simple ...,bug


In [14]:
dfValidation.labels.value_counts()

feature-request    425
other              247
bug                135
Name: labels, dtype: int64

On constate un déséquilibre au niveau  du nombre d'éléments par classe. Cela pourra poser des difficultés pour l'apprentissage.

# Feature Extraction

Maintenant que nous avons nos datasets, nous allons normaliser nos données, afin d'aider notre classifier à trouver du sens. Pour ce faire, nous allons "stemmatiser" les différents textes, "tokanier" pour récupérer les différents termes utilisés. Pour finir, notre dataset ressemblera à un Bag of Words (BoW). Concrètement nous aurons une matrice (n exemples x m mots). Les m mots sont tous les mots rencontrés dans le dataset d'entraînement. Les différentes valeurs correspondront au nombre de fois que le mot est utilisé par un exemple.

In [15]:
#Transforme une chaîne de caractère en un liste de tokens.
#On supprime les stop words (at, to ...), les ponctuations et les urls
def string_to_tokens(mystring):
    tokens = nltk.tokenize.TweetTokenizer().tokenize(mystring)
    stopwords = nltk.corpus.stopwords.words('english')
    stemmer = nltk.stem.PorterStemmer()
    
    for i in range(len(tokens))[::-1]:
        if tokens[i] in stopwords:
            tokens.remove(tokens[i]) #On retire les stopwords
        elif len(tokens[i]) <= 1:
            tokens.remove(tokens[i])
        elif 'https://' in tokens[i]: #suppression des url
            tokens.remove(tokens[i])
        else :    
            tokens[i] = stemmer.stem(tokens[i])
            tokens[i] = tokens[i].lower()
        
    return tokens

In [16]:
#Retourne à partir d'une liste de tokens un dictionnaire des tokens et leur nombre d'apparition
def token_frequency(tokens):
    frequencies = {t : 0 for t in tokens}
    
    for t in tokens:
        frequencies[t] += 1
    return frequencies

In [17]:
tmpTrain = dict()
for ind, row in dfTrain.iterrows():
    string = str(row['title']) + ' ' + str(row['body'])
    tmp = {}
    tmp = token_frequency(string_to_tokens(string))
    tmp['label'] = row['labels']
    tmpTrain[str(ind)] = tmp

In [19]:
X = pd.DataFrame(tmpTrain).T

In [20]:
X = X.fillna(0)
X.head()

Unnamed: 0,##vscode,#000,#000000,#0000ff,#008000,#00ff00,#10,#101,#10170,#10317,...,閣下確認閣下在使用互聯網及網上操守方面已獲得所有必需的批准,附帶,雲端硬碟,香港,가ㄹ,가ㅁ,가ㅈ,ａａａ,𝑺𝑻𝑶𝑷,𝗦𝗧𝗢𝗣
1278,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3116,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5174,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
784,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4682,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
X.to_csv('BoW.csv')
dfTrain.to_csv('dfTrain.csv')
dfTest.to_csv('dfTest.csv')
dfValidation.to_csv('dfValidationi.csv')

# Entraînement

In [31]:
def transform_df_like_BoW(words, df):
    newDf = pd.DataFrame(columns=words)
    for ind, row in df.iterrows():
        string = str(row['title']) + ' ' + str(row['body'])
        tokens = string_to_tokens(string)
        newDf.loc[ind] = [tokens.count(word) for word in words]
        newDf.loc[ind,'label'] = row['labels']
    return newDf

In [32]:
test = transform_df_like_BoW(X.columns.tolist(), dfTest)

In [34]:
test.head()

Unnamed: 0,##vscode,#000,#000000,#0000ff,#008000,#00ff00,#10,#101,#10170,#10317,...,閣下確認閣下在使用互聯網及網上操守方面已獲得所有必需的批准,附帶,雲端硬碟,香港,가ㄹ,가ㅁ,가ㅈ,ａａａ,𝑺𝑻𝑶𝑷,𝗦𝗧𝗢𝗣
4496,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1059,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2090,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2113,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2910,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

In [25]:
clf = gnb.fit(X=X.drop('label', axis=1), y=X.label)

In [35]:
pred = clf.predict(X=test.drop('label',axis=1))

In [38]:
from sklearn.metrics import accuracy_score

print(accuracy_score(test.label, pred))

0.5006195786864932


In [39]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

In [41]:
clf = clf.fit(X=X.drop('label', axis=1), y=X.label)

In [42]:
pred = clf.predict(X=test.drop('label',axis=1))

In [43]:
print(accuracy_score(test.label, pred))

0.5997521685254027


In [44]:
from sklearn import svm
clf = svm.SVC()
clf = clf.fit(X=X.drop('label', axis=1), y=X.label)
pred = clf.predict(X=test.drop('label',axis=1))
print(accuracy_score(test.label, pred))

0.5452292441140025


In [76]:
def labels_to_binary_vector(labels):
    vector = []
    for l in labels:
        n = LABELS.index(l)
        vector.append(np.array([0] * n + [1] + [0] * (3-n-1)))
    return vector

In [82]:
labels_to_binary_vector(LABELS * 2)[0].shape

(3,)

In [88]:
np.array(labels_to_binary_vector(X.label)).shape

(3767, 3)

In [45]:
import keras

Using TensorFlow backend.


In [50]:
model = keras.models.Sequential()
model.add(keras.layers.Dense(units=1000, activation='relu', input_dim=X.shape[1] - 1))
model.add(keras.layers.Dense(units=500, activation='relu'))
model.add(keras.layers.Dense(units=3, activation='relu'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True))

In [93]:
model.fit(X.drop('label', axis=1).values, np.array(labels_to_binary_vector(X.label.tolist())), epochs=100, batch_size=32, shuffle=True)

Epoch 1/100

KeyboardInterrupt: 

In [90]:
pred = model.predict(test.drop('label', axis=1))

In [92]:
pred.shape

(807, 3)

In [None]:
files = os.listdir('./issues/')

In [None]:
files.sort()

In [None]:
dfs = []
for f in files:
    dfs.append(pd.read_json('./issues/' + f))

In [None]:
issues = pd.concat(dfs)

In [None]:
issues.to_csv('issues.csv')

In [None]:
issues.loc[:,['title','body','labels']]

In [None]:
issues.iloc[4,:].labels

In [None]:
json.loads(issues.iloc[4,:].labels)

In [None]:
type(eval(issues.iloc[0,:].labels))

In [None]:
for ind, row in issues.iterrows():
    labels = row['labels']
    tmp = []
    print(labels)
    for l in labels:
        tmp.append(l['name'])
        
    new_label = filter_label(tmp)
    issues.loc[ind, 'labels'] = new_label

In [47]:
nltk.word_tokenize(issues.loc[0].body)

['Need',
 'to',
 'update',
 'the',
 'css',
 'so',
 'that',
 'this',
 'badge',
 'becomes',
 'a',
 'circle',
 '(',
 'when',
 'a',
 'single',
 'digit',
 ')',
 'instead',
 'of',
 'an',
 'odd',
 'shape',
 ':',
 '!',
 '[',
 'image',
 ']',
 '(',
 'https',
 ':',
 '//user-images.githubusercontent.com/35271042/43086421-27080894-8e52-11e8-8f99-11f5133b4203.png',
 ')']

In [50]:
string_to_tokens(issues.loc[0].body)

test
test
test
test
test
test
test
test


['Need',
 'update',
 'css',
 'that',
 'badge',
 'becomes',
 'circle',
 'when',
 'single',
 'digit',
 'instead',
 'an',
 'odd',
 'shape',
 '!',
 'image',
 '(',
 'https',
 '//user-images.githubusercontent.com/35271042/43086421-27080894-8e52-11e8-8f99-11f5133b4203.png']

In [44]:
len(')')

1