# Résumé

L'objectif de ce notebook est de faire du "Document Classification", c'est une sous partie du NLP. Pour cela, nous prenons les données qui sont [ici](https://api.github.com/repos/Microsoft/vscode/issues). Les données peuvent être récupérées soit via l'API [PyGithub](https://github.com/PyGithub/PyGithub), soit directement avec la commande curl. Nous devrons ensuite classer les différentes "issues" sous les différents labels (bug, feature-request et other). On rajoute la classe "other" afin d'être sûr que les deux autres concepts sont correctements appris. Pour finir, nous fournirons une méthode qui prendra en paramètre, un titre et un corp de texte et qui labellisera cette nouvelles entrée.

Plan
========
1. Construction du dataset
2. Séparation des datasets
3. Feature-Extraction
4. Entraînement
5. Validation
6. Utilisation du classifier

In [1]:
import pandas as pd #Gestion des dataframes
import nltk # Traitement du langage naturel

In [11]:
#Constants utilisées par la suite
LABEL_FQ = 'feature-request'
LABEL_BUG = 'bug'
LABEL_OTHER = 'other'
LABELS = [LABEL_BUG, LABEL_FQ, LABEL_OTHER]

# Construction du dataset

Pour cet exercice, on ne prendra que le titre, le corp et l'ID de l'issue. On va faire 3 classes différentes, 'bug', 'feature-request', 'other'. On considère que chaque input n'a qu'un seul label.

In [3]:
#Fonction qui permet de redéfinir les autres labels que bug et feature-request à other
def filter_label(labels):
    
    if LABEL_FQ  in labels:
        return LABEL_FQ
    elif LABEL_BUG in labels:
        return LABEL_BUG
    
    return LABEL_OTHER

In [4]:
issues = pd.read_csv('./issues.csv') #importation des données téléchargées au préalable
issues = issues.loc[:,['title','body','labels']] #On conserve seulement titre, body et labels
issues.head()

Unnamed: 0,title,body,labels
0,Panel badge is an odd shape when a single digit,Need to update the css so that this badge beco...,[]
1,custom titlebar : fullscreen very top dragging...,- VSCode Version: Insiders 1.26\r\n- OS Versio...,[]
2,Localized descriptions for built-in extensions...,Fixes #54111,[]
3,editor automatically removing characters from ...,Issue Type: <b>Bug</b>\r\n\r\nthe editor is re...,[]
4,[js] Add auto completion for computed property...,Currently intellisense doesn't work for comput...,"[{'id': 291124272, 'node_id': 'MDU6TGFiZWwyOTE..."


In [None]:
#Transformation des labels. On ne garde qu'un seul label (bug, feature-request et other)
for ind in issues.index:
    row = issues.loc[ind]
    labels = eval(row['labels'])
    tmp = []

    if len(labels) > 0: # S'il y a au moins un label, 3 possibilitées d'affectation
        for l in labels:
            tmp.append(l['name'])

        new_label = filter_label(tmp)
        issues.loc[ind, 'labels'] = new_label
    else:
        issues.loc[ind, 'labels'] = LABEL_OTHER #Sinon c'est other

In [8]:
issues

Unnamed: 0,title,body,labels
0,Panel badge is an odd shape when a single digit,Need to update the css so that this badge beco...,other
1,custom titlebar : fullscreen very top dragging...,- VSCode Version: Insiders 1.26\r\n- OS Versio...,other
2,Localized descriptions for built-in extensions...,Fixes #54111,other
3,editor automatically removing characters from ...,Issue Type: <b>Bug</b>\r\n\r\nthe editor is re...,other
4,[js] Add auto completion for computed property...,Currently intellisense doesn't work for comput...,other
5,Electron 2.0.5,This reverts https://github.com/Microsoft/vsco...,other
6,"Mac OS X ""Invalid key shortcut terminal"" on ⌃`",I want to open a terminal tab using the shortc...,other
7,ES & TS autoimport features enhance,No useful ways to import es & ts modules witho...,other
8,Folder with Chinese path cannot import into wo...,Issue Type: <b>Bug</b>\r\n\r\nPath include Chi...,other
9,Cannot uninstall VS Code on Windows Server 201...,Interesting use case here while using Visual S...,other


In [9]:
print(issues.labels.value_counts())
print('Totale : {}'.format(issues.shape[0]))

feature-request    2833
other              1644
bug                 904
Name: labels, dtype: int64
Totale : 5381


# Séparation des datasets

Dans cette section, nous séparons les données en 3 datasets. Ceci afin de valider le classifieur. Nous allons avoir un dataset pour l'entrainement, un pour le test et un dernier pour la validation. On garde 70% de chaque classe pour l'entraînement, et le reste sera divisé en 2 pour les autres datasets.

In [None]:
dfTrain = {}
dfTest = {}
dfValidation = {}
for l in LABELS:
    dfTrain[l] = issues[issues.labels == l].sample(frac=0.7)
    reste = issues[issues.labels == l].drop(dfTrain[l].index)
    dfTest[l] = reste.sample(frac=0.5)
    dfValidation[l] = reste.drop(dfTest[l].index)
    
dfTrain = pd.concat([dfTrain[l] for l in LABELS ], axis=0)
dfTest = pd.concat([dfTest[l] for l in LABELS ], axis=0)
dfValidation = pd.concat([dfValidation[l] for l in LABELS ], axis=0)

In [19]:
dfTrain.head()

Unnamed: 0,title,body,labels
4497,Updating VS Code should check for updates first.,If I have an update pending and a new update c...,bug
1664,node protocol probing confuses electron,- Use this program test.js:\r\n ```js\r\n le...,bug
5246,Text changes for completion CodeAction are aff...,- VSCode Version: Code - Insiders 1.19.0-insid...,bug
3878,Search results remain highlighted after closin...,Issue Type: <b>Bug</b>\r\n\r\n* Open [`Definit...,bug
4955,VSCode Shell Commands not retaining install st...,- VSCode Version: Code 1.19.1 (0759f77bb8d8665...,bug


In [22]:
dfTrain.labels.value_counts()

feature-request    1983
other              1151
bug                 633
Name: labels, dtype: int64

In [20]:
dfTest.head()

Unnamed: 0,title,body,labels
2213,Unable to open file via context menu from othe...,- VSCode Version:1.4\n- OS Version:Ubuntu 16.0...,bug
1853,"Find/replace: Keyboard shortcuts for ""whole wo...",Cmd-Alt-W (toggle whole word) and Cmd-Alt-C (t...,bug
310,indents after multiline comment incorrect,<!-- Do you have a question? Please ask it on ...,bug
293,Editor indentation incorrect around tabbed fun...,Indentation does not seem to work correctly fo...,bug
528,Update extension button invisible when new ver...,If the number is too big and the side bar too ...,bug


In [23]:
dfTest.labels.value_counts()

feature-request    425
other              246
bug                136
Name: labels, dtype: int64

In [21]:
dfValidation.head()

Unnamed: 0,title,body,labels
81,Toggle Word Wrap doesn't work with the custom ...,Issue Type: <b>Bug</b>\r\n\r\n(Using Windows 1...,bug
160,Cannot read property 'label' of undefined,Issue Id: <b>8d877eb2-b3bb-49b1-ed7a-c9be5d1b5...,bug
255,The uri strings are inconsistent from SetBreak...,- VSCode Version: 1.18.0-insider\r\n- OS Versi...,bug
397,Extension tips service should only listen on `...,It seems that file extensions recommendations ...,bug
404,Title bar font glitch in Chinese locale when t...,<!-- Please search existing issues to avoid cr...,bug


In [24]:
dfValidation.labels.value_counts()

feature-request    425
other              247
bug                135
Name: labels, dtype: int64

On constate un déséquilibre au niveau  du nombre d'éléments par classe. Cela pourra poser des difficultés pour l'apprentissage.

# Feature Extraction

Maintenant que nous avons nos datasets, nous allons normaliser nos données, afin d'aider notre classifier à trouver du sens. Pour ce faire, nous allons "stemmatiser" les différents textes, "tokanier" pour récupérer les différents termes utilisés. Pour finir, notre dataset ressemblera à un Bag of Words (BoW). Concrètement nous aurons une matrice (n exemples x m mots). Les m mots sont tous les mots rencontrés dans le dataset d'entraînement. Les différentes valeurs correspondront au nombre de fois que le mot est utilisé par un exemple.

In [67]:
#Transforme une chaîne de caractère en un liste de tokens.
#On supprime les stop words (at, to ...), les ponctuations et les urls
def string_to_tokens(mystring):
    tokens = nltk.tokenize.TweetTokenizer().tokenize(mystring)
    stopwords = nltk.corpus.stopwords.words('english')
    stemmer = nltk.stem.PorterStemmer()
    
    for i in range(len(tokens))[::-1]:
        if tokens[i] in stopwords:
            tokens.remove(tokens[i]) #On retire les stopwords
        elif len(tokens[i]) <= 1:
            tokens.remove(tokens[i])
        elif 'https://' in tokens[i]: #suppression des url
            tokens.remove(tokens[i])
        else :    
            tokens[i] = stemmer.stem(tokens[i])
            tokens[i] = tokens[i].lower()
        
    return tokens

In [65]:
#Retourne à partir d'une liste de tokens un dictionnaire des tokens et leur nombre d'apparition
def token_frequency(tokens):
    frequencies = {t : 0 for t in tokens}
    
    for t in tokens:
        frequencies[t] += 1
    return frequencies

In [72]:
tmpTrain = dict()
for ind, row in dfTrain.iterrows():
    string = str(row['title']) + ' ' + str(row['body'])
    tmp = {}
    tmp = token_frequency(string_to_tokens(string))
    tmp['label'] = row['labels']
    tmpTrain[str(ind)] = tmp

In [73]:
tmpTrain

{'4497': {'updat': 10,
  'vs': 2,
  'code': 3,
  'check': 5,
  'first': 1,
  'if': 1,
  'pend': 1,
  'new': 2,
  'come': 1,
  'order': 1,
  'get': 1,
  'latest': 1,
  'version': 1,
  'need': 1,
  'reload': 2,
  'window': 2,
  'step': 1,
  'reproduc': 1,
  'open': 1,
  'insid': 1,
  '-->': 2,
  'littl': 1,
  'appear': 1,
  'wait': 1,
  '24': 1,
  'hour': 1,
  'click': 1,
  'gear': 1,
  'find': 1,
  'way': 1,
  'previou': 1,
  'instal': 1,
  'launch': 1,
  'disable-extens': 1,
  'doe': 1,
  'issu': 1,
  'occur': 1,
  'extens': 1,
  'disabl': 1,
  'ye': 1,
  'label': 'bug'},
 '1664': {'node': 10,
  'protocol': 6,
  'probe': 2,
  'confus': 1,
  'electron': 3,
  'use': 2,
  'program': 2,
  'test.j': 2,
  'js': 1,
  'let': 1,
  'setinterv': 1,
  'console.log': 1,
  'hello': 1,
  ');': 2,
  '1000': 1,
  'download': 1,
  'current': 1,
  'maco': 1,
  'run': 1,
  'command': 1,
  'line': 1,
  'electron.app/contents/macos/electron': 1,
  'debug': 9,
  '6009': 4,
  'attach': 6,
  'vs': 3,
  'code':

In [75]:
X = pd.DataFrame(tmpTrain).T

In [None]:
X.fillna(0)

In [None]:
X.head()

In [77]:
X.to_csv('BoW.csv')

4497      bug
1664      bug
5246      bug
3878      bug
4955      bug
4820      bug
3614      bug
3781      bug
3798      bug
3367      bug
4890      bug
2043      bug
3329      bug
4902      bug
2908      bug
4147      bug
679       bug
625       bug
3572      bug
1197      bug
762       bug
4969      bug
1620      bug
150       bug
3136      bug
3653      bug
4485      bug
4100      bug
3879      bug
4085      bug
        ...  
714     other
2690    other
76      other
2394    other
19      other
146     other
2706    other
2058    other
5038    other
3503    other
1009    other
2941    other
3041    other
1307    other
2982    other
3588    other
2993    other
742     other
3985    other
5206    other
2947    other
5377    other
4081    other
4672    other
1387    other
2040    other
3426    other
4448    other
1076    other
4813    other
Name: label, Length: 3767, dtype: object

In [None]:
lemmatizer.lemmatize(issues.loc[1,'body'], pos='v')

In [56]:
str.ponctuation

AttributeError: type object 'str' has no attribute 'ponctuation'

In [None]:
test.title

In [None]:
issues = rep.get_issues(labels=PaginatedList(['bug','feature-request']))

In [None]:
type(labels)

In [None]:
PaginatedList.PaginatedList(list_item=['1','2'])

In [None]:
test

In [None]:
test.number

In [None]:
test.body

In [None]:
labels.totalCount()

In [None]:
labels[0]

In [None]:
files = os.listdir('./issues/')

In [None]:
files.sort()

In [None]:
dfs = []
for f in files:
    dfs.append(pd.read_json('./issues/' + f))

In [None]:
issues = pd.concat(dfs)

In [None]:
issues.to_csv('issues.csv')

In [None]:
issues.loc[:,['title','body','labels']]

In [None]:
issues.iloc[4,:].labels

In [None]:
json.loads(issues.iloc[4,:].labels)

In [None]:
type(eval(issues.iloc[0,:].labels))

In [None]:
for ind, row in issues.iterrows():
    labels = row['labels']
    tmp = []
    print(labels)
    for l in labels:
        tmp.append(l['name'])
        
    new_label = filter_label(tmp)
    issues.loc[ind, 'labels'] = new_label

In [47]:
nltk.word_tokenize(issues.loc[0].body)

['Need',
 'to',
 'update',
 'the',
 'css',
 'so',
 'that',
 'this',
 'badge',
 'becomes',
 'a',
 'circle',
 '(',
 'when',
 'a',
 'single',
 'digit',
 ')',
 'instead',
 'of',
 'an',
 'odd',
 'shape',
 ':',
 '!',
 '[',
 'image',
 ']',
 '(',
 'https',
 ':',
 '//user-images.githubusercontent.com/35271042/43086421-27080894-8e52-11e8-8f99-11f5133b4203.png',
 ')']

In [50]:
string_to_tokens(issues.loc[0].body)

test
test
test
test
test
test
test
test


['Need',
 'update',
 'css',
 'that',
 'badge',
 'becomes',
 'circle',
 'when',
 'single',
 'digit',
 'instead',
 'an',
 'odd',
 'shape',
 '!',
 'image',
 '(',
 'https',
 '//user-images.githubusercontent.com/35271042/43086421-27080894-8e52-11e8-8f99-11f5133b4203.png']

In [44]:
len(')')

1