<h1><center>TEST PONICODE</center></h1>

#### Objectif : Construire un modèle capable de prédire le type des paramètres d'une fonction JS à partir de son code
#### Stratégie : Vectoriser pour chaque argument son contexte (définit après) ainsi que son identifiant pour pouvoir ensuite appliquer un algorithme de classification sur les vecteurs obtenus. Le projet ne sera pas pris en compte.

### I) Modules :

In [25]:
import pickle
import esprima
import pandas as pd
import re
import numpy as np
from sklearn.preprocessing import LabelEncoder

from gensim.models import Word2Vec
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from scipy.special import softmax

import xgboost
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

### Base de données :
3 colonnes :  
project: int  
id du projet d'ou le code provient  
Types : str  
Types que l'on cherche à prédire   
function : str  
Texte de la fonction  

In [26]:
with open('anulap.pkl','rb') as f:
    data = pickle.load(f)
data

Unnamed: 0,function,types,project
0,function parseExtensionURL(url) {\n url = u...,string,267
1,"function showPageAction(tabId, displayUrl) {\n...","number,string",267
2,"function onExecuteFileBrowserHandler(id, detai...","string,object",267
3,"function openViewer(windowId, fileEntries) {\n...","number,array",267
4,function isPdfDownloadable(details) {\n if (d...,object,267
...,...,...,...
4426,"function addHandle( attrs, handler ) {\r\n\tva...","string,function",110
4427,function createInputPseudo( type ) {\r\n\tretu...,string,110
4428,function createButtonPseudo( type ) {\r\n\tret...,string,110
4429,function createPositionalPseudo( fn ) {\r\n\tr...,function,110


## II) Data processing :  
Certaines données sont en Typescript au lieu de JS, la fonction 'isjs' permet de detecter ces données

In [27]:
def isjs(program: list)->bool:
    try:
        esprima.parseScript(program)
        return True
    except:
        return False
    

In [28]:
mask = data['function'].apply(isjs)

In [29]:
len(data)-len(data[mask])

281

Le nombre de données en Typescript étant négligeable je me permets de les retirer

In [30]:
data.drop(data[-mask].index, inplace=True)
data.reset_index(inplace=True,drop=True)

Ici je cherche à extraire les information utiles sur les paramètres que je peut trouver dans le corp de la fonction soit:  

- le body de la fonction qui va nous permettre de construire un embeding pour représenter l'information utile
- le nom des arguments
- les contextes dans lequels apparaissent les arguments

In [31]:
def extract_args_body_contexts(program: list)->list:
    #récupération de l'ast grace à asprima
    ast = esprima.parseScript(program,loc=True,tokens=True)
    params = ast.body[0].params
    tokens = ast.tokens
    
    # conversion en camel case des paramètres
    args = [re.sub('[A-Z]+',camel,p.name) for p in params]
    
    tokens_values = []
    for t in tokens:
        if t.loc.start.line >= ast.body[0].body.loc.start.line+1:
            # token spécifiques pour string, numeric , regex ce qui permet de réduire le nombre de token en gardant 
            #l'information
            if t.type == 'String':
                tokens_values += ["<s>"]
            elif t.type == 'Numeric':
                tokens_values += ["0"]
            elif t.type == "RegularExpression":
                tokens_values += ["<regex>"]
            else:
                # conversion en camel case des différents token 
                tokens_values += [re.sub('[A-Z]+',camel,t.value)]
    #extraction du body           
    body_raw = [s.split() for s in (" ".join(tokens_values)).split(";")]
    #extraction du context pour chaque argument
    contexts_raw = [[list(filter((arg).__ne__, cs)) for cs in body_raw if arg in cs] for arg in args]
    
    # extraction de l'information syntaxique des token + ajout d'un token de remplacement pour etre en mesure de 
    #vectoriser noms de variables inconnus dans le futur 
    body = []
    for i,cs in enumerate(body_raw):
        body+=[[]]
        for name in cs:
            if name in args:
                body[i] += ['<param>']
            body[i] += break_name(name)
            
    contexts= []
    for i,ctxt in enumerate(contexts_raw):
        contexts += [[]]
        for j, cs in enumerate(ctxt):
            for name in cs:
                contexts[i] += break_name(name) 
    
    args = [break_name(arg) for arg in args]
    
    return [args,body,contexts]
#on utilise le camel case pour obtenir plus d'information sur les variables 
def break_name(string):
    regex_result = [s.lower() for s in re.findall("[a-zA-Z][a-z]*",string)]
    if string not in ["<s>","0","<regex>"] and regex_result != []:
        return regex_result
    else:
        return [string]
    
def camel(match):
    return match.group(0).title()

### Exemple :

In [32]:
program = data['function'][0]

In [33]:
print(program)

function parseExtensionURL(url) {
    url = url.substring(CRX_BASE_URL.length);
    // Find the (url-encoded) colon and verify that the scheme is whitelisted.
    var schemeIndex = url.search(/:|%3A/i);
    if (schemeIndex === -1) {
      return undefined;
    }
    var scheme = url.slice(0, schemeIndex).toLowerCase();
    if (schemes.includes(scheme)) {
      url = url.split("#")[0];
      if (url.charAt(schemeIndex) === ":") {
        url = encodeURIComponent(url);
      }
      return url;
    }
    return undefined;
  }


In [34]:
extract = extract_args_body_contexts(program)

In [35]:
# arguments de la fonction
extract[0]

[['url']]

In [36]:
#body tokenizé de la fonction
extract[1]

[['<param>',
  'url',
  '=',
  '<param>',
  'url',
  '.',
  'substring',
  '(',
  'crx',
  'base',
  'url',
  '.',
  'length',
  ')'],
 ['var',
  'scheme',
  'index',
  '=',
  '<param>',
  'url',
  '.',
  'search',
  '(',
  '<regex>',
  ')'],
 ['if',
  '(',
  'scheme',
  'index',
  '===',
  '-',
  '0',
  ')',
  '{',
  'return',
  'undefined'],
 ['}',
  'var',
  'scheme',
  '=',
  '<param>',
  'url',
  '.',
  'slice',
  '(',
  '0',
  ',',
  'scheme',
  'index',
  ')',
  '.',
  'to',
  'lower',
  'case',
  '(',
  ')'],
 ['if',
  '(',
  'schemes',
  '.',
  'includes',
  '(',
  'scheme',
  ')',
  ')',
  '{',
  '<param>',
  'url',
  '=',
  '<param>',
  'url',
  '.',
  'split',
  '(',
  '<s>',
  ')',
  '[',
  '0',
  ']'],
 ['if',
  '(',
  '<param>',
  'url',
  '.',
  'char',
  'at',
  '(',
  'scheme',
  'index',
  ')',
  '===',
  '<s>',
  ')',
  '{',
  '<param>',
  'url',
  '=',
  'encode',
  'uricomponent',
  '(',
  '<param>',
  'url',
  ')'],
 ['}', 'return', '<param>', 'url'],
 ['}', 'ret

In [37]:
#context de l'argument
extract[2][0]

['=',
 '.',
 'substring',
 '(',
 'crx',
 'base',
 'url',
 '.',
 'length',
 ')',
 'var',
 'scheme',
 'index',
 '=',
 '.',
 'search',
 '(',
 '<regex>',
 ')',
 '}',
 'var',
 'scheme',
 '=',
 '.',
 'slice',
 '(',
 '0',
 ',',
 'scheme',
 'index',
 ')',
 '.',
 'to',
 'lower',
 'case',
 '(',
 ')',
 'if',
 '(',
 'schemes',
 '.',
 'includes',
 '(',
 'scheme',
 ')',
 ')',
 '{',
 '=',
 '.',
 'split',
 '(',
 '<s>',
 ')',
 '[',
 '0',
 ']',
 'if',
 '(',
 '.',
 'char',
 'at',
 '(',
 'scheme',
 'index',
 ')',
 '===',
 '<s>',
 ')',
 '{',
 '=',
 'encode',
 'uricomponent',
 '(',
 ')',
 '}',
 'return']

In [38]:
#application sur chaque ligne de la fonction précédente 
data[["args","body","contexts"]] = data.apply(lambda x: extract_args_body_contexts(x.function),
                                              axis=1,result_type='expand')

In [39]:
#maintenant je vais chercher à n'avoir que 1 argument par ligne pour se ramener à une configuration propice pour 
#vectoriser et classifier
data['types'] = data.apply(lambda x : x.types.split(','),axis=1)

In [40]:
data

Unnamed: 0,function,types,project,args,body,contexts
0,function parseExtensionURL(url) {\n url = u...,[string],267,[[url]],"[[<param>, url, =, <param>, url, ., substring,...","[[=, ., substring, (, crx, base, url, ., lengt..."
1,"function showPageAction(tabId, displayUrl) {\n...","[number, string]",267,"[[tab, id], [display, url]]","[[var, url, =, <regex>, ., exec, (, <param>, d...","[[chrome, ., page, action, ., set, popup, (, {..."
2,"function onExecuteFileBrowserHandler(id, detai...","[string, object]",267,"[[id], [details]]","[[if, (, <param>, id, !==, <s>, ), {, return],...","[[if, (, !==, <s>, ), {, return, }, else, {, c..."
3,"function openViewer(windowId, fileEntries) {\n...","[number, array]",267,"[[window, id], [file, entries]]","[[if, (, !, <param>, file, entries, ., length,...","[[if, (, ), {, chrome, ., tabs, ., create, (, ..."
4,function isPdfDownloadable(details) {\n if (d...,[object],267,[[details]],"[[if, (, <param>, details, ., url, ., includes...","[[if, (, ., url, ., includes, (, <s>, ), ), {,..."
...,...,...,...,...,...,...
4145,"function addHandle( attrs, handler ) {\r\n\tva...","[string, function]",110,"[[attrs], [handler]]","[[var, arr, =, <param>, attrs, ., split, (, <s...","[[var, arr, =, ., split, (, <s>, ), ,, i, =, ...."
4146,function createInputPseudo( type ) {\r\n\tretu...,[string],110,[[type]],"[[return, function, (, elem, ), {, var, name, ...","[[return, name, ===, <s>, &&, elem, ., ===]]"
4147,function createButtonPseudo( type ) {\r\n\tret...,[string],110,[[type]],"[[return, function, (, elem, ), {, var, name, ...","[[return, (, name, ===, <s>, ||, name, ===, <s..."
4148,function createPositionalPseudo( fn ) {\r\n\tr...,[function],110,[[fn]],"[[return, mark, function, (, function, (, argu...","[[return, mark, function, (, function, (, seed..."


In [41]:
flat_data = pd.DataFrame([(index,t,arg,context) for (index,values)in data.iterrows() 
                           for t,arg,context in zip(values['types'],values['args'],values['contexts'])],
                         columns = ['index','type','arg','context']).set_index('index')

In [42]:
# format que je souhaite 
flat_data

Unnamed: 0_level_0,type,arg,context
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,string,[url],"[=, ., substring, (, crx, base, url, ., length..."
1,number,"[tab, id]","[chrome, ., page, action, ., set, popup, (, {,..."
1,string,"[display, url]","[var, url, =, <regex>, ., exec, (, ), }, else,..."
2,string,[id],"[if, (, !==, <s>, ), {, return, }, else, {, ch..."
2,object,[details],"[}, var, file, entries, =, ., entries, var, ta..."
...,...,...,...
4145,function,[handler],"[while, (, i, --, ), {, expr, ., attr, handle,..."
4146,string,[type],"[return, name, ===, <s>, &&, elem, ., ===]"
4147,string,[type],"[return, (, name, ===, <s>, ||, name, ===, <s>..."
4148,function,[fn],"[return, mark, function, (, function, (, seed,..."


In [43]:
# concaténation
data = data.drop(['types','args','contexts'],axis = 1).join(flat_data)

In [44]:
data

Unnamed: 0,function,project,body,type,arg,context
0,function parseExtensionURL(url) {\n url = u...,267,"[[<param>, url, =, <param>, url, ., substring,...",string,[url],"[=, ., substring, (, crx, base, url, ., length..."
1,"function showPageAction(tabId, displayUrl) {\n...",267,"[[var, url, =, <regex>, ., exec, (, <param>, d...",number,"[tab, id]","[chrome, ., page, action, ., set, popup, (, {,..."
1,"function showPageAction(tabId, displayUrl) {\n...",267,"[[var, url, =, <regex>, ., exec, (, <param>, d...",string,"[display, url]","[var, url, =, <regex>, ., exec, (, ), }, else,..."
2,"function onExecuteFileBrowserHandler(id, detai...",267,"[[if, (, <param>, id, !==, <s>, ), {, return],...",string,[id],"[if, (, !==, <s>, ), {, return, }, else, {, ch..."
2,"function onExecuteFileBrowserHandler(id, detai...",267,"[[if, (, <param>, id, !==, <s>, ), {, return],...",object,[details],"[}, var, file, entries, =, ., entries, var, ta..."
...,...,...,...,...,...,...
4145,"function addHandle( attrs, handler ) {\r\n\tva...",110,"[[var, arr, =, <param>, attrs, ., split, (, <s...",function,[handler],"[while, (, i, --, ), {, expr, ., attr, handle,..."
4146,function createInputPseudo( type ) {\r\n\tretu...,110,"[[return, function, (, elem, ), {, var, name, ...",string,[type],"[return, name, ===, <s>, &&, elem, ., ===]"
4147,function createButtonPseudo( type ) {\r\n\tret...,110,"[[return, function, (, elem, ), {, var, name, ...",string,[type],"[return, (, name, ===, <s>, ||, name, ===, <s>..."
4148,function createPositionalPseudo( fn ) {\r\n\tr...,110,"[[return, mark, function, (, function, (, argu...",function,[fn],"[return, mark, function, (, function, (, seed,..."


In [45]:
# On procède maintenant à l'encodage des labels qui sont  qualitatifs pour les numériser
le = LabelEncoder()
data['label'] = le.fit_transform(data.type.values)
data

Unnamed: 0,function,project,body,type,arg,context,label
0,function parseExtensionURL(url) {\n url = u...,267,"[[<param>, url, =, <param>, url, ., substring,...",string,[url],"[=, ., substring, (, crx, base, url, ., length...",5
1,"function showPageAction(tabId, displayUrl) {\n...",267,"[[var, url, =, <regex>, ., exec, (, <param>, d...",number,"[tab, id]","[chrome, ., page, action, ., set, popup, (, {,...",3
1,"function showPageAction(tabId, displayUrl) {\n...",267,"[[var, url, =, <regex>, ., exec, (, <param>, d...",string,"[display, url]","[var, url, =, <regex>, ., exec, (, ), }, else,...",5
2,"function onExecuteFileBrowserHandler(id, detai...",267,"[[if, (, <param>, id, !==, <s>, ), {, return],...",string,[id],"[if, (, !==, <s>, ), {, return, }, else, {, ch...",5
2,"function onExecuteFileBrowserHandler(id, detai...",267,"[[if, (, <param>, id, !==, <s>, ), {, return],...",object,[details],"[}, var, file, entries, =, ., entries, var, ta...",4
...,...,...,...,...,...,...,...
4145,"function addHandle( attrs, handler ) {\r\n\tva...",110,"[[var, arr, =, <param>, attrs, ., split, (, <s...",function,[handler],"[while, (, i, --, ), {, expr, ., attr, handle,...",2
4146,function createInputPseudo( type ) {\r\n\tretu...,110,"[[return, function, (, elem, ), {, var, name, ...",string,[type],"[return, name, ===, <s>, &&, elem, ., ===]",5
4147,function createButtonPseudo( type ) {\r\n\tret...,110,"[[return, function, (, elem, ), {, var, name, ...",string,[type],"[return, (, name, ===, <s>, ||, name, ===, <s>...",5
4148,function createPositionalPseudo( fn ) {\r\n\tr...,110,"[[return, mark, function, (, function, (, argu...",function,[fn],"[return, mark, function, (, function, (, seed,...",2


In [49]:
#séparations données de test données d'entrainements
index = np.array(data.index.unique())
np.random.shuffle(index)
train_index = index[:int(len(index)*0.8)]
test_index = index[int(len(index)*0.8):]

data_train = data.loc[train_index]
data_test = data.loc[test_index]

###  III) Création du model

Je cherche ici à vectoriser, pour chaque arguments, d'une part l'information portée par son identifiant, d'autre part celle portée par les contextes dans lequels il itervient. J'obtiens deux vecteurs que je concatène pour y appliquer un algorithme de classification XGBoost

In [50]:
# Utilisation de Word2Vec pour vectoriser les tokens l'entrainement se fait sur le corp de toutes les fonctions
docs_body = data_train[~data_train.index.duplicated(keep='first')]['body']
corpus_body = [line for doc in docs_body for line in doc]
modelw2v = Word2Vec(sentences=corpus_body, window = 4, min_count = 3)

In [51]:
# Utilisation de TF-IDF pour donner un poid à chaque token présent dans chaque contexte, le fit se fait uniquement
# sur les contextes. 1  document correspond à tous les contextes recontrés pour une variable
docs_context = data_train[~data_train.index.duplicated(keep='first')]['context']
dict_context = Dictionary([doc for doc in docs_context])
corpus_context = [dict_context.doc2bow(doc) for doc in docs_context]
modeltfidf = TfidfModel(corpus_context)

In [55]:
# définition des fonctions de vectorisation et récupération des features

#vectorisation de l'information contenue dans l'identifiant de la variable
def vectorize_arg(arg, modelw2v):
    norm = 0
    vect = np.zeros(100)
    for a in arg:
        if a in modelw2v.wv:
            vect += modelw2v.wv[a]
            norm += 1
    if norm != 0:
        vect /= norm
    else :
        vect = modelw2v.wv['<param>']
    return vect
#vectorization de l'information contenue dans le contexte que rencontre la variable 
def vectorize_context(context,modeltfidf,modelw2v,dict_context):
    tfidf = modeltfidf[dict_context.doc2bow(context)]
    freqs = []
    vects = []
    for id_,freq in tfidf:
        word = dict_context[id_]
        if word in modelw2v.wv:
            vects += [modelw2v.wv[word]]
            freqs += [freq]
    if vects == []:
        vect = modelw2v.wv['<param>']
    else :
        vect = np.stack(vects)
        freqs = softmax(np.array(freqs))
        vect = freqs@vect
    
    return vect
#concaténation des deux 
def get_features(arg,context,modeltfidf,modelw2v,dict_context):
    return np.concatenate((vectorize_arg(arg,modelw2v),vectorize_context(context,modeltfidf,modelw2v,dict_context)))

In [64]:
#Cross validation du modèle qui me permet d'ajuster les paramètre window de word2vec et n_estimators et max_depth 
#de xgboost malheuresement je n'arrive pas à dépaser 90 ... 

idx = np.unique(data_train.index)
np.random.shuffle(idx)
folders_id = np.array_split(idx,5)

mean_score = 0
for i in range(5): #un peu brute cette CV
    print("cv: ", i)
    idx_test = folders_id[i]
    idx_train = []
    for j in folders_id[:i]+folders_id[i+1:]:
        idx_train += list(j)
        
    data_train_cv = data_train.loc[idx_train].copy()
    data_test_cv = data_train.loc[idx_test].copy()
    #fit W2V
    docs_body = data_train_cv[~data_train_cv.index.duplicated(keep='first')]['body']
    corpus_body = [line for doc in docs_body for line in doc]
    modelw2v = Word2Vec(sentences=corpus_body, window = 4, min_count = 3)   
    #fit tfidf
    docs_context = data_train_cv[~data_train_cv.index.duplicated(keep='first')]['context']
    dict_context = Dictionary([doc for doc in docs_context])
    corpus_context = [dict_context.doc2bow(doc) for doc in docs_context]
    modeltfidf = TfidfModel(corpus_context)
    #récupération des features d'entrainement
    X_train_cv = np.stack(data_train_cv.apply(lambda x: get_features(x.arg,x.context,
                                                                     modeltfidf,modelw2v,dict_context),
                                              axis = 1).values)
    
    Y_train_cv = np.stack(data_train_cv['label'].values)
    
    model_xgb = xgboost.XGBClassifier(max_depth = 6,n_estimators = 150)
    model_xgb.fit(X_train_cv,Y_train_cv)
    #récupération des features de test
    X_test_cv = np.stack(data_test_cv.apply(lambda x: get_features(x.arg,x.context,
                                                                     modeltfidf,modelw2v,dict_context),
                                              axis = 1).values)
    
    Y_test_cv = np.stack(data_test_cv['label'].values)
    
    y_pred = model_xgb.predict(X_test_cv)
    predictions = [round(value) for value in y_pred]
    f1 = f1_score(Y_test_cv, predictions,average = 'macro')
    print("f1: %.2f%%" % (f1 * 100.0))    
    mean_score += f1
    
mean_score /= 5
print("mean score : ", mean_score)


cv:  0
f1: 90.86%
cv:  1
f1: 88.76%
cv:  2
f1: 88.86%
cv:  3
f1: 89.67%
cv:  4
f1: 90.34%
mean score :  0.8969630707162388


In [67]:
# entrainement du modèle sur toutes les données d'entrainements

#fit W2V
docs_body = data_train[~data_train.index.duplicated(keep='first')]['body']
corpus_body = [line for doc in docs_body for line in doc]
modelw2v = Word2Vec(sentences=corpus_body, window = 4, min_count = 3)   
#fit tfidf
docs_context = data_train[~data_train.index.duplicated(keep='first')]['context']
dict_context = Dictionary([doc for doc in docs_context])
corpus_context = [dict_context.doc2bow(doc) for doc in docs_context]
modeltfidf = TfidfModel(corpus_context)

X_train = np.stack(data_train.apply(lambda x: get_features(x.arg,x.context,modeltfidf,modelw2v,dict_context),
                                    axis = 1).values)    
Y_train = np.stack(data_train['label'].values)

model_xgb = xgboost.XGBClassifier(max_depth = 6,n_estimators = 150)
model_xgb.fit(X_train,Y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=150, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [68]:
X_test = np.stack(data_test.apply(lambda x: get_features(x.arg,x.context,modeltfidf,modelw2v,dict_context),
                                        axis = 1).values)
Y_test =  np.stack(data_test['label'].values)

In [69]:
y_pred = model_xgb.predict(X_test)
predictions = [round(value) for value in y_pred]
f1 = f1_score(Y_test, predictions,average = 'macro')
print("f1: %.2f%%" % (f1 * 100.0))

f1: 91.49%


### Remarques d'améliorations

J'obtiens un résultat satisfaisant (je crois) cependant il serait interessant d'implémenter un mecanisme de "self attention" pour améliorer le modele, car dans mon analyse ici les tokens sont pondérés uniquement par leur score idf, donc leur pertinence dans le contexte mais pas par leur pertinence par rapport à l'argument considéré. Avec un peu plus de données ce mécanisme pourrait etre implémenté en utilisant des RNN ou des architectures transformers, il nous faudrait alors un plus grand nombre de données pour obtenir des résultats pertinents.

In [73]:
modelw2v.save("word2vec.model")

In [74]:
modeltfidf.save("tfidf.model")

In [75]:
dict_context.save("tfidfdict.dict")

In [78]:
pickle.dump(model_xgb, open("xgb.model", "wb"))

In [92]:
np.save('classes.npy', le.classes_)