**<center style="font-size:26px">Ponicode X 42 - Data Challenge</center>**

# Introduction

## Challenge description

Hello everyone! Glad you make it here. Today's challenge is brought to you by [Ponicode](https://ponicode.com). 

This AI on Code challenge expects you to find the best Machine Learning model/pipeline to predict JavaScript function basic argument typing: `String`, `Boolean`, `Number`, `Array`, `Object`, `Function`

**Task:**

You'll be provided function codes, argument names and labels associated to these arguments. Your task is to create a machine learning model that will give as output the expected argument type.

Exemple:
_Given the following function, estimate the most relevant type for the `mail` argument_
```js
function dummyIsEmailValid(mail){
    return mail.endsWith('.com') && mail.includes('@')
}
```

Your model would be right to output `mail` is a `String`


**Dataset description**:
We'll provide a structured Dataset to train your pipelines. This dataset contains **4 string columns**:
- `id`
- `function_code`
- `argument_name`
- `argument_type`

Given a `function_code` and an `argument_name` your ML pipeline should output the `argument_type`.



|   | function_code                                                                                      | argument_name | argument_type |
|---|----------------------------------------------------------------------------------------------------|---------------|---------------|
| 0 | """function dummyIsEmailValid(mail){\n    return mail.endsWith('.com') && mail.includes('@')\n}""" | mail          | string        |

You'll be provided two files. `train_set.csv` & `test_set.csv`. The train set contains all 3 columns mentioned above and the test set will miss the `argument_type` one. Your model will output this columns.


If you have not downloaded the data for the challenge yet you can find it [HERE](https://bit.ly/Ponicode42-ChallengeData)

## Rules of challenge

**Can we work in groups?** You can do it by yourself or by group of 2 (binôme)

**How do we submit our work?**
- Download your results under **firstname_lastname.csv** and your notebook under **firstname_lastname.ipynb** (copy of your google colab file)
- Go on Ponicode’s Slack Community and open a conversation with Edmond
- Send to Edmond your csv **and** your ipynb files
- If you are in groups of two, open a grouped conversation with Edmond and send your group files **firstname1_fistname2_lastname1_lastname2.csv** and .**ipynb**

**When the winner will be announced?** Monday November 30th

**Where the winner will be announced?** By slack and email

**Can we ask support if we have questions for the challenge?** Yes from 16:00 to 20:00 on Saturday 21th November and from 17:00 to 18:00 on Sunday 22th November on Ponicode’s Community slack. Send your questions to Edmond

**On which creteria will the winner be based on?** On the f1-score of your results. The f1-score is described in detail in the test and later in these slides

**Where do we download the data from?** The data to use for the challenge is available here https://bit.ly/Ponicode42-ChallengeData



## Expected Output

As an output we'd like to have the following notebook with your complete functioning workflow and a result file on the Test set as a `.csv`

Your file should incorporate four columns: 
- `id`: Row identifier of the test set sample
- `argument_type`: Predicted type in [`string`, `boolean`, `number`, `array`, `object`, `function`]
- `full_name`: Your full name, ex 'Jean Dupont'
- `email`: Your email, ex 'jean.dupont@gmailcom'

**Your results will be rated on the f1-score macro avg. Below in the code, we will show you how to access this score**




# Pipeline example to resolve the challenge
Below is a solution to the challenge. This solution is just an example of how to solve the challenge and you are free to take a completely different approach. Do not hesitate to use other features, vectorizers and models than in the example to reach better results
- **Step1: Open your data** (do not forget to upload the train and test csvs on the left pannel)
- **Step2: Features engineering**. We use `argument_name`, `function_code` and a `dummy_feature` (created in the code below) as features. This step is **key**. If you want to beat the performance of this pipeline you will have to create better feature(s) than the `dummy_feature`
- **Step3: Vectorize Features**. The machine learning models are mathematical models. For our model to train, we have to transform our features, which are in a text format, into numerical vectors. We will use the `TfidfVectorizer`, but other methods can work better!
- **Step4: Train and test your model**. In this step we will train our model on our vector and test its performances. We use a `KNeighborsClassifier` as a classifier but other models will definitely work better! Than we test our model on a subset of our dataset to see the performance of the model
- **Step5: Submit dataset**. In this step we use the trained model to make predictions on the dataset to be submitted. Do not forget to input your name and email as shown in the example

# Setup

#### In the whole exercice we will use the famous pandas librairy for datasets manipulation. [Here is some documentation on how to use it ](https://pandas.pydata.org/docs/). Do not hesitate to click on this doc if at some point the functions used below are too hard to understand.

## Imports

In [None]:
import os
import json
import pandas as pd

## Function definition

In [None]:
def open_set(path):
    return pd.read_csv(path, sep=",")

# Main Code

## **Step 1:** Open your data. Don't forget to upload the data provided to you in your colab files (Pannel on the left)



Load the training set to train your model


In [None]:
df_train = open_set('./train_set.csv')

In [None]:
df_train.head()

Unnamed: 0,id,function_code,argument_name,argument_type
0,0,"function register(type, compressMode, backfill...",backfillLast,boolean
1,1,"function baseExtremum(array, iteratee, compara...",iteratee,function
2,2,"function baseMergeDeep(object,source,key,srcIn...",key,string
3,3,"function reorder(array, indexes) {\n var ar...",array,array
4,4,"function diffHalfMatchI(longtext, shorttext, i...",i,number


## **Step2: Features engineering**


For this example we will use the `function_code` and `argument_name` as features. We will also create a new feature, `dummy_feature` based on the token preceding the second occurence of the argument_name in the function code.
For the example of 
```js
function dummyIsEmailValid(mail){
    return mail.endsWith('.com') && mail.includes('@')
}
```
the `dummy_feature` would be `return`. The objective of features is to give us information that will be useful for the model to make good predictions. Here the feature is not super useful but you will easily find ways to captures more useful information.

You can find smarter ways to access the information in the function by using a parser. However it is not necessary to have good performances

In [None]:
def locate_indexes_argument(code_tokens, argument):
  """Get the indexes of all occurences of an argument in the code tokens
  Example
  Inputs: code_tokens:["function", "returnString", "(", "word", ")", "{", "return", "word", "}"] , argument: "word"
  Output: [7]
  """
  return [i for i,val in enumerate(code_tokens) if val==argument]

def clean(function_code):
  """Removes \n and \t from the code and add spaces around special tokens
  Example
  Input: "function returnString(word){\n\t return word}"
  Output: "function returnString ( word ) { return word }
  """
  to_clean = {
      '\n': ' ', 
      '\t': ' ',
      '(': ' ( ', 
      ')': ' ) ',
      '[': ' [ ',
      ']': ' ] ',
      "=": ' = ',
      ":": ' : ', 
      ",": ' , ',
      ";": ' ; ',
      "{": " { ",
      "}": " } ",
      ".": " . "
      }
  for s in to_clean:
    function_code = function_code.replace(s, to_clean[s])
  return function_code

def get_dummy_feature(code, argument_name):
  """Get the token preceiding the second occurence of the argument_name"""
  code = clean(code)
  code_tokens = code.split(' ')
  code_tokens = [token for token in code_tokens if token != ' ']
  argument_indexes = locate_indexes_argument(code_tokens, argument_name)
  if len(argument_indexes) >= 2:
    index_of_second_apparition = argument_indexes[1]
    return code_tokens[index_of_second_apparition-1]
  else:
    return ''

get_dummy_feature(
    """
function dummyIsEmailValid(mail){
    return mail.endsWith('.com') && mail.includes('@')
}""",
'mail')

'return'

In [None]:
df_train['dummy_feature'] = df_train.apply(lambda row: get_dummy_feature(row['function_code'], row['argument_name']), axis=1)

In [None]:
df_train.head()

Unnamed: 0,id,function_code,argument_name,argument_type,dummy_feature
0,0,"function register(type, compressMode, backfill...",backfillLast,boolean,
1,1,"function baseExtremum(array, iteratee, compara...",iteratee,function,
2,2,"function baseMergeDeep(object,source,key,srcIn...",key,string,","
3,3,"function reorder(array, indexes) {\n var ar...",array,array,
4,4,"function diffHalfMatchI(longtext, shorttext, i...",i,number,position


# **Step3: Vectorize Features**

The machine learning models are mathematical models. For our model to train, we have to transform our features, which are in a text format, into numerical vectors. We will use the `TfidfVectorizer`, but other methods can work better! You do not have to understand everything behind Tfidf maths to use it.
Here we proceed like so:
- We use a TfidfVectorizer character level. In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.Find more information on TFIDF here --> [Tfidf-Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) and the code doc [here ](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). You could also use Tfidf word level that would vectorize your text splitting it by word instead of splitting it by group of letters
- We vectorize each of the features `argument_name`, `function_code` and `dummy_feature`
- We concatenate the vectors of the 3 features into one big vector that will be used in the next step in our model

Tfidf is not the only method to vectorize text. You can use embeddings (word2vec, fasttext, ...), Countvectorizer and other methods. 
For this test, I advise you not to explore these other options except if you have a lot of time. Instead, you could play with the hyperparameters of tfidf or even try the tfidf word level

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
import numpy as np

def feature_vectorizer(col, nrange=(1,4)):
  """Instanciate a feature vectorizer"""
  char_vectorizer = TfidfVectorizer(
    strip_accents='unicode',
    analyzer='char_wb',
    ngram_range=nrange,
    max_features=200,
    )
  char_vectorizer.fit(col)
  return char_vectorizer

# Dictionary of all vectorisers
vectorizers = {
    'function_code':  feature_vectorizer(df_train['function_code'], nrange=(3,5)),
    'argument_name': feature_vectorizer(df_train['argument_name'], nrange=(1,3)),
    'dummy_feature': feature_vectorizer(df_train['dummy_feature'], nrange=(2,4))
}

def vectorize_features(df):
  """Takes a dataset and create the X_vector by concatenating the vectors of all featurs"""
  df['X_vector'] = df.apply(lambda row: np.concatenate([
                              vectorizer.transform([row[data]]).toarray() for data, vectorizer in vectorizers.items()], axis=1)[0]
                            , axis=1
  )
  return df

df_train = vectorize_features(df_train)
df_train.head()

Unnamed: 0,id,function_code,argument_name,argument_type,dummy_feature,X_vector
0,0,"function register(type, compressMode, backfill...",backfillLast,boolean,,"[0.0, 0.0, 0.0, 0.0, 0.07308017873249184, 0.0,..."
1,1,"function baseExtremum(array, iteratee, compara...",iteratee,function,,"[0.0, 0.0, 0.0, 0.0, 0.14947369377023104, 0.08..."
2,2,"function baseMergeDeep(object,source,key,srcIn...",key,string,",","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.007..."
3,3,"function reorder(array, indexes) {\n var ar...",array,array,,"[0.0, 0.0, 0.0, 0.0, 0.07499323662725493, 0.0,..."
4,4,"function diffHalfMatchI(longtext, shorttext, i...",i,number,position,"[0.017642548360458407, 0.08397613979998081, 0...."


# **Step4: Train and test your model**

 #### In this step we will train our model on our vector and test its performances. We use a KNeighborsClassifier as a classifier but other models will definitely work better! Than we test our model on a subset of our dataset to see the performance of the model.
 Other classifiers you may use:
 - [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)
 - [SVM Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
 -[ RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
 - [And others, feel free to search on the internet!](https://www.google.com/)

 The classifier is a model that will learn to predict a certain class (here the type of the parameter `argument_type`, so we have 6 classes) based on a vector `X_vector`.
 - In the first step the model learns to predict (this is the training) 
 - In the second step we see how good it is at predicting by seeing its performance (here f1 score) on a test set.
 - While we are not satisfied by the performance on the test set, we can change the hyperparameters of the models and/or change the features and/or change the vectorizing method.
 - Once we are satisfied we have our model! There are many ways to test a classifier [Click here to learn more](https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623).

### Dependencies: If you want to use different models you will have to import them

In [None]:
from sklearn.neighbors import KNeighborsClassifier # Model
from sklearn.model_selection import train_test_split # Split the dataset between a training and testing
from sklearn.metrics import classification_report # Librairy that outputs the performance of the models, including the f1 score

### Split between a train and a test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_train['X_vector'], df_train['argument_type'], test_size = 0.3, random_state=0, shuffle = True)

X_train = np.array(X_train.tolist()) # Convert in a format for the classifier module to work
X_test = np.array(X_test.tolist()) # Convert in a format for the classifier module to work
y_train = np.array(y_train.tolist()) # Convert in a format for the classifier module to work
y_test = np.array(y_test.tolist()) # Convert in a format for the classifier module to work

### Train your model

In [None]:
rf = KNeighborsClassifier()
rf.fit(X_train, y_train) # Fit means train

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

#### Congratulations, your model is trained! Now let's see how good it works with the dummy email function 

In [None]:
def predict_parameter_type(function_code, argument_name):
  context = get_dummy_feature(function_code, argument_name)
  df_parameter = pd.DataFrame(
      {
          'function_code': [function_code],
          'argument_name': [argument_name],
          'dummy_feature': [context]
      }
  )
  df_parameter = vectorize_features(df_parameter)
  X = np.array(df_parameter['X_vector'].tolist())
  return rf.predict(X)

function_code = """
  function dummyIsEmailValid(mail){
    return mail.endsWith('.com') && mail.includes('@')
}"""
argument_name = "mail"
prediction = predict_parameter_type(function_code, argument_name)[0]
print('The type of parameter', argument_name, 
      'in the function\n', function_code, 
      '\nis classified by our classifier as', prediction)

The type of parameter mail in the function
 
  function dummyIsEmailValid(mail){
    return mail.endsWith('.com') && mail.includes('@')
} 
is classified by our classifier as string


Congratulations, the prediction is string !!!

### Test your model

In [None]:
y_pred = rf.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

       array       0.82      0.79      0.80       189
     boolean       0.62      0.46      0.53        71
    function       0.82      0.91      0.86       182
      number       0.83      0.83      0.83       372
      object       0.79      0.79      0.79       440
      string       0.81      0.83      0.82       546

    accuracy                           0.81      1800
   macro avg       0.78      0.77      0.77      1800
weighted avg       0.81      0.81      0.81      1800



#### We can see pretty good results. accuracy is at 0.78 and f1-score at 0.77. If you test a different feature than the `dummy_feature` for step3 and different models for step4 you should be able to beat these performances by far (0.89 is largely doable).
**Your results will be rated on f1-score macro avg. Here your mark would be 0.77**

# **Step5:Submit dataset**

#### Upload the test set and open it

In [None]:
df_submission = open_set('./test_set.csv')

In [None]:
df_submission.head()

Unnamed: 0,id,function_code,argument_name,argument_type
0,0,function randInt(range) {\n return Math.rando...,range,
1,1,"function minBy(array, iteratee) {\n retur...",iteratee,
2,2,"function createCylinderVertices(radius, height...",radialSubdivisions,
3,3,"function loadStateData(data, loader, options)\...",data,
4,4,"function process (oObject, oResourceBundle, iC...",oDataSources,


#### Make your predictions on the dataframe submission

In [None]:
df_submission['dummy_feature'] = df_submission.apply(lambda row: get_dummy_feature(row['function_code'], row['argument_name']), axis=1)
df_submission = vectorize_features(df_submission)
X_submission = df_submission['X_vector']
X_submission = np.array(X_submission.tolist())
predictions = rf.predict(X_submission)
df_submission['argument_type'] = predictions

Now your predictions are in the `argument_type` column

In [None]:
df_submission.head()

Unnamed: 0,id,function_code,argument_name,argument_type,dummy_feature,X_vector
0,0,function randInt(range) {\n return Math.rando...,range,number,*,"[0.38683497345829343, 0.0, 0.0, 0.0, 0.0, 0.0,..."
1,1,"function minBy(array, iteratee) {\n retur...",iteratee,function,(,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.083..."
2,2,"function createCylinderVertices(radius, height...",radialSubdivisions,number,,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,3,"function loadStateData(data, loader, options)\...",data,array,(,"[0.0, 0.0, 0.16234115637134966, 0.165184848131..."
4,4,"function process (oObject, oResourceBundle, iC...",oDataSources,object,,"[0.0, 0.02859004385148536, 0.0, 0.0, 0.0345370..."


## Save your predictions and submit

**Reminder**
As an output we'd like to have the following notebook with your complete functioning workflow and a result file on the Test set as a `.csv`

Your file should incorporate four columns: 
- `id`: Row identifier of the test set sample
- `argument_type`: Predicted type in [`string`, `boolean`, `number`, `array`, `object`, `function`]
- `full_name`: Your full name, ex 'Jean Dupont'
- `email`: Your email, ex 'jean.dupont@gmailcom'

**Your results will be rated on the f1-score macro avg** 


In [None]:
def save_submission(df_submission, full_name, email):
  df_submission['full_name'] = full_name
  df_submission['email'] = email
  df_submission = df_submission[['full_name', 'email', 'id', 'argument_type']]
  file_name = "jean_dupont.csv" # Here enter your filename following the conventions givent in the rules of challenge tab
  df_submission.to_csv('jean_dupont.csv', sep=",")
  return df_submission



full_name =  'Jean Dupont'
email = 'jean.dupont@gmail.com'
save_submission(df_submission, full_name, email)

Unnamed: 0,full_name,email,id,argument_type
0,Jean Dupont,jean.dupont@gmail.com,0,number
1,Jean Dupont,jean.dupont@gmail.com,1,function
2,Jean Dupont,jean.dupont@gmail.com,2,number
3,Jean Dupont,jean.dupont@gmail.com,3,array
4,Jean Dupont,jean.dupont@gmail.com,4,object
...,...,...,...,...
1562,Jean Dupont,jean.dupont@gmail.com,1562,number
1563,Jean Dupont,jean.dupont@gmail.com,1563,string
1564,Jean Dupont,jean.dupont@gmail.com,1564,object
1565,Jean Dupont,jean.dupont@gmail.com,1565,array


**How do we submit our work?**
- Download your results under **firstname_lastname.csv** and your notebook under **firstname_lastname.ipynb** (copy of your google colab file)
- Go on Ponicode’s Slack Community and open a conversation with Edmond
- Send to Edmond your csv **and** your ipynb files
- If you are in groups of two, open a grouped conversation with Edmond and send your group files **firstname1_fistname2_lastname1_lastname2.csv** and .**ipynb**

Thank you very much for doing the AI on code Ponicode challenge!