# All You Can Git: A NER Model for Git-Specific Entities
- Amanda Kolopanis
- Khaled Badran
- Sharon Chee Yin Ho

This notebook includes the experiments performed for our project. It explores the dataset, feature extraction, model training and testing. It also explored the hyperparametr tuning, and dives into the results to highlight the most meaningful features for our model. 
In this experiment, we use a specialized library [sklearn-crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/index.html) that provides wrapper an implementation of a CRF model that is compatible with the `scikit-learn` library. Hence, in this notebook, we follow their instructions and suggested optimizations (e.g., recommended features) as seen in their [documentation](https://sklearn-crfsuite.readthedocs.io/en/latest/api.html) and [tutorials](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html).

In [56]:
from itertools import chain
import nltk
import sklearn
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
import pandas as pd
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
import eli5
from typing import List
import nltk
from nltk import pos_tag
import re
import random
from pathlib import Path
import pickle
import numpy as np
from sklearn.model_selection import train_test_split
import validators
from typing import Dict, List, Tuple

In [57]:
# Set a global random seed to be used in this experiment
np.random.seed(0)

# Dataset
In this part, we are going to read the data we have prepared before (`data/data.pickle` file), split it into training and testing sentences, and extract relevant features for our model

In [58]:
DATA_FOLDER = '../data/'
DATA_FILE = Path(DATA_FOLDER, 'data.pickle')

with open(DATA_FILE, 'rb') as f:
     data = pickle.load(f)

Here is a sample sentence from our dataset that has been tokenized where each token is tagged with its part of speech (POS) and entity type (IOB tagging):

In [59]:
print(f'Token, POS, Type')
data[0]

Token, POS, Type


[('Things', 'NNS', 'O'),
 ('get', 'VBP', 'O'),
 ('more', 'RBR', 'O'),
 ('complex', 'JJ', 'O'),
 ('when', 'WRB', 'O'),
 ('raw', 'JJ', 'O'),
 ('pointers', 'NNS', 'O'),
 ('and', 'CC', 'O'),
 ('unsafe', 'JJ', 'O'),
 ('code', 'NN', 'O'),
 ('is', 'VBZ', 'O'),
 ('involved', 'VBN', 'O'),
 ('see', 'JJ', 'O'),
 ('90718', 'CD', 'B-ISSUE'),
 ('but', 'CC', 'O'),
 ('the', 'DT', 'O'),
 ('original', 'JJ', 'O'),
 ('behavior', 'NN', 'O'),
 ('was', 'VBD', 'O'),
 ('correct', 'JJ', 'O'),
 ('for', 'IN', 'O'),
 ('realistic', 'JJ', 'O'),
 ('code', 'NN', 'O')]

## Feature Extraction
In this part, we will define the functions that will extract the features from the tokens. For example, we can check whether the token is a digit, which can indicate that it may be an issue name. We also define some helper functions to tokenize the sentence and 

In [60]:
def lexical_features(token: str, POS: str) -> Dict:
    """
    Extracts features from a specific token.
    """
    return {
        'bias': 1.0,
        'token': token.lower(),
        'is_title': token.istitle(),
        'is_digit': token.isdigit(),
        'has_digit': any(c.isdigit() for c in token),
        'has_period': '.' in token,
        'POS': POS,
    }


def neighbor_token_lexical_features(token: str, POS: str, position: str) -> Dict:
    """
    Extracts features from neighnoring tokens.
    """
    return {
        f'{position}_token': token.lower(),
        f'{position}_token_is_title': token.istitle(),
        f'{position}_token_is_digit': token.isdigit(),
        f'{position}_token_POS': POS,
    }


def token_to_features(sentence: List[Tuple], token_index: int) -> Dict:
    """
    Given a sentence and the index to a token, this function will extract the lexical 
    features from the token and its two neighboring tokens (previous and next).
    """
    token_info = sentence[token_index]
    token = token_info[0]
    POS = token_info[1]
    
    features = lexical_features(token, POS)
    
    # if a previous token exists
    if token_index > 0:
        previous_token_info = sentence[token_index-1]
        previous_token = previous_token_info[0]
        previous_token_POS = previous_token_info[1]
        
        features.update(neighbor_token_lexical_features(previous_token, previous_token_POS, 'previous'))
    else:
        features['begging_of_sentence'] = True
       
    # if a next token exists
    if token_index < len(sentence)-1:
        next_token_info = sentence[token_index+1]
        next_token = next_token_info[0]
        next_token_POS = next_token_info[1]
        
        features.update(neighbor_token_lexical_features(next_token, next_token_POS, 'next'))
    else:
        features['end_of_sentence'] = True
                
    return features


def sentence_to_features(sentence):
    return [token_to_features(sentence, token_index_) for token_index_ in range(len(sentence))]


def sentence_to_entity_types(sentence) -> List[str]:
    """
    Returns a list of entity types (IOB tags) from the setence. 
    """
    return [token_tuple[2] for token in sentence]

## Train and Test Split
Here we split the data into a train and test splits. Then, to get the training features (X) and target output (y), we use the previously defined functions `sentence_to_features` and `sentence_to_labels`.

In [61]:
train_sentences, test_sentences = train_test_split(data, test_size=0.25, random_state=0)

# Extract the features and labels from the setences
X_train = [sentence_to_features(s) for s in train_sentences]
y_train = [sentence_to_labels(s) for s in train_sentences]

X_test = [sentence_to_features(s) for s in test_sentences]
y_test = [sentence_to_labels(s) for s in test_sentences]

# Model Training and Parameter Optimization
In this part, we will define a CRF model from the aforementioned `sklearn_crfsuite` library. Then we will use a randomized search appraoch to find the best hyperparameters.

In [62]:
# define the CRF model
crf = sklearn_crfsuite.CRF(all_possible_transitions=True)

# define the parameter space
distributions = {
    'algorithm': ['lbfgs'],
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

# Define the entity types
entity_types = ['B-BRANCH', 'B-FILE', 'B-ISSUE']

# When evaluating the model, focus only on the entity types and not on the non-entity tokens
f1_scorer = make_scorer(metrics.flat_f1_score, average='weighted', labels=entity_types)

# Define Randomized search with 5 cross validation folds and 100 random iterations over the parameter space
clf = RandomizedSearchCV(crf,
                         distributions,
                         cv=5,
                         n_iter=100,
                         scoring=f1_scorer,
                         random_state=0,
                         n_jobs=-1)

clf.fit(X_train, y_train)

best_crf = rs.best_estimator_
print(f'optimal parameters: {rs.best_params_}')

optimal parameters: {'algorithm': 'lbfgs', 'c1': 0.4166891251163398, 'c2': 0.010122937979015215}


## Model Evaluation
Here we will evaluate our model using the held-out testing data. This will show us the performance for each entity type alongside an aggregate results for all classes. 

In [64]:
y_pred = best_crf.predict(X_test)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

              precision    recall  f1-score   support

    B-BRANCH      0.862     0.714     0.781        35
      B-FILE      0.618     0.708     0.660        48
     B-ISSUE      0.800     0.914     0.853        35

   micro avg      0.734     0.771     0.752       118
   macro avg      0.760     0.779     0.765       118
weighted avg      0.744     0.771     0.753       118



## Top Features
Now that we have obtained our best crf and evaluated it, we also want to investigate about the top features that the model finds to be highly correlated with the different entity types in our dataset

In [65]:
# Show the top 10 features from the model
eli5.show_weights(best_crf, top=10)

From \ To,O,B-BRANCH,B-FILE,B-ISSUE
O,0.448,0.647,0.106,0.0
B-BRANCH,0.74,-0.552,-0.882,0.0
B-FILE,0.0,0.0,2.201,0.0
B-ISSUE,-0.582,0.0,-0.231,0.211

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3
+8.213,bias,,
+4.110,token:1,,
+2.930,POS:CD,,
+2.497,token:2,,
+2.253,previous_token:least,,
+2.235,token:32,,
+2.184,POS:DT,,
… 70 more positive …,… 70 more positive …,,
… 51 more negative …,… 51 more negative …,,
-1.903,has_digit,,

Weight?,Feature
+8.213,bias
+4.110,token:1
+2.930,POS:CD
+2.497,token:2
+2.253,previous_token:least
+2.235,token:32
+2.184,POS:DT
… 70 more positive …,… 70 more positive …
… 51 more negative …,… 51 more negative …
-1.903,has_digit

Weight?,Feature
+6.791,token:staging
+5.754,next_token:branch
+5.403,token:browser-quirks
+4.435,previous_token:branch
+4.378,token:https://github.com/gilou/youtube-dl/tree/data_approach
+4.333,token:fix/signin-issue
+4.111,token:master
+3.862,token:gh-pages
+3.722,token:https://github.com/symfony/symfony/tree/_default
+3.667,POS:CD

Weight?,Feature
+7.596,token:dockerfile
+5.029,has_period
+4.236,next_token:file
+3.824,token:profraw
+3.712,token:readme
+2.604,token:gitignore
+2.604,next_token:editorconfig
+2.446,token:license
+2.446,next_token:etc
+2.322,previous_token:file

Weight?,Feature
+6.924,is_digit
+5.359,has_digit
+2.552,token:bvlc/caffe#2610
+2.061,next_token:comment
+1.510,token:https://github.com/angular/angular.js/issues/16916
+1.497,next_token:due
… 15 more positive …,… 15 more positive …
… 4 more negative …,… 4 more negative …
-0.790,next_token_POS:IN
-2.252,previous_token_POS:DT
