# All You Can Git: A NER Model for Git-Specific Entities
- Amanda Kolopanis
- Khaled Badran
- Sharon Chee Yin Ho

This notebook includes the experiments performed for our project. It explores the dataset, feature extraction, model training and testing. It also explored the hyperparametr tuning, and dives into the results to highlight the most meaningful features for our model. 
In this experiment, we use a specialized library [sklearn-crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/index.html) that provides wrapper an implementation of a CRF model that is compatible with the `scikit-learn` library. Hence, in this notebook, we follow their instructions and suggested optimizations (e.g., recommended features) as seen in their [documentation](https://sklearn-crfsuite.readthedocs.io/en/latest/api.html) and [tutorials](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html).

In [1]:
from itertools import chain
import nltk
import sklearn
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
import pandas as pd
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
import eli5
from typing import List
import nltk
from nltk import pos_tag
import re
import random
from pathlib import Path
import pickle
import numpy as np
from sklearn.model_selection import train_test_split
import validators
from typing import Dict, List, Tuple

In [2]:
# Set a global random seed to be used in this experiment
np.random.seed(0)

# Dataset
In this part, we are going to read the data we have prepared before (`data/data.pickle` file), split it into training and testing sentences, and extract relevant features for our model

In [3]:
DATA_FOLDER = '../data/'
DATA_FILE = Path(DATA_FOLDER, 'data.pickle')

with open(DATA_FILE, 'rb') as f:
     data = pickle.load(f)

Now, let's explore the data for a bit. First, we want to check a sample instance from the data. What we will notice is that this is a sentence that has been tokenized. Moreover, each token is tagged with its part of speech (POS) and entity type (IOB tagging):

In [4]:
print(f'Token, POS, Type')
data[0]

Token, POS, Type


[('Things', 'NNS', 'O'),
 ('get', 'VBP', 'O'),
 ('more', 'RBR', 'O'),
 ('complex', 'JJ', 'O'),
 ('when', 'WRB', 'O'),
 ('raw', 'JJ', 'O'),
 ('pointers', 'NNS', 'O'),
 ('and', 'CC', 'O'),
 ('unsafe', 'JJ', 'O'),
 ('code', 'NN', 'O'),
 ('is', 'VBZ', 'O'),
 ('involved', 'VBN', 'O'),
 ('see', 'JJ', 'O'),
 ('90718', 'CD', 'B-ISSUE'),
 ('but', 'CC', 'O'),
 ('the', 'DT', 'O'),
 ('original', 'JJ', 'O'),
 ('behavior', 'NN', 'O'),
 ('was', 'VBD', 'O'),
 ('correct', 'JJ', 'O'),
 ('for', 'IN', 'O'),
 ('realistic', 'JJ', 'O'),
 ('code', 'NN', 'O')]

We can also explore the size of our data to see how many sentences we have!

In [5]:
print(f'Number of unique sentences:', len(data))

Number of unique sentences: 375


We can also explore our data to see how many tokens we have, what are their Parts of Speech and their entity types:

In [6]:
# flatten the data so that all tokens are in one list
all_tokens = np.array(list(chain(*data)))
print(f'The total number of tokens in all sentences:', all_tokens.shape[0])

# check the number of occurences for each part of speech category
unique_POS, count_POS = np.unique(all_tokens[:,1], return_counts=True)
print('Here are the number of occurences for each POS category')
dict(zip(unique_POS, count_POS))

The total number of tokens in all sentences: 6971
Here are the number of occurences for each POS category


{'CC': 199,
 'CD': 253,
 'DT': 739,
 'EX': 22,
 'FW': 1,
 'IN': 742,
 'JJ': 453,
 'JJR': 11,
 'JJS': 17,
 'MD': 112,
 'NN': 1389,
 'NNP': 385,
 'NNPS': 1,
 'NNS': 274,
 'PDT': 5,
 'PRP': 281,
 'PRP$': 61,
 'RB': 335,
 'RBR': 11,
 'RP': 24,
 'TO': 233,
 'UH': 1,
 'VB': 380,
 'VBD': 125,
 'VBG': 185,
 'VBN': 196,
 'VBP': 187,
 'VBZ': 257,
 'WDT': 40,
 'WP': 12,
 'WP$': 1,
 'WRB': 39}

In [7]:
# check the number of occurences for each entity type
unique_tag, count_tag = np.unique(all_tokens[:,2], return_counts=True)
print('Here are the number of occurences for each Entity Type')
dict(zip(unique_tag, count_tag))


Here are the number of occurences for each Entity Type


{'B-BRANCH': 157, 'B-FILE': 158, 'B-ISSUE': 154, 'O': 6502}

## Feature Extraction
In this part, we will define the functions that will extract the features from the tokens. For example, we can check whether the token is a digit, which can indicate that it may be an issue name. We also define some helper functions to tokenize the sentence and 

In [8]:
def lexical_features(token: str, POS: str) -> Dict:
    """
    Extracts features from a specific token.
    """
    return {
        'bias': 1.0,
        'token': token.lower(),
        'is_title': token.istitle(),
        'is_digit': token.isdigit(),
        'has_digit': any(c.isdigit() for c in token),
        'has_period': '.' in token,
        'POS': POS,
    }


def neighbor_token_lexical_features(token: str, POS: str, position: str) -> Dict:
    """
    Extracts features from neighnoring tokens.
    """
    return {
        f'{position}_token': token.lower(),
        f'{position}_token_is_title': token.istitle(),
        f'{position}_token_is_digit': token.isdigit(),
        f'{position}_token_POS': POS,
    }


def token_to_features(sentence: List[Tuple], token_index: int) -> Dict:
    """
    Given a sentence and the index to a token, this function will extract the lexical 
    features from the token and its two neighboring tokens (previous and next).
    """
    token_info = sentence[token_index]
    token = token_info[0]
    POS = token_info[1]
    
    features = lexical_features(token, POS)
    
    # if a previous token exists
    if token_index > 0:
        previous_token_info = sentence[token_index-1]
        previous_token = previous_token_info[0]
        previous_token_POS = previous_token_info[1]
        
        features.update(neighbor_token_lexical_features(previous_token, previous_token_POS, 'previous'))
    else:
        features['begging_of_sentence'] = True
       
    # if a next token exists
    if token_index < len(sentence)-1:
        next_token_info = sentence[token_index+1]
        next_token = next_token_info[0]
        next_token_POS = next_token_info[1]
        
        features.update(neighbor_token_lexical_features(next_token, next_token_POS, 'next'))
    else:
        features['end_of_sentence'] = True
                
    return features


def sentence_to_features(sentence):
    return [token_to_features(sentence, token_index_) for token_index_ in range(len(sentence))]


def sentence_to_labels(sentence) -> List[str]:
    """
    Returns a list of entity types (IOB tags) from the setence. 
    """
    return [token_tuple[2] for token_tuple in sentence]

## Train and Test Split
Here we split the data into a train and test splits. Then, to get the training features (X) and target output (y), we use the previously defined functions `sentence_to_features` and `sentence_to_labels`.

In [9]:
train_sentences, test_sentences = train_test_split(data, test_size=0.25, random_state=0)

# Extract the features and labels from the setences
X_train = [sentence_to_features(s) for s in train_sentences]
y_train = [sentence_to_labels(s) for s in train_sentences]

X_test = [sentence_to_features(s) for s in test_sentences]
y_test = [sentence_to_labels(s) for s in test_sentences]

To check how the featurized data looks like we can check the first setence in our training data. Also, to limit the size of the output, we will check the information we have about the three tokens in this sentence. 

In [10]:
X_train[0][9:12]

[{'bias': 1.0,
  'token': 'a',
  'is_title': False,
  'is_digit': False,
  'has_digit': False,
  'has_period': False,
  'POS': 'DT',
  'previous_token': 'is',
  'previous_token_is_title': False,
  'previous_token_is_digit': False,
  'previous_token_POS': 'VBZ',
  'next_token': '1.4-upgrade',
  'next_token_is_title': False,
  'next_token_is_digit': False,
  'next_token_POS': 'JJ'},
 {'bias': 1.0,
  'token': '1.4-upgrade',
  'is_title': False,
  'is_digit': False,
  'has_digit': True,
  'has_period': True,
  'POS': 'JJ',
  'previous_token': 'a',
  'previous_token_is_title': False,
  'previous_token_is_digit': False,
  'previous_token_POS': 'DT',
  'next_token': 'branch',
  'next_token_is_title': False,
  'next_token_is_digit': False,
  'next_token_POS': 'NN'},
 {'bias': 1.0,
  'token': 'branch',
  'is_title': False,
  'is_digit': False,
  'has_digit': False,
  'has_period': False,
  'POS': 'NN',
  'previous_token': '1.4-upgrade',
  'previous_token_is_title': False,
  'previous_token_is_d

Now by looking at the target labels (entity types) of these tokens, we can see that the middle token `1.4-upgrade` is an entity of type branch:

In [11]:
y_train[0][9:12]

['O', 'B-BRANCH', 'O']

# Model Training and Parameter Optimization
In this part, we will define a CRF model from the aforementioned `sklearn_crfsuite` library. Then we will use a randomized search appraoch to find the best hyperparameters.

In [25]:
# define the CRF model
crf = sklearn_crfsuite.CRF(all_possible_transitions=True)

# define the parameter space
distributions = {
    'algorithm': ['lbfgs','l2sgd', 'ap', 'pa', 'arow'],
    'min_freq': [0, 3, 6],
    'epsilon': scipy.stats.expon(scale=1e-5),
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

# Define the entity types
entity_types = ['B-BRANCH', 'B-FILE', 'B-ISSUE']

# When evaluating the model, focus only on the entity types and not on the non-entity tokens
f1_scorer = make_scorer(metrics.flat_f1_score, average='weighted', labels=entity_types)

# Define Randomized search with 5 cross validation folds and 100 random iterations over the parameter space
clf = RandomizedSearchCV(crf,
                         distributions,
                         cv=5,
                         n_iter=200,
                         scoring=f1_scorer,
                         random_state=0,
                         n_jobs=-1)

clf.fit(X_train, y_train)

best_crf = clf.best_estimator_
print(f'optimal parameters: {clf.best_params_}')

optimal parameters: {'algorithm': 'lbfgs', 'c1': 0.46856935664517396, 'c2': 0.045398200558622, 'epsilon': 1.5308312136443692e-05, 'min_freq': 0}


## Model Evaluation
Here we will evaluate our model using the held-out testing data. This will show us the performance for each entity type alongside an aggregate results for all classes. 

In [26]:
y_pred = best_crf.predict(X_test)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=entity_types, digits=3
))

              precision    recall  f1-score   support

    B-BRANCH      0.900     0.771     0.831        35
      B-FILE      0.630     0.708     0.667        48
     B-ISSUE      0.800     0.914     0.853        35

   micro avg      0.750     0.788     0.769       118
   macro avg      0.777     0.798     0.784       118
weighted avg      0.760     0.788     0.771       118



## Top Features
Now that we have obtained our best crf and evaluated it, we also want to investigate about the top features that the model finds to be highly correlated with the different entity types in our dataset

In [27]:
# Show the top 10 features from the model
eli5.show_weights(best_crf, top=10)

From \ To,O,B-BRANCH,B-FILE,B-ISSUE
O,0.94,0.826,0.31,0.0
B-BRANCH,0.66,-0.599,-0.793,0.0
B-FILE,0.0,0.0,1.735,0.0
B-ISSUE,-0.285,-0.168,-0.357,0.168

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3
+6.703,bias,,
+3.267,token:1,,
+2.369,POS:CD,,
+1.960,POS:DT,,
+1.925,token:2,,
… 74 more positive …,… 74 more positive …,,
… 63 more negative …,… 63 more negative …,,
-1.699,POS:NNP,,
-1.787,token:readme,,
-1.841,token:master,,

Weight?,Feature
+6.703,bias
+3.267,token:1
+2.369,POS:CD
+1.960,POS:DT
+1.925,token:2
… 74 more positive …,… 74 more positive …
… 63 more negative …,… 63 more negative …
-1.699,POS:NNP
-1.787,token:readme
-1.841,token:master

Weight?,Feature
+5.248,token:staging
+4.554,next_token:branch
+3.990,previous_token:branch
+3.479,token:browser-quirks
+3.222,token:https://github.com/gilou/youtube-dl/tree/data_approach
+3.221,token:master
+2.891,next_token:branches
+2.851,token:5.2
+2.849,POS:CD
+2.823,token:fix/signin-issue

Weight?,Feature
+4.488,has_period
+4.391,token:dockerfile
+3.757,next_token:file
+2.970,token:profraw
+2.814,token:readme
+1.949,next_token:files
+1.865,token:gitignore
+1.865,next_token:editorconfig
+1.862,previous_token:file
… 54 more positive …,… 54 more positive …

Weight?,Feature
+5.231,is_digit
+4.900,has_digit
+1.532,next_token:comment
+1.513,token:bvlc/caffe#2610
+1.056,token:https://github.com/angular/angular.js/issues/16916
+1.051,next_token:due
… 18 more positive …,… 18 more positive …
… 5 more negative …,… 5 more negative …
-0.874,begging_of_sentence
-2.028,previous_token_POS:DT
