# FAISS on GitHub issues using Transformers

Goal: implement algo that recommends possible solutions a user's issue

Methodology: FAISS (Facebook AI Semantic Search) on embeddings from Transformer model fine-tuned on GitHub issues dataset.

Inspired by: https://huggingface.co/course/chapter5/6?fw=tf

<p><a href="https://colab.research.google.com/drive/1i4q3EFH38ltMXmVxWHpcJMgqWph1Ls9M", target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" align="left"></a>&nbsp;to run on GPU (Runtime > Change Runtime Type > GPU)</p>

The preprocessing and training take a while. Reload cells are provided.

Jump to:

[Reload preprocessed data](#reload-preprocessed-data)

In [1]:
from huggingface_hub import hf_hub_url
from datasets import load_dataset
from transformers import AutoTokenizer, TFAutoModel
import pandas as pd

## Load the data from the Hugging Face Hub

In [2]:
data_files = hf_hub_url(
    repo_id="lewtun/github-issues",
    filename="datasets-issues-with-comments.jsonl",
    repo_type='dataset',
)
issues_dataset = load_dataset('json', data_files=data_files, split='train')
issues_dataset

Using custom data configuration default-6a579f365d89f2f1
Reusing dataset json (C:\Users\federico trifoglio\.cache\huggingface\datasets\json\default-6a579f365d89f2f1\0.0.0\ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

## Data cleaning

- remove pull requests
- remove issues with no replies
- remove non-informative columns

In [3]:
issues_dataset = issues_dataset.filter(lambda x: (x['is_pull_request'] == False and len(x['comments']) > 0))
columns = issues_dataset.column_names
columns_to_keep = ['html_url', 'title', 'comments', 'body']
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Loading cached processed dataset at C:\Users\federico trifoglio\.cache\huggingface\datasets\json\default-6a579f365d89f2f1\0.0.0\ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b\cache-7adeb9322eeeff4e.arrow


Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

## Example/feature engineering

- extract comments from issues
- remove comments with less than 15 words
- create text feature as title + body + comment

In [4]:
def extract_comments(examples):
  """
  Extract comments from a single issue
  Each nested comment becomes a new example
  """  
  # flatten the comments
  results = {'comments': [c for cs in examples['comments'] for c in cs]}
  # repeat ['html_url', 'title', 'body'] as many times as the number of comments
  for c in ['html_url', 'title', 'body']:
    results[c] = [el for n, el in zip([len(cs) for cs in examples['comments']], examples[c]) for _ in range(n)]
  return results

In [5]:
issues_dataset = issues_dataset.map(extract_comments, batched=True)
issues_dataset = issues_dataset.map(lambda x: 
    {'comment_length': 
        [ len(o.split()) for o in x['comments'] ]
    }, batched=True)
issues_dataset = issues_dataset.filter(lambda x: x['comment_length'] > 15)
issues_dataset = issues_dataset.map(lambda x: 
    {'text': 
        [ t+" \n "+b+" \n "+c for t, b, c in zip(x['title'], x['body'], x['comments']) ]
    }, batched=True)
issues_dataset

100%|██████████| 1/1 [00:00<00:00,  9.09ba/s]
100%|██████████| 3/3 [00:00<00:00, 44.78ba/s]
100%|██████████| 3/3 [00:00<00:00, 42.86ba/s]
100%|██████████| 3/3 [00:00<00:00, 23.26ba/s]


Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text'],
    num_rows: 2175
})

SBERT (Sentence-BERT) can be used to calculate sentence embeddings for downstream similiarity tasks.

[multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search (given a question / search query, these models are able to find relevant text passages). It has been trained on 215M (question, answer) pairs from diverse sources.

```
query_embedding = model.encode('How big is London')
passage_embedding = model.encode(['London has 9,787,426 inhabitants at the 2011 census',
                                  'London is known for its finacial district'])

print("Similarity:", util.dot_score(query_embedding, passage_embedding))
```

In [7]:
pretrained_model = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
# multi-qa-mpnet-base-dot-v1 has PyTorch weights, 
# from_pt=True will convert them to the TensorFlow format
model = TFAutoModel.from_pretrained(pretrained_model, from_pt=True)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFMPNetModel: ['embeddings.position_ids']
- This IS expected if you are initializing TFMPNetModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMPNetModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFMPNetModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMPNetModel for predictions without further training.


In [8]:
def cls_pooling(model_output):
    """
    Collect the last hidden state for the special [CLS] token
    In BERT, the final hidden state corresponding to [CLS] token 
    is used as the aggregate sequence representation for 
    classification tasks.
    """
    return model_output.last_hidden_state[:, 0]

def get_embeddings(text):
    """
    Text > Tokenize > Model > CLS Pooling > Numpy of shape (768,)
    """
    encoded_input = tokenizer(
        text, padding=True, truncation=True, return_tensors='tf'
    )
    model_output = model(**encoded_input)
    sentence_embedding = cls_pooling(model_output)
    return sentence_embedding.numpy().reshape(-1)

In [9]:
%%time

embeddings_dataset = issues_dataset.map(lambda x: 
        {'embeddings': 
            [ get_embeddings(t) for t in x['text'] ]
        }, batched=True
)



INFO:tensorflow:Assets written to: ram://b8fa87f3-ce62-4d91-a29f-be06f43e260d/assets


INFO:tensorflow:Assets written to: ram://b8fa87f3-ce62-4d91-a29f-be06f43e260d/assets
100%|██████████| 3/3 [56:24<00:00, 1128.21s/ba]

CPU times: total: 4h 13min 18s
Wall time: 57min 1s





In [16]:
# embeddings_dataset.save_to_disk('faiss-github')

#### Reload preprocessed data

In [5]:
# from datasets import load_from_disk
# embeddings_dataset = load_from_disk('faiss-github')
# pretrained_model = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
# tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
# model = TFAutoModel.from_pretrained(pretrained_model, from_pt=True)
# def cls_pooling(model_output):
#     """
#     Collect the last hidden state for the special [CLS] token
#     In BERT, the final hidden state corresponding to [CLS] token 
#     is used as the aggregate sequence representation for 
#     classification tasks.
#     """
#     return model_output.last_hidden_state[:, 0]

# def get_embeddings(text):
#     """
#     Text > Tokenize > Model > CLS Pooling > Numpy of shape (768,)
#     """
#     encoded_input = tokenizer(
#         text, padding=True, truncation=True, return_tensors='tf'
#     )
#     model_output = model(**encoded_input)
#     sentence_embedding = cls_pooling(model_output)
#     return sentence_embedding.numpy().reshape(-1)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFMPNetModel: ['embeddings.position_ids']
- This IS expected if you are initializing TFMPNetModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMPNetModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFMPNetModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMPNetModel for predictions without further training.


The basic idea behind FAISS is to create a special data structure called an *index* that allows one to find which embeddings are similar to an input embedding.

In [6]:
embeddings_dataset.add_faiss_index(column="embeddings")

100%|██████████| 3/3 [00:00<00:00, 136.10it/s]


Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 2175
})

Let's test it on this question

In [16]:
question = "How can I load a dataset offline?"
question_embedding = get_embeddings(question)

In [27]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)
samples_df = pd.DataFrame.from_dict(samples)
samples_df['scores'] = scores
samples_df = samples_df.sort_values('scores', ascending=False).reset_index(drop=True)
pd.set_option('max_colwidth', 100)
samples_df[['comments', 'scores']]

Unnamed: 0,comments,scores
0,Requiring online connection is a deal breaker in some cases unfortunately so it'd be great if of...,25.505032
1,"The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package ...",24.555557
2,I opened a PR that allows to reload modules that have already been loaded once even if there's n...,24.148973
3,"> here is my way to load a dataset offline, but it **requires** an online machine\n> \n> 1. (onl...",22.893991
4,"here is my way to load a dataset offline, but it **requires** an online machine\r\n1. (online ma...",22.406647


In [28]:
best_answer = samples_df.loc[0, 'comments']
print("QUESTION:", question)
print("ANSWER:", best_answer)

QUESTION: How can I load a dataset offline?
ANSWER: Requiring online connection is a deal breaker in some cases unfortunately so it'd be great if offline mode is added similar to how `transformers` loads models offline fine.

@mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?
