# 🔍 Semantic Search over GitHub Issues Using Embeddings and FAISS

In this notebook, we build a powerful, modern search engine for GitHub issues using text embeddings and FAISS—all with Hugging Face Datasets, Transformers, and PyTorch.
Our goal: Given a natural language question, retrieve the most helpful comments from the issues corpus.


Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install faiss-gpu

## 1️⃣ Load and Filter the GitHub Issues Dataset

We’ll use the dataset of issues/comments pushed to the Hub in the previous section.
We'll filter out pull requests and issues without comments, since those aren't useful as answers.


In [None]:
from datasets import load_dataset

# Load the custom GitHub issues dataset(change user as appropriate)
issues_dataset = load_dataset("lewtun/github-issues",split="train")
print(issues_dataset)

# Remove pull requests and issues with no comments
issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"]==False and len(x["comments"])>0)
)
print(issues_dataset)

## 2️⃣ Keep Only Relevant Columns

For search, keep only `title`, `body`, `comments`, and `html_url` (for referencing the original issue).


In [None]:
columns=issues_dataset.column_names
columns_to_keep=["title","body","html_url","comments"]
columns_to_remove=set(columns)-set(columns_to_keep)
issues_dataset=issues_dataset.remove_columns(list(columns_to_remove))
print(issues_dataset)
print(issues_dataset[0]['comments'])

## 3️⃣ Explode the Comments (One Row Per Comment)

Each issue may have multiple comments. To support matching queries to comments, explode this list so every row is a single comment/context pair.


In [None]:
# list of comments in one row
print(issues_dataset[0]['comments'])

In [None]:
# Convert to pandas,explode comments
issues_dataset.set_format("pandas")
df=issues_dataset[:]
comments_df=df.explode("comments",ignore_index=True)
comments_df.head(4)

## 4️⃣ Return to Datasets and Clean by Comment Length

Convert back to a Dataset and keep only comments longer than 15 words (to remove unhelpful, super-short messages).


In [None]:
from datasets import Dataset
comments_dataset=Dataset.from_pandas(comments_df)
comments_dataset=comments_dataset.map(
    lambda x: {"comments_length":len(x["comments"].split())}
)
comments_dataset = comments_dataset.filter(lambda x:x["comments_length"]>15)
print(comments_dataset)

## 5️⃣ Concatenate All Context into a Single Text Field

Join together the issue title, body, and comment for full queryable context.


In [None]:
def concatenate_text(examples):
  return {
      "text":examples["title"]+"\n"+examples["body"]+"\n"+examples["comments"]
  }
  comments_dataset=comments_dataset.map(concatenate_text)


## 6️⃣ Compute Text Embeddings with Sentence Transformers

We can now produce a single embedding per context using a pretrained model.
We'll use `sentence-transformers/multi-qa-mpnet-base-dot-v1` recommended for QA search.


In [None]:
from transformers import AutoTokenizer,AutoModel
import torch

model_ckpt="sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer=AutoTokenizer.from_pretrained(model_ckpt)
model=AutoModel.from_pretrained(model_ckpt)

device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# CLS pooling function: take the embedding of the special [CLS] token
def cls_pooling(model_output):
  return model_output.last_hidden_state[:,0]

def get_embeddings(text_list):
  encoded_input=tokenizer(text_list,padding=True,truncation=True,return_tensors="pt")
  encoded_input={k:v.to(device) for k,v in encoded_input.itmes()}
  with torch.no_grad():
    model_output = model(**encoded_input)
  return cls_pooling(model_output)

## 7️⃣ Compute and Attach Embeddings to Each Row

Use `.map()` to embed all the documents and store as numpy arrays.


In [None]:
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings([x["text"]]).cpu().numpy()[0]}
)


## 8️⃣ Build the FAISS Index for Efficient Search

Now, add a FAISS index for the embeddings (automatically uses the "embeddings" column).


In [None]:
embeddings_dataset.add_faiss_index(column="embeddings")


## 9️⃣ Test Your Search Engine with a Query

Embed a new question and retrieve the top k most relevant comments using nearest-neighbors search.



In [None]:
question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().numpy()

# Query the FAISS index for the 5 nearest neighbors
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)



## 🔟 Review the Results

Visualize which comments matched the query, including the score and the issue context/title.


In [None]:
import pandas as pd

samples_df=pd.DataFrame.from_dict(samples)
samples_df["scores"]=scores
samples_df.sort_values("scores",ascending=False,inplace=True)

for _,row in samples_df.iterrows():
  print(f"COMMENT:\n{row.comments}\n")
  print(f"SCORE: {row.scores}")
  print(f"TITLE: {row.title}")
  print(f"URL: {row.html_url}")
  print("=" * 50)