## StackOverflow Tag Predictions

In this experiment I will describe the thought process, as well as a step-by-step explanation of an ML workflow, for StackOverflow tag predictions.

### Data Exploration and Cleaning

We need to understand the size, quality and features of our data. We will load the dataset in memory using Pandas (the dataset size allows for that) and try to understand things like size, features, and decide what to use and how.

After loading the data, we will do some exploration and show the results.

In [1]:
import pandas as pd
path = 'archive/'

# load everything in a dataframe
questions = pd.read_csv(path + 'Questions.csv', encoding='latin')
tags = pd.read_csv(path + 'Tags.csv')

In [2]:
tags.head(2)

Unnamed: 0,Id,Tag
0,80,flex
1,80,actionscript-3


In [3]:
tags['Tag'].nunique()

37034

In [4]:
questions.head(2)

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
0,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...
1,90,58.0,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...


In [5]:
# let's see how the questions are scored
questions['Score'].value_counts().sort_values(ascending=False)[:20]

Score
 0     591710
 1     281042
 2     125000
 3      61182
-1      43779
 4      33680
 5      20203
-2      17833
 6      13665
 7       9624
-3       8330
 8       7266
 9       5458
-4       4542
 10      4333
 11      3434
 12      3002
 13      2393
-5       2100
 14      2035
Name: count, dtype: int64

So, we seem to have some results on the dataset:

1. There are multiple (around 37 thousand) unique tags, which is not feasible for a local experiment, so we will try to limit it to around 100
2. Most of the question features seem not to be useful for the text task at hand, but we can use them to improve and limit our dataset.
3. It seems that most questions have quite a low score, which means possibly less context or tags. We will select all questions with a score > 15 and score < 3000 which should give us a good representative dataset.

<br>

Regarding the text preprocessing, we need to take some things into consideration:
1. Title and body will be handled separately, at first during the vectorization process
2. The text preprocessing needs to be handled carefully, as the standard methods will destroy the dataset (especially the code blocks)

In [6]:
import numpy as np

# get the top 100 most common tags
top_tags = tags['Tag'].value_counts().sort_values(ascending=False)[:100].index.tolist()

# filter the dataframe to keep only rows with the top 100 tags
tags = tags[tags['Tag'].isin(top_tags)]

# filter by score
questions = questions[questions['Score'] > 15]
questions = questions[questions['Score'] < 3000]

def get_tags(_id):
    try:
        _tags = tags[tags['Id'] == _id]
        return set(_tags['Tag'])
    except:
        return ''

# assign lists of tags to each question, and create a 'helper' column with the length
questions['tags'] = questions['Id'].apply(get_tags)
questions['tags-len'] = questions['tags'].apply(lambda x: len(x))

# drop unnecessary rows/columns
questions = questions[questions['tags-len'] > 0]
questions.drop(columns=['OwnerUserId', 'CreationDate', 'ClosedDate', 'Score'], inplace=True)
questions.head(3)

Unnamed: 0,Id,Title,Body,tags,tags-len
2,120,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...,"{sql, asp.net}",2
3,180,Function for creating color wheels,<p>This is something I've pseudo-solved many t...,{algorithm},1
4,260,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,"{c#, .net}",2


So, now after this preprocessing, we have 100 tags to be used, and a list of about 20K questions. Now we can process the text.

In [7]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# save common programming words
allowed_words = ['if', 'else', 'then', 'while', 'for', 'and', 'as', 'from']


def preprocess(text):
    # remove URLs, mails, replace numbers, some tags
    text = re.sub(r'(?:https?://|www\.|mailto:)\S+', '', text)
    text = re.sub(r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}', '', text)
    text = re.sub(r'<[^<]+?>', '', text)
    text = re.sub(r'\d+', 'NUM', text)
    text = re.sub(r'<p>', '', text)

    text = text.lower()
    
    # stopwords
    words = word_tokenize(text)
    words = [word for word in words 
             if word not in stop_words or word in allowed_words]

    # lemmas
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    return ' '.join(words)

# apply the changes to the text and show them
questions['clean_title'] = questions['Title'].apply(preprocess)
questions['clean_body'] = questions['Body'].apply(preprocess)

questions.head(2)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lilykos/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/lilykos/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Id,Title,Body,tags,tags-len,clean_title,clean_body
2,120,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...,"{sql, asp.net}",2,asp.net site map,anyone got experience creating sql-based asp.n...
3,180,Function for creating color wheels,<p>This is something I've pseudo-solved many t...,{algorithm},1,function for creating color wheel,something 've pseudo-solved many time and neve...


### Creating the baseline model

Now we are done with preprocessing the information, and we can start the ML procedure. Specifically:
- we need to create a baseline, which will use a simple classification algorithm
- we will use tf-idf and some way to reduce the amount of tokens (feature engineering)
- this will be a multilabel classification problem, so we will need to find the correct algorithms

The steps should be:
- binarize the tags
- vectorize the text input
- train/test split
- use a classifier and evaluate the results

In [8]:
import warnings
warnings.simplefilter('ignore', category=UserWarning)

from sklearn.preprocessing import MultiLabelBinarizer, FunctionTransformer
from sklearn.pipeline import FeatureUnion, make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer as Tfidf
from sklearn.model_selection import train_test_split


# in order to make the train/test split work, we need to update the dataframes
_questions = questions.loc[:, ['clean_title', 'clean_body']]
_tags = questions.loc[:, ['tags']]

_X_train, _X_test, _y_train, _y_test = train_test_split(
    _questions, _tags, test_size=0.2, random_state=42
)


# define extractor functions
title_extractor = FunctionTransformer(lambda x: x['clean_title'], validate=False)
body_extractor = FunctionTransformer(lambda x: x['clean_body'], validate=False)

# create a FeatureUnion pipeline that combines title / body
vec_pipeline = make_pipeline(
    FeatureUnion([(
        'title_vec', make_pipeline(
            title_extractor, Tfidf(analyzer='word', max_features=100))
        ), (
        'body_vec', make_pipeline(
            body_extractor, Tfidf(analyzer='word', max_features=2000))
    )])
)


# binarize the tags
mlb = MultiLabelBinarizer()
mlb.fit(_y_train['tags'])

y_train = mlb.transform(_y_train['tags'])
y_test = mlb.transform(_y_test['tags'])


# fit the pipeline to the training data
vec_pipeline.fit(_X_train)

X_train = vec_pipeline.transform(_X_train)
X_test = vec_pipeline.transform(_X_test)

We will now have train/test data, the input is vectorized, and the labels as well. We will use `Random Forest`, because it supports multilabel classifiction out of the box.

In [9]:
from sklearn.metrics import precision_score, f1_score, hamming_loss
from sklearn.ensemble import RandomForestClassifier

def metrics(y_true, y_pred, name):
    precision = precision_score(y_true, y_pred, average='micro')
    f1 = f1_score(y_true, y_pred, average='micro')
    hamming = 1 - hamming_loss(y_true, y_pred)

    print('-- ' + name)
    print('--- Precision:', round(precision, 3))
    print('--- F1-score:', round(f1, 3))
    print('--- Hamming:', round(hamming, 3))


rfc = RandomForestClassifier(n_estimators=500, random_state=42, n_jobs=6)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)

metrics(y_test, y_pred, 'Random Forest')

-- Random Forest
--- Precision: 0.907
--- F1-score: 0.504
--- Hamming: 0.989


### Using RNN (or other DNNs) as a classifier

Another solution could be to train a basic Neural network, in this case we will opt for an RNN, and use it as a classifier. The NN will handle the input data, by creating word embeddings, and then feed them to the RNN. The great addition of this method is that the input data keep their word order, so the RNN will provide better results usually, depending on the dataset size.

In [10]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
from sklearn.model_selection import train_test_split

# handle the input data
questions['combined_rnn'] = questions['clean_title'] + questions['clean_body']

# WE ONLY DO THIS DUE TO TIME/INFASTRACTURE CONSTRAINTS
_questions = questions.sample(1000)

# create the vocabulary, that will be used in an embedding, to feed to the RNN
def create_vocab(questions):
    vocab = set()
    for question in questions:
        tokens = question.split()
        vocab.update(tokens)
    vocab = sorted(vocab)
    vocab.append('<pad>')
    word_to_index = {word: i for i, word in enumerate(vocab)}
    return word_to_index

def encode_question(question, word_to_index):
    return [word_to_index[word] for word in question.split()]


word_to_index = create_vocab(_questions['combined_rnn'])
encoded_questions = [encode_question(question, word_to_index) for question in _questions['combined_rnn']]
padded_questions = pad_sequence(
    [torch.tensor(q) for q in encoded_questions],
    batch_first=True,
    padding_value=word_to_index['<pad>']
)

# Encode labels using MultiLabelBinarizer and add them to the dataset
mlb = MultiLabelBinarizer()
encoded_labels = mlb.fit_transform(_questions['tags'])
num_labels = len(mlb.classes_)

# Create dataset 
class TextDataset(Dataset):
    def __init__(self, questions, labels):
        self.questions = questions
        self.labels = labels

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        return self.questions[idx], self.labels[idx]
    
# train/test split
train_questions, val_questions, train_labels, val_labels = train_test_split(
    padded_questions, encoded_labels, test_size=0.2, random_state=42
)
train_dataset = TextDataset(train_questions, train_labels)
val_dataset = TextDataset(val_questions, val_labels)

# data loaders for the RNN
batch_size = 2
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

In [13]:
# build RNN model for multi-label classification
class RNNModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, num_labels):
        super(RNNModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.GRU(embedding_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_labels)

    def forward(self, x):
        x = self.embedding(x)
        x, _ = self.rnn(x)
        x = self.fc(x[:, -1, :])  # Use the last hidden state
        return torch.sigmoid(x)

# again, due to size, these numbers are quite small
vocab_size = len(word_to_index)
embedding_dim = 20
hidden_dim = 50
num_layers = 1

model = RNNModel(vocab_size, embedding_dim, hidden_dim, num_layers, num_labels)

# Train the model
num_epochs = 3
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(num_epochs):
    # Training
    model.train()
    for questions_batch, labels_batch in train_dataloader:
        optimizer.zero_grad()
        outputs = model(questions_batch)
        loss = criterion(outputs, torch.tensor(labels_batch).float())
        loss.backward()
        optimizer.step()

    # Evaluation
    model.eval()
    total_hamming_loss = 0
    total_samples = 0
    with torch.no_grad():
        for questions_batch, labels_batch in val_dataloader:
            outputs = model(questions_batch)
            pred_labels = (outputs > 0.5).int().numpy()
            true_labels = labels_batch.numpy()
            total_hamming_loss += hamming_loss(true_labels, pred_labels) * true_labels.shape[0]
            total_samples += true_labels.shape[0]

    hamming_score = 1 - (total_hamming_loss / total_samples)
    print(f"Epoch {epoch + 1}, Loss: {round(loss.item(), 4)}, Hamming score: {round(hamming_score, 4)}")

Epoch 1, Loss: 0.0717, Hamming score: 0.9854
Epoch 2, Loss: 0.054, Hamming score: 0.9854
Epoch 3, Loss: 0.0749, Hamming score: 0.9854


### Using LLMs

We can use some kind of LLM instead, and fine-tune it using our own original dataset, and use it as a classifier. The important thing with LLMs is that they allow us to use their pre-trained power, and apply it to our needs. This is why I will use a BERT model, additionally trained on stackoverflow sentences, which then will be fine-tuned to work as a multilabel classifier.

In [14]:
import warnings
warnings.simplefilter('ignore', category=UserWarning)

import pandas as pd
import numpy as np
import torch

from datasets import Dataset
from datasets.dataset_dict import DatasetDict
from transformers import (BertTokenizerFast, BertForSequenceClassification,
                          Trainer, TrainingArguments, DataCollatorWithPadding)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# combine title/body and train/test split
questions['combined_text'] = questions['Title'] + ' ' + questions['Body']
_questions = questions.loc[:, ['combined_text', 'tags']]

# WE ONLY DO THIS DUE TO TIME/INFASTRACTURE CONSTRAINTS
_questions = _questions.sample(2500)
train_df, test_df = train_test_split(_questions, test_size=0.3, random_state=42)


# tokenizer init
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

# create MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit(_questions['tags'])

# preprocess the text and create the datasets
def preprocessing(input_text):
    return tokenizer(input_text, padding=True, truncation=True, max_length=256)

trainset = []
testset = []

for index, row in train_df.iterrows():
    process_info = preprocessing(row['combined_text'])
    binarized_tags = mlb.transform([row['tags']])
    trainset.append({
        'input_ids': process_info['input_ids'],
        'attention_mask': process_info['attention_mask'],
        'token_type_ids': process_info['token_type_ids'],
        'labels': torch.tensor(binarized_tags.astype(np.float32)).squeeze()
    })

for index, row in test_df.iterrows():
    process_info = preprocessing(row['combined_text'])
    binarized_tags = mlb.transform([row['tags']])
    testset.append({
        'input_ids': process_info['input_ids'],
        'attention_mask': process_info['attention_mask'],
        'token_type_ids': process_info['token_type_ids'],
        'labels': torch.tensor(binarized_tags.astype(np.float32)).squeeze()
    })

dataset = DatasetDict({
    'train':Dataset.from_list(trainset),
    'eval':Dataset.from_list(testset)
})
dataset.set_format("torch")
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'token_type_ids', 'labels'],
        num_rows: 1750
    })
    eval: Dataset({
        features: ['input_ids', 'attention_mask', 'token_type_ids', 'labels'],
        num_rows: 750
    })
})

In [15]:
from sklearn.metrics import precision_score, f1_score, hamming_loss

original_tags = mlb.classes_
id2label = {idx:label for idx, label in enumerate(original_tags)}
label2id = {label:idx for idx, label in enumerate(original_tags)}


model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=100,
    problem_type="multi_label_classification",
    id2label=id2label, label2id=label2id
)
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=5e-5, eps=1e-08
)
training_args = TrainingArguments(
    output_dir="./output", logging_dir="./logs",
    num_train_epochs=3, 
    per_device_train_batch_size=16, per_device_eval_batch_size=16,
    warmup_steps=50, eval_steps=50, logging_steps=50, 
    evaluation_strategy="epoch", save_strategy="epoch",
    weight_decay=0.01,
    load_best_model_at_end=True,
)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def compute_metrics(p):
    preds = p.predictions.squeeze()
    labels = p.label_ids.squeeze()
    
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(preds))
    
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= 0.5)] = 1
    y_true = labels

    hamming = 1 - round(hamming_loss(y_true, y_pred), 3)
    return {'hamming': hamming}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["eval"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)
trainer.train()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Epoch,Training Loss,Validation Loss,Hamming
1,0.1882,0.10796,0.985
2,0.0877,0.079766,0.985
3,0.0796,0.076516,0.985


TrainOutput(global_step=330, training_loss=0.1707368583390207, metrics={'train_runtime': 2796.0251, 'train_samples_per_second': 1.878, 'train_steps_per_second': 0.118, 'total_flos': 691274239488000.0, 'train_loss': 0.1707368583390207, 'epoch': 3.0})

In [16]:
trainer.evaluate()
# we can save and reuse the trained model at this checkpoint
# trainer.save_model("multilabel-bert")
# tokenizer.save_pretrained("multilabel-bert")

{'eval_loss': 0.07651558518409729,
 'eval_hamming': 0.985,
 'eval_runtime': 97.4571,
 'eval_samples_per_second': 7.696,
 'eval_steps_per_second': 0.482,
 'epoch': 3.0}

## Discussion/Comments
After finishing the experimental pipeline, we will delve into the various stages, examining the rationale behind the decisions made and discussing potential alternatives. We will also address the limitations posed by time constraints and infrastructure, specifically the use of a single laptop, and the timeframe. In real-world scenarios, issues such as dataset size and preprocessing speed would be moot.

### Data Exploration and Preprocessing
Given the large volume of the dataset, managing it on a laptop was challenging. This is why some statistical analysis was necessary to make informed decisions on filtering, and selecting a representative subset of the data. The following steps were taken:

- We kept the top 100 tags and the questions associated with them.
- We filtered the questions based on their scores, ensuring that they were well-formed. We excluded negatively scored questions and those with extremely high scores.
- Preprocessing must be conducted with care to preserve the integrity of the code. Many code fragments could be discarded in a standard process, either because they are not recognized as words (e.g., {}, ()) or because they are considered stopwords (e.g., "and", "if"). We aimed to maintain the context of the text without removing valuable information. Techniques such as lemmatization and tokenization were used as usual.

Regarding the handling of titles and bodies, several approaches were considered:

- Merging them and treating the data as a single text block for each question.
- Vectorizing them separately, then merging the 2D arrays (title + body).
- Investigating and applying weights to either component, based on the significance of the information they provide.


### Multilabel Classification
Various methods for multilabel classification exist, such as Classifier Chains. However, we chose Random Forests for their speed and out-of-the-box support for multilabel classification. The crucial aspect here was selecting appropriate performance metrics:

- Accuracy is suboptimal, as it does not account for partially correct label predictions, rendering it unrepresentative, especially in a dataset with such a big amount of labels.
- The Hamming score, which considers partially correct predictions, was used instead.
- Precision, and F1 score were also used, with precision being especially useful.


### Deep Neural Networks (DNNs)
An alternative approach was to utilize a DNN, specifically a simple Recurrent Neural Network (RNN), to generate embeddings of the input text. This method, akin to using LLMs, maintains word order. Although the results may not be superior to other methods depending on the dataset, we included it for the sake of a comprehensive experimental pipeline, and we expect much better results on a more powerful.

### Leveraging a Large Language Model (LLM)
In the final stage of our experiment, we employed BERT, a widely recognized LLM known for its performance in general NLP tasks. We fine-tuned BERT on our dataset and used it as a classifier for our multilabel classification task. Other options exist of course, both different LLMs, (RoBERTa, T5, etc) but also more finetuned models (e.g. BERT fine-tuned on scientific papers). We would need to thoroughly investigate if any of those are better.

### Potential Improvements and Future Directions
While the current experimental pipeline yields promising results, there are several areas where further improvements could be made:

- Experiment with other LLMs, such as GPT-4, RoBERTa, or DistilBERT, to determine if they offer better performance for this specific task.
- Implement data augmentation techniques to expand the dataset and improve the model's generalization capabilities.
- Explore ensemble methods, such as stacking or bagging, to combine the strengths of various models and potentially improve overall performance. This could be extremely useful due to the vast amount of tags that exist. An approach could be clustering the tags according to the questions, creating representative clusters and then fine-tun the classification procedure to different models.
- Conduct hyperparameter optimization using techniques such as grid search, random search, k-folds, etc
- Use domain-specific knowledge to process text more correctly