## Classifier Chain for Tag prediction using  SciKit-Learn
Classifier Chain is an ensemble model that is built for multi-label classification. It is able to catch potential correlated relationships between the different tags. The chain is done through feeding the prediction of one classifier into the next. This method is chosen to model the time series analysis with their newly occurring tags.

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.multioutput import ClassifierChain
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import MultiLabelBinarizer, StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import jaccard_score
import sklearn

In [2]:
# data import
current_dir = os.getcwd()

# construct path to the project data folder
data_dir = os.path.join(current_dir, '..', '..', '..', 'Data','Sonar_Issues')

# load SonarQube measure data (without duplicates)
df = pd.read_csv(os.path.join(data_dir, 'measures+tags.csv'), low_memory=False)

# filter for project hive
df = df[df['PROJECT_ID'] == 'hive']
df['SQ_ANALYSIS_DATE'] = pd.to_datetime(df['SQ_ANALYSIS_DATE'])

# sort the df so that the dates are ordered from oldest to newest analysis
df = df.sort_values(by='SQ_ANALYSIS_DATE')
df

Unnamed: 0,PROJECT_ID,SQ_ANALYSIS_DATE,CLASSES,FILES,LINES,NCLOC,PACKAGE,STATEMENTS,FUNCTIONS,COMMENT_LINES,...,FUNCTION_COMPLEXITY,COGNITIVE_COMPLEXITY,LINES_TO_COVER,UNCOVERED_LINES,DUPLICATED_LINES,DUPLICATED_BLOCKS,DUPLICATED_FILES,COMMENT_LINES_DENSITY,DUPLICATED_LINES_DENSITY,TAGS
15553,hive,2008-09-02 23:58:59,613.0,358.0,67469.0,48651.0,29,26933.0,4334.0,2958.0,...,2.6,10623.0,31250.0,31250.0,16728,1204,66,5.7,24.8,"error-handling, clumsy, brain-overload, design..."
15552,hive,2008-09-17 00:28:22,613.0,358.0,67754.0,48873.0,29,27078.0,4340.0,2983.0,...,2.6,10691.0,31428.0,31428.0,16790,1208,66,5.8,24.8,"brain-overload, clumsy"
15551,hive,2008-09-17 20:13:00,613.0,358.0,67865.0,48976.0,29,27145.0,4346.0,2985.0,...,2.6,10701.0,31505.0,31505.0,16785,1208,66,5.7,24.7,"convention, design"
15550,hive,2008-09-18 00:09:17,661.0,397.0,71629.0,51241.0,33,28335.0,4538.0,3215.0,...,2.6,11061.0,32889.0,32889.0,17789,1228,74,5.9,24.8,"error-handling, clumsy, brain-overload, design..."
15549,hive,2008-09-18 17:37:59,664.0,399.0,72263.0,51707.0,33,28559.0,4592.0,3235.0,...,2.6,11206.0,33041.0,33041.0,17659,1224,75,5.9,24.4,"error-handling, clumsy, brain-overload, bad-pr..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13702,hive,2015-02-27 21:09:45,8327.0,3789.0,1071783.0,731599.0,364,352969.0,61412.0,75080.0,...,2.3,119218.0,431125.0,431125.0,139347,7774,791,9.3,13.0,"error-handling, clumsy, design, suspicious, pi..."
13701,hive,2015-02-27 21:30:05,8327.0,3789.0,1071783.0,731599.0,364,352969.0,61412.0,75080.0,...,2.3,119218.0,431125.0,431125.0,139347,7774,791,9.3,13.0,pitfall
13700,hive,2015-02-27 23:08:33,8468.0,3872.0,1087272.0,742901.0,387,357917.0,62390.0,76071.0,...,2.3,120954.0,437096.0,437096.0,140709,7913,810,9.3,12.9,"convention, pitfall"
13699,hive,2015-03-02 18:18:35,8477.0,3882.0,1088466.0,743721.0,387,358306.0,62458.0,76112.0,...,2.3,121067.0,437585.0,437585.0,140806,7917,813,9.3,12.9,"error-handling, design, unused, suspicious"


## Handle missing values

In [3]:
df[df.isnull().any(axis=1)]

Unnamed: 0,PROJECT_ID,SQ_ANALYSIS_DATE,CLASSES,FILES,LINES,NCLOC,PACKAGE,STATEMENTS,FUNCTIONS,COMMENT_LINES,...,FUNCTION_COMPLEXITY,COGNITIVE_COMPLEXITY,LINES_TO_COVER,UNCOVERED_LINES,DUPLICATED_LINES,DUPLICATED_BLOCKS,DUPLICATED_FILES,COMMENT_LINES_DENSITY,DUPLICATED_LINES_DENSITY,TAGS


There are no missing values.

## Prepare labels
For the model zu handle multiple tags correctly, they need to be encoded. Since classifier chains expect binary labels, the tags are one-hot-encoded.

In [4]:
all_tags = ['convention', 'brain-overload','unused','error-handling','bad-practice','pitfall',
            'clumsy','suspicious','design','antipattern','redundant','confusing','performance','obsolete']

# transform TAGS strings to lists
df.loc[:, 'TAGS'] = df['TAGS'].str.split(',')
# remove whitespaces
df.loc[:, 'TAGS'] = df['TAGS'].apply(lambda x: [item.strip() for item in x])

# save TAGS as raw_labels to be further processed
raw_labels = df['TAGS']

# initialise mlb with all tag categories
mlb = MultiLabelBinarizer(classes=all_tags)
# fit the mlb with the list of lists of raw labels
Y_binarized = mlb.fit_transform(raw_labels)

print(f"MLB classes (order of one-hot columns): {mlb.classes_}")
num_classes = len(mlb.classes_)
print(f"Total number of possible labels: {num_classes}")

MLB classes (order of one-hot columns): ['convention' 'brain-overload' 'unused' 'error-handling' 'bad-practice'
 'pitfall' 'clumsy' 'suspicious' 'design' 'antipattern' 'redundant'
 'confusing' 'performance' 'obsolete']
Total number of possible labels: 14


## Prepare predictors
To improve comparability between predictors, they are scaled with the StandardScaler.

In [5]:
def scale_predictors(df, label):
    """This function scales numerical predictor variables. The label remains unscaled."""
    columns_to_scale = [col for col in df.select_dtypes(include=['number']) if col != label]
    scaler = StandardScaler()
    df[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])
    return df

X = scale_predictors(df.select_dtypes(include='number'), 'TAGS')
X

Unnamed: 0,CLASSES,FILES,LINES,NCLOC,PACKAGE,STATEMENTS,FUNCTIONS,COMMENT_LINES,COMPLEXITY,CLASS_COMPLEXITY,FUNCTION_COMPLEXITY,COGNITIVE_COMPLEXITY,LINES_TO_COVER,UNCOVERED_LINES,DUPLICATED_LINES,DUPLICATED_BLOCKS,DUPLICATED_FILES,COMMENT_LINES_DENSITY,DUPLICATED_LINES_DENSITY
15553,-1.611448,-1.753882,-1.636504,-1.644758,-1.867228,-1.632894,-1.687940,-1.955492,-1.686843,0.028857,3.890931,-1.700907,-1.642291,-1.642291,-1.530234,-1.369866,-1.494250,-2.799625,2.711566
15552,-1.611448,-1.753882,-1.635665,-1.643786,-1.867228,-1.631576,-1.687632,-1.954361,-1.685957,0.028857,3.890931,-1.699043,-1.640975,-1.640975,-1.529098,-1.368268,-1.494250,-2.740282,2.711566
15551,-1.611448,-1.753882,-1.635338,-1.643336,-1.867228,-1.630968,-1.687323,-1.954270,-1.685602,0.094013,3.890931,-1.698769,-1.640406,-1.640406,-1.529189,-1.368268,-1.494250,-2.799625,2.680384
15550,-1.593440,-1.718480,-1.624254,-1.633426,-1.831254,-1.620154,-1.677453,-1.943865,-1.676188,-0.362080,3.890931,-1.688900,-1.630172,-1.630172,-1.510786,-1.360281,-1.468165,-2.680939,2.711566
15549,-1.592314,-1.716664,-1.622388,-1.631387,-1.831254,-1.618119,-1.674677,-1.942960,-1.673064,-0.296924,3.890931,-1.684926,-1.629048,-1.629048,-1.513169,-1.361878,-1.464904,-2.680939,2.586839
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13702,1.282620,1.360605,1.320827,1.343395,1.145643,1.329741,1.246249,1.307311,1.277539,-0.492393,-0.656656,1.275940,1.314624,1.314624,0.717393,1.253917,0.869667,-0.663264,-0.967876
13701,1.282620,1.360605,1.320827,1.343395,1.145643,1.329741,1.246249,1.307311,1.277539,-0.492393,-0.656656,1.275940,1.314624,1.314624,0.717393,1.253917,0.869667,-0.663264,-0.967876
13700,1.335519,1.435948,1.366436,1.392846,1.352497,1.374703,1.296525,1.352144,1.323948,-0.557549,-0.656656,1.323528,1.358777,1.358777,0.742359,1.309428,0.931618,-0.663264,-0.999058
13699,1.338896,1.445025,1.369952,1.396434,1.352497,1.378238,1.300021,1.353999,1.326650,-0.557549,-0.656656,1.326626,1.362393,1.362393,0.744137,1.311025,0.941400,-0.663264,-0.999058


## Train-Test-Split
To ensure the order of the data is used properly, with the training data chronologically being set before the testing set.

In [6]:
tscv = TimeSeriesSplit(n_splits=5)

for train_index, test_index in tscv.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    Y_train, Y_test = Y_binarized[train_index], Y_binarized[test_index]

## Classifier Chain: Logistic Regression
As a first approach, a Logistic Regression base model is chosen. The classes are balanced to ensure the tags are weighted depending on their occurence. <br>
The performance is measure with the jaccard_score, that describes the average overlap of the predicted labels with the true labels.

In [7]:
base_classifier = LogisticRegression(solver='liblinear', random_state=42, class_weight='balanced')

n_chains =1
chains = []
for i in range(n_chains):
    chain = ClassifierChain(base_classifier, order='random', random_state=i)
    chain.fit(X_train, Y_train)
    chains.append(chain)

# aggregate predictions
Y_pred_proba_ensemble = np.array([chain.predict_proba(X_test) for chain in chains]).mean(axis=0)
Y_pred_ensemble_binarized = (Y_pred_proba_ensemble >= 0.7).astype(int)

# calculate and print jaccard score for different thresholds
print("\n--- Trying different thresholds ---")
thresholds_to_test = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
for threshold in thresholds_to_test:
    Y_pred_ensemble_binarized = (Y_pred_proba_ensemble >= threshold).astype(int)
    predicted_tags_th = mlb.inverse_transform(Y_pred_ensemble_binarized)
    jaccard_th = jaccard_score(Y_test, Y_pred_ensemble_binarized, average='samples', zero_division=0)
    print(f"\nThreshold: {threshold:.1f}")
    print(f"  Jaccard Score: {jaccard_th:.4f}")


--- Trying different thresholds ---

Threshold: 0.0
  Jaccard Score: 0.2080

Threshold: 0.1
  Jaccard Score: 0.2080

Threshold: 0.2
  Jaccard Score: 0.2080

Threshold: 0.3
  Jaccard Score: 0.2080

Threshold: 0.4
  Jaccard Score: 0.2080

Threshold: 0.5
  Jaccard Score: 0.2080

Threshold: 0.6
  Jaccard Score: 0.2058

Threshold: 0.7
  Jaccard Score: 0.2140

Threshold: 0.8
  Jaccard Score: 0.2231

Threshold: 0.9
  Jaccard Score: 0.2266


As a result, the model shows a bad performance when predicting the testing set. Given a threshold of 0 so that the model predicts all
labels for each observation every time, the Jaccard score is ~20.8%. This only improves up to 22.7% for a higher threshold.

## Classifier Chain: Gradient Boosting
To see whether a good model can be achieved by changing the base model, another classifier chain is fitted with Gradient Boosting.

### Weighting
As Gradient Boosting doesn't provide a weight balancing parameter, the weights are calculated directly and given to the base classifier via fit_request().

In [8]:
# calculate inverse class frequencies for each label
label_weights_pos = {}
num_training_samples = Y_train.shape[0]
num_labels = Y_train.shape[1]

for i in range(num_labels):
    # count occurrences for the current label
    n_pos = np.sum(Y_train[:, i] == 1)

    # calculate weights for the labels in a sample
    weight_pos = num_training_samples / (2 * n_pos) if n_pos > 0 else 1.0
    label_weights_pos[i] = weight_pos

print("\nCalculated Positive Class Weights for each label:")
for i, weight in label_weights_pos.items():
    print(f"  {mlb.classes_[i]}: {weight:.2f}")

# initialize all sample weights to 1.0 per default
sample_weights_train = np.ones(num_training_samples)

# iterate through each sample in the training set to calculate and apply weights
for i in range(num_training_samples):
    # get true labels for the current sample
    sample_true_labels = Y_train[i]

    # calculate a combined weight for this sample
    combined_weight = 0.0
    for j in range(num_labels):
        if sample_true_labels[j] == 1:
            combined_weight += label_weights_pos[j]

    # assign the combined weight to the sample (if there are no labels weight 1.0 is applied)
    sample_weights_train[i] = combined_weight if combined_weight > 0 else 1.0

# normalize sample weights
sample_weights_train = sample_weights_train / np.mean(sample_weights_train)


Calculated Positive Class Weights for each label:
  convention: 1.11
  brain-overload: 1.14
  unused: 1.39
  error-handling: 1.41
  bad-practice: 1.72
  pitfall: 1.82
  clumsy: 1.73
  suspicious: 1.86
  design: 2.04
  antipattern: 2.23
  redundant: 4.11
  confusing: 8.59
  performance: 8.89
  obsolete: 19.34


### Model the Classifier Chain

In [9]:
# intialise base classifier
sklearn.set_config(enable_metadata_routing=True)
base_classifier = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
base_classifier.set_fit_request(sample_weight=True)

# initiate and train classifier chain
n_chains = 5
chains = []
for i in range(n_chains):
    chain = ClassifierChain(base_classifier)
    chain.fit(X_train, Y_train, sample_weight=sample_weights_train)
    chains.append(chain)

# calculate label probabilities for all test data
Y_pred_proba_ensemble = np.array([chain.predict_proba(X_test) for chain in chains]).mean(axis=0)

print("\n--- Trying different thresholds ---")
thresholds_to_test = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
for threshold in thresholds_to_test:
    Y_pred_ensemble_th = (Y_pred_proba_ensemble >= threshold).astype(int)
    predicted_tags_th = mlb.inverse_transform(Y_pred_ensemble_th)
    jaccard_th = jaccard_score(Y_test, Y_pred_ensemble_th, average='samples', zero_division=0)
    print(f"\nThreshold: {threshold:.1f}")
    print(f"  Jaccard Score: {jaccard_th:.4f}")


--- Trying different thresholds ---

Threshold: 0.0
  Jaccard Score: 0.2080

Threshold: 0.1
  Jaccard Score: 0.2372

Threshold: 0.2
  Jaccard Score: 0.2121

Threshold: 0.3
  Jaccard Score: 0.1809

Threshold: 0.4
  Jaccard Score: 0.1698

Threshold: 0.5
  Jaccard Score: 0.1483

Threshold: 0.6
  Jaccard Score: 0.1306

Threshold: 0.7
  Jaccard Score: 0.0606

Threshold: 0.8
  Jaccard Score: 0.0448

Threshold: 0.9
  Jaccard Score: 0.0362


Gradient Boosting performs similar as a base model as Logistic Regression. When predicting all labels for each analysis, the result
reaches a Jaccard score of 20.8%. For the best performing threshold (0.1), the score is 23.8%, which improves only little and is overall no good result.