## Classifier Chain for Tag prediction using  SciKit-Learn
Classifier Chain is an ensemble model that is built for multi-label classification. It is able to catch potential correlated relationships between the different tags. The chain is done through feeding the prediction of one classifier into the next. This method is chosen to model the time series analysis with their newly occurring tags.

In [18]:
import os
import numpy as np
import pandas as pd
from sklearn.multioutput import ClassifierChain
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.preprocessing import MultiLabelBinarizer, StandardScaler
from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, make_scorer

In [2]:
# data import
current_dir = os.getcwd()

# construct path to the project data folder
data_dir = os.path.join(current_dir, '..', '..', '..', 'Data','Sonar_Issues')

# load SonarQube measure data (without duplicates)
df = pd.read_csv(os.path.join(data_dir, 'measures+tags.csv'), low_memory=False)

# filter for project hive
df = df[df['PROJECT_ID'] == 'hive']
df['SQ_ANALYSIS_DATE'] = pd.to_datetime(df['SQ_ANALYSIS_DATE'])

# sort the df so that the dates are ordered from oldest to newest analysis
df = df.sort_values(by='SQ_ANALYSIS_DATE')
df

Unnamed: 0,PROJECT_ID,SQ_ANALYSIS_DATE,CLASSES,FILES,LINES,NCLOC,PACKAGE,STATEMENTS,FUNCTIONS,COMMENT_LINES,...,FUNCTION_COMPLEXITY,COGNITIVE_COMPLEXITY,LINES_TO_COVER,UNCOVERED_LINES,DUPLICATED_LINES,DUPLICATED_BLOCKS,DUPLICATED_FILES,COMMENT_LINES_DENSITY,DUPLICATED_LINES_DENSITY,TAGS
15553,hive,2008-09-02 23:58:59,613.0,358.0,67469.0,48651.0,29,26933.0,4334.0,2958.0,...,2.6,10623.0,31250.0,31250.0,16728,1204,66,5.7,24.8,"error-handling, clumsy, brain-overload, design..."
15552,hive,2008-09-17 00:28:22,613.0,358.0,67754.0,48873.0,29,27078.0,4340.0,2983.0,...,2.6,10691.0,31428.0,31428.0,16790,1208,66,5.8,24.8,"brain-overload, clumsy"
15551,hive,2008-09-17 20:13:00,613.0,358.0,67865.0,48976.0,29,27145.0,4346.0,2985.0,...,2.6,10701.0,31505.0,31505.0,16785,1208,66,5.7,24.7,"convention, design"
15550,hive,2008-09-18 00:09:17,661.0,397.0,71629.0,51241.0,33,28335.0,4538.0,3215.0,...,2.6,11061.0,32889.0,32889.0,17789,1228,74,5.9,24.8,"error-handling, clumsy, brain-overload, design..."
15549,hive,2008-09-18 17:37:59,664.0,399.0,72263.0,51707.0,33,28559.0,4592.0,3235.0,...,2.6,11206.0,33041.0,33041.0,17659,1224,75,5.9,24.4,"error-handling, clumsy, brain-overload, bad-pr..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13702,hive,2015-02-27 21:09:45,8327.0,3789.0,1071783.0,731599.0,364,352969.0,61412.0,75080.0,...,2.3,119218.0,431125.0,431125.0,139347,7774,791,9.3,13.0,"error-handling, clumsy, design, suspicious, pi..."
13701,hive,2015-02-27 21:30:05,8327.0,3789.0,1071783.0,731599.0,364,352969.0,61412.0,75080.0,...,2.3,119218.0,431125.0,431125.0,139347,7774,791,9.3,13.0,pitfall
13700,hive,2015-02-27 23:08:33,8468.0,3872.0,1087272.0,742901.0,387,357917.0,62390.0,76071.0,...,2.3,120954.0,437096.0,437096.0,140709,7913,810,9.3,12.9,"convention, pitfall"
13699,hive,2015-03-02 18:18:35,8477.0,3882.0,1088466.0,743721.0,387,358306.0,62458.0,76112.0,...,2.3,121067.0,437585.0,437585.0,140806,7917,813,9.3,12.9,"error-handling, design, unused, suspicious"


## Handle missing values

In [3]:
df[df.isnull().any(axis=1)]

Unnamed: 0,PROJECT_ID,SQ_ANALYSIS_DATE,CLASSES,FILES,LINES,NCLOC,PACKAGE,STATEMENTS,FUNCTIONS,COMMENT_LINES,...,FUNCTION_COMPLEXITY,COGNITIVE_COMPLEXITY,LINES_TO_COVER,UNCOVERED_LINES,DUPLICATED_LINES,DUPLICATED_BLOCKS,DUPLICATED_FILES,COMMENT_LINES_DENSITY,DUPLICATED_LINES_DENSITY,TAGS


There are no missing values.

## Prepare labels
For the model zu handle multiple tags correctly, they need to be encoded. Since classifier chains expect binary labels, the tags are one-hot-encoded.

In [4]:
all_tags = ['convention', 'brain-overload','unused','error-handling','bad-practice','pitfall',
            'clumsy','suspicious','design','antipattern','redundant','confusing','performance','obsolete']

# transform TAGS strings to lists
df.loc[:, 'TAGS'] = df['TAGS'].str.split(',')
# remove whitespaces
df.loc[:, 'TAGS'] = df['TAGS'].apply(lambda x: [item.strip() for item in x])

# save TAGS as raw_labels to be further processed
raw_labels = df['TAGS']

# initialise mlb with all tag categories
mlb = MultiLabelBinarizer(classes=all_tags)
# fit the mlb with the list of lists of raw labels
Y_binarized = mlb.fit_transform(raw_labels)

print(f"MLB classes (order of one-hot columns): {mlb.classes_}")
num_classes = len(mlb.classes_)
print(f"Total number of possible labels: {num_classes}")

MLB classes (order of one-hot columns): ['convention' 'brain-overload' 'unused' 'error-handling' 'bad-practice'
 'pitfall' 'clumsy' 'suspicious' 'design' 'antipattern' 'redundant'
 'confusing' 'performance' 'obsolete']
Total number of possible labels: 14


## Prepare predictors
To improve comparability between predictors, they are scaled with the StandardScaler.

In [5]:
def scale_predictors(df, label):
    """This function scales numerical predictor variables. The label remains unscaled."""
    columns_to_scale = [col for col in df.select_dtypes(include=['number']) if col != label]
    scaler = StandardScaler()
    df[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])
    return df

X = scale_predictors(df.select_dtypes(include='number'), 'TAGS')
X

Unnamed: 0,CLASSES,FILES,LINES,NCLOC,PACKAGE,STATEMENTS,FUNCTIONS,COMMENT_LINES,COMPLEXITY,CLASS_COMPLEXITY,FUNCTION_COMPLEXITY,COGNITIVE_COMPLEXITY,LINES_TO_COVER,UNCOVERED_LINES,DUPLICATED_LINES,DUPLICATED_BLOCKS,DUPLICATED_FILES,COMMENT_LINES_DENSITY,DUPLICATED_LINES_DENSITY
15553,-1.611448,-1.753882,-1.636504,-1.644758,-1.867228,-1.632894,-1.687940,-1.955492,-1.686843,0.028857,3.890931,-1.700907,-1.642291,-1.642291,-1.530234,-1.369866,-1.494250,-2.799625,2.711566
15552,-1.611448,-1.753882,-1.635665,-1.643786,-1.867228,-1.631576,-1.687632,-1.954361,-1.685957,0.028857,3.890931,-1.699043,-1.640975,-1.640975,-1.529098,-1.368268,-1.494250,-2.740282,2.711566
15551,-1.611448,-1.753882,-1.635338,-1.643336,-1.867228,-1.630968,-1.687323,-1.954270,-1.685602,0.094013,3.890931,-1.698769,-1.640406,-1.640406,-1.529189,-1.368268,-1.494250,-2.799625,2.680384
15550,-1.593440,-1.718480,-1.624254,-1.633426,-1.831254,-1.620154,-1.677453,-1.943865,-1.676188,-0.362080,3.890931,-1.688900,-1.630172,-1.630172,-1.510786,-1.360281,-1.468165,-2.680939,2.711566
15549,-1.592314,-1.716664,-1.622388,-1.631387,-1.831254,-1.618119,-1.674677,-1.942960,-1.673064,-0.296924,3.890931,-1.684926,-1.629048,-1.629048,-1.513169,-1.361878,-1.464904,-2.680939,2.586839
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13702,1.282620,1.360605,1.320827,1.343395,1.145643,1.329741,1.246249,1.307311,1.277539,-0.492393,-0.656656,1.275940,1.314624,1.314624,0.717393,1.253917,0.869667,-0.663264,-0.967876
13701,1.282620,1.360605,1.320827,1.343395,1.145643,1.329741,1.246249,1.307311,1.277539,-0.492393,-0.656656,1.275940,1.314624,1.314624,0.717393,1.253917,0.869667,-0.663264,-0.967876
13700,1.335519,1.435948,1.366436,1.392846,1.352497,1.374703,1.296525,1.352144,1.323948,-0.557549,-0.656656,1.323528,1.358777,1.358777,0.742359,1.309428,0.931618,-0.663264,-0.999058
13699,1.338896,1.445025,1.369952,1.396434,1.352497,1.378238,1.300021,1.353999,1.326650,-0.557549,-0.656656,1.326626,1.362393,1.362393,0.744137,1.311025,0.941400,-0.663264,-0.999058


## Modelling the Classifier Chains
To ensure the order of the data is used properly, with the training data chronologically being set before the testing set. For this modelling approach, the models are trained on the different folds, making it necessary to incorporate the modelling process into the time series split.

### Logistic Regression

In [11]:
# outer TimeSeriesSplit for evaluating overall performance
n_outer_splits = 3

# initiate model
n_chains = 5
base_classifier = LogisticRegression(solver='liblinear', random_state=42, class_weight='balanced')
thresholds_to_test = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

# storing results for global threshold optimization and final evaluation
all_Y_test_for_global_threshold = []
all_Y_pred_proba_for_global_threshold = []

# outer loop: time series cv for performance evaluation
outer_tscv = TimeSeriesSplit(n_splits=n_outer_splits)

print(f"--- Starting Outer Loop ({n_outer_splits} folds) ---")
for outer_fold_num, (train_index_outer, test_index_outer) in enumerate(outer_tscv.split(X)):
    print(f"\nOuter Fold {outer_fold_num + 1}/{n_outer_splits}")

    # splitting data for the current outer fold
    X_train_outer, X_test_outer = X.iloc[train_index_outer], X.iloc[test_index_outer]
    Y_train_outer, Y_test_outer = Y_binarized[train_index_outer], Y_binarized[test_index_outer]

    # training the ensemble of classifier chains on the outer training data ---
    current_fold_chains = []
    print(f"  Training {n_chains} classifier chains on outer training data...")
    for i in range(n_chains):
        chain = ClassifierChain(base_classifier, order='random', random_state=i)
        chain.fit(X_train_outer, Y_train_outer)
        current_fold_chains.append(chain)

    # making probabilistic predictions on the current outer test set
    Y_pred_proba_outer_test = np.array([chain.predict_proba(X_test_outer) for chain in current_fold_chains]).mean(axis=0)

    # accumulating results for global threshold optimization
    all_Y_test_for_global_threshold.append(Y_test_outer)
    all_Y_pred_proba_for_global_threshold.append(Y_pred_proba_outer_test)

    print(f"  Finished predictions for Outer Fold {outer_fold_num + 1}")

# concatenating all test predictions and true values
Y_test_overall = np.concatenate(all_Y_test_for_global_threshold, axis=0)
Y_pred_proba_overall = np.concatenate(all_Y_pred_proba_for_global_threshold, axis=0)

# calculating ROC AUC scores
print("\n--- overall ROC AUC Scores ---")
roc_auc_micro_overall = roc_auc_score(Y_test_overall, Y_pred_proba_overall, average='micro')
print(f"  Micro ROC AUC: {roc_auc_micro_overall:.4f}")

roc_auc_macro_overall = roc_auc_score(Y_test_overall, Y_pred_proba_overall, average='macro')
print(f"  Macro ROC AUC: {roc_auc_macro_overall:.4f}")

# optimising the global threshold on the concatenated test data
print("\n--- Optimizing Global Threshold ---")

# optimisation based on micro f1
best_global_micro_f1 = -1
optimal_global_threshold = 0.0
metrics_at_optimal_threshold = {}

for threshold in thresholds_to_test:
    Y_pred_binarized_global = (Y_pred_proba_overall >= threshold).astype(int)

    # calculating all metrics for the current threshold
    current_micro_f1 = f1_score(Y_test_overall, Y_pred_binarized_global, average='micro', zero_division=0)
    current_micro_precision = precision_score(Y_test_overall, Y_pred_binarized_global, average='micro', zero_division=0)
    current_micro_recall = recall_score(Y_test_overall, Y_pred_binarized_global, average='micro', zero_division=0)
    current_macro_f1 = f1_score(Y_test_overall, Y_pred_binarized_global, average='macro', zero_division=0)
    current_macro_precision = precision_score(Y_test_overall, Y_pred_binarized_global, average='macro', zero_division=0)
    current_macro_recall = recall_score(Y_test_overall, Y_pred_binarized_global, average='macro', zero_division=0)

    # checking if this threshold gives a better Micro F1-score
    if current_micro_f1 > best_global_micro_f1:
        best_global_micro_f1 = current_micro_f1
        optimal_global_threshold = threshold
        metrics_at_optimal_threshold = {
            'Micro F1': current_micro_f1,
            'Micro Precision': current_micro_precision,
            'Micro Recall': current_micro_recall,
            'Macro F1': current_macro_f1,
            'Macro Precision': current_macro_precision,
            'Macro Recall': current_macro_recall
        }

print(f"\nOptimal Global Threshold Found (maximizing Micro F1): {optimal_global_threshold:.1f}")
print("\n--- Final Model Performance at Optimal Global Threshold ---")
for metric_name, score in metrics_at_optimal_threshold.items():
    print(f"  {metric_name}: {score:.4f}")

--- Starting Outer Loop (3 folds) ---

Outer Fold 1/3
  Training 5 classifier chains on outer training data...
  Finished predictions for Outer Fold 1

Outer Fold 2/3
  Training 5 classifier chains on outer training data...
  Finished predictions for Outer Fold 2

Outer Fold 3/3
  Training 5 classifier chains on outer training data...
  Finished predictions for Outer Fold 3

--- overall ROC AUC Scores ---
  Micro ROC AUC: 0.5598
  Macro ROC AUC: 0.5225

--- Optimizing Global Threshold ---

Optimal Global Threshold Found (maximizing Micro F1): 0.4

--- Final Model Performance at Optimal Global Threshold ---
  Micro F1: 0.4009
  Micro Precision: 0.2554
  Micro Recall: 0.9314
  Macro F1: 0.3572
  Macro Precision: 0.2355
  Macro Recall: 0.8707


Using Logistic Regression as a base classifier, the ROC AUC over all predictions is low with only achieving 0.56. The quality of
predictions per class are slightly worse, scoring only 0.52. The Recall metric is very good overall but F1-score is low.

### Light Gradient Boosting Machine
As a first approach, a Logistic Regression base model is chosen. The classes are balanced to ensure the tags are weighted depending on their occurence. Hyperparameter optimisation is employed to ensure the best possible model performance.

In [22]:
# configurating data splits for nested cv
n_outer_splits = 3 # for final model evaluation
n_inner_splits = 2 # for hyperparameter tuning within each outer fold
n_chains = 5
thresholds_to_test = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

# defining the hyperparameter distribution for LGBMClassifier
param_dist = {
    'n_estimators': [100, 200, 400, 800], 
    'learning_rate': [0.005, 0.01, 0.05],
    'num_leaves': [7, 15, 20, 31, 40, 60],
    'max_depth': [-1, 3, 5, 7, 10],
    'min_child_samples': [10, 20, 30, 50],
    'reg_alpha': [0, 0.01, 0.1, 0.5],
    'reg_lambda': [0, 0.01, 0.1, 0.5],
    'colsample_bytree': [0.6, 0.7, 0.8, 0.9],
    'subsample': [0.7, 0.8, 0.9],
    'subsample_freq': [1],
    'min_gain_to_split': [0, 0.01, 0.1],
    'is_unbalance': [True]
}

# defining micro f1 as scoring metric for RandomizedSearchCV
scorer = make_scorer(f1_score, average='micro', zero_division=0)

# storing results for final evaluation
all_Y_test_overall_for_metrics = []
all_Y_pred_proba_overall_for_metrics = []

print(f"--- Starting Outer Loop ({n_outer_splits} folds) for Model Evaluation ---")
outer_tscv = TimeSeriesSplit(n_splits=n_outer_splits)

for outer_fold_num, (train_index_outer, test_index_outer) in enumerate(outer_tscv.split(X)):
    print(f"\nOuter Fold {outer_fold_num + 1}/{n_outer_splits}")

    # splitting data for the current outer fold
    X_train_outer, X_test_outer = X.iloc[train_index_outer], X.iloc[test_index_outer]
    Y_train_outer, Y_test_outer = Y_binarized[train_index_outer], Y_binarized[test_index_outer]

    # hyperparameter tuning using TimeSeriesSplit on outer training data
    print(f"  Starting Inner Loop ({n_inner_splits} folds) for Hyperparameter Tuning...")

    # instantiating the base classifier for the inner loop
    lgbm = LGBMClassifier(objective='binary', random_state=42)

    # wrapping LGBMClassifier in a ClassifierChain for hyperparameter tuning
    chain_for_tuning = ClassifierChain(lgbm, order='random', random_state=0)

    # creating inner TimeSeriesSplit for tuning
    inner_tscv = TimeSeriesSplit(n_splits=n_inner_splits)

    # setting up RandomizedSearchCV for hyperparameter tuning
    random_search = RandomizedSearchCV(
        estimator=chain_for_tuning,
        param_distributions={f'base_estimator__{k}': v for k, v in param_dist.items()},
        n_iter=10, 
        cv=inner_tscv,
        scoring=scorer,
        n_jobs=-1,
        verbose=1,
        random_state=42
    )

    # fitting RandomizedSearchCV on the outer training data
    random_search.fit(X_train_outer, Y_train_outer)

    best_lgbm_params = random_search.best_params_
    best_lgbm_score = random_search.best_score_
    print(f"  Best LGBM Classifier parameters found in inner loop: {best_lgbm_params}")
    print(f"  Best Micro F1-score from inner loop (tuning): {best_lgbm_score:.4f}")

    # training final ensemble of classifier chains for current Outer Fold with best parameters
    print(f"  Training final ensemble with best parameters for Outer Fold {outer_fold_num + 1}...")
    current_fold_chains = []
    # extracting params for the base classifier itself
    lgbm_base_params = {k.replace('base_classifier__', ''): v for k, v in best_lgbm_params.items()}
    # ensuring random_state is passed for reproducibility of the base LGBM classifier
    lgbm_base_params['random_state'] = 42

    for i in range(n_chains):
        # instantiating a new LGBMClassifier with the best found parameters
        tuned_base_classifier = LGBMClassifier(objective='binary', **lgbm_base_params)
        # creating ClassifierChain with the tuned base classifier
        chain = ClassifierChain(tuned_base_classifier, order='random', random_state=i) # Different random state for chain order
        chain.fit(X_train_outer, Y_train_outer)
        current_fold_chains.append(chain)

    # making probabilistic predictions on the current outer test set
    Y_pred_proba_outer_test = np.array([chain.predict_proba(X_test_outer) for chain in current_fold_chains]).mean(axis=0)

    # accumulating results for global threshold optimization and final evaluation
    all_Y_test_overall_for_metrics.append(Y_test_outer)
    all_Y_pred_proba_overall_for_metrics.append(Y_pred_proba_outer_test)

    print(f"  Finished predictions for Outer Fold {outer_fold_num + 1}")

# concatenating all test predictions and true values
Y_test_overall_final = np.concatenate(all_Y_test_overall_for_metrics, axis=0)
Y_pred_proba_overall_final = np.concatenate(all_Y_pred_proba_overall_for_metrics, axis=0)

# calculating overall ROC AUC Scores
print("\n--- Overall ROC AUC Scores (Independent of Threshold) ---")
try:
    roc_auc_micro_overall = roc_auc_score(Y_test_overall_final, Y_pred_proba_overall_final, average='micro')
    print(f"  Micro ROC AUC: {roc_auc_micro_overall:.4f}")
except ValueError as e:
    print(f"  Micro ROC AUC: N/A (Error: {e}) - Check for single class or constant predictions for a label.")

try:
    roc_auc_macro_overall = roc_auc_score(Y_test_overall_final, Y_pred_proba_overall_final, average='macro')
    print(f"  Macro ROC AUC: {roc_auc_macro_overall:.4f}")
except ValueError as e:
    print(f"  Macro ROC AUC: N/A (Error: {e}) - Check for single class or constant predictions for a label.")

# optimising global threshold on the concatenated test data
print("\n--- Optimizing Global Threshold ---")

best_global_micro_f1 = -1
optimal_global_threshold = 0.0
metrics_at_optimal_threshold = {}

for threshold in thresholds_to_test:
    Y_pred_binarized_global = (Y_pred_proba_overall_final >= threshold).astype(int)

    current_micro_f1 = f1_score(Y_test_overall_final, Y_pred_binarized_global, average='micro', zero_division=0)
    current_micro_precision = precision_score(Y_test_overall_final, Y_pred_binarized_global, average='micro', zero_division=0)
    current_micro_recall = recall_score(Y_test_overall_final, Y_pred_binarized_global, average='micro', zero_division=0)
    current_macro_f1 = f1_score(Y_test_overall_final, Y_pred_binarized_global, average='macro', zero_division=0)
    current_macro_precision = precision_score(Y_test_overall_final, Y_pred_binarized_global, average='macro', zero_division=0)
    current_macro_recall = recall_score(Y_test_overall_final, Y_pred_binarized_global, average='macro', zero_division=0)

    if current_micro_f1 > best_global_micro_f1:
        best_global_micro_f1 = current_micro_f1
        optimal_global_threshold = threshold
        metrics_at_optimal_threshold = {
            'Micro F1': current_micro_f1,
            'Micro Precision': current_micro_precision,
            'Micro Recall': current_micro_recall,
            'Macro F1': current_macro_f1,
            'Macro Precision': current_macro_precision,
            'Macro Recall': current_macro_recall
        }

print(f"\nOptimal Global Threshold Found (maximizing Micro F1): {optimal_global_threshold:.1f}")
print("\n--- Final Model Performance at Optimal Global Threshold ---")
for metric_name, score in metrics_at_optimal_threshold.items():
    print(f"  {metric_name}: {score:.4f}")

print("\nThis means the model achieved these performance metrics over all out-of-sample predictions when using a single, globally optimized threshold derived from nested CV.")

--- Starting Outer Loop (3 folds) for Model Evaluation ---

Outer Fold 1/3
  Starting Inner Loop (2 folds) for Hyperparameter Tuning...
Fitting 2 folds for each of 10 candidates, totalling 20 fits
[LightGBM] [Info] Number of positive: 94, number of negative: 370
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000772 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2085
[LightGBM] [Info] Number of data points in the train set: 464, number of used features: 19
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.202586 -> initscore=-1.370208
[LightGBM] [Info] Start training from score -1.370208
[LightGBM] [Info] Number of positive: 150, number of negative: 314
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000500 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2087
[LightGBM] [Info] Number of data points in the train 

The resulting metrics show that the model achieves a good model quality when considering every prediction equally (0.65 ROC AUC).
This performance is worse when considering predictions per class (0.51 ROC AUC). The other metrics compare to prior models, not achieving a good performance overall.