## CM-PaCE (Classifier Model for Posts about Configuration Errors)
#### Implementation of Word2Vec feature extraction and SVM classification
Using gensim for Word2Vec part\
Using sklearn for SVM part

2023, Ferris Kleier

Loading the necessary libraries

In [2]:
import gensim
from gensim.models import Word2Vec
from gensim.parsing.preprocessing import *
import sklearn.svm
from sklearn.model_selection import train_test_split
import numpy as np
import datetime
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
import joblib

Some defined constants to resuse

In [3]:
CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation,
                  remove_stopwords, strip_multiple_whitespaces, strip_numeric, stem_text]

#### Functions for the Word2Vec section

labeled_sets() processes the two full body sets of positive and negative posts and returns the combined full set.

In [4]:
def labeled_sets():

    full_set = []

    with open("../Posts/positives_body.txt", 'r') as f:
        for line in f:
            line = preprocess_string(line, CUSTOM_FILTERS)
            full_set.append([line, True])

    with open("../Posts/negatives_body.txt", 'r') as f:
        for line in f:
            line = preprocess_string(line, CUSTOM_FILTERS)
            full_set.append([line, False])
            
    return full_set

logging(string) logs the output of the model with results and more into a seperate log file.

In [5]:
def logging(string):
    with open("../Code/cmpace_log.txt", "a") as f:
        f.write("\n\n--------------------------------")
        f.write(string)

Return only the body dimension if a set.

In [6]:
def get_bodies(set): 
    return [row[0] for row in set]

### Word2Vec Section
In this section, we will implement the code for our Word2Vec model.

First, we use the already defined function labeled_sets() to create a labeled, combined set of the raw bodies. This full set will then be split into the training and test set in a 90/10 ratio using the sklearn library.

In [7]:
train_set, test_set = train_test_split(labeled_sets(), test_size=0.1)

Here we can see that the sklearn method train_test_split() actually shuffles and slices the sets into the desired output. That way, the content is mixed and bias can be eliminated by purposefully shuffling the labels from both sets.

In [8]:
true_count_train, false_count_train = (
    [row[1] for row in train_set].count(True),
    [row[1] for row in train_set].count(False),
)

true_count_test, false_count_test = (
    [row[1] for row in test_set].count(True),
    [row[1] for row in test_set].count(False),
)

print(true_count_train, false_count_train)
print(true_count_test, false_count_test)

235 216
26 25


#### Parameter Selection

**min_count**: this value defines the minimum occurrence of a word to be considered (suggest: 5)\
**alpha**: the initial learning rate of the model (suggest: 0.01)\
**epochs**: how many iterations the model performs (suggest: 500)\
**sg**: since we use the skip-gram approach, this indicates that it will be used for the model\
**hs**: if 1, hierarchical softmax will be used for model training\
**window**: the window size for which the skip-gram approach considers neighboring words

Below you can see our current best values for producing the desired model, after evaluating and comparing several other parameter selections.

In [9]:
min_count = 5
alpha = 0.01
epochs = 500
sg = 1
hs = 1
window = 4

#### Model Creation

Creating the name for the model to be saved by time, for better evaluation. We also make use of a lambda function to just take the bodies of the training set, since Word2Vec makes no use of the provided labels.

In [10]:
time = datetime.datetime.now().strftime("%Y-%m-%d_%H:%M:%S")
modelName = "../Models/Word2Vec/cmpace_w2v_{time}.model".format(time=time)

Using the gensim library, we can create our model by just parsing the parameters. We also activate loss computation for further evaluation. The model will then be saved with the provided model name. This way we can compare different models and store every artifact.

In [10]:
model = Word2Vec(sentences=get_bodies(train_set), min_count=min_count,
                 sg=sg, hs=hs, window=window,
                 alpha=alpha, epochs=epochs, compute_loss=True)

model.save(modelName)

Run this block only if a model should be loaded from a provided path.

In [11]:
modelName = "../Models/Word2Vec/cmpace_w2v_v2.1_PRIME.model"
model = gensim.models.Word2Vec.load(modelName)

Printing the header information for our model, this will also be used to log the output of the model in the log file.

In [12]:
logString = f"\nParameters and Results for Word2Vec Model {modelName}\n" + \
    f"min_count = {model.min_count}\n" + \
    f"alpha = {model.alpha}\n" + \
    f"sg = {model.sg}\n" + \
    f"hs = {model.hs}\n" + \
    f"window = {model.window}\n" + \
    f"epochs = {model.epochs}"
lossString = f"\nloss = {model.get_latest_training_loss()}\n"
logString += lossString

wv will be used as the primary function of the model for finding similar words or similarity between words. The word_vectors are the stored vectors of all words with their according relevance to all other words and serve as a full representation of the model's output.

In [44]:
wv = model.wv
word_vectors = wv.vectors

#### Validation

In this section the model will be inspected to see if the values make sense. This uncovers overfitting for similarities above 0.999 or other indicators for the need of refined parameters like window size from the top n similar words.

Here you can see the top n words by relevance in the corpus.

In [48]:
for index, word in enumerate(wv.index_to_key):
    if index == 11: break
    indexString = f"{index}/{len(wv.index_to_key)}: {word}"
    logString += f"\n{indexString}"

This part is the most interesting because we can see our model in action.

We can inspect the similarity between words, the top n similar words to a provided word, or which words does not fit by context in a list of words.

In [46]:
w1 = 'config'
w2 = 'error'
w3 = 'configur'
wList = ['configur', 'error', 'invalid', 'java', 'problem', 'cloud']

resString = (
    f"\nSimilarity between {w1} and {w2}:\n{wv.similarity(w1, w2)}\n\n"
    + f"Most similar words for {w1}:\n{wv.most_similar(positive=[w1], topn=5)}\n\n"
    + f"Similarity between {w3} and {w2}:\n{wv.similarity(w3, w2)}\n\n"
    + f"Most similar words for {w3}:\n{wv.most_similar(positive=[w3], topn=5)}\n\n"
    + f"Which does not fit in {wList}:\n{wv.doesnt_match(wList)}\n"
)
logString += f"\n{resString}"

print(resString)



Similarity between config and error:
0.294527530670166

Most similar words for config:
[('web', 0.5596780776977539), ('file', 0.5173095464706421), ('cudpp', 0.40932998061180115), ('rb', 0.40153074264526367), ('wwwroot', 0.3935110569000244)]

Similarity between configur and error:
0.7645865678787231

Most similar words for configur:
[('error', 0.7645866870880127), ('file', 0.5191519856452942), ('occur', 0.4893133342266083), ('give', 0.4655228853225708), ('applic', 0.42140066623687744)]

Which does not fit in ['configur', 'error', 'invalid', 'java', 'problem', 'cloud']:
java



After running the model and printing all outputs, the log will be written to keep track of all artifacts.

In [47]:
print(logString)
logging(logString)


Parameters and Results for Word2Vec Model ../Models/Word2Vec/cmpace_w2v_v2.1_PRIME.model
min_count = 5
alpha = 0.01
sg = 1
hs = 1
window = 4
epochs = 500
loss = 57536732.0


Similarity between config and error:
0.294527530670166

Most similar words for config:
[('web', 0.5596780776977539), ('file', 0.5173095464706421), ('cudpp', 0.40932998061180115), ('rb', 0.40153074264526367), ('wwwroot', 0.3935110569000244)]

Similarity between configur and error:
0.7645865678787231

Most similar words for configur:
[('error', 0.7645866870880127), ('file', 0.5191519856452942), ('occur', 0.4893133342266083), ('give', 0.4655228853225708), ('applic', 0.42140066623687744)]

Which does not fit in ['configur', 'error', 'invalid', 'java', 'problem', 'cloud']:
java


Similarity between config and error:
0.294527530670166

Most similar words for config:
[('web', 0.5596780776977539), ('file', 0.5173095464706421), ('cudpp', 0.40932998061180115), ('rb', 0.40153074264526367), ('wwwroot', 0.3935110569000244)]

S

### SVM Section

In this section we will build the SVM model that takes the feature vectors from the Word2Vec model and uses them as an input for the final classification model. The classification model will then be trained on the provided labels to learn the features of positive posts, resulting in a trained model that classifies a given post (any text) with either True or False.

We first transform the necessary sets with feature vectors. Next, we train the model on the training set. And finally, the model will be evaluated on the test set with according measures.

Some functions that are needed in this section

In [49]:
def get_features(set):
    return [row[1] for row in set]


def get_labels(set):
    return [row[2] for row in set]


def get_featurelength(set):
    max = 0
    for fv in set:
        if len(fv) > max:
            max = len(fv)
    return max


def padding_vecs(set, max):
    padded_vectors = []
    for fv in set:
        padded_vec = [0] * max
        padded_vec[:len(fv)] = fv
        padded_vectors.append(padded_vec)
    return padded_vectors


def get_features_and_labels(data):
    features = np.array(get_features(data))
    labels = np.array(get_labels(data))
    return features, labels


Here we create feature sets which simplifies handling the Word2Vec model's features. The final training and test data include the following scheme:

[ [ sentences ], [ features for sentences ], [ label ] ]

In [50]:
train_data = [[sentence[0], [wv[word] for word in sentence[0] if word in wv], sentence[1]] for sentence in train_set]
test_data = [[sentence[0], [wv[word] for word in sentence[0] if word in wv], sentence[1]] for sentence in test_set]

To better handle the subdata, we split the big sets into the features and labels.

In [51]:
train_features, train_labels = get_features_and_labels(train_data)
test_features, test_labels = get_features_and_labels(test_data)

  features = np.array(get_features(data))


Here we do some preparation for the sets. Like adding zeros to make all vectors the same length according to the longest feature vector in both datasets.

In [52]:
flat_train_features = np.array([np.array(sentence).flatten()
                         for sentence in train_features])
max_train = get_featurelength(flat_train_features)
flat_test_features = np.array([np.array(sentence).flatten()
                         for sentence in test_features])
max_test = get_featurelength(flat_test_features)
max = (max_train if max_train > max_test else max_test)

flat_train_features = padding_vecs(flat_train_features, max)
train_features = flat_train_features
flat_test_features = padding_vecs(flat_test_features, max)
test_features = flat_test_features

  flat_train_features = np.array([np.array(sentence).flatten()
  flat_test_features = np.array([np.array(sentence).flatten()


#### Parameter Selection

We used a grid search method to find the best values for the model's parameters.

In [58]:
svm_model = sklearn.svm.SVC()

param_grid = {'C': [0.001, 0.005, 0.01, 0.02, 0.05],
              'gamma': [0.001, 0.01, 0.1, 1, 10],
              'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}

grid_search = GridSearchCV(svm_model, param_grid=param_grid)
grid_search.fit(train_features, train_labels)

best_model = grid_search.best_estimator_
print(best_model)

SVC(C=0.005, gamma=0.001, kernel='linear')


These are the final parameters we used for out best model.

In [72]:
model_c = 0.01
model_gamma = 0.001
model_kernel = 'linear'

#### Model Creation

In this cell, the model gets created. We simply provide the training features and labels and run the model.

In [37]:
modelName = "../Models/SVM/cmpace_svm_{time}".format(time=time)

svm_model = sklearn.svm.SVC(C=model_c, gamma=model_gamma, kernel=model_kernel)
svm_model.fit(train_features, train_labels)

#### Results

A short evaluation of the final model's scores:

**Accuracy**: is the percentage of data points that are classified correctly\
**Precision**: is the percentage of data points that are classified as positive that are actually positive\
**Recall**: is the percentage of positive data points that are classified as positive\
**F1 Score**: is a measure of both precision and recall. It is calculated as the harmonic mean of precision and recall

In [74]:
predictions = svm_model.predict(test_features)


accuracy = accuracy_score(test_labels, predictions)
precision = precision_score(test_labels, predictions)
recall = recall_score(test_labels, predictions)
f1 = f1_score(test_labels, predictions)

result_string = f"""
Parameters and Results for SVM Model {modelName}
C: {model_c}
gamma: {model_gamma}
Kernel: {model_kernel}

Accuracy: {accuracy}
Precision: {precision}
Recall: {recall}
F1 score: {f1}
"""

print(result_string)
logging(result_string)



Parameters and Results for SVM Model ../Models/Word2Vec/cmpace_w2v_v2.1_PRIME.model
C: 0.01
gamma: 0.001
Kernel: linear

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 score: 1.0



In the end, we save our model to reuse it.

In [25]:
joblib.dump(svm_model, f'{modelName}.sav')

['../Models/SVM/cmpace_svm_2023-08-25_12:06:49.sav']

An Example on how to use the classifier with an input sentence. The first example should produce 'True' and the second should bring 'False'

In [76]:
posts = {
  "posts": {
    "true": {
      "post1": "I'm getting a `ConfigurationErrorsException` when I try to start my application. The error message says: `System.Configuration.ConfigurationErrorsException: The configuration file could not be loaded because of an error. The configuration file is located at `C:\myApp\app.config`. I've checked the file and it looks fine. I've also tried restarting my computer, but the problem persists. Any ideas what could be causing this?",
      "post2": "I'm trying to connect to a database, but I'm getting an error about a missing configuration setting. The error message says: `Could not find a connection string named 'myConnectionString'. The connection string is defined in the configuration file, but it's not being found. I've checked the spelling of the connection string and it's correct. I'm not sure why it's not being found. Any ideas what I can do to fix this?",
      "post3": "I'm trying to deploy my application to a new server, but I'm getting an error about a missing configuration file. The error message says: `Could not find the configuration file 'myApp.config'. I've made sure to copy the configuration file to the new server, but it's still not being found. I'm not sure why this is happening. Any ideas what I can do to fix this?",
      "post4": "I'm trying to change a configuration setting, but I'm not sure how to do it. The configuration setting I want to change is the `port` for the database connection. I've looked in the documentation, but I can't find any information on how to change this setting. I'm not sure if I need to modify the configuration file or if there's another way to do it. Any ideas how I can do this?",
      "post5": "I'm trying to debug a configuration error, but I'm not sure where to start. I've tried looking at the stack trace, but I can't make sense of it. I'm not sure what the different lines in the stack trace mean. Any ideas what I can do to debug this error?"
    },
    "false": {
      "post1": "I'm trying to get my application to work, but I'm not sure what I'm doing wrong. I've tried everything I can think of, but I'm still getting errors. I'm not sure if the problem is with my code or with the configuration. Can anyone help me figure out what's going on?",
      "post2": "I'm getting an error when I try to run my application. The error message says: `An unexpected error has occurred.` I'm not sure what this error means. I've tried looking in the documentation, but I can't find any information on it. Can anyone help me figure out what this error means?",
      "post3": "I'm trying to deploy my application to a new server, but I'm having some problems. I'm getting an error message that says: `The deployment failed.` I'm not sure what's causing this error. I've tried everything I can think of, but I can't seem to fix it. Can anyone help me figure out what's causing this error?",
      "post4": "I'm trying to debug my application, but I'm stuck. I've been following a tutorial, but I'm not sure what to do next. I'm not sure if I'm doing something wrong or if the tutorial is wrong. Can anyone help me figure out what I should do next?",
      "post5": "I'm having some problems with my application. I'm not sure what's causing them. I've tried everything I can think of, but I can't seem to fix them. Can anyone help me figure out what's causing these problems?"
    }
  }
}

In [81]:
post = "I'm getting a ConfigurationErrorsException when I try to open a form in design view in Visual Studio 2019. The error message says: System.Configuration.ConfigurationErrorsException: The configuration file could not be loaded because of an error. I've checked the configuration file and it looks fine. I've also tried restarting Visual Studio, but the problem persists. Any ideas what could be causing this? Thanks!"
# post = "Hi everyone, I'm working on a project and I'm trying to make my code more efficient. I've been reading some articles on the topic, but I'm still not sure what the best approach is. Can anyone give me some tips on how to make my code more efficient? Thanks!"

word2vec_model = gensim.models.Word2Vec.load(
    "../Models/Word2Vec/cmpace_w2v_v2.1_PRIME.model")
wv = word2vec_model.wv

sentences = [[post,None]]
sentences[0][0] = preprocess_string(sentences[0][0], CUSTOM_FILTERS)

input = [[sentence[0], [wv[word] for word in sentence[0] if word in wv], None] for sentence in sentences]
features, labels = get_features_and_labels(input)
features = np.array([np.array(sentence).flatten()
                         for sentence in features])
features = padding_vecs(features, 118600)


svm_model = joblib.load("../Models/SVM/cmpace_svm_v2.4_PRIME.sav")
print(svm_model)
label = svm_model.predict(features)
print(label[0])


{'true': {'post1': "I'm getting a `ConfigurationErrorsException` when I try to start my application. The error message says: `System.Configuration.ConfigurationErrorsException: The configuration file could not be loaded because of an error. The configuration file is located at `C:\\myApp\x07pp.config`. I've checked the file and it looks fine. I've also tried restarting my computer, but the problem persists. Any ideas what could be causing this?", 'post2': "I'm trying to connect to a database, but I'm getting an error about a missing configuration setting. The error message says: `Could not find a connection string named 'myConnectionString'. The connection string is defined in the configuration file, but it's not being found. I've checked the spelling of the connection string and it's correct. I'm not sure why it's not being found. Any ideas what I can do to fix this?", 'post3': "I'm trying to deploy my application to a new server, but I'm getting an error about a missing configuration