# Python Code for Generating Transaction Cost Measures from Contract Text

Python notebook for Measuring Transaction Costs in Public Sector Contracting through Machine Learning and Contract Text. The data is based on a weekly collection of documents related to public tenders in Denmark in 2021-2022. It covers tenders from local government, regional level, and public utility companies but not central government. The collection takes place as part of the project Contract Skills and Performance: A national training experiment.

Python version 3.11.5 (64-bit)

This notebook describes the analytical processing of data after the initial data cleaning and use of word embedding to vectorization.

The notebook contains the following:

#### Product Description Language prediction

1.1. Default model experiment with cross validation

1.2 Support Vector Machine adjustment

1.3 Neural Network adjustment

#### Measurability Prediction

2.1 Default model experiment with cross validation

2.2 Support Vector Machine adjustment


#### Asset Specifity Prediction

3.1 Default model experiment with cross validation

3.2 Support Vector Machine adjustment

# 1 Product Description Language prediction

## 1.1 Default model experiment with cross validation

In [3]:
import os

import PyPDF2 
import pandas as pd
import pickle
import slate3k as slate
import textract
import openpyxl

import nltk
import pandas as pd
import re 
import pickle
from nltk.tokenize import sent_tokenize

import multiprocessing
import os

import json

from nltk.corpus import stopwords
import lemmy

# Data loadning 

In [3]:
corpus_1 = pd.read_pickle("word_embedding_round_1.pkl")

In [5]:
corpus_2 = pd.read_pickle("word_embedding_manuel_round_2.pkl")

In [7]:
corpus_3 = pd.read_pickle("word_embedding_round_3.pkl")

In [9]:
corpus_4 = pd.read_pickle("word_embedding_round_4.pkl")

In [11]:
corpus_5 = pd.read_pickle("word_embedding_round_5.pkl")

In [13]:
corpus_6 = pd.read_pickle("word_embedding_round_6.pkl")

# Logistic regression

In [28]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.preprocessing import LabelEncoder

### Round 1

In [33]:
X_train, X_test, y_train, y_test = train_test_split(corpus_1['embedding'], corpus_1['product_description_binary'], test_size=0.3, random_state=42)

In [34]:
X = np.array(list(corpus_1['embedding']))
y = corpus_1['product_description_binary'] 

In [35]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [36]:
np.random.seed(42)


In [37]:
logistic_regression = LogisticRegression(random_state=42)

In [38]:
results = cross_validate(logistic_regression, X, y_encoded, cv=5, scoring=['accuracy', 'recall'])


In [39]:
accuracy_scores = results['test_accuracy']
sensitivity_scores = results['test_recall']


In [40]:
print('Accuracy scores for each fold:', accuracy_scores)
print('Sensitivity (Recall) scores for each fold:', sensitivity_scores)

Accuracy scores for each fold: [0.87323944 0.91750503 0.92354125 0.90524194 0.89919355]
Sensitivity (Recall) scores for each fold: [0.         0.34920635 0.3968254  0.37096774 0.25396825]


In [41]:
mean_accuracy = np.mean(results['test_accuracy'])
mean_sensitivity = np.nanmean(results['test_recall'])

In [42]:
print(f'Mean accuracy: {mean_accuracy}')
print(f'Mean sensitivity: {mean_sensitivity}')

Mean accuracy: 0.9037442396313364
Mean sensitivity: 0.2741935483870968


## Round 2

In [33]:
X = np.array(list(corpus_2['embedding']))
y = corpus_2['product_description_binary'] 

In [34]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [35]:
np.random.seed(42)


In [36]:
logistic_regression = LogisticRegression(random_state=42)

In [37]:
results = cross_validate(logistic_regression, X, y_encoded, cv=5, scoring=['accuracy', 'recall'])


In [38]:
accuracy_scores = results['test_accuracy']
sensitivity_scores = results['test_recall']


In [39]:
print('Accuracy scores for each fold:', accuracy_scores)
print('Sensitivity (Recall) scores for each fold:', sensitivity_scores)

Accuracy scores for each fold: [0.86818182 0.89242424 0.89848485 0.88030303 0.9       ]
Sensitivity (Recall) scores for each fold: [0.07446809 0.24468085 0.35106383 0.25531915 0.36842105]


In [40]:
mean_accuracy = np.mean(results['test_accuracy'])
mean_sensitivity = np.nanmean(results['test_recall'])

In [41]:
print(f'Mean accuracy: {mean_accuracy}')
print(f'Mean sensitivity: {mean_sensitivity}')


Mean accuracy: 0.8878787878787879
Mean sensitivity: 0.2587905935050392


## Round 3

In [42]:
X = np.array(list(corpus_3['embedding']))
y = corpus_3['product_description_binary'] 

In [43]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [44]:
np.random.seed(42)


In [45]:
logistic_regression = LogisticRegression(random_state=42)

In [46]:
results = cross_validate(logistic_regression, X, y_encoded, cv=5, scoring=['accuracy', 'recall'])


In [47]:
accuracy_scores = results['test_accuracy']
sensitivity_scores = results['test_recall']


In [49]:
print('Accuracy scores for each fold:', accuracy_scores)
print('Sensitivity (Recall) scores for each fold:', sensitivity_scores)


Accuracy scores for each fold: [0.88605898 0.8847185  0.86863271 0.84718499 0.88069705]
Sensitivity (Recall) scores for each fold: [0.3515625  0.328125   0.3984375  0.27906977 0.65891473]


In [50]:
mean_accuracy = np.mean(results['test_accuracy'])
mean_sensitivity = np.nanmean(results['test_recall'])

In [51]:
print(f'Mean accuracy: {mean_accuracy}')
print(f'Mean sensitivity: {mean_sensitivity}')


Mean accuracy: 0.8734584450402145
Mean sensitivity: 0.40322189922480617


## Round 4

In [33]:
X = np.array(list(corpus_4['embedding']))
y = corpus_4['product_description_binary'] 

In [34]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [35]:
np.random.seed(42)


In [36]:
logistic_regression = LogisticRegression(random_state=42)

In [37]:
results = cross_validate(logistic_regression, X, y_encoded, cv=5, scoring=['accuracy', 'recall'])


In [38]:
accuracy_scores = results['test_accuracy']
sensitivity_scores = results['test_recall']


In [39]:
print('Accuracy scores for each fold:', accuracy_scores)
print('Sensitivity (Recall) scores for each fold:', sensitivity_scores)

Accuracy scores for each fold: [0.88157895 0.88636364 0.80838323 0.90538922 0.8       ]
Sensitivity (Recall) scores for each fold: [0.44444444 0.44444444 0.24117647 0.70175439 0.19298246]


In [40]:
mean_accuracy = np.mean(results['test_accuracy'])
mean_sensitivity = np.nanmean(results['test_recall'])

In [41]:
print(f'Mean accuracy: {mean_accuracy}')
print(f'Mean sensitivity: {mean_sensitivity}')


Mean accuracy: 0.8563430077643754
Mean sensitivity: 0.4049604403164775


## Round 5

In [42]:
X = np.array(list(corpus_5['embedding']))
y = corpus_5['product_description_binary'] 

In [43]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [44]:
np.random.seed(42)


In [45]:
logistic_regression = LogisticRegression(random_state=42)

In [46]:
results = cross_validate(logistic_regression, X, y_encoded, cv=5, scoring=['accuracy', 'recall'])


In [47]:
accuracy_scores = results['test_accuracy']
sensitivity_scores = results['test_recall']


In [48]:
print('Accuracy scores for each fold:', accuracy_scores)
print('Sensitivity (Recall) scores for each fold:', sensitivity_scores)


Accuracy scores for each fold: [0.87569061 0.86175115 0.88663594 0.83502304 0.8921659 ]
Sensitivity (Recall) scores for each fold: [0.35576923 0.32850242 0.51923077 0.31730769 0.55288462]


In [49]:
mean_accuracy = np.mean(results['test_accuracy'])
mean_sensitivity = np.nanmean(results['test_recall'])

In [50]:
print(f'Mean accuracy: {mean_accuracy}')
print(f'Mean sensitivity: {mean_sensitivity}')

Mean accuracy: 0.8702533289202332
Mean sensitivity: 0.4147389446302491


## Round 6

In [51]:
X = np.array(list(corpus_6['embedding']))
y = corpus_6['product_description_binary'] 

In [52]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [53]:
np.random.seed(42)


In [54]:
logistic_regression = LogisticRegression(random_state=42)

In [55]:
results = cross_validate(logistic_regression, X, y_encoded, cv=5, scoring=['accuracy', 'recall'])


In [56]:
accuracy_scores = results['test_accuracy']
sensitivity_scores = results['test_recall']


In [57]:
print('Accuracy scores for each fold:', accuracy_scores)
print('Sensitivity (Recall) scores for each fold:', sensitivity_scores)

Accuracy scores for each fold: [0.88225399 0.87216148 0.79226241 0.87888982 0.91245791]
Sensitivity (Recall) scores for each fold: [0.46743295 0.47509579 0.25670498 0.54789272 0.82692308]


In [58]:
mean_accuracy = np.mean(results['test_accuracy'])
mean_sensitivity = np.nanmean(results['test_recall'])

In [59]:
print(f'Mean accuracy: {mean_accuracy}')
print(f'Mean sensitivity: {mean_sensitivity}')

Mean accuracy: 0.8676051232821628
Mean sensitivity: 0.5148099027409372


# SVM

## Round 1

In [60]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
import numpy as np

In [61]:
np.random.seed(42)

svm = SVC(random_state=42) 

In [62]:
X = np.array(corpus_1['embedding'].tolist())
y = np.array(corpus_1['product_description_binary'])


In [63]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [64]:
n_folds = 5


In [65]:
accuracy_scores = cross_val_score(svm, X, y_encoded, cv=n_folds, scoring='accuracy')

In [66]:
recall_scores = cross_val_score(svm, X, y_encoded, cv=n_folds, scoring='recall')  

In [67]:
print("Accuracy scores for each fold:", accuracy_scores)
print("Recall (Sensitivity) scores for each fold:", recall_scores)

Accuracy scores for each fold: [0.89537223 0.9195171  0.94969819 0.89314516 0.90524194]
Recall (Sensitivity) scores for each fold: [0.17460317 0.61904762 0.6984127  0.80645161 0.52380952]


In [68]:
print("Cross-validated Accuracy:", np.mean(accuracy_scores))
print("Cross-validated Recall (Sensitivity):", np.nanmean(recall_scores))

Cross-validated Accuracy: 0.9125949243850198
Cross-validated Recall (Sensitivity): 0.5644649257552483


## Round 2

In [33]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
import numpy as np


In [34]:
np.random.seed(42)

svm = SVC(random_state=42) 

In [35]:
X = np.array(corpus_2['embedding'].tolist())
y = np.array(corpus_2['product_description_binary'])


In [36]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [37]:
n_folds = 5


In [38]:
accuracy_scores = cross_val_score(svm, X, y_encoded, cv=n_folds, scoring='accuracy')

In [39]:
recall_scores = cross_val_score(svm, X, y_encoded, cv=n_folds, scoring='recall')  

In [40]:
print("Accuracy scores for each fold:", accuracy_scores)
print("Recall (Sensitivity) scores for each fold:", recall_scores)

Accuracy scores for each fold: [0.89393939 0.96212121 0.9        0.89090909 0.86363636]
Recall (Sensitivity) scores for each fold: [0.26595745 0.76595745 0.62765957 0.46808511 0.69473684]


In [41]:
print("Cross-validated Accuracy:", np.mean(accuracy_scores))
print("Cross-validated Recall (Sensitivity):", np.nanmean(recall_scores))

Cross-validated Accuracy: 0.902121212121212
Cross-validated Recall (Sensitivity): 0.5644792833146697


## Round 3

In [42]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
import numpy as np


In [43]:
np.random.seed(42)

svm = SVC(random_state=42) 

In [44]:
X = np.array(corpus_3['embedding'].tolist())
y = np.array(corpus_3['product_description_binary'])


In [45]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [46]:
n_folds = 5


In [47]:
accuracy_scores = cross_val_score(svm, X, y_encoded, cv=n_folds, scoring='accuracy')

In [48]:
recall_scores = cross_val_score(svm, X, y_encoded, cv=n_folds, scoring='recall')  

In [49]:
print("Accuracy scores for each fold:", accuracy_scores)
print("Recall (Sensitivity) scores for each fold:", recall_scores)

Accuracy scores for each fold: [0.91957105 0.93029491 0.85790885 0.84986595 0.80294906]
Recall (Sensitivity) scores for each fold: [0.5546875  0.640625   0.5078125  0.39534884 0.79069767]


In [50]:
print("Cross-validated Accuracy:", np.mean(accuracy_scores))
print("Cross-validated Recall (Sensitivity):", np.nanmean(recall_scores))


Cross-validated Accuracy: 0.8721179624664879
Cross-validated Recall (Sensitivity): 0.5778343023255814


## Round 4

In [51]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
import numpy as np


In [52]:
np.random.seed(42)

svm = SVC(random_state=42) 

In [53]:
X = np.array(corpus_4['embedding'].tolist())
y = np.array(corpus_4['product_description_binary'])


In [54]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [55]:
n_folds = 5


In [56]:
accuracy_scores = cross_val_score(svm, X, y_encoded, cv=n_folds, scoring='accuracy')

In [57]:
recall_scores = cross_val_score(svm, X, y_encoded, cv=n_folds, scoring='recall')  

In [58]:
print("Accuracy scores for each fold:", accuracy_scores)
print("Recall (Sensitivity) scores for each fold:", recall_scores)

Accuracy scores for each fold: [0.91746411 0.90550239 0.81916168 0.87784431 0.85389222]
Recall (Sensitivity) scores for each fold: [0.65497076 0.58479532 0.4        0.83040936 0.60818713]


In [59]:
print("Cross-validated Accuracy:", np.mean(accuracy_scores))
print("Cross-validated Recall (Sensitivity):", np.nanmean(recall_scores))

Cross-validated Accuracy: 0.8747729421539695
Cross-validated Recall (Sensitivity): 0.615672514619883


## Round 5

In [60]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
import numpy as np

In [61]:
np.random.seed(42)

svm = SVC(random_state=42) 

In [62]:
X = np.array(corpus_5['embedding'].tolist())
y = np.array(corpus_5['product_description_binary'])


In [63]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [64]:
n_folds = 5


In [65]:
accuracy_scores = cross_val_score(svm, X, y_encoded, cv=n_folds, scoring='accuracy')

In [66]:
recall_scores = cross_val_score(svm, X, y_encoded, cv=n_folds, scoring='recall')  

In [67]:
print("Accuracy scores for each fold:", accuracy_scores)
print("Recall (Sensitivity) scores for each fold:", recall_scores)

Accuracy scores for each fold: [0.9198895  0.88571429 0.89585253 0.85345622 0.91059908]
Recall (Sensitivity) scores for each fold: [0.61057692 0.46859903 0.74038462 0.48076923 0.67788462]


In [68]:
print("Cross-validated Accuracy:", np.mean(accuracy_scores))
print("Cross-validated Recall (Sensitivity):", np.nanmean(recall_scores))


Cross-validated Accuracy: 0.8931023245156199
Cross-validated Recall (Sensitivity): 0.595642883686362


## Round 6

In [33]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
import numpy as np

In [34]:
np.random.seed(42)

svm = SVC(random_state=42) 

In [35]:
X = np.array(corpus_6['embedding'].tolist())
y = np.array(corpus_6['product_description_binary'])


In [36]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [37]:
n_folds = 5


In [38]:
accuracy_scores = cross_val_score(svm, X, y_encoded, cv=n_folds, scoring='accuracy')

In [39]:
recall_scores = cross_val_score(svm, X, y_encoded, cv=n_folds, scoring='recall')  

In [40]:
print("Accuracy scores for each fold:", accuracy_scores)
print("Recall (Sensitivity) scores for each fold:", recall_scores)

Accuracy scores for each fold: [0.90664424 0.8881413  0.78973928 0.88982338 0.8973064 ]
Recall (Sensitivity) scores for each fold: [0.58237548 0.5862069  0.32183908 0.62452107 0.86538462]


In [41]:
print("Cross-validated Accuracy:", np.mean(accuracy_scores))
print("Cross-validated Recall (Sensitivity):", np.nanmean(recall_scores))

Cross-validated Accuracy: 0.8743309178128355
Cross-validated Recall (Sensitivity): 0.5960654288240496


# Random Forrest

In [50]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer, accuracy_score, recall_score
import numpy as np

## Round 1

In [361]:
X = np.array(corpus_1['embedding'].tolist())
y = np.array(corpus_1['product_description_binary'])


In [362]:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)


In [363]:
scoring_metrics = {'accuracy': 'accuracy', 'recall': make_scorer(recall_score, pos_label='Yes')}  # Specify the positive class label appropriately

In [372]:
cv_results = cross_validate(rf_classifier, X, y, cv=5, scoring=scoring_metrics)


In [367]:
mean_accuracy = np.mean(cv_results['test_accuracy'])
mean_sensitivity = np.nanmean(cv_results['test_recall'])


In [368]:
print("Cross-validated Accuracy:", mean_accuracy)
print("Cross-validated Sensitivity (Recall):", mean_sensitivity)


Cross-validated Accuracy: 0.910988511715454
Cross-validated Sensitivity (Recall): 0.271505376344086


## Round 2

In [369]:
X = np.array(corpus_2['embedding'].tolist())
y = np.array(corpus_2['product_description_binary'])


In [370]:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)


In [371]:
scoring_metrics = {'accuracy': 'accuracy', 'recall': make_scorer(recall_score, pos_label='Yes')}

In [372]:
cv_results = cross_validate(rf_classifier, X, y, cv=5, scoring=scoring_metrics)


In [373]:
mean_accuracy = np.mean(cv_results['test_accuracy'])
mean_sensitivity = np.nanmean(cv_results['test_recall'])


In [374]:
print("Cross-validated Accuracy:", mean_accuracy)
print("Cross-validated Sensitivity (Recall):", mean_sensitivity)


Cross-validated Accuracy: 0.8809090909090909
Cross-validated Sensitivity (Recall): 0.20994400895856663


## Round 3

In [375]:
X = np.array(corpus_3['embedding'].tolist()) 
y = np.array(corpus_3['product_description_binary'])


In [376]:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)


In [377]:
scoring_metrics = {'accuracy': 'accuracy', 'recall': make_scorer(recall_score, pos_label='Yes')} 

In [378]:
cv_results = cross_validate(rf_classifier, X, y, cv=5, scoring=scoring_metrics)


In [379]:
mean_accuracy = np.mean(cv_results['test_accuracy'])
mean_sensitivity = np.nanmean(cv_results['test_recall'])


In [380]:
print("Cross-validated Accuracy:", mean_accuracy)
print("Cross-validated Sensitivity (Recall):", mean_sensitivity)


Cross-validated Accuracy: 0.8710455764075068
Cross-validated Sensitivity (Recall): 0.3550750968992248


## Round 4

In [381]:
X = np.array(corpus_4['embedding'].tolist())  
y = np.array(corpus_4['product_description_binary'])


In [382]:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)


In [383]:
scoring_metrics = {'accuracy': 'accuracy', 'recall': make_scorer(recall_score, pos_label='Yes')} 

In [378]:
cv_results = cross_validate(rf_classifier, X, y, cv=5, scoring=scoring_metrics)


In [385]:
mean_accuracy = np.mean(cv_results['test_accuracy'])
mean_sensitivity = np.nanmean(cv_results['test_recall'])


In [386]:
print("Cross-validated Accuracy:", mean_accuracy)
print("Cross-validated Sensitivity (Recall):", mean_sensitivity)


Cross-validated Accuracy: 0.8422233045869983
Cross-validated Sensitivity (Recall): 0.30214424951267055


## Round 5

In [389]:
X = np.array(corpus_5['embedding'].tolist())  
y = np.array(corpus_5['product_description_binary'])


In [390]:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)


In [391]:
scoring_metrics = {'accuracy': 'accuracy', 'recall': make_scorer(recall_score, pos_label='Yes')}  # Specify the positive class label appropriately

In [378]:
cv_results = cross_validate(rf_classifier, X, y, cv=5, scoring=scoring_metrics)


In [393]:
mean_accuracy = np.mean(cv_results['test_accuracy'])
mean_sensitivity = np.nanmean(cv_results['test_recall'])


In [394]:
print("Cross-validated Accuracy:", mean_accuracy)
print("Cross-validated Sensitivity (Recall):", mean_sensitivity)


Cross-validated Accuracy: 0.8525615500165491
Cross-validated Sensitivity (Recall): 0.2708333333333333


## Round 6

In [51]:
X = np.array(corpus_6['embedding'].tolist())
y = np.array(corpus_6['product_description_binary'])


In [52]:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)


In [53]:
scoring_metrics = {'accuracy': 'accuracy', 'recall': make_scorer(recall_score, pos_label='Yes')}

In [378]:
cv_results = cross_validate(rf_classifier, X, y, cv=5, scoring=scoring_metrics)


In [None]:
mean_accuracy = np.mean(cv_results['test_accuracy'])
mean_sensitivity = np.nanmean(cv_results['test_recall'])


In [None]:
print("Cross-validated Accuracy:", mean_accuracy)
print("Cross-validated Sensitivity (Recall):", mean_sensitivity)


In [400]:
print("Cross-validated Accuracy:", mean_accuracy)
print("Cross-validated Sensitivity (Recall):", mean_sensitivity)


Cross-validated Accuracy: 0.8536425369478355
Cross-validated Sensitivity (Recall): 0.47892720306513414


# Neural Network

In [33]:
from sklearn.metrics import classification_report, confusion_matrix
import tensorflow as tf
import numpy as np
from sklearn.metrics import recall_score
from sklearn.model_selection import train_test_split

import pandas as pd
from sklearn.model_selection import KFold
from sklearn.metrics import recall_score
import tensorflow as tf
from sklearn.metrics import accuracy_score, recall_score
import random





In [35]:
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

In [36]:
embeddings = pd.DataFrame(corpus_1['embedding'].tolist())
class_mapping = {'No': 0, 'Yes': 1}
corpus_1['class_encoded'] = corpus_1['product_description_binary'].map(class_mapping)


In [37]:
X = np.array(embeddings)
y = np.array(corpus_1['class_encoded'])


In [38]:
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)


In [39]:
accuracies = []
sensitivities = []


In [40]:
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Model definition
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation='relu', input_shape=(1536,)),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

    # Compile and train the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=0)

    # Prediction and metrics calculation
    y_pred_probs = model.predict(X_test, verbose=0)
    y_pred = (y_pred_probs > 0.5).astype(int)

    accuracies.append(accuracy_score(y_test, y_pred))
    sensitivities.append(recall_score(y_test, y_pred))







In [41]:
mean_accuracy = np.mean(accuracies)
mean_sensitivity = np.mean(sensitivities)

In [42]:
print(f"Average Cross-Validated Accuracy: {mean_accuracy}")
print(f"Average Cross-Validated Sensitivity (Recall): {mean_sensitivity}")

Average Cross-Validated Accuracy: 0.9488544168235219
Average Cross-Validated Sensitivity (Recall): 0.8002041015400634


# Round 2

In [32]:
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

In [33]:
embeddings = pd.DataFrame(corpus_2['embedding'].tolist())
class_mapping = {'No': 0, 'Yes': 1}
corpus_2['class_encoded'] = corpus_2['product_description_binary'].map(class_mapping)


In [34]:
X = np.array(embeddings)
y = np.array(corpus_2['class_encoded'])


In [35]:
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)


In [36]:
accuracies = []
sensitivities = []


In [37]:
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Model definition
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation='relu', input_shape=(1536,)),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

    # Compile and train the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=0)

    # Prediction and metrics calculation
    y_pred_probs = model.predict(X_test, verbose=0)
    y_pred = (y_pred_probs > 0.5).astype(int)

    accuracies.append(accuracy_score(y_test, y_pred))
    sensitivities.append(recall_score(y_test, y_pred))







In [38]:
mean_accuracy = np.mean(accuracies)
mean_sensitivity = np.mean(sensitivities)

In [39]:
print(f"Average Cross-Validated Accuracy: {mean_accuracy}")
print(f"Average Cross-Validated Sensitivity (Recall): {mean_sensitivity}")

Average Cross-Validated Accuracy: 0.9393939393939394
Average Cross-Validated Sensitivity (Recall): 0.7626061539115941


# Round 3

In [40]:
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

In [41]:
embeddings = pd.DataFrame(corpus_3['embedding'].tolist())
class_mapping = {'No': 0, 'Yes': 1}
corpus_3['class_encoded'] = corpus_3['product_description_binary'].map(class_mapping)


In [42]:
X = np.array(embeddings)
y = np.array(corpus_3['class_encoded'])


In [43]:
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)


In [44]:
accuracies = []
sensitivities = []


In [45]:
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Model definition
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation='relu', input_shape=(1536,)),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

    # Compile and train the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=0)

    # Prediction and metrics calculation
    y_pred_probs = model.predict(X_test, verbose=0)
    y_pred = (y_pred_probs > 0.5).astype(int)

    accuracies.append(accuracy_score(y_test, y_pred))
    sensitivities.append(recall_score(y_test, y_pred))

In [46]:
mean_accuracy = np.mean(accuracies)
mean_sensitivity = np.mean(sensitivities)

In [47]:
print(f"Average Cross-Validated Accuracy: {mean_accuracy}")
print(f"Average Cross-Validated Sensitivity (Recall): {mean_sensitivity}")

Average Cross-Validated Accuracy: 0.927882037533512
Average Cross-Validated Sensitivity (Recall): 0.763558424383175


# Round 4

In [48]:
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

In [49]:
embeddings = pd.DataFrame(corpus_4['embedding'].tolist())
class_mapping = {'No': 0, 'Yes': 1}
corpus_4['class_encoded'] = corpus_4['product_description_binary'].map(class_mapping)


In [50]:
X = np.array(embeddings)
y = np.array(corpus_4['class_encoded'])


In [51]:
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)


In [52]:
accuracies = []
sensitivities = []


In [53]:
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Model definition
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation='relu', input_shape=(1536,)),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

    # Compile and train the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=0)

    # Prediction and metrics calculation
    y_pred_probs = model.predict(X_test, verbose=0)
    y_pred = (y_pred_probs > 0.5).astype(int)

    accuracies.append(accuracy_score(y_test, y_pred))
    sensitivities.append(recall_score(y_test, y_pred))

In [54]:
mean_accuracy = np.mean(accuracies)
mean_sensitivity = np.mean(sensitivities)

In [55]:
print(f"Average Cross-Validated Accuracy: {mean_accuracy}")
print(f"Average Cross-Validated Sensitivity (Recall): {mean_sensitivity}")

Average Cross-Validated Accuracy: 0.9260215454258945
Average Cross-Validated Sensitivity (Recall): 0.8233155200776473


# Round 5

In [56]:
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

In [57]:
embeddings = pd.DataFrame(corpus_5['embedding'].tolist())
class_mapping = {'No': 0, 'Yes': 1}
corpus_5['class_encoded'] = corpus_5['product_description_binary'].map(class_mapping)


In [58]:
X = np.array(embeddings)
y = np.array(corpus_5['class_encoded'])


In [59]:
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)


In [60]:
accuracies = []
sensitivities = []


In [61]:
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Model definition
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation='relu', input_shape=(1536,)),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

    # Compile and train the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=0)

    # Prediction and metrics calculation
    y_pred_probs = model.predict(X_test, verbose=0)
    y_pred = (y_pred_probs > 0.5).astype(int)

    accuracies.append(accuracy_score(y_test, y_pred))
    sensitivities.append(recall_score(y_test, y_pred))

In [62]:
mean_accuracy = np.mean(accuracies)
mean_sensitivity = np.mean(sensitivities)

In [63]:
print(f"Average Cross-Validated Accuracy: {mean_accuracy}")
print(f"Average Cross-Validated Sensitivity (Recall): {mean_sensitivity}")

Average Cross-Validated Accuracy: 0.9198256825453404
Average Cross-Validated Sensitivity (Recall): 0.7865540285146103


## Round 6

In [36]:
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

In [37]:
embeddings = pd.DataFrame(corpus_6['embedding'].tolist())
class_mapping = {'No': 0, 'Yes': 1}
corpus_6['class_encoded'] = corpus_6['product_description_binary'].map(class_mapping)


In [38]:
X = np.array(embeddings)
y = np.array(corpus_6['class_encoded'])


In [39]:
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)


In [40]:
accuracies = []
sensitivities = []


In [41]:
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Model definition
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation='relu', input_shape=(1536,)),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

    # Compile and train the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=0)

    # Prediction and metrics calculation
    y_pred_probs = model.predict(X_test, verbose=0)
    y_pred = (y_pred_probs > 0.5).astype(int)

    accuracies.append(accuracy_score(y_test, y_pred))
    sensitivities.append(recall_score(y_test, y_pred))







In [42]:
mean_accuracy = np.mean(accuracies)
mean_sensitivity = np.mean(sensitivities)

In [43]:
print(f"Average Cross-Validated Accuracy: {mean_accuracy}")
print(f"Average Cross-Validated Sensitivity (Recall): {mean_sensitivity}")

Average Cross-Validated Accuracy: 0.9054504959887634
Average Cross-Validated Sensitivity (Recall): 0.8352883414216391


## 1.2 Support Vector Machine adjustment

## Runde 1

In [82]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
import numpy as np

svm = SVC()

In [83]:
X = np.array(corpus_1['embedding'].tolist())
y = np.array(corpus_1['product_description_binary'])


In [84]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [85]:
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)

In [86]:
svm.fit(list(X_train), y_train)



In [87]:
y_pred = svm.predict(list(X_test))


In [88]:
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Accuracy: 0.9610738255033557


In [89]:
sensitivity = recall_score(y_test, y_pred)
print('Sensitivity:', sensitivity)

Sensitivity: 0.6875


## Runde 2

In [90]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
import numpy as np

svm = SVC()

In [91]:
X = np.array(corpus_2['embedding'].tolist())
y = np.array(corpus_2['product_description_binary'])


In [92]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [93]:
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)

In [94]:
svm.fit(list(X_train), y_train)



In [95]:
y_pred = svm.predict(list(X_test))


In [96]:
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Accuracy: 0.9333333333333333


In [97]:
sensitivity = recall_score(y_test, y_pred)
print('Sensitivity:', sensitivity)

Sensitivity: 0.6305732484076433


## Runde 3

In [98]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
import numpy as np

svm = SVC()

In [99]:
X = np.array(corpus_3['embedding'].tolist())
y = np.array(corpus_3['product_description_binary'])


In [100]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [101]:
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)

In [102]:
svm.fit(list(X_train), y_train)



In [103]:
y_pred = svm.predict(list(X_test))


In [104]:
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Accuracy: 0.9517426273458445


In [105]:
sensitivity = recall_score(y_test, y_pred)
print('Sensitivity:', sensitivity)

Sensitivity: 0.7956989247311828


## Runde 4

In [106]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
import numpy as np

svm = SVC()

In [107]:
X = np.array(corpus_4['embedding'].tolist())
y = np.array(corpus_4['product_description_binary'])


In [108]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [109]:
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)

In [110]:
svm.fit(list(X_train), y_train)



In [111]:
y_pred = svm.predict(list(X_test))


In [112]:
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Accuracy: 0.956140350877193


In [113]:
sensitivity = recall_score(y_test, y_pred)
print('Sensitivity:', sensitivity)

Sensitivity: 0.8396946564885496


## Runde 5

In [114]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
import numpy as np

svm = SVC()

In [115]:
X = np.array(corpus_5['embedding'].tolist())
y = np.array(corpus_5['product_description_binary'])


In [116]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [117]:
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)

In [118]:
svm.fit(list(X_train), y_train)



In [119]:
y_pred = svm.predict(list(X_test))


In [120]:
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Accuracy: 0.9686732186732187


In [121]:
sensitivity = recall_score(y_test, y_pred)
print('Sensitivity:', sensitivity)

Sensitivity: 0.8702290076335878


## Runde 6

In [33]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score
import numpy as np

svm = SVC()

In [34]:
X = np.array(corpus_6['embedding'].tolist())
y = np.array(corpus_6['product_description_binary'])


In [35]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)

In [37]:
svm.fit(list(X_train), y_train)



In [38]:
y_pred = svm.predict(list(X_test))


In [39]:
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Accuracy: 0.9680493273542601


In [40]:
sensitivity = recall_score(y_test, y_pred)
print('Sensitivity:', sensitivity)

Sensitivity: 0.8905325443786982


## 1.3 Neural Network adjustment

In [130]:
from sklearn.metrics import classification_report, confusion_matrix
import tensorflow as tf
import numpy as np
from sklearn.metrics import recall_score
from sklearn.model_selection import train_test_split

import pandas as pd
from sklearn.model_selection import KFold

from sklearn.metrics import accuracy_score, recall_score
import random




In [134]:
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

In [135]:
embeddings = pd.DataFrame(corpus_1['embedding'].tolist())
class_mapping = {'No': 0, 'Yes': 1}
corpus_1['class_encoded'] = corpus_1['product_description_binary'].map(class_mapping)


In [136]:
X = np.array(embeddings)
y = np.array(corpus_1['class_encoded'])


In [137]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [138]:
embedding_size = 1536
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(embedding_size,)), 
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')  
])




In [139]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])





In [140]:
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)


Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x265010d3690>

In [141]:
loss, accuracy = model.evaluate(X_test, y_test, verbose=0) 
print(f"Test accuracy: {accuracy}")


Test accuracy: 0.9449664354324341


In [142]:
y_pred_probs = model.predict(X_test, verbose=0)
y_pred = (y_pred_probs > 0.5).astype(int)

In [143]:
sensitivity = recall_score(y_test, y_pred)
print(f"Sensitivity (Recall): {sensitivity}")

Sensitivity (Recall): 0.825


# Runde 2

In [144]:
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

In [145]:
embeddings = pd.DataFrame(corpus_2['embedding'].tolist())
class_mapping = {'No': 0, 'Yes': 1}
corpus_2['class_encoded'] = corpus_2['product_description_binary'].map(class_mapping)


In [146]:
X = np.array(embeddings)
y = np.array(corpus_2['class_encoded'])


In [147]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [148]:
embedding_size = 1536 
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(embedding_size,)), 
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid') 
])

In [149]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


In [150]:
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x26501722750>

In [151]:
loss, accuracy = model.evaluate(X_test, y_test, verbose=0) 
print(f"Test accuracy: {accuracy}")


Test accuracy: 0.9303030371665955


In [152]:
y_pred_probs = model.predict(X_test, verbose=0)
y_pred = (y_pred_probs > 0.5).astype(int)

In [153]:
sensitivity = recall_score(y_test, y_pred)
print(f"Sensitivity (Recall): {sensitivity}")

Sensitivity (Recall): 0.7834394904458599


# Runde 3

In [154]:
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

In [155]:
embeddings = pd.DataFrame(corpus_3['embedding'].tolist())
class_mapping = {'No': 0, 'Yes': 1}
corpus_3['class_encoded'] = corpus_3['product_description_binary'].map(class_mapping)


In [156]:
X = np.array(embeddings)
y = np.array(corpus_3['class_encoded'])


In [157]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [158]:
embedding_size = 1536
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(embedding_size,)), 
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')  
])

In [159]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


In [160]:
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x265070eaed0>

In [161]:
loss, accuracy = model.evaluate(X_test, y_test, verbose=0) 
print(f"Test accuracy: {accuracy}")


Test accuracy: 0.9106345176696777


In [162]:
y_pred_probs = model.predict(X_test, verbose=0)
y_pred = (y_pred_probs > 0.5).astype(int)

In [163]:
sensitivity = recall_score(y_test, y_pred)
print(f"Sensitivity (Recall): {sensitivity}")

Sensitivity (Recall): 0.946236559139785


# Runde 4

In [168]:
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

In [169]:
embeddings = pd.DataFrame(corpus_4['embedding'].tolist())
class_mapping = {'No': 0, 'Yes': 1}
corpus_4['class_encoded'] = corpus_4['product_description_binary'].map(class_mapping)


In [170]:
X = np.array(embeddings)
y = np.array(corpus_4['class_encoded'])


In [171]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [172]:
embedding_size = 1536 
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(embedding_size,)),  
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')  
])

In [173]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


In [174]:
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x2650cd95390>

In [175]:
loss, accuracy = model.evaluate(X_test, y_test, verbose=0) 
print(f"Test accuracy: {accuracy}")


Test accuracy: 0.9258373379707336


In [176]:
y_pred_probs = model.predict(X_test, verbose=0)
y_pred = (y_pred_probs > 0.5).astype(int)

In [177]:
sensitivity = recall_score(y_test, y_pred)
print(f"Sensitivity (Recall): {sensitivity}")

Sensitivity (Recall): 0.950381679389313


# Runde 5

In [178]:
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

In [179]:
embeddings = pd.DataFrame(corpus_5['embedding'].tolist())
class_mapping = {'No': 0, 'Yes': 1}
corpus_5['class_encoded'] = corpus_5['product_description_binary'].map(class_mapping)


In [180]:
X = np.array(embeddings)
y = np.array(corpus_5['class_encoded'])


In [181]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [182]:
embedding_size = 1536 
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(embedding_size,)),  
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid') 
])

In [183]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


In [184]:
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x2650b7a3390>

In [185]:
loss, accuracy = model.evaluate(X_test, y_test, verbose=0) 
print(f"Test accuracy: {accuracy}")


Test accuracy: 0.9459459185600281


In [186]:
y_pred_probs = model.predict(X_test, verbose=0)
y_pred = (y_pred_probs > 0.5).astype(int)

In [187]:
sensitivity = recall_score(y_test, y_pred)
print(f"Sensitivity (Recall): {sensitivity}")

Sensitivity (Recall): 0.7709923664122137


## Runde 6

In [188]:
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

In [189]:
embeddings = pd.DataFrame(corpus_6['embedding'].tolist())
class_mapping = {'No': 0, 'Yes': 1}
corpus_6['class_encoded'] = corpus_6['product_description_binary'].map(class_mapping)


In [190]:
X = np.array(embeddings)
y = np.array(corpus_6['class_encoded'])


In [191]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [192]:
embedding_size = 1536 
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(embedding_size,)), 
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid') 
])

In [193]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


In [194]:
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x2650c40cdd0>

In [195]:
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test accuracy: {accuracy}")


Test accuracy: 0.9232062697410583


In [196]:
y_pred_probs = model.predict(X_test, verbose=0)
y_pred = (y_pred_probs > 0.5).astype(int)

In [197]:
sensitivity = recall_score(y_test, y_pred)
print(f"Sensitivity (Recall): {sensitivity}")

Sensitivity (Recall): 0.9497041420118343


# 2. Measurability Prediction

## 2.1 Default model experiment with cross validation

In [1]:
import pandas as pd
import nltk
nltk.download('punkt')
import re 
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

import os

import json

from nltk.corpus import stopwords
import lemmy
import textract
import pickle
import numpy as np

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\blunds\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Loading data

In [2]:
corpus_42 = pd.read_pickle("corpus_42_final.pkl")

In [3]:
corpus_42.head()

Unnamed: 0,doc_id,contract_type_code,Product_60codes,Measurability_mean,Asset_specificity_mean,Manuel_or_not,product_description_binary,CPVcodes_2digit,CPVcodes_3digit,clean_text,Embedding,Normalized_Embedding,Normalized_Embedding_mean
0,0000-008168,4,60,3.323232,2.450549,not,Yes,71,713,the new programme should provide the opportun...,"[-0.023649858, -0.0044648494, -0.006738333, 0....","[-0.02364985798180248, -0.0044648493965644956,...","[-0.02364985798180248, -0.0044648493965644956,..."
1,0000-008227,4,60,3.323232,2.450549,not,Yes,71,710,sikring af bygningens klimaskærm som fundamen...,"[0.0021429053, -0.020846844, -0.0025144592, 0....","[0.0018015666258628857, -0.017526196050273404,...","[-0.011694183793053732, -0.040396958453703675,..."
2,0000-008241,4,60,3.323232,2.450549,not,Yes,71,712,sikring af bygningens klimaskærm som fundamen...,"[0.019978527, -0.021250192, -0.005847986, -0.0...","[0.01997852606996726, -0.0212501910107692, -0....","[0.01997852606996726, -0.0212501910107692, -0...."
3,0000-008243,4,19,2.87013,2.590164,not,Yes,45,452,entreprisen omfatter almindelig vedligeholdels...,"[0.032049313, -0.03062923, 0.0018029478, -0.01...","[0.03204931125482191, -0.030629228332149552, 0...","[0.03204931125482191, -0.030629228332149552, 0..."
4,0000-008280,1,19,2.87013,2.590164,manuel,Yes,45,452,samarbejdets organisering og proces entreprenø...,"[0.027916875, -0.028474228, -0.009179947, -0.0...","[0.027916875576418063, -0.028474228587926097, ...","[0.027916875576418063, -0.028474228587926097, ..."


In [4]:
corpus_18 = pd.read_pickle("corpus_18_final.pkl")

In [5]:
corpus_42['CPVcodes_2digit'] = corpus_42['CPVcodes_2digit'].astype(str).apply(lambda x: x.zfill(2))


In [6]:
corpus_18['CPVcodes_2digit'] = corpus_18['CPVcodes_2digit'].astype(str).apply(lambda x: x.zfill(2))


In [9]:
corpus_over = pd.read_pickle("corpus_over_final.pkl")

#### Endocding CPVcodes_1digit

In [12]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd


In [13]:
corpus_42['CPVcodes_1digit'] = corpus_42['CPVcodes_2digit'].astype(str).str[0]
corpus_18['CPVcodes_1digit'] = corpus_18['CPVcodes_2digit'].astype(str).str[0]
corpus_over['CPVcodes_1digit'] = corpus_over['CPVcodes_2digit'].astype(str).str[0]

In [14]:
corpus_42['CPVcodes_1digit'] = corpus_42['CPVcodes_1digit'].astype(int)
corpus_18['CPVcodes_1digit'] = corpus_18['CPVcodes_1digit'].astype(int)
corpus_over['CPVcodes_1digit'] = corpus_over['CPVcodes_1digit'].astype(int)

In [15]:
corpus_42['CPVcodes_1digit_str'] = corpus_42['CPVcodes_1digit'].astype(str)
corpus_18['CPVcodes_1digit_str'] = corpus_18['CPVcodes_1digit'].astype(str)
corpus_over['CPVcodes_1digit_str'] = corpus_over['CPVcodes_1digit'].astype(str)

In [16]:
all_cpv_codes_str = pd.concat([corpus_42['CPVcodes_1digit_str'], corpus_18['CPVcodes_1digit_str'], corpus_over['CPVcodes_1digit_str']]).unique()

In [17]:
encoder = OneHotEncoder(categories=[all_cpv_codes_str], handle_unknown='ignore')

In [18]:
cpv_42_encoded_1 = encoder.fit_transform(corpus_42[['CPVcodes_1digit_str']]).toarray()

In [19]:
cpv_18_encoded_1 = encoder.transform(corpus_18[['CPVcodes_1digit_str']]).toarray()

In [20]:
cpv_over_encoded_1 = encoder.transform(corpus_over[['CPVcodes_1digit_str']]).toarray()

In [21]:
cpv_42_encoded_df_1 = pd.DataFrame(cpv_42_encoded_1, index=corpus_42.index)

In [22]:
cpv_18_encoded_df_1 = pd.DataFrame(cpv_18_encoded_1, index=corpus_18.index)

In [23]:
cpv_over_encoded_df_1 = pd.DataFrame(cpv_over_encoded_1, index=corpus_over.index)

#### Endocding CPVcodes_2digit

In [24]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd


In [25]:
corpus_42['CPVcodes_2digit_str'] = corpus_42['CPVcodes_2digit'].astype(str)
corpus_18['CPVcodes_2digit_str'] = corpus_18['CPVcodes_2digit'].astype(str)
corpus_over['CPVcodes_2digit_str'] = corpus_over['CPVcodes_2digit'].astype(str)


In [26]:
all_cpv_codes_str = pd.concat([corpus_42['CPVcodes_2digit_str'], corpus_18['CPVcodes_2digit_str'], corpus_over['CPVcodes_2digit_str']]).unique()

In [27]:
encoder = OneHotEncoder(categories=[all_cpv_codes_str], handle_unknown='ignore')

In [28]:
cpv_42_encoded_2 = encoder.fit_transform(corpus_42[['CPVcodes_2digit_str']]).toarray()

In [29]:
cpv_18_encoded_2 = encoder.transform(corpus_18[['CPVcodes_2digit_str']]).toarray()

In [30]:
cpv_over_encoded_2 = encoder.transform(corpus_over[['CPVcodes_2digit_str']]).toarray()

In [31]:
cpv_42_encoded_df_2 = pd.DataFrame(cpv_42_encoded_2, index=corpus_42.index)

In [32]:
cpv_18_encoded_df_2 = pd.DataFrame(cpv_18_encoded_2, index=corpus_18.index)

In [33]:
cpv_over_encoded_df_2 = pd.DataFrame(cpv_over_encoded_2, index=corpus_over.index)

#### Endocding CPVcodes_3digit

In [34]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd


In [35]:
corpus_42['CPVcodes_3digit_str'] = corpus_42['CPVcodes_3digit'].astype(str)
corpus_18['CPVcodes_3digit_str'] = corpus_18['CPVcodes_3digit'].astype(str)
corpus_over['CPVcodes_3digit_str'] = corpus_over['CPVcodes_3digit'].astype(str)


In [36]:
all_cpv_codes_str = pd.concat([corpus_42['CPVcodes_3digit_str'], corpus_18['CPVcodes_3digit_str'], corpus_over['CPVcodes_3digit_str']]).unique()

In [37]:
encoder = OneHotEncoder(categories=[all_cpv_codes_str], handle_unknown='ignore')

In [38]:
cpv_42_encoded_3 = encoder.fit_transform(corpus_42[['CPVcodes_3digit_str']]).toarray()

In [39]:
cpv_18_encoded_3 = encoder.transform(corpus_18[['CPVcodes_3digit_str']]).toarray()

In [40]:
cpv_over_encoded_3 = encoder.transform(corpus_over[['CPVcodes_3digit_str']]).toarray()

In [41]:
cpv_42_encoded_df_3 = pd.DataFrame(cpv_42_encoded_3, index=corpus_42.index)

In [42]:
cpv_18_encoded_df_3 = pd.DataFrame(cpv_18_encoded_3, index=corpus_18.index)

In [43]:
cpv_over_encoded_df_3 = pd.DataFrame(cpv_over_encoded_3, index=corpus_over.index)

# Linear Regression

In [44]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

In [46]:
embeddings_df = pd.DataFrame(corpus_42['Normalized_Embedding'].tolist())


In [47]:
data_with_dummies = pd.concat([embeddings_df, cpv_42_encoded_df_3, cpv_42_encoded_df_2, cpv_42_encoded_df_1], axis=1)

In [48]:
data = data_with_dummies.join(corpus_42['Measurability_mean'])

In [49]:
data.columns = data.columns.astype(str)

In [50]:
X = data.drop('Measurability_mean', axis=1)
y = data['Measurability_mean']

In [51]:
model = LinearRegression()

In [52]:
mse_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
mae_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error')
r2_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

In [53]:
print("Mean MSE:", -mse_scores.mean())
print("Mean MAE:", -mae_scores.mean())
print("Mean R2:", r2_scores.mean())

Mean MSE: 0.024548891757081255
Mean MAE: 0.10622352084267228
Mean R2: 0.8879356983033277


In [54]:
model.fit(X, y)


In [55]:
embeddings_df_18 = pd.DataFrame(corpus_18['Normalized_Embedding'].tolist())


In [56]:
data_with_dummies_18 = pd.concat([embeddings_df_18, cpv_18_encoded_df_3, cpv_18_encoded_df_2, cpv_18_encoded_df_1], axis=1)

In [57]:
data_18 = data_with_dummies_18.join(corpus_18['Measurability_mean'])

In [58]:
data_18.columns = data_18.columns.astype(str)

In [59]:
X_18 = data_18.drop('Measurability_mean', axis=1)

In [60]:
y_18 = data_18['Measurability_mean']

In [61]:
y_pred_18 = model.predict(X_18)

In [62]:
mse_18 = mean_squared_error(y_18, y_pred_18)
mae_18 = mean_absolute_error(y_18, y_pred_18)
r2_18 = r2_score(y_18, y_pred_18)

In [63]:
print(f"Mean Squared Error (MSE) on corpus_18: {mse_18}")
print(f"Mean Absolute Error (MAE) on corpus_18: {mae_18}")
print(f"R^2 Score on corpus_18: {r2_18}")

Mean Squared Error (MSE) on corpus_18: 0.22688947197577192
Mean Absolute Error (MAE) on corpus_18: 0.35748705657720964
R^2 Score on corpus_18: -0.4169290823372467


# Random Forrest

In [46]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [47]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

In [112]:
embeddings_df = pd.DataFrame(corpus_42['Normalized_Embedding'].tolist())


In [113]:
data_with_dummies = pd.concat([embeddings_df, cpv_42_encoded_df_3, cpv_42_encoded_df_2, cpv_42_encoded_df_1], axis=1)

In [114]:
data = data_with_dummies.join(corpus_42['Measurability_mean'])

In [115]:
data.columns = data.columns.astype(str)

In [116]:
X = data.drop('Measurability_mean', axis=1)
y = data['Measurability_mean']

In [117]:
model = RandomForestRegressor(n_estimators=100, random_state=42)

In [118]:
mse_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
mae_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error')
r2_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

In [119]:
print("Mean MSE:", -mse_scores.mean())
print("Mean MAE:", -mae_scores.mean())
print("Mean R2:", r2_scores.mean())

Mean MSE: 0.04188398130484365
Mean MAE: 0.1263508351564301
Mean R2: 0.8118248370675258


In [120]:
model.fit(X, y)


In [121]:
embeddings_df_18 = pd.DataFrame(corpus_18['Normalized_Embedding'].tolist())


In [122]:
data_with_dummies_18 = pd.concat([embeddings_df_18, cpv_18_encoded_df_3, cpv_18_encoded_df_2, cpv_18_encoded_df_1], axis=1)

In [123]:
data_18 = data_with_dummies_18.join(corpus_18['Measurability_mean'])

In [124]:
data_18.columns = data_18.columns.astype(str)

In [125]:
X_18 = data_18.drop('Measurability_mean', axis=1)

In [126]:
y_18 = data_18['Measurability_mean']

In [127]:
y_pred_18 = model.predict(X_18)

In [128]:
mse_18 = mean_squared_error(y_18, y_pred_18)
mae_18 = mean_absolute_error(y_18, y_pred_18)
r2_18 = r2_score(y_18, y_pred_18)

In [129]:
print(f"Mean Squared Error (MSE) on corpus_18: {mse_18}")
print(f"Mean Absolute Error (MAE) on corpus_18: {mae_18}")
print(f"R^2 Score on corpus_18: {r2_18}")

Mean Squared Error (MSE) on corpus_18: 0.12768783619471266
Mean Absolute Error (MAE) on corpus_18: 0.2944204931362472
R^2 Score on corpus_18: 0.2025870262313333


# SVM 

In [48]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [49]:
embeddings_df = pd.DataFrame(corpus_42['Normalized_Embedding'].tolist())


In [50]:
data_with_dummies = pd.concat([embeddings_df, cpv_42_encoded_df_3, cpv_42_encoded_df_2, cpv_42_encoded_df_1], axis=1)

In [51]:
data = data_with_dummies.join(corpus_42['Measurability_mean'])

In [52]:
data.columns = data.columns.astype(str)

In [53]:
X = data.drop('Measurability_mean', axis=1)
y = data['Measurability_mean']

In [59]:
model = SVR()

In [60]:
mse_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
mae_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error')
r2_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

In [61]:
print("Mean MSE:", -mse_scores.mean())
print("Mean MAE:", -mae_scores.mean())
print("Mean R2:", r2_scores.mean())

Mean MSE: 0.017411405614147148
Mean MAE: 0.09503278387561949
Mean R2: 0.9207175499146553


In [62]:
model.fit(X, y)


In [63]:
embeddings_df_18 = pd.DataFrame(corpus_18['Normalized_Embedding'].tolist())


In [64]:
data_with_dummies_18 = pd.concat([embeddings_df_18, cpv_18_encoded_df_3, cpv_18_encoded_df_2, cpv_18_encoded_df_1], axis=1)

In [65]:
data_18 = data_with_dummies_18.join(corpus_18['Measurability_mean'])

In [66]:
data_18.columns = data_18.columns.astype(str)

In [67]:
X_18 = data_18.drop('Measurability_mean', axis=1)

In [68]:
y_18 = data_18['Measurability_mean']

In [69]:
y_pred_18 = model.predict(X_18)

In [70]:
mse_18 = mean_squared_error(y_18, y_pred_18)
mae_18 = mean_absolute_error(y_18, y_pred_18)
r2_18 = r2_score(y_18, y_pred_18)

In [71]:
print(f"Mean Squared Error (MSE) on corpus_18: {mse_18}")
print(f"Mean Absolute Error (MAE) on corpus_18: {mae_18}")
print(f"R^2 Score on corpus_18: {r2_18}")

Mean Squared Error (MSE) on corpus_18: 0.12981839043452922
Mean Absolute Error (MAE) on corpus_18: 0.27332649159967914
R^2 Score on corpus_18: 0.1892816743452157


## Neural Network

In [85]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [86]:
import numpy as np
import tensorflow as tf
import random

np.random.seed(42)
random.seed(42)
tf.random.set_seed(42)

In [87]:
embeddings_df = pd.DataFrame(corpus_42['Normalized_Embedding'].tolist())


In [88]:
data_with_dummies = pd.concat([embeddings_df, cpv_42_encoded_df_3, cpv_42_encoded_df_2, cpv_42_encoded_df_1], axis=1)

In [89]:
data = data_with_dummies.join(corpus_42['Measurability_mean'])

In [90]:
data.columns = data.columns.astype(str)

In [91]:
X = data.drop('Measurability_mean', axis=1)
y = data['Measurability_mean']

In [92]:
model = MLPRegressor(hidden_layer_sizes=(100,), activation='logistic', solver='adam', max_iter=1000, random_state=42)

In [93]:
mse_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
mae_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error')
r2_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

In [94]:
print("Mean MSE:", -mse_scores.mean())
print("Mean MAE:", -mae_scores.mean())
print("Mean R2:", r2_scores.mean())

Mean MSE: 0.022970720741027663
Mean MAE: 0.09637297333853419
Mean R2: 0.8960198697710583


In [96]:
model.fit(X, y)

In [97]:
embeddings_df_18 = pd.DataFrame(corpus_18['Normalized_Embedding'].tolist())

In [98]:
data_with_dummies_18 = pd.concat([embeddings_df_18, cpv_18_encoded_df_3, cpv_18_encoded_df_2, cpv_18_encoded_df_1], axis=1)

In [99]:
data_18 = data_with_dummies_18.join(corpus_18['Measurability_mean'])

In [100]:
data_18.columns = data_18.columns.astype(str)

In [101]:
X_18 = data_18.drop('Measurability_mean', axis=1)

In [102]:
y_18 = data_18['Measurability_mean']

In [103]:
y_pred_18 = model.predict(X_18)

In [104]:
mse_18 = mean_squared_error(y_18, y_pred_18)
mae_18 = mean_absolute_error(y_18, y_pred_18)
r2_18 = r2_score(y_18, y_pred_18)

In [105]:
print(f"Mean Squared Error (MSE) on corpus_18: {mse_18}")
print(f"Mean Absolute Error (MAE) on corpus_18: {mae_18}")
print(f"R^2 Score on corpus_18: {r2_18}")

Mean Squared Error (MSE) on corpus_18: 0.13234059710025803
Mean Absolute Error (MAE) on corpus_18: 0.27485314912206843
R^2 Score on corpus_18: 0.1735304455851716


## 2.2 Support Vector Machine adjustment

In [1]:
import pandas as pd
import nltk
nltk.download('punkt')
import re 
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

import os

import json

from nltk.corpus import stopwords
import lemmy
import textract
import pickle
import numpy as np

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\blunds\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Loading data

In [2]:
corpus_42 = pd.read_pickle("corpus_42_final.pkl")

In [3]:
corpus_18 = pd.read_pickle("corpus_18_final.pkl")

In [4]:
corpus_18.head()

Unnamed: 0,doc_id,contract_type_code,Product_60codes,Measurability_mean,Asset_specificity_mean,Manuel_or_not,product_description_binary,CPVcodes_2digit,CPVcodes_3digit,clean_text,Embedding,Normalized_Embedding,Normalized_Embedding_mean
0,0000-008404,1,37,2.62,2.272727,not,Yes,44,441,- nedtagning og bortkørelse af eksist. tagplad...,"[0.03106328, -0.03935687, -0.013459226, 0.0129...","[0.03094654800758096, -0.03920897171461362, -0...","[0.02489573881306392, -0.04187234206775942, -0..."
1,0000-008761,2,58,2.676768,2.710843,not,Yes,33,331,auh q02 h1 e02 s2 n02 eksisterne forhold lav t...,"[0.010536981, -0.018840395, -0.00898971, 0.019...","[0.010553398240196509, -0.018869749450777892, ...","[0.008020402785488632, -0.033835261320993994, ..."
2,0000-008790,2,50,2.574257,2.716049,not,Yes,35,351,bilag 1 kravspecifikation vi ønsker at indkøbe...,"[0.013917757, -0.0545031, -0.0040278393, 0.002...","[0.013917756410740946, -0.05450309769241228, -...","[0.013917756410740946, -0.05450309769241228, -..."
3,0000-008871,2,50,2.574257,2.716049,not,Yes,35,351,demand specification the sensors must be able ...,"[0.008390237, 0.024996884, 0.0026603707, -0.00...","[0.008390237145836444, 0.024996884434487925, 0...","[0.008390237145836444, 0.024996884434487925, 0..."
4,001936-2021,4,6,3.450549,2.961039,not,Yes,85,851,3.3.1 autorisationer skal angive i hvilket omf...,"[0.015264984, -0.050398123, -0.015983956, 0.01...","[0.015264984321138807, -0.05039812406025615, -...","[0.015264984321138807, -0.05039812406025615, -..."


### Creating  CPVcodes_1digit

In [6]:
corpus_42['CPVcodes_2digit'] = corpus_42['CPVcodes_2digit'].astype(str).apply(lambda x: x.zfill(2))


In [7]:
corpus_18['CPVcodes_2digit'] = corpus_18['CPVcodes_2digit'].astype(str).apply(lambda x: x.zfill(2))


In [8]:
corpus_42['CPVcodes_1digit'] = corpus_42['CPVcodes_2digit'].astype(str).str[0]
corpus_18['CPVcodes_1digit'] = corpus_18['CPVcodes_2digit'].astype(str).str[0]


In [9]:
corpus_42['CPVcodes_1digit'] = corpus_42['CPVcodes_1digit'].astype(int)
corpus_18['CPVcodes_1digit'] = corpus_18['CPVcodes_1digit'].astype(int)


#### Endocding CPVcodes_1digit

In [318]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd


In [319]:
corpus_42['CPVcodes_1digit_str'] = corpus_42['CPVcodes_1digit'].astype(str)
corpus_18['CPVcodes_1digit_str'] = corpus_18['CPVcodes_1digit'].astype(str)

In [320]:
all_cpv_codes_str = pd.concat([corpus_42['CPVcodes_1digit_str'], corpus_18['CPVcodes_1digit_str']]).unique()

In [321]:
encoder = OneHotEncoder(categories=[all_cpv_codes_str], handle_unknown='ignore')

In [322]:
cpv_42_encoded_1 = encoder.fit_transform(corpus_42[['CPVcodes_1digit_str']]).toarray()

In [323]:
cpv_18_encoded_1 = encoder.transform(corpus_18[['CPVcodes_1digit_str']]).toarray()

In [324]:
cpv_42_encoded_df_1 = pd.DataFrame(cpv_42_encoded_1, index=corpus_42.index)

In [325]:
cpv_18_encoded_df_1 = pd.DataFrame(cpv_18_encoded_1, index=corpus_18.index)

#### Endocding CPVcodes_2digit

In [215]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd


In [216]:
corpus_42['CPVcodes_2digit_str'] = corpus_42['CPVcodes_2digit'].astype(str)
corpus_18['CPVcodes_2digit_str'] = corpus_18['CPVcodes_2digit'].astype(str)

In [217]:
all_cpv_codes_str = pd.concat([corpus_42['CPVcodes_2digit_str'], corpus_18['CPVcodes_2digit_str']]).unique()

In [218]:
encoder = OneHotEncoder(categories=[all_cpv_codes_str], handle_unknown='ignore')

In [219]:
cpv_42_encoded_2 = encoder.fit_transform(corpus_42[['CPVcodes_2digit_str']]).toarray()

In [220]:
cpv_18_encoded_2 = encoder.transform(corpus_18[['CPVcodes_2digit_str']]).toarray()

In [221]:
cpv_42_encoded_df_2 = pd.DataFrame(cpv_42_encoded_2, index=corpus_42.index)

In [222]:
cpv_18_encoded_df_2 = pd.DataFrame(cpv_18_encoded_2, index=corpus_18.index)

#### Endocding CPVcodes_3digit

In [223]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd


In [224]:
corpus_42['CPVcodes_3digit_str'] = corpus_42['CPVcodes_3digit'].astype(str)
corpus_18['CPVcodes_3digit_str'] = corpus_18['CPVcodes_3digit'].astype(str)

In [225]:
all_cpv_codes_str = pd.concat([corpus_42['CPVcodes_3digit_str'], corpus_18['CPVcodes_3digit_str']]).unique()

In [226]:
encoder = OneHotEncoder(categories=[all_cpv_codes_str], handle_unknown='ignore')

In [227]:
cpv_42_encoded_3 = encoder.fit_transform(corpus_42[['CPVcodes_3digit_str']]).toarray()

In [228]:
cpv_18_encoded_3 = encoder.transform(corpus_18[['CPVcodes_3digit_str']]).toarray()

In [229]:
cpv_42_encoded_df_3 = pd.DataFrame(cpv_42_encoded_3, index=corpus_42.index)

In [230]:
cpv_18_encoded_df_3 = pd.DataFrame(cpv_18_encoded_3, index=corpus_18.index)

#### Endocding contract_type_code

In [231]:
contract_type_dummies = pd.get_dummies(corpus_42['contract_type_code'], prefix='contract_type')

In [232]:
contract_type_dummies_18 = pd.get_dummies(corpus_18['contract_type_code'], prefix='contract_type')

# Measurability_mean

## Support Vector Machine

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [327]:
embeddings_df = pd.DataFrame(corpus_42['Normalized_Embedding'].tolist())

In [328]:
data_with_dummies = pd.concat([embeddings_df, cpv_42_encoded_df_2, cpv_42_encoded_df_1, cpv_42_encoded_df_3, contract_type_dummies], axis=1)

In [329]:
data = data_with_dummies.join(corpus_42['Measurability_mean'])

In [330]:
data.columns = data.columns.astype(str)

In [331]:
X = data.drop('Measurability_mean', axis=1)
y = data['Measurability_mean']


In [332]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [333]:
model_svm_rbf = SVR(kernel='rbf')

In [338]:
model_svm_rbf.fit(X_train, y_train)


In [339]:
y_pred_svm_rbf = model_svm_rbf.predict(X_test)

In [340]:
mse_svm_rbf = mean_squared_error(y_test, y_pred_svm_rbf)
mae_svm_rbf = mean_absolute_error(y_test, y_pred_svm_rbf)
r2_svm_rbf = r2_score(y_test, y_pred_svm_rbf)

In [341]:
print(f'Mean Squared Error (MSE): {mse_svm_rbf}')
print(f'Mean Absolute Error (MAE): {mae_svm_rbf}')
print(f'R^2 Score: {r2_svm_rbf}')

Mean Squared Error (MSE): 0.021582822232002193
Mean Absolute Error (MAE): 0.10364690773664942
R^2 Score: 0.8968725220439258


In [257]:
y_pred_full = model_svm_rbf.predict(X)

In [258]:
corpus_42['predicted_Measurability_mean'] = y_pred_full

In [259]:
corpus_42['absolute_error'] = np.abs(corpus_42['Measurability_mean'] - corpus_42['predicted_Measurability_mean'])
corpus_42['prediction_error'] = np.abs(corpus_42['predicted_Measurability_mean'] - corpus_42['Measurability_mean'])


In [262]:
print(corpus_42['prediction_error'].describe())

count    1351.000000
mean        0.093696
std         0.081616
min         0.000101
25%         0.067913
50%         0.087696
75%         0.099904
max         0.832349
Name: prediction_error, dtype: float64


In [260]:
corpus_42_sorted_by_error = corpus_42.sort_values('absolute_error', ascending=False)

In [267]:
corpus_42_sorted_by_error.tail()

Unnamed: 0,doc_id,contract_type_code,Product_60codes,Measurability_mean,Asset_specificity_mean,Manuel_or_not,product_description_binary,CPVcodes_2digit,CPVcodes_3digit,clean_text,Embedding,Normalized_Embedding,Normalized_Embedding_mean,CPVcodes_1digit,CPVcodes_1digit_str,CPVcodes_2digit_str,CPVcodes_3digit_str,predicted_Measurability_mean,absolute_error,prediction_error
677,314668-2022,2,39,2.282828,2.571429,not,Yes,34,341,logo .4 9 lad- og kassebilernes indretning og ...,"[0.015642222, -0.022446463, -0.008005235, 0.00...","[0.015642222077874635, -0.022446463111749474, ...","[0.015642222077874635, -0.022446463111749474, ...",3,3,34,341,2.283722,0.000894,0.000894
1017,510688-2022,4,17,2.855556,2.405941,not,Yes,66,665,k uindregistrerede arbejdsmaskiner anvendes ud...,"[0.022993825, -0.021013413, -0.0067123743, -0....","[0.022993825343339564, -0.02101341331376842, -...","[0.022993825343339564, -0.02101341331376842, -...",6,6,66,665,2.85483,0.000726,0.000726
884,4501-004039,1,34,3.053571,1.898305,not,Yes,45,453,leverandøren er forpligtet til at opsætte lade...,"[-0.0012835168, -0.0077688163, -0.008659378, -...","[-0.0012835167769329593, -0.00776881616038079,...","[-0.0012835167769329593, -0.00776881616038079,...",4,4,45,453,3.054189,0.000618,0.000618
66,015002-2022,2,39,2.282828,2.571429,not,Yes,34,341,det er ok med et 13 5 tons chassis. 100 luftaf...,"[-0.0046223914, -0.02774258, -0.010463151, 0.0...","[-0.00462239118112732, -0.02774257868637415, -...","[-0.00462239118112732, -0.02774257868637415, -...",3,3,34,341,2.283416,0.000588,0.000588
131,028547-2022,2,39,2.282828,2.571429,not,Yes,34,341,type 3-akslet fabriksnyt chassis med dobbelt c...,"[0.008342708, -0.0364893, -0.018855484, 0.0202...","[0.00834270769524348, -0.03648929866705726, -0...","[0.00834270769524348, -0.03648929866705726, -0...",3,3,34,341,2.282727,0.000101,0.000101


In [263]:
selected_columns_df = corpus_42_sorted_by_error[['doc_id', 'clean_text', 'contract_type_code', 'Product_60codes','CPVcodes_2digit', 'CPVcodes_3digit', 'Measurability_mean', 'prediction_error', 'predicted_Measurability_mean']]

In [268]:
def truncate_text(text):
    MAX_LENGTH = 30000  # Excel's maximum cell character limit
    if isinstance(text, str) and len(text) > MAX_LENGTH:
        return text[:MAX_LENGTH]
    return text

In [269]:
selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)


In [270]:
for col in selected_columns_df.select_dtypes(include=['float', 'float64']).columns:
    selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_

In [271]:
selected_columns_df.to_csv('SVM_42_rbf_alle.csv', index=False, encoding='utf-16', sep='\t')

In [282]:
mean_values_df_42 = corpus_42.groupby('Product_60codes').agg({
    'Measurability_mean': 'mean',
    'predicted_Measurability_mean': 'mean'
}).reset_index()

In [283]:
mean_values_df_42.rename(columns={'Measurability_mean': 'Mean_Actual_Measurability',
                               'predicted_Measurability_mean': 'Mean_Predicted_Measurability'}, inplace=True)

In [285]:
mean_values_df_42.to_excel('mean_values_42.xlsx', index=False)

### Validation on corpus_18

In [342]:
embeddings_df_18 = pd.DataFrame(corpus_18['Normalized_Embedding'].tolist())

In [343]:
data_with_dummies_18 = pd.concat([embeddings_df_18, cpv_18_encoded_df_2, cpv_18_encoded_df_1, cpv_18_encoded_df_3, contract_type_dummies_18], axis=1)

In [344]:
data_18 = data_with_dummies_18.join(corpus_18['Measurability_mean'])

In [345]:
data_18.columns = data_18.columns.astype(str)

In [346]:
X_18 = data_18.drop('Measurability_mean', axis=1)

In [347]:
y_18 = data_18['Measurability_mean']

In [348]:
y_pred_18_svm_rbf = model_svm_rbf.predict(X_18)

In [349]:
mse_18_svm_rbf = mean_squared_error(y_18, y_pred_18_svm_rbf)
mae_18_svm_rbf = mean_absolute_error(y_18, y_pred_18_svm_rbf)
r2_18_svm_rbf = r2_score(y_18, y_pred_18_svm_rbf)

In [350]:
print(f"Mean Squared Error (MSE) on corpus_18: {mse_18_svm_rbf}")
print(f"Mean Absolute Error (MAE) on corpus_18: {mae_18_svm_rbf}")
print(f"R^2 Score on corpus_18: {r2_18_svm_rbf}")

Mean Squared Error (MSE) on corpus_18: 0.12602936206192233
Mean Absolute Error (MAE) on corpus_18: 0.2744801433900481
R^2 Score on corpus_18: 0.2129442288401241


#### Error 

In [351]:
corpus_18['predicted_Measurability_mean'] = y_pred_18_svm_rbf

In [352]:
corpus_18['absolute_error'] = np.abs(corpus_18['Measurability_mean'] - corpus_18['predicted_Measurability_mean'])
corpus_18['prediction_error'] = corpus_18['predicted_Measurability_mean'] - corpus_18['Measurability_mean']

In [353]:
print(corpus_18['prediction_error'].describe())

count    389.000000
mean      -0.057283
std        0.350805
min       -0.910022
25%       -0.215425
50%       -0.045148
75%        0.234740
max        0.570008
Name: prediction_error, dtype: float64


In [354]:
corpus_18_sorted_by_error = corpus_18.sort_values('absolute_error', ascending=False)

In [355]:
corpus_18_sorted_by_error.head()

Unnamed: 0,doc_id,contract_type_code,Product_60codes,Measurability_mean,Asset_specificity_mean,Manuel_or_not,product_description_binary,CPVcodes_2digit,CPVcodes_3digit,clean_text,Embedding,Normalized_Embedding,Normalized_Embedding_mean,CPVcodes_1digit,CPVcodes_1digit_str,CPVcodes_2digit_str,CPVcodes_3digit_str,absolute_error,prediction_error,predicted_Measurability_mean
363,645705-2022,4,6,3.450549,2.961039,not,Yes,98,985,vask af gardiner tæpper dyner puder og lignend...,"[-0.012111249, -0.03762052, -0.0029946682, -0....","[-0.012111248951733156, -0.0376205198500713, -...","[-0.012111248951733156, -0.0376205198500713, -...",9,9,98,985,0.910022,-0.910022,2.540527
331,604726-2021,4,6,3.450549,2.961039,not,Yes,98,985,3 1. indledning københavns kommune sundheds- o...,"[0.013809947, -0.03705657, -0.0013971226, 0.01...","[0.013809947039233016, -0.03705657010527491, -...","[0.013809947039233016, -0.03705657010527491, -...",9,9,98,985,0.903757,-0.903757,2.546792
322,587998-2021,4,6,3.450549,2.961039,not,Yes,98,985,1.31 1.32 1.33 1.34 1.35 nr. 2 2.1 2.2 2.3 2....,"[0.00805287, -0.029606776, -0.0027219134, -0.0...","[0.008052869406365683, -0.02960677381747399, -...","[0.008052869406365683, -0.02960677381747399, -...",9,9,98,985,0.897682,-0.897682,2.552867
112,184479-2021,4,6,3.450549,2.961039,not,Yes,98,985,hjemmeplejen i fredensborg kommune hjemmepleje...,"[0.003928437, -0.033135317, -0.0076411944, -0....","[0.003928436965828447, -0.03313531671177207, -...","[0.003928436965828447, -0.03313531671177207, -...",9,9,98,985,0.897077,-0.897077,2.553472
314,566770-2022,4,6,3.450549,2.961039,not,Yes,98,985,3.3 sproglige færdigheder det personale der er...,"[0.012966551, -0.02693183, -0.0010209465, 0.02...","[0.012966550167898033, -0.026931828271704737, ...","[0.012966550167898033, -0.026931828271704737, ...",9,9,98,985,0.894203,-0.894203,2.556346


In [272]:
selected_columns_df = corpus_18_sorted_by_error[['doc_id', 'clean_text', 'contract_type_code', 'Product_60codes','CPVcodes_2digit', 'CPVcodes_3digit', 'Measurability_mean', 'prediction_error', 'predicted_Measurability_mean']]

In [273]:
def truncate_text(text):
    MAX_LENGTH = 30000  # Excel's maximum cell character limit
    if isinstance(text, str) and len(text) > MAX_LENGTH:
        return text[:MAX_LENGTH]
    return text

In [274]:
selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)


In [275]:
for col in selected_columns_df.select_dtypes(include=['float', 'float64']).columns:
    selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_

In [276]:
selected_columns_df.to_csv('SVM_rbf_alle.csv', index=False, encoding='utf-16', sep='\t')

### mean prediction

In [278]:
mean_values_df = corpus_18.groupby('Product_60codes').agg({
    'Measurability_mean': 'mean',
    'predicted_Measurability_mean': 'mean'
}).reset_index()

In [279]:
mean_values_df.rename(columns={'Measurability_mean': 'Mean_Actual_Measurability',
                               'predicted_Measurability_mean': 'Mean_Predicted_Measurability'}, inplace=True)

In [289]:
mean_values_df.head(18)

Unnamed: 0,Product_60codes,Mean_Actual_Measurability,Mean_Predicted_Measurability
0,2,3.291667,3.028851
1,5,2.118644,2.476478
2,6,3.450549,2.729885
3,15,2.488889,2.517403
4,16,3.020408,2.992763
5,28,2.287234,2.707101
6,29,2.657895,2.388123
7,37,2.62,2.428599
8,40,2.555556,2.826842
9,41,2.726027,2.372451


In [281]:
mean_values_df.to_excel('mean_values_18.xlsx', index=False)

## 2.3 Neural Network adjustment

In [1]:
import pandas as pd
import nltk
nltk.download('punkt')
import re 
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

import os

import json

from nltk.corpus import stopwords
import lemmy
import textract
import pickle
import numpy as np

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\blunds\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Loading data

In [2]:
corpus_42 = pd.read_pickle("corpus_42_final.pkl")

In [3]:
corpus_18 = pd.read_pickle("corpus_18_final.pkl")

In [4]:
corpus_over = pd.read_pickle("corpus_over_final.pkl")

## Creating  CPVcodes_1digit

In [11]:
corpus_42['CPVcodes_2digit'] = corpus_42['CPVcodes_2digit'].astype(str).apply(lambda x: x.zfill(2))


In [12]:
corpus_18['CPVcodes_2digit'] = corpus_18['CPVcodes_2digit'].astype(str).apply(lambda x: x.zfill(2))


In [13]:
corpus_over['CPVcodes_2digit'] = corpus_over['CPVcodes_2digit'].astype(str).apply(lambda x: x.zfill(2))


In [14]:
corpus_42['CPVcodes_1digit'] = corpus_42['CPVcodes_2digit'].astype(str).str[0]
corpus_18['CPVcodes_1digit'] = corpus_18['CPVcodes_2digit'].astype(str).str[0]
corpus_over['CPVcodes_1digit'] = corpus_over['CPVcodes_2digit'].astype(str).str[0]

In [15]:
corpus_42['CPVcodes_1digit'] = corpus_42['CPVcodes_1digit'].astype(int)
corpus_18['CPVcodes_1digit'] = corpus_18['CPVcodes_1digit'].astype(int)
corpus_over['CPVcodes_1digit'] = corpus_over['CPVcodes_1digit'].astype(int)

## Running Neural Network

In [18]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Flatten, Dense, Concatenate, LeakyReLU
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import LearningRateScheduler
from tensorflow.keras import backend as K

import tensorflow as tf
import random




In [22]:
np.random.seed(42)
random.seed(42)
tf.random.set_seed(42)


In [23]:
corpus_42['CPVcodes_1digit_str'] = corpus_42['CPVcodes_1digit'].astype(str)
corpus_42['CPVcodes_2digit_str'] = corpus_42['CPVcodes_2digit'].astype(str)
corpus_42['CPVcodes_3digit_str'] = corpus_42['CPVcodes_3digit'].astype(str)

In [24]:
X_embeddings = np.array(corpus_42['Normalized_Embedding'].tolist())  # Assuming this is a list of lists
X_cpv_1digit = corpus_42['CPVcodes_1digit_str'].astype('category').cat.codes.values
X_cpv_2digit = corpus_42['CPVcodes_2digit_str'].astype('category').cat.codes.values
X_cpv_3digit = corpus_42['CPVcodes_3digit_str'].astype('category').cat.codes.values
y = corpus_42['Measurability_mean'].values

In [25]:
X_train_embeddings, X_test_embeddings, X_train_cpv_1digit, X_test_cpv_1digit, X_train_cpv_2digit, X_test_cpv_2digit, X_train_cpv_3digit, X_test_cpv_3digit, y_train, y_test = train_test_split(
    X_embeddings, X_cpv_1digit, X_cpv_2digit, X_cpv_3digit, y, test_size=0.3, random_state=42
)

In [26]:
max_cpv_1digit = np.max(X_cpv_1digit) + 2
max_cpv_2digit = np.max(X_cpv_2digit) + 12
max_cpv_3digit = np.max(X_cpv_3digit) + 45
embedding_dim = X_embeddings.shape[1]

In [27]:
from tensorflow.keras.layers import Dropout

In [28]:
custom_learning_rate = 0.001


In [29]:
adam_optimizer = Adam(learning_rate=custom_learning_rate)


In [30]:
def create_model(max_cpv_1digit, max_cpv_2digit, max_cpv_3digit, embedding_dim):
    input_cpv_1digit = Input(shape=(1,), name='cpv_1digit_input')
    input_cpv_2digit = Input(shape=(1,), name='cpv_2digit_input')
    input_cpv_3digit = Input(shape=(1,), name='cpv_3digit_input')
    input_embeddings = Input(shape=(embedding_dim,), name='embeddings_input')

    embedding_cpv_1digit = Embedding(max_cpv_1digit, 4, name='cpv_1digit_embedding')(input_cpv_1digit)
    embedding_cpv_2digit = Embedding(max_cpv_2digit, 8, name='cpv_2digit_embedding')(input_cpv_2digit)
    embedding_cpv_3digit = Embedding(max_cpv_3digit, 12, name='cpv_3digit_embedding')(input_cpv_3digit)

    flat_cpv_1digit = Flatten()(embedding_cpv_1digit)
    flat_cpv_2digit = Flatten()(embedding_cpv_2digit)
    flat_cpv_3digit = Flatten()(embedding_cpv_3digit)

    concatenated = Concatenate()([flat_cpv_1digit, flat_cpv_2digit, flat_cpv_3digit, input_embeddings])

    # Using ELU as the activation function
    dense_layer = Dense(64, activation='relu')(concatenated)
    
    output_layer = Dense(1)(dense_layer)

    model = Model(inputs=[input_cpv_1digit, input_cpv_2digit, input_cpv_3digit, input_embeddings], outputs=output_layer)
    model.compile(optimizer=adam_optimizer, loss='mean_squared_error')
    
    return model


In [31]:
model = create_model(max_cpv_1digit, max_cpv_2digit, max_cpv_3digit, embedding_dim)





In [32]:
model.fit(
    [X_train_cpv_1digit, X_train_cpv_2digit, X_train_cpv_3digit, X_train_embeddings],
    y_train,
    epochs=100,
    batch_size=64,
    validation_split=0.3
)

Epoch 1/100

Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 7

<keras.src.callbacks.History at 0x1c33a7db950>

In [33]:
test_metrics = model.evaluate(
    [X_test_cpv_1digit, X_test_cpv_2digit, X_test_cpv_3digit, X_test_embeddings],
    y_test
)



In [34]:
print(f"Test Loss: {test_metrics}")

Test Loss: 0.036571089178323746


In [35]:
y_pred = model.predict([X_test_cpv_1digit, X_test_cpv_2digit, X_test_cpv_3digit, X_test_embeddings])



In [36]:
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")

Mean Absolute Error (MAE): 0.12112464807671983


In [37]:
r2 = r2_score(y_test, y_pred)
print(f"R-squared (R²): {r2}")

R-squared (R²): 0.8252552764264065


### First validation

In [38]:
corpus_18['CPVcodes_1digit_str'] = corpus_18['CPVcodes_1digit'].astype(str)
corpus_18['CPVcodes_2digit_str'] = corpus_18['CPVcodes_2digit'].astype(str)
corpus_18['CPVcodes_3digit_str'] = corpus_18['CPVcodes_3digit'].astype(str)

In [39]:
X_18_embeddings = np.array(corpus_18['Normalized_Embedding'].tolist())  # Adjusted for corpus_18
X_18_cpv_1digit = corpus_18['CPVcodes_1digit_str'].astype('category').cat.codes.values
X_18_cpv_2digit = corpus_18['CPVcodes_2digit_str'].astype('category').cat.codes.values
X_18_cpv_3digit = corpus_18['CPVcodes_3digit_str'].astype('category').cat.codes.values

In [40]:
y_pred_18 = model.predict([X_18_cpv_1digit, X_18_cpv_2digit, X_18_cpv_3digit, X_18_embeddings])




In [41]:
y_18 = corpus_18['Measurability_mean'].values

In [42]:
mae_18 = mean_absolute_error(y_18, y_pred_18)
r2_18 = r2_score(y_18, y_pred_18)

In [43]:
print(f"Mean Absolute Error (MAE) on corpus_18: {mae_18}")
print(f"R-squared (R²) on corpus_18: {r2_18}")

Mean Absolute Error (MAE) on corpus_18: 0.24719812836468064
R-squared (R²) on corpus_18: 0.4254265930483144


### Unsupervised Domain Adaptation

In [44]:
pseudo_labels = y_pred_18


In [45]:
pseudo_labels_flat = pseudo_labels.flatten()

In [46]:
X_train_combined = np.concatenate([X_train_embeddings, X_18_embeddings], axis=0)
X_cpv_1digit_combined = np.concatenate([X_train_cpv_1digit, X_18_cpv_1digit], axis=0)
X_cpv_2digit_combined = np.concatenate([X_train_cpv_2digit, X_18_cpv_2digit], axis=0)
X_cpv_3digit_combined = np.concatenate([X_train_cpv_3digit, X_18_cpv_3digit], axis=0)
y_combined = np.concatenate([y_train, pseudo_labels_flat], axis=0)

In [47]:
model.fit(
    [X_cpv_1digit_combined, X_cpv_2digit_combined, X_cpv_3digit_combined, X_train_combined],
    y_combined,
    epochs=10,
    batch_size=150,
    validation_split=0.2
)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x1c33b14ed10>

In [48]:
test_metrics = model.evaluate(
    [X_test_cpv_1digit, X_test_cpv_2digit, X_test_cpv_3digit, X_test_embeddings],
    y_test
)



In [49]:
y_pred_test = model.predict([X_test_cpv_1digit, X_test_cpv_2digit, X_test_cpv_3digit, X_test_embeddings])




In [50]:
mae_test = mean_absolute_error(y_test, y_pred_test)
print(f"Mean Absolute Error (MAE): {mae_test}")

Mean Absolute Error (MAE): 0.11403227651270974


In [48]:
r2 = r2_score(y_test, y_pred_test)
print(f"R-squared (R²): {r2}")

R-squared (R²): 0.8429390543078404


In [55]:
mse_test = mean_squared_error(y_test, y_pred_test)
print(f"Mean Squared Error (MSE): {mse_test}")

Mean Squared Error (MSE): 0.03287017715984365


In [49]:
y_pred_full_corpus = model.predict([X_cpv_1digit, X_cpv_2digit, X_cpv_3digit, X_embeddings])




In [50]:
y_pred_full_corpus_flat = y_pred_full_corpus.flatten()

In [51]:
corpus_42['Predicted_Measurability'] = y_pred_full_corpus_flat

In [52]:
corpus_42['absolute_error'] = np.abs(corpus_42['Measurability_mean'] - corpus_42['Predicted_Measurability'])
corpus_42['prediction_error'] = np.abs(corpus_42['Predicted_Measurability'] - corpus_42['Measurability_mean'])


In [53]:
print(corpus_42['prediction_error'].describe())

count    1351.000000
mean        0.077916
std         0.100207
min         0.000097
25%         0.020120
50%         0.049855
75%         0.097275
max         1.023496
Name: prediction_error, dtype: float64


In [54]:
corpus_42_sorted_by_error = corpus_42.sort_values('absolute_error', ascending=False)

In [55]:
corpus_42_sorted_by_error.head()

Unnamed: 0,doc_id,contract_type_code,Product_60codes,Measurability_mean,Asset_specificity_mean,Manuel_or_not,product_description_binary,CPVcodes_2digit,CPVcodes_3digit,clean_text,Embedding,Normalized_Embedding,Normalized_Embedding_mean,CPVcodes_1digit,CPVcodes_1digit_str,CPVcodes_2digit_str,CPVcodes_3digit_str,Predicted_Measurability,absolute_error,prediction_error
428,185005-2022,4,30,3.723404,2.6,not,Yes,79,799,lokalet skal være placeret på en adresse inden...,"[0.026613215, -0.003674943, 0.011295931, -0.00...","[0.02661321360203879, -0.0036749428069595214, ...","[0.02661321360203879, -0.0036749428069595214, ...",7,7,79,799,2.699908,1.023496,1.023496
634,289554-2022,4,56,3.290698,2.531646,not,Yes,77,773,drift og vedligeholdelse af veje mindre repara...,"[0.053412754, -0.017288446, -0.0029704645, 0.0...","[0.040426417224586265, -0.01308507573230037, -...","[0.02565708171490272, -0.046808362667422866, -...",7,7,77,773,2.4172,0.873498,0.873498
54,0000-009033,3,34,3.053571,1.898305,not,Yes,51,511,bilag 2 betingelser for opsætning af ladestand...,"[0.026239788, -0.012995314, -0.006022693, -0.0...","[0.026239788537120962, -0.012995314266010366, ...","[0.026239788537120962, -0.012995314266010366, ...",5,5,51,511,2.199995,0.853576,0.853576
1127,575306-2021,4,17,2.855556,2.405941,not,Yes,66,665,the one high pressure steam turbines operate a...,"[0.009211341, -0.008301167, -0.0036978791, -0....","[0.009211341215610855, -0.008301167194306314, ...","[0.009211341215610855, -0.008301167194306314, ...",6,6,66,665,2.040549,0.815007,0.815007
945,474183-2022,4,56,3.290698,2.531646,not,Yes,77,773,element 2. græs 2.1 rabatgræs by 2.2 rabatgræs...,"[0.038562726, -0.047106367, -0.012252452, 0.00...","[0.0386379782863127, -0.04719829156510038, -0....","[0.04009370887290179, -0.051011758177237236, -...",7,7,77,773,2.476147,0.814551,0.814551


In [56]:
selected_columns_df = corpus_42_sorted_by_error[['doc_id', 'clean_text', 'contract_type_code', 'Product_60codes','CPVcodes_2digit', 'CPVcodes_3digit', 'Measurability_mean', 'prediction_error', 'Predicted_Measurability']]

In [57]:
def truncate_text(text):
    MAX_LENGTH = 30000  # Excel's maximum cell character limit
    if isinstance(text, str) and len(text) > MAX_LENGTH:
        return text[:MAX_LENGTH]
    return text

In [58]:
selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)


In [59]:
for col in selected_columns_df.select_dtypes(include=['float', 'float64']).columns:
    selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_

In [60]:
selected_columns_df.to_csv('Corpus_42_measurebility_errors.csv', index=False, encoding='utf-16', sep='\t')

In [61]:
mean_values_df = corpus_42.groupby('Product_60codes').agg({
    'Measurability_mean': 'mean',
    'Predicted_Measurability': 'mean'
}).reset_index()

In [62]:
mean_values_df['difference'] = mean_values_df['Measurability_mean'] - mean_values_df['Predicted_Measurability']

In [63]:
mean_values_df = mean_values_df.sort_values('difference', ascending=False)

In [64]:
mean_values_df.head(18)

Unnamed: 0,Product_60codes,Measurability_mean,Predicted_Measurability,difference
21,30,3.723404,3.476799,0.246605
37,56,3.290698,3.127601,0.163097
24,34,3.053571,3.007499,0.046072
23,32,2.197917,2.157037,0.04088
10,17,2.855556,2.826545,0.029011
7,11,3.161616,3.137539,0.024077
6,10,2.134021,2.120677,0.013344
0,1,3.409639,3.397495,0.012144
14,21,3.037975,3.027104,0.010871
15,22,2.303797,2.295013,0.008784


In [66]:
mean_values_df.to_excel('Corpus_42_measurebility_product_type_errors.xlsx', index=False)

### Validation on corpus_18

In [106]:
y_18 = corpus_18['Measurability_mean'].values

In [107]:
y_pred_18 = model.predict([X_18_cpv_1digit, X_18_cpv_2digit, X_18_cpv_3digit, X_18_embeddings])



In [108]:
mae_18 = mean_absolute_error(y_18, y_pred_18)
r2_18 = r2_score(y_18, y_pred_18)

In [54]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [110]:
mse_18 = mean_squared_error(y_18, y_pred_18)


In [73]:
print(f"Mean Squared Error (MSE) on corpus_18: {mse_18}")
print(f"Mean Absolute Error (MAE) on corpus_18: {mae_18}")
print(f"R^2 Score on corpus_18: {r2_18}")

Mean Squared Error (MSE) on corpus_18: 0.0874355061559907
Mean Absolute Error (MAE) on corpus_18: 0.23997283731439975
R^2 Score on corpus_18: 0.45396359547907983


In [53]:
corpus_18['Predicted_Measurability_mean'] = y_pred_18.flatten()

In [54]:
corpus_18['absolute_error'] = np.abs(corpus_18['Measurability_mean'] - corpus_18['Predicted_Measurability_mean'])
corpus_18['prediction_error'] = np.abs(corpus_18['Predicted_Measurability_mean'] - corpus_18['Measurability_mean'])


In [55]:
print(corpus_18['prediction_error'].describe())

count    389.000000
mean       0.239973
std        0.172990
min        0.001352
25%        0.103067
50%        0.210849
75%        0.349444
max        0.759633
Name: prediction_error, dtype: float64


In [56]:
corpus_18_sorted_by_error = corpus_18.sort_values('absolute_error', ascending=False)

In [57]:
corpus_18_sorted_by_error.head()

Unnamed: 0,doc_id,contract_type_code,Product_60codes,Measurability_mean,Asset_specificity_mean,Manuel_or_not,product_description_binary,CPVcodes_2digit,CPVcodes_3digit,clean_text,Embedding,Normalized_Embedding,Normalized_Embedding_mean,CPVcodes_1digit,CPVcodes_1digit_str,CPVcodes_2digit_str,CPVcodes_3digit_str,Predicted_Measurability_mean,absolute_error,prediction_error
18,0277-301019,2,58,2.676768,2.710843,not,Yes,33,330,sale leica cv5030 fully automated glass covers...,"[-0.013762915, 0.0010504794, -0.0016031103, 0....","[-0.013762914777399184, 0.0010504793830095895,...","[-0.013762914777399184, 0.0010504793830095895,...",3,3,33,330,1.917135,0.759633,0.759633
339,611367-2022,4,2,3.291667,2.826087,not,Yes,79,795,systemet er intuitivt ved ”intuitivt” forstås ...,"[-0.0003638284, 0.032611296, -0.0051846746, 0....","[-0.00036382840716924994, 0.0326112966426066, ...","[-0.00036382840716924994, 0.0326112966426066, ...",7,7,79,795,2.543542,0.748125,0.748125
117,193848-2021,2,58,2.676768,2.710843,not,Yes,33,331,sterile technology type perforated outer lengt...,"[0.017060073, -0.022025751, 0.0036557117, 0.04...","[0.01344002836839025, -0.01735201943597193, 0....","[-0.009384797636176211, -0.05608886369094062, ...",3,3,33,331,1.93835,0.738418,0.738418
268,489952-2022,2,58,2.676768,2.710843,not,Yes,39,390,kundens beskrivelse af krav leverandørens løsn...,"[-0.008996954, -0.017590597, -0.008337694, 0.0...","[-0.008996954067939157, -0.017590597132832772,...","[-0.008996954067939157, -0.017590597132832772,...",3,3,39,390,1.963335,0.713433,0.713433
170,282815-2022,2,58,2.676768,2.710843,not,Yes,39,390,kundens beskrivelse af krav leverandørens løsn...,"[-0.0021949317, -0.028377473, -0.005583494, 0....","[-0.002194931641784452, -0.028377472247352376,...","[-0.002194931641784452, -0.028377472247352376,...",3,3,39,390,1.998589,0.678179,0.678179


In [58]:
selected_columns_df = corpus_18_sorted_by_error[['doc_id', 'clean_text', 'contract_type_code', 'Product_60codes','CPVcodes_2digit', 'CPVcodes_3digit', 'Measurability_mean', 'prediction_error', 'Predicted_Measurability_mean']]

In [59]:
def truncate_text(text):
    MAX_LENGTH = 30000  # Excel's maximum cell character limit
    if isinstance(text, str) and len(text) > MAX_LENGTH:
        return text[:MAX_LENGTH]
    return text

In [60]:
selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)


In [50]:
corpus_18['Predicted_Measurability_mean'] = y_pred_18.flatten()

In [63]:
mean_values_df = corpus_18.groupby('Product_60codes').agg({
    'Measurability_mean': 'mean',
    'Predicted_Measurability_mean': 'mean'
}).reset_index()

In [64]:
mean_values_df['difference'] = mean_values_df['Measurability_mean'] - mean_values_df['Predicted_Measurability_mean']

In [65]:
mean_values_df = mean_values_df.sort_values('difference', ascending=False)

In [66]:
mean_values_df.head(18)

Unnamed: 0,Product_60codes,Measurability_mean,Predicted_Measurability_mean,difference
0,2,3.291667,2.730762,0.560905
4,16,3.020408,2.461787,0.558621
9,41,2.726027,2.169745,0.556282
2,6,3.450549,3.104434,0.346115
6,29,2.657895,2.331527,0.326368
13,50,2.574257,2.304765,0.269492
17,58,2.676768,2.477731,0.199037
14,52,2.53913,2.414792,0.124338
1,5,2.118644,2.009434,0.10921
11,45,2.029703,1.978689,0.051014


In [67]:
mean_values_df.to_excel('mean_values_18.xlsx', index=False)

### Corpus_over

In [32]:
corpus_over['CPVcodes_2digit'] = corpus_over['CPVcodes_2digit'].astype(str).apply(lambda x: x.zfill(2))


In [33]:
corpus_over['CPVcodes_1digit'] = corpus_over['CPVcodes_2digit'].astype(str).str[0]

In [34]:
corpus_over['CPVcodes_1digit'] = corpus_over['CPVcodes_1digit'].astype(int)


In [38]:
corpus_over['CPVcodes_1digit_str'] = corpus_over['CPVcodes_1digit'].astype(str)
corpus_over['CPVcodes_2digit_str'] = corpus_over['CPVcodes_2digit'].astype(str)
corpus_over['CPVcodes_3digit_str'] = corpus_over['CPVcodes_3digit'].astype(str)

In [39]:
X_over_embeddings = np.array(corpus_over['Normalized_Embedding'].tolist())  # Adjusted for corpus_18
X_over_cpv_1digit = corpus_over['CPVcodes_1digit_str'].astype('category').cat.codes.values
X_over_cpv_2digit = corpus_over['CPVcodes_2digit_str'].astype('category').cat.codes.values
X_over_cpv_3digit = corpus_over['CPVcodes_3digit_str'].astype('category').cat.codes.values

In [40]:
y_pred = model.predict([X_over_cpv_1digit, X_over_cpv_2digit, X_over_cpv_3digit, X_over_embeddings])



In [41]:
corpus_over['Predicted_Measurability_mean'] = y_pred.flatten()

In [44]:
corpus_over.head()

Unnamed: 0,doc_id,contract_type_code,Product_60codes,Measurability_mean,Asset_specificity_mean,Manuel_or_not,product_description_binary,CPVcodes_2digit,CPVcodes_3digit,clean_text,Embedding,Normalized_Embedding,Normalized_Embedding_mean,CPVcodes_1digit,CPVcodes_1digit_str,CPVcodes_2digit_str,CPVcodes_3digit_str,Predicted_Measurability_mean
0,0000-008255,4,482,,,not,Yes,48,482,provide one icon per episode in several forma...,"[-0.0030848256, -0.026598396, -0.021644095, -0...","[-0.0030848255027367153, -0.026598395161363492...","[-0.0030848255027367153, -0.026598395161363492...",4,4,48,482,0.889499
1,0000-008285,1,450,,,not,Yes,45,450,udlægning af slidlag bærelag og bæreslidlag på...,"[0.03165265, -0.018070666, -0.0104349395, 0.01...","[0.031652648614329426, -0.01807066520891331, -...","[0.031652648614329426, -0.01807066520891331, -...",4,4,45,450,3.160149
2,0000-008392,1,451,,,not,Yes,45,451,pos.nr. pkt. bygningsdel 01 1 121 2 131 linief...,"[-0.0045218053, -0.029040039, -0.011053302, 0....","[-0.004521805473328217, -0.029040040113152348,...","[-0.004521805473328217, -0.029040040113152348,...",4,4,45,451,2.918535
3,0000-008393,4,482,,,not,Yes,48,482,task 4 - refurbishment of the interreg podcast...,"[-0.016492937, -0.0014233976, -0.018877864, -0...","[-0.0164929371094576, -0.00142339760944657, -0...","[-0.0164929371094576, -0.00142339760944657, -0...",4,4,48,482,1.971107
4,0000-008394,1,452,,,not,Yes,45,452,arbejder gulvbelægning vægoverflader ny gulvbe...,"[0.007988542, -0.026167195, -0.011535243, 0.03...","[0.007748142082371978, -0.025379743231885573, ...","[0.0010394256856936434, -0.04338538165723106, ...",4,4,45,452,3.12716


In [53]:
group_df = corpus_over.groupby('Product_60codes').agg({
    'Predicted_Measurability_mean': ['mean', 'std', 'count']
}).reset_index()


In [59]:
group_df.to_excel('corpus_over_prediction.xlsx', index=False)

# 3. Asset Specifity Prediction

## 3.1 Default model experiment with cross validation

In [1]:
import pandas as pd
import nltk
nltk.download('punkt')
import re 
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

import os

import json

from nltk.corpus import stopwords
import lemmy
import textract
import pickle
import numpy as np

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\blunds\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Loading data

In [2]:
corpus_42 = pd.read_pickle("corpus_42_final.pkl")

In [3]:
corpus_42.head()

Unnamed: 0,doc_id,contract_type_code,Product_60codes,Measurability_mean,Asset_specificity_mean,Manuel_or_not,product_description_binary,CPVcodes_2digit,CPVcodes_3digit,clean_text,Embedding,Normalized_Embedding,Normalized_Embedding_mean
0,0000-008168,4,60,3.323232,2.450549,not,Yes,71,713,the new programme should provide the opportun...,"[-0.023649858, -0.0044648494, -0.006738333, 0....","[-0.02364985798180248, -0.0044648493965644956,...","[-0.02364985798180248, -0.0044648493965644956,..."
1,0000-008227,4,60,3.323232,2.450549,not,Yes,71,710,sikring af bygningens klimaskærm som fundamen...,"[0.0021429053, -0.020846844, -0.0025144592, 0....","[0.0018015666258628857, -0.017526196050273404,...","[-0.011694183793053732, -0.040396958453703675,..."
2,0000-008241,4,60,3.323232,2.450549,not,Yes,71,712,sikring af bygningens klimaskærm som fundamen...,"[0.019978527, -0.021250192, -0.005847986, -0.0...","[0.01997852606996726, -0.0212501910107692, -0....","[0.01997852606996726, -0.0212501910107692, -0...."
3,0000-008243,4,19,2.87013,2.590164,not,Yes,45,452,entreprisen omfatter almindelig vedligeholdels...,"[0.032049313, -0.03062923, 0.0018029478, -0.01...","[0.03204931125482191, -0.030629228332149552, 0...","[0.03204931125482191, -0.030629228332149552, 0..."
4,0000-008280,1,19,2.87013,2.590164,manuel,Yes,45,452,samarbejdets organisering og proces entreprenø...,"[0.027916875, -0.028474228, -0.009179947, -0.0...","[0.027916875576418063, -0.028474228587926097, ...","[0.027916875576418063, -0.028474228587926097, ..."


In [4]:
corpus_18 = pd.read_pickle("corpus_18_final.pkl")

In [5]:
corpus_42['CPVcodes_2digit'] = corpus_42['CPVcodes_2digit'].astype(str).apply(lambda x: x.zfill(2))


In [6]:
corpus_18['CPVcodes_2digit'] = corpus_18['CPVcodes_2digit'].astype(str).apply(lambda x: x.zfill(2))


In [7]:
corpus_42['CPVcodes_1digit'] = corpus_42['CPVcodes_2digit'].astype(str).str[0]
corpus_18['CPVcodes_1digit'] = corpus_18['CPVcodes_2digit'].astype(str).str[0]

In [8]:
corpus_42['CPVcodes_1digit'] = corpus_42['CPVcodes_1digit'].astype(int)
corpus_18['CPVcodes_1digit'] = corpus_18['CPVcodes_1digit'].astype(int)


In [9]:
corpus_over = pd.read_pickle("corpus_over_final.pkl")

#### Endocding CPVcodes_1digit

In [12]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd


In [13]:
corpus_42['CPVcodes_1digit'] = corpus_42['CPVcodes_2digit'].astype(str).str[0]
corpus_18['CPVcodes_1digit'] = corpus_18['CPVcodes_2digit'].astype(str).str[0]
corpus_over['CPVcodes_1digit'] = corpus_over['CPVcodes_2digit'].astype(str).str[0]

In [14]:
corpus_42['CPVcodes_1digit'] = corpus_42['CPVcodes_1digit'].astype(int)
corpus_18['CPVcodes_1digit'] = corpus_18['CPVcodes_1digit'].astype(int)
corpus_over['CPVcodes_1digit'] = corpus_over['CPVcodes_1digit'].astype(int)

In [15]:
corpus_42['CPVcodes_1digit_str'] = corpus_42['CPVcodes_1digit'].astype(str)
corpus_18['CPVcodes_1digit_str'] = corpus_18['CPVcodes_1digit'].astype(str)
corpus_over['CPVcodes_1digit_str'] = corpus_over['CPVcodes_1digit'].astype(str)

In [16]:
all_cpv_codes_str = pd.concat([corpus_42['CPVcodes_1digit_str'], corpus_18['CPVcodes_1digit_str'], corpus_over['CPVcodes_1digit_str']]).unique()

In [17]:
encoder = OneHotEncoder(categories=[all_cpv_codes_str], handle_unknown='ignore')

In [18]:
cpv_42_encoded_1 = encoder.fit_transform(corpus_42[['CPVcodes_1digit_str']]).toarray()

In [19]:
cpv_18_encoded_1 = encoder.transform(corpus_18[['CPVcodes_1digit_str']]).toarray()

In [20]:
cpv_over_encoded_1 = encoder.transform(corpus_over[['CPVcodes_1digit_str']]).toarray()

In [21]:
cpv_42_encoded_df_1 = pd.DataFrame(cpv_42_encoded_1, index=corpus_42.index)

In [22]:
cpv_18_encoded_df_1 = pd.DataFrame(cpv_18_encoded_1, index=corpus_18.index)

In [23]:
cpv_over_encoded_df_1 = pd.DataFrame(cpv_over_encoded_1, index=corpus_over.index)

#### Endocding CPVcodes_2digit

In [24]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd


In [25]:
corpus_42['CPVcodes_2digit_str'] = corpus_42['CPVcodes_2digit'].astype(str)
corpus_18['CPVcodes_2digit_str'] = corpus_18['CPVcodes_2digit'].astype(str)
corpus_over['CPVcodes_2digit_str'] = corpus_over['CPVcodes_2digit'].astype(str)


In [26]:
all_cpv_codes_str = pd.concat([corpus_42['CPVcodes_2digit_str'], corpus_18['CPVcodes_2digit_str'], corpus_over['CPVcodes_2digit_str']]).unique()

In [27]:
encoder = OneHotEncoder(categories=[all_cpv_codes_str], handle_unknown='ignore')

In [28]:
cpv_42_encoded_2 = encoder.fit_transform(corpus_42[['CPVcodes_2digit_str']]).toarray()

In [29]:
cpv_18_encoded_2 = encoder.transform(corpus_18[['CPVcodes_2digit_str']]).toarray()

In [30]:
cpv_over_encoded_2 = encoder.transform(corpus_over[['CPVcodes_2digit_str']]).toarray()

In [31]:
cpv_42_encoded_df_2 = pd.DataFrame(cpv_42_encoded_2, index=corpus_42.index)

In [32]:
cpv_18_encoded_df_2 = pd.DataFrame(cpv_18_encoded_2, index=corpus_18.index)

In [33]:
cpv_over_encoded_df_2 = pd.DataFrame(cpv_over_encoded_2, index=corpus_over.index)

#### Endocding CPVcodes_3digit

In [34]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd


In [35]:
corpus_42['CPVcodes_3digit_str'] = corpus_42['CPVcodes_3digit'].astype(str)
corpus_18['CPVcodes_3digit_str'] = corpus_18['CPVcodes_3digit'].astype(str)
corpus_over['CPVcodes_3digit_str'] = corpus_over['CPVcodes_3digit'].astype(str)


In [36]:
all_cpv_codes_str = pd.concat([corpus_42['CPVcodes_3digit_str'], corpus_18['CPVcodes_3digit_str'], corpus_over['CPVcodes_3digit_str']]).unique()

In [37]:
encoder = OneHotEncoder(categories=[all_cpv_codes_str], handle_unknown='ignore')

In [38]:
cpv_42_encoded_3 = encoder.fit_transform(corpus_42[['CPVcodes_3digit_str']]).toarray()

In [39]:
cpv_18_encoded_3 = encoder.transform(corpus_18[['CPVcodes_3digit_str']]).toarray()

In [40]:
cpv_over_encoded_3 = encoder.transform(corpus_over[['CPVcodes_3digit_str']]).toarray()

In [41]:
cpv_42_encoded_df_3 = pd.DataFrame(cpv_42_encoded_3, index=corpus_42.index)

In [42]:
cpv_18_encoded_df_3 = pd.DataFrame(cpv_18_encoded_3, index=corpus_18.index)

In [43]:
cpv_over_encoded_df_3 = pd.DataFrame(cpv_over_encoded_3, index=corpus_over.index)

# Linear Regression

In [66]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import cross_val_score

In [68]:
embeddings_df = pd.DataFrame(corpus_42['Normalized_Embedding'].tolist())


In [69]:
data_with_dummies = pd.concat([embeddings_df, cpv_42_encoded_df_3, cpv_42_encoded_df_2, cpv_42_encoded_df_1], axis=1)

In [70]:
data = data_with_dummies.join(corpus_42['Asset_specificity_mean'])

In [71]:
data.columns = data.columns.astype(str)

In [72]:
X = data.drop('Asset_specificity_mean', axis=1)
y = data['Asset_specificity_mean']

In [73]:
model = LinearRegression()

In [74]:
mse_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
mae_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error')
r2_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

In [76]:
print("Mean MSE:", -mse_scores.mean())
print("Mean MAE:", -mae_scores.mean())
print("Mean R2:", r2_scores.mean())

Mean MSE: 0.026158301481645153
Mean MAE: 0.10987724421265117
Mean R2: 0.7748383532822506


In [77]:
model.fit(X, y)


In [78]:
embeddings_df_18 = pd.DataFrame(corpus_18['Normalized_Embedding'].tolist())


In [79]:
data_with_dummies_18 = pd.concat([embeddings_df_18, cpv_18_encoded_df_3, cpv_18_encoded_df_2, cpv_18_encoded_df_1], axis=1)

In [80]:
data_18 = data_with_dummies_18.join(corpus_18['Asset_specificity_mean'])

In [81]:
data_18.columns = data_18.columns.astype(str)

In [82]:
X_18 = data_18.drop('Asset_specificity_mean', axis=1)

In [83]:
y_18 = data_18['Asset_specificity_mean']

In [84]:
y_pred_18 = model.predict(X_18)

In [85]:
mse_18 = mean_squared_error(y_18, y_pred_18)
mae_18 = mean_absolute_error(y_18, y_pred_18)
r2_18 = r2_score(y_18, y_pred_18)

In [86]:
print(f"Mean Squared Error (MSE) on corpus_18: {mse_18}")
print(f"Mean Absolute Error (MAE) on corpus_18: {mae_18}")
print(f"R^2 Score on corpus_18: {r2_18}")

Mean Squared Error (MSE) on corpus_18: 0.14785461804320232
Mean Absolute Error (MAE) on corpus_18: 0.28483829761606055
R^2 Score on corpus_18: -0.017404111784969478


# Random Forrest

In [68]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import cross_val_score

In [88]:
embeddings_df = pd.DataFrame(corpus_42['Normalized_Embedding'].tolist())


In [89]:
data_with_dummies = pd.concat([embeddings_df, cpv_42_encoded_df_3, cpv_42_encoded_df_2, cpv_42_encoded_df_1], axis=1)

In [90]:
data = data_with_dummies.join(corpus_42['Asset_specificity_mean'])

In [91]:
data.columns = data.columns.astype(str)

In [92]:
X = data.drop('Asset_specificity_mean', axis=1)
y = data['Asset_specificity_mean']

In [93]:
model = RandomForestRegressor(n_estimators=100, random_state=42)

In [94]:
mse_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
mae_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error')
r2_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

In [95]:
print("Mean MSE:", -mse_scores.mean())
print("Mean MAE:", -mae_scores.mean())
print("Mean R2:", r2_scores.mean())

Mean MSE: 0.022757576091190494
Mean MAE: 0.08907608159472524
Mean R2: 0.8052676739143703


In [96]:
model.fit(X, y)


In [97]:
embeddings_df_18 = pd.DataFrame(corpus_18['Normalized_Embedding'].tolist())


In [98]:
data_with_dummies_18 = pd.concat([embeddings_df_18, cpv_18_encoded_df_3, cpv_18_encoded_df_2, cpv_18_encoded_df_1], axis=1)

In [99]:
data_18 = data_with_dummies_18.join(corpus_18['Asset_specificity_mean'])

In [100]:
data_18.columns = data_18.columns.astype(str)

In [101]:
X_18 = data_18.drop('Asset_specificity_mean', axis=1)

In [102]:
y_18 = data_18['Asset_specificity_mean']

In [103]:
y_pred_18 = model.predict(X_18)

In [104]:
mse_18 = mean_squared_error(y_18, y_pred_18)
mae_18 = mean_absolute_error(y_18, y_pred_18)
r2_18 = r2_score(y_18, y_pred_18)

In [105]:
print(f"Mean Squared Error (MSE) on corpus_18: {mse_18}")
print(f"Mean Absolute Error (MAE) on corpus_18: {mae_18}")
print(f"R^2 Score on corpus_18: {r2_18}")

Mean Squared Error (MSE) on corpus_18: 0.1364487083701588
Mean Absolute Error (MAE) on corpus_18: 0.3222243386632392
R^2 Score on corpus_18: 0.06108122437552632


# SVM 

In [44]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [45]:
embeddings_df = pd.DataFrame(corpus_42['Normalized_Embedding'].tolist())


In [46]:
data_with_dummies = pd.concat([embeddings_df, cpv_42_encoded_df_3, cpv_42_encoded_df_2, cpv_42_encoded_df_1], axis=1)

In [47]:
data = data_with_dummies.join(corpus_42['Asset_specificity_mean'])

In [48]:
data.columns = data.columns.astype(str)

In [49]:
X = data.drop('Asset_specificity_mean', axis=1)
y = data['Asset_specificity_mean']

In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [51]:
model = SVR()

In [52]:
model.fit(X_train, y_train)


In [53]:
y_pred = model.predict(X_test)

In [54]:
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [55]:
print(f'Mean Squared Error (MSE): {mse}')
print(f'Mean Absolute Error (MAE): {mae}')
print(f'R^2 Score: {r2}')

Mean Squared Error (MSE): 0.023209407657618635
Mean Absolute Error (MAE): 0.10166470944265969
R^2 Score: 0.8195858621186136


In [56]:
embeddings_df_18 = pd.DataFrame(corpus_18['Normalized_Embedding'].tolist())


In [57]:
data_with_dummies_18 = pd.concat([embeddings_df_18, cpv_18_encoded_df_3, cpv_18_encoded_df_2, cpv_18_encoded_df_1], axis=1)

In [58]:
data_18 = data_with_dummies_18.join(corpus_18['Asset_specificity_mean'])

In [59]:
data_18.columns = data_18.columns.astype(str)

In [60]:
X_18 = data_18.drop('Asset_specificity_mean', axis=1)

In [61]:
y_18 = data_18['Asset_specificity_mean']

In [62]:
y_pred_18 = model.predict(X_18)

In [63]:
mse_18 = mean_squared_error(y_18, y_pred_18)
mae_18 = mean_absolute_error(y_18, y_pred_18)
r2_18 = r2_score(y_18, y_pred_18)

In [64]:
print(f"Mean Squared Error (MSE) on corpus_18: {mse_18}")
print(f"Mean Absolute Error (MAE) on corpus_18: {mae_18}")
print(f"R^2 Score on corpus_18: {r2_18}")

Mean Squared Error (MSE) on corpus_18: 0.11970270462348415
Mean Absolute Error (MAE) on corpus_18: 0.2797758374608025
R^2 Score on corpus_18: 0.17631234324971023


## Neural Network

In [136]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import numpy as np
import tensorflow as tf
import random

In [137]:
np.random.seed(42)
random.seed(42)
tf.random.set_seed(42)




In [138]:
embeddings_df = pd.DataFrame(corpus_42['Normalized_Embedding'].tolist())


In [139]:
data_with_dummies = pd.concat([embeddings_df, cpv_42_encoded_df_3, cpv_42_encoded_df_2, cpv_42_encoded_df_1], axis=1)

In [140]:
data = data_with_dummies.join(corpus_42['Asset_specificity_mean'])

In [141]:
data.columns = data.columns.astype(str)

In [142]:
X = data.drop('Asset_specificity_mean', axis=1)
y = data['Asset_specificity_mean']

In [143]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [144]:
model = MLPRegressor(hidden_layer_sizes=(100,), activation='logistic', solver='adam', max_iter=1000, random_state=42)

In [145]:
model.fit(X_train, y_train)


In [146]:
y_pred = model.predict(X_test)

In [147]:
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [148]:
# Normalize

In [149]:
print(f'Mean Squared Error (MSE): {mse}')
print(f'Mean Absolute Error (MAE): {mae}')
print(f'R^2 Score: {r2}')

Mean Squared Error (MSE): 0.025700704103117714
Mean Absolute Error (MAE): 0.10402314555441827
R^2 Score: 0.800220219227071


In [150]:
embeddings_df_18 = pd.DataFrame(corpus_18['Normalized_Embedding'].tolist())

In [151]:
data_with_dummies_18 = pd.concat([embeddings_df_18, cpv_18_encoded_df_3, cpv_18_encoded_df_2, cpv_18_encoded_df_1], axis=1)

In [152]:
data_18 = data_with_dummies_18.join(corpus_18['Asset_specificity_mean'])

In [153]:
data_18.columns = data_18.columns.astype(str)

In [154]:
X_18 = data_18.drop('Asset_specificity_mean', axis=1)

In [155]:
y_18 = data_18['Asset_specificity_mean']

In [156]:
y_pred_18 = model.predict(X_18)

In [157]:
mse_18 = mean_squared_error(y_18, y_pred_18)
mae_18 = mean_absolute_error(y_18, y_pred_18)
r2_18 = r2_score(y_18, y_pred_18)

In [158]:
print(f"Mean Squared Error (MSE) on corpus_18: {mse_18}")
print(f"Mean Absolute Error (MAE) on corpus_18: {mae_18}")
print(f"R^2 Score on corpus_18: {r2_18}")

Mean Squared Error (MSE) on corpus_18: 0.110173243158986
Mean Absolute Error (MAE) on corpus_18: 0.2341268871682054
R^2 Score on corpus_18: 0.241885630072878


## 3.2 Support Vector Machine adjustment

In [36]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [37]:
embeddings_df = pd.DataFrame(corpus_42['Normalized_Embedding'].tolist())

In [38]:
data_with_dummies = pd.concat([embeddings_df, cpv_42_encoded_df_2, cpv_42_encoded_df_1, cpv_42_encoded_df_3], axis=1)

In [39]:
data = data_with_dummies.join(corpus_42['Asset_specificity_mean'])

In [40]:
data.columns = data.columns.astype(str)

In [41]:
X = data.drop('Asset_specificity_mean', axis=1)
y = data['Asset_specificity_mean']


In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [44]:
model_svm_linear = SVR(kernel='linear')


In [49]:
model_svm_linear.fit(X_train, y_train)


In [54]:
y_pred_svm_linear = model_svm_linear.predict(X_test)

In [59]:
mse_svm_linear = mean_squared_error(y_test, y_pred_svm_linear)
mae_svm_linear = mean_absolute_error(y_test, y_pred_svm_linear)
r2_svm_linear = r2_score(y_test, y_pred_svm_linear)

In [66]:
print(f'Mean Squared Error (MSE): {mse_svm_linear}')
print(f'Mean Absolute Error (MAE): {mae_svm_linear}')
print(f'R^2 Score: {r2_svm_linear}')

Mean Squared Error (MSE): 0.023111866290153712
Mean Absolute Error (MAE): 0.09779492867575047
R^2 Score: 0.8203440823187392


In [73]:
y_pred_full = model_svm_linear.predict(X)

In [74]:
corpus_42['predicted_Asset_specificity_mean'] = y_pred_full

In [75]:
corpus_42['absolute_error'] = np.abs(corpus_42['Asset_specificity_mean'] - corpus_42['predicted_Asset_specificity_mean'])
corpus_42['prediction_error'] = np.abs(corpus_42['predicted_Asset_specificity_mean'] - corpus_42['Asset_specificity_mean'])


In [76]:
print(corpus_42['prediction_error'].describe())

count    1351.000000
mean        0.078683
std         0.084745
min         0.000074
25%         0.031436
50%         0.068363
75%         0.099909
max         1.473202
Name: prediction_error, dtype: float64


In [77]:
corpus_42_sorted_by_error = corpus_42.sort_values('absolute_error', ascending=False)

In [78]:
corpus_42_sorted_by_error.tail()

Unnamed: 0,doc_id,contract_type_code,Product_60codes,Measurability_mean,Asset_specificity_mean,Manuel_or_not,product_description_binary,CPVcodes_2digit,CPVcodes_3digit,clean_text,Embedding,Normalized_Embedding,Normalized_Embedding_mean,CPVcodes_1digit,CPVcodes_1digit_str,CPVcodes_2digit_str,CPVcodes_3digit_str,predicted_Asset_specificity_mean,absolute_error,prediction_error
1283,662038-2021,4,17,2.855556,2.405941,not,Yes,66,660,netbanksløsning leverandøren skal som en del a...,"[0.014078571, -0.020258812, -0.0054351916, 0.0...","[0.014078570938344082, -0.02025881191127824, -...","[0.014078570938344082, -0.02025881191127824, -...",6,6,66,660,2.40538,0.000561,0.000561
916,462372-2022,4,17,2.855556,2.405941,not,Yes,66,665,m policens opbygning 3. tilbudsgiver skal udst...,"[0.021279745, -0.03408466, -0.008323194, 9.207...","[0.02127974556951182, -0.03408466091221097, -0...","[0.02127974556951182, -0.03408466091221097, -0...",6,6,66,665,2.405388,0.000553,0.000553
284,102634-2021,4,55,2.52,2.65,not,Yes,50,502,styreskabe og andre skabe indeholdende styreud...,"[0.038227268, -0.017627832, -0.008707555, 0.02...","[0.030997435407947013, -0.014293921914643268, ...","[0.011595606847068007, -0.04542482845721019, -...",5,5,50,502,2.649496,0.000504,0.000504
1192,614125-2021,4,60,3.323232,2.450549,not,Yes,71,710,hvordan gør vi det? undervisningen bærer præg ...,"[0.012553293, -0.042231217, -0.014644184, -0.0...","[0.012553292730098958, -0.0422312160920112, -0...","[0.012553292730098958, -0.0422312160920112, -0...",7,7,71,710,2.450829,0.00028,0.00028
935,471766-2021,4,47,2.669725,2.521277,not,Yes,55,555,1 regler på området indholdet af næringsstoffe...,"[0.00805352, -0.012468761, -0.0059803734, 0.02...","[0.008053520382908515, -0.012468761592833291, ...","[0.008053520382908515, -0.012468761592833291, ...",5,5,55,555,2.521351,7.4e-05,7.4e-05


In [79]:
selected_columns_df = corpus_42_sorted_by_error[['doc_id', 'clean_text', 'contract_type_code', 'Product_60codes','CPVcodes_2digit', 'CPVcodes_3digit', 'Asset_specificity_mean', 'prediction_error', 'predicted_Asset_specificity_mean']]

In [80]:
def truncate_text(text):
    MAX_LENGTH = 30000  # Excel's maximum cell character limit
    if isinstance(text, str) and len(text) > MAX_LENGTH:
        return text[:MAX_LENGTH]
    return text

In [81]:
selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)


In [82]:
for col in selected_columns_df.select_dtypes(include=['float', 'float64']).columns:
    selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_

In [83]:
selected_columns_df.to_csv('Corpus_42_asset_errors.csv', index=False, encoding='utf-16', sep='\t')

In [84]:
mean_values_df_42 = corpus_42.groupby('Product_60codes').agg({
    'Asset_specificity_mean': 'mean',
    'predicted_Asset_specificity_mean': 'mean'
}).reset_index()

In [86]:
mean_values_df_42['difference'] = mean_values_df_42['Asset_specificity_mean'] - mean_values_df_42['predicted_Asset_specificity_mean']

In [88]:
mean_values_df_42 = mean_values_df_42.sort_values('difference', ascending=False)

In [89]:
mean_values_df_42.tail(20)

Unnamed: 0,Product_60codes,Asset_specificity_mean,predicted_Asset_specificity_mean,difference
10,17,2.405941,2.406935,-0.000994
3,7,2.858824,2.860839,-0.002015
27,38,2.838235,2.844409,-0.006174
39,60,2.450549,2.461378,-0.010829
11,18,2.346535,2.370006,-0.023471
29,42,2.176471,2.214353,-0.037882
33,49,1.830189,1.871746,-0.041557
38,59,1.645161,1.689508,-0.044347
17,24,2.329268,2.384654,-0.055386
13,20,1.823009,1.879114,-0.056105


In [90]:
mean_values_df_42.to_excel('Corpus_42_asset_product_type_errors.xlsx', index=False)

### Validation on corpus_18

In [98]:
embeddings_df_18 = pd.DataFrame(corpus_18['Normalized_Embedding'].tolist())

In [99]:
data_with_dummies_18 = pd.concat([embeddings_df_18, cpv_18_encoded_df_2, cpv_18_encoded_df_1, cpv_18_encoded_df_3], axis=1)

In [100]:
data_18 = data_with_dummies_18.join(corpus_18['Asset_specificity_mean'])

In [101]:
data_18.columns = data_18.columns.astype(str)

In [102]:
X_18 = data_18.drop('Asset_specificity_mean', axis=1)

In [103]:
y_18 = data_18['Asset_specificity_mean']

In [105]:
y_pred_18_svm_linear = model_svm_linear.predict(X_18)

In [111]:
mse_18_svm_linear = mean_squared_error(y_18, y_pred_18_svm_linear)
mae_18_svm_linear = mean_absolute_error(y_18, y_pred_18_svm_linear)
r2_18_svm_linear = r2_score(y_18, y_pred_18_svm_linear)

In [112]:
print(f"Mean Squared Error (MSE) on corpus_18: {mse_18_svm_linear}")
print(f"Mean Absolute Error (MAE) on corpus_18: {mae_18_svm_linear}")
print(f"R^2 Score on corpus_18: {r2_18_svm_linear}")

Mean Squared Error (MSE) on corpus_18: 0.11751731631562982
Mean Absolute Error (MAE) on corpus_18: 0.24645708996469282
R^2 Score on corpus_18: 0.1913502438556156


#### Error 

In [119]:
corpus_18['predicted_Asset_specificity_mean'] = y_pred_18_svm_linear

In [120]:
corpus_18['absolute_error'] = np.abs(corpus_18['Asset_specificity_mean'] - corpus_18['predicted_Asset_specificity_mean'])
corpus_18['prediction_error'] = corpus_18['predicted_Asset_specificity_mean'] - corpus_18['Asset_specificity_mean']

In [121]:
print(corpus_18['prediction_error'].describe())

count    389.000000
mean      -0.052393
std        0.339217
min       -1.115597
25%       -0.249246
50%       -0.070784
75%        0.051185
max        1.171295
Name: prediction_error, dtype: float64


In [122]:
corpus_18_sorted_by_error = corpus_18.sort_values('absolute_error', ascending=False)

In [123]:
corpus_18_sorted_by_error.head()

Unnamed: 0,doc_id,contract_type_code,Product_60codes,Measurability_mean,Asset_specificity_mean,Manuel_or_not,product_description_binary,CPVcodes_2digit,CPVcodes_3digit,clean_text,Embedding,Normalized_Embedding,Normalized_Embedding_mean,CPVcodes_1digit,CPVcodes_1digit_str,CPVcodes_2digit_str,CPVcodes_3digit_str,predicted_Asset_specificity_mean,absolute_error,prediction_error
194,336016-2021,2,41,2.726027,1.309524,not,Yes,18,185,gavekassen er ganske afgørende i leverancen. g...,"[-0.0065661776, 0.0074800667, -0.010836113, 0....","[-0.006566177414315129, 0.00748006648847127, -...","[-0.006566177414315129, 0.00748006648847127, -...",1,1,18,185,2.480819,1.171295,1.171295
207,343481-2022,2,41,2.726027,1.309524,not,Yes,18,185,2.4 gavekassen gavekassen er ganske afgørende ...,"[-0.0140284365, 0.015935063, -0.010284936, 0.0...","[-0.014028436397875703, 0.01593506288399583, -...","[-0.014028436397875703, 0.01593506288399583, -...",1,1,18,185,2.453526,1.144002,1.144002
64,095342-2021,4,16,3.020408,3.770833,not,Yes,72,725,bilag 1b – kundens it-miljø aarhus kommune 25....,"[0.0018933702, -0.047902785, -0.01409423, 0.01...","[0.0018797690612758979, -0.04755867246244351, ...","[-0.006511076749262017, -0.052191325953563995,...",7,7,72,725,2.655236,1.115597,-1.115597
14,0273-000984,4,16,3.020408,3.770833,not,Yes,72,725,leverandøren skal til enhver tid have systemer...,"[-0.012033573, -0.02152224, -0.0032734876, 0.0...","[-0.01203357273654089, -0.02152223952879912, -...","[-0.01203357273654089, -0.02152223952879912, -...",7,7,72,725,2.686497,1.084336,-1.084336
213,359110-2022,4,16,3.020408,3.770833,not,Yes,72,725,generelle krav til funktionalitet overordnede ...,"[0.050823994, -0.027748182, -0.0058391406, 0.0...","[0.048764672670076815, -0.02662386219429582, -...","[0.022834313132002428, -0.034260265824075035, ...",7,7,72,725,2.718036,1.052797,-1.052797


In [124]:
selected_columns_df = corpus_18_sorted_by_error[['doc_id', 'clean_text', 'contract_type_code', 'Product_60codes','CPVcodes_2digit', 'CPVcodes_3digit', 'Asset_specificity_mean', 'prediction_error', 'predicted_Asset_specificity_mean']]

In [125]:
def truncate_text(text):
    MAX_LENGTH = 30000  # Excel's maximum cell character limit
    if isinstance(text, str) and len(text) > MAX_LENGTH:
        return text[:MAX_LENGTH]
    return text

In [126]:
selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)


In [127]:
for col in selected_columns_df.select_dtypes(include=['float', 'float64']).columns:
    selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_

In [128]:
selected_columns_df.to_csv('Corpus_18_asset_errors.csv', index=False, encoding='utf-16', sep='\t')

### mean prediction

In [129]:
mean_values_df = corpus_18.groupby('Product_60codes').agg({
    'Asset_specificity_mean': 'mean',
    'predicted_Asset_specificity_mean': 'mean'
}).reset_index()

In [130]:
mean_values_df['difference'] = mean_values_df['Asset_specificity_mean'] - mean_values_df['predicted_Asset_specificity_mean']

In [131]:
mean_values_df = mean_values_df.sort_values('difference', ascending=False)

In [132]:
mean_values_df.head(18)

Unnamed: 0,Product_60codes,Asset_specificity_mean,predicted_Asset_specificity_mean,difference
4,16,3.770833,2.68659,1.084243
10,44,2.72,2.179209,0.540791
13,50,2.716049,2.267415,0.448634
0,2,2.826087,2.39991,0.426177
2,6,2.961039,2.595234,0.365805
17,58,2.710843,2.532776,0.178067
3,15,2.722222,2.621519,0.100703
8,40,2.985075,2.887719,0.097356
5,28,2.744681,2.668794,0.075887
14,52,2.2,2.300714,-0.100714


In [133]:
mean_values_df.to_excel('Corpus_18_asset_product_type_errors.xlsx', index=False)

## 3.3 Neural Network adjustment

In [1]:
import pandas as pd
import nltk
nltk.download('punkt')
import re 
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

import os

import json

from nltk.corpus import stopwords
import lemmy
import textract
import pickle
import numpy as np

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\blunds\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Loading data

In [2]:
corpus_42 = pd.read_pickle("corpus_42_final.pkl")

In [3]:
corpus_18 = pd.read_pickle("corpus_18_final.pkl")

In [4]:
corpus_over = pd.read_pickle("corpus_over_final.pkl")

In [7]:
corpus_42['CPVcodes_2digit'] = corpus_42['CPVcodes_2digit'].astype(str).apply(lambda x: x.zfill(2))


In [8]:
corpus_18['CPVcodes_2digit'] = corpus_18['CPVcodes_2digit'].astype(str).apply(lambda x: x.zfill(2))


In [9]:
corpus_over['CPVcodes_2digit'] = corpus_over['CPVcodes_2digit'].astype(str).apply(lambda x: x.zfill(2))


In [10]:
corpus_42['CPVcodes_1digit'] = corpus_42['CPVcodes_2digit'].astype(str).str[0]
corpus_18['CPVcodes_1digit'] = corpus_18['CPVcodes_2digit'].astype(str).str[0]
corpus_over['CPVcodes_1digit'] = corpus_over['CPVcodes_2digit'].astype(str).str[0]


In [11]:
corpus_42['CPVcodes_1digit'] = corpus_42['CPVcodes_1digit'].astype(int)
corpus_18['CPVcodes_1digit'] = corpus_18['CPVcodes_1digit'].astype(int)
corpus_over['CPVcodes_1digit'] = corpus_over['CPVcodes_1digit'].astype(int)


#### Endocding CPVcodes_1digit

In [12]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd


In [13]:
corpus_42['CPVcodes_1digit_str'] = corpus_42['CPVcodes_1digit'].astype(str)
corpus_18['CPVcodes_1digit_str'] = corpus_18['CPVcodes_1digit'].astype(str)
corpus_over['CPVcodes_1digit_str'] = corpus_over['CPVcodes_1digit'].astype(str)

In [14]:
all_cpv_codes_str = pd.concat([corpus_42['CPVcodes_1digit_str'], corpus_18['CPVcodes_1digit_str'], corpus_over['CPVcodes_1digit_str']]).unique()

In [15]:
encoder = OneHotEncoder(categories=[all_cpv_codes_str], handle_unknown='ignore')

In [16]:
cpv_42_encoded_1 = encoder.fit_transform(corpus_42[['CPVcodes_1digit_str']]).toarray()

In [17]:
cpv_18_encoded_1 = encoder.transform(corpus_18[['CPVcodes_1digit_str']]).toarray()

In [18]:
cpv_over_encoded_1 = encoder.transform(corpus_over[['CPVcodes_1digit_str']]).toarray()

In [19]:
cpv_42_encoded_df_1 = pd.DataFrame(cpv_42_encoded_1, index=corpus_42.index)

In [20]:
cpv_18_encoded_df_1 = pd.DataFrame(cpv_18_encoded_1, index=corpus_18.index)

In [21]:
cpv_over_encoded_df_1 = pd.DataFrame(cpv_over_encoded_1, index=corpus_over.index)

#### Endocding CPVcodes_2digit

In [22]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd


In [23]:
corpus_42['CPVcodes_2digit_str'] = corpus_42['CPVcodes_2digit'].astype(str)
corpus_18['CPVcodes_2digit_str'] = corpus_18['CPVcodes_2digit'].astype(str)
corpus_over['CPVcodes_2digit_str'] = corpus_over['CPVcodes_2digit'].astype(str)


In [24]:
all_cpv_codes_str = pd.concat([corpus_42['CPVcodes_2digit_str'], corpus_18['CPVcodes_2digit_str'], corpus_over['CPVcodes_2digit_str']]).unique()

In [25]:
encoder = OneHotEncoder(categories=[all_cpv_codes_str], handle_unknown='ignore')

In [26]:
cpv_42_encoded_2 = encoder.fit_transform(corpus_42[['CPVcodes_2digit_str']]).toarray()

In [27]:
cpv_18_encoded_2 = encoder.transform(corpus_18[['CPVcodes_2digit_str']]).toarray()

In [28]:
cpv_over_encoded_2 = encoder.transform(corpus_over[['CPVcodes_2digit_str']]).toarray()

In [29]:
cpv_42_encoded_df_2 = pd.DataFrame(cpv_42_encoded_2, index=corpus_42.index)

In [30]:
cpv_18_encoded_df_2 = pd.DataFrame(cpv_18_encoded_2, index=corpus_18.index)

In [31]:
cpv_over_encoded_df_2 = pd.DataFrame(cpv_over_encoded_2, index=corpus_over.index)

#### Endocding CPVcodes_3digit

In [32]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd


In [33]:
corpus_42['CPVcodes_3digit_str'] = corpus_42['CPVcodes_3digit'].astype(str)
corpus_18['CPVcodes_3digit_str'] = corpus_18['CPVcodes_3digit'].astype(str)
corpus_over['CPVcodes_3digit_str'] = corpus_over['CPVcodes_3digit'].astype(str)


In [34]:
all_cpv_codes_str = pd.concat([corpus_42['CPVcodes_3digit_str'], corpus_18['CPVcodes_3digit_str'], corpus_over['CPVcodes_3digit_str']]).unique()

In [35]:
encoder = OneHotEncoder(categories=[all_cpv_codes_str], handle_unknown='ignore')

In [36]:
cpv_42_encoded_3 = encoder.fit_transform(corpus_42[['CPVcodes_3digit_str']]).toarray()

In [37]:
cpv_18_encoded_3 = encoder.transform(corpus_18[['CPVcodes_3digit_str']]).toarray()

In [38]:
cpv_over_encoded_3 = encoder.transform(corpus_over[['CPVcodes_3digit_str']]).toarray()

In [39]:
cpv_42_encoded_df_3 = pd.DataFrame(cpv_42_encoded_3, index=corpus_42.index)

In [40]:
cpv_18_encoded_df_3 = pd.DataFrame(cpv_18_encoded_3, index=corpus_18.index)

In [41]:
cpv_over_encoded_df_3 = pd.DataFrame(cpv_over_encoded_3, index=corpus_over.index)

## Neural Network

In [42]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import numpy as np
import tensorflow as tf
import random

In [43]:

np.random.seed(42)
random.seed(42)
tf.random.set_seed(42)




In [44]:
embeddings_df = pd.DataFrame(corpus_42['Normalized_Embedding'].tolist())


In [45]:
data_with_dummies = pd.concat([embeddings_df, cpv_42_encoded_df_3, cpv_42_encoded_df_2, cpv_42_encoded_df_1], axis=1)

In [46]:
data = data_with_dummies.join(corpus_42['Asset_specificity_mean'])

In [47]:
data.columns = data.columns.astype(str)

In [48]:
X = data.drop('Asset_specificity_mean', axis=1)
y = data['Asset_specificity_mean']

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [50]:
model = MLPRegressor(hidden_layer_sizes=(100,), activation='logistic', solver='adam', max_iter=1000, random_state=42)

In [51]:
model.fit(X_train, y_train)


In [52]:
y_pred = model.predict(X_test)

In [53]:
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [55]:
print(f'Mean Squared Error (MSE): {mse}')
print(f'Mean Absolute Error (MAE): {mae}')
print(f'R^2 Score: {r2}')

Mean Squared Error (MSE): 0.026417206418528745
Mean Absolute Error (MAE): 0.10595494345603472
R^2 Score: 0.794650617907131


In [56]:
data_for_prediction = pd.concat([embeddings_df, cpv_42_encoded_df_3, cpv_42_encoded_df_2, cpv_42_encoded_df_1], axis=1)


In [57]:
y_pred_full_corpus = model.predict(data_for_prediction)



In [58]:
y_pred_full_corpus_flat = y_pred_full_corpus.flatten()

In [59]:
corpus_42['Predicted_Asset_specificity'] = y_pred_full_corpus_flat

In [60]:
corpus_42['absolute_error'] = np.abs(corpus_42['Asset_specificity_mean'] - corpus_42['Predicted_Asset_specificity'])
corpus_42['prediction_error'] = np.abs(corpus_42['Predicted_Asset_specificity'] - corpus_42['Asset_specificity_mean'])


In [64]:
print(corpus_42['prediction_error'].describe())

count    1.351000e+03
mean     9.163343e-02
std      1.017583e-01
min      1.566359e-07
25%      2.423997e-02
50%      6.085948e-02
75%      1.206520e-01
max      1.475203e+00
Name: prediction_error, dtype: float64


In [65]:
corpus_42_sorted_by_error = corpus_42.sort_values('absolute_error', ascending=False)

In [66]:
corpus_42_sorted_by_error.head()

Unnamed: 0,doc_id,contract_type_code,Product_60codes,Measurability_mean,Asset_specificity_mean,Manuel_or_not,product_description_binary,CPVcodes_2digit,CPVcodes_3digit,clean_text,Embedding,Normalized_Embedding,Normalized_Embedding_mean,CPVcodes_1digit,CPVcodes_1digit_str,CPVcodes_2digit_str,CPVcodes_3digit_str,Predicted_Asset_specificity,absolute_error,prediction_error
936,471767-2021,4,23,2.809524,4.0,not,Yes,75,752,1.2 kravtyper i nærværende kravspecifikation e...,"[0.034549914, -0.042960282, -0.012512988, 0.03...","[0.034549912005077184, -0.04296027951945991, -...","[0.034549912005077184, -0.04296027951945991, -...",7,7,75,752,2.524797,1.475203,1.475203
919,463700-2022,2,4,1.915966,1.71028,not,Yes,30,301,de 3 centrale driftscentre er sammenkoblet i ...,"[0.021431936, 0.0010715968, -0.0025592253, 0.0...","[0.021429342326400525, 0.0010714671163200263, ...","[0.011766989357691772, -0.021660008612906734, ...",3,3,30,301,2.268862,0.558582,0.558582
1273,657903-2021,1,34,3.053571,1.898305,not,Yes,45,452,4. den udbudte anskaffelse entreprisen består ...,"[0.023648163, -0.02022101, -0.020865379, 0.018...","[0.02320736353691505, -0.019844092336203645, -...","[0.014665746413114938, -0.03314830637705246, -...",4,4,45,452,2.42697,0.528665,0.528665
1189,612592-2021,1,34,3.053571,1.898305,not,Yes,45,452,3.1.7 brønde . 20 3.1.8 brønde af beton . 22 3...,"[0.039461013, -0.014652216, -0.0045676813, 0.0...","[0.034086147939626284, -0.012656482037583757, ...","[0.018194331072319995, -0.040295459964509, -0....",4,4,45,452,2.426521,0.528216,0.528216
1133,576927-2022,1,34,3.053571,1.898305,not,Yes,45,452,2 1. tekniske funktionskrav der skal etableres...,"[0.006155151, -0.025454653, -0.008658246, 0.00...","[0.006175755599179605, -0.02553986340707546, -...","[0.005081111659826009, -0.03467379292968825, -...",4,4,45,452,2.421424,0.523119,0.523119


In [67]:
selected_columns_df = corpus_42_sorted_by_error[['doc_id', 'clean_text', 'contract_type_code', 'Product_60codes','CPVcodes_2digit', 'CPVcodes_3digit', 'Asset_specificity_mean', 'prediction_error', 'Predicted_Asset_specificity']]

In [68]:
def truncate_text(text):
    MAX_LENGTH = 30000  # Excel's maximum cell character limit
    if isinstance(text, str) and len(text) > MAX_LENGTH:
        return text[:MAX_LENGTH]
    return text

In [69]:
selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)


In [70]:
for col in selected_columns_df.select_dtypes(include=['float', 'float64']).columns:
    selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_

In [71]:
selected_columns_df.to_csv('Corpus_42_asset_errors_NY.csv', index=False, encoding='utf-16', sep='\t')

In [72]:
mean_values_df = corpus_42.groupby('Product_60codes').agg({
    'Asset_specificity_mean': 'mean',
    'Predicted_Asset_specificity': 'mean'
}).reset_index()

In [73]:
mean_values_df['difference'] = mean_values_df['Asset_specificity_mean'] - mean_values_df['Predicted_Asset_specificity']

In [74]:
mean_values_df = mean_values_df.sort_values('difference', ascending=False)

In [75]:
mean_values_df.head(18)

Unnamed: 0,Product_60codes,Asset_specificity_mean,Predicted_Asset_specificity,difference
16,23,4.0,2.524797,1.475203
9,14,2.666667,2.428587,0.23808
28,39,2.571429,2.38751,0.183919
12,19,2.590164,2.456601,0.133563
37,56,2.531646,2.418759,0.112887
18,25,2.701493,2.604519,0.096974
21,30,2.6,2.510285,0.089715
36,55,2.65,2.562147,0.087853
27,38,2.838235,2.763443,0.074792
32,48,3.041237,2.977103,0.064134


In [76]:
mean_values_df.to_excel('Corpus_42_asset_product_type_errors_NY.xlsx', index=False)

#### Validation on corpus_18

In [77]:
embeddings_df_18 = pd.DataFrame(corpus_18['Normalized_Embedding'].tolist())

In [78]:
data_with_dummies_18 = pd.concat([embeddings_df_18, cpv_18_encoded_df_3, cpv_18_encoded_df_2, cpv_18_encoded_df_1], axis=1)

In [79]:
data_18 = data_with_dummies_18.join(corpus_18['Asset_specificity_mean'])

In [80]:
data_18.columns = data_18.columns.astype(str)

In [81]:
X_18 = data_18.drop('Asset_specificity_mean', axis=1)

In [82]:
y_18 = data_18['Asset_specificity_mean']

In [83]:
y_pred_18 = model.predict(X_18)

In [84]:
mse_18 = mean_squared_error(y_18, y_pred_18)
mae_18 = mean_absolute_error(y_18, y_pred_18)
r2_18 = r2_score(y_18, y_pred_18)

In [85]:
print(f"Mean Squared Error (MSE) on corpus_18: {mse_18}")
print(f"Mean Absolute Error (MAE) on corpus_18: {mae_18}")
print(f"R^2 Score on corpus_18: {r2_18}")

Mean Squared Error (MSE) on corpus_18: 0.11020415110404123
Mean Absolute Error (MAE) on corpus_18: 0.23628311691900522
R^2 Score on corpus_18: 0.24167294905687597


In [86]:
corpus_18['Predicted_Asset_specificity_mean'] = y_pred_18.flatten()

In [87]:
corpus_18['absolute_error'] = np.abs(y_18 - y_pred_18)


In [88]:
corpus_18['prediction_error'] = y_pred_18 - y_18

In [89]:
print(corpus_18['prediction_error'].describe())

count    389.000000
mean      -0.046967
std        0.329054
min       -1.109963
25%       -0.205088
50%       -0.084556
75%        0.052287
max        0.993004
Name: prediction_error, dtype: float64


In [90]:
corpus_18_sorted_by_error = corpus_18.sort_values('absolute_error', ascending=False)

In [91]:
corpus_18_sorted_by_error.head()

Unnamed: 0,doc_id,contract_type_code,Product_60codes,Measurability_mean,Asset_specificity_mean,Manuel_or_not,product_description_binary,CPVcodes_2digit,CPVcodes_3digit,clean_text,Embedding,Normalized_Embedding,Normalized_Embedding_mean,CPVcodes_1digit,CPVcodes_1digit_str,CPVcodes_2digit_str,CPVcodes_3digit_str,Predicted_Asset_specificity_mean,absolute_error,prediction_error
14,0273-000984,4,16,3.020408,3.770833,not,Yes,72,725,leverandøren skal til enhver tid have systemer...,"[-0.012033573, -0.02152224, -0.0032734876, 0.0...","[-0.01203357273654089, -0.02152223952879912, -...","[-0.01203357273654089, -0.02152223952879912, -...",7,7,72,725,2.66087,1.109963,-1.109963
207,343481-2022,2,41,2.726027,1.309524,not,Yes,18,185,2.4 gavekassen gavekassen er ganske afgørende ...,"[-0.0140284365, 0.015935063, -0.010284936, 0.0...","[-0.014028436397875703, 0.01593506288399583, -...","[-0.014028436397875703, 0.01593506288399583, -...",1,1,18,185,2.302528,0.993004,0.993004
194,336016-2021,2,41,2.726027,1.309524,not,Yes,18,185,gavekassen er ganske afgørende i leverancen. g...,"[-0.0065661776, 0.0074800667, -0.010836113, 0....","[-0.006566177414315129, 0.00748006648847127, -...","[-0.006566177414315129, 0.00748006648847127, -...",1,1,18,185,2.300152,0.990628,0.990628
64,095342-2021,4,16,3.020408,3.770833,not,Yes,72,725,bilag 1b – kundens it-miljø aarhus kommune 25....,"[0.0018933702, -0.047902785, -0.01409423, 0.01...","[0.0018797690612758979, -0.04755867246244351, ...","[-0.006511076749262017, -0.052191325953563995,...",7,7,72,725,2.791287,0.979546,-0.979546
213,359110-2022,4,16,3.020408,3.770833,not,Yes,72,725,generelle krav til funktionalitet overordnede ...,"[0.050823994, -0.027748182, -0.0058391406, 0.0...","[0.048764672670076815, -0.02662386219429582, -...","[0.022834313132002428, -0.034260265824075035, ...",7,7,72,725,2.829337,0.941496,-0.941496


In [92]:
selected_columns_df = corpus_18_sorted_by_error[['doc_id', 'clean_text', 'contract_type_code', 'Product_60codes','CPVcodes_2digit', 'CPVcodes_3digit', 'Asset_specificity_mean', 'prediction_error', 'Predicted_Asset_specificity_mean']]

In [93]:
def truncate_text(text):
    MAX_LENGTH = 30000  # Excel's maximum cell character limit
    if isinstance(text, str) and len(text) > MAX_LENGTH:
        return text[:MAX_LENGTH]
    return text

In [94]:
selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df['clean_text'] = selected_columns_df['clean_text'].apply(truncate_text)


In [95]:
for col in selected_columns_df.select_dtypes(include=['float', 'float64']).columns:
    selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns_df[col] = selected_columns_df[col].apply(lambda x: f'{x:.2f}'.replace('.', ','))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_

In [96]:
selected_columns_df.to_csv('Corpus_18_asset_errors_NY.csv', index=False, encoding='utf-16', sep='\t')

In [97]:
mean_values_df = corpus_18.groupby('Product_60codes').agg({
    'Asset_specificity_mean': 'mean',
    'Predicted_Asset_specificity_mean': 'mean'
}).reset_index()

In [98]:
mean_values_df['difference'] = mean_values_df['Asset_specificity_mean'] - mean_values_df['Predicted_Asset_specificity_mean']

In [99]:
mean_values_df = mean_values_df.sort_values('difference', ascending=False)

In [100]:
mean_values_df.head(18)

Unnamed: 0,Product_60codes,Asset_specificity_mean,Predicted_Asset_specificity_mean,difference
4,16,3.770833,2.760498,1.010335
10,44,2.72,2.288038,0.431962
13,50,2.716049,2.332136,0.383913
0,2,2.826087,2.450064,0.376023
2,6,2.961039,2.628484,0.332555
8,40,2.985075,2.796989,0.188086
5,28,2.744681,2.607542,0.137139
17,58,2.710843,2.582048,0.128795
3,15,2.722222,2.62836,0.093862
14,52,2.2,2.276372,-0.076372


In [101]:
mean_values_df.to_excel('Corpus_18_asset_product_type_errors_NY.xlsx', index=False)

## Corpus_over

In [102]:
embeddings_df_over = pd.DataFrame(corpus_over['Normalized_Embedding'].tolist())

In [103]:
data_with_dummies_over = pd.concat([embeddings_df_over, cpv_over_encoded_df_3, cpv_over_encoded_df_2, cpv_over_encoded_df_1], axis=1)

In [104]:
data_with_dummies_over.columns = data_with_dummies_over.columns.astype(str)

In [105]:
y_pred_over = model.predict(data_with_dummies_over)

In [106]:
corpus_over['Predicted_Asset_specificity_mean'] = y_pred_over.flatten()

In [107]:
corpus_over.head()

Unnamed: 0,doc_id,contract_type_code,Product_60codes,Measurability_mean,Asset_specificity_mean,Manuel_or_not,product_description_binary,CPVcodes_2digit,CPVcodes_3digit,clean_text,Embedding,Normalized_Embedding,Normalized_Embedding_mean,CPVcodes_1digit,CPVcodes_1digit_str,CPVcodes_2digit_str,CPVcodes_3digit_str,Predicted_Asset_specificity_mean
0,0000-008255,4,482,,,not,Yes,48,482,provide one icon per episode in several forma...,"[-0.0030848256, -0.026598396, -0.021644095, -0...","[-0.0030848255027367153, -0.026598395161363492...","[-0.0030848255027367153, -0.026598395161363492...",4,4,48,482,1.789834
1,0000-008285,1,450,,,not,Yes,45,450,udlægning af slidlag bærelag og bæreslidlag på...,"[0.03165265, -0.018070666, -0.0104349395, 0.01...","[0.031652648614329426, -0.01807066520891331, -...","[0.031652648614329426, -0.01807066520891331, -...",4,4,45,450,2.412526
2,0000-008392,1,451,,,not,Yes,45,451,pos.nr. pkt. bygningsdel 01 1 121 2 131 linief...,"[-0.0045218053, -0.029040039, -0.011053302, 0....","[-0.004521805473328217, -0.029040040113152348,...","[-0.004521805473328217, -0.029040040113152348,...",4,4,45,451,2.228737
3,0000-008393,4,482,,,not,Yes,48,482,task 4 - refurbishment of the interreg podcast...,"[-0.016492937, -0.0014233976, -0.018877864, -0...","[-0.0164929371094576, -0.00142339760944657, -0...","[-0.0164929371094576, -0.00142339760944657, -0...",4,4,48,482,2.151623
4,0000-008394,1,452,,,not,Yes,45,452,arbejder gulvbelægning vægoverflader ny gulvbe...,"[0.007988542, -0.026167195, -0.011535243, 0.03...","[0.007748142082371978, -0.025379743231885573, ...","[0.0010394256856936434, -0.04338538165723106, ...",4,4,45,452,2.458802


In [108]:
group_df = corpus_over.groupby('Product_60codes').agg({
    'Predicted_Asset_specificity_mean': ['mean', 'std', 'count']
}).reset_index()


In [110]:
group_df.columns = ['Product_60codes', 'Predicted_Asset_specificity_mean_Mean', 'Predicted_Asset_specificity_mean_Std', 'Count']

In [112]:
group_df.to_excel('corpus_over_prediction_asset_NY.xlsx', index=False)