### Set som Globals
Set the LABEL to what must be predicted and LABEL_MODEL, to the model that should be used.

In [12]:
PATH_RELATIVE = 'data/'

# LABEL = 'label_users_top_100'
# LABEL_MODEL = 'Responsible'

LABEL = 'label_bins'
LABEL_MODEL = 'Time-Bins'

# LABEL = 'label_time_encoded'
# LABEL_MODEL = 'Time-Encoded'

# LABEL = 'label_users_top_100'
# LABEL_MODEL = 'Responsible-Overfitted'

PREDICT_TIME = True
PREDICT_RESPONSIBLE = False

### 1. Loading in the model and tokenizer
The `from_pretrained` parameter can simply targets the model being loaded.
The tokenizer is from the model that we did the fine-tunning on.

In [13]:
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer
import pandas as pd
import numpy as np

model = TFAutoModelForSequenceClassification.from_pretrained(f'{PATH_RELATIVE}models/IHLP-XLM-RoBERTa-{LABEL_MODEL}')
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

All model checkpoint layers were used when initializing TFXLMRobertaForSequenceClassification.

All the layers of TFXLMRobertaForSequenceClassification were initialized from the model checkpoint at data/models/IHLP-XLM-RoBERTa-Time-Bins.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForSequenceClassification for predictions without further training.


Load the text.

In [24]:
df_test = pd.read_csv(f'{PATH_RELATIVE}cached_test_label_time_encoded.csv')
df = pd.read_csv(f'{PATH_RELATIVE}text.csv')
df = df.fillna('')
# df = df[-50000:]
# df = pd.merge(df, df_test['text'], on='text')
# df = df.drop_duplicates(subset=['text'])

df.text = df.apply(lambda x: x.text[:512], axis=1)

df = df.reset_index(drop=True)

print(len(df))


302832


The ordering of the predicted output will match the ordering of the DataFrame
We keep only the id from the DataFrame and left-join all the data we need.


In [25]:
df_label_encoded = pd.read_csv('data/label_time_encoded.csv')
df_label_users_top_100 = pd.read_csv('data/label_users_top_100.csv')
df_label_time_complete = pd.read_csv('data/label_time_complete.csv')

out = df[['id', 'text']]
out = pd.merge(out, df_label_encoded[['id', 'label_time_encoded']], on='id')
out = pd.merge(out, df_label_users_top_100[['id', 'label_closed', 'label_users_top_100']], on='id')
out = pd.merge(out, df_label_time_complete[['id', 'received_time', 'time_bins_solution_timestamp']], on='id')

out['consumption'] = out.apply(lambda x: (x['label_time_encoded'] % 5) + 1, axis=1)

out = out.rename(columns={
    'label_closed': 'user',
    'received_time': 'received',
    'time_bins_solution_timestamp': 'solution'
})

print(len(out))

250493


In [26]:
# Initiate columns with default data
out['predict_consumption'] = 1
out['predict_user'] = ''

# Prepare columns for top_10 time and responsible prediction
for i in list(range(10)):
    out[f't_{i}'] = 0

# Prepare columns for top_10 responsible prediction
for i in list(range(10)):
    out[f'r_{i}'] = 0

In [27]:
def get_top_k_index(x, index=0, size=500):
    arr = x.to_numpy()[-size:]
    return np.argsort(-arr)[:10][index]

Tokenize the text.

In [28]:
def tokenize_texts(sentences, max_length=512, padding='max_length'):
    return tokenizer(
        sentences,
        truncation=False,
        padding=padding,
        max_length=max_length,
        return_tensors="tf"
    )

tokenized_text = dict(tokenize_texts(list(out['text'].values)))

In [29]:
print(out.head())
out = out.drop(columns=['text'])

        id                                               text  \
0  3703067                                                      
1  3703079                                                      
2  3703098  lån af eksamens pc og januar. tidligere efecte...   
3  3703139  test sag nichlas dsfadfasdfadsfasdfadfasdfasdf...   
4  3710799  brugeradmin. (opret /-) brugeradministration (...   

   label_time_encoded     user  label_users_top_100      received  \
0                 107  henrikk                   21  1.450246e+09   
1                 107  henrikk                   21  1.450247e+09   
2                  43    bernt                    8  1.450247e+09   
3                 361     nlhi                   72  1.450248e+09   
4                 473    thoje                   94  1.450163e+09   

       solution  consumption  predict_consumption predict_user  ...  r_0  r_1  \
0  1.450474e+09            3                    1               ...    0    0   
1  1.450474e+09            3    

In [31]:
if PREDICT_TIME:

    import tensorflow as tf

    # Compile and predict
    model.compile(metrics=[tf.keras.metrics.SparseTopKCategoricalAccuracy(k=10)])
    model.evaluate(tokenized_text, out.label_time_encoded.values, batch_size=16)
    predict = model.predict(tokenized_text, batch_size=16, verbose=False)

    # Merge prediction to DataFrame
    out = pd.merge(out, pd.DataFrame(predict[0]), left_index=True, right_index=True)

    # Apply top-10
    for i in list(range(10)):
        out[f't_{i}'] = out.apply(lambda x: get_top_k_index(x, i), axis=1)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


   51/15656 [..............................] - ETA: 2:10:06 - loss: nan - sparse_top_k_categorical_accuracy: 0.0135     

KeyboardInterrupt: 

In [None]:
if PREDICT_RESPONSIBLE:

    import tensorflow as tf

    # Compile and predict
    model.compile(metrics=[tf.keras.metrics.SparseTopKCategoricalAccuracy(k=10)])
    model.evaluate(tokenized_text, out.label_users_top_100.values, batch_size=16)
    predict = model.predict(tokenized_text, batch_size=16, verbose=False)

    # Merge prediction to DataFrame
    out = pd.merge(out, pd.DataFrame(predict[0]), left_index=True, right_index=True)

    # Apply top-10
    for i in list(range(10)):
        out[f'r_{i}'] = out.apply(lambda x: get_top_k_index(x, i, size=100), axis=1)

In [None]:
# For smaller evaluation sample
out = out[out.user == 'ep']

print(out.r_0.value_counts())


We got the latest 1000 issues. For each issue we assign a responsible s.t. we minimize overall workload.
Workload is the sum of expected (remaining) time consumption for all open issues for a responsible.

1. Get the current workload for all.
2. Consider the probability distribution of the prediction.
3. Considering a set of predicted responsibles, consider also their workload.
4. Take the responsible where both gives the best result.

In the experiment we don't know the solution time and we don't need it.
We do however in the evaluation, to remove issues from the 'worload-poll' over time.
We use the actual solution time, but we may assign the issues differently.
So, we don't know if the solution time is correct.
We may do it like this (1), or we could try setting the solution time from a probability distribution (2).
If we use option (1) we can continue without much work, but with option (2) it would be harder.
One should consider how to model the solution time distribution, one could use
- similar issues, using clustering (e.g. unsupervised learning, distance calculated from Transformer)
- all issues
- all issues from the responsible
- mix of prior and similar issues

In [None]:
class Workload:

    def set(self, key, value):
        self.__setattr__(key, value)

    def get(self, key):
        return self.__getattribute__(key)

In [None]:
def calculate_workloads(_tmp, _workloads, _users, predict=False):

    _workloads.append(Workload())

    for user in _users:
        if predict:
            tmp_for_user = tmp[tmp.predict_user == user]
            _workloads[-1].set(user, tmp_for_user.predict_consumption.sum())
        else:
            tmp_for_user = tmp[tmp.user == user]
            _workloads[-1].set(user, tmp_for_user.consumption.sum())

    return _workloads

In [None]:
from tqdm import tqdm

users = df_label_users_top_100.label_closed.unique()

tmp = df_label_users_top_100.drop_duplicates(subset=['label_closed', 'label_users_top_100'])
tmp = tmp.sort_values(by='label_users_top_100')
user_index = tmp.label_closed.values

counts = []
workloads = []
workloads_predicted = []

for i, el in tqdm(out.iterrows(), total=len(out)):

    # Consider all issues up until now.
    tmp = out[:i]

    # Remove all that has a solution.
    # Since we already has solution time. We remove all that is before the received time.
    tmp = tmp[tmp.solution > el.received]

    # Save the count just to make a pretty line in our graph
    counts.append(len(tmp))

    # Get the current workload
    workloads = calculate_workloads(tmp, workloads, users)

    # Get the predicted workload
    workloads_predicted = calculate_workloads(tmp, workloads_predicted, users, predict=True)

    # We (should) make a decision from the current predicted workload.
    # We simply select whoever the model finds suitable (found in column t_0)
    if False:
        # This did not work well.
        user = int(el.t_0 / 5)
        time = el.t_0 % 5

    if False:
        predicts = el.values[9:14]
        predicts_users = [int(e / 5) for e in predicts]
        predicts_users_max = max(predicts_users, key=predicts_users.count)

        time = 0
        user = 0

        for predict in predicts:
            if int(predict / 5) == predicts_users_max:
                user = predicts_users_max
                if time == 0:
                    time = (predict % 5) + 1

    if False:
        predicts = list(el.values[9:19])
        predicts_users = [e for e in [int(e / 5) for e in predicts] if e in el.values[19:29]]

        if len(predicts_users) == 0:
            predicts_users = [int(e / 5) for e in predicts]

        predicts_users_max = max(predicts_users, key=predicts_users.count)

        time = 0
        user = 0

        for predict in predicts:
            if int(predict / 5) == predicts_users_max:
                user = predicts_users_max
                if time == 0:
                    time = (predict % 5) + 1

    if True:
        time = 1
        user = el.r_0


    out.loc[i, 'predict_user'] = user_index[user]
    out.loc[i, 'predict_consumption'] = time

In [None]:
print(out.head())
print(user_index)

In [None]:
import matplotlib.pyplot as plt

users = out.predict_user.unique()

plt.plot(np.array(counts), label='counts')
for user in users:
    # plt.plot([e.get(user) for e in workloads], label=f'{user}')
    plt.plot([e.get(user) for e in workloads_predicted], label=f'{user}_p')

plt.legend()
plt.show()

In [None]:
out.to_csv('evaluate/02.csv', index=False)

I took around 30 issues, all assigned to user=ep. Using t_0, we get 19 different responsibles.
It seems unlikely that there are 19 (out of 99) alternatives to user=ep for just 30 issues.
Using t_0 simply takes the highest probability and does not take time consumption into account.

Changing this to using all top-10 and then taking the responsible that has been predicted the most time
I get 15 alternatives - still way to much.

The prediction was expected to show 5 - 8 alternatives.

In [None]:
print(out.r_0.value_counts()[:10])

I'll try to go word-embedding style over the data. What we want is over-fit the model.
Interesting topics: use the tool to search for similar issues.
Interesting topics: how would unsupervised clustering work combined with using statistics (probability distribution in clusters).

Removing outliers and just improving preprocessing may be worth it not.

### For Responsible
For a known responsible we get 84.4% accuracy on top-3
Overall we get 73.79% on top-10

### For Responsible-Overfitted
For a known responsible we get 81.8% accuracy on top-3
Overall we get 65.12% on top-10