Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embeddings.json #7

Open
stephen-hayne opened this issue May 12, 2022 · 14 comments
Open

embeddings.json #7

stephen-hayne opened this issue May 12, 2022 · 14 comments

Comments

@stephen-hayne
Copy link

stephen-hayne commented May 12, 2022

I'm trying to reproduce your results (like another poster here)...

Perhaps a silly question, but after downloading the HDFS and BGL datasets, running them through Drain, I'm now getting this error - can you advise how/where to get your "embeddings.json" file?

python3 main_run.py --folder=hdfs/ --log_file=HDFS.log --dataset_name=hdfs --device=cpu --model_name=deeplog --window_type=session --sample=sliding_window --is_logkey --train_size=0.4 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100 --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=70 --history_size=10 --lr=0.001 --accumulation_step=5 --session_level=hour --window_size=50 --step_size=50 --output_dir=experimental_results/deeplog/session/cd2 --is_process
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading ./dataset/hdfs/HDFS.log_structured.csv
575061it [00:00, 1983685.17it/s]
11175629it [00:19, 566251.66it/s]
Save options parameters
vocab size 20
save vocab in experimental_results/deeplog/session/cd2hdfs/deeplog_vocab.pkl
Loading vocab
20
Loading train dataset

Traceback (most recent call last):
  File "main_run.py", line 213, in <module>
    main()
  File "main_run.py", line 195, in main
    run_deeplog(options)
  File "/stephen/LogADEmpirical/logadempirical/deeplog.py", line 26, in run_deeplog
    Trainer(options).start_train()
  File "/stephen/LogADEmpirical/logadempirical/logdeep/tools/train.py", line 101, in __init__
    train_logs, train_labels = sliding_window(data,
  File "/stephen/LogADEmpirical/logadempirical/logdeep/dataset/sample.py", line 108, in sliding_window
    event2semantic_vec = read_json(os.path.join(data_dir, e_name))
  File "/stephen/LogADEmpirical/logadempirical/logdeep/dataset/sample.py", line 14, in read_json
    with open(filename, 'r') as load_f:
FileNotFoundError: [Errno 2] No such file or directory: './dataset/hdfs/embeddings.json'
@vanhoanglepsa
Copy link
Collaborator

Hi,
Currently, we adopt LogRobust to generate the embedding file. For now, it isn't included in this repository. We will try to start updating this part in the next week.

@stephen-hayne
Copy link
Author

I have read "Robust Log-Based Anomaly Detection on Unstable Log Data" and "Log-based Anomaly Detection Without Log Parsing" with interest (as well as several of the others in the citations).

Will LogRobust be put on github? Or just the data you generated?

@vanhoanglepsa
Copy link
Collaborator

We will add the code to generate embeddings in this repository, not only the generated data.

@X-zhihao
Copy link

X-zhihao commented Aug 13, 2022

How can we get this HDFS.log_structured.csv?

@souravs17031999
Copy link

souravs17031999 commented Jan 22, 2023

@vanhoanglepsa is the code updated to generate embeddings for generic log data ?
@stephen-hayne were you able to resolve this issue ? I am having same issue.

@stephen-hayne
Copy link
Author

@souravs17031999 No, this issue is not resolved.
@vanhoanglepsa Can you please help us to reproduce your work?

@pupuu555
Copy link

pupuu555 commented Apr 7, 2023

hi,have you add the code to generate embeddings in this repository? I haven't find the file, could you please tell me how to generate the embeddings?Or give me the embeddings.json? Thank you so much!!

@stephen-hayne
Copy link
Author

stephen-hayne commented Apr 7, 2023 via email

@xichie
Copy link

xichie commented Apr 8, 2023

Hi, the code following is my used to generate embedding.json. Hope it helps!

from logadempirical.PLELog.data.Embedding import *
from logadempirical.PLELog.data.DataLoader import *
import logging
import json

class NumpyEncoder(json.JSONEncoder):
    """ Special json encoder for numpy types """
    def default(self, obj):
        if isinstance(obj, (np.int_, np.intc, np.intp, np.int8,
                            np.int16, np.int32, np.int64, np.uint8,
                            np.uint16, np.uint32, np.uint64)):
            return int(obj)
        elif isinstance(obj, (np.float_, np.float16, np.float32,
                              np.float64)):
            return float(obj)
        elif isinstance(obj, (np.ndarray,)):
            return obj.tolist()
        return json.JSONEncoder.default(self, obj)
    
# Specify logger
logger = logging.getLogger('embedding')
logger.setLevel(logging.INFO)
dataset = 'bgl'
save_path = './dataset/bgl'
templatesDir = './dataset/bgl'
log_file = 'BGL_all.log'
logID2Temp, templates = load_templates_from_structured(templatesDir, logger, dataset,
                                                               log_file=log_file)
templateVocab = nlp_emb_mergeTemplateEmbeddings_BGL(save_path, templates, dataset, logger)

with open(os.path.join(save_path, 'templates_BGL.vec'), 'r', encoding='utf-8') as reader:
    templateVocab = {}
    line_num = 0
    for line in reader.readlines():
        if line_num == 0:
            vocabSize, embedSize = [int(x) for x in line.strip().split()]
        else:
            items = line.strip().split()
            if len(items) != embedSize + 1: continue
            template_word, template_embedding = items[0], np.asarray(items[1:], dtype=np.float64)
            for logID, temp in logID2Temp.items():
                if temp == template_word:
                    templateVocab[logID] = template_embedding
        line_num += 1
    replica_logIDs = []
    for logId in logID2Temp.keys():
        if logID not in templateVocab.keys():
            replica_logIDs.append(logID)
    # 有重复的template
    for logID in replica_logIDs:  
        temp = logID2Temp[logID]
        line_num = 0
        for line in reader.readlines():
            if line_num == 0:
                vocabSize, embedSize = [int(x) for x in line.strip().split()]
            else:
                items = line.strip().split()
                if len(items) != embedSize + 1: continue
                template_word, template_embedding = items[0], np.asarray(items[1:], dtype=np.float64)
                if temp == template_word:
                    templateVocab[logID] = template_embedding
            line_num += 1 
with open(os.path.join(save_path, 'embeddings.json'), 'w') as writer:
    json.dump(templateVocab, writer, cls=NumpyEncoder)

@pupuu555
Copy link

是的 - 我也找不到你提到的文件...... 我已经通过这段代码成功生成了 embedding.json。希望对您有所帮助! * https://github.com/xichie/LogADEmpirical/blob/master/generate_template_embedding.py < https://github.com/xichie/LogADEmpirical/blob/master/generate_template_embedding.py>*
……
-- Dr. Stephen C. Hayne, Professor Emeritus (CSU) Cyber Security and Information Systems Consultant I love to fly formation! Nanchang - 443LM "http://selfsynchronize.com/hayne/plane/"
On Thu, Apr 6, 2023 at 7:41 PM pupuu555 @.> wrote: hi,have you add the code to generate embeddings in this repository? I haven't find the file, could you please tell me how to generate the embeddings?Or give me the embeddings.json? Thank you so much!! — Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKIS2PHXIS4WRBZDJVNEWLW75WCZANCNFSM5VZAXOPQ . You are receiving this because you were mentioned.Message ID: @.>

Thank you sooooo much!!!!!!

@pupuu555
Copy link

您好,下面的代码是我用来生成 embedding.json 的。希望能帮助到你!

from logadempirical.PLELog.data.Embedding import *
from logadempirical.PLELog.data.DataLoader import *
import logging
import json

class NumpyEncoder(json.JSONEncoder):
    """ Special json encoder for numpy types """
    def default(self, obj):
        if isinstance(obj, (np.int_, np.intc, np.intp, np.int8,
                            np.int16, np.int32, np.int64, np.uint8,
                            np.uint16, np.uint32, np.uint64)):
            return int(obj)
        elif isinstance(obj, (np.float_, np.float16, np.float32,
                              np.float64)):
            return float(obj)
        elif isinstance(obj, (np.ndarray,)):
            return obj.tolist()
        return json.JSONEncoder.default(self, obj)
    
# Specify logger
logger = logging.getLogger('embedding')
logger.setLevel(logging.INFO)
dataset = 'bgl'
save_path = './dataset/bgl'
templatesDir = './dataset/bgl'
log_file = 'BGL_all.log'
logID2Temp, templates = load_templates_from_structured(templatesDir, logger, dataset,
                                                               log_file=log_file)
templateVocab = nlp_emb_mergeTemplateEmbeddings_BGL(save_path, templates, dataset, logger)

with open(os.path.join(save_path, 'templates_BGL.vec'), 'r', encoding='utf-8') as reader:
    templateVocab = {}
    line_num = 0
    for line in reader.readlines():
        if line_num == 0:
            vocabSize, embedSize = [int(x) for x in line.strip().split()]
        else:
            items = line.strip().split()
            if len(items) != embedSize + 1: continue
            template_word, template_embedding = items[0], np.asarray(items[1:], dtype=np.float64)
            for logID, temp in logID2Temp.items():
                if temp == template_word:
                    templateVocab[logID] = template_embedding
        line_num += 1
    replica_logIDs = []
    for logId in logID2Temp.keys():
        if logID not in templateVocab.keys():
            replica_logIDs.append(logID)
    # 有重复的template
    for logID in replica_logIDs:  
        temp = logID2Temp[logID]
        line_num = 0
        for line in reader.readlines():
            if line_num == 0:
                vocabSize, embedSize = [int(x) for x in line.strip().split()]
            else:
                items = line.strip().split()
                if len(items) != embedSize + 1: continue
                template_word, template_embedding = items[0], np.asarray(items[1:], dtype=np.float64)
                if temp == template_word:
                    templateVocab[logID] = template_embedding
            line_num += 1 
with open(os.path.join(save_path, 'embeddings.json'), 'w') as writer:
    json.dump(templateVocab, writer, cls=NumpyEncoder)

Thank you sooooo much!!!!好人一生平安

@pupuu555
Copy link

Hi, the code following is my used to generate embedding.json. Hope it helps!

from logadempirical.PLELog.data.Embedding import *
from logadempirical.PLELog.data.DataLoader import *
import logging
import json

class NumpyEncoder(json.JSONEncoder):
    """ Special json encoder for numpy types """
    def default(self, obj):
        if isinstance(obj, (np.int_, np.intc, np.intp, np.int8,
                            np.int16, np.int32, np.int64, np.uint8,
                            np.uint16, np.uint32, np.uint64)):
            return int(obj)
        elif isinstance(obj, (np.float_, np.float16, np.float32,
                              np.float64)):
            return float(obj)
        elif isinstance(obj, (np.ndarray,)):
            return obj.tolist()
        return json.JSONEncoder.default(self, obj)
    
# Specify logger
logger = logging.getLogger('embedding')
logger.setLevel(logging.INFO)
dataset = 'bgl'
save_path = './dataset/bgl'
templatesDir = './dataset/bgl'
log_file = 'BGL_all.log'
logID2Temp, templates = load_templates_from_structured(templatesDir, logger, dataset,
                                                               log_file=log_file)
templateVocab = nlp_emb_mergeTemplateEmbeddings_BGL(save_path, templates, dataset, logger)

with open(os.path.join(save_path, 'templates_BGL.vec'), 'r', encoding='utf-8') as reader:
    templateVocab = {}
    line_num = 0
    for line in reader.readlines():
        if line_num == 0:
            vocabSize, embedSize = [int(x) for x in line.strip().split()]
        else:
            items = line.strip().split()
            if len(items) != embedSize + 1: continue
            template_word, template_embedding = items[0], np.asarray(items[1:], dtype=np.float64)
            for logID, temp in logID2Temp.items():
                if temp == template_word:
                    templateVocab[logID] = template_embedding
        line_num += 1
    replica_logIDs = []
    for logId in logID2Temp.keys():
        if logID not in templateVocab.keys():
            replica_logIDs.append(logID)
    # 有重复的template
    for logID in replica_logIDs:  
        temp = logID2Temp[logID]
        line_num = 0
        for line in reader.readlines():
            if line_num == 0:
                vocabSize, embedSize = [int(x) for x in line.strip().split()]
            else:
                items = line.strip().split()
                if len(items) != embedSize + 1: continue
                template_word, template_embedding = items[0], np.asarray(items[1:], dtype=np.float64)
                if temp == template_word:
                    templateVocab[logID] = template_embedding
            line_num += 1 
with open(os.path.join(save_path, 'embeddings.json'), 'w') as writer:
    json.dump(templateVocab, writer, cls=NumpyEncoder)

hi,When i ran the file you gave me,I met a new issue: FileNotFoundError: [Errno 2] No such file or directory: 'dataset/nlp-word.vec',how can i get the nlp-word.vec? I don't find a way to genearate this file in the code.

@sailormoon-c
Copy link

是的 - 我也找不到你提到的文件...... 我已经通过这段代码成功生成了 embedding.json。希望对您有所帮助! * https://github.com/xichie/LogADEmpirical/blob/master/generate_template_embedding.py < https://github.com/xichie/LogADEmpirical/blob/master/generate_template_embedding.py>*
……
-- Dr. Stephen C. Hayne, Professor Emeritus (CSU) Cyber Security and Information Systems Consultant I love to fly formation! Nanchang - 443LM "http://selfsynchronize.com/hayne/plane/"
On Thu, Apr 6, 2023 at 7:41 PM pupuu555 @.> wrote: hi,have you add the code to generate embeddings in this repository? I haven't find the file, could you please tell me how to generate the embeddings?Or give me the embeddings.json? Thank you so much!! — Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKIS2PHXIS4WRBZDJVNEWLW75WCZANCNFSM5VZAXOPQ . You are receiving this because you were mentioned.Message ID: _@**.**_>

Thank you sooooo much!!!!!!

可以加个微信吗?我最近被这个项目搞得头都快炸掉了,拜托拜托,我的微信是:RainyloveStatic

@xichie
Copy link

xichie commented Apr 16, 2023

是的 - 我也找不到你提到的文件...... 我已经通过这段代码成功生成了 embedding.json。希望对您有所帮助! * https://github.com/xichie/LogADEmpirical/blob/master/generate_template_embedding.py < https://github.com/xichie/LogADEmpirical/blob/master/generate_template_embedding.py>*
……
-- Dr. Stephen C. Hayne, Professor Emeritus (CSU) Cyber Security and Information Systems Consultant I love to fly formation! Nanchang - 443LM "http://selfsynchronize.com/hayne/plane/"
On Thu, Apr 6, 2023 at 7:41 PM pupuu555 @.> wrote: hi,have you add the code to generate embeddings in this repository? I haven't find the file, could you please tell me how to generate the embeddings?Or give me the embeddings.json? Thank you so much!! — Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKIS2PHXIS4WRBZDJVNEWLW75WCZANCNFSM5VZAXOPQ . You are receiving this because you were mentioned.Message ID: _@**.**_>

Thank you sooooo much!!!!!!

可以加个微信吗?我最近被这个项目搞得头都快炸掉了,拜托拜托,我的微信是:RainyloveStatic
You can download nlp-word.vec by following:
https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants