This tool is to analyze event-log files, such as log data from website, netflow or somethings like a specific management system. Give an inputfile and some parameters, and you will get embeddings representing the dataset.
python representation_presenter.py [options]
INPUT AND OUTPUT:
-i inputfile_name,input_type (if using multifactor2vec, must choose 3 and provide tsv file)
type '1': Pre-processing file, train list for word2vec(txt)
type '2': Vector file after word2vec with key for token(tsv)
type '3': Raw file, event log file(tsv)
Default: type '3'
-o outputfile_name,output_type (no need to define when using multifactor2vec, automatically generate all files)
type '1': Pre-processing file, train list for word2vec(txt)
type '2': Vector file after word2vec with key for token(tsv)
type '3': 2D Vector file
type '4': All three types above
Default: type '3'
There should be no space between file name and file type.
MODEL TYPE:
-z Using_pytorch:
1: use Pytorch, train a multi-c2v model
0: don't use Pytorch, train a course2vec (specify Word2Vec or FastText model)
-x course2vec model used
1: Word2Vec (default if x not defined)
2: FastText
PARAMETERS FOR WORD2VEC
-e epoch num
Iteration times for word2vec. Default 50.
-n negative (No need to define when using multifactor2vec)
Negative sampling when using word2vec. Default 10.
-v vector size
Vector size when using word2vec(size). Default 100.
-w window size
Window size when using word2vec(window). Default 10.
-c min count (No need to define when using multifactor2vec)
Min count number when using word2vec(min_count).
KEYS
-g group_by_key
Group by.
-s sort_by_key
Sort by.
-k token1,token2,token3...(no space)
Tokens for analysis
-r factor1, factor2, factor3...
factors that are attributes of tokens added to the word2vec model (Only when building multifactor-2vec model)
if not none, softmax will be used instead of negative sampling, -n will be set as 0, -pt will be set not none automatically.
-m dim (No need to define when using multifactor2vec)
Dimension of data points in output file(after t-sne)
OTHERS (No need to define when using multifactor2vec)
-dd
If you use this option, duplications won't be removed when generate training list
-f featurefile,type
type determines how to merge feature file and vector file
type 'inner': intersection
type 'left' : save rows in vector file only
type 'right': save rows in feature file only
-l if calculating val loss
Example course2vec (FastText):
python3 representation_presenter.py -i enrollment_data.csv,3 -o outputft.tsv,2 -g 'student_id' -s 'semester' -k 'course_name' -z '0' -x '2'
Example multifactor-course2vec:
python3 representation_presenter.py -i enrollment_data.csv,3 -g 'student_id' -s 'semester' -k 'course_name' -z '1' -e '5' -r 'department'
- using "department" as a factor
- train for 5 epochs
student_id | semester | course_name | department |
---|---|---|---|
664564 | Fall 2017 | Computer Science 61A | Computer Science |
453656 | Spring 2020 | Sociology 1 | Sociology |
Notes for using word2vec and fasttext:
- Need to provide the output file and output type.
Notes for using multifactor2vec: (pytorch model will be used)
- The "time column" in the input data should be numerically sortable.
- The "factor column" in the input data is allowed to have multiple values separated by ";". For example, multiple instructors for a course can be in one cell "Peter David; Micheal Jordan"
- The output files generated by multifactor2vec includes: (1) data_matrix.pkl, of which each row represents an sequence of a user (e.g., enrollment sequence), e.g., at a time slice in a row: [[token1, factor1, factor2, ...], [token2, factor1, factor2, ...], ...] (2) sampled_data.pkl, for training (3) dictionaries of tokens, factors, basictype, time. (token_id.pkl, factor1_id.pkl, factor2_id.pkl,..., basictype_id.pkl, time_id.pkl) Each pickle file contains two dictionaries inside, name to id dictionary and id to name dictionary. (e.g., Instructor Name_id.pkl = {'instructor_id': dict, 'id_instructor': reversed_dict) (4) torch_model.pkl, trained pytorch model file. (5) token_embeddings.tsv, factor1_embeddings.tsv, factor2_embeddings.tsv,..., with three columns: id, name, weight (e.g., Course Subject Short Nm_embeddings.tsv, Instructor Name_embeddings.tsv, Num_subject_embeddings.tsv)