GitHub - Nancy078/representation_presenter: A data science tool for learning neural representations from sequential data, visualizing the representations, and adding attribute information to to aid in exploration and interpretation.

This tool is to analyze event-log files, such as log data from website, netflow or somethings like a specific management system. Give an inputfile and some parameters, and you will get embeddings representing the dataset.

SYNOPSIS

python representation_presenter.py [options]

OPTIONS

INPUT AND OUTPUT:
-i inputfile_name,input_type (if using multifactor2vec, must choose 3 and provide tsv file)
	type '1': Pre-processing file, train list for word2vec(txt)
	type '2': Vector file after word2vec with key for token(tsv)
	type '3': Raw file, event log file(tsv)
	Default: type '3'

-o outputfile_name,output_type (no need to define when using multifactor2vec, automatically generate all files)
	type '1': Pre-processing file, train list for word2vec(txt)
	type '2': Vector file after word2vec with key for token(tsv)
	type '3': 2D Vector file
	type '4': All three types above
	Default: type '3'

There should be no space between file name and file type.


MODEL TYPE:
-z Using_pytorch:
	1: use Pytorch, train a multi-c2v model
	0: don't use Pytorch, train a course2vec (specify Word2Vec or FastText model)
-x course2vec model used
	1: Word2Vec (default if x not defined)
	2: FastText




PARAMETERS FOR WORD2VEC
-e epoch num
	Iteration times for word2vec. Default 50.

-n negative (No need to define when using multifactor2vec)
	Negative sampling when using word2vec. Default 10.

-v vector size
	Vector size when using word2vec(size). Default 100.

-w window size
	Window size when using word2vec(window). Default 10.

-c min count (No need to define when using multifactor2vec)
	Min count number when using word2vec(min_count).



KEYS
-g group_by_key
	Group by. 

-s sort_by_key
	Sort by.

-k token1,token2,token3...(no space)
	Tokens for analysis

-r factor1, factor2, factor3...
    factors that are attributes of tokens added to the word2vec model (Only when building multifactor-2vec model)
    if not none, softmax will be used instead of negative sampling, -n will be set as 0, -pt will be set not none automatically.

-m dim   (No need to define when using multifactor2vec)
	Dimension of data points in output file(after t-sne)


OTHERS  (No need to define when using multifactor2vec)

-dd
	If you use this option, duplications won't be removed when generate training list

-f featurefile,type
    type determines how to merge feature file and vector file
	type 'inner': intersection
	type 'left' : save rows in vector file only
	type 'right': save rows in feature file only

-l if calculating val loss

Examples

Example course2vec (FastText):

python3 representation_presenter.py -i enrollment_data.csv,3 -o outputft.tsv,2  -g  'student_id' -s  'semester' -k 'course_name'  -z '0' -x '2'

Example multifactor-course2vec:

python3 representation_presenter.py -i enrollment_data.csv,3 -g  'student_id' -s  'semester' -k 'course_name'  -z '1' -e '5' -r 'department'

using "department" as a factor
train for 5 epochs

Example dataset used: enrollment_data.csv

student_id	semester	course_name	department
664564	Fall 2017	Computer Science 61A	Computer Science
453656	Spring 2020	Sociology 1	Sociology

Notes

Notes for using word2vec and fasttext:

Need to provide the output file and output type.

Notes for using multifactor2vec: (pytorch model will be used)

The "time column" in the input data should be numerically sortable.
The "factor column" in the input data is allowed to have multiple values separated by ";". For example, multiple instructors for a course can be in one cell "Peter David; Micheal Jordan"
The output files generated by multifactor2vec includes: (1) data_matrix.pkl, of which each row represents an sequence of a user (e.g., enrollment sequence), e.g., at a time slice in a row: [[token1, factor1, factor2, ...], [token2, factor1, factor2, ...], ...] (2) sampled_data.pkl, for training (3) dictionaries of tokens, factors, basictype, time. (token_id.pkl, factor1_id.pkl, factor2_id.pkl,..., basictype_id.pkl, time_id.pkl) Each pickle file contains two dictionaries inside, name to id dictionary and id to name dictionary. (e.g., Instructor Name_id.pkl = {'instructor_id': dict, 'id_instructor': reversed_dict) (4) torch_model.pkl, trained pytorch model file. (5) token_embeddings.tsv, factor1_embeddings.tsv, factor2_embeddings.tsv,..., with three columns: id, name, weight (e.g., Course Subject Short Nm_embeddings.tsv, Instructor Name_embeddings.tsv, Num_subject_embeddings.tsv)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
README		README
README.md		README.md
data_sampling.py		data_sampling.py
multifactor2vec_torch.py		multifactor2vec_torch.py
representation_presenter.py		representation_presenter.py
sequence_serialization.py		sequence_serialization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README

README

README.md

README.md

data_sampling.py

data_sampling.py

multifactor2vec_torch.py

multifactor2vec_torch.py

representation_presenter.py

representation_presenter.py

sequence_serialization.py

sequence_serialization.py

Repository files navigation

SYNOPSIS

OPTIONS

Examples

Example dataset used: enrollment_data.csv

Notes

About

Releases

Packages

Languages

Nancy078/representation_presenter

Folders and files

Latest commit

History

Repository files navigation

SYNOPSIS

OPTIONS

Examples

Example dataset used: enrollment_data.csv

Notes

About

Resources

Stars

Watchers

Forks

Languages