Skip to content

A data science tool for learning neural representations from sequential data, visualizing the representations, and adding attribute information to to aid in exploration and interpretation.

Notifications You must be signed in to change notification settings

Nancy078/representation_presenter

 
 

Repository files navigation

This tool is to analyze event-log files, such as log data from website, netflow or somethings like a specific management system. Give an inputfile and some parameters, and you will get embeddings representing the dataset.

SYNOPSIS

python representation_presenter.py [options]

OPTIONS

INPUT AND OUTPUT:
-i inputfile_name,input_type (if using multifactor2vec, must choose 3 and provide tsv file)
	type '1': Pre-processing file, train list for word2vec(txt)
	type '2': Vector file after word2vec with key for token(tsv)
	type '3': Raw file, event log file(tsv)
	Default: type '3'

-o outputfile_name,output_type (no need to define when using multifactor2vec, automatically generate all files)
	type '1': Pre-processing file, train list for word2vec(txt)
	type '2': Vector file after word2vec with key for token(tsv)
	type '3': 2D Vector file
	type '4': All three types above
	Default: type '3'

There should be no space between file name and file type.


MODEL TYPE:
-z Using_pytorch:
	1: use Pytorch, train a multi-c2v model
	0: don't use Pytorch, train a course2vec (specify Word2Vec or FastText model)
-x course2vec model used
	1: Word2Vec (default if x not defined)
	2: FastText




PARAMETERS FOR WORD2VEC
-e epoch num
	Iteration times for word2vec. Default 50.

-n negative (No need to define when using multifactor2vec)
	Negative sampling when using word2vec. Default 10.

-v vector size
	Vector size when using word2vec(size). Default 100.

-w window size
	Window size when using word2vec(window). Default 10.

-c min count (No need to define when using multifactor2vec)
	Min count number when using word2vec(min_count).



KEYS
-g group_by_key
	Group by. 

-s sort_by_key
	Sort by.

-k token1,token2,token3...(no space)
	Tokens for analysis

-r factor1, factor2, factor3...
    factors that are attributes of tokens added to the word2vec model (Only when building multifactor-2vec model)
    if not none, softmax will be used instead of negative sampling, -n will be set as 0, -pt will be set not none automatically.

-m dim   (No need to define when using multifactor2vec)
	Dimension of data points in output file(after t-sne)


OTHERS  (No need to define when using multifactor2vec)

-dd
	If you use this option, duplications won't be removed when generate training list

-f featurefile,type
    type determines how to merge feature file and vector file
	type 'inner': intersection
	type 'left' : save rows in vector file only
	type 'right': save rows in feature file only

-l if calculating val loss

Examples

Example course2vec (FastText):

python3 representation_presenter.py -i enrollment_data.csv,3 -o outputft.tsv,2  -g  'student_id' -s  'semester' -k 'course_name'  -z '0' -x '2' 

Example multifactor-course2vec:

python3 representation_presenter.py -i enrollment_data.csv,3 -g  'student_id' -s  'semester' -k 'course_name'  -z '1' -e '5' -r 'department'
  • using "department" as a factor
  • train for 5 epochs

Example dataset used: enrollment_data.csv

student_id semester course_name department
664564 Fall 2017 Computer Science 61A Computer Science
453656 Spring 2020 Sociology 1 Sociology

Notes

Notes for using word2vec and fasttext:

  1. Need to provide the output file and output type.

Notes for using multifactor2vec: (pytorch model will be used)

  1. The "time column" in the input data should be numerically sortable.
  2. The "factor column" in the input data is allowed to have multiple values separated by ";". For example, multiple instructors for a course can be in one cell "Peter David; Micheal Jordan"
  3. The output files generated by multifactor2vec includes: (1) data_matrix.pkl, of which each row represents an sequence of a user (e.g., enrollment sequence), e.g., at a time slice in a row: [[token1, factor1, factor2, ...], [token2, factor1, factor2, ...], ...] (2) sampled_data.pkl, for training (3) dictionaries of tokens, factors, basictype, time. (token_id.pkl, factor1_id.pkl, factor2_id.pkl,..., basictype_id.pkl, time_id.pkl) Each pickle file contains two dictionaries inside, name to id dictionary and id to name dictionary. (e.g., Instructor Name_id.pkl = {'instructor_id': dict, 'id_instructor': reversed_dict) (4) torch_model.pkl, trained pytorch model file. (5) token_embeddings.tsv, factor1_embeddings.tsv, factor2_embeddings.tsv,..., with three columns: id, name, weight (e.g., Course Subject Short Nm_embeddings.tsv, Instructor Name_embeddings.tsv, Num_subject_embeddings.tsv)

About

A data science tool for learning neural representations from sequential data, visualizing the representations, and adding attribute information to to aid in exploration and interpretation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%