Encode the csv-like file to sparse format for FM auxiliary feature input.
Factorization machine using the libsvm for input, and for experiment more conveniently, this repo is created !
(ref:https://towardsdatascience.com/factorization-machines-for-item-recommendation-with-implicit-feedback-data-5655a7c749db)
As you can see the FM's input for user and item using the One-Hot encode to generate a huge sparse matrix . The scikit-learn OneHotEncoder can't afford this kind task.
Furthermore, we also want to add the short-text feature or other auxiliary feature into FM, and those feature can be all kinds of type, e.g. categorical data, numerical data, we want to encode them all, so develope this tool is necessary !
python3 encoderder.py -c [config]
encoderder support json config format, which look like :
{
"train": {
"input": "./ml.csv",
"output": "./train.txt",
"cached": true,
"seperator": ",",
"header": true,
"sparse": true,
"target_columns": [
{
"index": 0,
"type": "cat"
},
{
"index": 1,
"type": "cat"
},
{
"index": 2,
"type": "truth"
}
]
}
}
- input : input file
- output : output file for sparse format
- target column : the intersted column you want to encode, support some kinds of feature type
- cat : (categorical data)
- num : (numerical data)
- truth : (labled data)
- cache : turn it on to will cached some data in memory, which cost more memory
- seperator : the symbol seperate the columns
- header : skip first line if the dataset contains the header
- sparse : turn it on to generate the scipy's coo matrix in npz format output