Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use merlin-dataloader package #845

Merged
merged 33 commits into from
Dec 9, 2022
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
fd4b593
Use merlin-dataloader package
edknv Nov 2, 2022
081331a
remove torch.dataset in favor of merlin.loader.torch
edknv Nov 2, 2022
5946396
Merge branch 'main' into merlin_dataloader
edknv Nov 2, 2022
c1c607d
Merge branch 'main' into merlin_dataloader
edknv Nov 3, 2022
4e2bb47
Merge branch 'main' into merlin_dataloader
edknv Nov 8, 2022
315c7d3
update dressipi notebook
edknv Nov 9, 2022
00ca57b
Merge branch 'main' into merlin_dataloader
edknv Nov 9, 2022
1c2d424
Merge branch 'main' into merlin_dataloader
edknv Nov 9, 2022
3e6bf94
minor clean up
edknv Nov 9, 2022
d9f76d5
Merge branch 'main' into merlin_dataloader
edknv Nov 10, 2022
bc30490
Merge branch 'main' into merlin_dataloader
edknv Nov 19, 2022
2091b63
Completely removes models DataLoader
edknv Nov 20, 2022
eeee879
Installs merlin-dataloader in github actions
edknv Nov 20, 2022
f7212c0
Adds back the stop method
edknv Nov 20, 2022
72e587c
Merge branch 'main' into merlin_dataloader
edknv Nov 29, 2022
1cc9a7b
dataloader can produce sparse tensors using value counts
edknv Nov 29, 2022
6f4c5db
remove data files
edknv Nov 29, 2022
d9b8fb0
fix torch tests
edknv Nov 29, 2022
09ac913
add missing target to dlrm test
edknv Nov 29, 2022
070a8d8
use loader.peek()
edknv Nov 30, 2022
1a89c9f
add some comments to help understand horovod tests
edknv Dec 1, 2022
02a54d8
make sparse tensors optional
edknv Dec 1, 2022
cc5893f
cleanup
edknv Dec 1, 2022
004d52f
fix spelling
edknv Dec 1, 2022
84218a0
Merge branch 'main' into merlin_dataloader
edknv Dec 1, 2022
25d61d1
fix merge
edknv Dec 1, 2022
d6b123a
Merge branch 'main' into merlin_dataloader
edknv Dec 1, 2022
d6de79d
replace while loop with for loop in horovod test
edknv Dec 1, 2022
93706b6
use loader context mananger
edknv Dec 1, 2022
16adc59
Update according to dataloader changes #80
edknv Dec 6, 2022
4701675
restore tox.ini
edknv Dec 6, 2022
fb1285b
restore gh workflow
edknv Dec 6, 2022
705dc6b
revert generator changes
edknv Dec 6, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/tensorflow.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ jobs:
fi
pip install "pandas>=1.2.0,<1.4.0dev0"
pip install "NVTabular@git+https://github.com/NVIDIA-Merlin/NVTabular.git@$branch"
pip install "merlin-dataloader@git+https://github.com/NVIDIA-Merlin/dataloader.git@$branch"
pip install "merlin-core@git+https://github.com/NVIDIA-Merlin/core.git@$branch"
- name: Install dependencies
run: |
Expand Down Expand Up @@ -108,6 +109,7 @@ jobs:
fi
pip install "pandas>=1.2.0,<1.4.0dev0"
pip install "NVTabular@git+https://github.com/NVIDIA-Merlin/NVTabular.git@$branch"
pip install "merlin-dataloader@git+https://github.com/NVIDIA-Merlin/dataloader.git@$branch"
pip install "merlin-core@git+https://github.com/NVIDIA-Merlin/core.git@$branch"
- name: Install dependencies
run: |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -968,8 +968,8 @@
"metadata": {},
"outputs": [],
"source": [
"loader = mm.Loader(train, batch_size=BATCH_SIZE, transform=mm.ToTarget(train.schema, \"purchase_id_first\", one_hot=True), shuffle = False)\n",
"val_loader = mm.Loader(valid, batch_size=BATCH_SIZE, transform=mm.ToTarget(train.schema, \"purchase_id_first\", one_hot=True), shuffle=False)"
"loader = mm.Loader(train, batch_size=BATCH_SIZE, shuffle=False).map(mm.ToTarget(train.schema, \"purchase_id_first\", one_hot=True))\n",
"val_loader = mm.Loader(valid, batch_size=BATCH_SIZE, shuffle=False).map(mm.ToTarget(train.schema, \"purchase_id_first\", one_hot=True))"
]
},
{
Expand Down Expand Up @@ -1546,9 +1546,13 @@
}
],
"source": [
"def as_ragged(inputs, targets):\n",
" _as_ragged = mm.ListToRagged()\n",
" return _as_ragged(inputs), targets\n",
"\n",
"history = model_bi_lstm.fit(\n",
" loader,\n",
oliverholworthy marked this conversation as resolved.
Show resolved Hide resolved
" validation_data=val_loader,\n",
" loader.map(as_ragged),\n",
" validation_data=val_loader.map(as_ragged),\n",
" epochs=EPOCHS,\n",
")"
]
Expand Down
17 changes: 16 additions & 1 deletion merlin/datasets/synthetic.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@

import merlin.io
from merlin.models.utils import schema_utils
from merlin.schema import Schema, Tags
from merlin.schema import ColumnSchema, Schema, Tags
from merlin.schema.io.tensorflow_metadata import TensorflowMetadata

LOG = logging.getLogger("merlin-models")
Expand Down Expand Up @@ -116,6 +116,21 @@ def generate_data(
else:
raise ValueError(f"Unknown input type: {type(input)}")

for col in schema.column_names:
if not schema[col].is_list:
continue
new_properties = schema[col].properties
new_properties["value_count"] = {"min": min_session_length}
if max_session_length:
new_properties["value_count"]["max"] = max_session_length
schema[col] = ColumnSchema(
name=schema[col].name,
tags=schema[col].tags,
properties=new_properties,
dtype=schema[col].dtype,
is_list=True,
)

df = generate_user_item_interactions(
schema, num_rows, min_session_length, max_session_length, device=device
)
Expand Down