-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated scraper_bot to handle scraping in chunks #2
Conversation
@@ -99,7 +99,8 @@ ipython_config.py | |||
# This is especially recommended for binary packages to ensure reproducibility, and is more | |||
# commonly ignored for libraries. | |||
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control | |||
#poetry.lock | |||
poetry.lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use poetry as my python package manager and I didn't want to include that in this repo
requirements.txt
Outdated
@@ -1,3 +1,4 @@ | |||
requests==2.31.0 | |||
datasets==2.14.5 | |||
Pillow==10.0.1 | |||
huggingface_hub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed for the HFFileSystem
scraper/scraper_bot.py
Outdated
@@ -205,51 +266,37 @@ def _get_messages(self, after_message_id: str) -> List[Dict[str, Any]]: | |||
|
|||
return unique_list | |||
|
|||
def scrape(self, fetch_all: bool = False, push_to_hub: bool = True) -> Dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is better viewed in split mode since it's really different
59da5e5
to
2387b53
Compare
DISCORD_TOKEN= | ||
DATASET_CHUNK_SIZE=300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is derived empirically by looking at chunks from https://huggingface.co/datasets/laion/dalle-3-dataset/tree/main/data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change as the dataset gets bigger? Wondering what effect changing it has.
Also I'd probably move this to config.json files because it doesn't need to be a secret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, for some reason I thought that we would aggregate all of these datasets into one - that's not the case - so I'll change this. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good overall, is it currently working on your fork? I'd be curious to see how the data looks
4919194
to
3ac0eb7
Compare
Nice! How many files are supposed to be here? Maybe off by one on either the numerator or denominator? |
@ZachNagengast I'm not sure - the autocreated chunks in https://huggingface.co/datasets/laion/dalle-3-dataset/tree/main/data are in this format, so I copied it |
Oh you're right haha, one of the hardest problems in computer science 😂 |
Does it seem odd that the dataset is 100mb? The viewer is just showing a bit of text so I'm confused where the size is coming from. |
1130b27
to
c337ab8
Compare
@ZachNagengast Update: For the equality, the datasets are equal! import pandas as pd
from datasets import load_dataset
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
laion = load_dataset('laion/gpt4v-dataset')['train'].to_pandas()
df = load_dataset('TwoAbove/LAION-discord-gpt4v')['train'].to_pandas()
def diff(first, second):
return first[~first['message_id'].isin(second['message_id'])]
# Show the diff between the two datasets by comparing message_id column
print(diff(laion, df))
print(diff(df, laion))
|
c337ab8
to
1610364
Compare
Here's my test dataset that works as expected: https://huggingface.co/datasets/TwoAbove/LAION-discord-gpt4v/tree/main/data |
Currently testing with the dalle3 dataset |
Using the same python code, import pandas as pd
from datasets import load_dataset
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
laion = load_dataset('laion/dalle-3-dataset')['train'].to_pandas()
df = load_dataset('TwoAbove/test-dalle-3')['train'].to_pandas()
def diff(first, second):
return first[~first['message_id'].isin(second['message_id'])]
# Show the diff between the two datasets by comparing message_id column
print(diff(laion, df))
print(diff(df, laion)) here are the results: Looks like things are working as expected
|
1610364
to
c9cb368
Compare
Sorry @ZachNagengast I'll do regular commits and then squash them before merging |
requirements.txt
Outdated
@@ -1,3 +1,5 @@ | |||
requests==2.31.0 | |||
datasets==2.14.5 | |||
Pillow==10.0.1 | |||
huggingface_hub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huggingface_hub | |
huggingface_hub>=0.18 |
You need to pin to at least 0.18.0
to get preupload_lfs_files
Hey @ZachNagengast Anything I can help with to get this merged? |
Gonna try to get this done in the next couple hours |
…/Discord-Scrapers into update-scraper-to-handle-chunks
@Wauplin I ended up just hardcoding the append logic here, happy to bring it over to datasets as well. Logic here does the following:
The part that enables this to run on github actions is streaming the data into a pandas dataframe without the images, which reduces the memory+storage requirements immensely, but still lets us check against the full dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ZachNagengast, I reviewed the PR (and left a few minor comments). Overall logic looks good to me. I still have a few comments/questions:
- Are you using
preupload_lfs_files
at all in the end? It doesn't seem to be the case for what I've read but wanted to check. - IIUC, raw images are no longer stored on the HF Dataset but only the url linking to them right? (to save Github action memory). Is there a plan to scrap the raw images as well somewhere else (so
embed_images=True
)? For what I understood, this is wherepreupload_lfs_files
would help keeping the memory footprint low while uploading several parquet chunks. - FYI the "100 operations per commit" is purely empirical. I expect the limit to be higher for delete operations (since no LFS file has to be checked) so happy to get some feedback at some point if run this scraper at large scale :)
Ping @lhoestq @mariosasko maintainers of datasets
to get their opinion as well
@@ -2,6 +2,7 @@ | |||
"base_url": "https://discord.com/api/v9", | |||
"channel_id": "1158354590463447092", | |||
"limit": 100, | |||
"max_chunk_size": 300, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same "300" defined as DATASET_CHUNK_SIZE
above? If yes, let's reuse it maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the DATASET_CHUNK_SIZE
env was my first iteration of the feature, but it makes more sense for it to be repo-dependent, so I think it should be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the in-depth review! These comments are very helpful.
- I'm not using pre-upload anymore since I moved the upload step into the append function, which uses the
HfFileSystem
to upload. If this is recommended I can take a look. - We store raw images only for some datasets, based on the config.
Here is one with images https://huggingface.co/datasets/laion/dalle-3-dataset. I still need get the readme updated as well for the dataset viewer here
) | ||
|
||
def _get_chunk_names(self) -> None: | ||
fs = HfFileSystem(token=os.environ["HF_TOKEN"], skip_instance_cache=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to define fs = HfFileSystem(token=os.environ["HF_TOKEN"], skip_instance_cache=True)
only once globally. It will cache some stuff internally (in memory) so hopefully running _get_chunk_names
, _get_detailed_chunk_names
and _append_chunk
might run faster in some cases.
(same for other HfFileSystem
definitions below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will cache some stuff internally (in memory) so hopefully running _get_chunk_names, _get_detailed_chunk_names and _append_chunk might run faster in some cases.
(actually fsspec
does cache the instance IIRC, but you need to remove skip_instance_cache
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was getting some issues with the cached instance not having the newest files, this seemed to solve the issue but maybe I'm not understanding what the true problem was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. after committing a new file to the repo, it wasn't in _get_chunk_names
until I redefined this (or ran the function a second time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a question about streaming=true
df = pd.read_parquet(fs.open(f"{self.fs_path}/{chunk}", "rb")) | ||
existing_message_ids = df['message_id'].tolist() | ||
messages = [msg for msg in messages if msg.message_id not in existing_message_ids] | ||
existing_message_ids = dataset["message_id"].tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this chunk filtering because we can't guarantee in which chunk the latest messages will be. If we use the custom chunking, then we can guarantee that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was having trouble getting it to work properly, my thinking with this is that we'll know we're not adding duplicates, and simply adding the new data to the incomplete chunk.
) | ||
|
||
# Load the current dataset without images initially, to figure out what we're working with | ||
current_dataset, chunk_count = self._load_dataset(schema=schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are loading with images, since schema = [f.name for f in fields(HFDatasetScheme)]
is at the start of the scrape function. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I forgot to update this part for github actions (still need to test that), but this saved me time trying to figure it out 😂
|
||
def _update_chunk(self, df: pd.DataFrame, chunk_num: int) -> None: | ||
def _append_chunk( | ||
self, df: pd.DataFrame, mode: AppendMode = AppendMode.LATEST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the AppendMode
idea!
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
This is still a draft - I have a couple of issues to iron out, but I'm opening this PR so that yall can take a look and provide feedback on the approach.
If it's beneficial, I can do a small write-up on what's happening in this new approach!
I've also added transactions so we won't hit limits, but I'm not sure if they work