Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated scraper_bot to handle scraping in chunks #2

Merged
merged 26 commits into from
Oct 20, 2023

Conversation

TwoAbove
Copy link
Collaborator

@TwoAbove TwoAbove commented Oct 12, 2023

This is still a draft - I have a couple of issues to iron out, but I'm opening this PR so that yall can take a look and provide feedback on the approach.

If it's beneficial, I can do a small write-up on what's happening in this new approach!

I've also added transactions so we won't hit limits, but I'm not sure if they work

@@ -99,7 +99,8 @@ ipython_config.py
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
poetry.lock
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use poetry as my python package manager and I didn't want to include that in this repo

requirements.txt Outdated
@@ -1,3 +1,4 @@
requests==2.31.0
datasets==2.14.5
Pillow==10.0.1
huggingface_hub
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed for the HFFileSystem

@@ -205,51 +266,37 @@ def _get_messages(self, after_message_id: str) -> List[Dict[str, Any]]:

return unique_list

def scrape(self, fetch_all: bool = False, push_to_hub: bool = True) -> Dataset:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is better viewed in split mode since it's really different

@TwoAbove TwoAbove force-pushed the update-scraper-to-handle-chunks branch 3 times, most recently from 59da5e5 to 2387b53 Compare October 12, 2023 15:26
DISCORD_TOKEN=
DATASET_CHUNK_SIZE=300
Copy link
Collaborator Author

@TwoAbove TwoAbove Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is derived empirically by looking at chunks from https://huggingface.co/datasets/laion/dalle-3-dataset/tree/main/data

Copy link
Member

@ZachNagengast ZachNagengast Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change as the dataset gets bigger? Wondering what effect changing it has.

Also I'd probably move this to config.json files because it doesn't need to be a secret

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, for some reason I thought that we would aggregate all of these datasets into one - that's not the case - so I'll change this. Thanks!

Copy link
Member

@ZachNagengast ZachNagengast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good overall, is it currently working on your fork? I'd be curious to see how the data looks

@TwoAbove TwoAbove force-pushed the update-scraper-to-handle-chunks branch 2 times, most recently from 4919194 to 3ac0eb7 Compare October 13, 2023 02:10
@TwoAbove TwoAbove marked this pull request as ready for review October 13, 2023 02:11
@TwoAbove
Copy link
Collaborator Author

@ZachNagengast
Copy link
Member

@ZachNagengast https://huggingface.co/datasets/TwoAbove/LAION-discord-gpt4v/tree/main/data This is how it looks

Nice! How many files are supposed to be here? Maybe off by one on either the numerator or denominator?

@TwoAbove
Copy link
Collaborator Author

@ZachNagengast I'm not sure - the autocreated chunks in https://huggingface.co/datasets/laion/dalle-3-dataset/tree/main/data are in this format, so I copied it

@ZachNagengast
Copy link
Member

Oh you're right haha, one of the hardest problems in computer science 😂

@ZachNagengast
Copy link
Member

ZachNagengast commented Oct 13, 2023

Does it seem odd that the dataset is 100mb? The viewer is just showing a bit of text so I'm confused where the size is coming from.

@TwoAbove
Copy link
Collaborator Author

TwoAbove commented Oct 14, 2023

@ZachNagengast Update:
Minor upload bug left, but the dataset config is fixed. The solution was to update the README.md in the dataset itself with the correct rows types. https://huggingface.co/datasets/TwoAbove/LAION-discord-gpt4v/commit/d9fa283b3aeea04017a7b920f3281d276f7ebb2a

For the equality, the datasets are equal!

import pandas as pd
from datasets import load_dataset

pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

laion = load_dataset('laion/gpt4v-dataset')['train'].to_pandas()
df = load_dataset('TwoAbove/LAION-discord-gpt4v')['train'].to_pandas()


def diff(first, second):
    return first[~first['message_id'].isin(second['message_id'])]
# Show the diff between the two datasets by comparing message_id column


print(diff(laion, df))
print(diff(df, laion))
Empty DataFrame
Columns: [caption, image, link, message_id, timestamp]
Index: []
Empty DataFrame
Columns: [caption, image, link, message_id, timestamp]
Index: []

@TwoAbove TwoAbove force-pushed the update-scraper-to-handle-chunks branch from c337ab8 to 1610364 Compare October 14, 2023 14:05
@TwoAbove
Copy link
Collaborator Author

Here's my test dataset that works as expected: https://huggingface.co/datasets/TwoAbove/LAION-discord-gpt4v/tree/main/data

@TwoAbove
Copy link
Collaborator Author

Currently testing with the dalle3 dataset

@TwoAbove
Copy link
Collaborator Author

@ZachNagengast

Using the same python code,

import pandas as pd
from datasets import load_dataset

pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

laion = load_dataset('laion/dalle-3-dataset')['train'].to_pandas()
df = load_dataset('TwoAbove/test-dalle-3')['train'].to_pandas()


def diff(first, second):
    return first[~first['message_id'].isin(second['message_id'])]
# Show the diff between the two datasets by comparing message_id column


print(diff(laion, df))
print(diff(df, laion))

here are the results: Looks like things are working as expected

                                               caption                                              image                                               link           message_id                         timestamp
106  In an endless sea of parking lots, two women i...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1160972438075605012  2023-10-09T16:09:59.062000+00:00
                                                caption                                              image                                               link           message_id                         timestamp
2356  A large and continuous Isometric 3D diorama ev...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162679544835223573  2023-10-14T09:13:25.030000+00:00
2357  A sprawling and continuous top-down Isometric ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162689946193236029  2023-10-14T09:54:44.907000+00:00
2358  A sprawling and continuous top-down Isometric ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162690254361342084  2023-10-14T09:55:58.380000+00:00
2359  An endless Isometric 3D map capturing the esse...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162690496775335997  2023-10-14T09:56:56.176000+00:00
2360  A boundless Isometric 3D diorama where terrain...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162690760488009778  2023-10-14T09:57:59.050000+00:00
2361  A large and continuous Isometric 3D diorama ev...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162691703115886693  2023-10-14T10:01:43.790000+00:00
2362  A view-filling and continuous top-down Isometr...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162695680696004668  2023-10-14T10:17:32.119000+00:00
2363  Anime still of a young woman with an athletic ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162699276560891924  2023-10-14T10:31:49.440000+00:00
2364  Cinematic movie still of a young, athletic wom...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162700052607807549  2023-10-14T10:34:54.464000+00:00
2365  Oil painting of an athletic young woman floati...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162700561523671130  2023-10-14T10:36:55.799000+00:00
2366  Photo close-up capturing a young woman as she ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162702539150594088  2023-10-14T10:44:47.302000+00:00
2367  Close-up shot from a vintage camera of a young...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162705851816095864  2023-10-14T10:57:57.103000+00:00
2368  Vintage-style shot of a young woman in a space...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162707803538984991  2023-10-14T11:05:42.430000+00:00
2369  Cinematic shot reminiscent of early 2000s foot...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162708153000009768  2023-10-14T11:07:05.748000+00:00
2370  Vintage-style shot of the young woman on an al...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162709155740655687  2023-10-14T11:11:04.820000+00:00
2371  Cinematic shot reminiscent of early 2000s foot...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162709491087847565  2023-10-14T11:12:24.773000+00:00
2372  Anime-style wide depiction set on a foreign wo...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162722043268190288  2023-10-14T12:02:17.446000+00:00
2373  Vintage-style wide shot set against the backdr...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162722177171324939  2023-10-14T12:02:49.371000+00:00
2374  Cinematic wide shot capturing a heartwarming m...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162723589787762718  2023-10-14T12:08:26.165000+00:00
2375  Close-up cinematic shot capturing a heartwarmi...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162723724823363685  2023-10-14T12:08:58.360000+00:00
2376  Close-up cinematic shot capturing a heartwarmi...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162726056327909437  2023-10-14T12:18:14.234000+00:00
2377  Anime-style wide depiction set on a distant wo...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162726650048421998  2023-10-14T12:20:35.788000+00:00
2378  Cinematic close-up portrait of the female extr...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162727580659617832  2023-10-14T12:24:17.663000+00:00
2379  Cinematic close-up portrait of the female extr...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162731384448749659  2023-10-14T12:39:24.557000+00:00
2380  Cinematic wide shot of the female extraterrest...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162733488412299314  2023-10-14T12:47:46.181000+00:00
2381  Close-up portrait of a little girl dressed as ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162734269458808982  2023-10-14T12:50:52.397000+00:00
2382  Illustration of a Corgi with a determined expr...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162734339923128420  2023-10-14T12:51:09.197000+00:00
2383  Amidst a colorful summer backdrop, a young gir...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162734562145734707  2023-10-14T12:52:02.179000+00:00
2384  Cinematic shot of the female extraterrestrial ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162734770942394399  2023-10-14T12:52:51.960000+00:00
2385  Cinematic shot of the female extraterrestrial ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162734770942394399  2023-10-14T12:52:51.960000+00:00
2386  Cinematic shot of the female extraterrestrial ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162734770942394399  2023-10-14T12:52:51.960000+00:00
2387  Cinematic wide shot of the female extraterrest...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162735089646575686  2023-10-14T12:54:07.945000+00:00
2388  Cinematic wide shot of the female extraterrest...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162735089646575686  2023-10-14T12:54:07.945000+00:00
2389  Cinematic wide shot of the female extraterrest...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162735089646575686  2023-10-14T12:54:07.945000+00:00
2390  Cinematic rearview shot of the female extrater...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162737144230584411  2023-10-14T13:02:17.796000+00:00
2391  Cinematic image of the female extraterrestrial...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162737616416935977  2023-10-14T13:04:10.374000+00:00
2392  Cinematic image of the female extraterrestrial...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162739840467603486  2023-10-14T13:13:00.629000+00:00
2393  1940s inspired Cubist style on aged matte phot...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162740233738141799  2023-10-14T13:14:34.392000+00:00
2394  Dynamic widescreen depiction of a time-travel ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162740309013315584  2023-10-14T13:14:52.339000+00:00
2395  Photo of a medieval wizard holding a futuristi...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162740379100139530  2023-10-14T13:15:09.049000+00:00
2396  Photo capturing a man at a historic airfield, ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162742402084257852  2023-10-14T13:23:11.366000+00:00
2397  Photo of a dimly lit, vintage corridor. A man,...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162746225779556482  2023-10-14T13:38:23.006000+00:00
2398  Photo of a man, face covered in sweat, crawlin...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162746344130220062  2023-10-14T13:38:51.223000+00:00
2399  Photo of a dimly lit, vintage bedroom. In the ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162747602136211496  2023-10-14T13:43:51.155000+00:00
2400  Photo of a dimly lit, vintage hotel room. Joe,...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162747966608637952  2023-10-14T13:45:18.052000+00:00
2401  Photo of a dramatic moment in a dimly lit, old...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162750683947216907  2023-10-14T13:56:05.916000+00:00
2402  Photo of a bleak, colorless street covered in ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162754340969267324  2023-10-14T14:10:37.818000+00:00
2403  Cinema frame of the snowy street scene. Pedest...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162754500117921963  2023-10-14T14:11:15.762000+00:00

@TwoAbove TwoAbove force-pushed the update-scraper-to-handle-chunks branch from 1610364 to c9cb368 Compare October 14, 2023 15:26
@ZachNagengast
Copy link
Member

ZachNagengast commented Oct 14, 2023

This is looking great.

I'm just seeing a bit of weirdness when setting fetch_all=True
image

Can you try some successive runs with fetch_all=True with no new data? The expected result is that it would replace the entire repo with the proper file structure, and overwrite merge with what's already there. Did you ever figure out where that hash was coming from? Also, I'd recommend just committing new stuff here instead of force push, hard to see what changes between commits.

Just about ready to go, nice work so far!

@TwoAbove
Copy link
Collaborator Author

Sorry @ZachNagengast I'll do regular commits and then squash them before merging

requirements.txt Outdated
@@ -1,3 +1,5 @@
requests==2.31.0
datasets==2.14.5
Pillow==10.0.1
huggingface_hub
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
huggingface_hub
huggingface_hub>=0.18

You need to pin to at least 0.18.0 to get preupload_lfs_files

@TwoAbove
Copy link
Collaborator Author

Hey @ZachNagengast Anything I can help with to get this merged?

@ZachNagengast
Copy link
Member

Gonna try to get this done in the next couple hours

@ZachNagengast
Copy link
Member

ZachNagengast commented Oct 20, 2023

@Wauplin I ended up just hardcoding the append logic here, happy to bring it over to datasets as well.

Logic here does the following:

  1. Check if there is a "most recent" chunk that is under the size limit
  2. Add to the chunk until it hits the limit
  3. Upload that chunk
  4. Start a new empty chunk with the remaining data
  5. Loop until all the data is uploaded

The part that enables this to run on github actions is streaming the data into a pandas dataframe without the images, which reduces the memory+storage requirements immensely, but still lets us check against the full dataset.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ZachNagengast, I reviewed the PR (and left a few minor comments). Overall logic looks good to me. I still have a few comments/questions:

  • Are you using preupload_lfs_files at all in the end? It doesn't seem to be the case for what I've read but wanted to check.
  • IIUC, raw images are no longer stored on the HF Dataset but only the url linking to them right? (to save Github action memory). Is there a plan to scrap the raw images as well somewhere else (so embed_images=True)? For what I understood, this is where preupload_lfs_files would help keeping the memory footprint low while uploading several parquet chunks.
  • FYI the "100 operations per commit" is purely empirical. I expect the limit to be higher for delete operations (since no LFS file has to be checked) so happy to get some feedback at some point if run this scraper at large scale :)

Ping @lhoestq @mariosasko maintainers of datasets to get their opinion as well

@@ -2,6 +2,7 @@
"base_url": "https://discord.com/api/v9",
"channel_id": "1158354590463447092",
"limit": 100,
"max_chunk_size": 300,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same "300" defined as DATASET_CHUNK_SIZE above? If yes, let's reuse it maybe?

Copy link
Collaborator Author

@TwoAbove TwoAbove Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the DATASET_CHUNK_SIZE env was my first iteration of the feature, but it makes more sense for it to be repo-dependent, so I think it should be deleted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the in-depth review! These comments are very helpful.

  • I'm not using pre-upload anymore since I moved the upload step into the append function, which uses the HfFileSystem to upload. If this is recommended I can take a look.
  • We store raw images only for some datasets, based on the config.

Here is one with images https://huggingface.co/datasets/laion/dalle-3-dataset. I still need get the readme updated as well for the dataset viewer here

)

def _get_chunk_names(self) -> None:
fs = HfFileSystem(token=os.environ["HF_TOKEN"], skip_instance_cache=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to define fs = HfFileSystem(token=os.environ["HF_TOKEN"], skip_instance_cache=True) only once globally. It will cache some stuff internally (in memory) so hopefully running _get_chunk_names, _get_detailed_chunk_names and _append_chunk might run faster in some cases.

(same for other HfFileSystem definitions below)

Copy link

@lhoestq lhoestq Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will cache some stuff internally (in memory) so hopefully running _get_chunk_names, _get_detailed_chunk_names and _append_chunk might run faster in some cases.

(actually fsspec does cache the instance IIRC, but you need to remove skip_instance_cache )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting some issues with the cached instance not having the newest files, this seemed to solve the issue but maybe I'm not understanding what the true problem was.

Copy link
Member

@ZachNagengast ZachNagengast Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. after committing a new file to the repo, it wasn't in _get_chunk_names until I redefined this (or ran the function a second time).

Copy link
Collaborator Author

@TwoAbove TwoAbove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a question about streaming=true

df = pd.read_parquet(fs.open(f"{self.fs_path}/{chunk}", "rb"))
existing_message_ids = df['message_id'].tolist()
messages = [msg for msg in messages if msg.message_id not in existing_message_ids]
existing_message_ids = dataset["message_id"].tolist()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this chunk filtering because we can't guarantee in which chunk the latest messages will be. If we use the custom chunking, then we can guarantee that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was having trouble getting it to work properly, my thinking with this is that we'll know we're not adding duplicates, and simply adding the new data to the incomplete chunk.

)

# Load the current dataset without images initially, to figure out what we're working with
current_dataset, chunk_count = self._load_dataset(schema=schema)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are loading with images, since schema = [f.name for f in fields(HFDatasetScheme)] is at the start of the scrape function. Or am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I forgot to update this part for github actions (still need to test that), but this saved me time trying to figure it out 😂


def _update_chunk(self, df: pd.DataFrame, chunk_num: int) -> None:
def _append_chunk(
self, df: pd.DataFrame, mode: AppendMode = AppendMode.LATEST
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the AppendMode idea!

@ZachNagengast ZachNagengast merged commit f56be93 into main Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants