Updated scraper_bot to handle scraping in chunks #2

TwoAbove · 2023-10-12T02:43:39Z

This is still a draft - I have a couple of issues to iron out, but I'm opening this PR so that yall can take a look and provide feedback on the approach.

If it's beneficial, I can do a small write-up on what's happening in this new approach!

I've also added transactions so we won't hit limits, but I'm not sure if they work

TwoAbove · 2023-10-12T02:45:02Z

.gitignore

@@ -99,7 +99,8 @@ ipython_config.py
 #   This is especially recommended for binary packages to ensure reproducibility, and is more
 #   commonly ignored for libraries.
 #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
-#poetry.lock
+poetry.lock


I use poetry as my python package manager and I didn't want to include that in this repo

TwoAbove · 2023-10-12T02:45:27Z

requirements.txt

@@ -1,3 +1,4 @@
 requests==2.31.0
 datasets==2.14.5
 Pillow==10.0.1
+huggingface_hub


This is needed for the HFFileSystem

scraper/scraper_bot.py

TwoAbove · 2023-10-12T02:47:30Z

scraper/scraper_bot.py

@@ -205,51 +266,37 @@ def _get_messages(self, after_message_id: str) -> List[Dict[str, Any]]:

        return unique_list

+    def scrape(self, fetch_all: bool = False, push_to_hub: bool = True) -> Dataset:


This change is better viewed in split mode since it's really different

TwoAbove · 2023-10-12T15:27:09Z

.env.example

+DISCORD_TOKEN=
+DATASET_CHUNK_SIZE=300


This is derived empirically by looking at chunks from https://huggingface.co/datasets/laion/dalle-3-dataset/tree/main/data

Does this change as the dataset gets bigger? Wondering what effect changing it has.

Also I'd probably move this to config.json files because it doesn't need to be a secret

You're right, for some reason I thought that we would aggregate all of these datasets into one - that's not the case - so I'll change this. Thanks!

ZachNagengast

Looks pretty good overall, is it currently working on your fork? I'd be curious to see how the data looks

scraper/scraper_bot.py

TwoAbove · 2023-10-13T02:12:31Z

@ZachNagengast https://huggingface.co/datasets/TwoAbove/LAION-discord-gpt4v/tree/main/data This is how it looks

ZachNagengast · 2023-10-13T02:14:52Z

@ZachNagengast https://huggingface.co/datasets/TwoAbove/LAION-discord-gpt4v/tree/main/data This is how it looks

Nice! How many files are supposed to be here? Maybe off by one on either the numerator or denominator?

TwoAbove · 2023-10-13T02:22:02Z

@ZachNagengast I'm not sure - the autocreated chunks in https://huggingface.co/datasets/laion/dalle-3-dataset/tree/main/data are in this format, so I copied it

ZachNagengast · 2023-10-13T03:25:39Z

Oh you're right haha, one of the hardest problems in computer science 😂

ZachNagengast · 2023-10-13T04:58:13Z

Does it seem odd that the dataset is 100mb? The viewer is just showing a bit of text so I'm confused where the size is coming from.

TwoAbove · 2023-10-14T13:28:54Z

@ZachNagengast Update:
Minor upload bug left, but the dataset config is fixed. The solution was to update the README.md in the dataset itself with the correct rows types. https://huggingface.co/datasets/TwoAbove/LAION-discord-gpt4v/commit/d9fa283b3aeea04017a7b920f3281d276f7ebb2a

For the equality, the datasets are equal!

import pandas as pd
from datasets import load_dataset

pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

laion = load_dataset('laion/gpt4v-dataset')['train'].to_pandas()
df = load_dataset('TwoAbove/LAION-discord-gpt4v')['train'].to_pandas()


def diff(first, second):
    return first[~first['message_id'].isin(second['message_id'])]
# Show the diff between the two datasets by comparing message_id column


print(diff(laion, df))
print(diff(df, laion))

Empty DataFrame
Columns: [caption, image, link, message_id, timestamp]
Index: []
Empty DataFrame
Columns: [caption, image, link, message_id, timestamp]
Index: []

TwoAbove · 2023-10-14T14:07:54Z

Here's my test dataset that works as expected: https://huggingface.co/datasets/TwoAbove/LAION-discord-gpt4v/tree/main/data

TwoAbove · 2023-10-14T14:50:14Z

Currently testing with the dalle3 dataset

TwoAbove · 2023-10-14T14:57:09Z

@ZachNagengast

Using the same python code,

import pandas as pd
from datasets import load_dataset

pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

laion = load_dataset('laion/dalle-3-dataset')['train'].to_pandas()
df = load_dataset('TwoAbove/test-dalle-3')['train'].to_pandas()


def diff(first, second):
    return first[~first['message_id'].isin(second['message_id'])]
# Show the diff between the two datasets by comparing message_id column


print(diff(laion, df))
print(diff(df, laion))

here are the results: Looks like things are working as expected

                                               caption                                              image                                               link           message_id                         timestamp
106  In an endless sea of parking lots, two women i...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1160972438075605012  2023-10-09T16:09:59.062000+00:00
                                                caption                                              image                                               link           message_id                         timestamp
2356  A large and continuous Isometric 3D diorama ev...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162679544835223573  2023-10-14T09:13:25.030000+00:00
2357  A sprawling and continuous top-down Isometric ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162689946193236029  2023-10-14T09:54:44.907000+00:00
2358  A sprawling and continuous top-down Isometric ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162690254361342084  2023-10-14T09:55:58.380000+00:00
2359  An endless Isometric 3D map capturing the esse...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162690496775335997  2023-10-14T09:56:56.176000+00:00
2360  A boundless Isometric 3D diorama where terrain...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162690760488009778  2023-10-14T09:57:59.050000+00:00
2361  A large and continuous Isometric 3D diorama ev...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162691703115886693  2023-10-14T10:01:43.790000+00:00
2362  A view-filling and continuous top-down Isometr...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162695680696004668  2023-10-14T10:17:32.119000+00:00
2363  Anime still of a young woman with an athletic ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162699276560891924  2023-10-14T10:31:49.440000+00:00
2364  Cinematic movie still of a young, athletic wom...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162700052607807549  2023-10-14T10:34:54.464000+00:00
2365  Oil painting of an athletic young woman floati...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162700561523671130  2023-10-14T10:36:55.799000+00:00
2366  Photo close-up capturing a young woman as she ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162702539150594088  2023-10-14T10:44:47.302000+00:00
2367  Close-up shot from a vintage camera of a young...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162705851816095864  2023-10-14T10:57:57.103000+00:00
2368  Vintage-style shot of a young woman in a space...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162707803538984991  2023-10-14T11:05:42.430000+00:00
2369  Cinematic shot reminiscent of early 2000s foot...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162708153000009768  2023-10-14T11:07:05.748000+00:00
2370  Vintage-style shot of the young woman on an al...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162709155740655687  2023-10-14T11:11:04.820000+00:00
2371  Cinematic shot reminiscent of early 2000s foot...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162709491087847565  2023-10-14T11:12:24.773000+00:00
2372  Anime-style wide depiction set on a foreign wo...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162722043268190288  2023-10-14T12:02:17.446000+00:00
2373  Vintage-style wide shot set against the backdr...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162722177171324939  2023-10-14T12:02:49.371000+00:00
2374  Cinematic wide shot capturing a heartwarming m...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162723589787762718  2023-10-14T12:08:26.165000+00:00
2375  Close-up cinematic shot capturing a heartwarmi...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162723724823363685  2023-10-14T12:08:58.360000+00:00
2376  Close-up cinematic shot capturing a heartwarmi...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162726056327909437  2023-10-14T12:18:14.234000+00:00
2377  Anime-style wide depiction set on a distant wo...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162726650048421998  2023-10-14T12:20:35.788000+00:00
2378  Cinematic close-up portrait of the female extr...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162727580659617832  2023-10-14T12:24:17.663000+00:00
2379  Cinematic close-up portrait of the female extr...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162731384448749659  2023-10-14T12:39:24.557000+00:00
2380  Cinematic wide shot of the female extraterrest...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162733488412299314  2023-10-14T12:47:46.181000+00:00
2381  Close-up portrait of a little girl dressed as ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162734269458808982  2023-10-14T12:50:52.397000+00:00
2382  Illustration of a Corgi with a determined expr...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162734339923128420  2023-10-14T12:51:09.197000+00:00
2383  Amidst a colorful summer backdrop, a young gir...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162734562145734707  2023-10-14T12:52:02.179000+00:00
2384  Cinematic shot of the female extraterrestrial ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162734770942394399  2023-10-14T12:52:51.960000+00:00
2385  Cinematic shot of the female extraterrestrial ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162734770942394399  2023-10-14T12:52:51.960000+00:00
2386  Cinematic shot of the female extraterrestrial ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162734770942394399  2023-10-14T12:52:51.960000+00:00
2387  Cinematic wide shot of the female extraterrest...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162735089646575686  2023-10-14T12:54:07.945000+00:00
2388  Cinematic wide shot of the female extraterrest...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162735089646575686  2023-10-14T12:54:07.945000+00:00
2389  Cinematic wide shot of the female extraterrest...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162735089646575686  2023-10-14T12:54:07.945000+00:00
2390  Cinematic rearview shot of the female extrater...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162737144230584411  2023-10-14T13:02:17.796000+00:00
2391  Cinematic image of the female extraterrestrial...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162737616416935977  2023-10-14T13:04:10.374000+00:00
2392  Cinematic image of the female extraterrestrial...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162739840467603486  2023-10-14T13:13:00.629000+00:00
2393  1940s inspired Cubist style on aged matte phot...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162740233738141799  2023-10-14T13:14:34.392000+00:00
2394  Dynamic widescreen depiction of a time-travel ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162740309013315584  2023-10-14T13:14:52.339000+00:00
2395  Photo of a medieval wizard holding a futuristi...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162740379100139530  2023-10-14T13:15:09.049000+00:00
2396  Photo capturing a man at a historic airfield, ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162742402084257852  2023-10-14T13:23:11.366000+00:00
2397  Photo of a dimly lit, vintage corridor. A man,...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162746225779556482  2023-10-14T13:38:23.006000+00:00
2398  Photo of a man, face covered in sweat, crawlin...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162746344130220062  2023-10-14T13:38:51.223000+00:00
2399  Photo of a dimly lit, vintage bedroom. In the ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162747602136211496  2023-10-14T13:43:51.155000+00:00
2400  Photo of a dimly lit, vintage hotel room. Joe,...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162747966608637952  2023-10-14T13:45:18.052000+00:00
2401  Photo of a dramatic moment in a dimly lit, old...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162750683947216907  2023-10-14T13:56:05.916000+00:00
2402  Photo of a bleak, colorless street covered in ...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162754340969267324  2023-10-14T14:10:37.818000+00:00
2403  Cinema frame of the snowy street scene. Pedest...  {'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHD...  https://cdn.discordapp.com/attachments/1158354...  1162754500117921963  2023-10-14T14:11:15.762000+00:00

ZachNagengast · 2023-10-14T17:22:10Z

This is looking great.

I'm just seeing a bit of weirdness when setting fetch_all=True

Can you try some successive runs with fetch_all=True with no new data? The expected result is that it would replace the entire repo with the proper file structure, and ~~overwrite~~ merge with what's already there. Did you ever figure out where that hash was coming from? Also, I'd recommend just committing new stuff here instead of force push, hard to see what changes between commits.

Just about ready to go, nice work so far!

TwoAbove · 2023-10-14T18:05:13Z

Sorry @ZachNagengast I'll do regular commits and then squash them before merging

Wauplin · 2023-10-16T07:27:20Z

requirements.txt

@@ -1,3 +1,5 @@
 requests==2.31.0
 datasets==2.14.5
 Pillow==10.0.1
+huggingface_hub


Suggested change

huggingface_hub

huggingface_hub>=0.18

You need to pin to at least 0.18.0 to get preupload_lfs_files

TwoAbove · 2023-10-19T15:11:42Z

Hey @ZachNagengast Anything I can help with to get this merged?

ZachNagengast · 2023-10-19T16:51:29Z

Gonna try to get this done in the next couple hours

…/Discord-Scrapers into update-scraper-to-handle-chunks

ZachNagengast · 2023-10-20T07:37:36Z

@Wauplin I ended up just hardcoding the append logic here, happy to bring it over to datasets as well.

Logic here does the following:

Check if there is a "most recent" chunk that is under the size limit
Add to the chunk until it hits the limit
Upload that chunk
Start a new empty chunk with the remaining data
Loop until all the data is uploaded

The part that enables this to run on github actions is streaming the data into a pandas dataframe without the images, which reduces the memory+storage requirements immensely, but still lets us check against the full dataset.

Wauplin

Hi @ZachNagengast, I reviewed the PR (and left a few minor comments). Overall logic looks good to me. I still have a few comments/questions:

Are you using preupload_lfs_files at all in the end? It doesn't seem to be the case for what I've read but wanted to check.
IIUC, raw images are no longer stored on the HF Dataset but only the url linking to them right? (to save Github action memory). Is there a plan to scrap the raw images as well somewhere else (so embed_images=True)? For what I understood, this is where preupload_lfs_files would help keeping the memory footprint low while uploading several parquet chunks.
FYI the "100 operations per commit" is purely empirical. I expect the limit to be higher for delete operations (since no LFS file has to be checked) so happy to get some feedback at some point if run this scraper at large scale :)

Ping @lhoestq @mariosasko maintainers of datasets to get their opinion as well

Wauplin · 2023-10-20T09:02:21Z

scrape_dalle/config.json

@@ -2,6 +2,7 @@
    "base_url": "https://discord.com/api/v9",
    "channel_id": "1158354590463447092",
    "limit": 100,
+    "max_chunk_size": 300,


Is this the same "300" defined as DATASET_CHUNK_SIZE above? If yes, let's reuse it maybe?

the DATASET_CHUNK_SIZE env was my first iteration of the feature, but it makes more sense for it to be repo-dependent, so I think it should be deleted

Thanks for the in-depth review! These comments are very helpful.

I'm not using pre-upload anymore since I moved the upload step into the append function, which uses the HfFileSystem to upload. If this is recommended I can take a look.

We store raw images only for some datasets, based on the config.

Here is one with images https://huggingface.co/datasets/laion/dalle-3-dataset. I still need get the readme updated as well for the dataset viewer here

scraper/scraper_bot.py

Wauplin · 2023-10-20T09:14:35Z

scraper/scraper_bot.py

+        )
+
+    def _get_chunk_names(self) -> None:
+        fs = HfFileSystem(token=os.environ["HF_TOKEN"], skip_instance_cache=True)


It's better to define fs = HfFileSystem(token=os.environ["HF_TOKEN"], skip_instance_cache=True) only once globally. It will cache some stuff internally (in memory) so hopefully running _get_chunk_names, _get_detailed_chunk_names and _append_chunk might run faster in some cases.

(same for other HfFileSystem definitions below)

It will cache some stuff internally (in memory) so hopefully running _get_chunk_names, _get_detailed_chunk_names and _append_chunk might run faster in some cases.

(actually fsspec does cache the instance IIRC, but you need to remove skip_instance_cache )

I was getting some issues with the cached instance not having the newest files, this seemed to solve the issue but maybe I'm not understanding what the true problem was.

i.e. after committing a new file to the repo, it wasn't in _get_chunk_names until I redefined this (or ran the function a second time).

scraper/scraper_bot.py

TwoAbove

Left a question about streaming=true

TwoAbove · 2023-10-20T12:28:34Z

scraper/scraper_bot.py

-            df = pd.read_parquet(fs.open(f"{self.fs_path}/{chunk}", "rb"))
-            existing_message_ids = df['message_id'].tolist()
-            messages = [msg for msg in messages if msg.message_id not in existing_message_ids]
+        existing_message_ids = dataset["message_id"].tolist()


I added this chunk filtering because we can't guarantee in which chunk the latest messages will be. If we use the custom chunking, then we can guarantee that.

I was having trouble getting it to work properly, my thinking with this is that we'll know we're not adding duplicates, and simply adding the new data to the incomplete chunk.

TwoAbove · 2023-10-20T12:30:41Z

scraper/scraper_bot.py

+        )
+
+        # Load the current dataset without images initially, to figure out what we're working with
+        current_dataset, chunk_count = self._load_dataset(schema=schema)


We are loading with images, since schema = [f.name for f in fields(HFDatasetScheme)] is at the start of the scrape function. Or am I missing something?

Yea I forgot to update this part for github actions (still need to test that), but this saved me time trying to figure it out 😂

TwoAbove · 2023-10-20T12:32:39Z

scraper/scraper_bot.py


-    def _update_chunk(self, df: pd.DataFrame, chunk_num: int) -> None:
+    def _append_chunk(
+        self, df: pd.DataFrame, mode: AppendMode = AppendMode.LATEST


I like the AppendMode idea!

Co-authored-by: Lucain <lucainp@gmail.com>

TwoAbove commented Oct 12, 2023

View reviewed changes

scraper/scraper_bot.py Outdated Show resolved Hide resolved

TwoAbove commented Oct 12, 2023

View reviewed changes

TwoAbove force-pushed the update-scraper-to-handle-chunks branch 3 times, most recently from 59da5e5 to 2387b53 Compare October 12, 2023 15:26

TwoAbove commented Oct 12, 2023

View reviewed changes

ZachNagengast reviewed Oct 12, 2023

View reviewed changes

scraper/scraper_bot.py Outdated Show resolved Hide resolved

scraper/scraper_bot.py Outdated Show resolved Hide resolved

TwoAbove force-pushed the update-scraper-to-handle-chunks branch 2 times, most recently from 4919194 to 3ac0eb7 Compare October 13, 2023 02:10

TwoAbove marked this pull request as ready for review October 13, 2023 02:11

TwoAbove requested a review from ZachNagengast October 13, 2023 02:11

ZachNagengast mentioned this pull request Oct 13, 2023

Incremental dataset (e.g. .push_to_hub(..., append=True)) huggingface/datasets#6290

Open

TwoAbove force-pushed the update-scraper-to-handle-chunks branch 2 times, most recently from 1130b27 to c337ab8 Compare October 14, 2023 13:28

TwoAbove force-pushed the update-scraper-to-handle-chunks branch from c337ab8 to 1610364 Compare October 14, 2023 14:05

Updated scraper_bot to handle scraping in chunks

c9cb368

TwoAbove force-pushed the update-scraper-to-handle-chunks branch from 1610364 to c9cb368 Compare October 14, 2023 15:26

TwoAbove and others added 3 commits October 15, 2023 18:26

Merge branch 'main' into update-scraper-to-handle-chunks

ad55e82

Merge branch 'main' into update-scraper-to-handle-chunks

d398e6b

Fixed chunk update logic

ec71538

Wauplin reviewed Oct 16, 2023

View reviewed changes

TwoAbove added 3 commits October 16, 2023 10:38

Fixed update_chunk when empty repo

a8485d6

Pinned huggingface_hub to at least 0.18

0fd28e4

Fixed race condition

f40edae

ZachNagengast added 5 commits October 19, 2023 11:24

Merge branch 'main' into update-scraper-to-handle-chunks

bbb8ba4

Merge branch 'update-scraper-to-handle-chunks' of github.com:LAION-AI…

209b6a4

…/Discord-Scrapers into update-scraper-to-handle-chunks

Update append logic

3600241

Fix config

6959598

Cleanup

0d5a608

Wauplin reviewed Oct 20, 2023

View reviewed changes

TwoAbove commented Oct 20, 2023

View reviewed changes

ZachNagengast and others added 11 commits October 20, 2023 09:21

Update scraper/scraper_bot.py

50b2478

Co-authored-by: Lucain <lucainp@gmail.com>

Update scraper/scraper_bot.py

593ca10

Co-authored-by: Lucain <lucainp@gmail.com>

Update scraper/scraper_bot.py

e9b111b

Co-authored-by: Lucain <lucainp@gmail.com>

Update scraper/scraper_bot.py

f8fe5ea

Co-authored-by: Lucain <lucainp@gmail.com>

Update fingerprint handling if missing

dfe957a

Freeze dataset version until fixed

2b29fdd

Update dataset for test

e17ca5e

Freeze dataset version until fixed

ff8542b

Add missing upload_file import

df94f2f

Test for dalle ci

54ce773

Rever test dataset names

5ef842d

ZachNagengast approved these changes Oct 20, 2023

View reviewed changes

ZachNagengast merged commit f56be93 into main Oct 20, 2023

		@@ -205,51 +266,37 @@ def _get_messages(self, after_message_id: str) -> List[Dict[str, Any]]:

		return unique_list

		def scrape(self, fetch_all: bool = False, push_to_hub: bool = True) -> Dataset:

		DISCORD_TOKEN=
		DATASET_CHUNK_SIZE=300

Updated scraper_bot to handle scraping in chunks #2

Updated scraper_bot to handle scraping in chunks #2

Conversation

TwoAbove commented Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TwoAbove Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

ZachNagengast Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZachNagengast left a comment

Choose a reason for hiding this comment

TwoAbove commented Oct 13, 2023

ZachNagengast commented Oct 13, 2023

TwoAbove commented Oct 13, 2023

ZachNagengast commented Oct 13, 2023

ZachNagengast commented Oct 13, 2023 • edited Loading

TwoAbove commented Oct 14, 2023 • edited Loading

TwoAbove commented Oct 14, 2023

TwoAbove commented Oct 14, 2023

TwoAbove commented Oct 14, 2023

ZachNagengast commented Oct 14, 2023 • edited Loading

TwoAbove commented Oct 14, 2023

Choose a reason for hiding this comment

TwoAbove commented Oct 19, 2023

ZachNagengast commented Oct 19, 2023

ZachNagengast commented Oct 20, 2023 • edited Loading

Wauplin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TwoAbove Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZachNagengast Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

TwoAbove left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TwoAbove commented Oct 12, 2023 •

edited

Loading

TwoAbove Oct 12, 2023 •

edited

Loading

ZachNagengast Oct 12, 2023 •

edited

Loading

ZachNagengast commented Oct 13, 2023 •

edited

Loading

TwoAbove commented Oct 14, 2023 •

edited

Loading

ZachNagengast commented Oct 14, 2023 •

edited

Loading

ZachNagengast commented Oct 20, 2023 •

edited

Loading

TwoAbove Oct 20, 2023 •

edited

Loading

lhoestq Oct 20, 2023 •

edited

Loading

ZachNagengast Oct 20, 2023 •

edited

Loading