Skip to content

ARKseal/crawlingathome-worker

 
 

Repository files navigation

Crawling@Home

Help us build a billion-scale image-caption dataset by filtering Common Crawl with OpenAI CLIP

Google Colab

Open In Colab

  1. Change the value for YOUR_NICKNAME_FOR_THE_LEADERBOARD and make sure you are connected to a gpu runtime to maximize efficiency.
  2. Then just run all (Ctrl+F9) to install dependencies and start Crawling!

Other options

  • Open In Colab
    • If you want to run a cpu only worker (don't use a gpu runtime)
  • Open In Colab
    • If you want to run a gpu only worker (please use a gpu runtime)

Docker file

  1. Get the docker image using docker pull arkseal/cah-worker:hybrid-cpu
  2. Run docker image using docker run --name cahworker --shm-size=4G -d arkseal/cah-worker:hybrid-cpu
    • add -e NAME={nickname} to specify display name
      • Ex: docker run --name cahworker -e NAME=ARKseal --shm-size=4G -d arkseal/cah-worker:hybrid-cpu
You can use this one liner: docker pull arkseal/cah-worker:hybrid-cpu && docker run --name cahworker --shm-size=4G -d arkseal/cah-worker:hybrid-cpu
  • add -e NAME={nickname} to specify display name
    • Ex: docker pull arkseal/cah-worker:hybrid-cpu && docker run --name cahworker -e NAME=ARKseal --shm-size=4G -d arkseal/cah-worker:hybrid-cpu

Other options

  • Gpu enabled hybrid worker: docker pull arkseal/cah-worker:hybrid-gpu && docker run --name cahworker --gpus all --shm-size=4G -d arkseal/cah-worker:hybrid-gpu
    • add -e NAME={nickname} to specify display name
      • Ex: docker pull arkseal/cah-worker:hybrid-gpu && docker run --name cahworker --gpus all -e NAME=ARKseal --shm-size=4G -d arkseal/cah-worker:hybrid-gpu
    • This requries NVIDIA Container Toolkit on host device
  • Cpu only worker: docker pull arkseal/cah-worker:cpu && docker run --name cahworker --shm-size=4G -d arkseal/cah-worker:cpu
    • add -e NAME={nickname} to specify display name
      • Ex: docker pull arkseal/cah-worker:cpu && docker run --name cahworker -e NAME=ARKseal --shm-size=4G -d arkseal/cah-worker:cpu

Setup

  1. wget https://raw.githubusercontent.com/ARKseal/crawlingathome-worker/master/setup/setup_hybrid.sh
  2. bash setup.sh, to install dependencies.
  3. export PYTHONHASHCODE=0 && python3 crawlingathome.py, to start Crawling!
    • use --name {nickname} to specify your display name

Other Options

  • CPU Only Worker:
    1. wget https://raw.githubusercontent.com/ARKseal/crawlingathome-worker/master/setup/setup_cpu.sh
    2. bash setup.sh, to install dependencies.
    3. export PYTHONHASHCODE=0 && python3 crawlingathome.py --cpu, to start Crawling!
      • use --name {nickname} to specify your display name
  • GPU Only Worker:
    1. wget https://raw.githubusercontent.com/ARKseal/crawlingathome-worker/master/setup/setup_gpu.sh
    2. bash setup.sh, to install dependencies.
    3. export PYTHONHASHCODE=0 && python3 crawlingathome.py --gpu, to start Crawling!
      • use --name {nickname} to specify your display name

Droplet Setup

  1. use cloud-config.yaml script to init the droplet
  2. ssh with this command ssh -oIdentitiesOnly=yes -i~/.ssh/id_cah crawl@{your-droplet-ip}}
  3. check the script by running tail -f crawl.log

TODO

  • Save image embedding
  • Convert images to tfrecords
  • Upload to google drive
  • Prevent corrupt image to be processed
  • Shard of chunk (it needs to read all WAT file which will be bad for low ram server)
  • Crawling@Home integration
  • Verify output

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 87.7%
  • Shell 7.2%
  • Jupyter Notebook 5.1%