WebScraping

Program to scrape products from Tokopedia, Bukalapak, Shopee

First Run

git clone https://github.com/RaymondSalim/FinalWebScrape
cd FinalWebScrape
python setup.py

Executing setup.py will automatically download and prepare the required files. If an error happens here, do the following:

Get the chromedriver file Here
Unzip the downloaded archive and place it in ./Files/ folder

Using the program

The following help document can be obtained by python main.py -h

usage: main.py [-h] {scrape,retry,convert,continue} ...

positional arguments:
  {scrape,retry,convert,continue}
    scrape              Command to scrape
    retry               Command to retry errors from xxx_errors.json
    convert             Command to convert from/to csv/json
    continue            Command to continue scraping

optional arguments:
  -h, --help            show this help message and exit

Scraping

usage: 

The following arguments are required:
-m / --marketplace      [REQUIRED] the marketplace {tokopedia, bukalapak, shopee}
-q / --query            [REQUIRED] keyword for search
-sp / --startpage       [OPTIONAL] (DEFAULT = 1) start scraping from this page number
-ep / --endpage         [REQUIRED] (0 TO SCRAPE ALL PAGES) scrape until this page number
-r / --result           [REQUIRED] the file format for the results {csv, json}
-f / --filename         [OPTIONAL] the name of the final output

Example:

Tokopedia, "masker bagus", from page 5 to page 25, save as csv

python main.py scrape -m tokopedia -q "masker bagus" -sp 5 -ep 25 -r csv

Shopee, "obat batuk", all results, save as json

python main.py scrape -m shopee -q "obat batuk" -ep 0 -r json

Retrying Errors

The program will automatically save a file ending with _errors.json which contains url of all the pages that had failed. This program allows you to retry all the failed urls. All retried urls will be saved in a new file with the same file name ending with _retry

usage: 

The following arguments are required:
-f / --filename         [REQUIRED] name of the file containing the errors
-r / --result           [REQUIRED] the file format for the results {csv, json}

Continuing Interrupted Job

This program allows you to continue interrupted jobs. This process will skip all scraped product and only scrape products that has been skipped. A new file will be saved in a new file with the same file name ending with _continued

usage:

The following arguments are required:
-f / --filename         [REQUIRED] name of the incomplete job file
-sp / --startpage       [OPTIONAL] (DEFAULT = 1) start scraping from this page number
-ep / --endpage         [REQUIRED] scrape until this page number
-r / --result           [REQUIRED] the file format for the results {csv, json}

Converting Result

This program allows you to convert file from JSON to CSV and vice versa. The converted file will have the same name

usage: 

The following arguments are required:
-f / --filename         [REQUIRED] name of the file

To Do List

Merge similar functions in different classes
Improve exit code based on exception

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
webscrape_files		webscrape_files
.gitignore		.gitignore
README.md		README.md
main.py		main.py
mergeAll.py		mergeAll.py
requirements.txt		requirements.txt
scrapeErr.py		scrapeErr.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebScraping

First Run

Using the program

Scraping

Retrying Errors

Continuing Interrupted Job

Converting Result

To Do List

About

Releases

Packages

Languages

RaymondSalim/FinalWebScrape

Folders and files

Latest commit

History

Repository files navigation

WebScraping

First Run

Using the program

Scraping

Retrying Errors

Continuing Interrupted Job

Converting Result

To Do List

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages