-
Notifications
You must be signed in to change notification settings - Fork 0
ScrapingShamela
The process is divided into two parts:
The download links were collected from the http://shamela.ws/index.php/search/last/ page, which lists the files in the Shamela corpus in order of addition.
The list is divided in a number of pages (76 at the time of scraping). Each result page contains a table with 100 books. For each book, the page contains the title of the book, the date when it was added to Shamela, and a link to a separate book page that contains the download links (for EPUB and BOK format (the latter zipped into a RAR file)).
The script used for downloading the files can be found here: https://github.com/OpenITI/raw_SHAM19Y/blob/master/Shamela-Scripts/shamelascrap-v2.py
The script loops through the rows of the table and stores the title and book page link of each book (NB: an updated version of the script also stores the date of addition of the book to Shamela). It then visits the book page of each of these books, and stores also the download link of the RAR file. NB: an updated version of the script also includes a link to a pdf version of the book (on http://waqfeya.com) if available.
This information is then stored in a csv file.
Use a download script like WGET to download all RAR download links collected in the csv file.