ScrapingShamela

Scraping Shamela

The process is divided into two parts:

1. Collecting the download links

The download links were collected from the http://shamela.ws/index.php/search/last/ page, which lists the files in the Shamela corpus in order of addition.

The list is divided in a number of pages (76 at the time of scraping). Each result page contains a table with 100 books. For each book, the page contains the title of the book, the date when it was added to Shamela, and a link to a separate book page that contains the download links (for EPUB and BOK format (the latter zipped into a RAR file)).

The script used for downloading the files can be found here: https://github.com/OpenITI/raw_SHAM19Y/blob/master/Shamela-Scripts/shamelascrap-v2.py

The script loops through the rows of the table and stores the title and book page link of each book (NB: an updated version of the script also stores the date of addition of the book to Shamela). It then visits the book page of each of these books, and stores also the download link of the RAR file. NB: an updated version of the script also includes a link to a pdf version of the book (on http://waqfeya.com) if available.

This information is then stored in a csv file.

2. Downloading the files

Use a download script like WGET to download all RAR download links collected in the csv file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ScrapingShamela

Scraping Shamela

1. Collecting the download links

2. Downloading the files

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally