Python code to use HTTrack to scrape Dal myweb.dal.ca
Get the URLs you will be scraping into a .csv file with headers of
name. For a few hundred, you can do it in under a couple of minutes after watching the video here. I recommend muting it and playing it at 2x speed. Apply the technique for a Google search of
If HTTrack is installed, and you run
httracker.py, and your csv file is named appropriately with the two column headers
url, it should work via
Here are some advantages of this technique:
It limits all the files scraped to those that live on a
It only scrapes directories on
myweb.dal.cathat are popular enough to be reachable by Google spiders -- that is, linked to by the outside web.
Given that this method preserves the last time/date that a file was modified by the owner, what was the rate and timeline of Web 2.0 techniques diffusion among Dal community members?
The experience of browsing the directory structure of the public face of
myweb.dal.ca is a bit like walking down the stacks of a library. You run your virtual hands along the spines, stop and flip through a person's stories. Maybe we shouldn't delete this stuff.
Also, the fact that some Chemistry professors used this resource as a way to experiment with technology and create works of public art suggests that maybe this private sandbox webhosting is a good resource to provide to a University community. It's perhaps not suprising that Dr. Aue was an expert in chromatography.
I'd be happy to tell you at great length why it matters that such a resource be hosted on Canadian soil, ideally within the control of job-secure locally staffed IT people, but that might get too boring.