Web Data Scraping

Spring 2023 ITSS Mini-Course — ARSC 5040
Brian C. Keegan, Ph.D.
Assistant Professor, Department of Information Science
University of Colorado Boulder

Copyright and distributed under an MIT License

Course description

This is a five-week one-credit "mini-course" on retrieving ("scraping") data from the web. The course is intended for researchers in the social sciences and humanities with computational instincts but limited or no prior programming experience. Each class will be 2.5 hours long: we'll take a break mid-way for biological input and output. Lectures will use a combination of lecture-by-notebook as well as hands-on exercises. The end of each class will have links to resources and additional take-home exercises. Students will have the option of presenting their solutions to the take-home exercises at the beginning of the next class.

Although many programming languages offer libraries for web information retrieval and analysis, we will be focusing on the Python data analysis ecosystem given its popularity and capabilities. I would strongly recommend that students download the latest Python 3.9 or above version of the Anaconda distribution which includes the Jupyter Notebook environment we're currently in, most of the data libraries we will use, and other conveniences.

Learning objectives

Students will:

Be able to navigate and access structured web data like HTML, XML, and JSON
Develop strategies for identifying relevant structures in semi-structed data using browser console tools
Utilize Python-based libraries to make request and parse web data
Retrieve data from application programming interfaces (APIs)
Critically reflect about the technological and ethical constraints on web scraping

Class outline

Week 1: Introduction to Python, Anaconda, Jupyter, browser console, structured data, ethical considerations
Week 2: Scraping HTML with requests and BeautifulSoup
Week 3: Scraping web data with Selenium, ethics of screen-scraping
Week 4: Scraping the Internet Archive and Wikipedia APIs
Week 5: Scraping the Reddit and Mastodon APIs

Evaluation

To be determined based on enrollments, distribution of skills, etc. but will primarily involve regular attendance, participation, and upwards trajectory in skill and confidence.

Acknowledgements

This course will draw on resources built by myself and Allison Morgan for the 2018 Summer Institute for Computational Social Science, which were in turn derived from other resources developed by Simon Munzert and Chris Bail.

Thank you also to Professor Terra KcKinnish for coordinating the ITSS seminars.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Class 01 - Introductions		Class 01 - Introductions
Class 02 - Scraping HTML		Class 02 - Scraping HTML
Class 03 - Scraping with Selenium		Class 03 - Scraping with Selenium
Class 04 - Internet Archive and Wikipedia APIs		Class 04 - Internet Archive and Wikipedia APIs
Class 05 - Reddit and Mastodon APIs		Class 05 - Reddit and Mastodon APIs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Class 01 - Introductions

Class 01 - Introductions

Class 02 - Scraping HTML

Class 02 - Scraping HTML

Class 03 - Scraping with Selenium

Class 03 - Scraping with Selenium

Class 04 - Internet Archive and Wikipedia APIs

Class 04 - Internet Archive and Wikipedia APIs

Class 05 - Reddit and Mastodon APIs

Class 05 - Reddit and Mastodon APIs

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Web Data Scraping

Course description

Learning objectives

Class outline

Evaluation

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

CU-ITSS/Web-Data-Scraping-S2023

Folders and files

Latest commit

History

Repository files navigation

Web Data Scraping

Course description

Learning objectives

Class outline

Evaluation

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Languages