# Retrieving web page content using the *requests* package

The *requests* package is used to retreive content from the internet by making web page requests.

Get requests are made using the *requests.get* method, checking whether the request was successful, and if so, then parsing the web page content.

Note: for all requests, it is good practice to include a header which specifies the *User-Agent*, which is the entity making the request. You can find your browser's User-Agent by looking at the following site:<br> https://developers.whatismybrowser.com/useragents/parse/?analyse-my-user-agent=yes

Web servers may return different results depending on the user agent, and may be designed to block requests that are automated:

https://hhsm95.dev/blog/the-importance-of-using-user-agent-to-scraping-data/


We first load the required modules. 

In [1]:
from bs4 import BeautifulSoup
import requests

If no user agent is specified, then the default user agent is used:

In [2]:
requests.utils.default_headers()

{'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'}

We can specify a user-agent by creating a dictionary.

In [3]:
# the value of "User-Agent" should be set appropriately
headers = {"User-Agent": "TestBot"}

Use *requests.get* to submit a get request, 

In [4]:
page = requests.get('https://www.easternct.edu/programs/index.html', headers = headers)
page

<Response [200]>

Check for a valid status (https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html). In the code below, the *exit()* function will cause the Python kernel to be reset if the status is not OK. In that case you will need to re-run the script from the beginning.

In [5]:
if page.status_code != requests.codes.ok :
    print("Request was not successful, status code:", page.status_code)
    print("Hit enter to continue...")
    input()
    exit()

If our request is successful, then *page.content* will contain a *byte* string containing the HTML of the page. This is what we parse with *BeautifulSoup*.

In [6]:
page.content

b'<!DOCTYPE html>\r\n<html xmlns="http://www.w3.org/1999/xhtml" class="no-js" lang="en">\r\n\t<head>\r\n\t\t<meta content="IE=edge" http-equiv="X-UA-Compatible"/>\r\n\t\t<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\r\n\t\t<meta content="width=device-width, initial-scale=1.0" name="viewport"/>\r\n\t\t<title>Majors and Minors - Eastern</title>\r\n\t\t<meta content="Undergraduate Programs majors minors" name="keywords"/>\r\n\t\t<meta content="Undergraduate Programs / Majors and Minors at Eastern." name="description"/>\r\n\t\t\r\n\t<style type="text/css">svg:not(:root).svg-inline--fa{overflow:visible}.svg-inline--fa{display:inline-block;font-size:inherit;height:1em;overflow:visible;vertical-align:-.125em}.svg-inline--fa.fa-lg{vertical-align:-.225em}.svg-inline--fa.fa-w-1{width:.0625em}.svg-inline--fa.fa-w-2{width:.125em}.svg-inline--fa.fa-w-3{width:.1875em}.svg-inline--fa.fa-w-4{width:.25em}.svg-inline--fa.fa-w-5{width:.3125em}.svg-inline--fa.fa-w-6{width:.375em}.sv

Now we can parse the content using *BeautifulSoup*.

In [7]:
# Parse page using BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
print('Page title:', soup.title.string.strip())

Page title: Majors and Minors - Eastern


Let's print out all programs offered at Eastern. There are several ways to do this, but note that program  names are inside of list item (*li*) elements that include the class 'filter-all'. For example, we can get the first program, accounting, by doing the following:

In [8]:
accounting = soup.find('li', class_ = 'filter-all')
accounting

<li class="index filter-all filter-major filter-minor filter-accounting mix row" data-bound="" style="display: inline-block;">
<input checked="checked" id="accordion-0" title="Accounting" type="checkbox"/>
<h3 class="col6">
<label for="accordion-0" title="Accordion Option">
<i></i> Accounting</label>
</h3>
<div class="col6 available-programs">
<div class="col4"><span>Major</span> <svg aria-hidden="true" class="svg-inline--fa fa-check fa-w-14" data-fa-i2svg="" data-icon="check" data-prefix="fal" role="img" viewbox="0 0 448 512" xmlns="http://www.w3.org/2000/svg"><path d="M413.505 91.951L133.49 371.966l-98.995-98.995c-4.686-4.686-12.284-4.686-16.971 0L6.211 284.284c-4.686 4.686-4.686 12.284 0 16.971l118.794 118.794c4.686 4.686 12.284 4.686 16.971 0l299.813-299.813c4.686-4.686 4.686-12.284 0-16.971l-11.314-11.314c-4.686-4.686-12.284-4.686-16.97 0z" fill="currentColor"></path></svg></div>
<div class="col4"><span>Minor</span> <svg aria-hidden="true" class="svg-inline--fa fa-check fa-w-14" d

The program name can be found in a *label* element. Note that the label element contains an *italics* tag (whose text is '\n'). Therefore we cannot use *element.string* to extract the text. But we can use *text* to extract all of the text, with a *strip* to remove the newline character.

In [9]:
accounting.label.text.strip()

'Accounting'

In [10]:
print('Majors, minor, and concentration programs at Eastern include:')
for program in soup.find_all('li', class_ = 'filter-all') :
    p = program.label.text.strip()
    print('', p)

Majors, minor, and concentration programs at Eastern include:
 Accounting
 Acting
 Actuarial Science
 Advertising
 Allied Health
 American Studies
 Anthropology
 Applied Media Production
 Archaeology
 Art
 Art History
 Asian Studies
 Astronomy Outreach and Public Presentation
 Banking and Real Estate
 Behavior Analysis
 Biochemistry
 Bioinformatics
 Biology
 Business Administration
 Business Analytics
 Business Economics
 Business Information Systems
 Cannabis Cultivation and Chemistry
 Chemistry
 Coaching
 Cognitive Neuroscience
 Communication
 Communication Generalist
 Computer Engineering Science
 Computer Science
 Costume & Fashion Design
 Creative Writing
 Criminology
 Cultural Anthropology
 Cultural Studies
 Dance and World Performance
 Data Science
 Design, Technology and Management
 Developmental Psychology
 Digital Art and Media Design
 Digital Media Design
 Directing, Dramaturgy and Cultural Performance
 Early Childhood Education
 Economics
 Elementary Education
 English
 Eng