# Scraping Text from a Web page

Now we come to the third way we can retrieve text from the internet. We looked at using a module (yfinance) to retrieve data directly from the internet. We also looked at using a RESTful API (Application Program Interface) to retrieve data froma source by connecting to an API endpoint and requesting data via a HTTP GET. 

The third way to retrieve data is to scrape it from a webpage. When we scrape from a webpage we are essentailly loading the webpage into a variable, using the requests package. We then use a module called BeautifulSoup4 to extract bits of information from the web page. Let's take a look at a webpage from the Ramapo Website. Specifically we want to look at the academic program offerings at ASB. We can load the webpage in our browswer and see 14 Academic programs offered at the Ansfield School of Business. 
![](ramapo_ASB.JPG)
- Accounting
- Accounting (4+1 BS-MS)
- Accounting (MSAC)
- Business Analytics
- Economics
- Entrepreneurship
- Finance
- Human Resources Management
- Information Technology Management
- International Business
- Management
- Marketing
- Master of Business Administration (MBA)
- Sports Management

Wouldn't it be nice to be able to retrieve these academic programs and be able to use them in our programs if we wanted to ?
Without the data source for this list and if Ramapo did not have an API for us to query, we would resort to webscraping it to retrieve the list Let's look at the code

### Setting up our Environment
We are going to import BeautfulSoup from the bs4 module and also requests. 
Requests will handle the communication request for our http call


In [2]:
from bs4 import BeautifulSoup
import requests

Once these modules are available for use. we can start to explore the page. the url for the ASB website is
https://www.ramapo.edu/asb/. The first step is to load the page in a modern browser like Chrome. This will allow us to look at the page and see where to target the web scrape. As shown in the screen capture above, we are looking to create a list containing all of the academic programs ASB has to offer. The key to screen scraping is to get a good understanding of how the page was created in html. Understanding the overall structure of the page, some basic html tags will help you figure out how to retrieve the data. To properly scrape the page, you will have to explore the html using the inspect option. Since we are using a Chrome browser in our example, the inspect option is available by clicking on an object oin the scrren, selecting the right mouse key and selecting inspect. The inspection window appears and displays the code on the page. 

![](ramapo_ASB2.JPG)


by highlighting the element on the page we want to examine, the inspection window automatically goes to this part of the page, showing the code for the hgihlighted element. Closer examination reveals the highlighted word Accounting is in an html tag  \<h3\>.  Further investigation reveals that each of the items we want to retrieve are all in \<h3\> tags. If we traverse up the tree, we also notice that each of our \<h3\> tags are under a \<div\> tag with an id field called "row-courses". The ID field of the \<div\> tag is perfect as it is something we can use Beautfiul soup to find and help us to parse it out. 
    
![](ramapo_ASB3.JPG)

Let's continue with our environment setup by creating a variable to hold our URL for requests and load the page into a variable using requests

In [3]:
URL= "https://www.ramapo.edu/asb/"  #My url
headers = {"Content-Type":"application/json", "User-agent":"Mozilla"} # Add Http headers
page = requests.get(URL) # My web page

The code above stores our URL for the ASB page and allows the page variable to point to the html page returned from our requests.  Print the page variable out and make sure we get a code 200 , then print the type out to verify page is the response of the requests call. Finally print the text of page to see what the return htlm looks like

In [4]:
print(page)
print (type(page))
print(page.text)

<Response [200]>
<class 'requests.models.Response'>
<!doctype html>
<!--[if lt IE 7]>      <html class="no-js lt-ie10 lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html class="no-js lt-ie10 lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html class="no-js lt-ie10 lt-ie9"> <![endif]-->
<!--[if IE 9]>         <html class="no-js lt-ie10"> <![endif]-->
<!--[if gt IE 8]> <html class="no-js"> <![endif]-->
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
<head>
	<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1">
	<link rel="shortcut icon" href="/wp-content/themes/rcnjrd/images/favicons/favicon.ico" />
	<link rel="apple-touch-icon" sizes="57x57" href="/wp-content/themes/rcnjrd/images/favicons/apple-touch-icon-57x57.png" />
	<link rel="apple-touch-icon" sizes="114x114" href="/wp-content/themes/rcnjrd/images/favicons/apple-touch-icon-114x114.png" />
	<link rel="apple-touch-icon" si

As you can see the html which returns is really a jumble of code and hard to read. So next let's work on the return html by using the BeautifulSoup module calling it within th variable soup

In [5]:
soup = BeautifulSoup(page.content, "html.parser") # soupify the webpage

Once the page has been "soupified", we can ask now begin to scrape the items we need from the page> Remember the "row-courses" id tag? We will ask BeautifulSoup to look for everyting with the id="row-courses"

In [6]:
tags= soup.find(id="row-courses") # find tags with the id text. 
print(tags)

<div class="row" id="row-courses"><div class="col-lg-3 col-md-4 col-sm-6 course all-courses active Major Minor Undergraduate"><div class="course-wrap" data-url="https://www.ramapo.edu/majors-minors/?p=19"><h3>Accounting</h3><h4>Bachelor of Science</h4><div class="keys"><span class="ramaroon">M</span> <span class="ramagrey">m</span></div></div></div><div class="col-lg-3 col-md-4 col-sm-6 course all-courses active Graduate Major Undergraduate"><div class="course-wrap" data-url="https://www.ramapo.edu/majors-minors/?p=470"><h3>Accounting (4+1 BS-MS)</h3><h4>Master of Science</h4><div class="keys"><span class="ramaroon">M</span> <span class="ramagray">G</span></div></div></div><div class="col-lg-3 col-md-4 col-sm-6 course all-courses active Graduate"><div class="course-wrap" data-url="https://www.ramapo.edu/majors-minors/?p=475"><h3>Accounting (MSAC)</h3><h4>Master of Science</h4><div class="keys"><span class="ramagray">G</span></div></div></div><div class="col-lg-3 col-md-4 col-sm-6 cours

Executing the print(tags) statement outputs the data which is everything within the \<div\> tag. It is a bit hard to read isn't it? We can use the .prettify() method to show an easier to read output. Run the next print statement and observe the results. It is much easier to see the return html. 

In [7]:
print(tags.prettify())

<div class="row" id="row-courses">
 <div class="col-lg-3 col-md-4 col-sm-6 course all-courses active Major Minor Undergraduate">
  <div class="course-wrap" data-url="https://www.ramapo.edu/majors-minors/?p=19">
   <h3>
    Accounting
   </h3>
   <h4>
    Bachelor of Science
   </h4>
   <div class="keys">
    <span class="ramaroon">
     M
    </span>
    <span class="ramagrey">
     m
    </span>
   </div>
  </div>
 </div>
 <div class="col-lg-3 col-md-4 col-sm-6 course all-courses active Graduate Major Undergraduate">
  <div class="course-wrap" data-url="https://www.ramapo.edu/majors-minors/?p=470">
   <h3>
    Accounting (4+1 BS-MS)
   </h3>
   <h4>
    Master of Science
   </h4>
   <div class="keys">
    <span class="ramaroon">
     M
    </span>
    <span class="ramagray">
     G
    </span>
   </div>
  </div>
 </div>
 <div class="col-lg-3 col-md-4 col-sm-6 course all-courses active Graduate">
  <div class="course-wrap" data-url="https://www.ramapo.edu/majors-minors/?p=475">
   <h3>

Scroll through the results above and you will see that each of the academic programs are just as we noted above, between the \<h3\> tags.  Now all we have to do is to get all of the \<h3\> tags by using the .find() or .find_all() methods.  Using the .find() method will find the first \<h3\> tag

In [20]:
academic_program =tags.find("h3") # finds the first tag
print(academic_program)
print (type(academic_program))

<h3>Accounting</h3>
<class 'bs4.element.Tag'>


Iterating through the tags with a for loop will result in getting all of the academic programs located in the \<h3\> tags as well as the degrees found in the \<h4\> tags. Use the Try Except to trap Attribute errors in case of a null value for degree

In [21]:
for tag in tags:
        academic = tag.find('h3').text
        try:
            degree = tag.find('h4').text
               
        except AttributeError:
            degree = "None"
            
        myList = [academic, degree]
        print(myList)

['Accounting', 'Bachelor of Science']
['Accounting (4+1 BS-MS)', 'Master of Science']
['Accounting (MSAC)', 'Master of Science']
['Business Analytics', 'None']
['Economics', 'Bachelor of Arts']
['Entrepreneurship', 'None']
['Finance', 'Bachelor of Science']
['Human Resources Management', 'None']
['Information Technology Management', 'Bachelor of Science']
['International Business', 'Bachelor of Arts']
['Management', 'Bachelor of Science']
['Marketing', 'Bachelor of Science']
['Master of Business Administration (MBA)', 'Master of Business Administration']
['Sports Management', 'None']


### What Can We Do With The Scraped Data ?

As you can see we were able to scrape the academic programs directly from the ASB page. We can create an empty list and append each tag's text to the list as we iterate through the tag's contents.



In [23]:
academics =[] # create an empty list
academic_programs = tags.find_all("h3") # now use the tags found to find the <h3>. This returns a BS4 result set
for tag in academic_programs: #iterate through the tags
   academics.append(tag.text) # append the tag's text into the list
print(type(academics)) # check to see our list is indeed a list
print (academics) 

<class 'list'>
['Accounting', 'Accounting (4+1 BS-MS)', 'Accounting (MSAC)', 'Business Analytics', 'Economics', 'Entrepreneurship', 'Finance', 'Human Resources Management', 'Information Technology Management', 'International Business', 'Management', 'Marketing', 'Master of Business Administration (MBA)', 'Sports Management']


We may optionally want to ensure our academic programs are not changable (immutable) by converting it to a tuple

In [24]:
myTuple= tuple(academics) #convert the list to a tuple
print (type(myTuple)) # check to see the new tuple is indeed a tuple
print(myTuple)

<class 'tuple'>
('Accounting', 'Accounting (4+1 BS-MS)', 'Accounting (MSAC)', 'Business Analytics', 'Economics', 'Entrepreneurship', 'Finance', 'Human Resources Management', 'Information Technology Management', 'International Business', 'Management', 'Marketing', 'Master of Business Administration (MBA)', 'Sports Management')


Let get all of the degree types as well. If you scroll above in the return of the tags, the degree types seem to be located with the \<h4\> tags. Let's go get those as well and append them to a list called degrees\[\]

In [25]:
degree_tags =tags.find_all("h4")
print(degree_tags)

[<h4>Bachelor of Science</h4>, <h4>Master of Science</h4>, <h4>Master of Science</h4>, <h4>Bachelor of Arts</h4>, <h4>Bachelor of Science</h4>, <h4>Bachelor of Science</h4>, <h4>Bachelor of Arts</h4>, <h4>Bachelor of Science</h4>, <h4>Bachelor of Science</h4>, <h4>Master of Business Administration</h4>]


In [26]:
degrees=[]
for tag in degree_tags: #iterate through the tags
   degrees.append(tag.text) # append the tag's text into the list
print(type(degrees)) # check to see our list is indeed a list
print (degrees) 

<class 'list'>
['Bachelor of Science', 'Master of Science', 'Master of Science', 'Bachelor of Arts', 'Bachelor of Science', 'Bachelor of Science', 'Bachelor of Arts', 'Bachelor of Science', 'Bachelor of Science', 'Master of Business Administration']


We can use  the data in our programs as data to answer some questions about ASB

In [33]:
print(f"The number of academic programs offered by the Ansfield School of Business is {len(myTuple)}")

The number of academic programs offered by the Ansfield School of Business is 14


We can also write the lists out to a csv (Comma Separated Value) file which can be opened by Excel. For this to work we need to import another package called csv. This allows us direct access to read and write csv files. 

In [42]:
from csv import writer


#Use with open to automatically keep the file open and close it after the loop
with open('ASB Academic Programs.csv', 'w', encoding='utf8', newline='') as f: 
    thewriter = writer(f) # create a writer object.
    header=['Academic Program','Degree Program'] # column headers in CSV
    thewriter.writerow(header) # write the column headings first

        
    for tag in tags:
        try: # try as a null will cause an Attribute error
           academic = tag.find('h3').text # find all the h3 tags again

        except AttributeError:
            academic = "None"

        try:# try as a null will cause an Attribute error
           degree = tag.find('h4').text # find all the h4 tags again

        except AttributeError:
            degree = "None"
        myList = [academic, degree]
        thewriter.writerow(myList)
        print (myList)                       

['Accounting', 'Bachelor of Science']
['Accounting (4+1 BS-MS)', 'Master of Science']
['Accounting (MSAC)', 'Master of Science']
['Business Analytics', 'None']
['Economics', 'Bachelor of Arts']
['Entrepreneurship', 'None']
['Finance', 'Bachelor of Science']
['Human Resources Management', 'None']
['Information Technology Management', 'Bachelor of Science']
['International Business', 'Bachelor of Arts']
['Management', 'Bachelor of Science']
['Marketing', 'Bachelor of Science']
['Master of Business Administration (MBA)', 'Master of Business Administration']
['Sports Management', 'None']


we can open the file using python and read its contents

In [43]:
with open("ASB Academic Programs.csv", 'r') as file:
    csvreader = csv.reader(file)
    for row in csvreader:
      print(row)

['Academic Program', 'Degree Program']
['Accounting', 'Bachelor of Science']
['Accounting (4+1 BS-MS)', 'Master of Science']
['Accounting (MSAC)', 'Master of Science']
['Business Analytics', 'None']
['Economics', 'Bachelor of Arts']
['Entrepreneurship', 'None']
['Finance', 'Bachelor of Science']
['Human Resources Management', 'None']
['Information Technology Management', 'Bachelor of Science']
['International Business', 'Bachelor of Arts']
['Management', 'Bachelor of Science']
['Marketing', 'Bachelor of Science']
['Master of Business Administration (MBA)', 'Master of Business Administration']
['Sports Management', 'None']
