# Scraping Text from a Web page

Now we come to the third way we can retrieve text from the internet. We looked at using a module (yfinance) to retrieve data directly from the internet. We also looked at using a RESTful API (Application Program Interface) to retrieve data froma source by connecting to an API endpoint and requesting data via a HTTP GET. 

The third way to retrieve data is to scrape it from a webpage. When we scrape from a webpage we are essentailly loading the webpage into a variable, using the requests package. We then use a module called BeautifulSoup4 to extract bits of information from the web page. Let's take a look at a webpage from the Ramapo Website. Specifically we want to look at the academic program offerings at ASB. We can load the webpage in our browswer and see 14 Academic programs offered at the Ansfield School of Business. 
![](ramapo_ASB.JPG)
- Accounting
- Accounting (4+1 BS-MS)
- Accounting (MSAC)
- Business Analytics
- Economics
- Entrepreneurship
- Finance
- Human Resources Management
- Information Technology Management
- International Business
- Management
- Marketing
- Master of Business Administration (MBA)
- Sports Management

Wouldn't it be nice to be able to retrieve these academic programs and be able to use them in our programs if we wanted to ?
Without the data source for this list and if Ramapo did not have an API for us to query, we would resort to webscraping it to retrieve the list Let's look at the code

### Setting up our Environment
We are going to import BeautfulSoup from the bs4 module and also requests. 
Requests will handle the communication request for our http call


In [None]:
from bs4 import BeautifulSoup
import requests

Once these modules are available for use. we can start to explore the page. the url for the ASB website is
https://www.ramapo.edu/asb/. The first step is to load the page in a modern browser like Chrome. This will allow us to look at the page and see where to target the web scrape. As shown in the screen capture above, we are looking to create a list containing all of the academic programs ASB has to offer. The key to screen scraping is to get a good understanding of how the page was created in html. Understanding the overall structure of the page, some basic html tags will help you figure out how to retrieve the data. To properly scrape the page, you will have to explore the html using the inspect option. Since we are using a Chrome browser in our example, the inspect option is available by clicking on an object oin the scrren, selecting the right mouse key and selecting inspect. The inspection window appears and displays the code on the page. 

![](ramapo_ASB2.JPG)


by highlighting the element on the page we want to examine, the inspection window automatically goes to this part of the page, showing the code for the hgihlighted element. Closer examination reveals the highlighted word Accounting is in an html tag  \<h3\>.  Further investigation reveals that each of the items we want to retrieve are all in \<h3\> tags. If we traverse up the tree, we also notice that each of our \<h3\> tags are under a \<div\> tag with an id field called "row-courses". The ID field of the \<div\> tag is perfect as it is something we can use Beautfiul soup to find and help us to parse it out. 
    
![](ramapo_ASB3.JPG)

Let's continue with our environment setup by creating a variable to hold our URL for requests and load the page into a variable using requests

In [None]:
URL= "https://www.ramapo.edu/asb/"  #My url
page = requests.get(URL) # My web page

The code above stores our URL for the ASB page and allows the page variable to point to the html page returned from our requests.  Print the page variable out and make sure we get a code 200 , then print the type out to verify page is the response of the requests call. Finally print the text of page to see what the return htlm looks like

In [None]:
print(page)
print (type(page))
print(page.text)

As you can see the html which returns is really a jumble of code and hard to read. So next let's work on the return html by using the BeautifulSoup module calling it within th variable soup

In [None]:
soup = BeautifulSoup(page.content, "html.parser") # soupify the webpage

Once the page has been "soupified", we can ask now begin to scrape the items we need from the page> Remember the "row-courses" id tag? We will ask BeautifulSoup to look for everyting with the id="row-courses"

In [None]:
tags= soup.find(id="row-courses") # find tags with the id text. 
print(tags)

Executing the print(tags) statement outputs the data which is everything within the \<div\> tag. It is a bit hard to read isn't it? We can use the .prettify() method to show an easier to read output. Run the next print statement and observe the results. It is much easier to see the return html. 

In [None]:
print(tags.prettify())

Scroll through the results above and you will see that each of the academic programs are just as we noted above, between the \<h3\> tags.  Now all we have to do is to get all of the \<h3\> tags by using the .find() or .find_all() methods.  Using the .find() method will find the first \<h3\> tag

In [None]:
academic_program =tags.find("h3") # finds the first tag
print(academic_program)
print (type(academic_program))

Using .find_all() will find all of the \<h3\> tags. The data type of the variable will be a BeautifulSoup 4 ResultSet

In [None]:
academic_programs = tags.find_all("h3") # now use the tags found to find the <h3>. This returns a BS4 result set
print(academic_programs)
print (type(academic_programs))

While academic_programs is also a bs4.element.ResultSet, we can treat it like a list and display particular items with an offset

In [None]:
print(academic_programs[0])
print(academic_programs[5])

And as with any list, we can iterate through it to display each \<h3\> tag on a separate line

In [None]:
for tag in academic_programs:
    print(tag)

The only thing left is for us to only print the contnets of the \<h3\> tag. We can do this by specifying the text in the print statement

In [None]:
print(academic_programs[0].text)
print(academic_programs[5].text)

In [None]:
for tag in academic_programs:
    print(tag.text)

### What Can We Do With The Scraped Data ?

As you can see we were able to scrape the academic programs directly from the ASB page. We can create an empty list and append each tag's text to the list as we iterate through the tag's contents.



In [None]:
academics =[] # create an empty list
for tag in academic_programs: #iterate through the tags
   academics.append(tag.text) # append the tag's text into the list
print(type(academics)) # check to see our list is indeed a list
print (academics) 

We may optionally want to ensure our academic programs are not changable (immutable) by converting it to a tuple

In [None]:
myTuple= tuple(academics) #convert the list to a tuple
print (type(myTuple)) # check to see the new tuple is indeed a tuple
print(myTuple)

Let get all of the degree types as well. If you scroll above in the return of the tags, the degree types seem to be located with the \<h4\> tags. Let's go get those as well and append them to a list called degrees\[\]

In [None]:
degree_tags =tags.find_all("h4")
print(degree_tags)

In [None]:
degrees=[]
for tag in degree_tags: #iterate through the tags
   degrees.append(tag.text) # append the tag's text into the list
print(type(degrees)) # check to see our list is indeed a list
print (degrees) 

Let's create a dictionary from the two lists, with academics\[\] being the keys and degrees\[\] being th values. Remeber our old friend zip ?

In [None]:
print("keys = academics[]","\n",academics,"\n")
print("values = degrees[]", "\n",degrees)

In [None]:
program_dict= dict(zip(academics,degrees))
print (type(program_dict))
print(program_dict)

In [None]:
for keys in program_dict:
    print(keys,":", program_dict[keys])

We can use  the data in our programs as data to answer some questions about ASB

In [None]:
print(f"The number of academic programs offered by the Ansfield School of Business is {len(myTuple)}")

We can also write the lists out to a csv (Comma Separated Value) file which can be opened by Excel

In [None]:
import csv
headers=["Academic Program", "Degrees"]

with open('asb2.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    writer.writerows(zip(academics, degrees))

we can open the file using python and read its contents

In [None]:
with open("asb2.csv", 'r') as file:
    csvreader = csv.reader(file)
    for row in csvreader:
      print(row)