# Web Scraping Workshop with Python Demo

If you have dabbled even a little bit in the world of data science, you have likely heard of the term *web scraping*. **Web scraping** is the process of using automation to obtain vast amounts of data simply from sources publicly available on the internet.

There are three steps required to successfully scrape a website:

1) **Retreiving the data**

    When you enter a URL into your browser and load a website on your computer, the contents of and the structure of the webpage is downloaded to your computer and displayed by your browser. This data is stored in an *HTML* file. To successfully web scrape a webpage, we must get access to this HTML data so that we can further analyze its contents.
 
2) **Parsing the data**

    How that you have access to a website's HTML data, you now have to make sense of this data. Oftentimes, a big part of this step is figuring out which components of the HTML you need -- and which ones you don't. This is often the hardest and most tedious part of web scraping, but this is where the magic really happens.
    
3) **Using the data**

    Congratulations! Now that you have you data, you can now do something cool with it. You can analyze this data, gain new insights and knowledge, even work on your very own data science app with this data as well (with proper attribution, of course!)
    
In this workshop, you will learn the basics of creating your own web scraping script using the `BeautifulSoup` and `requests` package for the Python programming language. You will learn how to perform all three tasks of web scraping - retreiving the raw source HTML from a webpage, parsing that data to gain valuable data and knowledge, and storing that data to develop new insights or work on your very own data science app. We will also discuss how web scraping has been used in technologies and software commonly used today, as well as the potential ethical implications of the practice.

This workshop is meant to be **introductory** and is open to all skill levels. No prior knowledge of web scraping or any of the Python packages mentioned are required.

## Getting Started

This workshop heavily uses the `requests` and `beautifulsoup4` package for Python. To install this, run the cell below:

In [1]:
!pip install requests
!pip install beautifulsoup4



Now that you have installed the `requests` and `beautifulsoup4` package, import them below:

In [4]:
# Import the `requests` package
import requests

# Import `BeautifulSoup` from the `bs4` package
from bs4 import BeautifulSoup


As you noticed above, `beautifulsoup4` is abbreviated as `bs4`. Then from there, we can import `BeautifulSoup`.

**Congratulations!** You are now ready to start web scraping.

## Part 1: Retrieving the Data

Remember, when you enter a URL into your browser and load a website on your computer, the contents of and the structure of the webpage is downloaded to your computer and displayed by your browser.

This data is stored in an *HTML* file! To successfully web scrape a webpage, we must get access to this HTML data so that we can further analyze its contents.

In Python, we can use the `requests` package to download and store the HTML from a URL!

Which URL?

In [9]:
# URL for Professor Eskandarian's course explorer tool:
url = "https://www.cs.unc.edu/~saba/COMP_classes/spring2023/"


Now, we want to fetch the data from this URL! Let's use the `requests.get()` function:

In [13]:
# Get data from the URL
data = requests.get(url)
data

<Response [200]>

As you can see, we received a response! Have you ever heard of the *404 Not Found* error? That is a response as well. 200 means all is well.

Now, let's find that HTML:

In [16]:
# Find data text (the HTML!) 
html = data.text
html

'<html>\n<head>\n    <title>\n        Class Information\n    </title>\n\t<style>\n\t\tbody{\n            font-family: Arial, Helvetica, sans-serif;\n            padding: 30px;\n\t\t}\n\t\t#main{\n\t\t\t\t//background-color: #fffafa;\n            margin: auto;\n\t\t}\n        ul{\n        list-style-type:none;\n        }\n        table{\n            border: none;\n        }\n        th{\n            font-weight: bold;\n            padding-bottom: 1em;\n            text-align: left;\n            padding-right: 2em;\n\n        }\n        td{\n            padding-bottom: 1em;\n            padding-right: 2em;\n        }\n\t\th1{\n            font-weight: normal;\n\t\t}\n        h2{\n            font-weight: normal;\n\t\t}\n        h3{\n            font-weight: normal;\n            margin-bottom: 0px;\n            margin-top: 0px;\n\t\t}\n\t\ta{\n            color: #337ab7;\n            outline:none;\n\t\t}\n\t\ta:hover {\n            color: hotpink;\n        }\n        .expandable {\n      

Wow, that is quite long! Congratulations, you have now completed step 1 -- retrieving the data! But now, how do we make sense of all that gibberish?

## Part 2: Parsing the Data

Now that we have the HTML data, we now unleash the full power of `BeautifulSoup` to parse this data.

In [18]:
# Parse data using the HTML that we extracted using the `html.parser`, and save output to soup.
soup = BeautifulSoup(html, 'html.parser')


In [21]:
# Now, let's print out the "prettified" version of that HTML!
print(soup.prettify())

<html>
 <head>
  <title>
   Class Information
  </title>
  <style>
   body{
            font-family: Arial, Helvetica, sans-serif;
            padding: 30px;
		}
		#main{
				//background-color: #fffafa;
            margin: auto;
		}
        ul{
        list-style-type:none;
        }
        table{
            border: none;
        }
        th{
            font-weight: bold;
            padding-bottom: 1em;
            text-align: left;
            padding-right: 2em;

        }
        td{
            padding-bottom: 1em;
            padding-right: 2em;
        }
		h1{
            font-weight: normal;
		}
        h2{
            font-weight: normal;
		}
        h3{
            font-weight: normal;
            margin-bottom: 0px;
            margin-top: 0px;
		}
		a{
            color: #337ab7;
            outline:none;
		}
		a:hover {
            color: hotpink;
        }
        .expandable {
            display: none;
        }
        tr:nth-child(4n+2) {
            background-c

`BeautifulSoup` does a lot more than just printing HTML in a nice way. We can now also find specific elements on the webpage!

Let's find all of the data for the COMP classes. How? Let's take a look at the HTML.

You will find that all of the data is located in the `<table>` element! So, let's try and find that element.

In [26]:
# Find all table elements on the page
tables = soup.find_all("table")

print("Number of tables: " + str(len(tables)))
print("Tables:")
print(tables)

Number of tables: 1
Tables:
[<table>
<tr>
<th>Class Number</th>
<th>Class</th>
<th>Meeting Time</th>
<th>Instructor</th>
<th>Room</th>
<th>Unreserved Enrollment</th>
<th>Reserved Enrollment</th>
<th>Wait List</th>
</tr>
<tr><td>4290</td><td>COMP  110 - 001   Introduction to Programming and Data Science</td><td>TuTh 8:00AM - 9:15AM</td><td>Alyssa Byrnes</td><td>Hamilton Hall - Rm 0100</td><td style="color:orange">217/219</td><td style="color:red">81/81</td><td>0/30</td></tr>
<tr class="expandable"><td colspan="7"><strong>Description: </strong>Prerequisite, A C or better in one of the following courses: MATH 130, 152, 210, 231, 129P, or PHIL 155, or STOR 120, 151, 155. Introduces students to programming and data science from a computational perspective. With an emphasis on modern applications in society, students gain experience with problem decomposition, algorithms for data analysis, abstraction design, and ethics in computing. No prior programming experience expected. Foundational con

Great! `soup.find_all()` output a **list** of all of the table elements on our page. In this case, there was only one! So, let's select that table.

In [27]:
# Select the table
table = tables[0]
table


<table>
<tr>
<th>Class Number</th>
<th>Class</th>
<th>Meeting Time</th>
<th>Instructor</th>
<th>Room</th>
<th>Unreserved Enrollment</th>
<th>Reserved Enrollment</th>
<th>Wait List</th>
</tr>
<tr><td>4290</td><td>COMP  110 - 001   Introduction to Programming and Data Science</td><td>TuTh 8:00AM - 9:15AM</td><td>Alyssa Byrnes</td><td>Hamilton Hall - Rm 0100</td><td style="color:orange">217/219</td><td style="color:red">81/81</td><td>0/30</td></tr>
<tr class="expandable"><td colspan="7"><strong>Description: </strong>Prerequisite, A C or better in one of the following courses: MATH 130, 152, 210, 231, 129P, or PHIL 155, or STOR 120, 151, 155. Introduces students to programming and data science from a computational perspective. With an emphasis on modern applications in society, students gain experience with problem decomposition, algorithms for data analysis, abstraction design, and ethics in computing. No prior programming experience expected. Foundational concepts include data types, seq

Just like how we used `find_all()` on the `soup` object, we can also use this on elements too! Let's try and find all of the *rows*, represented by `<tr>` elements, in the table:

In [32]:
# Find all rows in the table
rows = table.find_all("tr")

for i in range(0,5):
    print(rows[i])
    print("\n")

<tr>
<th>Class Number</th>
<th>Class</th>
<th>Meeting Time</th>
<th>Instructor</th>
<th>Room</th>
<th>Unreserved Enrollment</th>
<th>Reserved Enrollment</th>
<th>Wait List</th>
</tr>


<tr><td>4290</td><td>COMP  110 - 001   Introduction to Programming and Data Science</td><td>TuTh 8:00AM - 9:15AM</td><td>Alyssa Byrnes</td><td>Hamilton Hall - Rm 0100</td><td style="color:orange">217/219</td><td style="color:red">81/81</td><td>0/30</td></tr>


<tr class="expandable"><td colspan="7"><strong>Description: </strong>Prerequisite, A C or better in one of the following courses: MATH 130, 152, 210, 231, 129P, or PHIL 155, or STOR 120, 151, 155. Introduces students to programming and data science from a computational perspective. With an emphasis on modern applications in society, students gain experience with problem decomposition, algorithms for data analysis, abstraction design, and ethics in computing. No prior programming experience expected. Foundational concepts include data types, sequenc

There are certainly a lot of rows in this table! If you notice, the first row is just column headers, and we do not need that (for now). Also, every other row is taken up by a description for a course, so every course is represented by **two rows**. For now to make things simple, we will not save course descriptions. Therefore when scraping data on courses, we need to look at **every other row!**

Let's take a look at one of the rows further.

In [34]:
# Print out prettified HTML for row 2 (index 1)
print(rows[1].prettify())

<tr>
 <td>
  4290
 </td>
 <td>
  COMP  110 - 001   Introduction to Programming and Data Science
 </td>
 <td>
  TuTh 8:00AM - 9:15AM
 </td>
 <td>
  Alyssa Byrnes
 </td>
 <td>
  Hamilton Hall - Rm 0100
 </td>
 <td style="color:orange">
  217/219
 </td>
 <td style="color:red">
  81/81
 </td>
 <td>
  0/30
 </td>
</tr>



You will notice that HTML rows (`<tr>` elements) have elements inside called `<td>` elements, which represents data within the table!

Let's now find all of the `<td>` elements inside of this row.

In [41]:
# Find all <td> elements in the row
tds = rows[1].find_all("td")

# Iterate over all <td> elements and print them out
for td in tds:
    print(td)

<td>4290</td>
<td>COMP  110 - 001   Introduction to Programming and Data Science</td>
<td>TuTh 8:00AM - 9:15AM</td>
<td>Alyssa Byrnes</td>
<td>Hamilton Hall - Rm 0100</td>
<td style="color:orange">217/219</td>
<td style="color:red">81/81</td>
<td>0/30</td>


We are almost there! We still have HTML, but one last step will print out nice clean text from our HTML. Let's modify the for loop to just grab the `text` of each of these elements:

In [42]:
# Iterate over all <td> elements and print out their texts
for td in tds:
    print(td.text)

4290
COMP  110 - 001   Introduction to Programming and Data Science
TuTh 8:00AM - 9:15AM
Alyssa Byrnes
Hamilton Hall - Rm 0100
217/219
81/81
0/30


Amazing! We have just cleaned up the data for the first row of the table. We can extend this exact logic now for **every row** of the table!

In [47]:
# Iterate over all rows...
for row in rows:
    
    # Find all <td> elements in the row
    tds = row.find_all("td")

    # Iterate over all <td> elements and print out their texts
    for td in tds:
        print(td.text)
        
    # Add line break between data for each row
    print("\n")



4290
COMP  110 - 001   Introduction to Programming and Data Science
TuTh 8:00AM - 9:15AM
Alyssa Byrnes
Hamilton Hall - Rm 0100
217/219
81/81
0/30


Description: Prerequisite, A C or better in one of the following courses: MATH 130, 152, 210, 231, 129P, or PHIL 155, or STOR 120, 151, 155. Introduces students to programming and data science from a computational perspective. With an emphasis on modern applications in society, students gain experience with problem decomposition, algorithms for data analysis, abstraction design, and ethics in computing. No prior programming experience expected. Foundational concepts include data types, sequences, boolean logic, control flow, functions/methods, recursion, classes/objects, input/output, data organization, transformations, and visualizations. Students may not enroll in COMP 110 after receiving credit for COMP 210 or greater. 3 units.


10266
COMP  110 - 002   Introduction to Programming and Data Science
TuTh 2:00PM - 3:15PM
Alyssa Byrnes
Ham

Remember, we wanted *every other row*, so let's do that.

In [51]:
# Let's use some handy Python list notation:
# We can create a subset of list `a` with `a[start_index:end_index:step]`
# So, if we want every other row of `rows`, we can say `row[1::2]`
#     Note: Remember, we start at index 1 because row 0 is our header rows!
#     Note: Leaving end index blank implies we are going until the list ends!

# Iterate over every other row...
for row in rows[1::2]:
    
    # Find all <td> elements in the row
    tds = row.find_all("td")

    # Iterate over all <td> elements and print out their texts
    for td in tds:
        print(td.text)
        
    # Add line break between data for each row
    print("\n")

4290
COMP  110 - 001   Introduction to Programming and Data Science
TuTh 8:00AM - 9:15AM
Alyssa Byrnes
Hamilton Hall - Rm 0100
217/219
81/81
0/30


10266
COMP  110 - 002   Introduction to Programming and Data Science
TuTh 2:00PM - 3:15PM
Alyssa Byrnes
Hamilton Hall - Rm 0100
90/90
210/210
0/30


9610
COMP  116 - 001   Introduction to Scientific Programming
TuTh 8:00AM - 9:15AM
John Majikes
Hanes Art Center - Rm 0121
153/240
0/0
0/24


5130
COMP  126 - 001   Practical Web Design and Development for Everyone
TuTh 11:00AM - 12:15PM
Tessa Joseph-Nicholas
Sitterson - Rm 0014
101/105
0/0
0/11


13454
COMP  210 - 001   Data Structures and Analysis
MoWe 2:05PM - 3:20PM
Paul Stotts
Hanes Art Center - Rm 0121
235/240
0/0
0/21


9609
COMP  210 - 002   Data Structures and Analysis
TuTh 3:30PM - 4:45PM
Muhammad Ghani
Coker - Rm 0201
191/195
0/0
0/20


9611
COMP  211 - 001   Systems Fundamentals
TuTh 11:00AM - 12:15PM
Brent Munsell
Murray Hall - Rm G202
175/183
26/26
0/11


14454
COMP  211 - 002   S

Almost done! Let's use Python dictionaries (key-value pairs) to associate a **key** (column headings!) with each of the rows' data, then add each of these dictionaries to a final list for all of our data!

In [61]:
# Determine column headers
column_headers = ["Class Number", "Class", "Meeting Time", "Instructor", "Room", "Unreserved Enrollment", "Reserved Enrollment", "Wait List"]

#Create list to store final data
final_data_list = []

# Iterate over every other row...
for row in rows[1::2]:
    
    row_data = {}
    
    # Find all <td> elements in the row
    tds = row.find_all("td")

    # Iterate over all <td> elements and print out their texts
    for index, td in enumerate(tds):
        # Get correct column header for the data
        header = column_headers[index]
        # Store the data in the dictionary
        row_data[header] = td.text.replace("\xa0", "")
    
    # Add data to final list    
    final_data_list.append(row_data)

final_data_list

[{'Class Number': '4290',
  'Class': 'COMP 110 - 001 Introduction to Programming and Data Science',
  'Meeting Time': 'TuTh 8:00AM - 9:15AM',
  'Instructor': 'Alyssa Byrnes',
  'Room': 'Hamilton Hall - Rm 0100',
  'Unreserved Enrollment': '217/219',
  'Reserved Enrollment': '81/81',
  'Wait List': '0/30'},
 {'Class Number': '10266',
  'Class': 'COMP 110 - 002 Introduction to Programming and Data Science',
  'Meeting Time': 'TuTh 2:00PM - 3:15PM',
  'Instructor': 'Alyssa Byrnes',
  'Room': 'Hamilton Hall - Rm 0100',
  'Unreserved Enrollment': '90/90',
  'Reserved Enrollment': '210/210',
  'Wait List': '0/30'},
 {'Class Number': '9610',
  'Class': 'COMP 116 - 001 Introduction to Scientific Programming',
  'Meeting Time': 'TuTh 8:00AM - 9:15AM',
  'Instructor': 'John Majikes',
  'Room': 'Hanes Art Center - Rm 0121',
  'Unreserved Enrollment': '153/240',
  'Reserved Enrollment': '0/0',
  'Wait List': '0/24'},
 {'Class Number': '5130',
  'Class': 'COMP 126 - 001 Practical Web Design and Dev

We now have extracted all of the data we need! Congratulations!

## Part 3: Using the Data

Now that we have all of this course data, we can now use it for something cool!

Here is an example of the course data in a Pandas dataframe:

In [64]:
import pandas as pd

df = pd.DataFrame.from_dict(final_data_list)

df.head()

Unnamed: 0,Class Number,Class,Meeting Time,Instructor,Room,Unreserved Enrollment,Reserved Enrollment,Wait List
0,4290,COMP 110 - 001 Introduction to Programming and...,TuTh 8:00AM - 9:15AM,Alyssa Byrnes,Hamilton Hall - Rm 0100,217/219,81/81,0/30
1,10266,COMP 110 - 002 Introduction to Programming and...,TuTh 2:00PM - 3:15PM,Alyssa Byrnes,Hamilton Hall - Rm 0100,90/90,210/210,0/30
2,9610,COMP 116 - 001 Introduction to Scientific Prog...,TuTh 8:00AM - 9:15AM,John Majikes,Hanes Art Center - Rm 0121,153/240,0/0,0/24
3,5130,COMP 126 - 001 Practical Web Design and Develo...,TuTh 11:00AM - 12:15PM,Tessa Joseph-Nicholas,Sitterson - Rm 0014,101/105,0/0,0/11
4,13454,COMP 210 - 001 Data Structures and Analysis,MoWe 2:05PM - 3:20PM,Paul Stotts,Hanes Art Center - Rm 0121,235/240,0/0,0/21


We can now do anything we want with this data! We could:

- Analyze the data
- Gain new insights from the data
- Create our very own data science app!


## Closing Thoughts

Remember, it is extremely important to properly attribute data that you scrape from the internet.

**Thank you Professor Saba Eskandarian for creating the UNC classes tool!**