# Web Scraping Workshop with Python Demo

There are three steps required to successfully scrape a website:

1) **Retreiving the data**

    When you enter a URL into your browser and load a website on your computer, the contents of and the structure of the webpage is downloaded to your computer and displayed by your browser. This data is stored in an *HTML* file. To successfully web scrape a webpage, we must get access to this HTML data so that we can further analyze its contents.
 
2) **Parsing the data**

    How that you have access to a website's HTML data, you now have to make sense of this data. Oftentimes, a big part of this step is figuring out which components of the HTML you need -- and which ones you don't. This is often the hardest and most tedious part of web scraping, but this is where the magic really happens.
    
3) **Using the data**

    Congratulations! Now that you have you data, you can now do something cool with it. You can analyze this data, gain new insights and knowledge, even work on your very own data science app with this data as well (with proper attribution, of course!)
    
Here, you will learn the basics of creating your own web scraping script using the `BeautifulSoup` and `requests` package for the Python programming language. You will learn how to perform all three tasks of web scraping - retreiving the raw source HTML from a webpage, parsing that data to gain valuable data and knowledge, and storing that data to develop new insights or work on your very own data science app. We will also discuss how web scraping has been used in technologies and software commonly used today, as well as the potential ethical implications of the practice.


## Getting Started

This workshop heavily uses the `requests` and `beautifulsoup4` package for Python. To install this, run the cell below:

In [None]:
!pip install requests
!pip install beautifulsoup4

Now that you have installed the `requests` and `beautifulsoup4` package, import them below:

In [None]:
# Import the `requests` package
import requests

# Import `BeautifulSoup` from the `bs4` package
from bs4 import BeautifulSoup


As you noticed above, `beautifulsoup4` is abbreviated as `bs4`. Then from there, we can import `BeautifulSoup`.

**Congratulations!** You are now ready to start web scraping.

## Part 1: Retrieving the Data

Remember, when you enter a URL into your browser and load a website on your computer, the contents of and the structure of the webpage is downloaded to your computer and displayed by your browser.

This data is stored in an *HTML* file! To successfully web scrape a webpage, we must get access to this HTML data so that we can further analyze its contents.

In Python, we can use the `requests` package to download and store the HTML from a URL!

Which URL?

In [None]:
# URL for Professor Eskandarian's course explorer tool:
url = "https://crypto.unc.edu/UNC_classes/fall2023/"


Now, we want to fetch the data from this URL! Let's use the `requests.get()` function:

In [None]:
# Get data from the URL



As you can see, we received a response! Have you ever heard of the *404 Not Found* error? That is a response as well. 200 means all is well.

Now, let's find that HTML:

In [None]:
# Find data text (the HTML!) 



Wow, that is quite long! Congratulations, you have now completed step 1 -- retrieving the data! But now, how do we make sense of all that gibberish?

## Part 2: Parsing the Data

Now that we have the HTML data, we now unleash the full power of `BeautifulSoup` to parse this data.

In [None]:
# Parse data using the HTML that we extracted using the `html.parser`, and save output to soup.



In [None]:
# Now, let's print out the "prettified" version of that HTML!


`BeautifulSoup` does a lot more than just printing HTML in a nice way. We can now also find specific elements on the webpage!

Let's find all of the data for the COMP classes. How? Let's take a look at the HTML.

You will find that all of the data is located in the `<table>` element! So, let's try and find that element.

In [None]:
# Find all table elements on the page


Great! `soup.find_all()` output a **list** of all of the table elements on our page. In this case, there was only one! So, let's select that table.

In [None]:
# Select the table


Just like how we used `find_all()` on the `soup` object, we can also use this on elements too! Let's try and find all of the *rows*, represented by `<tr>` elements, in the table:

In [None]:
# Find all rows in the table



There are certainly a lot of rows in this table! If you notice, the first row is just column headers, and we do not need that (for now). Also, every other row is taken up by a description for a course, so every course is represented by **two rows**. For now to make things simple, we will not save course descriptions. Therefore when scraping data on courses, we need to look at **every other row!**

Let's take a look at one of the rows further.

In [None]:
# Print out prettified HTML for row 2 (index 1)


You will notice that HTML rows (`<tr>` elements) have elements inside called `<td>` elements, which represents data within the table!

Let's now find all of the `<td>` elements inside of this row.

In [None]:
# Find all <td> elements in the row

# Iterate over all <td> elements and print them out


We are almost there! We still have HTML, but one last step will print out nice clean text from our HTML. Let's modify the for loop to just grab the `text` of each of these elements:

In [None]:
# Iterate over all <td> elements and print out their texts


Amazing! We have just cleaned up the data for the first row of the table. We can extend this exact logic now for **every row** of the table!

In [None]:
# Iterate over all rows...
    
    # Find all <td> elements in the row

    # Iterate over all <td> elements and print out their texts
        
    # Add line break between data for each row



Remember, we wanted *every other row*, so let's do that.

In [None]:
# Let's use some handy Python list notation:
# We can create a subset of list `a` with `a[start_index:end_index:step]`
# So, if we want every other row of `rows`, we can say `row[1::2]`
#     Note: Remember, we start at index 1 because row 0 is our header rows!
#     Note: Leaving end index blank implies we are going until the list ends!

# Iterate over every other row...
    
    # Find all <td> elements in the row

    # Iterate over all <td> elements and print out their texts
        
    # Add line break between data for each row


Almost done! Let's use Python dictionaries (key-value pairs) to associate a **key** (column headings!) with each of the rows' data, then add each of these dictionaries to a final list for all of our data!

In [None]:
# Determine column headers
column_headers = ["Class Number", "Class", "Meeting Time", "Instructor", "Room", "Unreserved Enrollment", "Reserved Enrollment", "Wait List"]

#Create list to store final data

# Iterate over every other row...
    
    # Find all <td> elements in the row

    # Iterate over all <td> elements and print out their texts

        # Get correct column header for the data

        # Store the data in the dictionary

        
    # Add data to final list    



We now have extracted all of the data we need! Congratulations!

## Part 3: Using the Data

Now that we have all of this course data, we can now use it for something cool!

Here is an example of the course data in a Pandas dataframe:

In [None]:
import pandas as pd

df = pd.DataFrame.from_dict(""" YOUR FINAL LIST HERE! """)

df.head()

### ANALYSIS

Answer the following questions:


In [None]:
# 1 WHAT IS THE MOST USED AND THE MOST UNUSED ROOM DURING THE COURSE

In [None]:
# 2 WHAT IS THE INSTRUCTOR THAT IMPARTS MOST CLASES, HOW MANY?

In [None]:
# 3 WHAT IS THE AVERAGE OF SEATS FILLED OF Unreserved Enrollment (MEAN OF SEATSFILLED/TOTALSEATS)

In [None]:
# 4 THE MAJORITY OF LESSONS ARE 1H15' LONG. HOW MANY? 
# ARE THERE ANY OTHER LESSONS LONGER OR SHORTER? HOW MANY? 

### EXTRA

Here you have a list of all the terms available. Create a for loop that extracts the table for all terms. Put all the created tables into a dataframe and create a dictionary where the key is the term and the value its corresponding dataframe.

In [None]:
terms = [   "spring2025",
            "fall2024",
            "summerII2024",
            "summerI2024",
            "spring2024",
            # "fall2023",
            "summerII2023",
            "summerI2023",
            "spring2023",
            "fall2022",
            "spring2022"
        ]
