# STOR 320: Introduction to Data Science
## Web Scraping in Python 

We introduce the basics of web scraping using Python, including how to extract data from web pages and process it for analysis.

The web scrapping is ***not*** required for the midterm exam or the weekly class activity.

**Topics Covered:**

0. Introduction and Environment Setup
1. Understanding HTML Structure
2. Making HTTP Requests
3. Parsing HTML with Beautiful Soup
4. Extracting Data
5. Saving Data to a CSV
6. Practice Exercise

We will use [Selectorgadget](https://selectorgadget.com/):
 - Install the Chrome extension at [https://chromewebstore.google.com/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb](https://chromewebstore.google.com/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)

### 0. Introduction and Environment Setup

Web scraping is the process of extracting data from websites. This technique is used for data collection, analysis, and various other applications. It involves fetching the web page content and parsing the HTML to extract the desired information.

While web scraping is a powerful tool, it's important to follow best practices and ethical guidelines, including
- Respect the websites 'robots.txt' file, which specifies the rules for web crawlers/bots.
- Avoid overloading the website with tool many requests in a short period.
- Always give credit to the source of the data. 

For example: [https://stor.unc.edu/robots.txt](https://stor.unc.edu/robots.txt)


Tips: Use ChatGPT to help do the webscrapping and organize the results.

We will use the requests, Beautiful Soup, and pandas libraries throughout this lab. The cell below can be used to install any missing libraries with the `pip` command. You may need to place a `!` or `%` in front of the `pip` command in order for it to run on your machine. You can then run the import cell to import all necessary libraries.

In [2]:
pip install requests beautifulsoup4 pandas

Note: you may need to restart the kernel to use updated packages.




In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### 1. Understanding HTML Structure (6 points)
HTML (HyperText Markup Language) is the standard language for creating web pages. It describes the structure of a web page using elements and tags.

### Basic HTML Structure 

### Common HTML Tags
Below is a list of common HMTL tags. We typically refer to the tags representing the root element, the meta-information, and the content as the main sections of the document. For more information on different tags, you can use an [HTML Reference guide](https://www.w3schools.com/tags/).

- '\<html\>': The root element of an HTML page.
- '\<head\>': Contains meta-information about the HTML document.
- '\<title\>': Sets the title of the HTML document (displayed in the browser's title bar or tab).
- '\<body\>': Contains the content of the HTML document.
- '\<h1\>' to '\<h6\>': Header tags, with \<h1\> being the highest level and \<h6\> the lowest.
- '\<p\>': Paragraph tag.
- '\<a\>': Anchor tag for hyperlinks.
- '\<div\>': Division or section in an HTML document, often used for layout purposes.
- '\<span\>': Inline container for text.

### Example HTML Page for Scraping
This example HTML code provides a Sample Page with a header and paragraph, referencing two page links. Under the links, the page has an unordered list with three list items. 

**1.1 What are the three tags representing the main sections of an HTML document?** (3 points)

Answer: The main sections are '\<html\>\, '\<head\>', and '\<body\>'.

**1.2 What tags would you use to create an unordered list of items? What tags would you use to create an ordered list of items? What tags would you use inside the list tags to create an item of the list?** (3 points) 

Answer: To create a list of items, you would use the tags '\<ul\>' to create an unordered list or '\<ol\>' to create an ordered list, with '\<li\>' tags inside to create a list item.

### 2. Making HTTP Requests (6 points)
We use the requests library to fetch the contents of a web page. You can refer to the [API documentation](https://requests.readthedocs.io/en/latest/api/#) if needed.

### Example
The example below sends a GET request to the specified URL and returns a response object. We can then process the response object depending on the status code returned and get the information desired out of the response, such as the HTML text. 

In [4]:
url = "http://example.com"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print(response.text)
else:
    print("Failed to retrieve the web page")

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domai

**2.1. Using the API documentation for reference, what does the `response.status_code` represent? What are the two return options and their meaning?** (3 points)

Answer: The `response.status_code` represents the HTTP status code of the response, indicating whether the request was successful (code 200) or if there was an error (code 404). 

**2.2. Re-write the example code in the box below using a URL of your choice. Make sure to check if the request was successful! Instead of printing your webpage HTML text, change the print statement to state 'Successful!' if the request was successful.** (3 points)

In [9]:
url = "https://library.unc.edu/hours/"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Successful!")
else:
    print("Failed to retrieve the web page")

Successful!


In [10]:
print(response.text)

<!DOCTYPE html>
<html lang="en-US" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#">
<head>
        <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">

            <link rel="profile" href="https://gmpg.org/xfn/11">
            <title>Locations and Hours &#8211; UNC University Libraries</title>
<meta name='robots' content='max-image-preview:large' />
<link rel='dns-prefetch' href='//use.typekit.net' />
<link rel="alternate" type="application/rss+xml" title="UNC University Libraries &raquo; Feed" href="https://library.unc.edu/feed/" />
<link rel="alternate" type="application/rss+xml" title="UNC University Libraries &raquo; Comments Feed" href="https://library.unc.edu/comments/feed/" />
<script>
window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/15.0.3\/svg\/","svgExt":".svg","source":{"concatemoji":"https:\/\/library.unc.edu\/wp-i

### 3. Parsing HTML with Beautiful Soup (4 points)
We now can use the BeautifulSoup library to parse the HTML content and extract data from it in a structured way. We can then apply the `soup.prettify()` method to format the parsed HTML into a more readable and indented structure.

In [8]:
url = "http://example.com"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    html_content = response.text
    
    # Parse the HTML content
    soup = BeautifulSoup(html_content, "html.parser")
    print(soup.prettify())

else:
    print("Failed to retrieve the web page")

<!DOCTYPE html>
<html>
 <head>
  <title>
   Example Domain
  </title>
  <meta charset="utf-8"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <style type="text/css">
   body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
  </style>
 </head>
 <body>
  <div>
   <h1>
    Example Domain
   </h1>
   <p>
    This dom

**3.1. Why do we parse the HTML content inside the if statement instead of after the if-else statement?** (1 point)

Answer: We parse the HTML content inside the if statement because we only want to parse the response if the request was successful. Otherwise, we would not have a response to parse and would get an error.

**3.2. Rewrite your code from 2.2 in the box below to add the HTML parsing with Beautiful Soup. Instead of printing the prettified output, set the output equal to a variable named `pretty_soup` and then add a print statement stating "Pretty soup obtained!"** (3 points)

In [9]:
url = "https://www.sports-reference.com/cbb/seasons/men/2024.html"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    html_content = response.text
    
    # Parse the HTML content
    soup = BeautifulSoup(html_content, "html.parser")
    pretty_soup = soup.prettify()
    print("Pretty soup obtained!")
          
else:
    print("Failed to retrieve the web page")

Pretty soup obtained!


### 4. Extracting Data (2 points)
Often we do not want the entire HTML output. We will now look at extracting specific data from the parsed HTML using methods in the BeautifulSoup library.

### Example
Examine the code below added after we parse the HTML content.

In [11]:
url = "http://example.com"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    html_content = response.text
    
    # Parse the HTML content
    soup = BeautifulSoup(html_content, "html.parser")

    # Extract the title of the web page
    title = soup.title.string
    print("Title of the web page:", title)

    # Extract all hyperlinks
    for link in soup.find_all("a"):
        print(link.get('href'))

else:
    print("Failed to retrieve the web page")

Title of the web page: Example Domain
https://www.iana.org/domains/example


**4.1. How do you extract the title of a web page using BeautifulSoup?** (1 point)

Answer: You can extract the title of a web page using `soup.title.string`.

**4.2. What method would you use to find all hyperlinks on a webpage?** (1 point)

Answer: To find all hyperlinks on a web page, you can use the `soup.find_all("a")` method.

### 5. Saving Data to a CSV (2 points)
We will now use the Pandas library to save the extracted data to a CSV file.

### Example
We can put the data we extracted from the title and links into a DataFrame using a dictionary. We can then use the `to_csv` command to write the DataFrame into a csv. 

In [12]:
data = {
    "Title": [title],
    "Links": [link.get("href") for link in soup.find_all("a")]
}

df = pd.DataFrame(data)
df.to_csv("extracted_data.csv", index=False)
print("Data saved to extracted_data.csv")

Data saved to extracted_data.csv


In [13]:
df

Unnamed: 0,Title,Links
0,Example Domain,https://www.iana.org/domains/example


**5.1. What is the meaning of the two parameters passed into the `to_csv` function?** (2 points)

Answer: The first parameter is a string specifying what we want to name the CSV. The second parameter is a boolean representing if we want the index column included in the CSV or not. 

### 6. Practice Exercise (8 points)
You will now put together all these skills with a few practice exercise! We are going to walk through scraping the April 2024 NBA schedule from Basketball Reference.

**6.1. Complete the code below to send a GET request for the provided URL and then use the Beautiful Soup library to parse the content of the HTML document.** Feel free to examine the soup that is returned, but please do not leave it printed in your solution. (2 points)

In [14]:
url = 'https://www.basketball-reference.com/leagues/NBA_2024_games-april.html'

# Call the GET request
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

BeautifulSoup's `find()` method searches for a tag and specified attributes, returning the first match. Here we will search for the 'div_schedule' to find the section where the schedule lives in the HTML. You can examine the HTML output to find this.

In [16]:
schedule = soup.find(name = 'div', attrs = {'id' : 'div_schedule'})

### Use the SelectorGadget to extract data

In [17]:
for a in soup.select(".right , #schedule a"):
    print(a.get_text().strip())

Mon, Apr 1, 2024
7:00p
Boston Celtics
118
Charlotte Hornets
104
Box Score
19,238
1:54
Mon, Apr 1, 2024
7:00p
Memphis Grizzlies
110
Detroit Pistons
108
Box Score
19,999
2:08
Mon, Apr 1, 2024
7:00p
Brooklyn Nets
111
Indiana Pacers
133
Box Score
16,522
2:20
Mon, Apr 1, 2024
7:00p
Portland Trail Blazers
103
Orlando Magic
104
Box Score
19,004
2:11
Mon, Apr 1, 2024
8:00p
Atlanta Hawks
113
Chicago Bulls
101
Box Score
21,114
2:07
Mon, Apr 1, 2024
8:00p
Phoenix Suns
124
New Orleans Pelicans
111
Box Score
17,753
2:14
Tue, Apr 2, 2024
7:00p
Los Angeles Lakers
128
Toronto Raptors
111
Box Score
19,800
2:06
Tue, Apr 2, 2024
7:00p
Milwaukee Bucks
113
Washington Wizards
117
Box Score
16,492
2:13
Tue, Apr 2, 2024
7:30p
New York Knicks
99
Miami Heat
109
Box Score
19,857
2:16
Tue, Apr 2, 2024
7:30p
Oklahoma City Thunder
105
Philadelphia 76ers
109
Box Score
20,733
2:22
Tue, Apr 2, 2024
8:00p
Houston Rockets
106
Minnesota Timberwolves
113
Box Score
18,024
2:13
Tue, Apr 2, 2024
9:00p
San Antonio Spurs
105
D

In [18]:
len(soup.select(".right , #schedule a")) 

1414

**6.2. If you examine the output of `schedule`, you will see each row in the schedule starts with a '\<\tr\>' tag. We will now use a for loop to parse through all the rows of the schedule and put the data into a DataFrame. We will parse the game date, home team, away team, home points, and visitor points in order to add some computed columns to our DataFrame. Complete the code below to:** 

- Initialize an empty DataFrame `nba_schedule_df` with the columns stored in `columns` (1 point)
- Parse the `home_team` from the row by finding the "home_team_name" in the row instead of the "visitor_team_name" (1 point)
- Parse the `home_pts` as an integer by finding the "home_pts" in the row instead of the "visitor_pts" (1 point)
- Calculate the `spread` as the away team's points subtracted from the home team's points. (1 point)
- Calculate the `total` number of points (1 point)

In [19]:
games = soup.select(".right , #schedule a")

columns = ['date', 'start_time', 'visitor_team', 'visitor_pts', 'home_team', 'home_pts', 'attendance', 'duration', 'spread', 'total']
nba_schedule_df = pd.DataFrame(columns=columns)

for i in range(0, len(games)-1, 9):  # Assuming each game has 9 pieces of data
    date = games[i].get_text().strip()
    start_time = games[i + 1].get_text().strip()
    visitor_team = games[i + 2].get_text().strip()
    visitor_pts = int(games[i + 3].get_text().strip())  # Parse visitor points as integer
    home_team = games[i + 4].get_text().strip()  # Parse home team
    home_pts = int(games[i + 5].get_text().strip())  # Parse home points as integer
    attendance = games[i + 7].get_text().strip()
    duration = games[i + 8].get_text().strip()

    # Calculate spread (home points - visitor points)
    spread = home_pts - visitor_pts
    
    # Calculate total points (home points + visitor points)
    total = home_pts + visitor_pts

    # Append the data to the DataFrame
    new_row = pd.DataFrame({
        'date': [date],
        'start_time': [start_time],
        'visitor_team': [visitor_team],
        'visitor_pts': [visitor_pts],
        'home_team': [home_team],
        'home_pts': [home_pts],
        'attendance': [attendance],
        'duration': [duration],
        'spread': [spread],
        'total': [total]
    })

    # Append the new row to the existing DataFrame using pd.concat
    nba_schedule_df = pd.concat([nba_schedule_df, new_row], ignore_index=True)
    
nba_schedule_df

Unnamed: 0,date,start_time,visitor_team,visitor_pts,home_team,home_pts,attendance,duration,spread,total
0,"Mon, Apr 1, 2024",7:00p,Boston Celtics,118,Charlotte Hornets,104,19238,1:54,-14,222
1,"Mon, Apr 1, 2024",7:00p,Memphis Grizzlies,110,Detroit Pistons,108,19999,2:08,-2,218
2,"Mon, Apr 1, 2024",7:00p,Brooklyn Nets,111,Indiana Pacers,133,16522,2:20,22,244
3,"Mon, Apr 1, 2024",7:00p,Portland Trail Blazers,103,Orlando Magic,104,19004,2:11,1,207
4,"Mon, Apr 1, 2024",8:00p,Atlanta Hawks,113,Chicago Bulls,101,21114,2:07,-12,214
...,...,...,...,...,...,...,...,...,...,...
152,"Mon, Apr 29, 2024",8:30p,Oklahoma City Thunder,97,New Orleans Pelicans,89,18487,2:24,-8,186
153,"Mon, Apr 29, 2024",10:00p,Los Angeles Lakers,106,Denver Nuggets,108,19861,2:19,2,214
154,"Tue, Apr 30, 2024",7:00p,Philadelphia 76ers,112,New York Knicks,106,19812,2:49,-6,218
155,"Tue, Apr 30, 2024",8:00p,Orlando Magic,103,Cleveland Cavaliers,104,19432,2:31,1,207


**6.3 Write the `nba_schedule_df` to a csv file named "NBA_April_2024_Schedule.csv" and do not include the index column in the csv output.** (1 point)

In [20]:
nba_schedule_df.to_csv("NBA_April_2024_Schedule.csv", index=False)

Web scraping, while a useful tool for extracting data from websites, has several limitations and potential downsides in today's environment.

#### 1. **Legal and Ethical Issues**
   - **Terms of Service Violations**: Many websites explicitly prohibit scraping in their terms of service. Ignoring these can lead to legal consequences, including lawsuits or being banned from the site. (For example, the price information at Amazon should not be webscrapped by competitors, like Bestbuy.com)
   - **Copyright Infringement**: Some websites may have copyrighted content, and scraping could potentially infringe on intellectual property rights.
   - **Ethical Concerns**: Collecting data without permission, especially from websites that rely on user-generated content, can raise ethical questions about privacy and fairness.

#### 2. **Website Changes**
   - **Frequent Structure Changes**: Websites often update their layout and HTML structure. Scraping scripts need constant maintenance to handle these changes, which can be time-consuming.
   - **Dynamic Content**: Many modern websites use JavaScript to load content dynamically. Scraping tools that only handle static HTML may not capture important data without additional complexity.

#### 3. **Blocking and Anti-Scraping Measures**
   - **IP Blocking**: Websites often implement rate limiting or block IP addresses that send too many requests in a short time, making it hard to scrape data at scale.
   - **CAPTCHAs**: Some sites use CAPTCHA challenges to prevent automated access, which can interrupt the scraping process and requires additional tools to bypass.
   - **User-Agent Detection**: Sites can detect scraping tools based on their HTTP headers, such as the User-Agent, and restrict access to them.

#### 4. **Data Quality Issues**
   - **Incomplete or Inconsistent Data**: Scrapers might not extract data as intended due to minor changes in the site structure or hidden elements. This can lead to incomplete, inaccurate, or misleading data.
   - **Data Duplication**: Scraping might inadvertently collect duplicate records or outdated information, especially if the source is not well-structured.
   - **Legal Removal of Data**: Some data that is public today could be legally required to be removed later, making it unreliable for long-term use.

#### 5. **Scalability and Efficiency**
   - **Resource-Intensive**: Scraping large websites can require significant computational resources, bandwidth, and storage. It may also slow down or disrupt the scraped website.
   - **API Alternatives**: Many websites offer official APIs, which are much more efficient, stable, and scalable for retrieving structured data, without needing to parse HTML.