# Web Scraping 101

*After finishing this tutorial, you can extract data from multiple pages on the web and export such data to JSON and CSV files to use in an analysis. Plan a few hours to work through this notebook. Taking a few breaks in between keeps you sharp!*

*Just starting with web scraping? Then make sure to have followed the ["webdata for dummies" tutorial](https://odcm.hannesdatta.com/docs/modules/week2/webdata-for-dummies/) first.*

*Enjoy!*

--- 

## Learning Objectives

Our main goal is to compile a panel data set of music consumption data for (simulated) users of music-to-scrape.org, a platform developed for practicing web scraping skills.



* Identifying a strategy to generating seeds (“sampling”)
    * Extracting multiple elements at once using the `.find_all()` function
    * Preventing array misalignment
* Navigating on a website 
    * Using URLs to programmatically visit web pages
    * Writing loops to execute data collections in bulk using functions
* Improving extraction design
    * Implementing timers and modularizing extraction code
    * Storing data in CSV or JSON files with relevant meta data
* Scraping more advanced, dynamic websites
    * Understanding the difference between headless requests and browser emulation 
    * Learn when to apply one of the two methods (using `requests` and `selenium`)

--- 

<div class="alert alert-block alert-info"><b>Support Needed?</b> 
    For technical issues outside of scheduled classes, please check the <a href="https://odcm.hannesdatta.com/docs/course/support" target="_blank">support section</a> on the course website.
</div>


## 1. Generating seeds ("sampling")


__Importance__

So far, we've extracted (=parsed) some information (e.g., names of featured artists) from an artist's individual *artist page*. What we haven't done yet is to take a closer look at the consumption of individual users.

In fact, individual users are often a focal point of attention in web scraping. For example, we can sample users' tweets on Twitter/X, or users' movie watching behavior on trakt.tv. 

Yet, before we can start building what is called a "panel data set" (i.e., multiple users, observed over multiple time periods), we need to decide for __which users to obtain information__. Ideally, we would like to capture information for a *sample of users* (or books, movies, series, games - depending on the platform.).

In web scraping, we typically refer to a "seed" as a starting point for a data collection. Without a seed, there's no data to collect.

For example, before we can crawl through all users available at [music-to-scrape.org](https://music-to-scrape.org), we first need to generate a *list of many users of the platform*. (Note that obtaining the user names of ALL users of the site is barely possible).

One way to get there would be to:

1. first visit the main homepage of [music-to-scrape.org](https://music-to-scrape.org), showing a few recently active users at the time, and
2. visit a users' profile page and start scraping their consumption data (or anything else on that page; we have done this in the webdata for dummies tutorial). 

Note that the homepage allows us to "navigate" to the users' profile pages, such as by clicking on the user name or the avatar (see red boxes in the figure below). 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-users.png" align="left" width=80%/>

### 1.1 Collecting links to use as seeds

Let's take a look at how the links for users' profile pages are written in the website's source code.

Open the [website](https://music-to-scrape.org), and inspect the underlying HTML code with the Chrome or Firefox Inspector (right click --> inspect element). Hover around with your mouse a bit, and then select one of the user avatars. 

Do you see in the source code that each user contains a clickable link (`<a>`), containing the link (`href`) to the user's profile page? 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-inspect-link.png" align="left" width=60%/>

But, how could we tell a computer to capture the links to the various user pages?

One simple way is to select *elements by their tags*. For example, to extract all links (`<a>` tags). 

<div class="alert alert-block alert-info"><b>How to extract multiple elements at once?</b>
    <br>
    
- By working through other tutorials, you may already be familiar with the <code>.find()</code> function of BeautifulSoup. The <code>.find()</code> function returns the <b>first element</b> that matches your particular "search query". <br>
- If you want to extract <b>all elements</b> that match a particular search pattern (say, a class name), you can use BeautifulSoup's <code>.find_all()</code> function.<br>
- Note that the "result" of the <code>.find_all()</code> option is a list of results __that you need to iterate through.__

</div>


__Exercise 1.1__

Please run the code cell below, which extracts all links (the `a` tag!), and prints the URL (`href`) to the screen. Don't worry, you don't need need to understand the code yet, we'll go over it line by line shortly!

If you look at these links more closely, you'll notice that we're not interested in many of these links... 

Make a list of all links we're *not* interested in (i.e., those *not* pointing to a user page). Which ones are those? Can you find out why they are there?

In [314]:
# Run this code now
import requests
from bs4 import BeautifulSoup

# make a get request to the books overview page (see Webdata for Dummies tutorial)
user_agent = {'User-agent': 'Mozilla/5.0'}
url = 'https://music-to-scrape.org'

res = requests.get(url, headers = user_agent)
res.encoding = res.apparent_encoding

soup = BeautifulSoup(res.text)

# return the href attribute in the <a> tag nested within the first product class element
for link in soup.find_all("a"):
    if 'href' in link.attrs: 
        print(link.attrs["href"])

/privacy_terms
/privacy_terms
about
/
/
/tutorial_scraping
/tutorial_api
#
https://api.music-to-scrape.org/docs
/about
song?song-id=SOAXZLM12A6D4F7A95
song?song-id=SOVUNDW12A58A7AF38
song?song-id=SOGYYUL12A6D4FB2CB
song?song-id=SOOYBBU12A6D4F9AF3
song?song-id=SOCSZBL12CF530E5B0
song?song-id=SOKHQEF12AB0183CF6
song?song-id=SOPWVGR12A8AE46957
song?song-id=SOLGUGY12AB01897BE
song?song-id=SOTKTCQ12AB01863FF
song?song-id=SOJURPV12A8C141B82
song?song-id=SOJZTXJ12AB01845FB
song?song-id=SOZKQWY12A6D4FA5E6
song?song-id=SOGXHEG12AB018653E
song?song-id=SOQVMXR12A81C21483
song?song-id=SODRPTE12A58A7BE10
song?song-id=SOBVAPJ12AB018739D
song?song-id=SOECLAD12AAF3B120A
song?song-id=SORJVDO12AF72A1970
song?song-id=SOLIQRN12A8C1391A6
song?song-id=SOAYRZU12A8C133EBD
song?song-id=SOQVVDQ12AB018300C
song?song-id=SOQQVIP12A8C13E7E6
song?song-id=SOCDLSK12AB018168E
song?song-id=SONLBRH12A6D4FBAEE
song?song-id=SOMCTKM12A8C138B86
artist?artist-id=ARJ66JQ1187B99D2FF
artist?artist-id=ARIN12F1187FB3E92C
artist?ar

**Your answer**

...

__Solution__

The links we want to ignore are...

* The links to the about or privacy pages
* Any link pointing to the most popular songs or artists
* Any social media links, etc.

These links are present on the page, because they are used by users to navigate on the page. 

### 1.2 Collecting *More Specific* Links

__Importance__

We've just discovered that selecting elements by their tags gives us many irrelevant links. But, how can we narrow down these links, or, in other words, __how can we scrape only the users we're interested in?__.

To answer this question, we need to briefly revisit the notion of how an HTML code is structured. __Open your browser's inspect tool again and hover over the "recently active users" section on the site.__

After inspecting, you'd probably notice that the page is generated according to a rigid structure: all user links are contained in a `<section>` tag, with the attribute `name="recent_users"`. The "wrong links" extracted above (i.e., to the about or privacy pages) are *not* part of these elements. 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-section.png" align="left" width=60%/>



So, if we can tell our scraper that we're only interested in the `<a>` tags *within the particular `<section>` with attribute `name` equal to `recent_users`, we end up with our desired selection of links. 

__Let's try it out__

Like before, we'll use `.find_all()` to capture all matching elements on the page. The difference, however, is that we do not directly try to extract the __links__ with the tag `a`, but first try to select the section containing the relevant links.

Run the code below, in which we first try to capture only elements in the section with `name=recent_users`. Then, we collect all `<a>` tags.


In [315]:
import requests
from bs4 import BeautifulSoup

# make request
url = 'https://music-to-scrape.org'

res = requests.get(url, headers = user_agent)
res.encoding = res.apparent_encoding

soup = BeautifulSoup(res.text)

relevant_section = soup.find('section',attrs={'name':'recent_users'})

users = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        users.append(link.attrs['href'])
users

['user?username=CyberStar49',
 'user?username=Wizard79',
 'user?username=WizardShadow42',
 'user?username=Panda89',
 'user?username=SonicNinja25',
 'user?username=ShadowPixel58']

As expected, we retrieve up to six user names. You can now also use the `users` object to look at the data for the first, second, third, ... user.

In [316]:
users[0] # returns the link to the user page of the 1st user

'user?username=CyberStar49'

...to subsequently try to extract the link for the first book...

Note the user list still contains a lot of "other" things, unrelated to the user name. Remember, we extracted the __links__ to the profile pages, not just the user names.

If we want to remove anything but the usernames, we can modify our extraction function slightly, for example using Python's `split` function.


In [317]:
users = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        users.append(link.attrs['href'].split('=')[1])
users

['CyberStar49',
 'Wizard79',
 'WizardShadow42',
 'Panda89',
 'SonicNinja25',
 'ShadowPixel58']

Need explanation on this code? Just copy-paste it to ChatGPT and ask for an explanation, e.g., using this prompt:

> I struggle to understand this piece of Python code in the context of web scraping. 
> Can you please explain it, paying attention to the complicated last line (user.append())?

Pretty cool, right? So let's proceed with some exercises.

#### Exercise 1.2
1. Modify the loop (`for link in relevant_section`...) written above to extract the *absolute URLs* rather than the relative URLs. Specifically, combine the website's URL (`https://music-to-scrape.org/`) and the string you extracted earlier (`user?username=GalaxyShadow34`). The final URL needs to be: `https://music-to-scrape.org/user?username=GalaxyShadow34`.

2. Wrap your code from (1) in a function, called `get_users()`, returning the links to the user profile pages as an arary. We will use it later to repeatedly collect user names (seeds) from this page. 

3. Execute your function from 2) in a while loop, that runs every 2 seconds for a duration of 15 seconds. Importantly, write all URLs to a new-line separated JSON file, called `seeds.json`.

In [7]:
# your answer goes here!

#### Solutions

In [318]:
# Question 1 
urls = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        extracted_link = link.attrs['href']
        urls.append(f'https://music-to-scrape.org/{extracted_link}')
urls

['https://music-to-scrape.org/user?username=CyberStar49',
 'https://music-to-scrape.org/user?username=Wizard79',
 'https://music-to-scrape.org/user?username=WizardShadow42',
 'https://music-to-scrape.org/user?username=Panda89',
 'https://music-to-scrape.org/user?username=SonicNinja25',
 'https://music-to-scrape.org/user?username=ShadowPixel58']

In [319]:
# Question 2
import requests
from bs4 import BeautifulSoup

def get_users():
    url = 'https://music-to-scrape.org/'
  
    res = requests.get(url)
    res.encoding = res.apparent_encoding
    
    soup = BeautifulSoup(res.text)
    
    relevant_section = soup.find('section',attrs={'name':'recent_users'})

    links = []
    for link in relevant_section.find_all("a"):
        if 'href' in link.attrs: 
            extracted_link = link.attrs['href']
            links.append(f'https://music-to-scrape.org/{extracted_link}')
    return(links) # to return all links

get_users()

['https://music-to-scrape.org/user?username=CosmicSonic89',
 'https://music-to-scrape.org/user?username=CyberStar49',
 'https://music-to-scrape.org/user?username=Wizard79',
 'https://music-to-scrape.org/user?username=WizardShadow42',
 'https://music-to-scrape.org/user?username=Panda89',
 'https://music-to-scrape.org/user?username=SonicNinja25']

In [320]:
# Question 3
import time
import json

# Define the duration in seconds (1 minute = 60 seconds)
duration = 15

# Calculate the end time
end_time = time.time() + duration

f = open('seeds.json','a')

# Run the loop until the current time reaches the end time
while time.time() < end_time:
    for user in get_users():
        f.write(json.dumps(user)+'\n')
    time.sleep(2)  # Sleep for a few seconds between each execution
f.close()


<div class="alert alert-block alert-info"><b>Working with JSON data in Python</b>
    <br>
    In Python, we often need to work with JSON data, which is a common format for exchanging information. 
    
- To make a string (such as one read from a file) queryable as JSON, we use the <code>json.loads()</code> function.
  The <code>json.loads()</code> function takes a JSON-formatted string and converts it into a Python data structure, such as a dictionary or a list, so you can easily access its contents.
- If you want to save a Python data structure as a JSON file, you can use the <code>json.dumps()</code> function.
        The <code>json.dumps()</code> function takes a Python object, like a dictionary or a list, and converts it into a JSON-formatted string that you can save to a text file for later use.

</div>

# 1.3 Preventing array misalignment

So far, we have only extracted *one* piece of information (the URL) from the list of recently active users. But, what if we want to also extract the names of recently consumed songs? For example, you can view this song by hovering over the user profile pictures on the landing page.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-hover.png" align="left" width=30%/>


Closely inspecting the source also shows you this information!


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-song-tag.png" align="left" width=60%/>


A simple solution may be to just use the `.find_all()` command from BeautifulSoup, extracting all tags called `span`.

__Example__:


In [321]:
# Run this code now
import requests
from bs4 import BeautifulSoup

url = 'https://music-to-scrape.org/'

res = requests.get(url, headers = user_agent)
res.encoding = res.apparent_encoding

soup = BeautifulSoup(res.text)

relevant_section = soup.find('section',attrs={'name':'recent_users'})

# getting links
links = []
for link in relevant_section.find_all("a"):
    if 'href' in link.attrs: 
        extracted_link = link.attrs['href']
        links.append(f'https://music-to-scrape.org/{extracted_link}')

# getting songs
songs = []
for song in relevant_section.find_all("span"):
    songs.append(song.get_text())


# links for each user
print(links)

# recent songs for each user
print(songs)

['https://music-to-scrape.org/user?username=TechGeek32', 'https://music-to-scrape.org/user?username=CosmicSonic89', 'https://music-to-scrape.org/user?username=CyberStar49', 'https://music-to-scrape.org/user?username=Wizard79', 'https://music-to-scrape.org/user?username=WizardShadow42', 'https://music-to-scrape.org/user?username=Panda89']
['White Heart - Hold On', 'Vanessa Daou - Black & White', 'N.E.R.D. - Intro / Time For Some Action', 'Elizabeth Cotten - Freight Train']


While this approach seems easily implemented, it is __highly error-prone and needs to be avoided.__ 

So... what happened?

The length for these two objects - `links` and `songs` - differ! Didn't spot it? Then see for yourself!


In [322]:
print(len(links))
print(len(songs))

6
4


While the links are properly rendered for each user, we can only retrieve song information for a subset of songs. Ultimately, we won't be able to tell WHICH song is part of WHICH user. This is what we call a misalignment of the arrays that hold the necessary data.

<div class="alert alert-block alert-info"><b>What's an array misalignment?</b>
    <br>
    
<ul>
<li>
When extracting information from the web, we sometimes are prone to "ripping apart" the website's original structure by putting data points into individual arrays (e.g., lists such as one list for user names and another for their recently consumed songs). </li>
<li>In so doing, we violate the data's original structure: we should store information on users, and <b>each user</b> has a user name/link and song.</li>
    <li>The <b>correct way of organizing the data</b> is to create a list of users (e.g., in a dictionary) and then store each attribute (e.g., the song, etc.) <b>within</b> these objects. <b>Only if we store data this way</b> can we be sure to store everything correctly. </li>
<br>
<li>When we do not adhere to this practice, we run the risk of "array misalignment". For example, if only ONE data point were missing for a user, then the (independent) user names array (say, with 6 items) wouldn't be "1:1 aligned" with the song array (say, with only 2-5 items).</li>

</div>

__So, how to do it correctly?__

Similar to how we first "zoomed in" on the recently active user section earlier, we will *first* zoom in on each __user__, and then, *within each user*, extract the required information.

Subsequently, we will store the information in a list of dictionaries, where each element of the dictionary corresponds to a user. This data structure will allow us to also omit some of the song names. After all, whether or not a song is listed for users is now exactly tied to a particular usre. 

__See the example below.__ Pay attention to how we capture the "unavailability" of a song name with a `try` and `except` clause.

In [323]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup

# Define the URL you want to scrape
url = 'https://music-to-scrape.org/'

# Send an HTTP GET request to the URL and store the response
res = requests.get(url, headers=user_agent)

# Set the encoding of the response to the apparent encoding
res.encoding = res.apparent_encoding

# Parse the HTML content of the response using BeautifulSoup
soup = BeautifulSoup(res.text)

# Find the HTML section with the attribute 'name' equal to 'recent_users'
relevant_section = soup.find('section', attrs={'name': 'recent_users'})

# Identify individual users within the relevant section
users = relevant_section.find_all(class_='mobile-user-margin')

# Initialize a list to store user data
user_data = []

# Loop through each user in the list of users
for user in users:
    # Check if the user has an 'href' attribute within an anchor tag
    if 'href' in user.find('a').attrs:
        # Extract the link from the 'href' attribute
        extracted_link = user.find('a').attrs['href']
    
    # Check if the user has a 'span' element
    if user.find('span') is not None:
        # Get the text content of the 'span' element, which represents song names
        song_name = user.find('span').get_text()
    else:
        # If there is no 'span' element, set the song_name to 'NA'
        song_name = 'NA'
    
    # Create a dictionary object with the extracted data
    obj = {'url': extracted_link, 'song_name': song_name}
    
    # Append the dictionary to the user_data list
    user_data.append(obj)

# user_data now contains a list of dictionaries, each representing user information with a URL and song name
user_data

[{'url': 'user?username=TechGeek32', 'song_name': 'Dictators - Weekend'},
 {'url': 'user?username=CosmicSonic89',
  'song_name': 'Kitty Kallen - If You Smile At The Sun'},
 {'url': 'user?username=CyberStar49', 'song_name': 'NA'},
 {'url': 'user?username=Wizard79', 'song_name': 'NA'},
 {'url': 'user?username=WizardShadow42',
  'song_name': 'N.E.R.D. - Intro / Time For Some Action'},
 {'url': 'user?username=Panda89',
  'song_name': 'Elizabeth Cotten - Freight Train'}]

<div class="alert alert-block alert-info"><b>Handling Errors with <code>try</code> and <code>except</code> in Python</b>
    <br>
    
- In Python, we have a useful way to deal with potential errors or exceptions in our code. We use a construct called a <code>try</code> and <code>except</code> clause.
  - The <code>try</code> block is where you place the code that might potentially cause an error. For example, if you're trying to find an element on a website, you can put this code inside the <code>try</code> block.
  - If the code inside the <code>try</code> block encounters an error, instead of crashing your program, Python will jump to the <code>except</code> block. This is incredibly useful for handling situations where, for instance, the element you're trying to find on a website isn't available.
  - Inside the <code>except</code> block, you can define what action to take when an error occurs. In our example, you could set the missing data point to "NA" so that you know it wasn't available.
- However, it's crucial to use the <code>try</code> and <code>except</code> construct sparingly. You don't want to skip the entire process for a user just because one data point isn't available. Instead, use it selectively to handle specific errors and ensure your program continues running smoothly.
</div>

## 2. Navigating and Extracting Information from User Profile Pages

__Importance__

Alright - what have we learnt up this point?

We've learnt how to extract seeds (here: users) from __one page -- the homepage of the platform.__

So... what's missing?

Exactly! [`music-to-scrape.org`](https://music-to-scrape.org) contains consumption data on many users. 

The objective of this section is to navigate through each user's __consumption history__ and save the name of all songs, artists, and corresponding timestamps (time/date). However, it's important to note that this information is __spread across multiple pages__ (one for every week in the data), and we need to visit them one by one.


__Let's try it out__

Open [the website](https://music-to-scrape.org/user?username=StarCoder49&week=36), and click on the "previous" button at the top of the page. Do you understand how you will be able to "loop" through the site?

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-user-page.png" align="left" width=90%/>

### 2.1. Capture Information from User Profile Pages

The goal of this section is to extract the information on a users' consumption history from the website.

Up to this moment, we have defined which seeds to use (usernames from the homepage), and identified from which pages we would like to extract information (e.g., for weeks 37 through 0). Yet, we haven't yet extracted any of the consumption data from the website (e.g., which song a particular user has listened to in a given week.

For this, we use our previous learnings (e.g., see "Web scraping for Dummies" tutorial in this course) to iterate through the table.

__Try it out__

View the code snippet below, which *prints* the information on what songs were listened to by a user to the screen.


It's useful to start doing this in a prototype first, before assembling everything in a "working script". So, let's start.

First, let us download the first page of a user, and store it in a variable called `soup`. 

In [324]:
url = 'https://music-to-scrape.org/user?username=StarCoder49&week=36'
header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)


We can now try a few commands to access information on the site. Of course, the browser inspect tool is important to have opened on the side. You probably notice that the table is quite easy to capture - it has it's own tag, called `table`.


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-table.png" align="left" width=90%/>

In [325]:
table = soup.find('table')
table

<table class="table table-striped">
<thead>
<tr>
<th>Song Title</th>
<th>Artist</th>
<th>Date</th>
<th>Time</th>
</tr>
</thead>
<tbody>
<tr>
<td>Too Much Saturn</td>
<td>Francis Dunnery</td>
<td>2023-09-10</td>
<td>10:49:31</td>
</tr>
<tr>
<td>Sample Track 11</td>
<td>Simon Harris</td>
<td>2023-09-10</td>
<td>10:46:24</td>
</tr>
<tr>
<td>Stand (Album Version)</td>
<td>Kiss</td>
<td>2023-09-10</td>
<td>10:41:33</td>
</tr>
<tr>
<td>High Tide</td>
<td>Richard Souther</td>
<td>2023-09-10</td>
<td>10:37:44</td>
</tr>
<tr>
<td>Sense</td>
<td>Beherit</td>
<td>2023-09-10</td>
<td>10:27:19</td>
</tr>
<tr>
<td>Outro</td>
<td>Attack Attack</td>
<td>2023-09-10</td>
<td>10:25:55</td>
</tr>
<tr>
<td>I don't Want to Be Alone Tonight</td>
<td>Jauida</td>
<td>2023-09-10</td>
<td>10:21:25</td>
</tr>
<tr>
<td>E. Warren</td>
<td>DJ Omega</td>
<td>2023-09-10</td>
<td>10:17:04</td>
</tr>
<tr>
<td>Banana Man</td>
<td>C.J. Chenier</td>
<td>2023-09-10</td>
<td>10:13:55</td>
</tr>
<tr>
<td>Happiness Stan - Orig

See? This one worked quite well! Inspecting the table a bit more, you can get at the individual rows using the `tr` tag. Again, use your browser's inspect tool to spot it!

In [326]:
table.find('tr')

<tr>
<th>Song Title</th>
<th>Artist</th>
<th>Date</th>
<th>Time</th>
</tr>

This is just the first row. Using `.find_all()`, instead, will give you a list of all rows.

In [327]:
rows = table.find_all('tr')
rows

[<tr>
 <th>Song Title</th>
 <th>Artist</th>
 <th>Date</th>
 <th>Time</th>
 </tr>,
 <tr>
 <td>Too Much Saturn</td>
 <td>Francis Dunnery</td>
 <td>2023-09-10</td>
 <td>10:49:31</td>
 </tr>,
 <tr>
 <td>Sample Track 11</td>
 <td>Simon Harris</td>
 <td>2023-09-10</td>
 <td>10:46:24</td>
 </tr>,
 <tr>
 <td>Stand (Album Version)</td>
 <td>Kiss</td>
 <td>2023-09-10</td>
 <td>10:41:33</td>
 </tr>,
 <tr>
 <td>High Tide</td>
 <td>Richard Souther</td>
 <td>2023-09-10</td>
 <td>10:37:44</td>
 </tr>,
 <tr>
 <td>Sense</td>
 <td>Beherit</td>
 <td>2023-09-10</td>
 <td>10:27:19</td>
 </tr>,
 <tr>
 <td>Outro</td>
 <td>Attack Attack</td>
 <td>2023-09-10</td>
 <td>10:25:55</td>
 </tr>,
 <tr>
 <td>I don't Want to Be Alone Tonight</td>
 <td>Jauida</td>
 <td>2023-09-10</td>
 <td>10:21:25</td>
 </tr>,
 <tr>
 <td>E. Warren</td>
 <td>DJ Omega</td>
 <td>2023-09-10</td>
 <td>10:17:04</td>
 </tr>,
 <tr>
 <td>Banana Man</td>
 <td>C.J. Chenier</td>
 <td>2023-09-10</td>
 <td>10:13:55</td>
 </tr>,
 <tr>
 <td>Happiness 

We can also check whether the number of rows is equal to what we would expect from looking at the website. Using the `len` function for this yields...

In [328]:
len(rows)

51

Looks about right? Yes! So, let's now try to extract, for one row, the name of the song and artist, corresponding to the first and second column of the table.

Let's first select one row for prototyping. We take row 2 (which is the first row after the table header).

In [329]:
one_row = rows[1]

In [330]:
one_row

<tr>
<td>Too Much Saturn</td>
<td>Francis Dunnery</td>
<td>2023-09-10</td>
<td>10:49:31</td>
</tr>

In [331]:
one_row.find_all('td')[0].get_text() # for song name

'Too Much Saturn'

In [332]:
one_row.find_all('td')[1].get_text() # for artist name, corresponding to the second "column"

'Francis Dunnery'


We can now put everything together in one script.

In [333]:
url = 'https://music-to-scrape.org/user?username=StarCoder49&week=36'

header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)

table = soup.find('table')

rows = table.find_all('tr')

for row in rows:
    #print(row)
    data = row.find_all('td')
    
    if len(data)>0:
        song_name=data[0].get_text()
        artist_name=data[1].get_text()
        
        print(f'Song "{song_name}" by "{artist_name}"')

Song "Too Much Saturn" by "Francis Dunnery"
Song "Sample Track 11" by "Simon Harris"
Song "Stand (Album Version)" by "Kiss"
Song "High Tide" by "Richard Souther"
Song "Sense" by "Beherit"
Song "Outro" by "Attack Attack"
Song "I don't Want to Be Alone Tonight" by "Jauida"
Song "E. Warren" by "DJ Omega"
Song "Banana Man" by "C.J. Chenier"
Song "Happiness Stan - Original" by "Small Facers"
Song "Phorever People" by "Shamen"
Song "Signs Of Insanity" by "Headhunter"
Song "Doctrines" by "Arthur Fiedler"
Song "It Hurts Me Too" by "Jeremy Spencer"
Song "A B***** Is A B***** (Edited)" by "N.W.A."
Song "Real Love" by "The Smashing Pumpkins"
Song "Say What!?!" by "Chris Standring"
Song "Revelations" by "Marco Beltrami"
Song "Jos Defiance" by "Don Francisco"
Song "Quickstep" by "FSTZ"
Song "Shape Of My Heart" by "Backstreet Boys"
Song "Class Act" by "James Hunter"
Song "Secrets" by "Sunscreem"
Song "She's My Woman" by "Crusaders"
Song "Anything For My Baby" by "Kiss"
Song "Apogee (ft. TechTonic)" 

__Exercise 2.1__

1. Rather than printing the data to the screen, store it in a list of dictionaries, containing the following data points:
    - song
    - artist
    - date
    - username
    - and time of data extraction.
2. Wrap your code in a function, that returns the JSON dictionary from 1).

__Solution__

In [334]:
# Q1:
import time

url = 'https://music-to-scrape.org/user?username=StarCoder49&week=36'

header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)

table = soup.find('table')

rows = table.find_all('tr')

json_data=[]

for row in rows:
    data = row.find_all('td')

    if len(data)>0:
        song_name=data[0].get_text()
        artist_name=data[1].get_text()
        date=data[2].get_text()
        timestamp=data[3].get_text()
        json_data.append({'song_name': song_name,
                          'artist_name': artist_name,
                          'date': date,
                          'time': timestamp,
                          'timestamp_of_extraction': int(time.time()),
                          'username': url.split('=')[1]})
json_data

[{'song_name': 'Too Much Saturn',
  'artist_name': 'Francis Dunnery',
  'date': '2023-09-10',
  'time': '10:49:31',
  'timestamp_of_extraction': 1694551620,
  'username': 'StarCoder49&week'},
 {'song_name': 'Sample Track 11',
  'artist_name': 'Simon Harris',
  'date': '2023-09-10',
  'time': '10:46:24',
  'timestamp_of_extraction': 1694551620,
  'username': 'StarCoder49&week'},
 {'song_name': 'Stand (Album Version)',
  'artist_name': 'Kiss',
  'date': '2023-09-10',
  'time': '10:41:33',
  'timestamp_of_extraction': 1694551620,
  'username': 'StarCoder49&week'},
 {'song_name': 'High Tide',
  'artist_name': 'Richard Souther',
  'date': '2023-09-10',
  'time': '10:37:44',
  'timestamp_of_extraction': 1694551620,
  'username': 'StarCoder49&week'},
 {'song_name': 'Sense',
  'artist_name': 'Beherit',
  'date': '2023-09-10',
  'time': '10:27:19',
  'timestamp_of_extraction': 1694551620,
  'username': 'StarCoder49&week'},
 {'song_name': 'Outro',
  'artist_name': 'Attack Attack',
  'date': '202

In [335]:
#Q2

def get_consumption_history(url):
    header = {'User-agent': 'Mozilla/5.0'}
    res = requests.get(url, headers = header)
    res.encoding = res.apparent_encoding
    soup = BeautifulSoup(res.text)
    
    table = soup.find('table')
    
    rows = table.find_all('tr')
    
    json_data=[]
    for row in rows:
        data = row.find_all('td')
    
        if len(data)>0:
            song_name=data[0].get_text()
            artist_name=data[1].get_text()
            date=data[2].get_text()
            timestamp=data[3].get_text()
            json_data.append({'song_name': song_name,
                              'artist_name': artist_name,
                              'date': date,
                              'time': timestamp,
                              'timestamp_of_extraction': int(time.time()),
                              'username': url.split('=')[1]})
    return(json_data)

In [336]:
# try running the function
get_consumption_history('https://music-to-scrape.org/user?username=StarCoder49&week=36')


[{'song_name': 'Too Much Saturn',
  'artist_name': 'Francis Dunnery',
  'date': '2023-09-10',
  'time': '10:49:31',
  'timestamp_of_extraction': 1694551626,
  'username': 'StarCoder49&week'},
 {'song_name': 'Sample Track 11',
  'artist_name': 'Simon Harris',
  'date': '2023-09-10',
  'time': '10:46:24',
  'timestamp_of_extraction': 1694551626,
  'username': 'StarCoder49&week'},
 {'song_name': 'Stand (Album Version)',
  'artist_name': 'Kiss',
  'date': '2023-09-10',
  'time': '10:41:33',
  'timestamp_of_extraction': 1694551626,
  'username': 'StarCoder49&week'},
 {'song_name': 'High Tide',
  'artist_name': 'Richard Souther',
  'date': '2023-09-10',
  'time': '10:37:44',
  'timestamp_of_extraction': 1694551626,
  'username': 'StarCoder49&week'},
 {'song_name': 'Sense',
  'artist_name': 'Beherit',
  'date': '2023-09-10',
  'time': '10:27:19',
  'timestamp_of_extraction': 1694551626,
  'username': 'StarCoder49&week'},
 {'song_name': 'Outro',
  'artist_name': 'Attack Attack',
  'date': '202

In [337]:
# Check whether it also works for different weeks
get_consumption_history('https://music-to-scrape.org/user?username=StarCoder49&week=12')

[{'song_name': '7 Miles',
  'artist_name': 'Brixx',
  'date': '2023-03-26',
  'time': '03:06:17',
  'timestamp_of_extraction': 1694551629,
  'username': 'StarCoder49&week'},
 {'song_name': 'Good Texan',
  'artist_name': 'The Vaughan Brothers',
  'date': '2023-03-26',
  'time': '03:01:54',
  'timestamp_of_extraction': 1694551629,
  'username': 'StarCoder49&week'},
 {'song_name': 'Danny',
  'artist_name': 'Billie Jo Spears',
  'date': '2023-03-26',
  'time': '02:59:50',
  'timestamp_of_extraction': 1694551629,
  'username': 'StarCoder49&week'},
 {'song_name': "Sally Can't Dance",
  'artist_name': 'Lou Reed',
  'date': '2023-03-26',
  'time': '02:56:18',
  'timestamp_of_extraction': 1694551629,
  'username': 'StarCoder49&week'},
 {'song_name': 'Tien An Man Dream Again',
  'artist_name': 'fIREHOSE',
  'date': '2023-03-26',
  'time': '02:55:00',
  'timestamp_of_extraction': 1694551629,
  'username': 'StarCoder49&week'},
 {'song_name': 'Cycle Time',
  'artist_name': 'Liars',
  'date': '2023-

### 2.2. Loop through all weeks for each user


__Importance__

Alright - what have we achieve so far?

- In section 1, we've built a function to retrieve user names of currently active users. We call this the stage of our project in which we collect "seeds".
- In section 2.1, we've managed to extract a user's consumption history from a table displayed on the user's profile page.

What's missing, though, is __ALL of a user's consumption data__, i.e., from __ALL possible weeks__.

For this, we're making use of the "previous page" button.


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mits-previous-button.png" align="left" width=30%/>

__Let's try it out__

Open the user's profile page at https://music-to-scrape.org/user?username=StarCoder49. __Click on the previous button__ a few times, and observe how the URL in your browser bar is changing. 

For example:

- `https://music-to-scrape.org/user?username=StarCoder49`
- `https://music-to-scrape.org/user?username=StarCoder49&week=37`
- `https://music-to-scrape.org/user?username=StarCoder49&week=36`
- `https://music-to-scrape.org/user?username=StarCoder49&week=35`
- ...

Can you guess the next one...?

A general solution is to look up whether there is a `previous` button on the page (see HTML code below). We can then either "grab" the URL and visit it, or - instead - "click" on the button.


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/mts-previous-page.png" align="left" width=60% style="border: 1px solid black" />

So, let's write a snippet that "captures" the link of the next page button on the [books page](https://books.toscrape.com).

We always proceed in small steps.

In [338]:
# Step 1: Load the website's source code and convert to BeautifulSoup object
url = 'https://music-to-scrape.org/user?username=StarCoder49'

header = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers = header)
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text)

In [339]:
# Step 2: Trying to locate the previous button, using a combination of class names and attribute-value pairs.
soup.find(class_='page-link', attrs={'type':'previous_page'})

<a class="page-link" href="user?username=StarCoder49&amp;week=36" type="previous_page">Previous
                                        Week</a>

In [340]:
# Step 3: Trying to extract the `href` attribute
soup.find(class_='page-link', attrs={'type':'previous_page'}).attrs['href']

'user?username=StarCoder49&week=36'

In [341]:
# Step 4: Storing "previous page" link
previous_page_link = soup.find(class_='page-link', attrs={'type':'previous_page'}).attrs['href']
previous_page_link # print it

'user?username=StarCoder49&week=36'

At each iteration, we can observe how we're getting closer to the information we need.

Now, we only need to combine the base URL (`https://music-to-scrape.org/`) with the page number.

In [224]:
previous_page_link = soup.find(class_='page-link', attrs={'type':'previous_page'}).attrs['href']
f'https://music-to-scrape.org/{previous_page_link}'

'https://music-to-scrape.org/user?username=StarCoder49&week=36'

__Exercise 2.2__

Please first load the snippet below, which has wrapped the "previous page" capturing in a function. Observe the use of `try` and `except`, which accounts for the last page NOT having a next page button.

In [342]:
def previous_page(soup):
    try:
        previous_page_link = soup.find(class_='page-link', attrs={'type':'previous_page'}).attrs['href']
        return(f'https://music-to-scrape.org/{previous_page_link}')
    except:
        return('no previous page')

Let's try out this function on the source code of the website.

In [343]:
soup = BeautifulSoup(requests.get('https://music-to-scrape.org/user?username=StarCoder49').text)
previous_page(soup)

'https://music-to-scrape.org/user?username=StarCoder49&week=36'

See, it worked! Now, proceed with the exercises.


1. Make a web requests to 'https://music-to-scrape.org/user?username=StarCoder49&week=36', and pass on the (souped) object to the `previous_page()` function and observe the output. Then, use 'https://music-to-scrape.org/user?username=StarCoder49&week=0'. Is that what you expected? 

2. Write a while loop that continuously visits all pages for the user `StarCoder49`, by extracting previous page URLs from each page and continuing the data collection until there is no previous page to fetch. Start with week 10 to minimize server load.

In [23]:
# write your code here

__Solution__

In [344]:
# Question 1
soup = BeautifulSoup(requests.get('https://music-to-scrape.org/user?username=StarCoder49&week=36').text)
previous_page(soup)


'https://music-to-scrape.org/user?username=StarCoder49&week=35'

In [345]:
soup = BeautifulSoup(requests.get('https://music-to-scrape.org/user?username=StarCoder49&week=0').text)
previous_page(soup)
# returns "no previous page"

'no previous page'

In [346]:
# Question 2
urls = []

# define first URL to start from
url = 'https://music-to-scrape.org/user?username=StarCoder49&week=10'

while True:
    print(f'Opening {url} and checking for next page...')
    soup = BeautifulSoup(requests.get(url).text)
    previous_url = previous_page(soup)
    if 'no previous page' in previous_url: break
    url = previous_url
    

Opening https://music-to-scrape.org/user?username=StarCoder49&week=10 and checking for next page...
Opening https://music-to-scrape.org/user?username=StarCoder49&week=9 and checking for next page...
Opening https://music-to-scrape.org/user?username=StarCoder49&week=8 and checking for next page...
Opening https://music-to-scrape.org/user?username=StarCoder49&week=7 and checking for next page...
Opening https://music-to-scrape.org/user?username=StarCoder49&week=6 and checking for next page...
Opening https://music-to-scrape.org/user?username=StarCoder49&week=5 and checking for next page...
Opening https://music-to-scrape.org/user?username=StarCoder49&week=4 and checking for next page...
Opening https://music-to-scrape.org/user?username=StarCoder49&week=3 and checking for next page...
Opening https://music-to-scrape.org/user?username=StarCoder49&week=2 and checking for next page...
Opening https://music-to-scrape.org/user?username=StarCoder49&week=1 and checking for next page...
Opening h

------------
So... seems like we're almost there!

The only thing that's missing is to actually also extract the song consumption data from each of the user profile pages.

We turn towards this issue next.

## 3. Improving Extraction Design

### 3.1 Timers

__Importance__

Before we started running some of the cells above, you may have observed the usage of the `time.sleep` function. Sending many requests at the same time can overload a server. Therefore, pausing between requests rather than sending them all simultaneously is highly recommended. This prevents your IP address (i.e., the numerical label assigned to each device connected to the internet) from getting blocked, not allowing you to visit (and scrape) the website. 

__Let's try it out__

In Python, you can import the `time` module, which pauses the execution of future commands for a given amount of time. For example, the print statement after `time.sleep(2)` will only be executed after 2 seconds:

In [347]:
# run this cell again to see the timer in action yourself!
import time
pause = 2
time.sleep(pause)
print(f"I'll be printed to the console after {pause} seconds!")

I'll be printed to the console after 2 seconds!


__Exercise 3.1__

Modify the code above to sleep for 2 minutes. Go grab a coffee in-between. Did it take you longer than 2 minutes?

(if you want to abort the running code, just select the cell and push the "stop" button!)

In [150]:
# your answer goes here!

**Solution**  

In [None]:
time.sleep(2*60)
print("Done!")

### 3.2 Modularization

**Importance**  

In scraping, many things have to be executed *multiple times*. For example, whenever we open a new user page on music-to-scrape.org, we would like to extract all the available book links.

To help us execute things repeatedly, we will "modularize" our code into functions. We can then call these functions whenever we need them. Another benefit of using functions is that we can improve the readability and reusability of our code. If you need a quick refresher on functions, please revisit section 4 of the [Python Bootcamp](https://odcm.hannesdatta.com/docs/modules/week1/pythonbootcamp/).

**Let's try it out**

Let's finish our scraper by compiling everything we have learned thus far.

Re-execute the function `get_users` from exercise 1.2 (3) above. Remember how it worked? Then proceed with your exercises.


In [349]:
get_users()

['https://music-to-scrape.org/user?username=WizardShadow42',
 'https://music-to-scrape.org/user?username=SonicNinja25',
 'https://music-to-scrape.org/user?username=TechGeek32',
 'https://music-to-scrape.org/user?username=CosmicSonic89',
 'https://music-to-scrape.org/user?username=CyberStar49',
 'https://music-to-scrape.org/user?username=Wizard79']

__Exercise 3.2__

Execute the function `get_users()` for a few minutes to collect a list of usernames. Store the user names in a JSON file (new-line separated), along with the timestamp of data retrieval `int(time.time())`.


In [None]:
# your answer here

__Solution__

In [350]:
import time
import json

duration = 15 # for testing, just 15 seconds

# Calculate the end time
end_time = time.time() + duration

f = open('seeds.json','w') # start a new file with seeds, so, use `w` (write new file) instead of `a` (append to existing file)

# Run the loop until the current time reaches the end time
while time.time() < end_time:
    print(f'Scraping user names...')
    for user in get_users():
        new_user = {'url': user,
                    'timestamp': int(time.time())}
        f.write(json.dumps(new_user)+'\n')
    time.sleep(2)  # Sleep for a few seconds between each execution
f.close()
print('Done.')

Scraping user names...
Scraping user names...
Scraping user names...
Done.


In [351]:
# verify whether you can open the data

import json
f = open('seeds.json','r',encoding = 'utf-8')
data = f.readlines()
for item in data:
    print(json.loads(item))
f.close()

{'url': 'https://music-to-scrape.org/user?username=WizardShadow42', 'timestamp': 1694551693}
{'url': 'https://music-to-scrape.org/user?username=SonicNinja25', 'timestamp': 1694551693}
{'url': 'https://music-to-scrape.org/user?username=TechGeek32', 'timestamp': 1694551693}
{'url': 'https://music-to-scrape.org/user?username=CosmicSonic89', 'timestamp': 1694551693}
{'url': 'https://music-to-scrape.org/user?username=CyberStar49', 'timestamp': 1694551693}
{'url': 'https://music-to-scrape.org/user?username=Wizard79', 'timestamp': 1694551693}
{'url': 'https://music-to-scrape.org/user?username=WizardShadow42', 'timestamp': 1694551699}
{'url': 'https://music-to-scrape.org/user?username=SonicNinja25', 'timestamp': 1694551699}
{'url': 'https://music-to-scrape.org/user?username=TechGeek32', 'timestamp': 1694551699}
{'url': 'https://music-to-scrape.org/user?username=CosmicSonic89', 'timestamp': 1694551699}
{'url': 'https://music-to-scrape.org/user?username=CyberStar49', 'timestamp': 1694551699}
{'u

__Exercise 3.3__

Now, let's write some code that loads `seeds.json`, and visit each user's __first profile page__ to extract consumption data. Remember to build in a little timer (e.g., waiting for 2 seconds or so). The prototype/starting code below stops automatically after 5 iterations to minimize server load. Try removing the prototyping condition using the comment character `#` when you think you're done!


In [352]:
# start from the code below

import time # we need the time package for implementing a bit of waiting time
import json

content = open('seeds.json', 'r').readlines() # let's read in the seed data

counter = 0 # initialize counter to 0

# loop through all lines of the JSON file
for line in content:
    # increment counter and check whether prototyping condition is met
    counter = counter + 1
    if counter>5: break # deactivate this if you want to loop through the entire file
        
    # convert loaded data to JSON object/dictionary for querying
    obj = json.loads(line)
    
    # show URL for which product information needs to be captured
    print(obj['url'])
    
    # eventually sleep for a second
    time.sleep(2)

print('Done!')

https://music-to-scrape.org/user?username=WizardShadow42
https://music-to-scrape.org/user?username=SonicNinja25
https://music-to-scrape.org/user?username=TechGeek32
https://music-to-scrape.org/user?username=CosmicSonic89
https://music-to-scrape.org/user?username=CyberStar49
Done!


<div class="alert alert-block alert-info"><b>Tips</b>
    <br>
    <ul>
        <li>
            Use the function <code>get_consumption_history(url)</code> from exercise 2.3 above!
        </li>
 
</div>


__Solution__

In [353]:
# start from the code below
import time # we need the time package for implementing a bit of waiting time
import json

content = open('seeds.json', 'r').readlines() # let's read in the seed data

counter = 0 # initialize counter to 0

# loop through all lines of the JSON file
for line in content:
    # increment counter and check whether prototyping condition is met
    counter = counter + 1
    if counter>5: break # deactivate this if you want to loop through the entire file
        
    # convert loaded data to JSON object/dictionary for querying
    obj = json.loads(line)
    
    # show URL for which product information needs to be captured
    url = obj['url']

    print(f'Extracting information for {url}...')
    
    output_file = open('output_data.json','a')

    songs = get_consumption_history(url)

    for song in songs:
        output_file.write(json.dumps(song))
        output_file.write('\n')

    output_file.close()
    
    time.sleep(2)

print('Done!')

Extracting information for https://music-to-scrape.org/user?username=WizardShadow42...
Extracting information for https://music-to-scrape.org/user?username=SonicNinja25...
Extracting information for https://music-to-scrape.org/user?username=TechGeek32...
Extracting information for https://music-to-scrape.org/user?username=CosmicSonic89...
Extracting information for https://music-to-scrape.org/user?username=CyberStar49...
Done!


<div class="alert alert-block alert-info"><b>Tip: Understanding the Difference Between <code>'a'</code> and <code>'w'</code> When Writing Files in Python</b>
    <br>
    
- When working with files in Python, it's essential to know the difference between <code>'a'</code> and <code>'w'</code>  when opening them.
- <code>'a'</code> stands for "append" mode. When you open a file with <code>'a'</code> , Python will let you add data to the end of the existing file without erasing its contents. This is useful when you want to add new information to a file without losing what's already there. It's like adding new lines to the end of an ongoing document.
- <code>'w'</code>  stands for "write" mode. When you open a file with <code>'w'</code> , Python will create a new file or overwrite an existing one. This means that if the file already has data in it, using <code>'w'</code>  will erase all the existing content and start fresh. It's like creating a new document or wiping out the old one.
- Remember, when scraping data or working with files, it's generally safer to use <code>'a'</code>. This way, you won't accidentally delete valuable data. Using <code>'w'</code>  should be done with caution, and only when you intentionally want to start with a clean slate or create a new file altogether.
</div>

Finally, we can re-open the extracted data in Python to see whether what we retrieved seems complete.

Verify you've the `pandas` package installed by running the next cell.

In [None]:
!pip install pandas

Now, we can load the data.

In [354]:
# inspect data in pandas
import pandas as pd
pd.read_json('output_data.json', lines=True)

Unnamed: 0,song_name,artist_name,date,time,timestamp_of_extraction,username
0,Ninja Tattoo,Shanadoo,2023-09-12,20:04:11,2023-09-12 20:05:19,Geek15
1,The Sweetest Sounds,Helen O'Connell,2023-09-12,20:01:10,2023-09-12 20:05:19,Geek15
2,Throwdown At The Hoedown (LP Version),Bela Fleck And The Flecktones,2023-09-12,19:56:00,2023-09-12 20:05:19,Geek15
3,Dame Tu Carino,Ray Barretto,2023-09-12,19:52:53,2023-09-12 20:05:19,Geek15
4,Finding My Way,Stanley Clarke & George Duke,2023-09-12,19:47:14,2023-09-12 20:05:19,Geek15
...,...,...,...,...,...,...
4673,Morpha Too [Alternate Mix],Big Star,2023-09-12,20:19:00,2023-09-12 20:48:47,CyberStar49
4674,Turn This Thing Around,El Presidente,2023-09-12,20:15:26,2023-09-12 20:48:47,CyberStar49
4675,Ass Attack (Four Tet Remix),Hot Chip,2023-09-12,20:12:25,2023-09-12 20:48:47,CyberStar49
4676,You Can't Deep Freeze a Red Hot Mama,Sophie Tucker,2023-09-12,20:09:51,2023-09-12 20:48:47,CyberStar49


### 3.3 Summary

At the beginning of this tutorial, we set out the promise of writing multi-page scrapers from start to finish. Although the examples we have studied are relatively simple, the same principles (seed definition, data extraction plan, page-level data collection) apply to any other website you'd like to scrape. 

<div class="alert alert-block alert-info"><b>Limitations of BeautifulSoup and the Advantages of Selenium</b>
    <br>
    
- While BeautifulSoup is a powerful tool for parsing and navigating HTML documents, it has some limitations when it comes to interacting with websites:
  - BeautifulSoup is a static parser, meaning it can't interact with dynamic web content that loads or changes after the initial page load. This makes it less suitable for websites that heavily rely on, say, JavaScript to update their content. For example, this is relevant for Twitter or Instagram.
  - BeautifulSoup can't handle user interactions such as clicking buttons, filling out forms, or navigating through complex web applications.
- When you need to scrape data from very modern and interactive websites, consider using a tool like Selenium. Selenium is a web automation framework that allows you to control a web browser programmatically.
  - With Selenium, you can automate interactions with websites, simulate user actions, and retrieve data from pages that rely heavily on JavaScript.
  - It's an excellent choice for scraping data from dynamic websites, conducting web testing, and performing tasks that require a more interactive approach.
- Keep in mind that while BeautifulSoup is great for many scraping tasks, knowing when to use Selenium can open up new possibilities and make your web scraping efforts more effective.

</div>


## After-class exercises


### Exercise 1

Can you extend the code written in 3.2 to extract data from ALL of a user's profile pages?

### Exercise 2

Please port your data collection into two Python scripts. One called `collect_seeds.py` that collects seeds for 5 minutes. You can use a task scheduler to launch this task every 15 minutes and keep it running for a few hours.

Building on exercise 1 above, write a second script, called `collect_user_data.py`, which you run once (after you've finalized collecting seeds). This script collects all of the required data for all users.

__Solution__

Let us first modify the `get_consumption_history()` function, ensuring it shows us whether there is a `previous page`.

In [299]:
# Question 1

def get_consumption_history(url):
    header = {'User-agent': 'Mozilla/5.0'}
    res = requests.get(url, headers = header)
    res.encoding = res.apparent_encoding
    soup = BeautifulSoup(res.text)
    
    table = soup.find('table')
    
    rows = table.find_all('tr')
    
    json_data=[]
    for row in rows:
        data = row.find_all('td')
    
        if len(data)>0:
            song_name=data[0].get_text()
            artist_name=data[1].get_text()
            date=data[2].get_text()
            timestamp=data[3].get_text()
            json_data.append({'song_name': song_name,
                              'artist_name': artist_name,
                              'date': date,
                              'time': timestamp,
                              'timestamp_of_extraction': int(time.time()),
                              'username': url.split('=')[1]})

    url_of_previous_page = previous_page(soup)
        
    return({'songs': json_data, 'previous_page': url_of_previous_page})
    

In [None]:
# start from the code below
import time # we need the time package for implementing a bit of waiting time
import json

content = open('seeds.json', 'r').readlines() # let's read in the seed data

counter = 0 # initialize counter to 0

# loop through all lines of the JSON file
for line in content:
    # increment counter and check whether prototyping condition is met
    counter = counter + 1
    if counter>5: break # deactivate this if you want to loop through the entire file
        
    # convert loaded data to JSON object/dictionary for querying
    obj = json.loads(line)
    
    # show URL for which product information needs to be captured
    url = obj['url']

    while 'no previous page' not in url:
        print(f'Extracting information for {url}...')
    
        output_file = open('output_data.json','a')
    
        songs = get_consumption_history(url)
        
        for song in songs['songs']:
            output_file.write(json.dumps(song))
            output_file.write('\n')
        output_file.close()
        
        url = songs['previous_page']
        time.sleep(2)
    
print('Done!')

Extracting information for https://music-to-scrape.org/user?username=Geek15...
Extracting information for https://music-to-scrape.org/user?username=Geek15&week=36...
Extracting information for https://music-to-scrape.org/user?username=Geek15&week=35...
Extracting information for https://music-to-scrape.org/user?username=Geek15&week=34...
Extracting information for https://music-to-scrape.org/user?username=Geek15&week=33...
Extracting information for https://music-to-scrape.org/user?username=Geek15&week=32...
Extracting information for https://music-to-scrape.org/user?username=Geek15&week=31...
Extracting information for https://music-to-scrape.org/user?username=Geek15&week=30...
Extracting information for https://music-to-scrape.org/user?username=Geek15&week=29...
Extracting information for https://music-to-scrape.org/user?username=Geek15&week=28...
Extracting information for https://music-to-scrape.org/user?username=Geek15&week=27...
Extracting information for https://music-to-scrape.

# 4. A primer on scraping more advanced, dynamic websites using selenium

In previous tutorials, you have used the `requests` library to retrieve web data. Yet, this rarely works on more modern websites (such as Twitch, Twitter, or Instagram).

A solution is to use the `selenium` library. We'll use it to *navigate on the site*. One we *are* on the site, we can continue using `BeautifulSoup` to extract elements from the source code.

We'll continue with [Twitch](https://twitch.tv).


## 4.1 Making a connection to a website using Selenium

<div class="alert alert-block alert-warning"><b>Installing Selenium and Chromedriver</b> 

To install Selenium and Chromedriver locally, please follow the <a href="https://tilburgsciencehub.com/configure/python-for-scraping/?utm_campaign=referral-short">Tutorial on Tilburg Science Hub</a>.
    
You can also use the code snippet below to automate the installation. Running this snippet takes a little longer each time, but the benefit is that it almost always works!
</div>


In [266]:
!pip install webdriver_manager
!pip install selenium


[0m

In [273]:
# Using selenium 4 - ensure you have Chrome installed!
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

url = "https://twitch.tv/"
driver.get(url)

If everything went smooth, your computer opened a new Chrome window, and opened `twitch.tv`. 

<div class="alert alert-block alert-info"><b>Using Google Colab</b> 

If you're using Google Colab, you don't see your browser open up manually.
    
Whenever you switch pages, just manually open that page in your browser. Although this feels like a little less interactive, you will still be able to work through this tutorial!

</div>

From now onwards, you can use `driver.get('https://google.com')` to point to different websites (i.e., you don't need to install it over and over again, unless you open up a new instance of Jupyter Notebook).

## 4.2 Using BeautifulSoup with Selenium


We can now also try to extract information. Note that we're converting the source code of the site to a `BeautifulSoup` object (because you may have learnt how to use `BeautifulSoup` earlier).

In [274]:
# we also need the time package to wait a few seconds until the page is loaded
import time
url = "https://twitch.tv/"
driver.get(url)
time.sleep(3)

Rather than using the "source code" obtained with the `requests` library, we can now convert the source code of the Selenium website to a BeautifulSoup object.

In [275]:
soup=BeautifulSoup(driver.page_source)

...and start experimenting with querying the site, such as retrieving the titles of the currently active streams.

In [276]:
streams = soup.find_all('a', attrs = {'data-test-selector':"TitleAndChannel"})

# print a list of stream names
counter = 0
for stream in streams:
    counter = counter + 1
    print('Stream ' + str(counter) + ': ' + stream.get_text())


Stream 1: 🐧EXCLUSIVE DROP🐧OP PENGUIN DOOR 🐧WEEK LONG !CHARITY STREAM🐧BlizzardID
Stream 2: !tg g0bbba unique ak-47 only here))  !fonbet !youtube !boosty !tgg0bbba
Stream 3: 🟣DROPS ENABLED🟣SebbyK GARAGE DOOR 🟣!Twitter !Video !ServerSebbyK
Stream 4: [STARFIELD CODE RAFFLE] BUT IS IT IMMERSIVE?! <3 !MODS <3 100%/VERY HARD LONG STREAM !socialspaigejxo
Stream 5: [🔥Giveaway !SCORPIUS🔥] !mods │ Day 9 Vanguard Quest Line! │ !TTS !Subtemberlokenplays
Stream 6: Space Ronin! Moving Up From the Streets of Neon! Socials: @earlmeisterEarlmeister


Wow - this is cool. You've just learnt a second way to open websites using `selenium`. The benefit of `selenium` is that you can work with highly dynamic websites (which also helps you to not getting blocked). The drawback is that `selenium` is slower than just using the `requests` library, and it may sometimes be buggy on computers without a screen (which matters when you scale up your data collection.

<div class="alert alert-block alert-info"><b>Awesome stuff with Selenium</b> 

Selenium is your best shot at navigating a dynamic website. It can do amazing things, such as 
    
<ul>
    <li>"clicking" on buttons</li>
    <li>scrolling through a site</li>
    <li>hovering over items and capturing information from popups,</li>
    <li>starting to play a stream,</li>
    <li>typing text and submitting it in the chat, and</li>
    <li>so much more...!</li>
</ul>
    
Note though that we won't cover the advanced functionality of Selenium in this tutorial, but the optional "Web data advanced" tutorial holds the necessary information.
   
</div>



__Exercise 4.1__

Please write code snippets to extract the following pieces of information. Do you choose `requests` or `selenium`?

1. The titles of all `<h2>` tags from `https://odcm.hannesdatta.com/docs/course/`
2. The titles of all available TV series from `https://www.bol.com/nl/nl/l/series/3133/30291/` (about 24)

```
soup.find_all('a', class_='product-title')
```


We also need the time package to wait a few seconds until the page is loaded.

```
import time
url = "https://twitch.tv/" # some example URL
driver.get(url)
time.sleep(3)
```

In [277]:
# write your solution here

In [278]:
# Solution to question 1:
header = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
request = requests.get('https://odcm.hannesdatta.com/docs/course/', headers = header)
request.encoding = request.apparent_encoding # set encoding to UTF-8
soup = BeautifulSoup(request.text)
for title in soup.find_all('h2'): print(title.get_text())

Instructor
Course description
Prerequisites
Teaching format
Assessment
Code of Conduct
Structure of the course
More links


In [279]:
# Solution to question 2:
driver.get('https://www.bol.com/nl/nl/l/series/3133/30291/')
time.sleep(3)
soup = BeautifulSoup(driver.page_source)

In [280]:
urls = []
for url in soup.find_all('a', class_='product-title'):
    urls.append(url.attrs['href'])
urls

['/nl/nl/p/midsomer-murders-seizoen-1/9200000126652303/?bltgh=nOMDDGdt0suKSrRC0h5Fgg.3_15.16.ProductTitle',
 '/nl/nl/p/midsomer-murders-seizoen-7-deel-2/9200000130879705/?bltgh=nOMDDGdt0suKSrRC0h5Fgg.3_15.17.ProductTitle',
 '/nl/nl/p/flikken-maastricht-seizoen-17/9300000152726224/?bltgh=nOMDDGdt0suKSrRC0h5Fgg.3_15.18.ProductTitle',
 '/nl/nl/p/jack-ryan-seizoen-3/9300000160137131/?bltgh=nOMDDGdt0suKSrRC0h5Fgg.3_15.19.ProductTitle',
 '/nl/nl/p/chicago-fire-seizoen-11-dvd-import-zonder-nl-ot/9300000160463468/?bltgh=nOMDDGdt0suKSrRC0h5Fgg.3_15.20.ProductTitle',
 '/nl/nl/p/the-sandhamn-murders-seizoen-6/9300000155912985/?bltgh=nOMDDGdt0suKSrRC0h5Fgg.3_15.21.ProductTitle',
 '/nl/nl/p/dagboek-van-een-herdershond-seizoen-1/1002004006832455/?bltgh=nOMDDGdt0suKSrRC0h5Fgg.3_15.22.ProductTitle',
 '/nl/nl/p/last-of-us/9300000142780418/?bltgh=nOMDDGdt0suKSrRC0h5Fgg.3_15.23.ProductTitle',
 '/nl/nl/p/last-of-us/9300000142780419/?bltgh=nOMDDGdt0suKSrRC0h5Fgg.3_15.24.ProductTitle',
 '/nl/nl/p/succession

### 4.3 Using interactive elements (e.g., by clicking buttons)

__Importance__

For more dynamic websites, we may have to click on certain elements (rather than extracting some URL).

<div class="alert alert-block alert-info"><b>Extracting elements using Selenium, not BeautifulSoup</b> 

Selenium is really great for navigating dynamic website. There are two ways in which you can use it for querying sites:
    
<ul>
    <li>put the "selenium" source code (<code>driver.page_source</code>) to BeautifulSoup, and then use BeautifulSoup commands, or </li>
    <li>directly use selenium (and it's own query language) to extract elements.</li>
</ul>
    
In the next few examples, we are using selenium's "internal" query language (which you identify easily because it is a subfunction of the `driver` object, and because it has a different name (`find_element`, instead of `find` or `find_all`).
    
Want to know more about selenium's built-in query language? Check out the "Advanced Web Scraping Tutorial", or dig up some extra material from the web. Knowing both BeautifulSoup and Selenium makes you most productive!
  
</div>

__Try it out__

If you haven't done so, rerun the installation code for `selenium` from above. Then, proceed by running the following cell and observe what happens in your browser.


In [289]:
driver.get('https://music-to-scrape.org/user?username=StarCoder49')

After a few seconds, your browser will have loaded the website in Chrome. Now, run the next cells.

In [290]:
# Step 1: Let's try location the element
from selenium.webdriver.common.by import By
button = driver.find_element(By.CSS_SELECTOR, ".page-link")
button

<selenium.webdriver.remote.webelement.WebElement (session="889393e1e316b7c098b0fb7e4e63c82f", element="7870A3AF1CDB4D7790D28E2374BA835D_element_855")>

In [294]:
# Step 2: Clicking the link!
button = driver.find_element(By.CSS_SELECTOR, ".page-link")
button.click()

Boom! In step 2, we clicked on the link. Just try rerunning this cell with step 2 over and over again. Does iterating through the pages work?!

## Backup: Executing Python Files

### Jupyter Notebooks versus editors such as Visual Studio Code, PyCharm, or Spyder

Jupyter Notebooks are ideal for combining programming and markdown (e.g., text, plots, equations), making it the default choice for sharing and presenting reproducible data analyses. Since we can execute code blocks one by one, it's suitable for developing and debugging code on the fly. 

That said, Jupyter Notebooks also have some severe limitations when using them in production environments. That's where an "Integrated Development Environment" (IDE) comes in, such as Visual Studio Code or PyCharm. Let's revisit the most important differences.

First, the order in which you run cells within a notebook may affect the results. While prototyping, you may lose sight of the top-down hierarchy, which can cause problems once you restart the kernel (e.g., a library is imported after it is being used). Second, there is no easy way to browse through directories and files within a Jupyter Notebook. Third, notebooks cannot handle large codebases nor big data remarkably well. 

That's why we recommend starting in Jupyter Notebooks, moving code into functions along the way, and once all seems to be running well, save your Jupyter Notebook as a `.py` file and continue working with it in Visual Studio Code.

Below, we introduce you to the IDE (here, Spyder, but VS Code looks very similar), and show you how to run Python files from the command line. 

### Introduction to Spyder
The first time you need to click on the green "Install" button in Anaconda Navigator, after which you start Spyder by clicking on the blue "Launch" button (alternatively, type `spyder` in the terminal). 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/anaconda_navigator.png" width=90% align="left" style="border: 1px solid black" />


The main interface consists of three panels: 
1. **Code editor** = where you write Python code (i.e., the content of code cells in a notebook)
2. **Variable / files** = depending on which tab you choose either an overview of all declared variables (e.g. look up their type or change their values) or a file explorer (e.g., to open other Python files)
3. **Console** = the output of running the Python script from the code editor (what normally appears below each cell in a notebook)

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/spyder.png" width=90% align="left" style="border: 1px solid black" />

**Let's try it out!**     
Copy the solution from exercise 3.3 to a new file, called `webscraping_101.py`. To run the script you can

- click on the green play button to run all code, or
- highlight the parts of the script you want to execute and then click the run selection button.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/toolbar.png" width=40% align="left" style="border: 1px solid black" />

Once the script is running, you may need to interrupt the execution because it is simply taking too long or you spotted a bug somewhere. Click on the red rectangular in the console to stop the execution. 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/interrupt.gif" width=80% align="left" style="border: 1px solid black" />

### Run Python Files 

__For Mac and Linux users__

1. Open the terminal and navigate to the folder in which the `.py` file has been saved (use `cd` to change directories and `ls` to list all files).
2. Run the Python script by typing `python <FILENAME.py>` (e.g., `python webscraping_101.py`).

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/modules/week3/webscraping101/images/running_python.gif" width=60% align="left" style="border: 1px solid black" />

__For Windows users__

1. Open Windows explorer and navigate to the folder in which the `.py` file has been saved. Type `cmd` to open the command prompt. Alternatively, open the command prompt from the start menu (and use `cd` to change directories and `dir` to list files).
2. Activate Anaconda by typing `conda activate`.
3. Run the Python script by typing `python <FILENAME.py>` (e.g., `python webscraping_101.py`).