# Exercises for Session 6: Web Scraping 1

In session 5 you briefly touched upon extracting data from the internet. You worked with APIs which can be used to download data from a webpage in a structured way. Sometimes the webpage do not provide an API or the data you can download via the API is limited. In that case we will need to extract the data from the webpage ourselves. 

In the next three sessions you will learn how to extract data from a webpage when you cannot use an API. It involves mapping through the webpage (find the right URLs) and extracting the desired data from the webpage's HTML string (HTML: the underlying language behind a webpage).

*(Note: I recommend to use Chrome as your browser during the next three sessions. Lectures and exercises are solely based on Chrome.)*

# Part 1: Scraping Jobnet.dk

When we want to scrape a webpage, the first thing we do is to investigate the webpage. First, we need to get an overview of the URLs of all the webpages we want to scrape. Second, we download the HTML-string from the webpages. You can learn more about this in video 6.1:

(I might talk a bit slow in some of the videos. Remember that you can turn up the speed on Youtube)

In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo('Xiu-acDIm28', width=640, height=360)

> **Ex. 6.1.1:** Go to  www.jobnet.dk and investigate the page. Locate the webpage that shows the job postings. Use the `request` module to extract the HTML-string of the webpage. 
>
> Remember to add name and email to the header of your request, so the website managers can see that you are not a malicious actor.

> *Note:* The HTML-string will not make a lot of sense right now, but try to take a look at it. In the next session we will learn how to extract data from the HTML-string.

> *Note:* The website is in Danish, but it should be no problem for non-Danish speaking persons to solve the exercises.

In [20]:
import requests

headers = {
    'name': 'Nina Frandsen Jensen',
    'email': 'qls153@alumni.ku.dk'
}

response = requests.get('https://job.jobnet.dk/cv/findwork', headers=headers)

# Extract the HTML-string from the response
html_string = response.text

# Print the HTML-string
print(html_string)





<!DOCTYPE html>
<html class="no-js jobnet"
      lang="da"
      data-build="2023.2.0.129"
      data-ng-app="Jobnet">

<head data-jn-header-manager>
    <meta charset="utf-8" />

    <script src="https://cdn-eu.cookietractor.com/cookietractor.js" data-lang="da-DK" data-id="997a8f64-3979-4aaf-a7ad-d75d4a075a3e"></script>

    
    <title>Find job</title>
    <meta name="description"
          content="" />
    <meta name="viewport"
          content="width=device-width, initial-scale=1" />
    
    <link href="/CV/bundles/jobnet/styles/themes/jqueryui?v=tnDXbSoBDWbbJp6Mq-7PNZ2WgEiO41s0WI3Jpab9v5k1" rel="stylesheet"/>

    <link href="/CV/bundles/jobnet/styles/normalization?v=8SYC4_fo8F7yKup3Ic3pmxETVZDCktLLOPXjtIVe2Zk1" rel="stylesheet"/>

    <link href="/CV/bundles/jobnet/styles/normalizationprint?v=oRijEx5qJuAAPi5Biy05nn2lsj7dhIKZLJ8zNwNOAZs1" rel="stylesheet"/>

    <link href="/CV/bundles/jobnet/styles/core?v=Cg60VVyiEhch4qImjjyR7P2kCXqJPDpWNG5k6Gpigrw1" rel="stylesheet"/>

   

When you have completed exercise 6.1.1 you have scraped your first webpage! I.e., you have retrieved the HTML-string of the webpage you wanted to extract data from. In session 7 we will learn how to get the relevant data from the HTML-string. But first we want to learn about how to go through all the webpages we want to scrape and retrieve the HTML-strings behind: `mapping`

> **Ex. 6.1.2:** Start your `mapping`: We want to figure out what URLs we need to scrape to collect job posting data. 

> You will see that there are 20 job postings per page, and that you can click through the pages with job postings on the bottom of the page. Figure out what the structure of the URL is, so you can click through the job posting pages by changing the URL. 

> Describe the structure of the URL in plain words below. What is the relevant paging parameter (the parameter you need to change to go to the next webpage) and how does it behave when you change page?

### Answer
The relevant paging parameter is "Offset". Page 1 is equal to Offset=0, page 2 is equal to Offset=20, page 3 is equal to 40, etc. Each time you want to change page, you have to add 20 to the parameter. If you want to go to page 8, you would have to type Offset=140.

> **Ex. 6.1.3:** Make a list of the URLs of the first 5 webpages with job postings.

> *Hint 1:* Design a `for loop` using the `range` function that changes the paging parameter in the URL.
>
> *Hint 2:* How do you change the paging parameter in the URL-string? Here string formatting is your friend! Read about it [here](https://realpython.com/python-string-formatting) (I recommend that you adopt the f-strings formatting which is a relatively new and nice feature in Python). 

In [46]:
# Define the base URL for job postings
base_url = 'https://www.jobnet.dk/CV/FindWork?Offset='

# Initialize an empty list to store the URLs
job_posting_urls = []

# Loop through the first 5 pages (0 to 4)
for page in range(5):
    # Calculate the offset for each page
    offset = page * 20  
    # Create the URL for the current page
    url = f'{base_url}{offset}'
    # Add the URL to the list
    job_posting_urls.append(url)

# Print the list of URLs
job_posting_urls

['https://www.jobnet.dk/CV/FindWork?Offset=0',
 'https://www.jobnet.dk/CV/FindWork?Offset=20',
 'https://www.jobnet.dk/CV/FindWork?Offset=40',
 'https://www.jobnet.dk/CV/FindWork?Offset=60',
 'https://www.jobnet.dk/CV/FindWork?Offset=80']

> **Ex. 6.1.4:** Now loop through the list and scrape the HTML-strings of all 5 webpages using the `request` module again and save the HTML-strings in a list. 

> - Use the `time.sleep()` function to limit the rate of your calls. This is important to avoid overloading the webpage's server. Worst case, you can be banned from the website.

> - ***Extra:*** Monitor the time left to completing the loop by using `tqdm.tqdm()` function.

In [50]:
import time
from tqdm import tqdm  # For the progress bar

# Define the list of URLs from the previous exercise
job_posting_urls = [
    'https://www.jobnet.dk/CV/FindWork?Offset=0',
    'https://www.jobnet.dk/CV/FindWork?Offset=20',
    'https://www.jobnet.dk/CV/FindWork?Offset=40',
    'https://www.jobnet.dk/CV/FindWork?Offset=60',
    'https://www.jobnet.dk/CV/FindWork?Offset=80',
]

# Initialize an empty list to store the HTML-strings
html_strings = []

# Loop through the URLs and scrape the HTML-strings
for url in tqdm(job_posting_urls, desc='Scraping pages'):
    # Add headers to prevent being blocked as a bot
    headers = {
    'name': 'Nina Frandsen Jensen',
    'email': 'qls153@alumni.ku.dk'
    }
    
    # Make the request to the website
    response = requests.get(url, headers=headers)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Append the HTML-string to the list
        html_strings.append(response.text)
    else:
        print(f"Failed to fetch URL: {url}")

    # Sleep for a few seconds to avoid overloading the server
    time.sleep(2)

# Print the number of HTML-strings retrieved
print(f"Number of HTML-strings retrieved: {len(html_strings)}")


Scraping pages: 100%|██████████| 5/5 [00:11<00:00,  2.30s/it]

Number of HTML-strings retrieved: 5





#### In the video below (video 6.2) you will learn about logging and handling exceptions. Watch it before continuing with Ex.6.1.5

In [None]:
YouTubeVideo('d9fx8m7dQmI', width=640, height=360)

> **Ex. 6.1.5:** Repeat 6.1.4, but now log your activity as well. 

In [69]:
import tqdm
import logging
import os
import json

# Define the log function to gather the log information
def log(response,logfile,output_path=os.getcwd()):
    # Open or create the csv file
    if os.path.isfile(logfile): #If the log file exists, open it and allow for changes     
        log = open(logfile,'a')
    else: #If the log file does not exist, create it and make headers for the log variables
        log = open(logfile,'w')
        header = ['timestamp','status_code','length','output_file']
        log.write(';'.join(header) + "\n") #Make the headers and jump to new line
        
    # Gather log information
    status_code = response.status_code #Status code from the request result
    timestamp = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())) #Local time
    length = len(response.text) #Length of the HTML-string
    
    # Open the log file and append the gathered log information
    with open(logfile,'a') as log:
        log.write(f'{timestamp};{status_code};{length};{output_path}' + "\n") #Append the information and jump to new line

In [70]:
list_htmls = []
logfile = 'log.csv'
for url in tqdm.tqdm(job_posting_urls):
    response = requests.get(url)
    html = response.text
    list_htmls.append(html)
    time.sleep(0.5)
    log(response,logfile)

100%|██████████| 5/5 [00:04<00:00,  1.23it/s]


> **Ex. 6.1.6:** It is a good idea to build a scraper that can handle exceptions (for example a link that for some reason does not exist or connection problems). Build such an exception into your scraper from 6.1.5, so you do not loose the scraped data if it crashes halfway through.

In [71]:
list_htmls = []
for url in tqdm.tqdm(job_posting_urls):
    try:
        response = requests.get(url)
    except Exception as e:
        print(url) #Print url
        print(e) #Print error
        with open("list_htmls", "w") as l: #Save the list_htmls as a json file to retrieve at another time
            json.dump(list_htmls, l)
        continue #Continue to next iteration of the loop
    html = response.text
    list_htmls.append(html)
    time.sleep(0.5) #Sleep for 0.5 seconds

100%|██████████| 5/5 [00:03<00:00,  1.26it/s]


# Part 2: Locating data through the network panel

Sometimes you may be fortunate to find the request that the webpage sends to the server to retrieve the data for the webpage. In that case, we can just replicate the request to receive the data in a structured format (JSON). Then we do not need to struggle with the HTML-strings.

To do this, we first need to find the request. For that purpose, the **network panel** in the Chrome Developer Tools is useful. The network panel monitors all the uploads and downloads to and from the webpage. You can read more about the network panel [here](https://developer.chrome.com/docs/devtools/network/).

**Watch the video below (video 6.3) before working on the exercises.**

In [72]:
YouTubeVideo('isUxBDzfWMg', width=640, height=360)

> **Ex. 6.2.1:** Go to the job posting page at www.jobnet.dk again. Open the network panel and choose *Fetch/XHR* type ([Read more: XMLHttpRequest](https://en.wikipedia.org/wiki/XMLHttpRequest)). If you update the page, you will see all the XHR resources the page generates. 

> Go through all the XHRs and find the XHR that carries the information about the different job postings. What is the name of the XHR?
>
>*Note: There is no smart way to do this. You just need to go through all the XHRs and inspect the information they carry.*

### Answer
Det hedder Search.

> **Ex. 6.2.2:** Use the request URL to download the JSON file consisting of the first 20 job postings. Return the request result in JSON format.

In [74]:
import requests
response = requests.get('https://job.jobnet.dk/CV/FindWork/Search', headers={'name':'Nina Frandsen Jensen','email':'qls153@alumni.ku.dk'})
result_json = response.json()

> **Ex. 6.2.3:** The JSON file consists of three different key-value pairs. We are only interested in the pair that contains the job postings. Find the right key-value pair and convert the JSON data to a Pandas dataframe.

In [76]:
import pandas as pd
jobs_first20 = pd.DataFrame(result_json['JobPositionPostings'])

> **Ex. 6.2.4:** At this point, we have information about the first 20 job postings. Now we want the job postings of the first 5 pages, i.e. the first 100 job postings. 

> Use the same procedure as in **Ex. 6.1.3-4** to download the first 100 postings and save them in a dataframe.
>
> *Note: Remember to limit the rate of your calls, log your activity, and think about how to handle exceptions.*

> *Hint: Recall the paging parameter from **Ex. 6.1.2**. You can use the same paging parameter in the new request URL to loop through the 5 pages.*

In [78]:
links = []
for offset in range(0,5*20,20):
    url = f'https://job.jobnet.dk/CV/FindWork/Search?offset={offset}'
    links.append(url)

> **Ex. 6.2.5 (optional):** What are the top 5 occupation areas with most job postings out of the 100 postings? How many job postings do the top 5 occupation areas have each?

In [80]:
logfile = 'log3.csv'
list_htmls = []
jobs_first100 = pd.DataFrame()

for url in tqdm.tqdm(links):
    try:
        response = requests.get(url, headers={'name':'Nina Frandsen Jensen','email':'qls153@alumni.ku.dk'})
    except Exception as e:
        print(url) #Print url
        print(e) #Print error
        jobs_first100.to_csv('jobs_first100.csv') #Save the dataframe as a csv file to retrieve at another time
        continue #Continue to next iteration of the loop
    
    if response.ok: #Check if the response carries any data
        result_json = response.json() #If the response carries data, then convert it to json format
    else: #If the response does not carry any data, then print the status_code and continue to next iteration of the loop
        print(response.status_code)
        continue
    
    result_df = pd.DataFrame(result_json['JobPositionPostings']) #Convert this iteration's json file to a dataframe
    jobs_first100 = pd.concat([jobs_first100,result_df], axis=0, ignore_index=True) #Append to the rest of the data
    log(response, logfile)
    time.sleep(0.5) #Sleep for 0.5 seconds

100%|██████████| 5/5 [00:05<00:00,  1.01s/it]
