## The Web Scraping Recipe

To scrape information from the web is:
1. **MAPPING**: Finding URLs of the pages containing the information you want.
2. **DOWNLOAD**: Fetching the pages via HTTP.
3. **PARSE**: Extracting the information from HTML.  
  
  
You could also add `connection`, `storing`, `logging`, etc.
   


### Packages used
Today we will mainly build on the python skills you have gotten so far, and tomorrow we will look into more specialized packages.

* for connecting to the internet we use: **requests**
* for parsing: **beautifulsoup** and **regex**
* for automatic browsing / screen scraping: **selenium** 
* for mitigating errors we use: **time**

We will write our scrapers with basic python, for larger projects consider looking into the packages **scrapy**

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import yfinance as yf
import os

In [16]:
import requests
from bs4 import BeautifulSoup
import re
import selenium
import time

import tqdm
import os
import json

In [4]:
import requests
response = requests.get('https://isdsucph.github.io/isds2021/')

In [5]:
# NBA网站
url = 'https://www.basketball-reference.com/leagues/NBA_2018.html' # link to the website
dfs = pd.read_html(url) # parses all tables found on the page.
dfs[1]

Unnamed: 0,Western Conference,W,L,W/L%,GB,PS/G,PA/G,SRS
0,Houston Rockets*,65,17,0.793,—,112.4,103.9,8.21
1,Golden State Warriors*,58,24,0.707,7.0,113.5,107.5,5.79
2,Portland Trail Blazers*,49,33,0.598,16.0,105.6,103.0,2.6
3,Oklahoma City Thunder*,48,34,0.585,17.0,107.9,104.4,3.42
4,Utah Jazz*,48,34,0.585,17.0,104.1,99.8,4.47
5,New Orleans Pelicans*,48,34,0.585,17.0,111.7,110.4,1.48
6,San Antonio Spurs*,47,35,0.573,18.0,102.7,99.8,2.89
7,Minnesota Timberwolves*,47,35,0.573,18.0,109.5,107.3,2.35
8,Denver Nuggets,46,36,0.561,19.0,110.0,108.5,1.57
9,Los Angeles Clippers,42,40,0.512,23.0,109.0,109.0,0.15


In [6]:
url = 'https://www.basketball-reference.com/leagues/NBA_2018.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('h2')[0].text

'Conference Standings'

In [11]:
# Mapping exercise - 招聘网站
links = []
for page in range(1,6,1):
    url = f'https://www.jobindex.dk/jobsoegning?page={page}'
    links.append(url)
links

['https://www.jobindex.dk/jobsoegning?page=1',
 'https://www.jobindex.dk/jobsoegning?page=2',
 'https://www.jobindex.dk/jobsoegning?page=3',
 'https://www.jobindex.dk/jobsoegning?page=4',
 'https://www.jobindex.dk/jobsoegning?page=5']

In [22]:
response = requests.get('https://www.jobindex.dk/jobsoegning?page=1', headers={'name':'Siyi','email':'wasariii@outlook.com'})
list_htmls = []
for url in tqdm.tqdm(links): #Track the time left before completing the loop
    response = requests.get(url)
    html = response.text
    list_htmls.append(html)
    time.sleep(0.5) #Sleep for 0.5 seconds


100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.19s/it]






#### Log your activities

In [41]:
# Define the log function to gather the log information
def log(response,logfile,output_path=os.getcwd()):
    # Open or create the csv file
    if os.path.isfile(logfile): #If the log file exists, open it and allow for changes     
        log = open(logfile,'a')
    else: #If the log file does not exist, create it and make headers for the log variables
        log = open(logfile,'w')
        header = ['timestamp','status_code','length','output_file']
        log.write(';'.join(header) + "\n") #Make the headers and jump to new line
        
    # Gather log information
    status_code = response.status_code #Status code from the request result
    timestamp = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())) #Local time
    length = len(response.text) #Length of the HTML-string
    
    # Open the log file and append the gathered log information
    with open(logfile,'a') as log:
        log.write(f'{timestamp};{status_code};{length};{output_path}' + "\n") #Append the information and jump to new line

In [19]:
#写入log文件
list_htmls = []
logfile = 'log.csv'
for url in tqdm.tqdm(links):
    response = requests.get(url)
    html = response.text
    list_htmls.append(html)
    time.sleep(0.5)
    log(response,logfile)

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:06<00:00,  1.25s/it]


In [23]:
list_htmls = []
for url in tqdm.tqdm(links):
    try:
        response = requests.get(url)
    except Exception as e:
        print(url) #Print url
        print(e) #Print error
        with open("list_htmls", "w") as l: #Save the list_htmls as a json file to retrieve at another time
            json.dump(list_htmls, l)
        continue #Continue to next iteration of the loop
    html = response.text
    list_htmls.append(html)
    time.sleep(0.5) #Sleep for 0.5 seconds

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.16s/it]


In [24]:

response = requests.get('https://www.boligsiden.dk/tilsalg', headers={'name':'Siyi','email':'wasariii@outlook.com'})
list_htmls = []
for url in tqdm.tqdm(links): #Track the time left before completing the loop
    response = requests.get(url)
    html = response.text
    list_htmls.append(html)
    time.sleep(0.5) #Sleep for 0.5 seconds

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.16s/it]


In [70]:
response = requests.get('https://api.prod.bs-aws-stage.com/search/cases?addressTypes=villa%2Ccondo%2Cterraced+house%2Choliday+house%2Ccooperative%2Cfarm%2Chobby+farm%2Cfull+year+plot%2Cvilla+apartment%2Choliday+plot&per_page=50&page=1&highlighted=true&sortAscending=true&sortBy=timeOnMarket')
result = response.json()

result_properties = result['cases']
data = pd.DataFrame(result_properties)
data

Unnamed: 0,_links,address,addressType,caseID,caseUrl,coordinates,daysOnMarket,defaultImage,descriptionBody,descriptionTitle,...,realEstate,realtor,slug,status,totalClickCount,totalFavourites,weightedArea,yearBuilt,nextOpenHouse,basementArea
0,{'self': {'href': '/cases/09777162-a79c-4c13-a...,{'_links': {'self': {'href': '/addresses/0a3f5...,condo,09777162-a79c-4c13-aafa-697c772fb16f,https://www.realmaeglerne.dk/301-redirect/?mgl...,"{'lat': 55.126675, 'lon': 12.064598, 'type': '...",1,"{'imageSources': [{'size': {'height': 80, 'wid...",4-V ejerlejlighed opført som et rækkehus og pe...,1. RÆKKE TIL FJORDEN - INDFLYTNINGSKLART,...,"{'downPayment': 105000, 'grossMortgage': 10644...",{'_links': {'self': {'href': '/realtors/e799ab...,kirsebaervej-34-4720-praestoe-03900761__34_______,open,496,1,98.0,1981,,
1,{'self': {'href': '/cases/9724f5da-480e-45ef-8...,{'_links': {'self': {'href': '/addresses/0a3f5...,terraced house,9724f5da-480e-45ef-8065-a17274fbe24f,http://www.nybolig.dk/maegler/pages/property-p...,"{'lat': 55.30157, 'lon': 11.540407, 'type': 'E...",1,"{'imageSources': [{'size': {'height': 80, 'wid...",Drømmer I om et dejligt hus med en god beligge...,Velindrettet halvt dobbelthus i Fuglebjerg,...,"{'downPayment': 65000, 'grossMortgage': 6553, ...",{'_links': {'self': {'href': '/realtors/cb08f4...,dalsgaardsvej-2-4250-fuglebjerg-03700249___2__...,open,104,0,103.2,1980,"{'date': '2022-08-06T11:30:00Z', 'duration': 3...",
2,{'self': {'href': '/cases/8dcc6f60-57c2-4daa-b...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,8dcc6f60-57c2-4daa-ba12-343ec30c83e1,https://www.danbolig.dk?propertyid=2690000038&...,"{'lat': 55.374687, 'lon': 10.363591, 'type': '...",1,"{'imageSources': [{'size': {'height': 80, 'wid...",Velkommen til Rosenvænget 23 – et rigtig dejli...,"Skønne rammer, masser af plads og dejlig belig...",...,"{'downPayment': 275000, 'grossMortgage': 28145...",{'_links': {'self': {'href': '/realtors/5ddfef...,rosenvaenget-23-5250-odense-sv-04616723__23___...,open,627,3,294.45,1948,"{'date': '2022-08-07T09:00:00Z', 'duration': 3...",128.0
3,{'self': {'href': '/cases/6f770ac3-9c6f-4737-8...,{'_links': {'self': {'href': '/addresses/71dbc...,villa,6f770ac3-9c6f-4737-86ce-c3f9e1299d97,http://www.nybolig.dk/maegler/pages/property-p...,"{'lat': 55.76669, 'lon': 9.543022, 'type': 'EP...",1,"{'imageSources': [{'size': {'height': 80, 'wid...","Denne moderne villa har mange kvaliteter, og m...",,...,"{'downPayment': 165000, 'grossMortgage': 16644...",{'_links': {'self': {'href': '/realtors/134128...,niels-bisteds-vej-1-7100-vejle-06303014___1___...,open,294,2,165.0,2015,,
4,{'self': {'href': '/cases/44253f24-a3f5-4f92-a...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,44253f24-a3f5-4f92-a6fe-6764f8eef8a9,https://www.kleinadamsen.dk/bolig/havdrup-4622...,"{'lat': 55.54057, 'lon': 12.116872, 'type': 'E...",2,"{'imageSources': [{'size': {'height': 80, 'wid...",Lige på hjørnet af Vinkelvej og Hovedgaden fin...,Charmerende villa fra 1897.,...,"{'downPayment': 150000, 'grossMortgage': 14181...",{'_links': {'self': {'href': '/realtors/466c3a...,hovedgaden-42-4622-havdrup-02692777__42_______,open,139,1,129.6,1897,,
5,{'self': {'href': '/cases/e5ec3c73-24f8-47f7-a...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,e5ec3c73-24f8-47f7-a92d-0cebb5ef52f9,https://www.danbolig.dk?propertyid=0140000584&...,"{'lat': 55.669468, 'lon': 11.76874, 'type': 'E...",2,"{'imageSources': [{'size': {'height': 80, 'wid...",Glæd jer til at udforske denne rummelige villa...,Stor familievilla på i alt 270 m2 - i Arnakke ...,...,"{'downPayment': 145000, 'grossMortgage': 14592...",{'_links': {'self': {'href': '/realtors/ea7520...,arnakkegaards-alle-7-4390-vipperoed-03160062__...,open,154,0,195.5,1973,,135.0
6,{'self': {'href': '/cases/f4f2efd4-3f02-4407-b...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,f4f2efd4-3f02-4407-b6bf-e18aac66e217,https://www.danbolig.dk?propertyid=1770000180&...,"{'lat': 55.524612, 'lon': 12.1674, 'type': 'EP...",2,"{'imageSources': [{'size': {'height': 80, 'wid...",På Møllebakken 10 i Jersie kan I nu overtage n...,150 m2 villa med 100 m2 god kælder i skønt omr...,...,"{'downPayment': 215000, 'grossMortgage': 20862...",{'_links': {'self': {'href': '/realtors/9fe16d...,moellebakken-10-2680-solroed-strand-02695188__...,open,234,2,190.0,1967,,100.0
7,{'self': {'href': '/cases/c7a3e8b0-8eca-432a-a...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,c7a3e8b0-8eca-432a-adf8-fd26e8539972,https://www.danbolig.dk?propertyid=0350000205&...,"{'lat': 55.845303, 'lon': 12.43989, 'type': 'E...",2,"{'imageSources': [{'size': {'height': 80, 'wid...",Vandtårnsvej ligger centralt i nærheden af bym...,Charme og selvtillid ved Rude Skov!,...,"{'downPayment': 390000, 'grossMortgage': 39884...",{'_links': {'self': {'href': '/realtors/3c9ae3...,vandtaarnsvej-10-3460-birkeroed-02300682__10__...,open,486,1,217.1,1950,"{'date': '2022-08-07T13:00:00Z', 'duration': 3...",69.0
8,{'self': {'href': '/cases/aa8ad55c-9b55-4308-8...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,aa8ad55c-9b55-4308-8e4f-b500e153280a,https://www.danbolig.dk?propertyid=0350000290&...,"{'lat': 55.859886, 'lon': 12.428505, 'type': '...",3,"{'imageSources': [{'size': {'height': 80, 'wid...",Få skridt fra Sjælsø og Eskemose Skov ligger d...,Stor familievilla ved Sjælsø,...,"{'downPayment': 390000, 'grossMortgage': 38204...",{'_links': {'self': {'href': '/realtors/3c9ae3...,groendalsvaenge-3b-3460-birkeroed-02300218__3b...,open,303,0,180.25,1984,"{'date': '2022-08-07T11:00:00Z', 'duration': 3...",
9,{'self': {'href': '/cases/4632e9ad-317f-45bc-a...,{'_links': {'self': {'href': '/addresses/0a3f5...,condo,4632e9ad-317f-45bc-a24c-f9a646a0a4e9,https://www.danbolig.dk?propertyid=0350000305&...,"{'lat': 55.83866, 'lon': 12.4127, 'type': 'EPS...",3,"{'imageSources': [{'size': {'height': 80, 'wid...",Lejligheden er beliggende i et eftertragtet om...,"Indbydende, lys lejlighed i Birkerød",...,"{'downPayment': 110000, 'grossMortgage': 11283...",{'_links': {'self': {'href': '/realtors/3c9ae3...,lyngborghave-2b-2-th-3460-birkeroed-02300393__...,open,166,0,72.0,1975,"{'date': '2022-08-07T09:00:00Z', 'duration': 2...",


#### Exercise

In [76]:
links = []
for offset in range(0,5*20,20):
    url = f'https://job.jobnet.dk/CV/FindWork/Search?offset={offset}'   # 关注Fetch/XHR中最长的一条，然后找到它的Request URL
    links.append(url)
    
logfile = 'log3.csv'
list_htmls = []
jobs_first100 = pd.DataFrame()

for url in tqdm.tqdm(links):
    try:
        response = requests.get(url, headers={'name':'Siyi','email':'wasariii@outlook.com'})
    except Exception as e:
        print(url) #Print url
        print(e) #Print error
        jobs_first100.to_csv('jobs_first100.csv') #Save the dataframe as a csv file to retrieve at another time
        continue #Continue to next iteration of the loop
    
    if response.ok: #Check if the response carries any data
        result_json = response.json() #If the response carries data, then convert it to json format
    else: #If the response does not carry any data, then print the status_code and continue to next iteration of the loop
        print(response.status_code)
        continue
    
    result_df = pd.DataFrame(result_json['JobPositionPostings']) # 网页Network-Search-Preview下
    jobs_first100 = pd.concat([jobs_first100,result_df], axis=0, ignore_index=True) #Append to the rest of the data
    log(response, logfile)
    time.sleep(0.5) #Sleep for 0.5 seconds
jobs_first100

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.03s/it]


Unnamed: 0,AutomatchType,Abroad,Weight,Title,JobHeadline,Presentation,HiringOrgName,WorkPlaceAddress,WorkPlacePostalCode,WorkPlaceCity,...,HiringOrgCVR,UserLoggedIn,AnonymousEmployer,ShareUrl,DetailsUrl,JobLogUrl,HasLocationValues,ID,Latitude,Longitude
0,0,False,1.0,Distributionschauffør søges til kørsel på Sjæl...,Distributionschauffør søges til kørsel på Sjæl...,Vi udvider vores team igen!\nFrederiksen Trans...,FREDERIKSEN TRANSPORT ApS,Solvangsvej 2,4681,Herfølge,...,33381085,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5613790,https://job.jobnet.dk/CV/FindWork/Details/5613790,True,5613790,55.4229,12.1464
1,0,False,1.0,Dansk- og klasselærer til mellemtrinnet Øster­...,Dansk- og klasselærer til mellemtrinnet Øster­...,Vi har en stilling som dansk og klasselærer i ...,Roskilde Kommune - Østervangsskolen,Astersvej 15,4000,Roskilde,...,29189404,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5652952,https://job.jobnet.dk/CV/FindWork/Details/5652952,True,5652952,55.6290,12.0977
2,0,False,1.0,Vaccinationscenter - Daglig leder,Vaccinationscenter - Daglig leder,Daglig leder til COVID-19 Vaccinationscentre\n...,VaccineDanmark.dk ApS,Hedegaardsvej 88,2300,København S,...,42460699,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5652951,https://job.jobnet.dk/CV/FindWork/Details/5652951,True,5652951,55.6459,12.6421
3,0,False,1.0,Pædagog med Teamkoordinator funktion søges til...,Pædagog med Teamkoordinator funktion søges til...,"Trives du med udvikling, har du overblik, bræn...",Roskilde Kommune - Børnehuset Solstrålen,Ørstedvej 61B,4130,Viby Sjælland,...,29189404,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5652575,https://job.jobnet.dk/CV/FindWork/Details/5652575,True,5652575,55.5470,12.0340
4,0,False,1.0,Teamleder Kongevejens Børnehus,Teamleder Kongevejens Børnehus,Kongevejens Børnehus søger en fagligt engagere...,Ikast-Brande Kommune,Kongevejen 1,7430,Ikast,...,29189617,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5652950,https://job.jobnet.dk/CV/FindWork/Details/5652950,True,5652950,56.1359,9.1574
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,False,1.0,Institut for Psykologi - KU,Institut for Psykologi - KU,Studiementor søges snarest til kvindelig stude...,DUOS A/S,,2200,København N,...,25477154,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5652216,https://job.jobnet.dk/CV/FindWork/Details/5652216,True,5652216,55.6941,12.5492
96,0,False,1.0,Lagermedarbejder til moderne lager i Taastrup,Lagermedarbejder til moderne lager i Taastrup,"Er du omhyggelig, mødestabil og har tidligere ...",StudentConsulting,,2630,Taastrup,...,31085403,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5652214,https://job.jobnet.dk/CV/FindWork/Details/5652214,True,5652214,55.6579,12.2808
97,0,False,1.0,Manager,Manager,\n \nBurger King Drejeb...,Burger King Danmark,Drejebænken Drejebænken,5260,Odense S,...,99999999,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5652190,https://job.jobnet.dk/CV/FindWork/Details/5652190,True,5652190,55.3494,10.3992
98,0,False,1.0,Fast afløser hos privat hjemmepleje i Frederik...,Fast afløser hos privat hjemmepleje i Frederik...,"Vi bestræber os på, at vores borgere bliver mø...",AjourCare Aps,Frederikssund,3600,Frederikssund,...,34478953,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5652164,https://job.jobnet.dk/CV/FindWork/Details/5652164,True,5652164,55.8548,12.0792


In [89]:
jobs_first100.groupby(jobs_first100['OccupationArea'])['ID'].count().sort_values(ascending=False)[0:10]

OccupationArea
Pædagogisk, socialt og kirkeligt arbejde      15
Sundhed, omsorg og personlig pleje            13
Ledelse                                       12
Hotel, restauration, køkken, kantine           9
Akademisk arbejde                              9
Kontor, administration, regnskab og finans     6
Salg, indkøb og markedsføring                  6
Undervisning og vejledning                     6
Bygge og anlæg                                 4
It og teleteknik                               3
Name: ID, dtype: int64