## The Web Scraping Recipe

To scrape information from the web is:
1. **MAPPING**: Finding URLs of the pages containing the information you want.
2. **DOWNLOAD**: Fetching the pages via HTTP.
3. **PARSE**: Extracting the information from HTML.  
  
  
You could also add `connection`, `storing`, `logging`, etc.
   


### Packages used
Today we will mainly build on the python skills you have gotten so far, and tomorrow we will look into more specialized packages.

* for connecting to the internet we use: **requests**
* for parsing: **beautifulsoup** and **regex**
* for automatic browsing / screen scraping: **selenium** 
* for mitigating errors we use: **time**

We will write our scrapers with basic python, for larger projects consider looking into the packages **scrapy**

In [42]:
import pandas as pd
import numpy as np
import seaborn as sns
import yfinance as yf
import os

In [43]:
import requests
from bs4 import BeautifulSoup
import re
import selenium
import time

import tqdm
import os
import json

In [45]:
# NBA网站
url = 'https://www.basketball-reference.com/leagues/NBA_2018.html' # link to the website
dfs = pd.read_html(url) # parses all tables found on the page.
dfs[1]

Unnamed: 0,Western Conference,W,L,W/L%,GB,PS/G,PA/G,SRS
0,Houston Rockets*,65,17,0.793,—,112.4,103.9,8.21
1,Golden State Warriors*,58,24,0.707,7.0,113.5,107.5,5.79
2,Portland Trail Blazers*,49,33,0.598,16.0,105.6,103.0,2.6
3,Oklahoma City Thunder*,48,34,0.585,17.0,107.9,104.4,3.42
4,Utah Jazz*,48,34,0.585,17.0,104.1,99.8,4.47
5,New Orleans Pelicans*,48,34,0.585,17.0,111.7,110.4,1.48
6,San Antonio Spurs*,47,35,0.573,18.0,102.7,99.8,2.89
7,Minnesota Timberwolves*,47,35,0.573,18.0,109.5,107.3,2.35
8,Denver Nuggets,46,36,0.561,19.0,110.0,108.5,1.57
9,Los Angeles Clippers,42,40,0.512,23.0,109.0,109.0,0.15


In [46]:
url = 'https://www.basketball-reference.com/leagues/NBA_2018.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('h2')[0].text

'Conference Standings'

In [47]:
# Mapping exercise - 招聘网站
links = []
for page in range(1,6,1):
    url = f'https://www.jobindex.dk/jobsoegning?page={page}'
    links.append(url)
links

['https://www.jobindex.dk/jobsoegning?page=1',
 'https://www.jobindex.dk/jobsoegning?page=2',
 'https://www.jobindex.dk/jobsoegning?page=3',
 'https://www.jobindex.dk/jobsoegning?page=4',
 'https://www.jobindex.dk/jobsoegning?page=5']

In [48]:
response = requests.get('https://www.jobindex.dk/jobsoegning?page=1', headers={'name':'Siyi','email':'wasariii@outlook.com'})
list_htmls = []
for url in tqdm.tqdm(links): #Track the time left before completing the loop
    response = requests.get(url)
    html = response.text
    list_htmls.append(html)
    time.sleep(0.5) #Sleep for 0.5 seconds


100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:06<00:00,  1.21s/it]


#### Log your activities

In [49]:
# Define the log function to gather the log information
def log(response,logfile,output_path=os.getcwd()):
    # Open or create the csv file
    if os.path.isfile(logfile): #If the log file exists, open it and allow for changes     
        log = open(logfile,'a')
    else: #If the log file does not exist, create it and make headers for the log variables
        log = open(logfile,'w')
        header = ['timestamp','status_code','length','output_file']
        log.write(';'.join(header) + "\n") #Make the headers and jump to new line
        
    # Gather log information
    status_code = response.status_code #Status code from the request result
    timestamp = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())) #Local time
    length = len(response.text) #Length of the HTML-string
    
    # Open the log file and append the gathered log information
    with open(logfile,'a') as log:
        log.write(f'{timestamp};{status_code};{length};{output_path}' + "\n") #Append the information and jump to new line

In [50]:
#写入log文件
list_htmls = []
logfile = 'log.csv'
for url in tqdm.tqdm(links):
    response = requests.get(url)
    html = response.text
    list_htmls.append(html)
    time.sleep(0.5)
    log(response,logfile)

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.18s/it]


In [51]:
list_htmls = []
for url in tqdm.tqdm(links):
    try:
        response = requests.get(url)
    except Exception as e:
        print(url) #Print url
        print(e) #Print error
        with open("list_htmls", "w") as l: #Save the list_htmls as a json file to retrieve at another time
            json.dump(list_htmls, l)
        continue #Continue to next iteration of the loop
    html = response.text
    list_htmls.append(html)
    time.sleep(0.5) #Sleep for 0.5 seconds

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.14s/it]


In [52]:

response = requests.get('https://www.boligsiden.dk/tilsalg', headers={'name':'Siyi','email':'wasariii@outlook.com'})
list_htmls = []
for url in tqdm.tqdm(links): #Track the time left before completing the loop
    response = requests.get(url)
    html = response.text
    list_htmls.append(html)
    time.sleep(0.5) #Sleep for 0.5 seconds

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.16s/it]


In [53]:
response = requests.get('https://api.prod.bs-aws-stage.com/search/cases?addressTypes=villa%2Ccondo%2Cterraced+house%2Choliday+house%2Ccooperative%2Cfarm%2Chobby+farm%2Cfull+year+plot%2Cvilla+apartment%2Choliday+plot&per_page=50&page=1&highlighted=true&sortAscending=true&sortBy=timeOnMarket')
result = response.json()

result_properties = result['cases']
data = pd.DataFrame(result_properties)
data

Unnamed: 0,_links,address,addressType,caseID,caseUrl,coordinates,daysOnMarket,defaultImage,descriptionBody,descriptionTitle,...,realEstate,realtor,slug,status,totalClickCount,totalFavourites,weightedArea,yearBuilt,lotArea,basementArea
0,{'self': {'href': '/cases/02f5c770-36c7-4add-a...,{'_links': {'self': {'href': '/addresses/0a3f5...,condo,02f5c770-36c7-4add-aca9-ef1f1cc1b3ff,http://www.paulun.dk/bolig-redirect?mgl=2640&s...,"{'lat': 55.66615, 'lon': 12.617624, 'type': 'E...",1,"{'imageSources': [{'size': {'height': 80, 'wid...",Rigtig god beliggenhed! Lejligheden er beligge...,INDFLYTNINGSKLAR 2-VÆRELSET MED SUPER BELIGGEN...,...,"{'downPayment': 150000, 'grossMortgage': 14684...",{'_links': {'self': {'href': '/realtors/0f76bb...,bremensgade-23-2-tv-2300-koebenhavn-s-01010828...,open,173,2,55.0,1939,,
1,{'self': {'href': '/cases/09777162-a79c-4c13-a...,{'_links': {'self': {'href': '/addresses/0a3f5...,condo,09777162-a79c-4c13-aafa-697c772fb16f,https://www.realmaeglerne.dk/301-redirect/?mgl...,"{'lat': 55.126675, 'lon': 12.064598, 'type': '...",2,"{'imageSources': [{'size': {'height': 80, 'wid...",4-V ejerlejlighed opført som et rækkehus og pe...,1. RÆKKE TIL FJORDEN - INDFLYTNINGSKLART,...,"{'downPayment': 105000, 'grossMortgage': 10644...",{'_links': {'self': {'href': '/realtors/e799ab...,kirsebaervej-34-4720-praestoe-03900761__34_______,open,562,1,98.0,1981,3431.0,
2,{'self': {'href': '/cases/9724f5da-480e-45ef-8...,{'_links': {'self': {'href': '/addresses/0a3f5...,terraced house,9724f5da-480e-45ef-8065-a17274fbe24f,http://www.nybolig.dk/maegler/pages/property-p...,"{'lat': 55.30157, 'lon': 11.540407, 'type': 'E...",2,"{'imageSources': [{'size': {'height': 80, 'wid...",Drømmer I om et dejligt hus med en god beligge...,Velindrettet halvt dobbelthus i Fuglebjerg,...,"{'downPayment': 65000, 'grossMortgage': 6553, ...",{'_links': {'self': {'href': '/realtors/cb08f4...,dalsgaardsvej-2-4250-fuglebjerg-03700249___2__...,open,129,0,103.2,1980,618.0,
3,{'self': {'href': '/cases/bef77f06-1be8-4149-9...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,bef77f06-1be8-4149-9966-c50a5ed57faf,https://home.dk/sag/6210000652,"{'lat': 56.302917, 'lon': 9.590847, 'type': 'E...",2,"{'imageSources': [{'size': {'height': 80, 'wid...","I roligt lukket kvarter, tæt ved masser af nat...",Velbeliggende i roligt attraktivt kvarter,...,"{'downPayment': 75000, 'grossMortgage': 7361, ...",{'_links': {'self': {'href': '/realtors/c18587...,solsortevej-37-8643-ans-by-07401660__37_______,open,209,2,133.9,1978,961.0,
4,{'self': {'href': '/cases/8dcc6f60-57c2-4daa-b...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,8dcc6f60-57c2-4daa-ba12-343ec30c83e1,https://www.danbolig.dk?propertyid=2690000038&...,"{'lat': 55.374687, 'lon': 10.363591, 'type': '...",2,"{'imageSources': [{'size': {'height': 80, 'wid...",Velkommen til Rosenvænget 23 – et rigtig dejli...,"Skønne rammer, masser af plads og dejlig belig...",...,"{'downPayment': 275000, 'grossMortgage': 28102...",{'_links': {'self': {'href': '/realtors/5ddfef...,rosenvaenget-23-5250-odense-sv-04616723__23___...,open,684,3,294.45,1948,612.0,128.0
5,{'self': {'href': '/cases/fd5bf741-560c-41ec-b...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,fd5bf741-560c-41ec-bc23-b0cdd5db8703,https://www.danbolig.dk?propertyid=0140000547&...,"{'lat': 55.71455, 'lon': 11.739444, 'type': 'E...",2,"{'imageSources': [{'size': {'height': 80, 'wid...",En omfattende renovering har forvandlet denne ...,Renoveret villa med attraktiv underetage,...,"{'downPayment': 190000, 'grossMortgage': 19462...",{'_links': {'self': {'href': '/realtors/ea7520...,munkholmvej-105-4300-holbaek-03161036_105_______,open,468,1,140.9,1913,631.0,64.0
6,{'self': {'href': '/cases/7a564b33-eb4f-43fc-b...,{'_links': {'self': {'href': '/addresses/0a3f5...,terraced house,7a564b33-eb4f-43fc-b008-4574d14c8347,https://www.danbolig.dk?propertyid=0140000619&...,"{'lat': 55.713028, 'lon': 11.730439, 'type': '...",2,"{'imageSources': [{'size': {'height': 80, 'wid...","I et roligt kvarter midt i Holbæk, tæt på grøn...",Yderst velholdt rækkehus centralt i Holbæk,...,"{'downPayment': 140000, 'grossMortgage': 14108...",{'_links': {'self': {'href': '/realtors/ea7520...,alfred-hansens-have-12-4300-holbaek-03160035__...,open,160,0,94.1,2004,96.0,
7,{'self': {'href': '/cases/6f770ac3-9c6f-4737-8...,{'_links': {'self': {'href': '/addresses/71dbc...,villa,6f770ac3-9c6f-4737-86ce-c3f9e1299d97,http://www.nybolig.dk/maegler/pages/property-p...,"{'lat': 55.76669, 'lon': 9.543022, 'type': 'EP...",2,"{'imageSources': [{'size': {'height': 80, 'wid...","Denne moderne villa har mange kvaliteter, og m...",,...,"{'downPayment': 165000, 'grossMortgage': 16644...",{'_links': {'self': {'href': '/realtors/134128...,niels-bisteds-vej-1-7100-vejle-06303014___1___...,open,337,3,165.0,2015,1027.0,
8,{'self': {'href': '/cases/44253f24-a3f5-4f92-a...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,44253f24-a3f5-4f92-a6fe-6764f8eef8a9,https://www.kleinadamsen.dk/bolig/havdrup-4622...,"{'lat': 55.54057, 'lon': 12.116872, 'type': 'E...",3,"{'imageSources': [{'size': {'height': 80, 'wid...",Lige på hjørnet af Vinkelvej og Hovedgaden fin...,Charmerende villa fra 1897.,...,"{'downPayment': 150000, 'grossMortgage': 14181...",{'_links': {'self': {'href': '/realtors/466c3a...,hovedgaden-42-4622-havdrup-02692777__42_______,open,153,2,129.6,1897,506.0,
9,{'self': {'href': '/cases/e5ec3c73-24f8-47f7-a...,{'_links': {'self': {'href': '/addresses/0a3f5...,villa,e5ec3c73-24f8-47f7-a92d-0cebb5ef52f9,https://www.danbolig.dk?propertyid=0140000584&...,"{'lat': 55.669468, 'lon': 11.76874, 'type': 'E...",3,"{'imageSources': [{'size': {'height': 80, 'wid...",Glæd jer til at udforske denne rummelige villa...,Stor familievilla på i alt 270 m2 - i Arnakke ...,...,"{'downPayment': 145000, 'grossMortgage': 14592...",{'_links': {'self': {'href': '/realtors/ea7520...,arnakkegaards-alle-7-4390-vipperoed-03160062__...,open,175,0,195.5,1973,876.0,135.0


#### Exercise

In [54]:
links = []
for offset in range(0,5*20,20):
    url = f'https://job.jobnet.dk/CV/FindWork/Search?offset={offset}'   # 关注Fetch/XHR中最长的一条，然后找到它的Request URL
    links.append(url)
    
logfile = 'log3.csv'
list_htmls = []
jobs_first100 = pd.DataFrame()

for url in tqdm.tqdm(links):
    try:
        response = requests.get(url, headers={'name':'Siyi','email':'wasariii@outlook.com'})
    except Exception as e:
        print(url) #Print url
        print(e) #Print error
        jobs_first100.to_csv('jobs_first100.csv') #Save the dataframe as a csv file to retrieve at another time
        continue #Continue to next iteration of the loop
    
    if response.ok: #Check if the response carries any data
        result_json = response.json() #If the response carries data, then convert it to json format
    else: #If the response does not carry any data, then print the status_code and continue to next iteration of the loop
        print(response.status_code)
        continue
    
    result_df = pd.DataFrame(result_json['JobPositionPostings']) # 网页Network-Search-Preview下
    jobs_first100 = pd.concat([jobs_first100,result_df], axis=0, ignore_index=True) #Append to the rest of the data
    log(response, logfile)
    time.sleep(0.5) #Sleep for 0.5 seconds
jobs_first100

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.11it/s]


Unnamed: 0,AutomatchType,Abroad,Weight,Title,JobHeadline,Presentation,HiringOrgName,WorkPlaceAddress,WorkPlacePostalCode,WorkPlaceCity,...,HiringOrgCVR,UserLoggedIn,AnonymousEmployer,ShareUrl,DetailsUrl,JobLogUrl,HasLocationValues,ID,Latitude,Longitude
0,0,False,1.0,Næstkommanderende til Jægerkorpsets operative ...,Næstkommanderende til Jægerkorpsets operative ...,Kan du have mange bolde i luften samtidig med ...,Flyvestation Aalborg,Thisted Landevej 53,9430,Vadum,...,16287180,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5653185,https://job.jobnet.dk/CV/FindWork/Details/5653185,True,5653185,57.1085,9.8574
1,0,False,1.0,Frisk gut på 52 år søger handicaphjælper - Viborg,Frisk gut på 52 år søger handicaphjælper - Viborg,OMRÅDE: 8800 Viborg\nLIDT OM MIG:Jeg er en fr...,HANDICAPFORMIDLINGEN ApS,,8800,Viborg,...,34605416,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5653079,https://job.jobnet.dk/CV/FindWork/Details/5653079,True,5653079,56.4168,9.3690
2,0,False,1.0,Tilst Skole søger en engageret og faglig dygti...,Tilst Skole søger en engageret og faglig dygti...,Du skal have følgende profil:· Du har ...,Aarhus Kommune,Tåstumvænget 8,8381,Tilst,...,55133018,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5653184,https://job.jobnet.dk/CV/FindWork/Details/5653184,True,5653184,56.1895,10.1132
3,0,False,1.0,Pædagogmedhjælper til Snejbjerg Skoles SFO,Pædagogmedhjælper til Snejbjerg Skoles SFO,Snejbjerg Skole søger en 19-timers pædagogmedh...,Herning Kommune,Snejbjerg Hovedgade 75,7400,Herning,...,29189919,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5653071,https://job.jobnet.dk/CV/FindWork/Details/5653071,True,5653071,56.1283,8.8854
4,0,False,1.0,Børnehaveklasseassistent til Hedelyskolen,Børnehaveklasseassistent til Hedelyskolen,Hedelyskolen er en folkeskole med omkring 730 ...,Hedelyskolen,Drønnergårds Alle 30,2670,Greve,...,44023911,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5653069,https://job.jobnet.dk/CV/FindWork/Details/5653069,True,5653069,55.5410,12.3529
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,False,1.0,Vi søger en socialrådgiver eller socialformidl...,Vi søger en socialrådgiver eller socialformidl...,Da en af vores dygtige rådgivere har søgt nye ...,"Center for Børn, Unge og Familier",Herlev Bygade 90,2730,Herlev,...,63640719,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5652944,https://job.jobnet.dk/CV/FindWork/Details/5652944,True,5652944,55.7251,12.4326
96,0,False,1.0,#JobsForUkraine: Cleaningassistant to Hedenst...,#JobsForUkraine: Cleaningassistant to Hedenst...,6 Hours per week\nYou can take the cleaning be...,Dansk Koncept Service A/S,,8722,Hedensted,...,27268021,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5652942,https://job.jobnet.dk/CV/FindWork/Details/5652942,True,5652942,55.7651,9.7113
97,0,False,1.0,#jobsForUkraine: Cleaning job in Billund Area...,#jobsForUkraine: Cleaning job in Billund Area...,\nEngelsk\nDo you like to clean? Then it´s YO...,Dansk Koncept Service A/S,,7190,Billund,...,27268021,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5652940,https://job.jobnet.dk/CV/FindWork/Details/5652940,True,5652940,55.7379,9.1007
98,0,False,1.0,Social- og sundhedshjælper til dag- og aftenva...,Social- og sundhedshjælper til dag- og aftenva...,Motiveres du af at yde en høj kvalitet af plej...,Københavns Kommune,Borups Allé 177,2400,København NV,...,64942212,False,False,https://job.jobnet.dk/CV/FindWork/DetailsSocia...,https://job.jobnet.dk/CV/FindWork/Details/5653126,https://job.jobnet.dk/CV/FindWork/Details/5653126,True,5653126,55.6960,12.5253


In [55]:
jobs_first100.groupby(jobs_first100['OccupationArea'])['ID'].count().sort_values(ascending=False)[0:10]

OccupationArea
Sundhed, omsorg og personlig pleje               20
Akademisk arbejde                                18
Pædagogisk, socialt og kirkeligt arbejde         16
Ledelse                                           7
Salg, indkøb og markedsføring                     6
Transport, post, lager- og maskinførerarbejde     6
Kontor, administration, regnskab og finans        5
Bygge og anlæg                                    4
Undervisning og vejledning                        4
Industriel produktion                             3
Name: ID, dtype: int64

# Parsing the HTML with BeautifulSoup
## Learning by doing: Creating a dataset from www.dr.dk/nyheder/udland

### Let's put together some of the stuff we have learned so far
1. **Mapping:** In this exercise we will collect some URLs from webpages with news articles and save them into a list
2. **Downloading:** Then we will download the HTML content of the webpages
3. **Parsing:** At last we will collect relevant information in each article

## 1. MAPPING

In [56]:
# Define our URL
url = 'https://www.dr.dk/nyheder/udland' 

# Connects to site
response = requests.get(url, headers={'name':'Siyi','email':'wasariii@outlook.com'})

# Parse data with BeautifulSoup
soup = BeautifulSoup(response.content,'lxml')

# Identify articles to scrape by inspecting site
articles = soup.find_all('div', class_ = 'dre-teaser-content') #(class_ is used because class is reserved in Python)

In [57]:
list_of_article_urls = []
# Creating a loop that appends the article url to the list above
for i in range(len(articles)):
    list_of_article_urls.append(articles[i].find('a')['href'])


list_of_article_urls_final = []
for link in list_of_article_urls:
    if '/nyheder/udland' in link: #All article URLs have this string in them, so we restrict on it being in the URL
        list_of_article_urls_final.append(link)
print(list_of_article_urls_final)

['/nyheder/udland/amerikansk-radiovaert-skal-betale-millionerstatning-loegne-om-skoleskyderi', '/nyheder/udland/ukrainsk-militaer-bryder-krigsregler-foerer-krig-fra-private-hjem-skoler-og', '/nyheder/udland/kina-laver-storstillet-militaeroevelse-omkring-taiwan-taiwan-goer-sig-klar-sig-til', '/nyheder/udland/drs-matilde-kimer-er-blevet-udvist-af-rusland', '/nyheder/udland/foerste-amerikanske-delstat-har-stemt-nej-til-fjerne-retten-til-fri-abort', '/nyheder/udland/pelosi-roser-taiwans-demokrati-mens-kina-sender-kampfly-paa-vingerne', '/nyheder/udland/taiwan-byder-pelosi-velkommen-med-aabne-arme-mens-kina-skruer-op-trusler', '/nyheder/udland/al-qaeda-leder-blev-draebt-i-diplomatkvarter-fem-minutters-gang-fra-tidligere-dansk', '/nyheder/udland/nancy-pelosi-trodser-kinesiske-advarsler-ankommer-til-taiwan-til-historisk-besoeg', '/nyheder/udland/corona-laeges-selvmord-saetter-gang-i-oestrigsk-debat-om-netchikane', '/nyheder/udland/puk-damsgaard-markant-og-bemaerkelsesvaerdigt-al-qaedas-draebt

## 2. DOWNLOADING + 3. PARSING

In [58]:
# We want to extract title, lead and time posted from the articles

# Creatig empty list for the infomation we want to extract for every article
title_list = []
lead_list = []
time_list = []

for i in range(10): #len(list_of_article_urls)
    
    # This time we scrape for each news article in the url list we created before
    url = 'https://www.dr.dk' + list_of_article_urls_final[i] #The scraped links are relative, so we need to add the base url
    response = requests.get(url, headers={'name':'Siyi','email':'wasariii@outlook.com'})
    soup = BeautifulSoup(response.content,'lxml')
    
    # Append title to list
    temp = soup.find_all('h1')
    temp = temp[1]
    temp = temp.text.strip()
    title_list.append(temp)
    
    # Append lead to list
    temp = soup.find('p', class_='dre-article-title__summary')
    temp = temp.text.strip()
    lead_list.append(temp)

    # Append time posted to list
    temp = soup.find('time', class_='dre-byline__date')
    temp = temp['datetime']
    time_list.append(temp)

In [117]:
df = pd.DataFrame({'title':title_list, 'lead':lead_list, 'time':time_list})
df

Unnamed: 0,title,lead,time
0,Amerikansk radiovært skal betale millionerstat...,"Alex Jones har i årevis påstået, at massakren ...",2022-08-05T03:55:00+00:00
1,Ukrainsk militær bryder krigsregler - fører kr...,Amnesty International har undersøgt tre forske...,2022-08-04T11:48:00+00:00
2,Kina laver storstilet militærøvelse omkring Ta...,Kina er lige nu i gang med en af de mest omfat...,2022-08-04T08:52:00+00:00
3,DR's Matilde Kimer er blevet udvist af Rusland,Rusland slår hårdt ned på uafhængige medier og...,2022-08-03T16:29:00+00:00
4,Første amerikanske delstat har stemt 'nej' til...,Resultatet er en vigtigt sejr for tilhængere a...,2022-08-03T08:56:00+00:00
5,"Pelosi roser Taiwans demokrati, mens Kina send...",Den amerikanske toppolitikers uanmeldte besøg ...,2022-08-03T03:57:00+00:00
6,"Taiwan byder Pelosi velkommen med åbne arme, m...","USA skal støtte demokrati alle steder, skriver...",2022-08-02T18:53:00+00:00
7,Al-Qaeda-leder blev dræbt i diplomatkvarter - ...,Terrorleders tilstedeværelse midt i hovedstade...,2022-08-02T16:14:00+00:00
8,Nancy Pelosi trodser kinesiske advarsler: Anko...,Formanden for Repræsentanternes Hus er ankomme...,2022-08-02T14:52:00+00:00
9,Corona-læges selvmord sætter gang i østrigsk d...,"I flere byer blev der tændt lys for læge, der ...",2022-08-02T11:55:00+00:00


In [114]:
# what if we need the body?
url = 'https://www.dr.dk/nyheder/udland/gazprom-strammer-ifoelge-tyskland-skruen-uden-grund' 
response = requests.get(url, headers={'name':'Siyi','email':'wasariii@outlook.com'})
soup = BeautifulSoup(response.content,'lxml')
body = soup.find('div', class_ = 'dre-article-body')

'''
This body consists of both sections with text and figures. We want it all.
But sections and figures have different tags, so we cannot just use find_all to find all elements in the body.
Instead we can use .children. It finds all children of the element body:
'''

body_text = []
for child in body.children:
    body_text.append(child.text)
print(body_text)

['Gazprom halverer gasleverancerne til Europa via Nord Stream 1. Årsagen er ifølge selskabet vedligehold af en gasturbine. Den daglige gasforsyning via gasledningen vil fra onsdag morgen blive reduceret til 33 millioner kubikmeter, oplyser Gazprom.Det svarer til cirka 20 procent af den maksimale kapacitet, og det fremgår ikke, hvor længe den yderligt reducerede forsyning af gas vil stå på.', '', 'Den tyske regering anser den forklaringen om vedligeholdelse for at være opfundet til lejligheden.- Ifølge vores oplysninger er der ingen teknisk grund til en reduktion i leverancerne, siger en talskvinde for Finansministeriet og minister Robert Habeck til Frankfurter Allgemeine Zeitung.Tyskerne får 25 procent af deres energi fra gas, hvor en overvejende del er kommet fra Rusland.Gasprisen stiger med 10 procentDet er anden gang indenfor en uge, at Gazprom reducerer leverancen af gas under påskud af reperation af gasturbiner. Da Gazprom efter ti dages vedligehold i sidste uge genåbnede for gasf

In [61]:
'''
We have used .text to get the text of the HTML. The figure elements do not contain any text, so they will just be empty.
We can use .join() to join all the strings in the list. Just join it on an empty string:
# 这步似乎是只去掉了''
'''

''.join(body_text)

'Gazprom halverer gasleverancerne til Europa via Nord Stream 1. Årsagen er ifølge selskabet vedligehold af en gasturbine. Den daglige gasforsyning via gasledningen vil fra onsdag morgen blive reduceret til 33 millioner kubikmeter, oplyser Gazprom.Det svarer til cirka 20 procent af den maksimale kapacitet, og det fremgår ikke, hvor længe den yderligt reducerede forsyning af gas vil stå på.Den tyske regering anser den forklaringen om vedligeholdelse for at være opfundet til lejligheden.- Ifølge vores oplysninger er der ingen teknisk grund til en reduktion i leverancerne, siger en talskvinde for Finansministeriet og minister Robert Habeck til Frankfurter Allgemeine Zeitung.Tyskerne får 25 procent af deres energi fra gas, hvor en overvejende del er kommet fra Rusland.Gasprisen stiger med 10 procentDet er anden gang indenfor en uge, at Gazprom reducerer leverancen af gas under påskud af reperation af gasturbiner. Da Gazprom efter ti dages vedligehold i sidste uge genåbnede for gasforsyninge

### Exercise

In [133]:
url = 'https://www.basketball-reference.com/leagues/NBA_2018.html' 
response = requests.get(url, headers={'name':'Siyi','email':'wasariii@outlook.com'})
soup = BeautifulSoup(response.content,'lxml')
table_node = soup.find('div', class_ = 'table_wrapper')


print(table_node)

<div class="table_wrapper" id="all_confs_standings_E">
<div class="section_heading assoc_confs_standings_E" id="confs_standings_E_sh">
<span class="section_anchor" data-label="Conference Standings" id="confs_standings_E_link"></span><h2>Conference Standings</h2> <div class="section_heading_text">
<ul><li><small>* Playoff teams</small></li>
</ul>
</div>
</div>
<div class="table_container" id="div_confs_standings_E">
<table class="suppress_all sortable stats_table" data-cols-to-freeze=",1" id="confs_standings_E">
<caption>Conference Standings Table</caption>
<colgroup><col/><col/><col/><col/><col/><col/><col/><col/></colgroup>
<thead>
<tr>
<th aria-label="Eastern Conference" class="poptip sort_default_asc left" data-stat="team_name" scope="col">Eastern Conference</th>
<th aria-label="Wins" class="poptip right" data-stat="wins" data-tip="Wins" scope="col">W</th>
<th aria-label="Losses" class="poptip right" data-stat="losses" data-tip="Losses" scope="col">L</th>
<th aria-label="Win-Loss Pe

In [134]:
# 定义函数从上面的结果中制作Data Frame
def parse_html_table(table_node):
    # Get the columns in a list
    columns_html = table_node.thead.find_all('th')
    # Extract the text
    columns = [col.text for col in columns_html]

    rows_list = table_node.tbody.find_all('tr')

    data = []
    for row_node in rows_list:
        row = []
        for child in row_node.children:
            row.append(child.text)
        data.append(row)
    df = pd.DataFrame(data,columns=columns)
    return df
df = parse_html_table(table_node)
df

Unnamed: 0,Eastern Conference,W,L,W/L%,GB,PS/G,PA/G,SRS
0,Toronto Raptors*,59,23,0.72,—,111.7,103.9,7.29
1,Boston Celtics*,55,27,0.671,4.0,104.0,100.4,3.23
2,Philadelphia 76ers*,52,30,0.634,7.0,109.8,105.3,4.3
3,Cleveland Cavaliers*,50,32,0.61,9.0,110.9,109.9,0.59
4,Indiana Pacers*,48,34,0.585,11.0,105.6,104.2,1.18
5,Miami Heat*,44,38,0.537,15.0,103.4,102.9,0.15
6,Milwaukee Bucks*,44,38,0.537,15.0,106.5,106.8,-0.45
7,Washington Wizards*,43,39,0.524,16.0,106.6,106.0,0.53
8,Detroit Pistons,39,43,0.476,20.0,103.8,103.9,-0.26
9,Charlotte Hornets,36,46,0.439,23.0,108.2,108.0,0.07


In [162]:
tables = soup.find_all('table') #Locate all table nodes

dfs = []
for i in range(10): #"len(tables)" instead of 3 to get all tables. len(tables)=13，目前列不全
    table = parse_html_table(tables[i]) #Apply parse_html_table function
    dfs.append(table) # store table in a list
dfs[9]

Unnamed: 0,Rk,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Boston Celtics*,82,19805,38.7,88.0,0.44,9.7,28.6,0.339,...,0.763,10.0,35.4,45.4,22.0,7.5,4.6,14.6,19.8,103.9
1,2,Utah Jazz*,82,19755,38.8,86.4,0.449,9.9,27.0,0.365,...,0.771,9.0,34.3,43.3,20.8,8.6,4.8,15.5,21.4,103.9
2,3,San Antonio Spurs*,82,19730,40.1,88.5,0.453,9.6,27.6,0.348,...,0.759,9.7,35.0,44.6,22.9,8.0,4.1,14.8,20.7,104.8
3,4,Philadelphia 76ers*,82,19780,37.9,87.4,0.434,10.1,29.5,0.342,...,0.745,9.9,32.1,42.1,21.7,8.5,5.1,14.3,20.3,105.0
4,5,Toronto Raptors*,82,19830,39.1,87.2,0.449,9.1,25.4,0.357,...,0.767,10.0,33.2,43.2,22.2,7.3,5.0,14.6,20.3,105.9
5,6,Houston Rockets*,82,19755,40.4,87.4,0.462,10.3,29.5,0.351,...,0.746,8.8,34.1,42.9,22.9,7.6,4.5,14.9,20.8,106.1
6,7,Miami Heat*,82,19930,38.9,86.6,0.45,9.9,27.5,0.36,...,0.783,9.4,35.2,44.5,21.7,7.8,4.8,14.6,20.0,106.3
7,8,Portland Trail Blazers*,82,19755,39.6,88.7,0.447,10.0,27.4,0.364,...,0.755,9.6,34.7,44.3,20.8,7.6,5.3,13.0,19.7,106.4
8,9,Oklahoma City Thunder*,82,19830,39.5,86.2,0.458,11.5,31.4,0.367,...,0.769,9.8,33.5,43.3,23.9,7.9,4.7,16.5,21.9,107.2
9,10,Detroit Pistons,82,19805,40.4,88.0,0.459,11.4,31.8,0.359,...,0.776,9.5,35.7,45.2,26.0,7.5,5.0,15.3,19.0,107.3


In [175]:
# pd.read_html的作用：Read HTML tables into a list of DataFrame objects.
a = pd.read_html(url)
a[12]  # 这个方法能列出全部的表

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,% of FGA by Distance,% of FGA by Distance,% of FGA by Distance,...,% of FG Ast'd,Unnamed: 23_level_0,Dunks,Dunks,Unnamed: 26_level_0,Layups,Layups,Unnamed: 29_level_0,Corner,Corner
Unnamed: 0_level_1,Rk,Team,G,MP,FG%,Dist.,Unnamed: 6_level_1,2P,0-3,3-10,...,3P,Unnamed: 23_level_1,%FGA,Md.,Unnamed: 26_level_1,%FGA,Md.,Unnamed: 29_level_1,%3PA,3P%
0,1.0,Atlanta Hawks,82,19705,0.469,13.7,,0.646,0.271,0.127,...,0.872,,0.051,318,,0.259,1024,,0.216,0.381
1,2.0,Boston Celtics*,82,19805,0.44,13.1,,0.674,0.286,0.162,...,0.818,,0.046,294,,0.286,1094,,0.181,0.417
2,3.0,Brooklyn Nets,82,19855,0.466,12.6,,0.726,0.26,0.185,...,0.78,,0.043,275,,0.277,1080,,0.186,0.401
3,4.0,Chicago Bulls,82,19855,0.472,13.7,,0.624,0.292,0.122,...,0.864,,0.051,330,,0.264,1110,,0.213,0.393
4,5.0,Charlotte Hornets,82,19780,0.468,13.7,,0.657,0.262,0.144,...,0.853,,0.049,314,,0.258,1064,,0.194,0.418
5,6.0,Cleveland Cavaliers*,82,19730,0.474,13.4,,0.641,0.299,0.132,...,0.832,,0.055,362,,0.276,1148,,0.213,0.446
6,7.0,Dallas Mavericks,82,19805,0.469,13.8,,0.649,0.238,0.167,...,0.826,,0.051,314,,0.216,855,,0.19,0.384
7,8.0,Denver Nuggets,82,19880,0.476,12.8,,0.668,0.268,0.191,...,0.835,,0.049,310,,0.291,1188,,0.247,0.39
8,9.0,Detroit Pistons,82,19805,0.459,13.6,,0.638,0.251,0.177,...,0.904,,0.046,288,,0.271,1059,,0.233,0.387
9,10.0,Golden State Warriors*,82,19730,0.447,13.0,,0.676,0.278,0.166,...,0.802,,0.052,347,,0.274,1089,,0.189,0.378
